Hbase regionserver 报错然后造成 region下线,最终所有region都下线后,这个regionserver就挂掉了

```
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /hbase/WALs/xhn12,60020,1483801123833-splitting/xhn12%2C60020%
2C1483801123833.1483850465810 (inode 53042): File is not open for writing.
2017-01-08 13:06:23,534 WARN hdfs.DFSClient: Error while syncing
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1194)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1040)
at com.sun.proxy.$Proxy16.getAdditionalDatanode(Unknown Source)
at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:294)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at com.sun.proxy.$Proxy15.getAdditionalDatanode(Unknown Source)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolTranslatorPB.java:416)
at com.sun.proxy.$Proxy14.getAdditionalDatanode(Unknown Source)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at org.apache.hadoop.ipc.Client.call(Client.java:1364)
at org.apache.hadoop.ipc.Client.call(Client.java:1411)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
at javax.security.auth.Subject.doAs(Subject.java:415)
at java.security.AccessController.doPrivileged(Native Method)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:499)
at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getAdditionalDatanode(AuthorizationProviderProxyClientProtocol.java:204)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:647)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3237)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3334)


```
已邀请:
提供下更多的regionserver的日志信息?
2017-01-10 13:10:55,983 DEBUG regionserver.CompactSplitThread: Small Compaction requested: system; Because: MemStoreFlusher.0; compaction_queue=(0:1), split_queue=0, merge_queue=0
2017-01-10 13:10:55,983 DEBUG compactions.RatioBasedCompactionPolicy: Selecting compaction from 3 store files, 0 compacting, 3 eligible, 10 blocking
2017-01-10 13:11:55,715 WARN   util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 58923ms
GC pool 'ParNew' had collection(s): count=1 time=57632ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=1379ms
java.io.EOFException: Premature EOF: no length prefix available
        at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2103)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:810)
2017-01-10 13:11:55,715 WARN   util.Sleeper: We slept 60077ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2017-01-10 13:11:55,715 WARN   util.Sleeper: We slept 65421ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2017-01-10 13:11:55,715 WARN   util.Sleeper: We slept 65415ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2017-01-10 13:11:55,723 WARN   hdfs.DFSClient: Error Recovery for block BP-352369807-10.11.24.57-1483799407745:blk_1073846346_105537 in pipeline 10.11.24.54:50010, 10.11.24.122:50010, 10.11.24.183:50010: bad datanode 10.11.24.54:50010
2017-01-10 13:11:55,720 WARN   hdfs.DFSClient: Slow ReadProcessor read fields took 59028ms (threshold=30000ms); ack: seqno: 68 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 2892347, targets:
2017-01-10 13:11:55,740 FATAL regionserver.HRegionServer: ABORTING region server xhn-slave05,60020,1483945222556: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing xhn-slave05,60020,1483945222556 as dead server
        at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:369)
        at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:274)
        at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1357)
        at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2031)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
        at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:74)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
看起来是GC时间太长了,导致ZK 心跳 连接超时,最终通知这个server下线。
为什么regionserver 和Zookeeper的session expired? 可能的原因有
1. 网络不好。
2. Java full GC, 这会block所有的线程。如果时间比较长,也会导致session expired.
怎么办?
1. 将Zookeeper的timeout时间加长。
2. 配置“hbase.regionserver.restart.on.zk.expire” 为true。 这样子,遇到ZooKeeper session expired , regionserver将选择 restart 而不是 abort
具体的配置是,在hbase-site.xml中加入

zookeeper.session.timeout
90000
ZooKeeper session timeout.
HBase passes this to the zk quorum as suggested maximum time for a
session.  See http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkSessions
“The client sends a requested timeout, the server responds with the
timeout that it can give the client. The current implementation
requires that the timeout be a minimum of 2 times the tickTime
(as set in the server configuration) and a maximum of 20 times
the tickTime.” Set the zk ticktime with hbase.zookeeper.property.tickTime.
In milliseconds.



hbase.regionserver.restart.on.zk.expire
true

Zookeeper session expired will force regionserver exit.
Enable this will make the regionserver restart.


为了避免java full GC suspend thread 对Zookeeper heartbeat的影响,我们还需要对hbase-env.sh进行配置。
将{{{export HBASE_OPTS="$HBASE_OPTS -XX:+HeapDumpOnOutOfMemoryError \ -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode"}}}
修改成{{{export HBASE_OPTS="$HBASE_OPTS -XX:+HeapDumpOnOutOfMemoryError \ -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled \ -XX:+CMSInitiatingOccupancyFraction=70 \ -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseParNewGC -Xmn256m"


同时,当linux的maxfile设置过小时,scan多个列族也会造成regionServer宕机}}}
**是不是因为这个原因?**
 
 
 1.最近hbase的rgion经常挂掉一个,查看该节点日志发现如下错误:
2014-02-22 01:52:02,194 ERROR org.apache.http://www.linuxidc.com/topicnews.aspx?tid=13.hbase.regionserver.HRegionServer: Close and delete failed
org.apache.http://lib.csdn.net/base/20.hdfs.server.namenode.LeaseExpiredException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /hbase/.logs/testhd3,60020,1392948100268/testhd3%2C60020%2C1392948100268.1393004989411 File does not exist. Holder DFSClient_hb_rs_testhd3,60020,1392948100268 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.http://lib.csdn.net/base/17:1631)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1622)
查了很长时间也没找到hbase的问题,后来根据网上资料查看了hadoop的日志如下:
2014-02-22 01:52:00,935 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hadoop cause:org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /hbase/.logs/testhd3,60020,1392948100268/testhd3%2C60020%2C1392948100268.1393004989411 File does not exist. Holder DFSClient_hb_rs_testhd3,60020,1392948100268 does not have any open files.
2014-02-22 01:52:00,936 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 9000, call addBlock(/hbase/.logs/testhd3,60020,1392948100268/testhd3%2C60020%2C1392948100268.1393004989411, DFSClient_hb_rs_testhd3,60020,1392948100268, null) from 172.72.101.213:59979: error: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /hbase/.logs/testhd3,60020,1392948100268/testhd3%2C60020%2C1392948100268.1393004989411 File does not exist. Holder DFSClient_hb_rs_testhd3,60020,1392948100268 does not have any open files.
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /hbase/.logs/testhd3,60020,1392948100268/testhd3%2C60020%2C1392948100268.1393004989411 File does not exist. Holder DFSClient_hb_rs_testhd3,60020,1392948100268 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1631)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1622)
**结果发现两个日志有几乎相同的记录,可以确认hbase的问题是由hadoop引起,修改如下:**
解决办法,调整xcievers参数
默认是4096,改为8192
vi /home/dwhftp/opt/hadoop/conf/hdfs-site.xml

dfs.datanode.max.xcievers
8192

dfs.datanode.max.xcievers 参数说明
一个 Hadoop HDFS Datanode 有一个同时处理文件的上限. 这个参数叫 xcievers (Hadoop的作者把这个单词拼错了). 在你加载之前,先确认下你有没有配置这个文件conf/hdfs-site.xml里面的xceivers参数,至少要有4096:

dfs.datanode.max.xcievers
4096

mopishv0 - 高级开发工程师@美团

看你贴的日志,一般是GC导致的,需要进一步确定下是put导致的还是scan,是否是某个业务的瞬时压力,是否能转为离线的方式进行处理。
集群入库请求情况如下:
JVM参数设置情况:
 

mopishv0 - 高级开发工程师@美团

页面上能看到的信息比较少,看日志里是否有large response之类的日志,如果有,看看长度和并发,如果是scan请求打死,这里会有日志。
另外需要看监控,挂掉的那会儿请求情况是什么样的。
有response的日志摘录:
WARN   hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block BP-2008143126-10.10.24.57-1479816673596:blk_1075975305_2234486
java.io.IOException: Bad response ERROR for block BP-2008143126-10.10.24.57-1479816673596:blk_1075975305_2234486 from datanode 10.10.24.55:50010
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:840)
 
WARN   ipc.RpcServer: (responseTooSlow): {"processingtimems":14083,"call":"Multi(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiRequest)","client":"10.10.24.53:26592","starttimems":1483859603791,"queuetimems":0,"class":"HRegionServer","responsesize":8,"method":"Multi"}
 
 
 DEBUG hfile.LruBlockCache: Total=35.56 MB, free=1.41 GB, max=1.45 GB, blocks=1551944960, accesses=1201680, hits=50479, hitRatio=4.20%, , cachingAccesses=51019, cachingHits=49653, cachingHitsRatio=97.32%, evictions=5845, evicted=826, evictedPerRun=0.14131736755371094
 
regionserver 下线前报GC超时:
{{{2017-01-12 06:23:21,414 WARN util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 58674ms
GC pool 'ParNew' had collection(s): count=1 time=57351ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=1336ms
2017-01-12 06:23:21,414 FATAL regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: []
2017-01-12 06:23:21,424 WARN hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /hbase/oldWALs/kbzy-xjp-slave20%2C60020%2C1484022796657.1484170274386 (inode 1166999): File is not open for writing. Holder DFSClient_hb_rs_kbzy-xjp-slave20,60020,1484022796657_1006405432_33 does not have any open files.}}}
regionserver下线前在进行一系列的 smallCompactions{{{
 }}}
 
{{{2017-01-12 06:13:30,153 DEBUG compactions.ExploringCompactionPolicy: Exploring compaction algorithm has selected 0 files of size 0 starting at candidate #-1 after considering 1 permutations with 0 in ratio
2017-01-12 06:13:30,153 DEBUG compactions.RatioBasedCompactionPolicy: Not compacting files because we only have 0 files ready for compaction. Need 3 to initiate.
2017-01-12 06:13:30,153 DEBUG regionserver.CompactSplitThread: Not compacting yb_income,20161109-~,1483801779193.ef126d4a983a6d7da07a7960dd6c2990. because compaction request was cancelled
2017-01-12 06:13:30,153 DEBUG compactions.RatioBasedCompactionPolicy: Selecting compaction from 3 store files, 0 compacting, 3 eligible, 10 blocking
2017-01-12 06:13:30,153 DEBUG compactions.ExploringCompactionPolicy: Exploring compaction algorithm has selected 0 files of size 0 starting at candidate #-1 after considering 1 permutations with 0 in ratio
2017-01-12 06:13:30,153 DEBUG compactions.RatioBasedCompactionPolicy: Not compacting files because we only have 0 files ready for compaction. Need 3 to initiate.
2017-01-12 06:13:30,153 DEBUG regionserver.CompactSplitThread: Not compacting addexpr,20170107-10-ios-1308310468-21330478-125604-1483764964,1484112418556.318cc320cbb1fd04931c6cd9e9839a85. because compaction request was cancelled
2017-01-12 06:13:30,153 DEBUG compactions.RatioBasedCompactionPolicy: Selecting compaction from 3 store files, 0 compacting, 3 eligible, 10 blocking
2017-01-12 06:13:30,153 DEBUG compactions.ExploringCompactionPolicy: Exploring compaction algorithm has selected 0 files of size 0 starting at candidate #-1 after considering 1 permutations with 0 in ratio
2017-01-12 06:13:30,153 DEBUG compactions.RatioBasedCompactionPolicy: Not compacting files because we only have 0 files ready for compaction. Need 3 to initiate.
2017-01-12 06:13:30,153 DEBUG regionserver.CompactSplitThread: Not compacting logout,20161115-~,1483801701066.1adce0d0fa549161e19e89633d3f8c77. because compaction request was cancelled
2017-01-12 06:13:30,153 DEBUG compactions.RatioBasedCompactionPolicy: Selecting compaction from 3 store files, 0 compacting, 3 eligible, 10 blocking
2017-01-12 06:13:30,154 DEBUG compactions.ExploringCompactionPolicy: Exploring compaction algorithm has selected 0 files of size 0 starting at candidate #-1 after considering 1 permutations with 0 in ratio
2017-01-12 06:13:30,154 DEBUG compactions.RatioBasedCompactionPolicy: Not compacting files because we only have 0 files ready for compaction. Need 3 to initiate.
2017-01-12 06:13:30,154 DEBUG regionserver.CompactSplitThread: Not compacting yb_income,20161116-~,1483801779194.3b8c51ff5019433f4b61e53dede3438f. because compaction request was cancelled
2017-01-12 06:18:16,780 DEBUG hfile.LruBlockCache: Total=353.67 MB, free=1.09 GB, max=1.43 GB, blocks=1538850816, accesses=2891937, hits=552140, hitRatio=19.09%, , cachingAccesses=557722, cachingHits=552119, cachingHitsRatio=99.00%, evictions=15029, evicted=24, evictedPerRun=0.00159691262524575
2017-01-12 06:21:50,151 INFO regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region online,20170111-~,1483801713477.03cc0f95d60fc9c2af25bd9944b625ef. after a delay of 9114
2017-01-12 06:21:59,265 INFO regionserver.HRegion: Started memstore flush for online,20170111-~,1483801713477.03cc0f95d60fc9c2af25bd9944b625ef., current region memstore size 4.0 M
2017-01-12 06:21:59,297 INFO regionserver.DefaultStoreFlusher: Flushed, sequenceid=15152, memsize=4.0 M, hasBloomFilter=true, into tmp file hdfs://kbzy-xjp-master:9000/hbase/data/default/online/03cc0f95d60fc9c2af25bd9944b625ef/.tmp/6851a6096280410ca692a6cf86cd2dae
2017-01-12 06:21:59,304 DEBUG regionserver.HRegionFileSystem: Committing store file hdfs://kbzy-xjp-master:9000/hbase/data/default/online/03cc0f95d60fc9c2af25bd9944b625ef/.tmp/6851a6096280410ca692a6cf86cd2dae as hdfs://kbzy-xjp-master:9000/hbase/data/default/online/03cc0f95d60fc9c2af25bd9944b625ef/info/6851a6096280410ca692a6cf86cd2dae
2017-01-12 06:21:59,311 INFO regionserver.HStore: Added hdfs://kbzy-xjp-master:9000/hbase/data/default/online/03cc0f95d60fc9c2af25bd9944b625ef/info/6851a6096280410ca692a6cf86cd2dae, entries=18480, sequenceid=15152, filesize=267.8 K
2017-01-12 06:21:59,311 INFO regionserver.HRegion: Finished memstore flush of ~4.0 M/4239504, currentsize=0/0 for region online,20170111-~,1483801713477.03cc0f95d60fc9c2af25bd9944b625ef. in 46ms, sequenceid=15152, compaction requested=false
2017-01-12 06:23:21,403 DEBUG hfile.LruBlockCache: Total=353.67 MB, free=1.09 GB, max=1.43 GB, blocks=1538850816, accesses=2891937, hits=552140, hitRatio=19.09%, , cachingAccesses=557722, cachingHits=552119, cachingHitsRatio=99.00%, evictions=15054, evicted=24, evictedPerRun=0.0015942606842145324
2017-01-12 06:23:21,403 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 65366ms for sessionid 0x159796eae5f243d, closing socket connection and attempting reconnect
2017-01-12 06:23:21,403 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 67766ms for sessionid 0x259796eae6b1bbf, closing socket connection and attempting reconnect
2017-01-12 06:23:21,403 WARN util.Sleeper: We slept 60430ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2017-01-12 06:23:21,403 WARN util.Sleeper: We slept 61250ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2017-01-12 06:23:21,403 WARN util.Sleeper: We slept 61251ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2017-01-12 06:23:21,403 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 68077ms for sessionid 0x159796eae5f2441, closing socket connection and attempting reconnect
2017-01-12 06:23:21,403 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 65280ms for sessionid 0x5597cee2705003d, closing socket connection and attempting reconnect
2017-01-12 06:23:21,404 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP-352369807-10.11.24.57-1483799407745:blk_1073983425_242623
java.io.EOFException: Premature EOF: no length prefix available
at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2103)
at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:810)
2017-01-12 06:23:21,409 WARN hdfs.DFSClient: Error Recovery for block BP-352369807-10.11.24.57-1483799407745:blk_1073983425_242623 in pipeline 10.11.24.182:50010, 10.11.24.49:50010, 10.11.24.180:50010: bad datanode 10.11.24.182:50010
2017-01-12 06:23:21,412 FATAL regionserver.HRegionServer: ABORTING region server kbzy-xjp-slave20,60020,1484022796657: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing kbzy-xjp-slave20,60020,1484022796657 as dead server
at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:369)
at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:274)
at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1357)
at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2031)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:74)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

org.apache.hadoop.hbase.YouAreDeadException: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing kbzy-xjp-slave20,60020,1484022796657 as dead server
at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:369)
at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:274)
at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1357)
at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2031)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:74)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:304)
at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:1107)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:928)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.YouAreDeadException): org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing kbzy-xjp-slave20,60020,1484022796657 as dead server
at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:369)
at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:274)
at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1357)
at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2031)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:74)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1457)
at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1661)
at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1719)
at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerReport(RegionServerStatusProtos.java:5414)
at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:1105)
... 2 more}}}
GC严重,导致zookeeper连接超时
背景介绍:
        25台regionserver,5台zookeeper,master有热备。
实时处理程序,实时的向hbase表写数据,hbase集群与hdfs、hive、spark公用,集群上每天会跑12个小时左右的分析任务。

     推测原因1--compaction、split过于频繁:

              由于配置里面hfile最大文件大小设置为1G,所以compaction、split比较频繁,资源消耗比较大,导致gc暂停时间过长,出现写hdfs错误,导致regionsever挂掉
                 推测原因2-- hdfs压力过大,datanode超负荷:

               由于集群运行各种任务,hdfs读写压力大,datanode的负载比较高,导致regionserver写hdfs异常,宕机
         解决办法:1、hfile大小改为100G,禁止系统自己做major compaction。2、给datanode多一些内存,调整rpc线程数。
您好,想问下?您这边的实时的向hbase表写数据,是如何实现的?能简单介绍下你们那边用的框架吗?多谢。

要回复问题请先登录注册