Hbase regionserver 报错然后造成 region下线,最终所有region都下线后,这个regionserver就挂掉了

``` org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /hbase/WALs/xhn12,60020,1483801123833-splitting/xhn12%2C60020% 2C1483801123833.1483850465810 (inode 53042): File is not open for writing. [Lease. Holder: DFSClient_hb_rs_xhn12,60020,1483801123833_-808411320_33, pendingcreates: 1] 2017-01-08 13:06:23,534 WARN [regionserver60020-WAL.AsyncSyncer0] hdfs.DFSClient: Error while syncing at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:496) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:945) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1194) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1040) at com.sun.proxy.$Proxy16.getAdditionalDatanode(Unknown Source) at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:294) at java.lang.reflect.Method.invoke(Method.java:606) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at com.sun.proxy.$Proxy15.getAdditionalDatanode(Unknown Source) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at java.lang.reflect.Method.invoke(Method.java:606) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolTranslatorPB.java:416) at com.sun.proxy.$Proxy14.getAdditionalDatanode(Unknown Source) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at org.apache.hadoop.ipc.Client.call(Client.java:1364) at org.apache.hadoop.ipc.Client.call(Client.java:1411) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at javax.security.auth.Subject.doAs(Subject.java:415) at java.security.AccessController.doPrivileged(Native Method) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:587) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode(ClientNamenodeProtocolServerSideTranslatorPB.java:499) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getAdditionalDatanode(AuthorizationProviderProxyClientProtocol.java:204) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:647) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3237) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3334) ```

wangxiaolei

赞同来自:

提供下更多的regionserver的日志信息?

Hagrid

赞同来自:

2017-01-10 13:10:55,983 DEBUG [MemStoreFlusher.0] regionserver.CompactSplitThread: Small Compaction requested: system; Because: MemStoreFlusher.0; compaction_queue=(0:1), split_queue=0, merge_queue=0 2017-01-10 13:10:55,983 DEBUG [regionserver60020-smallCompactions-1483945539168] compactions.RatioBasedCompactionPolicy: Selecting compaction from 3 store files, 0 compacting, 3 eligible, 10 blocking 2017-01-10 13:11:55,715 WARN  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 58923ms GC pool 'ParNew' had collection(s): count=1 time=57632ms GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=1379ms java.io.EOFException: Premature EOF: no length prefix available         at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2103)         at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:810) 2017-01-10 13:11:55,715 WARN  [regionserver60020] util.Sleeper: We slept 60077ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.h ... pired 2017-01-10 13:11:55,715 WARN  [regionserver60020.periodicFlusher] util.Sleeper: We slept 65421ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.h ... pired 2017-01-10 13:11:55,715 WARN  [regionserver60020.compactionChecker] util.Sleeper: We slept 65415ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.h ... pired 2017-01-10 13:11:55,723 WARN  [DataStreamer for file /hbase/WALs/xhn-slave05,60020,1483945222556/xhn-slave05%2C60020%2C1483945222556.1484023255836 block BP-352369807-10.11.24.57-1483799407745:blk_1073846346_105537] hdfs.DFSClient: Error Recovery for block BP-352369807-10.11.24.57-1483799407745:blk_1073846346_105537 in pipeline 10.11.24.54:50010, 10.11.24.122:50010, 10.11.24.183:50010: bad datanode 10.11.24.54:50010 2017-01-10 13:11:55,720 WARN  [ResponseProcessor for block BP-352369807-10.11.24.57-1483799407745:blk_1073846957_106148] hdfs.DFSClient: Slow ReadProcessor read fields took 59028ms (threshold=30000ms); ack: seqno: 68 status: SUCCESS status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 2892347, targets: [10.11.24.54:50010, 10.11.24.179:50010, 10.11.24.50:50010] 2017-01-10 13:11:55,740 FATAL [regionserver60020] regionserver.HRegionServer: ABORTING region server xhn-slave05,60020,1483945222556: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing xhn-slave05,60020,1483945222556 as dead server         at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:369)         at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:274)         at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1357)         at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087)         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2031)         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)         at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:74)         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)         at java.util.concurrent.FutureTask.run(FutureTask.java:262)         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

Hagrid

赞同来自:

看起来是GC时间太长了,导致ZK 心跳 连接超时,最终通知这个server下线。

Hagrid

赞同来自:

为什么regionserver 和Zookeeper的session expired? 可能的原因有 1. 网络不好。 2. Java full GC, 这会block所有的线程。如果时间比较长,也会导致session expired. 怎么办? 1. 将Zookeeper的timeout时间加长。 2. 配置“hbase.regionserver.restart.on.zk.expire” 为true。 这样子,遇到ZooKeeper session expired , regionserver将选择 restart 而不是 abort 具体的配置是,在hbase-site.xml中加入 <property> <name>zookeeper.session.timeout</name> <value>90000</value> <description>ZooKeeper session timeout. HBase passes this to the zk quorum as suggested maximum time for a session.  See http://hadoop.apache.org/zooke ... sions “The client sends a requested timeout, the server responds with the timeout that it can give the client. The current implementation requires that the timeout be a minimum of 2 times the tickTime (as set in the server configuration) and a maximum of 20 times the tickTime.” Set the zk ticktime with hbase.zookeeper.property.tickTime. In milliseconds. </description> </property> <property> <name>hbase.regionserver.restart.on.zk.expire</name> <value>true</value> <description> Zookeeper session expired will force regionserver exit. Enable this will make the regionserver restart. </description> </property> 为了避免java full GC suspend thread 对Zookeeper heartbeat的影响,我们还需要对hbase-env.sh进行配置。 将
export HBASE_OPTS="$HBASE_OPTS -XX:+HeapDumpOnOutOfMemoryError \ -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode"
修改成
export HBASE_OPTS="$HBASE_OPTS -XX:+HeapDumpOnOutOfMemoryError \ -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled \ -XX:+CMSInitiatingOccupancyFraction=70 \ -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseParNewGC -Xmn256m"


同时,当linux的maxfile设置过小时,scan多个列族也会造成regionServer宕机

Hagrid

赞同来自:

是不是因为这个原因?      1.最近hbase的rgion经常挂掉一个,查看该节点日志发现如下错误: 2014-02-22 01:52:02,194 ERROR org.apache.Hadoop.hbase.regionserver.HRegionServer: Close and delete failed org.apache.Hadoop.hdfs.server.namenode.LeaseExpiredException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /hbase/.logs/testhd3,60020,1392948100268/testhd3%2C60020%2C1392948100268.1393004989411 File does not exist. Holder DFSClient_hb_rs_testhd3,60020,1392948100268 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.Java:1631) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1622) 查了很长时间也没找到hbase的问题,后来根据网上资料查看了hadoop的日志如下: 2014-02-22 01:52:00,935 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hadoop cause:org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /hbase/.logs/testhd3,60020,1392948100268/testhd3%2C60020%2C1392948100268.1393004989411 File does not exist. Holder DFSClient_hb_rs_testhd3,60020,1392948100268 does not have any open files. 2014-02-22 01:52:00,936 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 9000, call addBlock(/hbase/.logs/testhd3,60020,1392948100268/testhd3%2C60020%2C1392948100268.1393004989411, DFSClient_hb_rs_testhd3,60020,1392948100268, null) from 172.72.101.213:59979: error: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /hbase/.logs/testhd3,60020,1392948100268/testhd3%2C60020%2C1392948100268.1393004989411 File does not exist. Holder DFSClient_hb_rs_testhd3,60020,1392948100268 does not have any open files. org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /hbase/.logs/testhd3,60020,1392948100268/testhd3%2C60020%2C1392948100268.1393004989411 File does not exist. Holder DFSClient_hb_rs_testhd3,60020,1392948100268 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1631) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1622) 结果发现两个日志有几乎相同的记录,可以确认hbase的问题是由hadoop引起,修改如下: 解决办法,调整xcievers参数 默认是4096,改为8192 vi /home/dwhftp/opt/hadoop/conf/hdfs-site.xml <property> <name>dfs.datanode.max.xcievers</name> <value>8192</value> </property> dfs.datanode.max.xcievers 参数说明 一个 Hadoop HDFS Datanode 有一个同时处理文件的上限. 这个参数叫 xcievers (Hadoop的作者把这个单词拼错了). 在你加载之前,先确认下你有没有配置这个文件conf/hdfs-site.xml里面的xceivers参数,至少要有4096: <property> <name>dfs.datanode.max.xcievers</name> <value>4096</value> </property>

mopishv0 - 高级开发工程师@美团

赞同来自:

看你贴的日志,一般是GC导致的,需要进一步确定下是put导致的还是scan,是否是某个业务的瞬时压力,是否能转为离线的方式进行处理。

Hagrid

赞同来自:

集群入库请求情况如下:

Hagrid

赞同来自:

JVM参数设置情况:  

mopishv0 - 高级开发工程师@美团

赞同来自:

页面上能看到的信息比较少,看日志里是否有large response之类的日志,如果有,看看长度和并发,如果是scan请求打死,这里会有日志。 另外需要看监控,挂掉的那会儿请求情况是什么样的。

Hagrid

赞同来自:

有response的日志摘录: WARN  [ResponseProcessor for block BP-2008143126-10.10.24.57-1479816673596:blk_1075975305_2234486] hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block BP-2008143126-10.10.24.57-1479816673596:blk_1075975305_2234486 java.io.IOException: Bad response ERROR for block BP-2008143126-10.10.24.57-1479816673596:blk_1075975305_2234486 from datanode 10.10.24.55:50010         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:840)   WARN  [B.defaultRpcServer.handler=15,queue=5,port=60020] ipc.RpcServer: (responseTooSlow): {"processingtimems":14083,"call":"Multi(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiRequest)","client":"10.10.24.53:26592","starttimems":1483859603791,"queuetimems":0,"class":"HRegionServer","responsesize":8,"method":"Multi"}      DEBUG [LruStats #0] hfile.LruBlockCache: Total=35.56 MB, free=1.41 GB, max=1.45 GB, blocks=1551944960, accesses=1201680, hits=50479, hitRatio=4.20%, , cachingAccesses=51019, cachingHits=49653, cachingHitsRatio=97.32%, evictions=5845, evicted=826, evictedPerRun=0.14131736755371094  

Hagrid

赞同来自:

regionserver 下线前报GC超时:
2017-01-12 06:23:21,414 WARN  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 58674ms
GC pool 'ParNew' had collection(s): count=1 time=57351ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=1336ms
2017-01-12 06:23:21,414 FATAL [regionserver60020] regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: []
2017-01-12 06:23:21,424 WARN  [DataStreamer for file /hbase/WALs/kbzy-xjp-slave20,60020,1484022796657/kbzy-xjp-slave20%2C60020%2C1484022796657.1484170274386 block BP-352369807-10.11.24.57-1483799407745:blk_1073983425_242623] hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /hbase/oldWALs/kbzy-xjp-slave20%2C60020%2C1484022796657.1484170274386 (inode 1166999): File is not open for writing. Holder DFSClient_hb_rs_kbzy-xjp-slave20,60020,1484022796657_1006405432_33 does not have any open files.

Hagrid

赞同来自:

regionserver下线前在进行一系列的 smallCompactions
 
 
2017-01-12 06:13:30,153 DEBUG [regionserver60020-smallCompactions-1484022938989] compactions.ExploringCompactionPolicy: Exploring compaction algorithm has selected 0 files of size 0 starting at candidate #-1 after considering 1 permutations with 0 in ratio
2017-01-12 06:13:30,153 DEBUG [regionserver60020-smallCompactions-1484022938989] compactions.RatioBasedCompactionPolicy: Not compacting files because we only have 0 files ready for compaction. Need 3 to initiate.
2017-01-12 06:13:30,153 DEBUG [regionserver60020-smallCompactions-1484022938989] regionserver.CompactSplitThread: Not compacting yb_income,20161109-~,1483801779193.ef126d4a983a6d7da07a7960dd6c2990. because compaction request was cancelled
2017-01-12 06:13:30,153 DEBUG [regionserver60020-smallCompactions-1484022938989] compactions.RatioBasedCompactionPolicy: Selecting compaction from 3 store files, 0 compacting, 3 eligible, 10 blocking
2017-01-12 06:13:30,153 DEBUG [regionserver60020-smallCompactions-1484022938989] compactions.ExploringCompactionPolicy: Exploring compaction algorithm has selected 0 files of size 0 starting at candidate #-1 after considering 1 permutations with 0 in ratio
2017-01-12 06:13:30,153 DEBUG [regionserver60020-smallCompactions-1484022938989] compactions.RatioBasedCompactionPolicy: Not compacting files because we only have 0 files ready for compaction. Need 3 to initiate.
2017-01-12 06:13:30,153 DEBUG [regionserver60020-smallCompactions-1484022938989] regionserver.CompactSplitThread: Not compacting addexpr,20170107-10-ios-1308310468-21330478-125604-1483764964,1484112418556.318cc320cbb1fd04931c6cd9e9839a85. because compaction request was cancelled
2017-01-12 06:13:30,153 DEBUG [regionserver60020-smallCompactions-1484022938989] compactions.RatioBasedCompactionPolicy: Selecting compaction from 3 store files, 0 compacting, 3 eligible, 10 blocking
2017-01-12 06:13:30,153 DEBUG [regionserver60020-smallCompactions-1484022938989] compactions.ExploringCompactionPolicy: Exploring compaction algorithm has selected 0 files of size 0 starting at candidate #-1 after considering 1 permutations with 0 in ratio
2017-01-12 06:13:30,153 DEBUG [regionserver60020-smallCompactions-1484022938989] compactions.RatioBasedCompactionPolicy: Not compacting files because we only have 0 files ready for compaction. Need 3 to initiate.
2017-01-12 06:13:30,153 DEBUG [regionserver60020-smallCompactions-1484022938989] regionserver.CompactSplitThread: Not compacting logout,20161115-~,1483801701066.1adce0d0fa549161e19e89633d3f8c77. because compaction request was cancelled
2017-01-12 06:13:30,153 DEBUG [regionserver60020-smallCompactions-1484022938989] compactions.RatioBasedCompactionPolicy: Selecting compaction from 3 store files, 0 compacting, 3 eligible, 10 blocking
2017-01-12 06:13:30,154 DEBUG [regionserver60020-smallCompactions-1484022938989] compactions.ExploringCompactionPolicy: Exploring compaction algorithm has selected 0 files of size 0 starting at candidate #-1 after considering 1 permutations with 0 in ratio
2017-01-12 06:13:30,154 DEBUG [regionserver60020-smallCompactions-1484022938989] compactions.RatioBasedCompactionPolicy: Not compacting files because we only have 0 files ready for compaction. Need 3 to initiate.
2017-01-12 06:13:30,154 DEBUG [regionserver60020-smallCompactions-1484022938989] regionserver.CompactSplitThread: Not compacting yb_income,20161116-~,1483801779194.3b8c51ff5019433f4b61e53dede3438f. because compaction request was cancelled
2017-01-12 06:18:16,780 DEBUG [LruStats #0] hfile.LruBlockCache: Total=353.67 MB, free=1.09 GB, max=1.43 GB, blocks=1538850816, accesses=2891937, hits=552140, hitRatio=19.09%, , cachingAccesses=557722, cachingHits=552119, cachingHitsRatio=99.00%, evictions=15029, evicted=24, evictedPerRun=0.00159691262524575
2017-01-12 06:21:50,151 INFO  [regionserver60020.periodicFlusher] regionserver.HRegionServer: regionserver60020.periodicFlusher requesting flush for region online,20170111-~,1483801713477.03cc0f95d60fc9c2af25bd9944b625ef. after a delay of 9114
2017-01-12 06:21:59,265 INFO  [MemStoreFlusher.1] regionserver.HRegion: Started memstore flush for online,20170111-~,1483801713477.03cc0f95d60fc9c2af25bd9944b625ef., current region memstore size 4.0 M
2017-01-12 06:21:59,297 INFO  [MemStoreFlusher.1] regionserver.DefaultStoreFlusher: Flushed, sequenceid=15152, memsize=4.0 M, hasBloomFilter=true, into tmp file hdfs://kbzy-xjp-master:9000/hbase/data/default/online/03cc0f95d60fc9c2af25bd9944b625ef/.tmp/6851a6096280410ca692a6cf86cd2dae
2017-01-12 06:21:59,304 DEBUG [MemStoreFlusher.1] regionserver.HRegionFileSystem: Committing store file hdfs://kbzy-xjp-master:9000/hbase/data/default/online/03cc0f95d60fc9c2af25bd9944b625ef/.tmp/6851a6096280410ca692a6cf86cd2dae as hdfs://kbzy-xjp-master:9000/hbase/data/default/online/03cc0f95d60fc9c2af25bd9944b625ef/info/6851a6096280410ca692a6cf86cd2dae
2017-01-12 06:21:59,311 INFO  [MemStoreFlusher.1] regionserver.HStore: Added hdfs://kbzy-xjp-master:9000/hbase/data/default/online/03cc0f95d60fc9c2af25bd9944b625ef/info/6851a6096280410ca692a6cf86cd2dae, entries=18480, sequenceid=15152, filesize=267.8 K
2017-01-12 06:21:59,311 INFO  [MemStoreFlusher.1] regionserver.HRegion: Finished memstore flush of ~4.0 M/4239504, currentsize=0/0 for region online,20170111-~,1483801713477.03cc0f95d60fc9c2af25bd9944b625ef. in 46ms, sequenceid=15152, compaction requested=false
2017-01-12 06:23:21,403 DEBUG [LruStats #0] hfile.LruBlockCache: Total=353.67 MB, free=1.09 GB, max=1.43 GB, blocks=1538850816, accesses=2891937, hits=552140, hitRatio=19.09%, , cachingAccesses=557722, cachingHits=552119, cachingHitsRatio=99.00%, evictions=15054, evicted=24, evictedPerRun=0.0015942606842145324
2017-01-12 06:23:21,403 INFO  [regionserver60020-SendThread(kbzy-xjp-zk01:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 65366ms for sessionid 0x159796eae5f243d, closing socket connection and attempting reconnect
2017-01-12 06:23:21,403 INFO  [SplitLogWorker-kbzy-xjp-slave20,60020,1484022796657-SendThread(kbzy-xjp-zk02:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 67766ms for sessionid 0x259796eae6b1bbf, closing socket connection and attempting reconnect
2017-01-12 06:23:21,403 WARN  [regionserver60020] util.Sleeper: We slept 60430ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.h ... pired
2017-01-12 06:23:21,403 WARN  [regionserver60020.compactionChecker] util.Sleeper: We slept 61250ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.h ... pired
2017-01-12 06:23:21,403 WARN  [regionserver60020.periodicFlusher] util.Sleeper: We slept 61251ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.h ... pired
2017-01-12 06:23:21,403 INFO  [regionserver60020-SendThread(kbzy-xjp-zk01:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 68077ms for sessionid 0x159796eae5f2441, closing socket connection and attempting reconnect
2017-01-12 06:23:21,403 INFO  [regionserver60020-SendThread(kbzy-xjp-sparkmaster:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 65280ms for sessionid 0x5597cee2705003d, closing socket connection and attempting reconnect
2017-01-12 06:23:21,404 WARN  [ResponseProcessor for block BP-352369807-10.11.24.57-1483799407745:blk_1073983425_242623] hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block BP-352369807-10.11.24.57-1483799407745:blk_1073983425_242623
java.io.EOFException: Premature EOF: no length prefix available
	at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2103)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
	at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:810)
2017-01-12 06:23:21,409 WARN  [DataStreamer for file /hbase/WALs/kbzy-xjp-slave20,60020,1484022796657/kbzy-xjp-slave20%2C60020%2C1484022796657.1484170274386 block BP-352369807-10.11.24.57-1483799407745:blk_1073983425_242623] hdfs.DFSClient: Error Recovery for block BP-352369807-10.11.24.57-1483799407745:blk_1073983425_242623 in pipeline 10.11.24.182:50010, 10.11.24.49:50010, 10.11.24.180:50010: bad datanode 10.11.24.182:50010
2017-01-12 06:23:21,412 FATAL [regionserver60020] regionserver.HRegionServer: ABORTING region server kbzy-xjp-slave20,60020,1484022796657: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing kbzy-xjp-slave20,60020,1484022796657 as dead server
	at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:369)
	at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:274)
	at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1357)
	at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087)
	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2031)
	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
	at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:74)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)

org.apache.hadoop.hbase.YouAreDeadException: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing kbzy-xjp-slave20,60020,1484022796657 as dead server
	at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:369)
	at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:274)
	at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1357)
	at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087)
	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2031)
	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
	at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:74)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)

	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95)
	at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:304)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:1107)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:928)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.YouAreDeadException): org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing kbzy-xjp-slave20,60020,1484022796657 as dead server
	at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:369)
	at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:274)
	at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1357)
	at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5087)
	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2031)
	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
	at org.apache.hadoop.hbase.ipc.FifoRpcScheduler$1.run(FifoRpcScheduler.java:74)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)

	at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1457)
	at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1661)
	at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1719)
	at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerReport(RegionServerStatusProtos.java:5414)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:1105)
	... 2 more

Hagrid

赞同来自:

GC严重,导致zookeeper连接超时

Hagrid

赞同来自:

背景介绍:         25台regionserver,5台zookeeper,master有热备。 实时处理程序,实时的向hbase表写数据,hbase集群与hdfs、hive、spark公用,集群上每天会跑12个小时左右的分析任务。      推测原因1--compaction、split过于频繁:               由于配置里面hfile最大文件大小设置为1G,所以compaction、split比较频繁,资源消耗比较大,导致gc暂停时间过长,出现写hdfs错误,导致regionsever挂掉                  推测原因2-- hdfs压力过大,datanode超负荷:                由于集群运行各种任务,hdfs读写压力大,datanode的负载比较高,导致regionserver写hdfs异常,宕机          解决办法:1、hfile大小改为100G,禁止系统自己做major compaction。2、给datanode多一些内存,调整rpc线程数。

数据男孩

赞同来自:

您好,想问下?您这边的实时的向hbase表写数据,是如何实现的?能简单介绍下你们那边用的框架吗?多谢。

要回复问题请先登录注册