Re: Occasional regionserver crashes following socket errors writing to HDFS

Michael Segel Thu, 10 May 2012 06:27:29 -0700

Ok...

So the issue is that you have a lot of regions on a region server, where the 
max file size is the default. 
On your input to HBase, you have a couple of issues.


1) Your data is most likely sorted. (Not good on inserts)
2) You will want to increase your region size from default (256MB) to something 
like 1-2GB.
3) You probably don't have MSLABS set up or GC tuned. 
4) google dfs.balance.bandwidthPerSec  I believe its also used by HBase when 
they need to move regions.
Speaking of which what happens when HBase decides to move a region? Does it 
make a copy on the new RS and then after its there, point to the new RS and 
then remove the old region?


I'm assuming you're writing out of your reducer straight to HBase.
Are you writing your job to 1 reducer or did you set up multiple reducers?  You 
may want to play with having multiple reducers ... 

Again, here's the issue. You don't need a reducer when writing to HBase. You 
would be better served by refactoring your job to have the mapper write to 
Hbase directly. 
Think about it. (Really, think about it. If you really don't see it, face a 
white wall, with a 6 pack of beer and start drinking and focus on the question 
of why would I say you don't need a reducer on a map job. ) ;-) Note if you 
don't drink, go to the gym and get on a treadmill and run at a good pace. Put 
your body in to a zone and then work through the problem


HTH

-Mike


On May 10, 2012, at 7:22 AM, Eran Kutner wrote:

> Hi Mike,
> Not sure I understand the question about the reducer. I'm using a reducer
> because my M/R jobs require one and I want to write the result to Hbase.
> I have two tables I'm writing two, one is using the default file size
> (256MB if I remember correctly) the other one is 512MB.
> There are ~700 regions on each server.
> Didn't know there is a bandwidth limit, is it on HDFS or HBase? How can it
> be configured?
> 
> -eran
> 
> 
> 
> On Thu, May 10, 2012 at 2:53 PM, Michel Segel 
> <[email protected]>wrote:
> 
>> Silly question...
>> Why are you using a reducer when working w HBase?
>> 
>> Second silly question... What is the max file size of your table that you
>> are writing to?
>> 
>> Third silly question... How many regions are on each of your region servers
>> 
>> Fourth silly question ... There is this bandwidth setting... Default is
>> 10MB...  Did you modify it?
>> 
>> 
>> 
>> Sent from a remote device. Please excuse any typos...
>> 
>> Mike Segel
>> 
>> On May 10, 2012, at 6:33 AM, Eran Kutner <[email protected]> wrote:
>> 
>>> Thanks Igal, but we already have that setting. These are the relevant
>>> setting from hdfs-site.xml :
>>> <property>
>>>   <name>dfs.datanode.max.xcievers</name>
>>>   <value>65536</value>
>>> </property>
>>> <property>
>>>   <name>dfs.datanode.handler.count</name>
>>>   <value>10</value>
>>> </property>
>>> <property>
>>>   <name>dfs.datanode.socket.write.timeout</name>
>>>   <value>0</value>
>>> </property>
>>> 
>>> Other ideas?
>>> 
>>> -eran
>>> 
>>> 
>>> 
>>> On Thu, May 10, 2012 at 12:25 PM, Igal Shilman <[email protected]> wrote:
>>> 
>>>> Hi Eran,
>>>> Do you have: dfs.datanode.socket.write.timeout set in hdfs-site.xml ?
>>>> (We have set this to zero in our cluster, which means waiting as long as
>>>> necessary for the write to complete)
>>>> 
>>>> Igal.
>>>> 
>>>> On Thu, May 10, 2012 at 11:17 AM, Eran Kutner <[email protected]> wrote:
>>>> 
>>>>> Hi,
>>>>> We're seeing occasional regionserver crashes during heavy write
>>>> operations
>>>>> to Hbase (at the reduce phase of large M/R jobs). I have increased the
>>>> file
>>>>> descriptors, HDFS xceivers, HDFS threads to the recommended settings
>> and
>>>>> actually way above.
>>>>> 
>>>>> Here is an example of the HBase log (showing only errors):
>>>>> 
>>>>> 2012-05-10 03:34:54,291 WARN org.apache.hadoop.hdfs.DFSClient:
>>>>> DFSOutputStream ResponseProcessor exception  for block
>>>>> blk_-8928911185099340956_5189425java.io.IOException: Bad response 1 for
>>>>> block blk_-8928911185099340956_5189425 from datanode 10.1.104.6:50010
>>>>>      at
>>>>> 
>>>>> 
>>>> 
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:
>>>>> 2986)
>>>>> 
>>>>> 2012-05-10 03:34:54,494 WARN org.apache.hadoop.hdfs.DFSClient:
>>>> DataStreamer
>>>>> Exception: java.io.InterruptedIOException: Interruped while waiting for
>>>> IO
>>>>> on channel java.nio.channels.SocketChannel[connected
>>>>> local=/10.1.104.9:59642remote=/
>>>>> 10.1.104.9:50010]. 0 millis timeout left.
>>>>>      at
>>>>> 
>>>>> 
>>>> 
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
>>>>>      at
>>>>> 
>>>>> 
>>>> 
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>>>>>      at
>>>>> 
>>>> 
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>>>>>      at
>>>>> 
>>>> 
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>>>>>      at
>>>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
>>>>>      at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>>>>      at
>>>>> 
>>>>> 
>>>> 
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:
>>>>> 2848)
>>>>> 
>>>>> 2012-05-10 03:34:54,531 WARN org.apache.hadoop.hdfs.DFSClient: Error
>>>>> Recovery for block blk_-8928911185099340956_5189425 bad datanode[2]
>>>>> 10.1.104.6:50010
>>>>> 2012-05-10 03:34:54,531 WARN org.apache.hadoop.hdfs.DFSClient: Error
>>>>> Recovery for block blk_-8928911185099340956_5189425 in pipeline
>>>>> 10.1.104.9:50010, 10.1.104.8:50010, 10.1.104.6:50010: bad datanode
>>>>> 10.1.104.6:50010
>>>>> 2012-05-10 03:48:30,174 FATAL
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>>>> server
>>>>> serverName=hadoop1-s09.farm-ny.gigya.com,60020,1336476100422,
>>>>> load=(requests=15741, regions=789, usedHeap=6822, maxHeap=7983):
>>>>> regionserver:60020-0x2372c0e8a2f0008
>> regionserver:60020-0x2372c0e8a2f0008
>>>>> received expired from ZooKeeper, aborting
>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>> KeeperErrorCode = Session expired
>>>>>      at
>>>>> 
>>>>> 
>>>> 
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:352)
>>>>>      at
>>>>> 
>>>>> 
>>>> 
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:270)
>>>>>      at
>>>>> 
>>>>> 
>>>> 
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
>>>>>      at
>>>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
>>>>> java.io.InterruptedIOException: Aborting compaction of store properties
>>>> in
>>>>> region
>>>>> 
>>>>> 
>>>> 
>> gs_users,6155551|QoCW/euBIKuMat/nRC5Xtw==,1334983658004.878522ea91f41cd76b903ea06ccd17f9.
>>>>> because user requested stop.
>>>>>      at
>>>>> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:998)
>>>>>      at
>>>>> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:779)
>>>>>      at
>>>>> 
>>>>> 
>>>> 
>> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:776)
>>>>>      at
>>>>> 
>>>>> 
>>>> 
>> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:721)
>>>>>      at
>>>>> 
>>>>> 
>>>> 
>> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
>>>>> 
>>>>> 
>>>>> This is from 10.1.104.9 (same machine running the region server that
>>>>> crashed):
>>>>> 2012-05-10 03:31:16,785 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>>>> blk_-8928911185099340956_5189425 src: /10.1.104.9:59642 dest: /
>>>>> 10.1.104.9:50010
>>>>> 2012-05-10 03:35:39,000 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
>>>>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
>>>>> Connection reset
>>>>> 2012-05-10 03:35:39,052 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
>>>> receiveBlock
>>>>> for block blk_-8928911185099340956_5189425
>>>>> java.nio.channels.ClosedByInterruptException
>>>>> 2012-05-10 03:35:39,053 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
>>>>> blk_-8928911185099340956_5189425 received exception
>> java.io.IOException:
>>>>> Interrupted receiveBlock
>>>>> 2012-05-10 03:35:39,055 ERROR
>>>>> org.apache.hadoop.security.UserGroupInformation:
>>>> PriviledgedActionException
>>>>> as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block
>>>>> blk_-8928911185099340956_5189425 length is 24384000 does not match
>> block
>>>>> file length 24449024
>>>>> 2012-05-10 03:35:39,055 INFO org.apache.hadoop.ipc.Server: IPC Server
>>>>> handler 3 on 50020, call
>>>>> startBlockRecovery(blk_-8928911185099340956_5189425) from
>>>> 10.1.104.8:50251
>>>>> :
>>>>> error: java.io.IOException: Block blk_-8928911185099340956_5189425
>> length
>>>>> is 24384000 does not match block file length 24449024
>>>>> java.io.IOException: Block blk_-8928911185099340956_5189425 length is
>>>>> 24384000 does not match block file length 24449024
>>>>> 2012-05-10 03:35:39,077 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
>>>>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
>>>>> Broken pipe
>>>>> 2012-05-10 03:35:39,077 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
>>>>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
>>>>> Socket closed
>>>>> 2012-05-10 03:35:39,108 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
>>>>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
>>>>> Socket closed
>>>>> 2012-05-10 03:35:39,136 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
>>>>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
>>>>> Socket closed
>>>>> 2012-05-10 03:35:39,165 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
>>>>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
>>>>> Socket closed
>>>>> 2012-05-10 03:35:39,196 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
>>>>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
>>>>> Socket closed
>>>>> 2012-05-10 03:35:39,221 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
>>>>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
>>>>> Socket closed
>>>>> 2012-05-10 03:35:39,246 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
>>>>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
>>>>> Socket closed
>>>>> 2012-05-10 03:35:39,271 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
>>>>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
>>>>> Socket closed
>>>>> 2012-05-10 03:35:39,296 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
>>>>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
>>>>> Socket closed
>>>>> 
>>>>> This is the log from 10.1.104.6 datanode for
>>>>> "blk_-8928911185099340956_5189425":
>>>>> 2012-05-10 03:31:16,772 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>>>> blk_-8928911185099340956_5189425 src: /10.1.104.8:43828 dest: /
>>>>> 10.1.104.6:50010
>>>>> 2012-05-10 03:35:39,041 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
>>>> receiveBlock
>>>>> for block blk_-8928911185099340956_5189425 java.net.SocketException:
>>>>> Connection reset
>>>>> 2012-05-10 03:35:39,043 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
>>>>> block blk_-8928911185099340956_5189425 Interrupted.
>>>>> 2012-05-10 03:35:39,043 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
>>>>> block blk_-8928911185099340956_5189425 terminating
>>>>> 2012-05-10 03:35:39,043 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
>>>>> blk_-8928911185099340956_5189425 received exception
>>>>> java.net.SocketException: Connection reset
>>>>> 
>>>>> 
>>>>> Any idea why is this happening?
>>>>> 
>>>>> Thanks.
>>>>> 
>>>>> -eran
>>>>> 
>>>> 
>>

Re: Occasional regionserver crashes following socket errors writing to HDFS

Reply via email to