Re: Occasional regionserver crashes following socket errors writing to HDFS

Dave Revell Thu, 10 May 2012 10:31:37 -0700

This "you don't need a reducer" conversation is distracting from the real
problem and is false.


Many mapreduce algorithms require a reduce phase (e.g. sorting). The fact
that the output is written to HBase or somewhere else is irrelevant.

-Dave

On Thu, May 10, 2012 at 6:26 AM, Michael Segel <[email protected]>wrote:

> Ok...
>
> So the issue is that you have a lot of regions on a region server, where
> the max file size is the default.
> On your input to HBase, you have a couple of issues.
>
> 1) Your data is most likely sorted. (Not good on inserts)
> 2) You will want to increase your region size from default (256MB) to
> something like 1-2GB.
> 3) You probably don't have MSLABS set up or GC tuned.
> 4) google dfs.balance.bandwidthPerSec  I believe its also used by HBase
> when they need to move regions.
> Speaking of which what happens when HBase decides to move a region? Does
> it make a copy on the new RS and then after its there, point to the new RS
> and then remove the old region?
>
>
> I'm assuming you're writing out of your reducer straight to HBase.
> Are you writing your job to 1 reducer or did you set up multiple reducers?
>  You may want to play with having multiple reducers ...
>
> Again, here's the issue. You don't need a reducer when writing to HBase.
> You would be better served by refactoring your job to have the mapper write
> to Hbase directly.
> Think about it. (Really, think about it. If you really don't see it, face
> a white wall, with a 6 pack of beer and start drinking and focus on the
> question of why would I say you don't need a reducer on a map job. ) ;-)
> Note if you don't drink, go to the gym and get on a treadmill and run at a
> good pace. Put your body in to a zone and then work through the problem
>
>
> HTH
>
> -Mike
>
>
> On May 10, 2012, at 7:22 AM, Eran Kutner wrote:
>
> > Hi Mike,
> > Not sure I understand the question about the reducer. I'm using a reducer
> > because my M/R jobs require one and I want to write the result to Hbase.
> > I have two tables I'm writing two, one is using the default file size
> > (256MB if I remember correctly) the other one is 512MB.
> > There are ~700 regions on each server.
> > Didn't know there is a bandwidth limit, is it on HDFS or HBase? How can
> it
> > be configured?
> >
> > -eran
> >
> >
> >
> > On Thu, May 10, 2012 at 2:53 PM, Michel Segel <[email protected]
> >wrote:
> >
> >> Silly question...
> >> Why are you using a reducer when working w HBase?
> >>
> >> Second silly question... What is the max file size of your table that
> you
> >> are writing to?
> >>
> >> Third silly question... How many regions are on each of your region
> servers
> >>
> >> Fourth silly question ... There is this bandwidth setting... Default is
> >> 10MB...  Did you modify it?
> >>
> >>
> >>
> >> Sent from a remote device. Please excuse any typos...
> >>
> >> Mike Segel
> >>
> >> On May 10, 2012, at 6:33 AM, Eran Kutner <[email protected]> wrote:
> >>
> >>> Thanks Igal, but we already have that setting. These are the relevant
> >>> setting from hdfs-site.xml :
> >>> <property>
> >>>   <name>dfs.datanode.max.xcievers</name>
> >>>   <value>65536</value>
> >>> </property>
> >>> <property>
> >>>   <name>dfs.datanode.handler.count</name>
> >>>   <value>10</value>
> >>> </property>
> >>> <property>
> >>>   <name>dfs.datanode.socket.write.timeout</name>
> >>>   <value>0</value>
> >>> </property>
> >>>
> >>> Other ideas?
> >>>
> >>> -eran
> >>>
> >>>
> >>>
> >>> On Thu, May 10, 2012 at 12:25 PM, Igal Shilman <[email protected]> wrote:
> >>>
> >>>> Hi Eran,
> >>>> Do you have: dfs.datanode.socket.write.timeout set in hdfs-site.xml ?
> >>>> (We have set this to zero in our cluster, which means waiting as long
> as
> >>>> necessary for the write to complete)
> >>>>
> >>>> Igal.
> >>>>
> >>>> On Thu, May 10, 2012 at 11:17 AM, Eran Kutner <[email protected]> wrote:
> >>>>
> >>>>> Hi,
> >>>>> We're seeing occasional regionserver crashes during heavy write
> >>>> operations
> >>>>> to Hbase (at the reduce phase of large M/R jobs). I have increased
> the
> >>>> file
> >>>>> descriptors, HDFS xceivers, HDFS threads to the recommended settings
> >> and
> >>>>> actually way above.
> >>>>>
> >>>>> Here is an example of the HBase log (showing only errors):
> >>>>>
> >>>>> 2012-05-10 03:34:54,291 WARN org.apache.hadoop.hdfs.DFSClient:
> >>>>> DFSOutputStream ResponseProcessor exception  for block
> >>>>> blk_-8928911185099340956_5189425java.io.IOException: Bad response 1
> for
> >>>>> block blk_-8928911185099340956_5189425 from datanode
> 10.1.104.6:50010
> >>>>>      at
> >>>>>
> >>>>>
> >>>>
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:
> >>>>> 2986)
> >>>>>
> >>>>> 2012-05-10 03:34:54,494 WARN org.apache.hadoop.hdfs.DFSClient:
> >>>> DataStreamer
> >>>>> Exception: java.io.InterruptedIOException: Interruped while waiting
> for
> >>>> IO
> >>>>> on channel java.nio.channels.SocketChannel[connected
> >>>>> local=/10.1.104.9:59642remote=/
> >>>>> 10.1.104.9:50010]. 0 millis timeout left.
> >>>>>      at
> >>>>>
> >>>>>
> >>>>
> >>
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
> >>>>>      at
> >>>>>
> >>>>>
> >>>>
> >>
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
> >>>>>      at
> >>>>>
> >>>>
> >>
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
> >>>>>      at
> >>>>>
> >>>>
> >>
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
> >>>>>      at
> >>>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
> >>>>>      at java.io.DataOutputStream.write(DataOutputStream.java:90)
> >>>>>      at
> >>>>>
> >>>>>
> >>>>
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:
> >>>>> 2848)
> >>>>>
> >>>>> 2012-05-10 03:34:54,531 WARN org.apache.hadoop.hdfs.DFSClient: Error
> >>>>> Recovery for block blk_-8928911185099340956_5189425 bad datanode[2]
> >>>>> 10.1.104.6:50010
> >>>>> 2012-05-10 03:34:54,531 WARN org.apache.hadoop.hdfs.DFSClient: Error
> >>>>> Recovery for block blk_-8928911185099340956_5189425 in pipeline
> >>>>> 10.1.104.9:50010, 10.1.104.8:50010, 10.1.104.6:50010: bad datanode
> >>>>> 10.1.104.6:50010
> >>>>> 2012-05-10 03:48:30,174 FATAL
> >>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> >>>> server
> >>>>> serverName=hadoop1-s09.farm-ny.gigya.com,60020,1336476100422,
> >>>>> load=(requests=15741, regions=789, usedHeap=6822, maxHeap=7983):
> >>>>> regionserver:60020-0x2372c0e8a2f0008
> >> regionserver:60020-0x2372c0e8a2f0008
> >>>>> received expired from ZooKeeper, aborting
> >>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>> KeeperErrorCode = Session expired
> >>>>>      at
> >>>>>
> >>>>>
> >>>>
> >>
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:352)
> >>>>>      at
> >>>>>
> >>>>>
> >>>>
> >>
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:270)
> >>>>>      at
> >>>>>
> >>>>>
> >>>>
> >>
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
> >>>>>      at
> >>>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
> >>>>> java.io.InterruptedIOException: Aborting compaction of store
> properties
> >>>> in
> >>>>> region
> >>>>>
> >>>>>
> >>>>
> >>
> gs_users,6155551|QoCW/euBIKuMat/nRC5Xtw==,1334983658004.878522ea91f41cd76b903ea06ccd17f9.
> >>>>> because user requested stop.
> >>>>>      at
> >>>>> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:998)
> >>>>>      at
> >>>>> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:779)
> >>>>>      at
> >>>>>
> >>>>>
> >>>>
> >>
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:776)
> >>>>>      at
> >>>>>
> >>>>>
> >>>>
> >>
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:721)
> >>>>>      at
> >>>>>
> >>>>>
> >>>>
> >>
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
> >>>>>
> >>>>>
> >>>>> This is from 10.1.104.9 (same machine running the region server that
> >>>>> crashed):
> >>>>> 2012-05-10 03:31:16,785 INFO
> >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> >>>>> blk_-8928911185099340956_5189425 src: /10.1.104.9:59642 dest: /
> >>>>> 10.1.104.9:50010
> >>>>> 2012-05-10 03:35:39,000 INFO
> >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> >>>>> blk_-8928911185099340956_5189425 2 Exception
> java.net.SocketException:
> >>>>> Connection reset
> >>>>> 2012-05-10 03:35:39,052 INFO
> >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> >>>> receiveBlock
> >>>>> for block blk_-8928911185099340956_5189425
> >>>>> java.nio.channels.ClosedByInterruptException
> >>>>> 2012-05-10 03:35:39,053 INFO
> >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> >>>>> blk_-8928911185099340956_5189425 received exception
> >> java.io.IOException:
> >>>>> Interrupted receiveBlock
> >>>>> 2012-05-10 03:35:39,055 ERROR
> >>>>> org.apache.hadoop.security.UserGroupInformation:
> >>>> PriviledgedActionException
> >>>>> as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block
> >>>>> blk_-8928911185099340956_5189425 length is 24384000 does not match
> >> block
> >>>>> file length 24449024
> >>>>> 2012-05-10 03:35:39,055 INFO org.apache.hadoop.ipc.Server: IPC Server
> >>>>> handler 3 on 50020, call
> >>>>> startBlockRecovery(blk_-8928911185099340956_5189425) from
> >>>> 10.1.104.8:50251
> >>>>> :
> >>>>> error: java.io.IOException: Block blk_-8928911185099340956_5189425
> >> length
> >>>>> is 24384000 does not match block file length 24449024
> >>>>> java.io.IOException: Block blk_-8928911185099340956_5189425 length is
> >>>>> 24384000 does not match block file length 24449024
> >>>>> 2012-05-10 03:35:39,077 INFO
> >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> >>>>> blk_-8928911185099340956_5189425 2 Exception
> java.net.SocketException:
> >>>>> Broken pipe
> >>>>> 2012-05-10 03:35:39,077 INFO
> >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> >>>>> blk_-8928911185099340956_5189425 2 Exception
> java.net.SocketException:
> >>>>> Socket closed
> >>>>> 2012-05-10 03:35:39,108 INFO
> >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> >>>>> blk_-8928911185099340956_5189425 2 Exception
> java.net.SocketException:
> >>>>> Socket closed
> >>>>> 2012-05-10 03:35:39,136 INFO
> >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> >>>>> blk_-8928911185099340956_5189425 2 Exception
> java.net.SocketException:
> >>>>> Socket closed
> >>>>> 2012-05-10 03:35:39,165 INFO
> >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> >>>>> blk_-8928911185099340956_5189425 2 Exception
> java.net.SocketException:
> >>>>> Socket closed
> >>>>> 2012-05-10 03:35:39,196 INFO
> >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> >>>>> blk_-8928911185099340956_5189425 2 Exception
> java.net.SocketException:
> >>>>> Socket closed
> >>>>> 2012-05-10 03:35:39,221 INFO
> >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> >>>>> blk_-8928911185099340956_5189425 2 Exception
> java.net.SocketException:
> >>>>> Socket closed
> >>>>> 2012-05-10 03:35:39,246 INFO
> >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> >>>>> blk_-8928911185099340956_5189425 2 Exception
> java.net.SocketException:
> >>>>> Socket closed
> >>>>> 2012-05-10 03:35:39,271 INFO
> >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> >>>>> blk_-8928911185099340956_5189425 2 Exception
> java.net.SocketException:
> >>>>> Socket closed
> >>>>> 2012-05-10 03:35:39,296 INFO
> >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
> >>>>> blk_-8928911185099340956_5189425 2 Exception
> java.net.SocketException:
> >>>>> Socket closed
> >>>>>
> >>>>> This is the log from 10.1.104.6 datanode for
> >>>>> "blk_-8928911185099340956_5189425":
> >>>>> 2012-05-10 03:31:16,772 INFO
> >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> >>>>> blk_-8928911185099340956_5189425 src: /10.1.104.8:43828 dest: /
> >>>>> 10.1.104.6:50010
> >>>>> 2012-05-10 03:35:39,041 INFO
> >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> >>>> receiveBlock
> >>>>> for block blk_-8928911185099340956_5189425 java.net.SocketException:
> >>>>> Connection reset
> >>>>> 2012-05-10 03:35:39,043 INFO
> >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0
> for
> >>>>> block blk_-8928911185099340956_5189425 Interrupted.
> >>>>> 2012-05-10 03:35:39,043 INFO
> >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0
> for
> >>>>> block blk_-8928911185099340956_5189425 terminating
> >>>>> 2012-05-10 03:35:39,043 INFO
> >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> >>>>> blk_-8928911185099340956_5189425 received exception
> >>>>> java.net.SocketException: Connection reset
> >>>>>
> >>>>>
> >>>>> Any idea why is this happening?
> >>>>>
> >>>>> Thanks.
> >>>>>
> >>>>> -eran
> >>>>>
> >>>>
> >>
>
>

Re: Occasional regionserver crashes following socket errors writing to HDFS

Reply via email to