This "you don't need a reducer" conversation is distracting from the real problem and is false.
Many mapreduce algorithms require a reduce phase (e.g. sorting). The fact that the output is written to HBase or somewhere else is irrelevant. -Dave On Thu, May 10, 2012 at 6:26 AM, Michael Segel <[email protected]>wrote: > Ok... > > So the issue is that you have a lot of regions on a region server, where > the max file size is the default. > On your input to HBase, you have a couple of issues. > > 1) Your data is most likely sorted. (Not good on inserts) > 2) You will want to increase your region size from default (256MB) to > something like 1-2GB. > 3) You probably don't have MSLABS set up or GC tuned. > 4) google dfs.balance.bandwidthPerSec I believe its also used by HBase > when they need to move regions. > Speaking of which what happens when HBase decides to move a region? Does > it make a copy on the new RS and then after its there, point to the new RS > and then remove the old region? > > > I'm assuming you're writing out of your reducer straight to HBase. > Are you writing your job to 1 reducer or did you set up multiple reducers? > You may want to play with having multiple reducers ... > > Again, here's the issue. You don't need a reducer when writing to HBase. > You would be better served by refactoring your job to have the mapper write > to Hbase directly. > Think about it. (Really, think about it. If you really don't see it, face > a white wall, with a 6 pack of beer and start drinking and focus on the > question of why would I say you don't need a reducer on a map job. ) ;-) > Note if you don't drink, go to the gym and get on a treadmill and run at a > good pace. Put your body in to a zone and then work through the problem > > > HTH > > -Mike > > > On May 10, 2012, at 7:22 AM, Eran Kutner wrote: > > > Hi Mike, > > Not sure I understand the question about the reducer. I'm using a reducer > > because my M/R jobs require one and I want to write the result to Hbase. > > I have two tables I'm writing two, one is using the default file size > > (256MB if I remember correctly) the other one is 512MB. > > There are ~700 regions on each server. > > Didn't know there is a bandwidth limit, is it on HDFS or HBase? How can > it > > be configured? > > > > -eran > > > > > > > > On Thu, May 10, 2012 at 2:53 PM, Michel Segel <[email protected] > >wrote: > > > >> Silly question... > >> Why are you using a reducer when working w HBase? > >> > >> Second silly question... What is the max file size of your table that > you > >> are writing to? > >> > >> Third silly question... How many regions are on each of your region > servers > >> > >> Fourth silly question ... There is this bandwidth setting... Default is > >> 10MB... Did you modify it? > >> > >> > >> > >> Sent from a remote device. Please excuse any typos... > >> > >> Mike Segel > >> > >> On May 10, 2012, at 6:33 AM, Eran Kutner <[email protected]> wrote: > >> > >>> Thanks Igal, but we already have that setting. These are the relevant > >>> setting from hdfs-site.xml : > >>> <property> > >>> <name>dfs.datanode.max.xcievers</name> > >>> <value>65536</value> > >>> </property> > >>> <property> > >>> <name>dfs.datanode.handler.count</name> > >>> <value>10</value> > >>> </property> > >>> <property> > >>> <name>dfs.datanode.socket.write.timeout</name> > >>> <value>0</value> > >>> </property> > >>> > >>> Other ideas? > >>> > >>> -eran > >>> > >>> > >>> > >>> On Thu, May 10, 2012 at 12:25 PM, Igal Shilman <[email protected]> wrote: > >>> > >>>> Hi Eran, > >>>> Do you have: dfs.datanode.socket.write.timeout set in hdfs-site.xml ? > >>>> (We have set this to zero in our cluster, which means waiting as long > as > >>>> necessary for the write to complete) > >>>> > >>>> Igal. > >>>> > >>>> On Thu, May 10, 2012 at 11:17 AM, Eran Kutner <[email protected]> wrote: > >>>> > >>>>> Hi, > >>>>> We're seeing occasional regionserver crashes during heavy write > >>>> operations > >>>>> to Hbase (at the reduce phase of large M/R jobs). I have increased > the > >>>> file > >>>>> descriptors, HDFS xceivers, HDFS threads to the recommended settings > >> and > >>>>> actually way above. > >>>>> > >>>>> Here is an example of the HBase log (showing only errors): > >>>>> > >>>>> 2012-05-10 03:34:54,291 WARN org.apache.hadoop.hdfs.DFSClient: > >>>>> DFSOutputStream ResponseProcessor exception for block > >>>>> blk_-8928911185099340956_5189425java.io.IOException: Bad response 1 > for > >>>>> block blk_-8928911185099340956_5189425 from datanode > 10.1.104.6:50010 > >>>>> at > >>>>> > >>>>> > >>>> > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java: > >>>>> 2986) > >>>>> > >>>>> 2012-05-10 03:34:54,494 WARN org.apache.hadoop.hdfs.DFSClient: > >>>> DataStreamer > >>>>> Exception: java.io.InterruptedIOException: Interruped while waiting > for > >>>> IO > >>>>> on channel java.nio.channels.SocketChannel[connected > >>>>> local=/10.1.104.9:59642remote=/ > >>>>> 10.1.104.9:50010]. 0 millis timeout left. > >>>>> at > >>>>> > >>>>> > >>>> > >> > org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349) > >>>>> at > >>>>> > >>>>> > >>>> > >> > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) > >>>>> at > >>>>> > >>>> > >> > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) > >>>>> at > >>>>> > >>>> > >> > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) > >>>>> at > >>>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) > >>>>> at java.io.DataOutputStream.write(DataOutputStream.java:90) > >>>>> at > >>>>> > >>>>> > >>>> > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java: > >>>>> 2848) > >>>>> > >>>>> 2012-05-10 03:34:54,531 WARN org.apache.hadoop.hdfs.DFSClient: Error > >>>>> Recovery for block blk_-8928911185099340956_5189425 bad datanode[2] > >>>>> 10.1.104.6:50010 > >>>>> 2012-05-10 03:34:54,531 WARN org.apache.hadoop.hdfs.DFSClient: Error > >>>>> Recovery for block blk_-8928911185099340956_5189425 in pipeline > >>>>> 10.1.104.9:50010, 10.1.104.8:50010, 10.1.104.6:50010: bad datanode > >>>>> 10.1.104.6:50010 > >>>>> 2012-05-10 03:48:30,174 FATAL > >>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region > >>>> server > >>>>> serverName=hadoop1-s09.farm-ny.gigya.com,60020,1336476100422, > >>>>> load=(requests=15741, regions=789, usedHeap=6822, maxHeap=7983): > >>>>> regionserver:60020-0x2372c0e8a2f0008 > >> regionserver:60020-0x2372c0e8a2f0008 > >>>>> received expired from ZooKeeper, aborting > >>>>> org.apache.zookeeper.KeeperException$SessionExpiredException: > >>>>> KeeperErrorCode = Session expired > >>>>> at > >>>>> > >>>>> > >>>> > >> > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:352) > >>>>> at > >>>>> > >>>>> > >>>> > >> > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:270) > >>>>> at > >>>>> > >>>>> > >>>> > >> > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531) > >>>>> at > >>>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507) > >>>>> java.io.InterruptedIOException: Aborting compaction of store > properties > >>>> in > >>>>> region > >>>>> > >>>>> > >>>> > >> > gs_users,6155551|QoCW/euBIKuMat/nRC5Xtw==,1334983658004.878522ea91f41cd76b903ea06ccd17f9. > >>>>> because user requested stop. > >>>>> at > >>>>> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:998) > >>>>> at > >>>>> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:779) > >>>>> at > >>>>> > >>>>> > >>>> > >> > org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:776) > >>>>> at > >>>>> > >>>>> > >>>> > >> > org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:721) > >>>>> at > >>>>> > >>>>> > >>>> > >> > org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81) > >>>>> > >>>>> > >>>>> This is from 10.1.104.9 (same machine running the region server that > >>>>> crashed): > >>>>> 2012-05-10 03:31:16,785 INFO > >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > >>>>> blk_-8928911185099340956_5189425 src: /10.1.104.9:59642 dest: / > >>>>> 10.1.104.9:50010 > >>>>> 2012-05-10 03:35:39,000 INFO > >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > >>>>> blk_-8928911185099340956_5189425 2 Exception > java.net.SocketException: > >>>>> Connection reset > >>>>> 2012-05-10 03:35:39,052 INFO > >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in > >>>> receiveBlock > >>>>> for block blk_-8928911185099340956_5189425 > >>>>> java.nio.channels.ClosedByInterruptException > >>>>> 2012-05-10 03:35:39,053 INFO > >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock > >>>>> blk_-8928911185099340956_5189425 received exception > >> java.io.IOException: > >>>>> Interrupted receiveBlock > >>>>> 2012-05-10 03:35:39,055 ERROR > >>>>> org.apache.hadoop.security.UserGroupInformation: > >>>> PriviledgedActionException > >>>>> as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block > >>>>> blk_-8928911185099340956_5189425 length is 24384000 does not match > >> block > >>>>> file length 24449024 > >>>>> 2012-05-10 03:35:39,055 INFO org.apache.hadoop.ipc.Server: IPC Server > >>>>> handler 3 on 50020, call > >>>>> startBlockRecovery(blk_-8928911185099340956_5189425) from > >>>> 10.1.104.8:50251 > >>>>> : > >>>>> error: java.io.IOException: Block blk_-8928911185099340956_5189425 > >> length > >>>>> is 24384000 does not match block file length 24449024 > >>>>> java.io.IOException: Block blk_-8928911185099340956_5189425 length is > >>>>> 24384000 does not match block file length 24449024 > >>>>> 2012-05-10 03:35:39,077 INFO > >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > >>>>> blk_-8928911185099340956_5189425 2 Exception > java.net.SocketException: > >>>>> Broken pipe > >>>>> 2012-05-10 03:35:39,077 INFO > >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > >>>>> blk_-8928911185099340956_5189425 2 Exception > java.net.SocketException: > >>>>> Socket closed > >>>>> 2012-05-10 03:35:39,108 INFO > >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > >>>>> blk_-8928911185099340956_5189425 2 Exception > java.net.SocketException: > >>>>> Socket closed > >>>>> 2012-05-10 03:35:39,136 INFO > >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > >>>>> blk_-8928911185099340956_5189425 2 Exception > java.net.SocketException: > >>>>> Socket closed > >>>>> 2012-05-10 03:35:39,165 INFO > >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > >>>>> blk_-8928911185099340956_5189425 2 Exception > java.net.SocketException: > >>>>> Socket closed > >>>>> 2012-05-10 03:35:39,196 INFO > >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > >>>>> blk_-8928911185099340956_5189425 2 Exception > java.net.SocketException: > >>>>> Socket closed > >>>>> 2012-05-10 03:35:39,221 INFO > >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > >>>>> blk_-8928911185099340956_5189425 2 Exception > java.net.SocketException: > >>>>> Socket closed > >>>>> 2012-05-10 03:35:39,246 INFO > >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > >>>>> blk_-8928911185099340956_5189425 2 Exception > java.net.SocketException: > >>>>> Socket closed > >>>>> 2012-05-10 03:35:39,271 INFO > >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > >>>>> blk_-8928911185099340956_5189425 2 Exception > java.net.SocketException: > >>>>> Socket closed > >>>>> 2012-05-10 03:35:39,296 INFO > >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder > >>>>> blk_-8928911185099340956_5189425 2 Exception > java.net.SocketException: > >>>>> Socket closed > >>>>> > >>>>> This is the log from 10.1.104.6 datanode for > >>>>> "blk_-8928911185099340956_5189425": > >>>>> 2012-05-10 03:31:16,772 INFO > >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > >>>>> blk_-8928911185099340956_5189425 src: /10.1.104.8:43828 dest: / > >>>>> 10.1.104.6:50010 > >>>>> 2012-05-10 03:35:39,041 INFO > >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in > >>>> receiveBlock > >>>>> for block blk_-8928911185099340956_5189425 java.net.SocketException: > >>>>> Connection reset > >>>>> 2012-05-10 03:35:39,043 INFO > >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 > for > >>>>> block blk_-8928911185099340956_5189425 Interrupted. > >>>>> 2012-05-10 03:35:39,043 INFO > >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 > for > >>>>> block blk_-8928911185099340956_5189425 terminating > >>>>> 2012-05-10 03:35:39,043 INFO > >>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock > >>>>> blk_-8928911185099340956_5189425 received exception > >>>>> java.net.SocketException: Connection reset > >>>>> > >>>>> > >>>>> Any idea why is this happening? > >>>>> > >>>>> Thanks. > >>>>> > >>>>> -eran > >>>>> > >>>> > >> > >
