More info on this blog post: http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html
J-D On Thu, Jul 8, 2010 at 10:11 AM, Jean-Daniel Cryans <[email protected]> wrote: > This would be done at the expense of network IO, since you will lose > locality for jobs that read/write to HBase. Also I guess the datanodes > are also there, so HBase will lose locality with HDFS. > > J-D > > On Thu, Jul 8, 2010 at 10:07 AM, Jamie Cockrill > <[email protected]> wrote: >> Thanks all for your help with this, everything seems much more stable >> for the meantime. I have a backlog loading job to run over a great >> deal of data, so I might separate out my region servers from my task >> trackers for the meantime. >> >> Thanks again, >> >> Jamie >> >> >> >> On 8 July 2010 17:46, Jean-Daniel Cryans <[email protected]> wrote: >>> OS cache is good, glad you figured out your memory problem. >>> >>> J-D >>> >>> On Thu, Jul 8, 2010 at 2:03 AM, Jamie Cockrill <[email protected]> >>> wrote: >>>> Morning all. Day 2 begins... >>>> >>>> I discussed this with someone else earlier and they pointed out that >>>> we also have task trackers running on all of those nodes, which will >>>> affect the amount of memory being used when jobs are being run. Each >>>> tasktracker had a maximum of 8 maps and 8 reduces configured per node, >>>> with a JVM Xmx of 512mb each. Clearly this implies a fully utilised >>>> node will use 8*512mb + 8*512mb = 8GB of memory on tasks alone. That's >>>> before the datanode does anything, or HBase for that matter. >>>> >>>> As such, I've dropped it to 4 maps, 4 reduces per node and reduced the >>>> Xmx to 256mb, giving a potential maximum task overhead of 2GB per >>>> node. Running 'vmstat 20' now, under load from mapreduce jobs, >>>> suggests that the actual free memory is about the same, but the memory >>>> cache is much much bigger, which presumably is healthlier as, in >>>> theory, that ought to relinquish memory to processes that request it. >>>> >>>> Lets see if that does the trick! >>>> >>>> ta >>>> >>>> Jamie >>>> >>>> >>>> On 7 July 2010 19:30, Jean-Daniel Cryans <[email protected]> wrote: >>>>> YouAreDead means that the region server's session was expired, GC >>>>> seems like your major problem. (file problems can happen after a GC >>>>> sleep because they were moved around while the process was sleeping, >>>>> you also get the same kind of messages with xcievers issue... sorry >>>>> for the confusion) >>>>> >>>>> By over committing the memory I meant trying to fit too much stuff in >>>>> the amount of RAM that you have. I guess it's the map and reduce tasks >>>>> that eat all the free space? Why not lower their number? >>>>> >>>>> J-D >>>>> >>>>> On Wed, Jul 7, 2010 at 11:22 AM, Jamie Cockrill >>>>> <[email protected]> wrote: >>>>>> PS, I've now reset my MAX_FILESIZE back to the default. (from the 1GB >>>>>> i raised it to). It caused me to run into a delightful >>>>>> 'YouAreDeadException' which looks very related to the Garbage >>>>>> collection issues on the Troubleshooting page, as my Zookeeper session >>>>>> expired. >>>>>> >>>>>> Thanks >>>>>> >>>>>> Jamie >>>>>> >>>>>> >>>>>> >>>>>> On 7 July 2010 19:19, Jamie Cockrill <[email protected]> wrote: >>>>>>> By overcommit, do you mean make my overcommit_ratio higher on each box >>>>>>> (its at the default 50 at the moment)? What I'm noticing at the moment >>>>>>> is that hadoop is taking up the vast majority of the memory on the >>>>>>> boxes. >>>>>>> >>>>>>> I found this article: >>>>>>> http://blog.rapleaf.com/dev/2010/01/05/the-wrath-of-drwho-or-unpredictable-hadoop-memory-usage/ >>>>>>> which Todd, it looks like you replied to. Does this sound like a >>>>>>> similar problem? No worries if you can't remember, it was back in >>>>>>> january! This article suggests reducing the amount of memory allocated >>>>>>> to Hadoop at startup, how would I go about doing this? >>>>>>> >>>>>>> Thank you everyone for your patience so far. Sorry if this is taking >>>>>>> up a lot of your time. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Jamie >>>>>>> >>>>>>> On 7 July 2010 19:03, Jean-Daniel Cryans <[email protected]> wrote: >>>>>>>> swappinness at 0 is good, but also don't overcommit your memory! >>>>>>>> >>>>>>>> J-D >>>>>>>> >>>>>>>> On Wed, Jul 7, 2010 at 10:53 AM, Jamie Cockrill >>>>>>>> <[email protected]> wrote: >>>>>>>>> I think you're right. >>>>>>>>> >>>>>>>>> Unfortunately the machines are on a separate network to this laptop, >>>>>>>>> so I'm having to type everything across, apologies if it doesn't >>>>>>>>> translate well... >>>>>>>>> >>>>>>>>> free -m gave: >>>>>>>>> >>>>>>>>> Mem Total Used Free >>>>>>>>> 7992 7939 53 >>>>>>>>> b/c 7877 114 >>>>>>>>> Swap: 23415 895 22519 >>>>>>>>> >>>>>>>>> I did this on another node that isn't being smashed at the moment and >>>>>>>>> the numbers came out similar, but the buffers/cache free was higher >>>>>>>>> >>>>>>>>> vmstat -20 is giving non-zero si and so's ranging between 3 and just >>>>>>>>> short of 5000. >>>>>>>>> >>>>>>>>> That seems to be it I guess. Hadoop troubleshooting suggests setting >>>>>>>>> swappiness to 0, is that just a case of changing the value in >>>>>>>>> /proc/sys/vm/swappiness? >>>>>>>>> >>>>>>>>> thanks >>>>>>>>> >>>>>>>>> Jamie >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 7 July 2010 18:40, Todd Lipcon <[email protected]> wrote: >>>>>>>>>> On Wed, Jul 7, 2010 at 10:32 AM, Jamie Cockrill >>>>>>>>>> <[email protected]>wrote: >>>>>>>>>> >>>>>>>>>>> On the subject of GC and heap, I've left those as defaults. I could >>>>>>>>>>> look at those if that's the next logical step? Would there be >>>>>>>>>>> anything >>>>>>>>>>> in any of the logs that I should look at? >>>>>>>>>>> >>>>>>>>>>> One thing I have noticed is that it does take an absolute age to log >>>>>>>>>>> in to the DN/RS to restart the RS once it's fallen over, in one >>>>>>>>>>> instance it took about 10 minutes. These are 8GB, 4 core amd64 boxes >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> That indicates swapping. Can you run "free -m" on the node? >>>>>>>>>> >>>>>>>>>> Also let "vmstat 20" run while running your job and observe the "si" >>>>>>>>>> and >>>>>>>>>> "so" columns. If those are nonzero, it indicates you're swapping, >>>>>>>>>> and you've >>>>>>>>>> oversubscribed your RAM (very easy on 8G machines) >>>>>>>>>> >>>>>>>>>> -Todd >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> ta >>>>>>>>>>> >>>>>>>>>>> Jamie >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 7 July 2010 18:30, Jamie Cockrill <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> > Bad news, it looks like my xcievers is set as it should be, it's >>>>>>>>>>> > in >>>>>>>>>>> > the hdfs-site.xml and looking at the job.xml of one of my jobs in >>>>>>>>>>> > the >>>>>>>>>>> > job-tracker, it's showing that property as set to 2047. I've cat | >>>>>>>>>>> > grepped one of the datanode logs and although there were a few in >>>>>>>>>>> > there, they were from a few months ago. I've upped my >>>>>>>>>>> > MAX_FILESIZE on >>>>>>>>>>> > my table to 1GB to see if that helps (not sure if it will!). >>>>>>>>>>> > >>>>>>>>>>> > Thanks, >>>>>>>>>>> > >>>>>>>>>>> > Jamie >>>>>>>>>>> > >>>>>>>>>>> > On 7 July 2010 18:12, Jean-Daniel Cryans <[email protected]> >>>>>>>>>>> > wrote: >>>>>>>>>>> >> xcievers exceptions will be in the datanodes' logs, and your >>>>>>>>>>> >> problem >>>>>>>>>>> >> totally looks like it. 0.20.5 will have the same issue (since >>>>>>>>>>> >> it's on >>>>>>>>>>> >> the HDFS side) >>>>>>>>>>> >> >>>>>>>>>>> >> J-D >>>>>>>>>>> >> >>>>>>>>>>> >> On Wed, Jul 7, 2010 at 10:08 AM, Jamie Cockrill >>>>>>>>>>> >> <[email protected]> wrote: >>>>>>>>>>> >>> Hi Todd & JD, >>>>>>>>>>> >>> >>>>>>>>>>> >>> Environment: >>>>>>>>>>> >>> All (hadoop and HBase) installed as of karmic-cdh3, which means: >>>>>>>>>>> >>> Hadoop 0.20.2+228 >>>>>>>>>>> >>> HBase 0.89.20100621+17 >>>>>>>>>>> >>> Zookeeper 3.3.1+7 >>>>>>>>>>> >>> >>>>>>>>>>> >>> Unfortunately my whole cluster of regionservers have now >>>>>>>>>>> >>> crashed, so I >>>>>>>>>>> >>> can't really say if it was swapping too much. There is a DEBUG >>>>>>>>>>> >>> statement just before it crashes saying: >>>>>>>>>>> >>> >>>>>>>>>>> >>> org.apache.hadoop.hbase.regionserver.wal.HLog: closing hlog >>>>>>>>>>> >>> writer in >>>>>>>>>>> >>> hdfs://<somewhere on my HDFS, in /hbase> >>>>>>>>>>> >>> >>>>>>>>>>> >>> What follows is: >>>>>>>>>>> >>> >>>>>>>>>>> >>> WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: >>>>>>>>>>> >>> org.apache.hadoop.ipc.RemoteException: >>>>>>>>>>> >>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: >>>>>>>>>>> >>> No lease >>>>>>>>>>> >>> on <file location as above> File does not exist. Holder >>>>>>>>>>> >>> DFSClient_-11113603 does not have any open files >>>>>>>>>>> >>> >>>>>>>>>>> >>> It then seems to try and do some error recovery (Error Recovery >>>>>>>>>>> >>> for >>>>>>>>>>> >>> block null bad datanode[0] nodes == null), fails (Could not get >>>>>>>>>>> >>> block >>>>>>>>>>> >>> locations. Source file "<hbase file as before>" - Aborting). >>>>>>>>>>> >>> There is >>>>>>>>>>> >>> then an ERROR org.apache...HRegionServer: Close and delete >>>>>>>>>>> >>> failed. >>>>>>>>>>> >>> There is then a similar LeaseExpiredException as above. >>>>>>>>>>> >>> >>>>>>>>>>> >>> There are then a couple of messages from HRegionServer saying >>>>>>>>>>> >>> that >>>>>>>>>>> >>> it's notifying master of its shutdown and stopping itself. The >>>>>>>>>>> >>> shutdown hook then fires and the RemoteException and >>>>>>>>>>> >>> LeaseExpiredExceptions are printed again. >>>>>>>>>>> >>> >>>>>>>>>>> >>> ulimit is set to 65000 (it's in the regionserver log, printed >>>>>>>>>>> >>> as I >>>>>>>>>>> >>> restarted the regionserver), however I haven't got the xceivers >>>>>>>>>>> >>> set >>>>>>>>>>> >>> anywhere. I'll give that a go. It does seem very odd as I did >>>>>>>>>>> >>> have a >>>>>>>>>>> >>> few of them fall over one at a time with a few early loads, but >>>>>>>>>>> >>> that >>>>>>>>>>> >>> seemed to be because the regions weren't splitting properly, so >>>>>>>>>>> >>> all >>>>>>>>>>> >>> the traffic was going to one node and it was being overwhelmed. >>>>>>>>>>> >>> Once I >>>>>>>>>>> >>> throttled it, after one load it a region split seemed to get >>>>>>>>>>> >>> triggered, which flung regions all over, which made subsequent >>>>>>>>>>> >>> loads >>>>>>>>>>> >>> much more distributed. However, perhaps the time-bomb was >>>>>>>>>>> >>> ticking... >>>>>>>>>>> >>> I'll have a go at specifying the xcievers property. I'm pretty >>>>>>>>>>> >>> certain i've got everything else covered, except the patches as >>>>>>>>>>> >>> referenced in the JIRA. >>>>>>>>>>> >>> >>>>>>>>>>> >>> I just grepped some of the log files and didn't get an explicit >>>>>>>>>>> >>> exception with 'xciever' in it. >>>>>>>>>>> >>> >>>>>>>>>>> >>> I am considering downgrading(?) to 0.20.5, however because >>>>>>>>>>> >>> everything >>>>>>>>>>> >>> is installed as per karmic-cdh3, I'm a bit reluctant to do so as >>>>>>>>>>> >>> presumably Cloudera has tested each of these versions against >>>>>>>>>>> >>> each >>>>>>>>>>> >>> other? And I don't really want to introduce further versioning >>>>>>>>>>> >>> issues. >>>>>>>>>>> >>> >>>>>>>>>>> >>> Thanks, >>>>>>>>>>> >>> >>>>>>>>>>> >>> Jamie >>>>>>>>>>> >>> >>>>>>>>>>> >>> >>>>>>>>>>> >>> On 7 July 2010 17:30, Jean-Daniel Cryans <[email protected]> >>>>>>>>>>> >>> wrote: >>>>>>>>>>> >>>> Jamie, >>>>>>>>>>> >>>> >>>>>>>>>>> >>>> Does your configuration meets the requirements? >>>>>>>>>>> >>>> >>>>>>>>>>> http://hbase.apache.org/docs/r0.20.5/api/overview-summary.html#requirements >>>>>>>>>>> >>>> >>>>>>>>>>> >>>> ulimit and xcievers, if not set, are usually time bombs that >>>>>>>>>>> >>>> blow off >>>>>>>>>>> when >>>>>>>>>>> >>>> the cluster is under load. >>>>>>>>>>> >>>> >>>>>>>>>>> >>>> J-D >>>>>>>>>>> >>>> >>>>>>>>>>> >>>> On Wed, Jul 7, 2010 at 9:11 AM, Jamie Cockrill < >>>>>>>>>>> [email protected]>wrote: >>>>>>>>>>> >>>> >>>>>>>>>>> >>>>> Dear all, >>>>>>>>>>> >>>>> >>>>>>>>>>> >>>>> My current HBase/Hadoop architecture has HBase region servers >>>>>>>>>>> >>>>> on the >>>>>>>>>>> >>>>> same physical boxes as the HDFS data-nodes. I'm getting an >>>>>>>>>>> >>>>> awful lot >>>>>>>>>>> >>>>> of region server crashes. The last thing that happens appears >>>>>>>>>>> >>>>> to be a >>>>>>>>>>> >>>>> DroppedSnapshot Exception, caused by an IOException: could not >>>>>>>>>>> >>>>> complete write to file <file on HDFS>. I am running it under >>>>>>>>>>> >>>>> load, >>>>>>>>>>> how >>>>>>>>>>> >>>>> heavy that is I'm not sure how that is quantified, but I'm >>>>>>>>>>> >>>>> guessing >>>>>>>>>>> it >>>>>>>>>>> >>>>> is a load issue. >>>>>>>>>>> >>>>> >>>>>>>>>>> >>>>> Is it common practice to put region servers on data-nodes? Is >>>>>>>>>>> >>>>> it >>>>>>>>>>> >>>>> common to see region server crashes when either the HDFS or >>>>>>>>>>> >>>>> region >>>>>>>>>>> >>>>> server (or both) is under heavy load? I'm guessing that is >>>>>>>>>>> >>>>> the case >>>>>>>>>>> as >>>>>>>>>>> >>>>> I've seen a few similar posts. I've not got a great deal of >>>>>>>>>>> >>>>> capacity >>>>>>>>>>> >>>>> to be separating region servers from HDFS data nodes, but it >>>>>>>>>>> >>>>> might be >>>>>>>>>>> >>>>> an argument I could make. >>>>>>>>>>> >>>>> >>>>>>>>>>> >>>>> Thanks >>>>>>>>>>> >>>>> >>>>>>>>>>> >>>>> Jamie >>>>>>>>>>> >>>>> >>>>>>>>>>> >>>> >>>>>>>>>>> >>> >>>>>>>>>>> >> >>>>>>>>>>> > >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Todd Lipcon >>>>>>>>>> Software Engineer, Cloudera >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
