Fwd: HBase on same boxes as HDFS Data nodes

vramanathan00 Thu, 08 Jul 2010 10:31:24 -0700

 Hi 
Fairly new to hbase..& the list serve..Following up on this thread & the 
article..
Could some one elaborate why locality is lost upon restart? Is it because
of random assignment by HMaster and/or HRegionServer is stateless or other 
reasons?


thanks
venkatesh



 


 

 

-----Original Message-----
From: Jean-Daniel Cryans <[email protected]>
To: [email protected]
Sent: Thu, Jul 8, 2010 1:11 pm
Subject: Re: HBase on same boxes as HDFS Data nodes


More info on this blog post:

http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html



J-D



On Thu, Jul 8, 2010 at 10:11 AM, Jean-Daniel Cryans <[email protected]> wrote:

> This would be done at the expense of network IO, since you will lose

> locality for jobs that read/write to HBase. Also I guess the datanodes

> are also there, so HBase will lose locality with HDFS.

>

> J-D

>

> On Thu, Jul 8, 2010 at 10:07 AM, Jamie Cockrill

> <[email protected]> wrote:

>> Thanks all for your help with this, everything seems much more stable

>> for the meantime. I have a backlog loading job to run over a great

>> deal of data, so I might separate out my region servers from my task

>> trackers for the meantime.

>>

>> Thanks again,

>>

>> Jamie

>>

>>

>>

>> On 8 July 2010 17:46, Jean-Daniel Cryans <[email protected]> wrote:

>>> OS cache is good, glad you figured out your memory problem.

>>>

>>> J-D

>>>

>>> On Thu, Jul 8, 2010 at 2:03 AM, Jamie Cockrill <[email protected]> 

wrote:

>>>> Morning all. Day 2 begins...

>>>>

>>>> I discussed this with someone else earlier and they pointed out that

>>>> we also have task trackers running on all of those nodes, which will

>>>> affect the amount of memory being used when jobs are being run. Each

>>>> tasktracker had a maximum of 8 maps and 8 reduces configured per node,

>>>> with a JVM Xmx of 512mb each.  Clearly this implies a fully utilised

>>>> node will use 8*512mb + 8*512mb = 8GB of memory on tasks alone. That's

>>>> before the datanode does anything, or HBase for that matter.

>>>>

>>>> As such, I've dropped it to 4 maps, 4 reduces per node and reduced the

>>>> Xmx to 256mb, giving a potential maximum task overhead of 2GB per

>>>> node. Running 'vmstat 20' now, under load from mapreduce jobs,

>>>> suggests that the actual free memory is about the same, but the memory

>>>> cache is much much bigger, which presumably is healthlier as, in

>>>> theory, that ought to relinquish memory to processes that request it.

>>>>

>>>> Lets see if that does the trick!

>>>>

>>>> ta

>>>>

>>>> Jamie

>>>>

>>>>

>>>> On 7 July 2010 19:30, Jean-Daniel Cryans <[email protected]> wrote:

>>>>> YouAreDead means that the region server's session was expired, GC

>>>>> seems like your major problem. (file problems can happen after a GC

>>>>> sleep because they were moved around while the process was sleeping,

>>>>> you also get the same kind of messages with xcievers issue... sorry

>>>>> for the confusion)

>>>>>

>>>>> By over committing the memory I meant trying to fit too much stuff in

>>>>> the amount of RAM that you have. I guess it's the map and reduce tasks

>>>>> that eat all the free space? Why not lower their number?

>>>>>

>>>>> J-D

>>>>>

>>>>> On Wed, Jul 7, 2010 at 11:22 AM, Jamie Cockrill

>>>>> <[email protected]> wrote:

>>>>>> PS, I've now reset my MAX_FILESIZE back to the default.  (from the 1GB

>>>>>> i raised it to). It caused me to run into a delightful

>>>>>> 'YouAreDeadException' which looks very related to the Garbage

>>>>>> collection issues on the Troubleshooting page, as my Zookeeper session

>>>>>> expired.

>>>>>>

>>>>>> Thanks

>>>>>>

>>>>>> Jamie

>>>>>>

>>>>>>

>>>>>>

>>>>>> On 7 July 2010 19:19, Jamie Cockrill <[email protected]> wrote:

>>>>>>> By overcommit, do you mean make my overcommit_ratio higher on each box

>>>>>>> (its at the default 50 at the moment)? What I'm noticing at the moment

>>>>>>> is that hadoop is taking up the vast majority of the memory on the

>>>>>>> boxes.

>>>>>>>

>>>>>>> I found this article:

>>>>>>> http://blog.rapleaf.com/dev/2010/01/05/the-wrath-of-drwho-or-unpredictable-hadoop-memory-usage/

>>>>>>> which Todd, it looks like you replied to. Does this sound like a

>>>>>>> similar problem? No worries if you can't remember, it was back in

>>>>>>> january! This article suggests reducing the amount of memory allocated

>>>>>>> to Hadoop at startup, how would I go about doing this?

>>>>>>>

>>>>>>> Thank you everyone for your patience so far. Sorry if this is taking

>>>>>>> up a lot of your time.

>>>>>>>

>>>>>>> Thanks,

>>>>>>>

>>>>>>> Jamie

>>>>>>>

>>>>>>> On 7 July 2010 19:03, Jean-Daniel Cryans <[email protected]> wrote:

>>>>>>>> swappinness at 0 is good, but also don't overcommit your memory!

>>>>>>>>

>>>>>>>> J-D

>>>>>>>>

>>>>>>>> On Wed, Jul 7, 2010 at 10:53 AM, Jamie Cockrill

>>>>>>>> <[email protected]> wrote:

>>>>>>>>> I think you're right.

>>>>>>>>>

>>>>>>>>> Unfortunately the machines are on a separate network to this laptop,

>>>>>>>>> so I'm having to type everything across, apologies if it doesn't

>>>>>>>>> translate well...

>>>>>>>>>

>>>>>>>>> free -m gave:

>>>>>>>>>

>>>>>>>>> Mem    Total    Used     Free

>>>>>>>>>            7992     7939      53

>>>>>>>>> b/c                    7877    114

>>>>>>>>> Swap: 23415       895  22519

>>>>>>>>>

>>>>>>>>> I did this on another node that isn't being smashed at the moment and

>>>>>>>>> the numbers came out similar, but the buffers/cache free was higher

>>>>>>>>>

>>>>>>>>> vmstat -20 is giving non-zero si and so's ranging between 3 and just

>>>>>>>>> short of 5000.

>>>>>>>>>

>>>>>>>>> That seems to be it I guess. Hadoop troubleshooting suggests setting

>>>>>>>>> swappiness to 0, is that just a case of changing the value in

>>>>>>>>> /proc/sys/vm/swappiness?

>>>>>>>>>

>>>>>>>>> thanks

>>>>>>>>>

>>>>>>>>> Jamie

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>>

>>>>>>>>> On 7 July 2010 18:40, Todd Lipcon <[email protected]> wrote:

>>>>>>>>>> On Wed, Jul 7, 2010 at 10:32 AM, Jamie Cockrill 
>>>>>>>>>> <[email protected]>wrote:

>>>>>>>>>>

>>>>>>>>>>> On the subject of GC and heap, I've left those as defaults. I could

>>>>>>>>>>> look at those if that's the next logical step? Would there be 

anything

>>>>>>>>>>> in any of the logs that I should look at?

>>>>>>>>>>>

>>>>>>>>>>> One thing I have noticed is that it does take an absolute age to log

>>>>>>>>>>> in to the DN/RS to restart the RS once it's fallen over, in one

>>>>>>>>>>> instance it took about 10 minutes. These are 8GB, 4 core amd64 boxes

>>>>>>>>>>>

>>>>>>>>>>>

>>>>>>>>>> That indicates swapping. Can you run "free -m" on the node?

>>>>>>>>>>

>>>>>>>>>> Also let "vmstat 20" run while running your job and observe the "si" 

and

>>>>>>>>>> "so" columns. If those are nonzero, it indicates you're swapping, 
>>>>>>>>>> and 

you've

>>>>>>>>>> oversubscribed your RAM (very easy on 8G machines)

>>>>>>>>>>

>>>>>>>>>> -Todd

>>>>>>>>>>

>>>>>>>>>>

>>>>>>>>>>

>>>>>>>>>>> ta

>>>>>>>>>>>

>>>>>>>>>>> Jamie

>>>>>>>>>>>

>>>>>>>>>>>

>>>>>>>>>>>

>>>>>>>>>>> On 7 July 2010 18:30, Jamie Cockrill <[email protected]> 

wrote:

>>>>>>>>>>> > Bad news, it looks like my xcievers is set as it should be, it's 

in

>>>>>>>>>>> > the hdfs-site.xml and looking at the job.xml of one of my jobs in 

the

>>>>>>>>>>> > job-tracker, it's showing that property as set to 2047. I've cat |

>>>>>>>>>>> > grepped one of the datanode logs and although there were a few in

>>>>>>>>>>> > there, they were from a few months ago. I've upped my 
>>>>>>>>>>> > MAX_FILESIZE 

on

>>>>>>>>>>> > my table to 1GB to see if that helps (not sure if it will!).

>>>>>>>>>>> >

>>>>>>>>>>> > Thanks,

>>>>>>>>>>> >

>>>>>>>>>>> > Jamie

>>>>>>>>>>> >

>>>>>>>>>>> > On 7 July 2010 18:12, Jean-Daniel Cryans <[email protected]> 

wrote:

>>>>>>>>>>> >> xcievers exceptions will be in the datanodes' logs, and your 

problem

>>>>>>>>>>> >> totally looks like it. 0.20.5 will have the same issue (since 

it's on

>>>>>>>>>>> >> the HDFS side)

>>>>>>>>>>> >>

>>>>>>>>>>> >> J-D

>>>>>>>>>>> >>

>>>>>>>>>>> >> On Wed, Jul 7, 2010 at 10:08 AM, Jamie Cockrill

>>>>>>>>>>> >> <[email protected]> wrote:

>>>>>>>>>>> >>> Hi Todd & JD,

>>>>>>>>>>> >>>

>>>>>>>>>>> >>> Environment:

>>>>>>>>>>> >>> All (hadoop and HBase) installed as of karmic-cdh3, which means:

>>>>>>>>>>> >>> Hadoop 0.20.2+228

>>>>>>>>>>> >>> HBase 0.89.20100621+17

>>>>>>>>>>> >>> Zookeeper 3.3.1+7

>>>>>>>>>>> >>>

>>>>>>>>>>> >>> Unfortunately my whole cluster of regionservers have now 

crashed, so I

>>>>>>>>>>> >>> can't really say if it was swapping too much. There is a DEBUG

>>>>>>>>>>> >>> statement just before it crashes saying:

>>>>>>>>>>> >>>

>>>>>>>>>>> >>> org.apache.hadoop.hbase.regionserver.wal.HLog: closing hlog 

writer in

>>>>>>>>>>> >>> hdfs://<somewhere on my HDFS, in /hbase>

>>>>>>>>>>> >>>

>>>>>>>>>>> >>> What follows is:

>>>>>>>>>>> >>>

>>>>>>>>>>> >>> WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:

>>>>>>>>>>> >>> org.apache.hadoop.ipc.RemoteException:

>>>>>>>>>>> >>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: 
>>>>>>>>>>> >>> No 

lease

>>>>>>>>>>> >>> on <file location as above> File does not exist. Holder

>>>>>>>>>>> >>> DFSClient_-11113603 does not have any open files

>>>>>>>>>>> >>>

>>>>>>>>>>> >>> It then seems to try and do some error recovery (Error Recovery 

for

>>>>>>>>>>> >>> block null bad datanode[0] nodes == null), fails (Could not get 

block

>>>>>>>>>>> >>> locations. Source file "<hbase file as before>" - Aborting). 

There is

>>>>>>>>>>> >>> then an ERROR org.apache...HRegionServer: Close and delete 

failed.

>>>>>>>>>>> >>> There is then a similar LeaseExpiredException as above.

>>>>>>>>>>> >>>

>>>>>>>>>>> >>> There are then a couple of messages from HRegionServer saying 

that

>>>>>>>>>>> >>> it's notifying master of its shutdown and stopping itself. The

>>>>>>>>>>> >>> shutdown hook then fires and the RemoteException and

>>>>>>>>>>> >>> LeaseExpiredExceptions are printed again.

>>>>>>>>>>> >>>

>>>>>>>>>>> >>> ulimit is set to 65000 (it's in the regionserver log, printed 
>>>>>>>>>>> >>> as 

I

>>>>>>>>>>> >>> restarted the regionserver), however I haven't got the xceivers 

set

>>>>>>>>>>> >>> anywhere. I'll give that a go. It does seem very odd as I did 

have a

>>>>>>>>>>> >>> few of them fall over one at a time with a few early loads, but 

that

>>>>>>>>>>> >>> seemed to be because the regions weren't splitting properly, so 

all

>>>>>>>>>>> >>> the traffic was going to one node and it was being overwhelmed. 

Once I

>>>>>>>>>>> >>> throttled it, after one load it a region split seemed to get

>>>>>>>>>>> >>> triggered, which flung regions all over, which made subsequent 

loads

>>>>>>>>>>> >>> much more distributed. However, perhaps the time-bomb was 

ticking...

>>>>>>>>>>> >>> I'll  have a go at specifying the xcievers property. I'm pretty

>>>>>>>>>>> >>> certain i've got everything else covered, except the patches as

>>>>>>>>>>> >>> referenced in the JIRA.

>>>>>>>>>>> >>>

>>>>>>>>>>> >>> I just grepped some of the log files and didn't get an explicit

>>>>>>>>>>> >>> exception with 'xciever' in it.

>>>>>>>>>>> >>>

>>>>>>>>>>> >>> I am considering downgrading(?) to 0.20.5, however because 

everything

>>>>>>>>>>> >>> is installed as per karmic-cdh3, I'm a bit reluctant to do so as

>>>>>>>>>>> >>> presumably Cloudera has tested each of these versions against 

each

>>>>>>>>>>> >>> other? And I don't really want to introduce further versioning 

issues.

>>>>>>>>>>> >>>

>>>>>>>>>>> >>> Thanks,

>>>>>>>>>>> >>>

>>>>>>>>>>> >>> Jamie

>>>>>>>>>>> >>>

>>>>>>>>>>> >>>

>>>>>>>>>>> >>> On 7 July 2010 17:30, Jean-Daniel Cryans <[email protected]> 

wrote:

>>>>>>>>>>> >>>> Jamie,

>>>>>>>>>>> >>>>

>>>>>>>>>>> >>>> Does your configuration meets the requirements?

>>>>>>>>>>> >>>>

>>>>>>>>>>> http://hbase.apache.org/docs/r0.20.5/api/overview-summary.html#requirements

>>>>>>>>>>> >>>>

>>>>>>>>>>> >>>> ulimit and xcievers, if not set, are usually time bombs that 

blow off

>>>>>>>>>>> when

>>>>>>>>>>> >>>> the cluster is under load.

>>>>>>>>>>> >>>>

>>>>>>>>>>> >>>> J-D

>>>>>>>>>>> >>>>

>>>>>>>>>>> >>>> On Wed, Jul 7, 2010 at 9:11 AM, Jamie Cockrill <

>>>>>>>>>>> [email protected]>wrote:

>>>>>>>>>>> >>>>

>>>>>>>>>>> >>>>> Dear all,

>>>>>>>>>>> >>>>>

>>>>>>>>>>> >>>>> My current HBase/Hadoop architecture has HBase region servers 

on the

>>>>>>>>>>> >>>>> same physical boxes as the HDFS data-nodes. I'm getting an 

awful lot

>>>>>>>>>>> >>>>> of region server crashes. The last thing that happens appears 

to be a

>>>>>>>>>>> >>>>> DroppedSnapshot Exception, caused by an IOException: could not

>>>>>>>>>>> >>>>> complete write to file <file on HDFS>. I am running it under 

load,

>>>>>>>>>>> how

>>>>>>>>>>> >>>>> heavy that is I'm not sure how that is quantified, but I'm 

guessing

>>>>>>>>>>> it

>>>>>>>>>>> >>>>> is a load issue.

>>>>>>>>>>> >>>>>

>>>>>>>>>>> >>>>> Is it common practice to put region servers on data-nodes? Is 

it

>>>>>>>>>>> >>>>> common to see region server crashes when either the HDFS or 

region

>>>>>>>>>>> >>>>> server (or both) is under heavy load? I'm guessing that is 
>>>>>>>>>>> >>>>> the 

case

>>>>>>>>>>> as

>>>>>>>>>>> >>>>> I've seen a few similar posts. I've not got a great deal of 

capacity

>>>>>>>>>>> >>>>> to be separating region servers from HDFS data nodes, but it 

might be

>>>>>>>>>>> >>>>> an argument I could make.

>>>>>>>>>>> >>>>>

>>>>>>>>>>> >>>>> Thanks

>>>>>>>>>>> >>>>>

>>>>>>>>>>> >>>>> Jamie

>>>>>>>>>>> >>>>>

>>>>>>>>>>> >>>>

>>>>>>>>>>> >>>

>>>>>>>>>>> >>

>>>>>>>>>>> >

>>>>>>>>>>>

>>>>>>>>>>

>>>>>>>>>>

>>>>>>>>>>

>>>>>>>>>> --

>>>>>>>>>> Todd Lipcon

>>>>>>>>>> Software Engineer, Cloudera

>>>>>>>>>>

>>>>>>>>>

>>>>>>>>

>>>>>>>

>>>>>>

>>>>>

>>>>

>>>

>>

>

Fwd: HBase on same boxes as HDFS Data nodes

Reply via email to