Re: major hdfs issues

Stack Wed, 30 Mar 2011 22:35:01 -0700

Thanks for updating the list Jack.  I added a note to our 'book' on
nproc and referenced your email below (Will push the changes to the
website later).
Good stuff,
St.Ack


On Wed, Mar 30, 2011 at 7:31 PM, Jack Levin <[email protected]> wrote:
> Thanks to everyone chiming in to help me fix this issue... It has now
> been resolved,  JD and I spend some time looking at thread limits and
> apparently, our userid 'hadoop' had nproc limit (default) set to 1024,
> this of course caused the issue of running out of threads every time
> we were under load, (like compaction, or just high number of queries,
> or RS restart), now, this was addressed in /etc/security/limits.conf,
> where we set "hadoop soft/hard nproc 32000".  Please note that this is
> not same as ulimit -n, and neither it is xcievers, nor its handlers,
> or anything like that.  The user "root" does not run into this
> problem, but anyone installing stock HADOOP/HDFS from cloudera, likely
> would be running datanodes as user hadoop, and will hit that problem
> unless the above settings are adjusted.
>
> I ran Java thread tester, that simply creates a bunch of threads and
> tells you when you are at the limit, here are the results before and
> after:
>
> Thread no. 800 started.
> Creating thread 900 (95ms)
> Thread no. 900 started.
> Error thrown when creating thread 917
> java.lang.OutOfMemoryError: unable to create new native thread
>       at java.lang.Thread.start0(Native Method)
>       at java.lang.Thread.start(Thread.java:614)
>       at CreateThreads.main(CreateThreads.java:42)
>
>
> after:
>
> Creating thread 32100 (36913ms)
> Thread no. 32100 started.
> Creating thread 32200 (37196ms)
> Thread no. 32200 started.
> Error thrown when creating thread 32207
> java.lang.OutOfMemoryError: unable to create new native thread
>
> You can see the difference.   Now I can sleep a little better :)
>
> -Jack
>
> On Sat, Mar 12, 2011 at 3:31 AM, Suraj Varma <[email protected]> wrote:
>>>> to:java.lang.OutOfMemoryError: unable to create new native thread
>>
>> This indicates that you are oversubscribed on your RAM to the extent
>> that the JVM doesn't have any space to create native threads (which
>> are allocated outside of the JVM heap.)
>>
>> You may actually have to _reduce_ your heap sizes to allow more space
>> for native threads (do an inventory of all the JVM heaps and don't let
>> it go over about 75% of available RAM.)
>> Another option is to use the -Xss stack size JVM arg to reduce the per
>> thread stack size - set it to 512k or 256k (you may have to
>> experiment/perf test a bit to see what's the optimum size.
>> Or ... get more RAM ...
>>
>> --Suraj
>>
>> On Fri, Mar 11, 2011 at 8:11 PM, Jack Levin <[email protected]> wrote:
>>> I am noticing following errors also:
>>>
>>> 2011-03-11 17:52:00,376 ERROR
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
>>> 10.103.7.3:50010, storageID=DS-824332190-10.103.7.3-50010-1290043658438,
>>> infoPort=50075, ipcPort=50020):DataXceiveServer: Exiting due
>>> to:java.lang.OutOfMemoryError: unable to create new native thread
>>>        at java.lang.Thread.start0(Native Method)
>>>        at java.lang.Thread.start(Thread.java:597)
>>>        at
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:132)
>>>        at java.lang.Thread.run(Thread.java:619)
>>>
>>>
>>> and this:
>>>
>>> nf_conntrack: table full, dropping packet.
>>> nf_conntrack: table full, dropping packet.
>>> nf_conntrack: table full, dropping packet.
>>> nf_conntrack: table full, dropping packet.
>>> nf_conntrack: table full, dropping packet.
>>> nf_conntrack: table full, dropping packet.
>>> net_ratelimit: 10 callbacks suppressed
>>> nf_conntrack: table full, dropping packet.
>>> possible SYN flooding on port 9090. Sending cookies.
>>>
>>> This seems like a network stack issue?
>>>
>>> So, does datanode need higher heap than 1GB?  Or possible we ran out of RAM
>>> for other reasons?
>>>
>>> -Jack
>>>
>>> On Thu, Mar 10, 2011 at 1:29 PM, Ryan Rawson <[email protected]> wrote:
>>>
>>>> Looks like a datanode went down.  InterruptedException is how java
>>>> uses to interrupt IO in threads, its similar to the EINTR errno.  That
>>>> means the actual source of the abort is higher up...
>>>>
>>>> So back to how InterruptedException works... at some point a thread in
>>>> the JVM decides that the VM should abort.  So it calls
>>>> thread.interrupt() on all the threads it knows/cares about to
>>>> interrupt their IO.  That is what you are seeing in the logs. The root
>>>> cause lies above I think.
>>>>
>>>> Look for the first "Exception" string or any FATAL or ERROR strings in
>>>> the datanode logfiles.
>>>>
>>>> -ryan
>>>>
>>>> On Thu, Mar 10, 2011 at 1:03 PM, Jack Levin <[email protected]> wrote:
>>>> > http://pastebin.com/ZmsyvcVc  Here is the regionserver log, they all
>>>> have
>>>> > similar stuff,
>>>> >
>>>> > On Thu, Mar 10, 2011 at 11:34 AM, Stack <[email protected]> wrote:
>>>> >
>>>> >> Whats in the regionserver logs?  Please put up regionserver and
>>>> >> datanode excerpts.
>>>> >> Thanks Jack,
>>>> >> St.Ack
>>>> >>
>>>> >> On Thu, Mar 10, 2011 at 10:31 AM, Jack Levin <[email protected]> wrote:
>>>> >> > All was well, until this happen:
>>>> >> >
>>>> >> > http://pastebin.com/iM1niwrS
>>>> >> >
>>>> >> > and all regionservers went down, is this xciever issue?
>>>> >> >
>>>> >> > <property>
>>>> >> > <name>dfs.datanode.max.xcievers</name>
>>>> >> > <value>12047</value>
>>>> >> > </property>
>>>> >> >
>>>> >> > this is what I have, should I set it higher?
>>>> >> >
>>>> >> > -Jack
>>>> >> >
>>>> >>
>>>> >
>>>>
>>>
>>
>

Re: major hdfs issues

Reply via email to