Thanks to everyone chiming in to help me fix this issue... It has now
been resolved, JD and I spend some time looking at thread limits and
apparently, our userid 'hadoop' had nproc limit (default) set to 1024,
this of course caused the issue of running out of threads every time
we were under load, (like compaction, or just high number of queries,
or RS restart), now, this was addressed in /etc/security/limits.conf,
where we set "hadoop soft/hard nproc 32000". Please note that this is
not same as ulimit -n, and neither it is xcievers, nor its handlers,
or anything like that. The user "root" does not run into this
problem, but anyone installing stock HADOOP/HDFS from cloudera, likely
would be running datanodes as user hadoop, and will hit that problem
unless the above settings are adjusted.
I ran Java thread tester, that simply creates a bunch of threads and
tells you when you are at the limit, here are the results before and
after:
Thread no. 800 started.
Creating thread 900 (95ms)
Thread no. 900 started.
Error thrown when creating thread 917
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:614)
at CreateThreads.main(CreateThreads.java:42)
after:
Creating thread 32100 (36913ms)
Thread no. 32100 started.
Creating thread 32200 (37196ms)
Thread no. 32200 started.
Error thrown when creating thread 32207
java.lang.OutOfMemoryError: unable to create new native thread
You can see the difference. Now I can sleep a little better :)
-Jack
On Sat, Mar 12, 2011 at 3:31 AM, Suraj Varma <[email protected]> wrote:
>>> to:java.lang.OutOfMemoryError: unable to create new native thread
>
> This indicates that you are oversubscribed on your RAM to the extent
> that the JVM doesn't have any space to create native threads (which
> are allocated outside of the JVM heap.)
>
> You may actually have to _reduce_ your heap sizes to allow more space
> for native threads (do an inventory of all the JVM heaps and don't let
> it go over about 75% of available RAM.)
> Another option is to use the -Xss stack size JVM arg to reduce the per
> thread stack size - set it to 512k or 256k (you may have to
> experiment/perf test a bit to see what's the optimum size.
> Or ... get more RAM ...
>
> --Suraj
>
> On Fri, Mar 11, 2011 at 8:11 PM, Jack Levin <[email protected]> wrote:
>> I am noticing following errors also:
>>
>> 2011-03-11 17:52:00,376 ERROR
>> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
>> 10.103.7.3:50010, storageID=DS-824332190-10.103.7.3-50010-1290043658438,
>> infoPort=50075, ipcPort=50020):DataXceiveServer: Exiting due
>> to:java.lang.OutOfMemoryError: unable to create new native thread
>> at java.lang.Thread.start0(Native Method)
>> at java.lang.Thread.start(Thread.java:597)
>> at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:132)
>> at java.lang.Thread.run(Thread.java:619)
>>
>>
>> and this:
>>
>> nf_conntrack: table full, dropping packet.
>> nf_conntrack: table full, dropping packet.
>> nf_conntrack: table full, dropping packet.
>> nf_conntrack: table full, dropping packet.
>> nf_conntrack: table full, dropping packet.
>> nf_conntrack: table full, dropping packet.
>> net_ratelimit: 10 callbacks suppressed
>> nf_conntrack: table full, dropping packet.
>> possible SYN flooding on port 9090. Sending cookies.
>>
>> This seems like a network stack issue?
>>
>> So, does datanode need higher heap than 1GB? Or possible we ran out of RAM
>> for other reasons?
>>
>> -Jack
>>
>> On Thu, Mar 10, 2011 at 1:29 PM, Ryan Rawson <[email protected]> wrote:
>>
>>> Looks like a datanode went down. InterruptedException is how java
>>> uses to interrupt IO in threads, its similar to the EINTR errno. That
>>> means the actual source of the abort is higher up...
>>>
>>> So back to how InterruptedException works... at some point a thread in
>>> the JVM decides that the VM should abort. So it calls
>>> thread.interrupt() on all the threads it knows/cares about to
>>> interrupt their IO. That is what you are seeing in the logs. The root
>>> cause lies above I think.
>>>
>>> Look for the first "Exception" string or any FATAL or ERROR strings in
>>> the datanode logfiles.
>>>
>>> -ryan
>>>
>>> On Thu, Mar 10, 2011 at 1:03 PM, Jack Levin <[email protected]> wrote:
>>> > http://pastebin.com/ZmsyvcVc Here is the regionserver log, they all
>>> have
>>> > similar stuff,
>>> >
>>> > On Thu, Mar 10, 2011 at 11:34 AM, Stack <[email protected]> wrote:
>>> >
>>> >> Whats in the regionserver logs? Please put up regionserver and
>>> >> datanode excerpts.
>>> >> Thanks Jack,
>>> >> St.Ack
>>> >>
>>> >> On Thu, Mar 10, 2011 at 10:31 AM, Jack Levin <[email protected]> wrote:
>>> >> > All was well, until this happen:
>>> >> >
>>> >> > http://pastebin.com/iM1niwrS
>>> >> >
>>> >> > and all regionservers went down, is this xciever issue?
>>> >> >
>>> >> > <property>
>>> >> > <name>dfs.datanode.max.xcievers</name>
>>> >> > <value>12047</value>
>>> >> > </property>
>>> >> >
>>> >> > this is what I have, should I set it higher?
>>> >> >
>>> >> > -Jack
>>> >> >
>>> >>
>>> >
>>>
>>
>