Thanks for updating the list Jack. I added a note to our 'book' on nproc and referenced your email below (Will push the changes to the website later). Good stuff, St.Ack
On Wed, Mar 30, 2011 at 7:31 PM, Jack Levin <[email protected]> wrote: > Thanks to everyone chiming in to help me fix this issue... It has now > been resolved, JD and I spend some time looking at thread limits and > apparently, our userid 'hadoop' had nproc limit (default) set to 1024, > this of course caused the issue of running out of threads every time > we were under load, (like compaction, or just high number of queries, > or RS restart), now, this was addressed in /etc/security/limits.conf, > where we set "hadoop soft/hard nproc 32000". Please note that this is > not same as ulimit -n, and neither it is xcievers, nor its handlers, > or anything like that. The user "root" does not run into this > problem, but anyone installing stock HADOOP/HDFS from cloudera, likely > would be running datanodes as user hadoop, and will hit that problem > unless the above settings are adjusted. > > I ran Java thread tester, that simply creates a bunch of threads and > tells you when you are at the limit, here are the results before and > after: > > Thread no. 800 started. > Creating thread 900 (95ms) > Thread no. 900 started. > Error thrown when creating thread 917 > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:614) > at CreateThreads.main(CreateThreads.java:42) > > > after: > > Creating thread 32100 (36913ms) > Thread no. 32100 started. > Creating thread 32200 (37196ms) > Thread no. 32200 started. > Error thrown when creating thread 32207 > java.lang.OutOfMemoryError: unable to create new native thread > > You can see the difference. Now I can sleep a little better :) > > -Jack > > On Sat, Mar 12, 2011 at 3:31 AM, Suraj Varma <[email protected]> wrote: >>>> to:java.lang.OutOfMemoryError: unable to create new native thread >> >> This indicates that you are oversubscribed on your RAM to the extent >> that the JVM doesn't have any space to create native threads (which >> are allocated outside of the JVM heap.) >> >> You may actually have to _reduce_ your heap sizes to allow more space >> for native threads (do an inventory of all the JVM heaps and don't let >> it go over about 75% of available RAM.) >> Another option is to use the -Xss stack size JVM arg to reduce the per >> thread stack size - set it to 512k or 256k (you may have to >> experiment/perf test a bit to see what's the optimum size. >> Or ... get more RAM ... >> >> --Suraj >> >> On Fri, Mar 11, 2011 at 8:11 PM, Jack Levin <[email protected]> wrote: >>> I am noticing following errors also: >>> >>> 2011-03-11 17:52:00,376 ERROR >>> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( >>> 10.103.7.3:50010, storageID=DS-824332190-10.103.7.3-50010-1290043658438, >>> infoPort=50075, ipcPort=50020):DataXceiveServer: Exiting due >>> to:java.lang.OutOfMemoryError: unable to create new native thread >>> at java.lang.Thread.start0(Native Method) >>> at java.lang.Thread.start(Thread.java:597) >>> at >>> org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:132) >>> at java.lang.Thread.run(Thread.java:619) >>> >>> >>> and this: >>> >>> nf_conntrack: table full, dropping packet. >>> nf_conntrack: table full, dropping packet. >>> nf_conntrack: table full, dropping packet. >>> nf_conntrack: table full, dropping packet. >>> nf_conntrack: table full, dropping packet. >>> nf_conntrack: table full, dropping packet. >>> net_ratelimit: 10 callbacks suppressed >>> nf_conntrack: table full, dropping packet. >>> possible SYN flooding on port 9090. Sending cookies. >>> >>> This seems like a network stack issue? >>> >>> So, does datanode need higher heap than 1GB? Or possible we ran out of RAM >>> for other reasons? >>> >>> -Jack >>> >>> On Thu, Mar 10, 2011 at 1:29 PM, Ryan Rawson <[email protected]> wrote: >>> >>>> Looks like a datanode went down. InterruptedException is how java >>>> uses to interrupt IO in threads, its similar to the EINTR errno. That >>>> means the actual source of the abort is higher up... >>>> >>>> So back to how InterruptedException works... at some point a thread in >>>> the JVM decides that the VM should abort. So it calls >>>> thread.interrupt() on all the threads it knows/cares about to >>>> interrupt their IO. That is what you are seeing in the logs. The root >>>> cause lies above I think. >>>> >>>> Look for the first "Exception" string or any FATAL or ERROR strings in >>>> the datanode logfiles. >>>> >>>> -ryan >>>> >>>> On Thu, Mar 10, 2011 at 1:03 PM, Jack Levin <[email protected]> wrote: >>>> > http://pastebin.com/ZmsyvcVc Here is the regionserver log, they all >>>> have >>>> > similar stuff, >>>> > >>>> > On Thu, Mar 10, 2011 at 11:34 AM, Stack <[email protected]> wrote: >>>> > >>>> >> Whats in the regionserver logs? Please put up regionserver and >>>> >> datanode excerpts. >>>> >> Thanks Jack, >>>> >> St.Ack >>>> >> >>>> >> On Thu, Mar 10, 2011 at 10:31 AM, Jack Levin <[email protected]> wrote: >>>> >> > All was well, until this happen: >>>> >> > >>>> >> > http://pastebin.com/iM1niwrS >>>> >> > >>>> >> > and all regionservers went down, is this xciever issue? >>>> >> > >>>> >> > <property> >>>> >> > <name>dfs.datanode.max.xcievers</name> >>>> >> > <value>12047</value> >>>> >> > </property> >>>> >> > >>>> >> > this is what I have, should I set it higher? >>>> >> > >>>> >> > -Jack >>>> >> > >>>> >> >>>> > >>>> >>> >> >
