subject:"RE\: Nutch 1.6 \: java.lang.OutOfMemoryError\: unable to create new native thread"

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-05 Thread kiran chitturi

Tejas,

I have a total of 364k files fetched in my last crawl and i used a topN of
2000 and 2 threads per queue. The gap i have noticed is between 5-8
minutes. I had a total of 180 rounds in my crawl ( i had some big crawls at
the beginning with  topN of 10k but after it crashed i changed topN to 2k).


Due to my hardware limitations and local mode, i think using smaller number
of rounds saved me quite some time. The downside might be having lot more
segments to go through but i am writing scripts for automating the index
and reparse tasks.





On Mon, Mar 4, 2013 at 11:18 PM, Tejas Patil tejas.patil...@gmail.comwrote:

 Hi Kiran,

 Is the 6 mins consistent across those 5 rounds ? With 10k files is takes
 ~60 minutes for writing segments.
 With 2k file, it took 6 min gap. You will need 5 such small rounds to get
 total 10k, so total gap time would be (5 * 6) = 30 mins. Thats half of the
 time taken for the crawl with 10k !! So in a way, you saved 30 mins by
 running small crawls. Something does seem right with the math here.

 Thanks,
 Tejas Patil

 On Mon, Mar 4, 2013 at 12:45 PM, kiran chitturi
 chitturikira...@gmail.comwrote:

  Thanks Sebastian for the details. This was the bottleneck i had when i am
  fetching 10k files. Now i switched to 2k and i have a 6 mins gap now.  It
  took me some time finding right configuration in the local node.
 
 
 
  On Mon, Mar 4, 2013 at 3:33 PM, Sebastian Nagel
  wastl.na...@googlemail.comwrote:
 
   After all documents are fetched (and ev. parsed) the segment has to be
   written:
   finish sorting the data and copy it from local temp dir
 (hadoop.tmp.dir)
   to the
   segment directory. If IO is a bottleneck this may take a while. Also
  looks
   like
   you have a lot of content!
  
   On 03/04/2013 06:03 AM, kiran chitturi wrote:
Thanks for your suggestion guys! The big crawl is fetching large
 amount
   of
big PDF files.
   
For something like below, the fetcher took a lot of time to finish
 up,
   even
though the files are fetched. It shows more than one hour of time.
   
   
2013-03-01 19:45:43,217 INFO  fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0
2013-03-01* 19:45:43,217 *INFO  fetcher.Fetcher - -activeThreads=0
2013-03-01* 20:57:55,288* INFO  fetcher.Fetcher - Fetcher: finished
 at
2013-03-01 20:57:55, elapsed: 01:34:09
   
   
Does fetching a lot of files causes this issue ? Should i stick to
 one
thread per local mode or use pseudo distributed mode to improve
   performance
?
   
What is an acceptable time fetcher should finish up after fetching
 the
files ? What exactly happens in this step ?
   
Thanks again!
Kiran.
   
   
   
On Sun, Mar 3, 2013 at 4:55 PM, Markus Jelsma 
   markus.jel...@openindex.iowrote:
   
The default heap size of 1G is just enough for a parsing fetcher
 with
  10
threads. The only problem that may rise is too large and complicated
  PDF
files or very large HTML files. If you generate fetch lists of a
   reasonable
size there won't be a problem most of the time. And if you want to
   crawl a
lot, then just generate more small segments.
   
If there is a bug it's most likely to be the parser eating memory
 and
   not
releasing it.
   
-Original message-
From:Tejas Patil tejas.patil...@gmail.com
Sent: Sun 03-Mar-2013 22:19
To: user@nutch.apache.org
Subject: Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to
 create
new native thread
   
I agree with Sebastian. It was a crawl in local mode and not over a
cluster. The intended crawl volume is huge and if we dont override
  the
default heap size to some decent value, there is high possibility
 of
facing
an OOM.
   
   
On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi 
chitturikira...@gmail.comwrote:
   
If you find the time you should trace the process.
Seems to be either a misconfiguration or even a bug.
   
I will try to track this down soon with the previous
 configuration.
Right
now, i am just trying to get data crawled by Monday.
   
Kiran.
   
   
Luckily, you should be able to retry via bin/nutch parse ...
Then trace the system and the Java process to catch the reason.
   
Sebastian
   
On 03/02/2013 08:13 PM, kiran chitturi wrote:
Sorry, i am looking to crawl 400k documents with the crawl. I
said
400
in
my last message.
   
   
On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi 
chitturikira...@gmail.comwrote:
   
Hi!
   
I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5
2.8GHz.
   
Last night i started a crawl on local mode for 5 seeds with
 the
config
given below. If the crawl goes well, it should fetch a total
 of
400
documents. The crawling is done on a single host that we own.
   
Config
-
   
fetcher.threads.per.queue - 2
fetcher.server.delay - 1

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-04 Thread Sebastian Nagel

After all documents are fetched (and ev. parsed) the segment has to be written:
finish sorting the data and copy it from local temp dir (hadoop.tmp.dir) to the
segment directory. If IO is a bottleneck this may take a while. Also looks like
you have a lot of content!

On 03/04/2013 06:03 AM, kiran chitturi wrote:
Thanks for your suggestion guys! The big crawl is fetching large amount of
big PDF files.

For something like below, the fetcher took a lot of time to finish up, even
though the files are fetched. It shows more than one hour of time.

2013-03-01 19:45:43,217 INFO fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0
2013-03-01* 19:45:43,217 *INFO fetcher.Fetcher - -activeThreads=0
2013-03-01* 20:57:55,288* INFO fetcher.Fetcher - Fetcher: finished at
2013-03-01 20:57:55, elapsed: 01:34:09

Does fetching a lot of files causes this issue ? Should i stick to one
thread per local mode or use pseudo distributed mode to improve performance
?

What is an acceptable time fetcher should finish up after fetching the
files ? What exactly happens in this step ?

Thanks again!
Kiran.

On Sun, Mar 3, 2013 at 4:55 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

The default heap size of 1G is just enough for a parsing fetcher with 10
threads. The only problem that may rise is too large and complicated PDF
files or very large HTML files. If you generate fetch lists of a reasonable
size there won't be a problem most of the time. And if you want to crawl a
lot, then just generate more small segments.

If there is a bug it's most likely to be the parser eating memory and not
releasing it.

-Original message-
From:Tejas Patil tejas.patil...@gmail.com
Sent: Sun 03-Mar-2013 22:19
To: user@nutch.apache.org
Subject: Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create
new native thread

I agree with Sebastian. It was a crawl in local mode and not over a
cluster. The intended crawl volume is huge and if we dont override the
default heap size to some decent value, there is high possibility of
facing
an OOM.

On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi
chitturikira...@gmail.comwrote:

If you find the time you should trace the process.
Seems to be either a misconfiguration or even a bug.

I will try to track this down soon with the previous configuration.
Right
now, i am just trying to get data crawled by Monday.

Kiran.

Luckily, you should be able to retry via bin/nutch parse ...
Then trace the system and the Java process to catch the reason.

Sebastian

On 03/02/2013 08:13 PM, kiran chitturi wrote:
Sorry, i am looking to crawl 400k documents with the crawl. I
said
400
in
my last message.

On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi
chitturikira...@gmail.comwrote:

Hi!

I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5
2.8GHz.

Last night i started a crawl on local mode for 5 seeds with the
config
given below. If the crawl goes well, it should fetch a total of
400
documents. The crawling is done on a single host that we own.

Config
-

fetcher.threads.per.queue - 2
fetcher.server.delay - 1
fetcher.throughput.threshold.pages - -1

crawl script settings

timeLimitFetch- 30
numThreads - 5
topN - 1
mapred.child.java.opts=-Xmx1000m

I have noticed today that the crawl has stopped due to an error
and
i
have
found the below error in logs.

2013-03-01 21:45:03,767 INFO parse.ParseSegment - Parsed (0ms):

http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
2013-03-01 21:45:03,790 WARN mapred.LocalJobRunner -
job_local_0001
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:658)
at

java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
at

java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
at

java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
at

java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
at
org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
at
org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93)
at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at

org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
(END)

Did anyone run in to the same issue ? I am not sure why the new
native
thread is not being created. The link here says

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-04 Thread kiran chitturi

Thanks Sebastian for the details. This was the bottleneck i had when i am
fetching 10k files. Now i switched to 2k and i have a 6 mins gap now. It
took me some time finding right configuration in the local node.

On Mon, Mar 4, 2013 at 3:33 PM, Sebastian Nagel
wastl.na...@googlemail.comwrote:

After all documents are fetched (and ev. parsed) the segment has to be
written:
finish sorting the data and copy it from local temp dir (hadoop.tmp.dir)
to the
segment directory. If IO is a bottleneck this may take a while. Also looks
like
you have a lot of content!

On 03/04/2013 06:03 AM, kiran chitturi wrote:
Thanks for your suggestion guys! The big crawl is fetching large amount
of
big PDF files.

For something like below, the fetcher took a lot of time to finish up,
even
though the files are fetched. It shows more than one hour of time.

Does fetching a lot of files causes this issue ? Should i stick to one
thread per local mode or use pseudo distributed mode to improve
performance
?

What is an acceptable time fetcher should finish up after fetching the
files ? What exactly happens in this step ?

Thanks again!
Kiran.

On Sun, Mar 3, 2013 at 4:55 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

The default heap size of 1G is just enough for a parsing fetcher with 10
threads. The only problem that may rise is too large and complicated PDF
files or very large HTML files. If you generate fetch lists of a
reasonable
size there won't be a problem most of the time. And if you want to
crawl a
lot, then just generate more small segments.

If there is a bug it's most likely to be the parser eating memory and
not
releasing it.

On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi
chitturikira...@gmail.comwrote:

If you find the time you should trace the process.
Seems to be either a misconfiguration or even a bug.

I will try to track this down soon with the previous configuration.
Right
now, i am just trying to get data crawled by Monday.

Kiran.

Luckily, you should be able to retry via bin/nutch parse ...
Then trace the system and the Java process to catch the reason.

Sebastian

On 03/02/2013 08:13 PM, kiran chitturi wrote:
Sorry, i am looking to crawl 400k documents with the crawl. I
said
400
in
my last message.

On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi
chitturikira...@gmail.comwrote:

Hi!

I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5
2.8GHz.

Config
-

fetcher.threads.per.queue - 2
fetcher.server.delay - 1
fetcher.throughput.threshold.pages - -1

crawl script settings

timeLimitFetch- 30
numThreads - 5
topN - 1
mapred.child.java.opts=-Xmx1000m

I have noticed today that the crawl has stopped due to an error
and
i
have
found the below error in logs.

2013-03-01 21:45:03,767 INFO parse.ParseSegment - Parsed (0ms):

java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
at

java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
at

java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
at

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-04 Thread Tejas Patil

Hi Kiran,

Is the 6 mins consistent across those 5 rounds ? With 10k files is takes
~60 minutes for writing segments.
With 2k file, it took 6 min gap. You will need 5 such small rounds to get
total 10k, so total gap time would be (5 * 6) = 30 mins. Thats half of the
time taken for the crawl with 10k !! So in a way, you saved 30 mins by
running small crawls. Something does seem right with the math here.

Thanks,
Tejas Patil

On Mon, Mar 4, 2013 at 12:45 PM, kiran chitturi
chitturikira...@gmail.comwrote:

 Thanks Sebastian for the details. This was the bottleneck i had when i am
 fetching 10k files. Now i switched to 2k and i have a 6 mins gap now.  It
 took me some time finding right configuration in the local node.



 On Mon, Mar 4, 2013 at 3:33 PM, Sebastian Nagel
 wastl.na...@googlemail.comwrote:

  After all documents are fetched (and ev. parsed) the segment has to be
  written:
  finish sorting the data and copy it from local temp dir (hadoop.tmp.dir)
  to the
  segment directory. If IO is a bottleneck this may take a while. Also
 looks
  like
  you have a lot of content!
 
  On 03/04/2013 06:03 AM, kiran chitturi wrote:
   Thanks for your suggestion guys! The big crawl is fetching large amount
  of
   big PDF files.
  
   For something like below, the fetcher took a lot of time to finish up,
  even
   though the files are fetched. It shows more than one hour of time.
  
  
   2013-03-01 19:45:43,217 INFO  fetcher.Fetcher - -activeThreads=0,
   spinWaiting=0, fetchQueues.totalSize=0
   2013-03-01* 19:45:43,217 *INFO  fetcher.Fetcher - -activeThreads=0
   2013-03-01* 20:57:55,288* INFO  fetcher.Fetcher - Fetcher: finished at
   2013-03-01 20:57:55, elapsed: 01:34:09
  
  
   Does fetching a lot of files causes this issue ? Should i stick to one
   thread per local mode or use pseudo distributed mode to improve
  performance
   ?
  
   What is an acceptable time fetcher should finish up after fetching the
   files ? What exactly happens in this step ?
  
   Thanks again!
   Kiran.
  
  
  
   On Sun, Mar 3, 2013 at 4:55 PM, Markus Jelsma 
  markus.jel...@openindex.iowrote:
  
   The default heap size of 1G is just enough for a parsing fetcher with
 10
   threads. The only problem that may rise is too large and complicated
 PDF
   files or very large HTML files. If you generate fetch lists of a
  reasonable
   size there won't be a problem most of the time. And if you want to
  crawl a
   lot, then just generate more small segments.
  
   If there is a bug it's most likely to be the parser eating memory and
  not
   releasing it.
  
   -Original message-
   From:Tejas Patil tejas.patil...@gmail.com
   Sent: Sun 03-Mar-2013 22:19
   To: user@nutch.apache.org
   Subject: Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create
   new native thread
  
   I agree with Sebastian. It was a crawl in local mode and not over a
   cluster. The intended crawl volume is huge and if we dont override
 the
   default heap size to some decent value, there is high possibility of
   facing
   an OOM.
  
  
   On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi 
   chitturikira...@gmail.comwrote:
  
   If you find the time you should trace the process.
   Seems to be either a misconfiguration or even a bug.
  
   I will try to track this down soon with the previous configuration.
   Right
   now, i am just trying to get data crawled by Monday.
  
   Kiran.
  
  
   Luckily, you should be able to retry via bin/nutch parse ...
   Then trace the system and the Java process to catch the reason.
  
   Sebastian
  
   On 03/02/2013 08:13 PM, kiran chitturi wrote:
   Sorry, i am looking to crawl 400k documents with the crawl. I
   said
   400
   in
   my last message.
  
  
   On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi 
   chitturikira...@gmail.comwrote:
  
   Hi!
  
   I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5
   2.8GHz.
  
   Last night i started a crawl on local mode for 5 seeds with the
   config
   given below. If the crawl goes well, it should fetch a total of
   400
   documents. The crawling is done on a single host that we own.
  
   Config
   -
  
   fetcher.threads.per.queue - 2
   fetcher.server.delay - 1
   fetcher.throughput.threshold.pages - -1
  
   crawl script settings
   
   timeLimitFetch- 30
   numThreads - 5
   topN - 1
   mapred.child.java.opts=-Xmx1000m
  
  
   I have noticed today that the crawl has stopped due to an error
   and
   i
   have
   found the below error in logs.
  
   2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed
 (0ms):
  
   http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
   2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner -
   job_local_0001
   java.lang.OutOfMemoryError: unable to create new native thread
   at java.lang.Thread.start0(Native Method)
   at java.lang.Thread.start(Thread.java:658

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-03 Thread neeraj

Kiran,

  Were you able to resolve this issue?.. I am getting the same error when
fetching huge number of URL's

-Neeraj.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-tp4044231p4044398.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-03 Thread Sebastian Nagel

Hi Kiran,

there are many possible reasons for the problem. Beside the limits on the
number of processes
the stack size in the Java VM and the system (see java -Xss and ulimit -s).

I think in local mode there should be only one mapper and consequently only
one thread spent for parsing. So the number of processes/threads is hardly the
problem suggested that you don't run any other number crunching tasks in
parallel
on your desktop.

Luckily, you should be able to retry via bin/nutch parse ...
Then trace the system and the Java process to catch the reason.

Sebastian

On 03/02/2013 08:13 PM, kiran chitturi wrote:
Sorry, i am looking to crawl 400k documents with the crawl. I said 400 in
my last message.

On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi
chitturikira...@gmail.comwrote:

Hi!

I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz.

Last night i started a crawl on local mode for 5 seeds with the config
given below. If the crawl goes well, it should fetch a total of 400
documents. The crawling is done on a single host that we own.

Config
-

fetcher.threads.per.queue - 2
fetcher.server.delay - 1
fetcher.throughput.threshold.pages - -1

crawl script settings

timeLimitFetch- 30
numThreads - 5
topN - 1
mapred.child.java.opts=-Xmx1000m

I have noticed today that the crawl has stopped due to an error and i have
found the below error in logs.

2013-03-01 21:45:03,767 INFO parse.ParseSegment - Parsed (0ms):
http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
2013-03-01 21:45:03,790 WARN mapred.LocalJobRunner - job_local_0001
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:658)
at
java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
at
java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
at
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
at
java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
(END)

Did anyone run in to the same issue ? I am not sure why the new native
thread is not being created. The link here says [0] that it might due to
the limitation of number of processes in my OS. Will increase them solve
the issue ?

[0] - http://ww2.cs.fsu.edu/~czhang/errors.html

Thanks!

--
Kiran Chitturi

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-03 Thread kiran chitturi

Thanks Sebastian for the suggestions. I came over this by using low value
for topN(2000) than 1. I decided to use lower value for topN with more
rounds.

On Sun, Mar 3, 2013 at 3:41 PM, Sebastian Nagel
wastl.na...@googlemail.comwrote:

Hi Kiran,

there are many possible reasons for the problem. Beside the limits on the
number of processes
the stack size in the Java VM and the system (see java -Xss and ulimit -s).

I think in local mode there should be only one mapper and consequently only
one thread spent for parsing. So the number of processes/threads is hardly
the
problem suggested that you don't run any other number crunching tasks in
parallel
on your desktop.

Luckily, you should be able to retry via bin/nutch parse ...
Then trace the system and the Java process to catch the reason.

Sebastian

On 03/02/2013 08:13 PM, kiran chitturi wrote:
Sorry, i am looking to crawl 400k documents with the crawl. I said 400 in
my last message.

On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi
chitturikira...@gmail.comwrote:

Hi!

I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz.

Last night i started a crawl on local mode for 5 seeds with the config
given below. If the crawl goes well, it should fetch a total of 400
documents. The crawling is done on a single host that we own.

Config
-

fetcher.threads.per.queue - 2
fetcher.server.delay - 1
fetcher.throughput.threshold.pages - -1

crawl script settings

timeLimitFetch- 30
numThreads - 5
topN - 1
mapred.child.java.opts=-Xmx1000m

I have noticed today that the crawl has stopped due to an error and i
have
found the below error in logs.

java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
at

java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
at

java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
at

java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
at
org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at

org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
(END)

[0] - http://ww2.cs.fsu.edu/~czhang/errors.html

Thanks!

--
Kiran Chitturi

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-03 Thread Sebastian Nagel

using low value for topN(2000) than 1
That would mean: you need 200 rounds and also 200 segments for 400k documents.
That's a work-around no solution!

If you find the time you should trace the process.
Seems to be either a misconfiguration or even a bug.

Sebastian

On 03/03/2013 09:45 PM, kiran chitturi wrote:
Thanks Sebastian for the suggestions. I came over this by using low value
for topN(2000) than 1. I decided to use lower value for topN with more
rounds.

On Sun, Mar 3, 2013 at 3:41 PM, Sebastian Nagel
wastl.na...@googlemail.comwrote:

Hi Kiran,

there are many possible reasons for the problem. Beside the limits on the
number of processes
the stack size in the Java VM and the system (see java -Xss and ulimit -s).

I think in local mode there should be only one mapper and consequently only
one thread spent for parsing. So the number of processes/threads is hardly
the
problem suggested that you don't run any other number crunching tasks in
parallel
on your desktop.

Luckily, you should be able to retry via bin/nutch parse ...
Then trace the system and the Java process to catch the reason.

Sebastian

On 03/02/2013 08:13 PM, kiran chitturi wrote:
Sorry, i am looking to crawl 400k documents with the crawl. I said 400 in
my last message.

On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi
chitturikira...@gmail.comwrote:

Hi!

I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz.

Last night i started a crawl on local mode for 5 seeds with the config
given below. If the crawl goes well, it should fetch a total of 400
documents. The crawling is done on a single host that we own.

Config
-

fetcher.threads.per.queue - 2
fetcher.server.delay - 1
fetcher.throughput.threshold.pages - -1

crawl script settings

timeLimitFetch- 30
numThreads - 5
topN - 1
mapred.child.java.opts=-Xmx1000m

I have noticed today that the crawl has stopped due to an error and i
have
found the below error in logs.

java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
at

java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
at

java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
at

java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
at
org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
at org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93)
at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at

org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
(END)

[0] - http://ww2.cs.fsu.edu/~czhang/errors.html

Thanks!

--
Kiran Chitturi

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-03 Thread kiran chitturi

 If you find the time you should trace the process.
 Seems to be either a misconfiguration or even a bug.

 I will try to track this down soon with the previous configuration. Right
now, i am just trying to get data crawled by Monday.

Kiran.


  Luckily, you should be able to retry via bin/nutch parse ...
  Then trace the system and the Java process to catch the reason.
 
  Sebastian
 
  On 03/02/2013 08:13 PM, kiran chitturi wrote:
  Sorry, i am looking to crawl 400k documents with the crawl. I said 400
 in
  my last message.
 
 
  On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi 
  chitturikira...@gmail.comwrote:
 
  Hi!
 
  I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz.
 
  Last night i started a crawl on local mode for 5 seeds with the config
  given below. If the crawl goes well, it should fetch a total of 400
  documents. The crawling is done on a single host that we own.
 
  Config
  -
 
  fetcher.threads.per.queue - 2
  fetcher.server.delay - 1
  fetcher.throughput.threshold.pages - -1
 
  crawl script settings
  
  timeLimitFetch- 30
  numThreads - 5
  topN - 1
  mapred.child.java.opts=-Xmx1000m
 
 
  I have noticed today that the crawl has stopped due to an error and i
  have
  found the below error in logs.
 
  2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed (0ms):
  http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
  2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner - job_local_0001
  java.lang.OutOfMemoryError: unable to create new native thread
  at java.lang.Thread.start0(Native Method)
  at java.lang.Thread.start(Thread.java:658)
  at
 
 
 java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
  at
 
 
 java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
  at
 
 
 java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
  at
 
 
 java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
  at
  org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
  at
 org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93)
  at
  org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
  at
  org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
  at
  org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
  at
 
  org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
  (END)
 
 
 
  Did anyone run in to the same issue ? I am not sure why the new native
  thread is not being created. The link here says [0] that it might due
 to
  the limitation of number of processes in my OS. Will increase them
 solve
  the issue ?
 
 
  [0] - http://ww2.cs.fsu.edu/~czhang/errors.html
 
  Thanks!
 
  --
  Kiran Chitturi
 
 
 
 
 
 
 
 




-- 
Kiran Chitturi

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-03 Thread Tejas Patil

I agree with Sebastian. It was a crawl in local mode and not over a
cluster. The intended crawl volume is huge and if we dont override the
default heap size to some decent value, there is high possibility of facing
an OOM.


On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi chitturikira...@gmail.comwrote:

  If you find the time you should trace the process.
  Seems to be either a misconfiguration or even a bug.
 
  I will try to track this down soon with the previous configuration. Right
 now, i am just trying to get data crawled by Monday.

 Kiran.


   Luckily, you should be able to retry via bin/nutch parse ...
   Then trace the system and the Java process to catch the reason.
  
   Sebastian
  
   On 03/02/2013 08:13 PM, kiran chitturi wrote:
   Sorry, i am looking to crawl 400k documents with the crawl. I said
 400
  in
   my last message.
  
  
   On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi 
   chitturikira...@gmail.comwrote:
  
   Hi!
  
   I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz.
  
   Last night i started a crawl on local mode for 5 seeds with the
 config
   given below. If the crawl goes well, it should fetch a total of 400
   documents. The crawling is done on a single host that we own.
  
   Config
   -
  
   fetcher.threads.per.queue - 2
   fetcher.server.delay - 1
   fetcher.throughput.threshold.pages - -1
  
   crawl script settings
   
   timeLimitFetch- 30
   numThreads - 5
   topN - 1
   mapred.child.java.opts=-Xmx1000m
  
  
   I have noticed today that the crawl has stopped due to an error and
 i
   have
   found the below error in logs.
  
   2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed (0ms):
   http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
   2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner -
 job_local_0001
   java.lang.OutOfMemoryError: unable to create new native thread
   at java.lang.Thread.start0(Native Method)
   at java.lang.Thread.start(Thread.java:658)
   at
  
  
 
 java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
   at
  
  
 
 java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
   at
  
  
 
 java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
   at
  
  
 
 java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
   at
   org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
   at
  org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93)
   at
   org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
   at
   org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
   at
 org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at
   org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
   at
  
  
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
   (END)
  
  
  
   Did anyone run in to the same issue ? I am not sure why the new
 native
   thread is not being created. The link here says [0] that it might
 due
  to
   the limitation of number of processes in my OS. Will increase them
  solve
   the issue ?
  
  
   [0] - http://ww2.cs.fsu.edu/~czhang/errors.html
  
   Thanks!
  
   --
   Kiran Chitturi
  
  
  
  
  
  
  
  
 
 


 --
 Kiran Chitturi

RE: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-03 Thread Markus Jelsma

The default heap size of 1G is just enough for a parsing fetcher with 10
threads. The only problem that may rise is too large and complicated PDF files
or very large HTML files. If you generate fetch lists of a reasonable size
there won't be a problem most of the time. And if you want to crawl a lot, then
just generate more small segments.

If there is a bug it's most likely to be the parser eating memory and not
releasing it.

On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi
chitturikira...@gmail.comwrote:

If you find the time you should trace the process.
Seems to be either a misconfiguration or even a bug.

I will try to track this down soon with the previous configuration. Right
now, i am just trying to get data crawled by Monday.

Kiran.

Luckily, you should be able to retry via bin/nutch parse ...
Then trace the system and the Java process to catch the reason.

Sebastian

On 03/02/2013 08:13 PM, kiran chitturi wrote:
Sorry, i am looking to crawl 400k documents with the crawl. I said
400
in
my last message.

On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi
chitturikira...@gmail.comwrote:

Hi!

I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz.

Last night i started a crawl on local mode for 5 seeds with the
config
given below. If the crawl goes well, it should fetch a total of 400
documents. The crawling is done on a single host that we own.

Config
-

fetcher.threads.per.queue - 2
fetcher.server.delay - 1
fetcher.throughput.threshold.pages - -1

crawl script settings

timeLimitFetch- 30
numThreads - 5
topN - 1
mapred.child.java.opts=-Xmx1000m

I have noticed today that the crawl has stopped due to an error and
i
have
found the below error in logs.

java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
at

java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
at

java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
at

java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
at
org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
at
org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93)
at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at

org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
(END)

Did anyone run in to the same issue ? I am not sure why the new
native
thread is not being created. The link here says [0] that it might
due
to
the limitation of number of processes in my OS. Will increase them
solve
the issue ?

[0] - http://ww2.cs.fsu.edu/~czhang/errors.html

Thanks!

--
Kiran Chitturi

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-03 Thread kiran chitturi

Thanks for your suggestion guys! The big crawl is fetching large amount of
big PDF files.

For something like below, the fetcher took a lot of time to finish up, even
though the files are fetched. It shows more than one hour of time.

Does fetching a lot of files causes this issue ? Should i stick to one
thread per local mode or use pseudo distributed mode to improve performance
?

What is an acceptable time fetcher should finish up after fetching the
files ? What exactly happens in this step ?

Thanks again!
Kiran.

On Sun, Mar 3, 2013 at 4:55 PM, Markus Jelsma markus.jel...@openindex.iowrote:

If there is a bug it's most likely to be the parser eating memory and not
releasing it.

On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi
chitturikira...@gmail.comwrote:

If you find the time you should trace the process.
Seems to be either a misconfiguration or even a bug.

I will try to track this down soon with the previous configuration.
Right
now, i am just trying to get data crawled by Monday.

Kiran.

Luckily, you should be able to retry via bin/nutch parse ...
Then trace the system and the Java process to catch the reason.

Sebastian

On 03/02/2013 08:13 PM, kiran chitturi wrote:
Sorry, i am looking to crawl 400k documents with the crawl. I
said
400
in
my last message.

On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi
chitturikira...@gmail.comwrote:

Hi!

I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5
2.8GHz.

Config
-

fetcher.threads.per.queue - 2
fetcher.server.delay - 1
fetcher.throughput.threshold.pages - -1

crawl script settings

timeLimitFetch- 30
numThreads - 5
topN - 1
mapred.child.java.opts=-Xmx1000m

I have noticed today that the crawl has stopped due to an error
and
i
have
found the below error in logs.

2013-03-01 21:45:03,767 INFO parse.ParseSegment - Parsed (0ms):

java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
at

java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
at

java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
at

org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
(END)

Did anyone run in to the same issue ? I am not sure why the new
native

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-02 Thread kiran chitturi

Sorry, i am looking to crawl 400k documents with the crawl. I said 400 in
my last message.


On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi chitturikira...@gmail.comwrote:

 Hi!

 I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz.

 Last night i started a crawl on local mode for 5 seeds with the config
 given below. If the crawl goes well, it should fetch a total of 400
 documents. The crawling is done on a single host that we own.

 Config
 -

 fetcher.threads.per.queue - 2
 fetcher.server.delay - 1
 fetcher.throughput.threshold.pages - -1

 crawl script settings
 
 timeLimitFetch- 30
 numThreads - 5
 topN - 1
 mapred.child.java.opts=-Xmx1000m


 I have noticed today that the crawl has stopped due to an error and i have
 found the below error in logs.

 2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed (0ms):
 http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
 2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner - job_local_0001
 java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:658)
 at
 java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
 at
 java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
 at
 java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
 at
 java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
 at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
 at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
 at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
 at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
 at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
 (END)



 Did anyone run in to the same issue ? I am not sure why the new native
 thread is not being created. The link here says [0] that it might due to
 the limitation of number of processes in my OS. Will increase them solve
 the issue ?


 [0] - http://ww2.cs.fsu.edu/~czhang/errors.html

 Thanks!

 --
 Kiran Chitturi




-- 
Kiran Chitturi

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

RE: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

13 matches

Site Navigation

Mail list logo

Footer information