> using low value for topN(2000) than 10000 That would mean: you need 200 rounds and also 200 segments for 400k documents. That's a work-around no solution!
If you find the time you should trace the process. Seems to be either a misconfiguration or even a bug. Sebastian On 03/03/2013 09:45 PM, kiran chitturi wrote: > Thanks Sebastian for the suggestions. I came over this by using low value > for topN(2000) than 10000. I decided to use lower value for topN with more > rounds. > > > On Sun, Mar 3, 2013 at 3:41 PM, Sebastian Nagel > <wastl.na...@googlemail.com>wrote: > >> Hi Kiran, >> >> there are many possible reasons for the problem. Beside the limits on the >> number of processes >> the stack size in the Java VM and the system (see java -Xss and ulimit -s). >> >> I think in local mode there should be only one mapper and consequently only >> one thread spent for parsing. So the number of processes/threads is hardly >> the >> problem suggested that you don't run any other number crunching tasks in >> parallel >> on your desktop. >> >> Luckily, you should be able to retry via "bin/nutch parse ..." >> Then trace the system and the Java process to catch the reason. >> >> Sebastian >> >> On 03/02/2013 08:13 PM, kiran chitturi wrote: >>> Sorry, i am looking to crawl 400k documents with the crawl. I said 400 in >>> my last message. >>> >>> >>> On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi < >> chitturikira...@gmail.com>wrote: >>> >>>> Hi! >>>> >>>> I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz. >>>> >>>> Last night i started a crawl on local mode for 5 seeds with the config >>>> given below. If the crawl goes well, it should fetch a total of 400 >>>> documents. The crawling is done on a single host that we own. >>>> >>>> Config >>>> --------------------- >>>> >>>> fetcher.threads.per.queue - 2 >>>> fetcher.server.delay - 1 >>>> fetcher.throughput.threshold.pages - -1 >>>> >>>> crawl script settings >>>> ---------------------------- >>>> timeLimitFetch- 30 >>>> numThreads - 5 >>>> topN - 10000 >>>> mapred.child.java.opts=-Xmx1000m >>>> >>>> >>>> I have noticed today that the crawl has stopped due to an error and i >> have >>>> found the below error in logs. >>>> >>>> 2013-03-01 21:45:03,767 INFO parse.ParseSegment - Parsed (0ms): >>>>> http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm >>>>> 2013-03-01 21:45:03,790 WARN mapred.LocalJobRunner - job_local_0001 >>>>> java.lang.OutOfMemoryError: unable to create new native thread >>>>> at java.lang.Thread.start0(Native Method) >>>>> at java.lang.Thread.start(Thread.java:658) >>>>> at >>>>> >> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681) >>>>> at >>>>> >> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) >>>>> at >>>>> >> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655) >>>>> at >>>>> >> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) >>>>> at >> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) >>>>> at org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93) >>>>> at >> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) >>>>> at >> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) >>>>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) >>>>> at >> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) >>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) >>>>> at >>>>> >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) >>>>> (END) >>>> >>>> >>>> >>>> Did anyone run in to the same issue ? I am not sure why the new native >>>> thread is not being created. The link here says [0] that it might due to >>>> the limitation of number of processes in my OS. Will increase them solve >>>> the issue ? >>>> >>>> >>>> [0] - http://ww2.cs.fsu.edu/~czhang/errors.html >>>> >>>> Thanks! >>>> >>>> -- >>>> Kiran Chitturi >>>> >>> >>> >>> >> >> > >