Thanks for your suggestion guys! The big crawl is fetching large amount of big PDF files.
For something like below, the fetcher took a lot of time to finish up, even though the files are fetched. It shows more than one hour of time. > > 2013-03-01 19:45:43,217 INFO fetcher.Fetcher - -activeThreads=0, > spinWaiting=0, fetchQueues.totalSize=0 > 2013-03-01* 19:45:43,217 *INFO fetcher.Fetcher - -activeThreads=0 > 2013-03-01* 20:57:55,288* INFO fetcher.Fetcher - Fetcher: finished at > 2013-03-01 20:57:55, elapsed: 01:34:09 Does fetching a lot of files causes this issue ? Should i stick to one thread per local mode or use pseudo distributed mode to improve performance ? What is an acceptable time fetcher should finish up after fetching the files ? What exactly happens in this step ? Thanks again! Kiran. On Sun, Mar 3, 2013 at 4:55 PM, Markus Jelsma <[email protected]>wrote: > The default heap size of 1G is just enough for a parsing fetcher with 10 > threads. The only problem that may rise is too large and complicated PDF > files or very large HTML files. If you generate fetch lists of a reasonable > size there won't be a problem most of the time. And if you want to crawl a > lot, then just generate more small segments. > > If there is a bug it's most likely to be the parser eating memory and not > releasing it. > > -----Original message----- > > From:Tejas Patil <[email protected]> > > Sent: Sun 03-Mar-2013 22:19 > > To: [email protected] > > Subject: Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create > new native thread > > > > I agree with Sebastian. It was a crawl in local mode and not over a > > cluster. The intended crawl volume is huge and if we dont override the > > default heap size to some decent value, there is high possibility of > facing > > an OOM. > > > > > > On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi < > [email protected]>wrote: > > > > > > If you find the time you should trace the process. > > > > Seems to be either a misconfiguration or even a bug. > > > > > > > > I will try to track this down soon with the previous configuration. > Right > > > now, i am just trying to get data crawled by Monday. > > > > > > Kiran. > > > > > > > > > > >> Luckily, you should be able to retry via "bin/nutch parse ..." > > > > >> Then trace the system and the Java process to catch the reason. > > > > >> > > > > >> Sebastian > > > > >> > > > > >> On 03/02/2013 08:13 PM, kiran chitturi wrote: > > > > >>> Sorry, i am looking to crawl 400k documents with the crawl. I > said > > > 400 > > > > in > > > > >>> my last message. > > > > >>> > > > > >>> > > > > >>> On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi < > > > > >> [email protected]>wrote: > > > > >>> > > > > >>>> Hi! > > > > >>>> > > > > >>>> I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 > 2.8GHz. > > > > >>>> > > > > >>>> Last night i started a crawl on local mode for 5 seeds with the > > > config > > > > >>>> given below. If the crawl goes well, it should fetch a total of > 400 > > > > >>>> documents. The crawling is done on a single host that we own. > > > > >>>> > > > > >>>> Config > > > > >>>> --------------------- > > > > >>>> > > > > >>>> fetcher.threads.per.queue - 2 > > > > >>>> fetcher.server.delay - 1 > > > > >>>> fetcher.throughput.threshold.pages - -1 > > > > >>>> > > > > >>>> crawl script settings > > > > >>>> ---------------------------- > > > > >>>> timeLimitFetch- 30 > > > > >>>> numThreads - 5 > > > > >>>> topN - 10000 > > > > >>>> mapred.child.java.opts=-Xmx1000m > > > > >>>> > > > > >>>> > > > > >>>> I have noticed today that the crawl has stopped due to an error > and > > > i > > > > >> have > > > > >>>> found the below error in logs. > > > > >>>> > > > > >>>> 2013-03-01 21:45:03,767 INFO parse.ParseSegment - Parsed (0ms): > > > > >>>>> > http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm > > > > >>>>> 2013-03-01 21:45:03,790 WARN mapred.LocalJobRunner - > > > job_local_0001 > > > > >>>>> java.lang.OutOfMemoryError: unable to create new native thread > > > > >>>>> at java.lang.Thread.start0(Native Method) > > > > >>>>> at java.lang.Thread.start(Thread.java:658) > > > > >>>>> at > > > > >>>>> > > > > >> > > > > > > > > java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681) > > > > >>>>> at > > > > >>>>> > > > > >> > > > > > > > > java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) > > > > >>>>> at > > > > >>>>> > > > > >> > > > > > > > > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655) > > > > >>>>> at > > > > >>>>> > > > > >> > > > > > > > > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) > > > > >>>>> at > > > > >> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) > > > > >>>>> at > > > > org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93) > > > > >>>>> at > > > > >> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) > > > > >>>>> at > > > > >> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) > > > > >>>>> at > > > org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > > > > >>>>> at > > > > >> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) > > > > >>>>> at > org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) > > > > >>>>> at > > > > >>>>> > > > > >> > > > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > > > > >>>>> (END) > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> Did anyone run in to the same issue ? I am not sure why the new > > > native > > > > >>>> thread is not being created. The link here says [0] that it > might > > > due > > > > to > > > > >>>> the limitation of number of processes in my OS. Will increase > them > > > > solve > > > > >>>> the issue ? > > > > >>>> > > > > >>>> > > > > >>>> [0] - http://ww2.cs.fsu.edu/~czhang/errors.html > > > > >>>> > > > > >>>> Thanks! > > > > >>>> > > > > >>>> -- > > > > >>>> Kiran Chitturi > > > > >>>> > > > > >>> > > > > >>> > > > > >>> > > > > >> > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Kiran Chitturi > > > > > > -- Kiran Chitturi

