Thanks Sebastian for the details. This was the bottleneck i had when i am fetching 10k files. Now i switched to 2k and i have a 6 mins gap now. It took me some time finding right configuration in the local node.
On Mon, Mar 4, 2013 at 3:33 PM, Sebastian Nagel <[email protected]>wrote: > After all documents are fetched (and ev. parsed) the segment has to be > written: > finish sorting the data and copy it from local temp dir (hadoop.tmp.dir) > to the > segment directory. If IO is a bottleneck this may take a while. Also looks > like > you have a lot of content! > > On 03/04/2013 06:03 AM, kiran chitturi wrote: > > Thanks for your suggestion guys! The big crawl is fetching large amount > of > > big PDF files. > > > > For something like below, the fetcher took a lot of time to finish up, > even > > though the files are fetched. It shows more than one hour of time. > > > >> > >> 2013-03-01 19:45:43,217 INFO fetcher.Fetcher - -activeThreads=0, > >> spinWaiting=0, fetchQueues.totalSize=0 > >> 2013-03-01* 19:45:43,217 *INFO fetcher.Fetcher - -activeThreads=0 > >> 2013-03-01* 20:57:55,288* INFO fetcher.Fetcher - Fetcher: finished at > >> 2013-03-01 20:57:55, elapsed: 01:34:09 > > > > > > Does fetching a lot of files causes this issue ? Should i stick to one > > thread per local mode or use pseudo distributed mode to improve > performance > > ? > > > > What is an acceptable time fetcher should finish up after fetching the > > files ? What exactly happens in this step ? > > > > Thanks again! > > Kiran. > > > > > > > > On Sun, Mar 3, 2013 at 4:55 PM, Markus Jelsma < > [email protected]>wrote: > > > >> The default heap size of 1G is just enough for a parsing fetcher with 10 > >> threads. The only problem that may rise is too large and complicated PDF > >> files or very large HTML files. If you generate fetch lists of a > reasonable > >> size there won't be a problem most of the time. And if you want to > crawl a > >> lot, then just generate more small segments. > >> > >> If there is a bug it's most likely to be the parser eating memory and > not > >> releasing it. > >> > >> -----Original message----- > >>> From:Tejas Patil <[email protected]> > >>> Sent: Sun 03-Mar-2013 22:19 > >>> To: [email protected] > >>> Subject: Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create > >> new native thread > >>> > >>> I agree with Sebastian. It was a crawl in local mode and not over a > >>> cluster. The intended crawl volume is huge and if we dont override the > >>> default heap size to some decent value, there is high possibility of > >> facing > >>> an OOM. > >>> > >>> > >>> On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi < > >> [email protected]>wrote: > >>> > >>>>> If you find the time you should trace the process. > >>>>> Seems to be either a misconfiguration or even a bug. > >>>>> > >>>>> I will try to track this down soon with the previous configuration. > >> Right > >>>> now, i am just trying to get data crawled by Monday. > >>>> > >>>> Kiran. > >>>> > >>>> > >>>>>>> Luckily, you should be able to retry via "bin/nutch parse ..." > >>>>>>> Then trace the system and the Java process to catch the reason. > >>>>>>> > >>>>>>> Sebastian > >>>>>>> > >>>>>>> On 03/02/2013 08:13 PM, kiran chitturi wrote: > >>>>>>>> Sorry, i am looking to crawl 400k documents with the crawl. I > >> said > >>>> 400 > >>>>> in > >>>>>>>> my last message. > >>>>>>>> > >>>>>>>> > >>>>>>>> On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi < > >>>>>>> [email protected]>wrote: > >>>>>>>> > >>>>>>>>> Hi! > >>>>>>>>> > >>>>>>>>> I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 > >> 2.8GHz. > >>>>>>>>> > >>>>>>>>> Last night i started a crawl on local mode for 5 seeds with the > >>>> config > >>>>>>>>> given below. If the crawl goes well, it should fetch a total of > >> 400 > >>>>>>>>> documents. The crawling is done on a single host that we own. > >>>>>>>>> > >>>>>>>>> Config > >>>>>>>>> --------------------- > >>>>>>>>> > >>>>>>>>> fetcher.threads.per.queue - 2 > >>>>>>>>> fetcher.server.delay - 1 > >>>>>>>>> fetcher.throughput.threshold.pages - -1 > >>>>>>>>> > >>>>>>>>> crawl script settings > >>>>>>>>> ---------------------------- > >>>>>>>>> timeLimitFetch- 30 > >>>>>>>>> numThreads - 5 > >>>>>>>>> topN - 10000 > >>>>>>>>> mapred.child.java.opts=-Xmx1000m > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> I have noticed today that the crawl has stopped due to an error > >> and > >>>> i > >>>>>>> have > >>>>>>>>> found the below error in logs. > >>>>>>>>> > >>>>>>>>> 2013-03-01 21:45:03,767 INFO parse.ParseSegment - Parsed (0ms): > >>>>>>>>>> > >> http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm > >>>>>>>>>> 2013-03-01 21:45:03,790 WARN mapred.LocalJobRunner - > >>>> job_local_0001 > >>>>>>>>>> java.lang.OutOfMemoryError: unable to create new native thread > >>>>>>>>>> at java.lang.Thread.start0(Native Method) > >>>>>>>>>> at java.lang.Thread.start(Thread.java:658) > >>>>>>>>>> at > >>>>>>>>>> > >>>>>>> > >>>>> > >>>> > >> > java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681) > >>>>>>>>>> at > >>>>>>>>>> > >>>>>>> > >>>>> > >>>> > >> > java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) > >>>>>>>>>> at > >>>>>>>>>> > >>>>>>> > >>>>> > >>>> > >> > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655) > >>>>>>>>>> at > >>>>>>>>>> > >>>>>>> > >>>>> > >>>> > >> > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) > >>>>>>>>>> at > >>>>>>> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) > >>>>>>>>>> at > >>>>> org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93) > >>>>>>>>>> at > >>>>>>> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) > >>>>>>>>>> at > >>>>>>> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) > >>>>>>>>>> at > >>>> org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > >>>>>>>>>> at > >>>>>>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) > >>>>>>>>>> at > >> org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) > >>>>>>>>>> at > >>>>>>>>>> > >>>>>>> > >>>> > >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > >>>>>>>>>> (END) > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Did anyone run in to the same issue ? I am not sure why the new > >>>> native > >>>>>>>>> thread is not being created. The link here says [0] that it > >> might > >>>> due > >>>>> to > >>>>>>>>> the limitation of number of processes in my OS. Will increase > >> them > >>>>> solve > >>>>>>>>> the issue ? > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> [0] - http://ww2.cs.fsu.edu/~czhang/errors.html > >>>>>>>>> > >>>>>>>>> Thanks! > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> Kiran Chitturi > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>> > >>>>> > >>>> > >>>> > >>>> -- > >>>> Kiran Chitturi > >>>> > >>> > >> > > > > > > > > -- Kiran Chitturi

