Thanks Sebastian for the details. This was the bottleneck i had when i am
fetching 10k files. Now i switched to 2k and i have a 6 mins gap now.  It
took me some time finding right configuration in the local node.



On Mon, Mar 4, 2013 at 3:33 PM, Sebastian Nagel
<[email protected]>wrote:

> After all documents are fetched (and ev. parsed) the segment has to be
> written:
> finish sorting the data and copy it from local temp dir (hadoop.tmp.dir)
> to the
> segment directory. If IO is a bottleneck this may take a while. Also looks
> like
> you have a lot of content!
>
> On 03/04/2013 06:03 AM, kiran chitturi wrote:
> > Thanks for your suggestion guys! The big crawl is fetching large amount
> of
> > big PDF files.
> >
> > For something like below, the fetcher took a lot of time to finish up,
> even
> > though the files are fetched. It shows more than one hour of time.
> >
> >>
> >> 2013-03-01 19:45:43,217 INFO  fetcher.Fetcher - -activeThreads=0,
> >> spinWaiting=0, fetchQueues.totalSize=0
> >> 2013-03-01* 19:45:43,217 *INFO  fetcher.Fetcher - -activeThreads=0
> >> 2013-03-01* 20:57:55,288* INFO  fetcher.Fetcher - Fetcher: finished at
> >> 2013-03-01 20:57:55, elapsed: 01:34:09
> >
> >
> > Does fetching a lot of files causes this issue ? Should i stick to one
> > thread per local mode or use pseudo distributed mode to improve
> performance
> > ?
> >
> > What is an acceptable time fetcher should finish up after fetching the
> > files ? What exactly happens in this step ?
> >
> > Thanks again!
> > Kiran.
> >
> >
> >
> > On Sun, Mar 3, 2013 at 4:55 PM, Markus Jelsma <
> [email protected]>wrote:
> >
> >> The default heap size of 1G is just enough for a parsing fetcher with 10
> >> threads. The only problem that may rise is too large and complicated PDF
> >> files or very large HTML files. If you generate fetch lists of a
> reasonable
> >> size there won't be a problem most of the time. And if you want to
> crawl a
> >> lot, then just generate more small segments.
> >>
> >> If there is a bug it's most likely to be the parser eating memory and
> not
> >> releasing it.
> >>
> >> -----Original message-----
> >>> From:Tejas Patil <[email protected]>
> >>> Sent: Sun 03-Mar-2013 22:19
> >>> To: [email protected]
> >>> Subject: Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create
> >> new native thread
> >>>
> >>> I agree with Sebastian. It was a crawl in local mode and not over a
> >>> cluster. The intended crawl volume is huge and if we dont override the
> >>> default heap size to some decent value, there is high possibility of
> >> facing
> >>> an OOM.
> >>>
> >>>
> >>> On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi <
> >> [email protected]>wrote:
> >>>
> >>>>> If you find the time you should trace the process.
> >>>>> Seems to be either a misconfiguration or even a bug.
> >>>>>
> >>>>> I will try to track this down soon with the previous configuration.
> >> Right
> >>>> now, i am just trying to get data crawled by Monday.
> >>>>
> >>>> Kiran.
> >>>>
> >>>>
> >>>>>>> Luckily, you should be able to retry via "bin/nutch parse ..."
> >>>>>>> Then trace the system and the Java process to catch the reason.
> >>>>>>>
> >>>>>>> Sebastian
> >>>>>>>
> >>>>>>> On 03/02/2013 08:13 PM, kiran chitturi wrote:
> >>>>>>>> Sorry, i am looking to crawl 400k documents with the crawl. I
> >> said
> >>>> 400
> >>>>> in
> >>>>>>>> my last message.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi <
> >>>>>>> [email protected]>wrote:
> >>>>>>>>
> >>>>>>>>> Hi!
> >>>>>>>>>
> >>>>>>>>> I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5
> >> 2.8GHz.
> >>>>>>>>>
> >>>>>>>>> Last night i started a crawl on local mode for 5 seeds with the
> >>>> config
> >>>>>>>>> given below. If the crawl goes well, it should fetch a total of
> >> 400
> >>>>>>>>> documents. The crawling is done on a single host that we own.
> >>>>>>>>>
> >>>>>>>>> Config
> >>>>>>>>> ---------------------
> >>>>>>>>>
> >>>>>>>>> fetcher.threads.per.queue - 2
> >>>>>>>>> fetcher.server.delay - 1
> >>>>>>>>> fetcher.throughput.threshold.pages - -1
> >>>>>>>>>
> >>>>>>>>> crawl script settings
> >>>>>>>>> ----------------------------
> >>>>>>>>> timeLimitFetch- 30
> >>>>>>>>> numThreads - 5
> >>>>>>>>> topN - 10000
> >>>>>>>>> mapred.child.java.opts=-Xmx1000m
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I have noticed today that the crawl has stopped due to an error
> >> and
> >>>> i
> >>>>>>> have
> >>>>>>>>> found the below error in logs.
> >>>>>>>>>
> >>>>>>>>> 2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed (0ms):
> >>>>>>>>>>
> >> http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
> >>>>>>>>>> 2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner -
> >>>> job_local_0001
> >>>>>>>>>> java.lang.OutOfMemoryError: unable to create new native thread
> >>>>>>>>>>         at java.lang.Thread.start0(Native Method)
> >>>>>>>>>>         at java.lang.Thread.start(Thread.java:658)
> >>>>>>>>>>         at
> >>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>
> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
> >>>>>>>>>>         at
> >>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>
> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
> >>>>>>>>>>         at
> >>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
> >>>>>>>>>>         at
> >>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
> >>>>>>>>>>         at
> >>>>>>> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
> >>>>>>>>>>         at
> >>>>> org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93)
> >>>>>>>>>>         at
> >>>>>>> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
> >>>>>>>>>>         at
> >>>>>>> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
> >>>>>>>>>>         at
> >>>> org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >>>>>>>>>>         at
> >>>>>>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
> >>>>>>>>>>         at
> >> org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> >>>>>>>>>>         at
> >>>>>>>>>>
> >>>>>>>
> >>>>
> >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> >>>>>>>>>> (END)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Did anyone run in to the same issue ? I am not sure why the new
> >>>> native
> >>>>>>>>> thread is not being created. The link here says [0] that it
> >> might
> >>>> due
> >>>>> to
> >>>>>>>>> the limitation of number of processes in my OS. Will increase
> >> them
> >>>>> solve
> >>>>>>>>> the issue ?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> [0] - http://ww2.cs.fsu.edu/~czhang/errors.html
> >>>>>>>>>
> >>>>>>>>> Thanks!
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Kiran Chitturi
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Kiran Chitturi
> >>>>
> >>>
> >>
> >
> >
> >
>
>


-- 
Kiran Chitturi

Reply via email to