After all documents are fetched (and ev. parsed) the segment has to be written:
finish sorting the data and copy it from local temp dir (hadoop.tmp.dir) to the
segment directory. If IO is a bottleneck this may take a while. Also looks like
you have a lot of content!

On 03/04/2013 06:03 AM, kiran chitturi wrote:
> Thanks for your suggestion guys! The big crawl is fetching large amount of
> big PDF files.
> 
> For something like below, the fetcher took a lot of time to finish up, even
> though the files are fetched. It shows more than one hour of time.
> 
>>
>> 2013-03-01 19:45:43,217 INFO  fetcher.Fetcher - -activeThreads=0,
>> spinWaiting=0, fetchQueues.totalSize=0
>> 2013-03-01* 19:45:43,217 *INFO  fetcher.Fetcher - -activeThreads=0
>> 2013-03-01* 20:57:55,288* INFO  fetcher.Fetcher - Fetcher: finished at
>> 2013-03-01 20:57:55, elapsed: 01:34:09
> 
> 
> Does fetching a lot of files causes this issue ? Should i stick to one
> thread per local mode or use pseudo distributed mode to improve performance
> ?
> 
> What is an acceptable time fetcher should finish up after fetching the
> files ? What exactly happens in this step ?
> 
> Thanks again!
> Kiran.
> 
> 
> 
> On Sun, Mar 3, 2013 at 4:55 PM, Markus Jelsma 
> <[email protected]>wrote:
> 
>> The default heap size of 1G is just enough for a parsing fetcher with 10
>> threads. The only problem that may rise is too large and complicated PDF
>> files or very large HTML files. If you generate fetch lists of a reasonable
>> size there won't be a problem most of the time. And if you want to crawl a
>> lot, then just generate more small segments.
>>
>> If there is a bug it's most likely to be the parser eating memory and not
>> releasing it.
>>
>> -----Original message-----
>>> From:Tejas Patil <[email protected]>
>>> Sent: Sun 03-Mar-2013 22:19
>>> To: [email protected]
>>> Subject: Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create
>> new native thread
>>>
>>> I agree with Sebastian. It was a crawl in local mode and not over a
>>> cluster. The intended crawl volume is huge and if we dont override the
>>> default heap size to some decent value, there is high possibility of
>> facing
>>> an OOM.
>>>
>>>
>>> On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi <
>> [email protected]>wrote:
>>>
>>>>> If you find the time you should trace the process.
>>>>> Seems to be either a misconfiguration or even a bug.
>>>>>
>>>>> I will try to track this down soon with the previous configuration.
>> Right
>>>> now, i am just trying to get data crawled by Monday.
>>>>
>>>> Kiran.
>>>>
>>>>
>>>>>>> Luckily, you should be able to retry via "bin/nutch parse ..."
>>>>>>> Then trace the system and the Java process to catch the reason.
>>>>>>>
>>>>>>> Sebastian
>>>>>>>
>>>>>>> On 03/02/2013 08:13 PM, kiran chitturi wrote:
>>>>>>>> Sorry, i am looking to crawl 400k documents with the crawl. I
>> said
>>>> 400
>>>>> in
>>>>>>>> my last message.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi <
>>>>>>> [email protected]>wrote:
>>>>>>>>
>>>>>>>>> Hi!
>>>>>>>>>
>>>>>>>>> I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5
>> 2.8GHz.
>>>>>>>>>
>>>>>>>>> Last night i started a crawl on local mode for 5 seeds with the
>>>> config
>>>>>>>>> given below. If the crawl goes well, it should fetch a total of
>> 400
>>>>>>>>> documents. The crawling is done on a single host that we own.
>>>>>>>>>
>>>>>>>>> Config
>>>>>>>>> ---------------------
>>>>>>>>>
>>>>>>>>> fetcher.threads.per.queue - 2
>>>>>>>>> fetcher.server.delay - 1
>>>>>>>>> fetcher.throughput.threshold.pages - -1
>>>>>>>>>
>>>>>>>>> crawl script settings
>>>>>>>>> ----------------------------
>>>>>>>>> timeLimitFetch- 30
>>>>>>>>> numThreads - 5
>>>>>>>>> topN - 10000
>>>>>>>>> mapred.child.java.opts=-Xmx1000m
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have noticed today that the crawl has stopped due to an error
>> and
>>>> i
>>>>>>> have
>>>>>>>>> found the below error in logs.
>>>>>>>>>
>>>>>>>>> 2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed (0ms):
>>>>>>>>>>
>> http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
>>>>>>>>>> 2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner -
>>>> job_local_0001
>>>>>>>>>> java.lang.OutOfMemoryError: unable to create new native thread
>>>>>>>>>>         at java.lang.Thread.start0(Native Method)
>>>>>>>>>>         at java.lang.Thread.start(Thread.java:658)
>>>>>>>>>>         at
>>>>>>>>>>
>>>>>>>
>>>>>
>>>>
>> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
>>>>>>>>>>         at
>>>>>>>>>>
>>>>>>>
>>>>>
>>>>
>> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
>>>>>>>>>>         at
>>>>>>>>>>
>>>>>>>
>>>>>
>>>>
>> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
>>>>>>>>>>         at
>>>>>>>>>>
>>>>>>>
>>>>>
>>>>
>> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
>>>>>>>>>>         at
>>>>>>> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
>>>>>>>>>>         at
>>>>> org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93)
>>>>>>>>>>         at
>>>>>>> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
>>>>>>>>>>         at
>>>>>>> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
>>>>>>>>>>         at
>>>> org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>>>>>>>>         at
>>>>>>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>>>>>>>>>>         at
>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>>>>>>>>>>         at
>>>>>>>>>>
>>>>>>>
>>>>
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>>>>>>>>>> (END)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Did anyone run in to the same issue ? I am not sure why the new
>>>> native
>>>>>>>>> thread is not being created. The link here says [0] that it
>> might
>>>> due
>>>>> to
>>>>>>>>> the limitation of number of processes in my OS. Will increase
>> them
>>>>> solve
>>>>>>>>> the issue ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [0] - http://ww2.cs.fsu.edu/~czhang/errors.html
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Kiran Chitturi
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Kiran Chitturi
>>>>
>>>
>>
> 
> 
> 

Reply via email to