Hi Matteo,

have a look at the property hadoop.tmp.dir which allows you to direct
the temp folder to another volume with more space on it.
For "local" crawls:
 - do not share this folder for two simultaneously running Nutch jobs
 - you have to clean-up the temp folder, esp. after failed jobs
   (if no job is currently running with this folder defined as hadoop.tmp.dir
    a clean-up is save)
   Successful jobs do not leave any data in temp except for empty directories.

Sebastian

P.S.:
Search for nutch + hadoop.tmp.dir, there is plenty information on the wiki and
the mailing lists.


On 09/19/2012 10:07 AM, Matteo Simoncini wrote:
> Hi,
> 
> I'm running Nutch 1.5.1 on a Virtual Machine to crawl a big amount of url.
> I gave enought space to the "crawl" folder, the one where linkDB and
> crawlDB go, and to the Solr folder.
> 
> It worked fine until 200.000 URL, but now I get an IOException that says
> that there isn't enough memory.
> 
> Looking at the "crawl" folder or the Solr folder everything is fine. The
> exeption was made because the temp folder (actually the temp/hadoop-root
> folder) has become 14GB.
> 
> The solution of my problem I think of are:
> 
> 1) Delete some tmp file. But which one and when.
> 2) Make nutch generate his tmp file in another directory (maybe
> <nutch_folder>/tmp)
> 
> How can I do that? There is a third better solution?
> 
> Here is a copy of my script.
> 
> #!/bin/bash
> 
> # inject the initial seed into crawlDB
> 
> bin/nutch inject test/crawldb urls
> 
> 
> # initialization of the variables
> 
> counter=1
> 
> error=0
> 
> 
> #while there is no error
> 
> while [ $error -ne 1 ]
> 
> do
> 
> # crawl 500 URL
> 
> echo [ Script ] Starting generating phase
> 
> bin/nutch generate test/crawldb test/segments -topN 10000
> 
> if [ $? -ne 0 ]
> 
> then
> 
> echo [ Script ] Stopping: No more URLs to fetch.
> 
> error=1
> 
> break
> 
> fi
> 
> segment=`ls -d test/segments/2* | tail -1`
> 
> 
> #fetching phase
> 
> echo [ Script ] Starting fetching phase
> 
> bin/nutch fetch $segment -threads 20
> 
> if [ $? -ne 0 ]
> 
> then
> 
> echo [ Script ] Fetch $segment failed. Deleting it.
> 
> rm -rf $segment
> 
> continue
> 
> fi
> 
>  #parsing phase
> 
> echo [ Script ] Starting parsing phase
> 
> bin/nutch parse $segment
> 
> 
> #updateDB phase
> 
> echo [ Script ] Starting updateDB phase
> 
> bin/nutch updatedb test/crawldb $segment
> 
> 
> #indicizing with solr
> 
> bin/nutch invertlinks test/linkdb -dir test/segments
> 
> bin/nutch solrindex http://127.0.0.1:8983/solr/ test/crawldb -linkdb
> test/linkdb test/segments/*
> 
> done
> 
> 
> Thanks for your help.
> 
> Matteo
> 

Reply via email to