Hi Matteo,
have a look at the property hadoop.tmp.dir which allows you to direct
the temp folder to another volume with more space on it.
For "local" crawls:
- do not share this folder for two simultaneously running Nutch jobs
- you have to clean-up the temp folder, esp. after failed jobs
(if no job is currently running with this folder defined as hadoop.tmp.dir
a clean-up is save)
Successful jobs do not leave any data in temp except for empty directories.
Sebastian
P.S.:
Search for nutch + hadoop.tmp.dir, there is plenty information on the wiki and
the mailing lists.
On 09/19/2012 10:07 AM, Matteo Simoncini wrote:
> Hi,
>
> I'm running Nutch 1.5.1 on a Virtual Machine to crawl a big amount of url.
> I gave enought space to the "crawl" folder, the one where linkDB and
> crawlDB go, and to the Solr folder.
>
> It worked fine until 200.000 URL, but now I get an IOException that says
> that there isn't enough memory.
>
> Looking at the "crawl" folder or the Solr folder everything is fine. The
> exeption was made because the temp folder (actually the temp/hadoop-root
> folder) has become 14GB.
>
> The solution of my problem I think of are:
>
> 1) Delete some tmp file. But which one and when.
> 2) Make nutch generate his tmp file in another directory (maybe
> <nutch_folder>/tmp)
>
> How can I do that? There is a third better solution?
>
> Here is a copy of my script.
>
> #!/bin/bash
>
> # inject the initial seed into crawlDB
>
> bin/nutch inject test/crawldb urls
>
>
> # initialization of the variables
>
> counter=1
>
> error=0
>
>
> #while there is no error
>
> while [ $error -ne 1 ]
>
> do
>
> # crawl 500 URL
>
> echo [ Script ] Starting generating phase
>
> bin/nutch generate test/crawldb test/segments -topN 10000
>
> if [ $? -ne 0 ]
>
> then
>
> echo [ Script ] Stopping: No more URLs to fetch.
>
> error=1
>
> break
>
> fi
>
> segment=`ls -d test/segments/2* | tail -1`
>
>
> #fetching phase
>
> echo [ Script ] Starting fetching phase
>
> bin/nutch fetch $segment -threads 20
>
> if [ $? -ne 0 ]
>
> then
>
> echo [ Script ] Fetch $segment failed. Deleting it.
>
> rm -rf $segment
>
> continue
>
> fi
>
> #parsing phase
>
> echo [ Script ] Starting parsing phase
>
> bin/nutch parse $segment
>
>
> #updateDB phase
>
> echo [ Script ] Starting updateDB phase
>
> bin/nutch updatedb test/crawldb $segment
>
>
> #indicizing with solr
>
> bin/nutch invertlinks test/linkdb -dir test/segments
>
> bin/nutch solrindex http://127.0.0.1:8983/solr/ test/crawldb -linkdb
> test/linkdb test/segments/*
>
> done
>
>
> Thanks for your help.
>
> Matteo
>