Thanks, you really helped a lot.

Matteo

2012/9/20 Sebastian Nagel <[email protected]>

> Hi Matteo,
>
> have a look at the property hadoop.tmp.dir which allows you to direct
> the temp folder to another volume with more space on it.
> For "local" crawls:
>  - do not share this folder for two simultaneously running Nutch jobs
>  - you have to clean-up the temp folder, esp. after failed jobs
>    (if no job is currently running with this folder defined as
> hadoop.tmp.dir
>     a clean-up is save)
>    Successful jobs do not leave any data in temp except for empty
> directories.
>
> Sebastian
>
> P.S.:
> Search for nutch + hadoop.tmp.dir, there is plenty information on the wiki
> and
> the mailing lists.
>
>
> On 09/19/2012 10:07 AM, Matteo Simoncini wrote:
> > Hi,
> >
> > I'm running Nutch 1.5.1 on a Virtual Machine to crawl a big amount of
> url.
> > I gave enought space to the "crawl" folder, the one where linkDB and
> > crawlDB go, and to the Solr folder.
> >
> > It worked fine until 200.000 URL, but now I get an IOException that says
> > that there isn't enough memory.
> >
> > Looking at the "crawl" folder or the Solr folder everything is fine. The
> > exeption was made because the temp folder (actually the temp/hadoop-root
> > folder) has become 14GB.
> >
> > The solution of my problem I think of are:
> >
> > 1) Delete some tmp file. But which one and when.
> > 2) Make nutch generate his tmp file in another directory (maybe
> > <nutch_folder>/tmp)
> >
> > How can I do that? There is a third better solution?
> >
> > Here is a copy of my script.
> >
> > #!/bin/bash
> >
> > # inject the initial seed into crawlDB
> >
> > bin/nutch inject test/crawldb urls
> >
> >
> > # initialization of the variables
> >
> > counter=1
> >
> > error=0
> >
> >
> > #while there is no error
> >
> > while [ $error -ne 1 ]
> >
> > do
> >
> > # crawl 500 URL
> >
> > echo [ Script ] Starting generating phase
> >
> > bin/nutch generate test/crawldb test/segments -topN 10000
> >
> > if [ $? -ne 0 ]
> >
> > then
> >
> > echo [ Script ] Stopping: No more URLs to fetch.
> >
> > error=1
> >
> > break
> >
> > fi
> >
> > segment=`ls -d test/segments/2* | tail -1`
> >
> >
> > #fetching phase
> >
> > echo [ Script ] Starting fetching phase
> >
> > bin/nutch fetch $segment -threads 20
> >
> > if [ $? -ne 0 ]
> >
> > then
> >
> > echo [ Script ] Fetch $segment failed. Deleting it.
> >
> > rm -rf $segment
> >
> > continue
> >
> > fi
> >
> >  #parsing phase
> >
> > echo [ Script ] Starting parsing phase
> >
> > bin/nutch parse $segment
> >
> >
> > #updateDB phase
> >
> > echo [ Script ] Starting updateDB phase
> >
> > bin/nutch updatedb test/crawldb $segment
> >
> >
> > #indicizing with solr
> >
> > bin/nutch invertlinks test/linkdb -dir test/segments
> >
> > bin/nutch solrindex http://127.0.0.1:8983/solr/ test/crawldb -linkdb
> > test/linkdb test/segments/*
> >
> > done
> >
> >
> > Thanks for your help.
> >
> > Matteo
> >
>
>

Reply via email to