Thanks, you really helped a lot. Matteo
2012/9/20 Sebastian Nagel <[email protected]> > Hi Matteo, > > have a look at the property hadoop.tmp.dir which allows you to direct > the temp folder to another volume with more space on it. > For "local" crawls: > - do not share this folder for two simultaneously running Nutch jobs > - you have to clean-up the temp folder, esp. after failed jobs > (if no job is currently running with this folder defined as > hadoop.tmp.dir > a clean-up is save) > Successful jobs do not leave any data in temp except for empty > directories. > > Sebastian > > P.S.: > Search for nutch + hadoop.tmp.dir, there is plenty information on the wiki > and > the mailing lists. > > > On 09/19/2012 10:07 AM, Matteo Simoncini wrote: > > Hi, > > > > I'm running Nutch 1.5.1 on a Virtual Machine to crawl a big amount of > url. > > I gave enought space to the "crawl" folder, the one where linkDB and > > crawlDB go, and to the Solr folder. > > > > It worked fine until 200.000 URL, but now I get an IOException that says > > that there isn't enough memory. > > > > Looking at the "crawl" folder or the Solr folder everything is fine. The > > exeption was made because the temp folder (actually the temp/hadoop-root > > folder) has become 14GB. > > > > The solution of my problem I think of are: > > > > 1) Delete some tmp file. But which one and when. > > 2) Make nutch generate his tmp file in another directory (maybe > > <nutch_folder>/tmp) > > > > How can I do that? There is a third better solution? > > > > Here is a copy of my script. > > > > #!/bin/bash > > > > # inject the initial seed into crawlDB > > > > bin/nutch inject test/crawldb urls > > > > > > # initialization of the variables > > > > counter=1 > > > > error=0 > > > > > > #while there is no error > > > > while [ $error -ne 1 ] > > > > do > > > > # crawl 500 URL > > > > echo [ Script ] Starting generating phase > > > > bin/nutch generate test/crawldb test/segments -topN 10000 > > > > if [ $? -ne 0 ] > > > > then > > > > echo [ Script ] Stopping: No more URLs to fetch. > > > > error=1 > > > > break > > > > fi > > > > segment=`ls -d test/segments/2* | tail -1` > > > > > > #fetching phase > > > > echo [ Script ] Starting fetching phase > > > > bin/nutch fetch $segment -threads 20 > > > > if [ $? -ne 0 ] > > > > then > > > > echo [ Script ] Fetch $segment failed. Deleting it. > > > > rm -rf $segment > > > > continue > > > > fi > > > > #parsing phase > > > > echo [ Script ] Starting parsing phase > > > > bin/nutch parse $segment > > > > > > #updateDB phase > > > > echo [ Script ] Starting updateDB phase > > > > bin/nutch updatedb test/crawldb $segment > > > > > > #indicizing with solr > > > > bin/nutch invertlinks test/linkdb -dir test/segments > > > > bin/nutch solrindex http://127.0.0.1:8983/solr/ test/crawldb -linkdb > > test/linkdb test/segments/* > > > > done > > > > > > Thanks for your help. > > > > Matteo > > > >

