Any advice? Matteo
2012/9/19 Matteo Simoncini <[email protected]> > Hi, > > I'm running Nutch 1.5.1 on a Virtual Machine to crawl a big amount of url. > I gave enought space to the "crawl" folder, the one where linkDB and > crawlDB go, and to the Solr folder. > > It worked fine until 200.000 URL, but now I get an IOException that says > that there isn't enough memory. > > Looking at the "crawl" folder or the Solr folder everything is fine. The > exeption was made because the temp folder (actually the temp/hadoop-root > folder) has become 14GB. > > The solution of my problem I think of are: > > 1) Delete some tmp file. But which one and when. > 2) Make nutch generate his tmp file in another directory (maybe > <nutch_folder>/tmp) > > How can I do that? There is a third better solution? > > Here is a copy of my script. > > #!/bin/bash > > # inject the initial seed into crawlDB > > bin/nutch inject test/crawldb urls > > > # initialization of the variables > > counter=1 > > error=0 > > > #while there is no error > > while [ $error -ne 1 ] > > do > > # crawl 500 URL > > echo [ Script ] Starting generating phase > > bin/nutch generate test/crawldb test/segments -topN 10000 > > if [ $? -ne 0 ] > > then > > echo [ Script ] Stopping: No more URLs to fetch. > > error=1 > > break > > fi > > segment=`ls -d test/segments/2* | tail -1` > > > #fetching phase > > echo [ Script ] Starting fetching phase > > bin/nutch fetch $segment -threads 20 > > if [ $? -ne 0 ] > > then > > echo [ Script ] Fetch $segment failed. Deleting it. > > rm -rf $segment > > continue > > fi > > #parsing phase > > echo [ Script ] Starting parsing phase > > bin/nutch parse $segment > > > #updateDB phase > > echo [ Script ] Starting updateDB phase > > bin/nutch updatedb test/crawldb $segment > > > #indicizing with solr > > bin/nutch invertlinks test/linkdb -dir test/segments > > bin/nutch solrindex http://127.0.0.1:8983/solr/ test/crawldb -linkdb > test/linkdb test/segments/* > > done > > > Thanks for your help. > > Matteo > > >

