Any advice?

Matteo

2012/9/19 Matteo Simoncini <[email protected]>

> Hi,
>
> I'm running Nutch 1.5.1 on a Virtual Machine to crawl a big amount of url.
> I gave enought space to the "crawl" folder, the one where linkDB and
> crawlDB go, and to the Solr folder.
>
> It worked fine until 200.000 URL, but now I get an IOException that says
> that there isn't enough memory.
>
> Looking at the "crawl" folder or the Solr folder everything is fine. The
> exeption was made because the temp folder (actually the temp/hadoop-root
> folder) has become 14GB.
>
> The solution of my problem I think of are:
>
> 1) Delete some tmp file. But which one and when.
> 2) Make nutch generate his tmp file in another directory (maybe
> <nutch_folder>/tmp)
>
> How can I do that? There is a third better solution?
>
> Here is a copy of my script.
>
> #!/bin/bash
>
> # inject the initial seed into crawlDB
>
> bin/nutch inject test/crawldb urls
>
>
> # initialization of the variables
>
> counter=1
>
> error=0
>
>
> #while there is no error
>
> while [ $error -ne 1 ]
>
> do
>
>  # crawl 500 URL
>
>  echo [ Script ] Starting generating phase
>
>  bin/nutch generate test/crawldb test/segments -topN 10000
>
>  if [ $? -ne 0 ]
>
>  then
>
>  echo [ Script ] Stopping: No more URLs to fetch.
>
>  error=1
>
>  break
>
>  fi
>
>  segment=`ls -d test/segments/2* | tail -1`
>
>
> #fetching phase
>
>  echo [ Script ] Starting fetching phase
>
>  bin/nutch fetch $segment -threads 20
>
>  if [ $? -ne 0 ]
>
>  then
>
>  echo [ Script ] Fetch $segment failed. Deleting it.
>
>  rm -rf $segment
>
>  continue
>
>  fi
>
>   #parsing phase
>
>  echo [ Script ] Starting parsing phase
>
>  bin/nutch parse $segment
>
>
>  #updateDB phase
>
>  echo [ Script ] Starting updateDB phase
>
>  bin/nutch updatedb test/crawldb $segment
>
>
> #indicizing with solr
>
>  bin/nutch invertlinks test/linkdb -dir test/segments
>
>  bin/nutch solrindex http://127.0.0.1:8983/solr/ test/crawldb -linkdb
> test/linkdb test/segments/*
>
> done
>
>
> Thanks for your help.
>
> Matteo
>
>
>

Reply via email to