Re: fetcher.store.content and fetcher.parse

Markus Jelsma Thu, 07 Oct 2010 14:15:51 -0700

On Thu, 7 Oct 2010 09:48:57 -0700 (PDT), webdev1977<webdev1...@gmail.com> wrote:

So how is it that one is able to crawl huge websites with the crawlscriptand not use the parse = false? You would have to have enormousamounts of
disk space to run the parse later.

You can run smaller batches by limiting the number of URL's per host orIP when generating your fetch list.

I am not even able to run with fetcher.parse=false and
fetcher.store.content= true.  I get an out of memory error.

I can get about 14,500 urls with fetcher.parse=true and
fetcher.store.content=false, but that is not even close to thecorrectnumber of documents that the crawl should be getting. It should befetching
around 350,000 documents.  I increased the value of
db.max.outlinks.per.page. Now I am getting more urls, but it isdying afteraround 12,500 or so with hung threads errors. Seems like there arealot offiles getting written by pdfbox to the /tmp directory. One of whichwas
almost 2 GB!! \

Not sure if it helps, try using a tmp dir on a larger disk. Use thehadoop.tmp.dir property to do so. An again, i think smaller batches willbe more helpful, also in debugging.

Am I kidding myself that running the nutch crawl script is the way tobuildan index of about 350,000 documents? All of which are file urls? Allwithina local config, not using hdfs? Maybe I am crazy? :-) (thatsentirely
possible!).

I have a test set up that fetched, parsed, updated and indexed aboutthree million URL's (mostly ordinary web pages in my case). Thegenerated segments take about 120GB of disk space, the Solr index isabout 30GB. 350,000 documents isn't that much but if the files are largethen you are on the right track (i think) when you want to parse but notstore. Using a smaller batch will help again.

I don't want to invest the time on setting up a cluster for hadoop.1. Idon't have the hardware resources at the moment 2. This is aprototype, butif I can't prove that a crawl will finish (even if it takes a week),then
the prototype will not be developed into anything more useful.

That'd be a pity, please try the smaller batches and report back, hopeit helps.

Re: fetcher.store.content and fetcher.parse

Reply via email to