On Thu, 7 Oct 2010 09:48:57 -0700 (PDT), webdev1977 <webdev1...@gmail.com> wrote:
So how is it that one is able to crawl huge websites with the crawl script and not use the parse = false? You would have to have enormous amounts of
disk space to run the parse later.

You can run smaller batches by limiting the number of URL's per host or IP when generating your fetch list.


I am not even able to run with fetcher.parse=false and
fetcher.store.content= true.  I get an out of memory error.

I can get about 14,500 urls with fetcher.parse=true and
fetcher.store.content=false, but that is not even close to the correct number of documents that the crawl should be getting. It should be fetching
around 350,000 documents.  I increased the value of
db.max.outlinks.per.page. Now I am getting more urls, but it is dying after around 12,500 or so with hung threads errors. Seems like there are alot of files getting written by pdfbox to the /tmp directory. One of which was
almost 2 GB!! \

Not sure if it helps, try using a tmp dir on a larger disk. Use the hadoop.tmp.dir property to do so. An again, i think smaller batches will be more helpful, also in debugging.


Am I kidding myself that running the nutch crawl script is the way to build an index of about 350,000 documents? All of which are file urls? All within a local config, not using hdfs? Maybe I am crazy? :-) (thats entirely
possible!).

I have a test set up that fetched, parsed, updated and indexed about three million URL's (mostly ordinary web pages in my case). The generated segments take about 120GB of disk space, the Solr index is about 30GB. 350,000 documents isn't that much but if the files are large then you are on the right track (i think) when you want to parse but not store. Using a smaller batch will help again.


I don't want to invest the time on setting up a cluster for hadoop. 1. I don't have the hardware resources at the moment 2. This is a prototype, but if I can't prove that a crawl will finish (even if it takes a week), then
the prototype will not be developed into anything more useful.

That'd be a pity, please try the smaller batches and report back, hope it helps.

Reply via email to