So how is it that one is able to crawl huge websites with the crawl script and not use the parse = false? You would have to have enormous amounts of disk space to run the parse later.
I am not even able to run with fetcher.parse=false and fetcher.store.content= true. I get an out of memory error. I can get about 14,500 urls with fetcher.parse=true and fetcher.store.content=false, but that is not even close to the correct number of documents that the crawl should be getting. It should be fetching around 350,000 documents. I increased the value of db.max.outlinks.per.page. Now I am getting more urls, but it is dying after around 12,500 or so with hung threads errors. Seems like there are alot of files getting written by pdfbox to the /tmp directory. One of which was almost 2 GB!! \ Am I kidding myself that running the nutch crawl script is the way to build an index of about 350,000 documents? All of which are file urls? All within a local config, not using hdfs? Maybe I am crazy? :-) (thats entirely possible!). I don't want to invest the time on setting up a cluster for hadoop. 1. I don't have the hardware resources at the moment 2. This is a prototype, but if I can't prove that a crawl will finish (even if it takes a week), then the prototype will not be developed into anything more useful. -- View this message in context: http://lucene.472066.n3.nabble.com/fetcher-store-content-and-fetcher-parse-tp1648127p1649457.html Sent from the Nutch - User mailing list archive at Nabble.com.