So how is it that one is able to crawl huge websites with the crawl script
and not use the parse = false?  You would have to have enormous amounts of
disk space to run the parse later.  

I am not even able to run with fetcher.parse=false and
fetcher.store.content= true.  I get an out of memory error.  

I can get about 14,500 urls with fetcher.parse=true and
fetcher.store.content=false, but that is not even close to the correct
number of documents that the crawl should be getting.  It should be fetching
around 350,000 documents.  I increased the value of 
db.max.outlinks.per.page.  Now I am getting more urls, but it is dying after
around 12,500 or so with  hung threads errors.  Seems like there are alot of
files getting written by pdfbox to the /tmp directory.  One of which was
almost 2 GB!! \

Am I kidding myself that running the nutch crawl script is the way to build
an index of about 350,000 documents? All of which are file urls?  All within
a local config, not using hdfs?  Maybe I am crazy? :-) (thats entirely
possible!).  

I don't want to invest the time on setting up a cluster for hadoop.  1.  I
don't have the hardware resources at the moment 2.  This is a prototype, but
if I can't prove that a crawl will finish (even if it takes a week), then
the prototype will not be developed into anything more useful.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/fetcher-store-content-and-fetcher-parse-tp1648127p1649457.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to