fetcher.store.content and fetcher.parse

webdev1977 Thu, 07 Oct 2010 05:42:35 -0700

Could someone please clarify the relationship between these two properties?


I have been reading that it is not wise to set fetcher.parse to true, but if
you set it to false and then set fetcher.store.content  to false you get an
error during the crawl:

Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist:
file:/data/nutch/crawl4/segments/20101007082548/content
        at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
        at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
        at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
        at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)

Makes sense I guess, I told it not to parse content, but when it does need
to parse the content it can't find it? 

Will setting fetcher.store.content to true take up loads of disk space?  

Thanks!!
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/fetcher-store-content-and-fetcher-parse-tp1648127p1648127.html
Sent from the Nutch - User mailing list archive at Nabble.com.

fetcher.store.content and fetcher.parse

Reply via email to