Re: fetcher.store.content and fetcher.parse

Markus Jelsma Thu, 07 Oct 2010 05:48:11 -0700

Storing content will take up about as much disk space as the contentyou are fetching. If you don't store, there is nothing to parse.

On Thu, 7 Oct 2010 05:42:00 -0700 (PDT), webdev1977<webdev1...@gmail.com> wrote:

Could someone please clarify the relationship between these twoproperties?
I have been reading that it is not wise to set fetcher.parse to true,but ifyou set it to false and then set fetcher.store.content to false youget an
error during the crawl:
Exception in thread "main"org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist:
file:/data/nutch/crawl4/segments/20101007082548/content
        at

org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
        at

org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
        at

org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
        at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
Makes sense I guess, I told it not to parse content, but when it doesneed
to parse the content it can't find it?
Will setting fetcher.store.content to true take up loads of diskspace?
Thanks!!

Re: fetcher.store.content and fetcher.parse

Reply via email to