Storing content will take up about as much disk space as the content
you are fetching. If you don't store, there is nothing to parse.
On Thu, 7 Oct 2010 05:42:00 -0700 (PDT), webdev1977
<webdev1...@gmail.com> wrote:
Could someone please clarify the relationship between these two
properties?
I have been reading that it is not wise to set fetcher.parse to true,
but if
you set it to false and then set fetcher.store.content to false you
get an
error during the crawl:
Exception in thread "main"
org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist:
file:/data/nutch/crawl4/segments/20101007082548/content
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
Makes sense I guess, I told it not to parse content, but when it does
need
to parse the content it can't find it?
Will setting fetcher.store.content to true take up loads of disk
space?
Thanks!!