Re: parse data directory not found after merge

Lewis John Mcgibbney Thu, 05 Jan 2012 09:39:35 -0800

Hi Dean,

Depending on the size of the segments your fetching, in most cases I
would advise you to separate out fetching and parsing into individual
steps. This becomes self explanatory as your segments increase in size
and the possibility of something going wrong with the fetching and
parsing when done together. This looks to be a segments which when
being fetched has experienced problems during parsing, therefore no
parse_data was produced.


Can you please try a test fetch (with parsing boolean set to false) on
a sample segment then an individual parse and report back to us with
this one please.

Thanks

On Thu, Jan 5, 2012 at 5:28 PM, Dean Pullen <[email protected]> wrote:
> Hi all,
>
> I'm upgrading from nutch 1 to 1.4 and am having problems running
> invertlinks.
>
> Error:
>
> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data
>    at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>    at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>    at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>    at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>    at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>    at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>
> I notice that the parse_data directories are produced after a fetch (with
> fetcher.parse set to true), but after the merge the parse_data directory
> doesn't exist.
>
> What behaviour has changed since 1.0 and does anyone have a solution for the
> above?
>
> Thanks in advance,
>
> Dean.



-- 
Lewis

Re: parse data directory not found after merge

Reply via email to