I've also tried nutch v1.3 with the same outcome (i.e. parse_data directory is not found).

On 06/01/2012 10:42, Dean Pullen wrote:
I'd like to reiterate that this all works in v1...

Dean

On 06/01/2012 10:04, Dean Pullen wrote:
Lewis,

Many thanks for your reply.

I've separated the parsing from the fetching, and although each segment - we run the crawl 5 times - has the parse_data directory after parsing (observed via pausing the process), the mergesegs command does not reproduce the parse_data directory meaning invertlinks fails with the same parse_data not found error.

The merged segments directory simply has the crawl_generate and crawl_fetch directories, not any of the others you can see in the other segments directories.

Regards,

Dean.


On 5 Jan 2012, at 17:39, Lewis John Mcgibbney wrote:

Hi Dean,

Depending on the size of the segments your fetching, in most cases I
would advise you to separate out fetching and parsing into individual
steps. This becomes self explanatory as your segments increase in size
and the possibility of something going wrong with the fetching and
parsing when done together. This looks to be a segments which when
being fetched has experienced problems during parsing, therefore no
parse_data was produced.

Can you please try a test fetch (with parsing boolean set to false) on
a sample segment then an individual parse and report back to us with
this one please.

Thanks

On Thu, Jan 5, 2012 at 5:28 PM, Dean Pullen<[email protected]> wrote:
Hi all,

I'm upgrading from nutch 1 to 1.4 and am having problems running
invertlinks.

Error:

LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data
    at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
    at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
    at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
    at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
    at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)

I notice that the parse_data directories are produced after a fetch (with fetcher.parse set to true), but after the merge the parse_data directory
doesn't exist.

What behaviour has changed since 1.0 and does anyone have a solution for the
above?

Thanks in advance,

Dean.


--
Lewis


Reply via email to