Re: parse data directory not found after merge

Dean Pullen Fri, 06 Jan 2012 04:15:05 -0800

I've also tried nutch v1.3 with the same outcome (i.e. parse_datadirectory is not found).


On 06/01/2012 10:42, Dean Pullen wrote:

I'd like to reiterate that this all works in v1...

Dean

On 06/01/2012 10:04, Dean Pullen wrote:
Lewis,

Many thanks for your reply.
I've separated the parsing from the fetching, and although eachsegment - we run the crawl 5 times - has the parse_data directoryafter parsing (observed via pausing the process), the mergesegscommand does not reproduce the parse_data directory meaninginvertlinks fails with the same parse_data not found error.
The merged segments directory simply has the crawl_generate andcrawl_fetch directories, not any of the others you can see in theother segments directories.
Regards,

Dean.


On 5 Jan 2012, at 17:39, Lewis John Mcgibbney wrote:
Hi Dean,

Depending on the size of the segments your fetching, in most cases I
would advise you to separate out fetching and parsing into individual
steps. This becomes self explanatory as your segments increase in size
and the possibility of something going wrong with the fetching and
parsing when done together. This looks to be a segments which when
being fetched has experienced problems during parsing, therefore no
parse_data was produced.

Can you please try a test fetch (with parsing boolean set to false) on
a sample segment then an individual parse and report back to us with
this one please.

Thanks
On Thu, Jan 5, 2012 at 5:28 PM, DeanPullen<[email protected]> wrote:
Hi all,

I'm upgrading from nutch 1 to 1.4 and am having problems running
invertlinks.

Error:
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input pathdoes not
exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data
    at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
    at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
    at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)atorg.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
    at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)atorg.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
    at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
I notice that the parse_data directories are produced after a fetch(withfetcher.parse set to true), but after the merge the parse_datadirectory
doesn't exist.
What behaviour has changed since 1.0 and does anyone have asolution for the
above?

Thanks in advance,

Dean.
--
Lewis

Re: parse data directory not found after merge

Reply via email to