I've also tried nutch v1.3 with the same outcome (i.e. parse_data
directory is not found).
On 06/01/2012 10:42, Dean Pullen wrote:
I'd like to reiterate that this all works in v1...
Dean
On 06/01/2012 10:04, Dean Pullen wrote:
Lewis,
Many thanks for your reply.
I've separated the parsing from the fetching, and although each
segment - we run the crawl 5 times - has the parse_data directory
after parsing (observed via pausing the process), the mergesegs
command does not reproduce the parse_data directory meaning
invertlinks fails with the same parse_data not found error.
The merged segments directory simply has the crawl_generate and
crawl_fetch directories, not any of the others you can see in the
other segments directories.
Regards,
Dean.
On 5 Jan 2012, at 17:39, Lewis John Mcgibbney wrote:
Hi Dean,
Depending on the size of the segments your fetching, in most cases I
would advise you to separate out fetching and parsing into individual
steps. This becomes self explanatory as your segments increase in size
and the possibility of something going wrong with the fetching and
parsing when done together. This looks to be a segments which when
being fetched has experienced problems during parsing, therefore no
parse_data was produced.
Can you please try a test fetch (with parsing boolean set to false) on
a sample segment then an individual parse and report back to us with
this one please.
Thanks
On Thu, Jan 5, 2012 at 5:28 PM, Dean
Pullen<[email protected]> wrote:
Hi all,
I'm upgrading from nutch 1 to 1.4 and am having problems running
invertlinks.
Error:
LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path
does not
exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at
org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
I notice that the parse_data directories are produced after a fetch
(with
fetcher.parse set to true), but after the merge the parse_data
directory
doesn't exist.
What behaviour has changed since 1.0 and does anyone have a
solution for the
above?
Thanks in advance,
Dean.
--
Lewis