Can you please post your script or what type of commands (and parameters) you are passing... I suspect that there is maybe something lurking which we could fix now e.g. differences between the 1.0/1.3 commands and current 1.4.
If not then you may have flagged up something which requires some TLC. Thanks On Fri, Jan 6, 2012 at 12:14 PM, Dean Pullen <[email protected]> wrote: > I've also tried nutch v1.3 with the same outcome (i.e. parse_data directory > is not found). > > > > On 06/01/2012 10:42, Dean Pullen wrote: >> >> I'd like to reiterate that this all works in v1... >> >> Dean >> >> On 06/01/2012 10:04, Dean Pullen wrote: >>> >>> Lewis, >>> >>> Many thanks for your reply. >>> >>> I've separated the parsing from the fetching, and although each segment - >>> we run the crawl 5 times - has the parse_data directory after parsing >>> (observed via pausing the process), the mergesegs command does not reproduce >>> the parse_data directory meaning invertlinks fails with the same parse_data >>> not found error. >>> >>> The merged segments directory simply has the crawl_generate and >>> crawl_fetch directories, not any of the others you can see in the other >>> segments directories. >>> >>> Regards, >>> >>> Dean. >>> >>> >>> On 5 Jan 2012, at 17:39, Lewis John Mcgibbney wrote: >>> >>>> Hi Dean, >>>> >>>> Depending on the size of the segments your fetching, in most cases I >>>> would advise you to separate out fetching and parsing into individual >>>> steps. This becomes self explanatory as your segments increase in size >>>> and the possibility of something going wrong with the fetching and >>>> parsing when done together. This looks to be a segments which when >>>> being fetched has experienced problems during parsing, therefore no >>>> parse_data was produced. >>>> >>>> Can you please try a test fetch (with parsing boolean set to false) on >>>> a sample segment then an individual parse and report back to us with >>>> this one please. >>>> >>>> Thanks >>>> >>>> On Thu, Jan 5, 2012 at 5:28 PM, Dean Pullen<[email protected]> >>>> wrote: >>>>> >>>>> Hi all, >>>>> >>>>> I'm upgrading from nutch 1 to 1.4 and am having problems running >>>>> invertlinks. >>>>> >>>>> Error: >>>>> >>>>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does >>>>> not >>>>> exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data >>>>> at >>>>> >>>>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) >>>>> at >>>>> >>>>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) >>>>> at >>>>> >>>>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) >>>>> at >>>>> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) >>>>> at >>>>> >>>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) >>>>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) >>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) >>>>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) >>>>> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290) >>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>>> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255) >>>>> >>>>> I notice that the parse_data directories are produced after a fetch >>>>> (with >>>>> fetcher.parse set to true), but after the merge the parse_data >>>>> directory >>>>> doesn't exist. >>>>> >>>>> What behaviour has changed since 1.0 and does anyone have a solution >>>>> for the >>>>> above? >>>>> >>>>> Thanks in advance, >>>>> >>>>> Dean. >>>> >>>> >>>> >>>> -- >>>> Lewis >> >> > -- Lewis

