Re: parse data directory not found after merge

Dean Pullen Fri, 06 Jan 2012 02:05:14 -0800

Lewis,

Many thanks for your reply.


I've separated the parsing from the fetching, and although each segment - we 
run the crawl 5 times - has the parse_data directory after parsing (observed 
via pausing the process), the mergesegs command does not reproduce the 
parse_data directory meaning invertlinks fails with the same parse_data not 
found error.

The merged segments directory simply has the crawl_generate and crawl_fetch 
directories, not any of the others you can see in the other segments 
directories.

Regards,

Dean. 


On 5 Jan 2012, at 17:39, Lewis John Mcgibbney wrote:

> Hi Dean,
> 
> Depending on the size of the segments your fetching, in most cases I
> would advise you to separate out fetching and parsing into individual
> steps. This becomes self explanatory as your segments increase in size
> and the possibility of something going wrong with the fetching and
> parsing when done together. This looks to be a segments which when
> being fetched has experienced problems during parsing, therefore no
> parse_data was produced.
> 
> Can you please try a test fetch (with parsing boolean set to false) on
> a sample segment then an individual parse and report back to us with
> this one please.
> 
> Thanks
> 
> On Thu, Jan 5, 2012 at 5:28 PM, Dean Pullen <[email protected]> wrote:
>> Hi all,
>> 
>> I'm upgrading from nutch 1 to 1.4 and am having problems running
>> invertlinks.
>> 
>> Error:
>> 
>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
>> exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data
>>    at
>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>>    at
>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>>    at
>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>>    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>>    at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>>    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>>    at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>    at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>> 
>> I notice that the parse_data directories are produced after a fetch (with
>> fetcher.parse set to true), but after the merge the parse_data directory
>> doesn't exist.
>> 
>> What behaviour has changed since 1.0 and does anyone have a solution for the
>> above?
>> 
>> Thanks in advance,
>> 
>> Dean.
> 
> 
> 
> -- 
> Lewis

Re: parse data directory not found after merge

Reply via email to