Re: parse data directory not found after merge

Lewis John Mcgibbney Fri, 06 Jan 2012 06:34:07 -0800

Can you please post your script or what type of commands (and
parameters) you are passing... I suspect that there is maybe something
lurking which we could fix now e.g. differences between the 1.0/1.3
commands and current 1.4.


If not then you may have flagged up something which requires some TLC.

Thanks

On Fri, Jan 6, 2012 at 12:14 PM, Dean Pullen <[email protected]> wrote:
> I've also tried nutch v1.3 with the same outcome (i.e. parse_data directory
> is not found).
>
>
>
> On 06/01/2012 10:42, Dean Pullen wrote:
>>
>> I'd like to reiterate that this all works in v1...
>>
>> Dean
>>
>> On 06/01/2012 10:04, Dean Pullen wrote:
>>>
>>> Lewis,
>>>
>>> Many thanks for your reply.
>>>
>>> I've separated the parsing from the fetching, and although each segment -
>>> we run the crawl 5 times - has the parse_data directory after parsing
>>> (observed via pausing the process), the mergesegs command does not reproduce
>>> the parse_data directory meaning invertlinks fails with the same parse_data
>>> not found error.
>>>
>>> The merged segments directory simply has the crawl_generate and
>>> crawl_fetch directories, not any of the others you can see in the other
>>> segments directories.
>>>
>>> Regards,
>>>
>>> Dean.
>>>
>>>
>>> On 5 Jan 2012, at 17:39, Lewis John Mcgibbney wrote:
>>>
>>>> Hi Dean,
>>>>
>>>> Depending on the size of the segments your fetching, in most cases I
>>>> would advise you to separate out fetching and parsing into individual
>>>> steps. This becomes self explanatory as your segments increase in size
>>>> and the possibility of something going wrong with the fetching and
>>>> parsing when done together. This looks to be a segments which when
>>>> being fetched has experienced problems during parsing, therefore no
>>>> parse_data was produced.
>>>>
>>>> Can you please try a test fetch (with parsing boolean set to false) on
>>>> a sample segment then an individual parse and report back to us with
>>>> this one please.
>>>>
>>>> Thanks
>>>>
>>>> On Thu, Jan 5, 2012 at 5:28 PM, Dean Pullen<[email protected]>
>>>>  wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I'm upgrading from nutch 1 to 1.4 and am having problems running
>>>>> invertlinks.
>>>>>
>>>>> Error:
>>>>>
>>>>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does
>>>>> not
>>>>> exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data
>>>>>    at
>>>>>
>>>>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>>>>>    at
>>>>>
>>>>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>>>>>    at
>>>>>
>>>>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>>>>>    at
>>>>> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>>>>>    at
>>>>>
>>>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>>>>>    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>>>>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>>>>    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>>>>>    at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>>>>>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>    at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>>>>>
>>>>> I notice that the parse_data directories are produced after a fetch
>>>>> (with
>>>>> fetcher.parse set to true), but after the merge the parse_data
>>>>> directory
>>>>> doesn't exist.
>>>>>
>>>>> What behaviour has changed since 1.0 and does anyone have a solution
>>>>> for the
>>>>> above?
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> Dean.
>>>>
>>>>
>>>>
>>>> --
>>>> Lewis
>>
>>
>



-- 
Lewis

Re: parse data directory not found after merge

Reply via email to