Re: parse data directory not found after merge

Lewis John Mcgibbney Mon, 09 Jan 2012 06:25:21 -0800

Hi Dean,

I'll have a look into this later today if I get a chance. Anyone else
experiencing problems using the mergesegs command or code?


Thanks for persisting with this Dean hopefully we will get to the
bottom of it soon.

On Mon, Jan 9, 2012 at 1:31 PM, Dean Pullen <[email protected]> wrote:
> Looking through the code, I'm seeing
> org.apache.nutch.segment.SegmentMerger.reduce(..) only being called for
> crawl_fetch and crawl_generate.
>
> Prior to this org.apache.nutch.segment.SegmentMerger.getRecordWriter(...)
> gets called for all components, i.e. crawl_generate crawl_fetch crawl_parse
> parse_data parse_text
>
> I'm not quiet sure what's going on in-between these two calls...
>
> Dean.
>
>
>
> On 08/01/2012 22:51, Dean Pullen wrote:
>>
>> Where do we go from here? I can start looking/stepping through the
>> mergesegs code, but I'm reluctant due to it's probable complexity.
>>
>> Dean.
>>
>>
>> On 08/01/2012 14:26, Dean Pullen wrote:
>>>
>>> No Lewis, -linkdb was already been used for the solrindex command, so we
>>> still have the same problem.
>>>
>>> Many thanks,
>>>
>>> Dean
>>>
>>> On 08/01/2012 14:08, Lewis John Mcgibbney wrote:
>>>>
>>>> Hi dean is this sorted
>>>>
>>>> On Saturday, January 7, 2012, Dean Pullen<[email protected]>
>>>>  wrote:
>>>>>
>>>>> Sorry, you did mean on solrindex - which I already do...
>>>>>
>>>>> On 07/01/2012 13:15, Dean Pullen wrote:
>>>>>
>>>>> The -linkdb param isn't in the invertlinks docs
>>>>
>>>> http://wiki.apache.org/nutch/bin/nutch_invertlinks
>>>>>
>>>>> (However it is in the solrindex docs)
>>>>>
>>>>> Adding it makes no difference to invertlinks.
>>>>>
>>>>> I think the problem is definitely with mergesegs, as opposed to
>>>>
>>>> invertlinks etc.
>>>>>
>>>>> Thanks again,
>>>>>
>>>>> Dean.
>>>>>
>>>>> On 06/01/2012 17:53, Lewis John Mcgibbney wrote:
>>>>>
>>>>> OK so now I think were at the bottom of it. If you wish to create a
>>>>> linkdb in>= Nutch 1.4 you need to specifically pass the linkdb
>>>>> parameter. This was implemented as not everyone wishes to create a
>>>>> linkdb.
>>>>>
>>>>> Your invertlinks command should be passed as follows
>>>>>
>>>>> bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir
>>>>> /path/to/segment/dirs
>>>>> then
>>>>> bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb
>>>>> path/to/linkdb -dir path/to/segment/dirs
>>>>>
>>>>> If you are not passing the -linkdb path/to/linkdb explicitly you will
>>>>> be thrown an exception as the linkdb is treated as a segment directory
>>>>> now.
>>>>>
>>>>> On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen<[email protected]>
>>>>
>>>>  wrote:
>>>>>
>>>>> Only this:
>>>>>
>>>>> 2012-01-06 17:15:47,972 WARN  mapred.JobClient - Use
>>>>> GenericOptionsParser
>>>>> for parsing the arguments. Applications should implement Tool for the
>>>>
>>>> same.
>>>>>
>>>>> 2012-01-06 17:15:48,692 WARN  util.NativeCodeLoader - Unable to load
>>>>> native-hadoop library for your platform... using builtin-java classes
>>>>
>>>> where
>>>>>
>>>>> applicable
>>>>> 2012-01-06 17:15:51,566 INFO  crawl.LinkDb - LinkDb: starting at
>>>>
>>>> 2012-01-06
>>>>>
>>>>> 17:15:51
>>>>> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: linkdb:
>>>>> /opt/nutch_1_4/data/crawl/linkdb
>>>>> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: URL normalize:
>>>>> true
>>>>> 2012-01-06 17:15:51,567 INFO  crawl.LinkDb - LinkDb: URL filter: true
>>>>> 2012-01-06 17:15:51,576 INFO  crawl.LinkDb - LinkDb: adding segment:
>>>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547
>>>>> 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb:
>>>>> org.apache.hadoop.mapred.InvalidInputException: Input path does not
>>>>> exist:
>>>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data
>>>>>    at
>>>>>
>>>>
>>>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
>>>>>
>>>>>    at
>>>>>
>>>>
>>>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
>>>>>
>>>>>    at
>>>>>
>>>>
>>>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
>>>>>
>>>>>    at
>>>>
>>>> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
>>>>>
>>>>>    at
>>>>>
>>>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
>>>>>    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>>>>    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>>>>    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
>>>>>    at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
>>>>>    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>    at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
>>>>>
>>>>> 2012-01-06 17:15:52,714 INFO  solr.SolrIndexer - SolrIndexer: starting
>>>>> at
>>>>> 2012-01-06 17:15:52
>>>>> 2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce -
>>>>> IndexerMapReduce:
>>>>> crawldb: /opt/nutch_1_4/data/crawl/crawldb
>>>>> 2012-01-06 17:15:52,782 INFO  indexer.IndexerMapReduce -
>>>>> IndexerMapReduce:
>>>>> linkdb: /opt/nutch_1_4/data/crawl/linkdb
>>>>>
>>>
>>
>



-- 
Lewis

Re: parse data directory not found after merge

Reply via email to