Ok then,

How about your generate command:

2) GENERATE:
/opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
/opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26

Your <segments_dir> seems to point to /opt/semantico/slot/etc/etc/etc,
when everything else being utilised within the crawl cycle points to
an entirely different <segment_dirs> path which is
/opt/nutch_1_4/data/crawl/segments/segment_date

Was this intentional?

On Fri, Jan 6, 2012 at 4:08 PM, Dean Pullen <[email protected]> wrote:
> Lewis,
>
> Changing the merge to * returns a similar response:
>
> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern
> file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files
>
> And yes, your assumption was correct - it's a different segment directory
> each loop.
>
> Many thanks,
>
> Dean.
>
> On 06/01/2012 15:43, Lewis John Mcgibbney wrote:
>>
>> Hi Dean,
>>
>> Without discussing any of your configuration properties can you please try
>>
>> 6) MERGE SEGMENTS:
>> /opt/nutch_1_4/bin/nutch mergesegs
>> /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir
>> /opt/nutch_1_4/data/crawl/segments/* -filter -normalize
>>
>> paying attention to the wildcard /* in -dir
>> /opt/nutch_1_4/data/crawl/segments/*
>>
>> Also presumably, when you mention you repeat steps 2-5 another 4
>> times, you are not recursively generating, fetching, parsing and
>> updating the WebDB with
>> /opt/nutch_1_4/data/crawl/segments/20120106152527? This should change
>> with every iteration of the g/f/p/updatedb cycle.
>>
>> Thanks
>>
>> On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen<[email protected]>
>>  wrote:
>>>
>>> No problem Lewis, I appreciate you looking into it.
>>>
>>>
>>> Firstly I have a seed URL XML document here:
>>> http://www.ukcigarforums.com/injectlist.xml
>>> This basically has 'http://www.ukcigarforums.com/content.php' as a URL
>>> within it.
>>>
>>> Nutch's regex-urlfilter.txt contains this:
>>>
>>> # allow urls in ukcigarforums.com domain
>>> +http://([a-z0-9\-A-Z]*\.)*ukcigarforums.com/
>>> # deny anything else
>>> -.
>>>
>>>
>>> Here's the procedure:
>>>
>>>
>>> 1) INJECT:
>>> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/
>>> /opt/nutch_1_4/data/seed/
>>>
>>> 2) GENERATE:
>>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
>>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays
>>> 26
>>>
>>> 3) FETCH:
>>> /opt/nutch_1_4/bin/nutch fetch
>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>
>>> 4) PARSE:
>>> /opt/nutch_1_4/bin/nutch parse
>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>
>>> 5) UPDATE DB:
>>> /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/
>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter
>>>
>>>
>>> Repeat steps 2 to 5 another 4 times, then:
>>>
>>> 6) MERGE SEGMENTS:
>>> /opt/nutch_1_4/bin/nutch mergesegs
>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/
>>> -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize
>>>
>>>
>>> Interestingly, this prints out:
>>> "SegmentMerger: using segment data from: crawl_generate crawl_fetch
>>> crawl_parse parse_data parse_text"
>>>
>>> MERGEDsegments segment directory then has just two directories, instead
>>> of
>>> all of those listed in the last output, i.e. just: crawl_generate and
>>> crawl_fetch
>>>
>>> (when then delete from the segments directory and copy the MERGEDsegments
>>> results into it)
>>>
>>>
>>> Lastly we run invert links after merge segments:
>>>
>>> 7) INVERT LINKS:
>>> /opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/
>>> -dir
>>> /opt/nutch_1_4/data/crawl/segments/
>>>
>>> Which produces:
>>>
>>> "LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does
>>> not
>>> exist: file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data"
>>>
>>>
>>
>>
>



-- 
Lewis

Reply via email to