Re: parse data directory not found after merge

Lewis John Mcgibbney Fri, 06 Jan 2012 08:29:28 -0800

How about merging segs after every subsequent iteration of the crawl
cycle... surely this is a problem with producing the specific
parse_data directory. If it doesn't work after two iterations then we
know that it is happening early on in the crawl cycle. Have you
manually checked that the directories exist after fetching and
parsing?


On Fri, Jan 6, 2012 at 4:24 PM, Dean Pullen <[email protected]> wrote:
> Good spot because all of that was meant to be removed! No, I'm afraid that's
> just a copy/paste problem.
>
> Dean
>
> On 06/01/2012 16:17, Lewis John Mcgibbney wrote:
>>
>> Ok then,
>>
>> How about your generate command:
>>
>> 2) GENERATE:
>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26
>>
>> Your<segments_dir>  seems to point to /opt/semantico/slot/etc/etc/etc,
>> when everything else being utilised within the crawl cycle points to
>> an entirely different<segment_dirs>  path which is
>> /opt/nutch_1_4/data/crawl/segments/segment_date
>>
>> Was this intentional?
>>
>> On Fri, Jan 6, 2012 at 4:08 PM, Dean Pullen<[email protected]>
>>  wrote:
>>>
>>> Lewis,
>>>
>>> Changing the merge to * returns a similar response:
>>>
>>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern
>>> file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files
>>>
>>> And yes, your assumption was correct - it's a different segment directory
>>> each loop.
>>>
>>> Many thanks,
>>>
>>> Dean.
>>>
>>> On 06/01/2012 15:43, Lewis John Mcgibbney wrote:
>>>>
>>>> Hi Dean,
>>>>
>>>> Without discussing any of your configuration properties can you please
>>>> try
>>>>
>>>> 6) MERGE SEGMENTS:
>>>> /opt/nutch_1_4/bin/nutch mergesegs
>>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir
>>>> /opt/nutch_1_4/data/crawl/segments/* -filter -normalize
>>>>
>>>> paying attention to the wildcard /* in -dir
>>>> /opt/nutch_1_4/data/crawl/segments/*
>>>>
>>>> Also presumably, when you mention you repeat steps 2-5 another 4
>>>> times, you are not recursively generating, fetching, parsing and
>>>> updating the WebDB with
>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527? This should change
>>>> with every iteration of the g/f/p/updatedb cycle.
>>>>
>>>> Thanks
>>>>
>>>> On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen<[email protected]>
>>>>  wrote:
>>>>>
>>>>> No problem Lewis, I appreciate you looking into it.
>>>>>
>>>>>
>>>>> Firstly I have a seed URL XML document here:
>>>>> http://www.ukcigarforums.com/injectlist.xml
>>>>> This basically has 'http://www.ukcigarforums.com/content.php' as a URL
>>>>> within it.
>>>>>
>>>>> Nutch's regex-urlfilter.txt contains this:
>>>>>
>>>>> # allow urls in ukcigarforums.com domain
>>>>> +http://([a-z0-9\-A-Z]*\.)*ukcigarforums.com/
>>>>> # deny anything else
>>>>> -.
>>>>>
>>>>>
>>>>> Here's the procedure:
>>>>>
>>>>>
>>>>> 1) INJECT:
>>>>> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/
>>>>> /opt/nutch_1_4/data/seed/
>>>>>
>>>>> 2) GENERATE:
>>>>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
>>>>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays
>>>>> 26
>>>>>
>>>>> 3) FETCH:
>>>>> /opt/nutch_1_4/bin/nutch fetch
>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>>>
>>>>> 4) PARSE:
>>>>> /opt/nutch_1_4/bin/nutch parse
>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>>>
>>>>> 5) UPDATE DB:
>>>>> /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/
>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter
>>>>>
>>>>>
>>>>> Repeat steps 2 to 5 another 4 times, then:
>>>>>
>>>>> 6) MERGE SEGMENTS:
>>>>> /opt/nutch_1_4/bin/nutch mergesegs
>>>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/
>>>>> -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize
>>>>>
>>>>>
>>>>> Interestingly, this prints out:
>>>>> "SegmentMerger: using segment data from: crawl_generate crawl_fetch
>>>>> crawl_parse parse_data parse_text"
>>>>>
>>>>> MERGEDsegments segment directory then has just two directories, instead
>>>>> of
>>>>> all of those listed in the last output, i.e. just: crawl_generate and
>>>>> crawl_fetch
>>>>>
>>>>> (when then delete from the segments directory and copy the
>>>>> MERGEDsegments
>>>>> results into it)
>>>>>
>>>>>
>>>>> Lastly we run invert links after merge segments:
>>>>>
>>>>> 7) INVERT LINKS:
>>>>> /opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/
>>>>> -dir
>>>>> /opt/nutch_1_4/data/crawl/segments/
>>>>>
>>>>> Which produces:
>>>>>
>>>>> "LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path
>>>>> does
>>>>> not
>>>>> exist:
>>>>> file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data"
>>>>>
>>>>>
>>>>
>>
>>
>



-- 
Lewis

Re: parse data directory not found after merge

Reply via email to