Another thing which I have stupidly not asked yet, have you checked
your hadoop.log to see if there are any problems around the parse
phase?

It should begin

LOG.info("ParseSegment: starting at " + sdf.format(start));
LOG.info("ParseSegment: segment: " + segment);
...
if successful
...
LOG.info("Parsed (" + Long.toString(end - start) + "ms):" + url);
...
if not then
...
LOG.warn("Error parsing: " etc

Any joy?

On Fri, Jan 6, 2012 at 4:38 PM, Dean Pullen <[email protected]> wrote:
> Two iterations do the same thing - the parse_data directory is missing.
>
> Interestingly, just doing the mergesegs on ONE crawl also removes the
> parse_data dir etc!
>
> Dean.
>
>
>
> On 06/01/2012 16:28, Lewis John Mcgibbney wrote:
>>
>> How about merging segs after every subsequent iteration of the crawl
>> cycle... surely this is a problem with producing the specific
>> parse_data directory. If it doesn't work after two iterations then we
>> know that it is happening early on in the crawl cycle. Have you
>> manually checked that the directories exist after fetching and
>> parsing?
>>
>> On Fri, Jan 6, 2012 at 4:24 PM, Dean Pullen<[email protected]>
>>  wrote:
>>>
>>> Good spot because all of that was meant to be removed! No, I'm afraid
>>> that's
>>> just a copy/paste problem.
>>>
>>> Dean
>>>
>>> On 06/01/2012 16:17, Lewis John Mcgibbney wrote:
>>>>
>>>> Ok then,
>>>>
>>>> How about your generate command:
>>>>
>>>> 2) GENERATE:
>>>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
>>>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays
>>>> 26
>>>>
>>>> Your<segments_dir>    seems to point to /opt/semantico/slot/etc/etc/etc,
>>>> when everything else being utilised within the crawl cycle points to
>>>> an entirely different<segment_dirs>    path which is
>>>> /opt/nutch_1_4/data/crawl/segments/segment_date
>>>>
>>>> Was this intentional?
>>>>
>>>> On Fri, Jan 6, 2012 at 4:08 PM, Dean Pullen<[email protected]>
>>>>  wrote:
>>>>>
>>>>> Lewis,
>>>>>
>>>>> Changing the merge to * returns a similar response:
>>>>>
>>>>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern
>>>>> file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files
>>>>>
>>>>> And yes, your assumption was correct - it's a different segment
>>>>> directory
>>>>> each loop.
>>>>>
>>>>> Many thanks,
>>>>>
>>>>> Dean.
>>>>>
>>>>> On 06/01/2012 15:43, Lewis John Mcgibbney wrote:
>>>>>>
>>>>>> Hi Dean,
>>>>>>
>>>>>> Without discussing any of your configuration properties can you please
>>>>>> try
>>>>>>
>>>>>> 6) MERGE SEGMENTS:
>>>>>> /opt/nutch_1_4/bin/nutch mergesegs
>>>>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir
>>>>>> /opt/nutch_1_4/data/crawl/segments/* -filter -normalize
>>>>>>
>>>>>> paying attention to the wildcard /* in -dir
>>>>>> /opt/nutch_1_4/data/crawl/segments/*
>>>>>>
>>>>>> Also presumably, when you mention you repeat steps 2-5 another 4
>>>>>> times, you are not recursively generating, fetching, parsing and
>>>>>> updating the WebDB with
>>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527? This should change
>>>>>> with every iteration of the g/f/p/updatedb cycle.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen<[email protected]>
>>>>>>  wrote:
>>>>>>>
>>>>>>> No problem Lewis, I appreciate you looking into it.
>>>>>>>
>>>>>>>
>>>>>>> Firstly I have a seed URL XML document here:
>>>>>>> http://www.ukcigarforums.com/injectlist.xml
>>>>>>> This basically has 'http://www.ukcigarforums.com/content.php' as a
>>>>>>> URL
>>>>>>> within it.
>>>>>>>
>>>>>>> Nutch's regex-urlfilter.txt contains this:
>>>>>>>
>>>>>>> # allow urls in ukcigarforums.com domain
>>>>>>> +http://([a-z0-9\-A-Z]*\.)*ukcigarforums.com/
>>>>>>> # deny anything else
>>>>>>> -.
>>>>>>>
>>>>>>>
>>>>>>> Here's the procedure:
>>>>>>>
>>>>>>>
>>>>>>> 1) INJECT:
>>>>>>> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/
>>>>>>> /opt/nutch_1_4/data/seed/
>>>>>>>
>>>>>>> 2) GENERATE:
>>>>>>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
>>>>>>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000
>>>>>>> -adddays
>>>>>>> 26
>>>>>>>
>>>>>>> 3) FETCH:
>>>>>>> /opt/nutch_1_4/bin/nutch fetch
>>>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>>>>>
>>>>>>> 4) PARSE:
>>>>>>> /opt/nutch_1_4/bin/nutch parse
>>>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>>>>>>>
>>>>>>> 5) UPDATE DB:
>>>>>>> /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/
>>>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter
>>>>>>>
>>>>>>>
>>>>>>> Repeat steps 2 to 5 another 4 times, then:
>>>>>>>
>>>>>>> 6) MERGE SEGMENTS:
>>>>>>> /opt/nutch_1_4/bin/nutch mergesegs
>>>>>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/
>>>>>>> -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize
>>>>>>>
>>>>>>>
>>>>>>> Interestingly, this prints out:
>>>>>>> "SegmentMerger: using segment data from: crawl_generate crawl_fetch
>>>>>>> crawl_parse parse_data parse_text"
>>>>>>>
>>>>>>> MERGEDsegments segment directory then has just two directories,
>>>>>>> instead
>>>>>>> of
>>>>>>> all of those listed in the last output, i.e. just: crawl_generate and
>>>>>>> crawl_fetch
>>>>>>>
>>>>>>> (when then delete from the segments directory and copy the
>>>>>>> MERGEDsegments
>>>>>>> results into it)
>>>>>>>
>>>>>>>
>>>>>>> Lastly we run invert links after merge segments:
>>>>>>>
>>>>>>> 7) INVERT LINKS:
>>>>>>> /opt/nutch_1_4/bin/nutch invertlinks
>>>>>>> /opt/nutch_1_4/data/crawl/linkdb/
>>>>>>> -dir
>>>>>>> /opt/nutch_1_4/data/crawl/segments/
>>>>>>>
>>>>>>> Which produces:
>>>>>>>
>>>>>>> "LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path
>>>>>>> does
>>>>>>> not
>>>>>>> exist:
>>>>>>> file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data"
>>>>>>>
>>>>>>>
>>>>
>>
>>
>



-- 
Lewis

Reply via email to