How about merging segs after every subsequent iteration of the crawl cycle... surely this is a problem with producing the specific parse_data directory. If it doesn't work after two iterations then we know that it is happening early on in the crawl cycle. Have you manually checked that the directories exist after fetching and parsing?
On Fri, Jan 6, 2012 at 4:24 PM, Dean Pullen <[email protected]> wrote: > Good spot because all of that was meant to be removed! No, I'm afraid that's > just a copy/paste problem. > > Dean > > On 06/01/2012 16:17, Lewis John Mcgibbney wrote: >> >> Ok then, >> >> How about your generate command: >> >> 2) GENERATE: >> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/ >> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26 >> >> Your<segments_dir> seems to point to /opt/semantico/slot/etc/etc/etc, >> when everything else being utilised within the crawl cycle points to >> an entirely different<segment_dirs> path which is >> /opt/nutch_1_4/data/crawl/segments/segment_date >> >> Was this intentional? >> >> On Fri, Jan 6, 2012 at 4:08 PM, Dean Pullen<[email protected]> >> wrote: >>> >>> Lewis, >>> >>> Changing the merge to * returns a similar response: >>> >>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern >>> file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files >>> >>> And yes, your assumption was correct - it's a different segment directory >>> each loop. >>> >>> Many thanks, >>> >>> Dean. >>> >>> On 06/01/2012 15:43, Lewis John Mcgibbney wrote: >>>> >>>> Hi Dean, >>>> >>>> Without discussing any of your configuration properties can you please >>>> try >>>> >>>> 6) MERGE SEGMENTS: >>>> /opt/nutch_1_4/bin/nutch mergesegs >>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir >>>> /opt/nutch_1_4/data/crawl/segments/* -filter -normalize >>>> >>>> paying attention to the wildcard /* in -dir >>>> /opt/nutch_1_4/data/crawl/segments/* >>>> >>>> Also presumably, when you mention you repeat steps 2-5 another 4 >>>> times, you are not recursively generating, fetching, parsing and >>>> updating the WebDB with >>>> /opt/nutch_1_4/data/crawl/segments/20120106152527? This should change >>>> with every iteration of the g/f/p/updatedb cycle. >>>> >>>> Thanks >>>> >>>> On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen<[email protected]> >>>> wrote: >>>>> >>>>> No problem Lewis, I appreciate you looking into it. >>>>> >>>>> >>>>> Firstly I have a seed URL XML document here: >>>>> http://www.ukcigarforums.com/injectlist.xml >>>>> This basically has 'http://www.ukcigarforums.com/content.php' as a URL >>>>> within it. >>>>> >>>>> Nutch's regex-urlfilter.txt contains this: >>>>> >>>>> # allow urls in ukcigarforums.com domain >>>>> +http://([a-z0-9\-A-Z]*\.)*ukcigarforums.com/ >>>>> # deny anything else >>>>> -. >>>>> >>>>> >>>>> Here's the procedure: >>>>> >>>>> >>>>> 1) INJECT: >>>>> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/ >>>>> /opt/nutch_1_4/data/seed/ >>>>> >>>>> 2) GENERATE: >>>>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/ >>>>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays >>>>> 26 >>>>> >>>>> 3) FETCH: >>>>> /opt/nutch_1_4/bin/nutch fetch >>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15 >>>>> >>>>> 4) PARSE: >>>>> /opt/nutch_1_4/bin/nutch parse >>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15 >>>>> >>>>> 5) UPDATE DB: >>>>> /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/ >>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter >>>>> >>>>> >>>>> Repeat steps 2 to 5 another 4 times, then: >>>>> >>>>> 6) MERGE SEGMENTS: >>>>> /opt/nutch_1_4/bin/nutch mergesegs >>>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/ >>>>> -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize >>>>> >>>>> >>>>> Interestingly, this prints out: >>>>> "SegmentMerger: using segment data from: crawl_generate crawl_fetch >>>>> crawl_parse parse_data parse_text" >>>>> >>>>> MERGEDsegments segment directory then has just two directories, instead >>>>> of >>>>> all of those listed in the last output, i.e. just: crawl_generate and >>>>> crawl_fetch >>>>> >>>>> (when then delete from the segments directory and copy the >>>>> MERGEDsegments >>>>> results into it) >>>>> >>>>> >>>>> Lastly we run invert links after merge segments: >>>>> >>>>> 7) INVERT LINKS: >>>>> /opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/ >>>>> -dir >>>>> /opt/nutch_1_4/data/crawl/segments/ >>>>> >>>>> Which produces: >>>>> >>>>> "LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path >>>>> does >>>>> not >>>>> exist: >>>>> file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data" >>>>> >>>>> >>>> >> >> > -- Lewis

