Ok then, How about your generate command:
2) GENERATE: /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/ /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26 Your <segments_dir> seems to point to /opt/semantico/slot/etc/etc/etc, when everything else being utilised within the crawl cycle points to an entirely different <segment_dirs> path which is /opt/nutch_1_4/data/crawl/segments/segment_date Was this intentional? On Fri, Jan 6, 2012 at 4:08 PM, Dean Pullen <[email protected]> wrote: > Lewis, > > Changing the merge to * returns a similar response: > > LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern > file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files > > And yes, your assumption was correct - it's a different segment directory > each loop. > > Many thanks, > > Dean. > > On 06/01/2012 15:43, Lewis John Mcgibbney wrote: >> >> Hi Dean, >> >> Without discussing any of your configuration properties can you please try >> >> 6) MERGE SEGMENTS: >> /opt/nutch_1_4/bin/nutch mergesegs >> /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir >> /opt/nutch_1_4/data/crawl/segments/* -filter -normalize >> >> paying attention to the wildcard /* in -dir >> /opt/nutch_1_4/data/crawl/segments/* >> >> Also presumably, when you mention you repeat steps 2-5 another 4 >> times, you are not recursively generating, fetching, parsing and >> updating the WebDB with >> /opt/nutch_1_4/data/crawl/segments/20120106152527? This should change >> with every iteration of the g/f/p/updatedb cycle. >> >> Thanks >> >> On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen<[email protected]> >> wrote: >>> >>> No problem Lewis, I appreciate you looking into it. >>> >>> >>> Firstly I have a seed URL XML document here: >>> http://www.ukcigarforums.com/injectlist.xml >>> This basically has 'http://www.ukcigarforums.com/content.php' as a URL >>> within it. >>> >>> Nutch's regex-urlfilter.txt contains this: >>> >>> # allow urls in ukcigarforums.com domain >>> +http://([a-z0-9\-A-Z]*\.)*ukcigarforums.com/ >>> # deny anything else >>> -. >>> >>> >>> Here's the procedure: >>> >>> >>> 1) INJECT: >>> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/ >>> /opt/nutch_1_4/data/seed/ >>> >>> 2) GENERATE: >>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/ >>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays >>> 26 >>> >>> 3) FETCH: >>> /opt/nutch_1_4/bin/nutch fetch >>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15 >>> >>> 4) PARSE: >>> /opt/nutch_1_4/bin/nutch parse >>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15 >>> >>> 5) UPDATE DB: >>> /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/ >>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter >>> >>> >>> Repeat steps 2 to 5 another 4 times, then: >>> >>> 6) MERGE SEGMENTS: >>> /opt/nutch_1_4/bin/nutch mergesegs >>> /opt/nutch_1_4/data/crawl/MERGEDsegments/ >>> -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize >>> >>> >>> Interestingly, this prints out: >>> "SegmentMerger: using segment data from: crawl_generate crawl_fetch >>> crawl_parse parse_data parse_text" >>> >>> MERGEDsegments segment directory then has just two directories, instead >>> of >>> all of those listed in the last output, i.e. just: crawl_generate and >>> crawl_fetch >>> >>> (when then delete from the segments directory and copy the MERGEDsegments >>> results into it) >>> >>> >>> Lastly we run invert links after merge segments: >>> >>> 7) INVERT LINKS: >>> /opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/ >>> -dir >>> /opt/nutch_1_4/data/crawl/segments/ >>> >>> Which produces: >>> >>> "LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does >>> not >>> exist: file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data" >>> >>> >> >> > -- Lewis

