Hi Dean,

Without discussing any of your configuration properties can you please try

6) MERGE SEGMENTS:
/opt/nutch_1_4/bin/nutch mergesegs
/opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir
/opt/nutch_1_4/data/crawl/segments/* -filter -normalize

paying attention to the wildcard /* in -dir /opt/nutch_1_4/data/crawl/segments/*

Also presumably, when you mention you repeat steps 2-5 another 4
times, you are not recursively generating, fetching, parsing and
updating the WebDB with
/opt/nutch_1_4/data/crawl/segments/20120106152527? This should change
with every iteration of the g/f/p/updatedb cycle.

Thanks

On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen <[email protected]> wrote:
> No problem Lewis, I appreciate you looking into it.
>
>
> Firstly I have a seed URL XML document here:
> http://www.ukcigarforums.com/injectlist.xml
> This basically has 'http://www.ukcigarforums.com/content.php' as a URL
> within it.
>
> Nutch's regex-urlfilter.txt contains this:
>
> # allow urls in ukcigarforums.com domain
> +http://([a-z0-9\-A-Z]*\.)*ukcigarforums.com/
> # deny anything else
> -.
>
>
> Here's the procedure:
>
>
> 1) INJECT:
> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/
> /opt/nutch_1_4/data/seed/
>
> 2) GENERATE:
> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26
>
> 3) FETCH:
> /opt/nutch_1_4/bin/nutch fetch
> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>
> 4) PARSE:
> /opt/nutch_1_4/bin/nutch parse
> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
>
> 5) UPDATE DB:
> /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/
> /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter
>
>
> Repeat steps 2 to 5 another 4 times, then:
>
> 6) MERGE SEGMENTS:
> /opt/nutch_1_4/bin/nutch mergesegs /opt/nutch_1_4/data/crawl/MERGEDsegments/
> -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize
>
>
> Interestingly, this prints out:
> "SegmentMerger: using segment data from: crawl_generate crawl_fetch
> crawl_parse parse_data parse_text"
>
> MERGEDsegments segment directory then has just two directories, instead of
> all of those listed in the last output, i.e. just: crawl_generate and
> crawl_fetch
>
> (when then delete from the segments directory and copy the MERGEDsegments
> results into it)
>
>
> Lastly we run invert links after merge segments:
>
> 7) INVERT LINKS:
> /opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/ -dir
> /opt/nutch_1_4/data/crawl/segments/
>
> Which produces:
>
> "LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist: file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data"
>
>



-- 
Lewis

Reply via email to