Good spot because all of that was meant to be removed! No, I'm afraid that's just a copy/paste problem.

Dean

On 06/01/2012 16:17, Lewis John Mcgibbney wrote:
Ok then,

How about your generate command:

2) GENERATE:
/opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
/opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26

Your<segments_dir>  seems to point to /opt/semantico/slot/etc/etc/etc,
when everything else being utilised within the crawl cycle points to
an entirely different<segment_dirs>  path which is
/opt/nutch_1_4/data/crawl/segments/segment_date

Was this intentional?

On Fri, Jan 6, 2012 at 4:08 PM, Dean Pullen<[email protected]>  wrote:
Lewis,

Changing the merge to * returns a similar response:

LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern
file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files

And yes, your assumption was correct - it's a different segment directory
each loop.

Many thanks,

Dean.

On 06/01/2012 15:43, Lewis John Mcgibbney wrote:
Hi Dean,

Without discussing any of your configuration properties can you please try

6) MERGE SEGMENTS:
/opt/nutch_1_4/bin/nutch mergesegs
/opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir
/opt/nutch_1_4/data/crawl/segments/* -filter -normalize

paying attention to the wildcard /* in -dir
/opt/nutch_1_4/data/crawl/segments/*

Also presumably, when you mention you repeat steps 2-5 another 4
times, you are not recursively generating, fetching, parsing and
updating the WebDB with
/opt/nutch_1_4/data/crawl/segments/20120106152527? This should change
with every iteration of the g/f/p/updatedb cycle.

Thanks

On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen<[email protected]>
  wrote:
No problem Lewis, I appreciate you looking into it.


Firstly I have a seed URL XML document here:
http://www.ukcigarforums.com/injectlist.xml
This basically has 'http://www.ukcigarforums.com/content.php' as a URL
within it.

Nutch's regex-urlfilter.txt contains this:

# allow urls in ukcigarforums.com domain
+http://([a-z0-9\-A-Z]*\.)*ukcigarforums.com/
# deny anything else
-.


Here's the procedure:


1) INJECT:
/opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/
/opt/nutch_1_4/data/seed/

2) GENERATE:
/opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
/opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays
26

3) FETCH:
/opt/nutch_1_4/bin/nutch fetch
/opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15

4) PARSE:
/opt/nutch_1_4/bin/nutch parse
/opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15

5) UPDATE DB:
/opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/
/opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter


Repeat steps 2 to 5 another 4 times, then:

6) MERGE SEGMENTS:
/opt/nutch_1_4/bin/nutch mergesegs
/opt/nutch_1_4/data/crawl/MERGEDsegments/
-dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize


Interestingly, this prints out:
"SegmentMerger: using segment data from: crawl_generate crawl_fetch
crawl_parse parse_data parse_text"

MERGEDsegments segment directory then has just two directories, instead
of
all of those listed in the last output, i.e. just: crawl_generate and
crawl_fetch

(when then delete from the segments directory and copy the MERGEDsegments
results into it)


Lastly we run invert links after merge segments:

7) INVERT LINKS:
/opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/
-dir
/opt/nutch_1_4/data/crawl/segments/

Which produces:

"LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does
not
exist: file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data"






Reply via email to