No problem Lewis, I appreciate you looking into it.
Firstly I have a seed URL XML document here:
http://www.ukcigarforums.com/injectlist.xml
This basically has 'http://www.ukcigarforums.com/content.php' as a URL
within it.
Nutch's regex-urlfilter.txt contains this:
# allow urls in ukcigarforums.com domain
+http://([a-z0-9\-A-Z]*\.)*ukcigarforums.com/
# deny anything else
-.
Here's the procedure:
1) INJECT:
/opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/
/opt/nutch_1_4/data/seed/
2) GENERATE:
/opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/
/opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26
3) FETCH:
/opt/nutch_1_4/bin/nutch fetch
/opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
4) PARSE:
/opt/nutch_1_4/bin/nutch parse
/opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15
5) UPDATE DB:
/opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/
/opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter
Repeat steps 2 to 5 another 4 times, then:
6) MERGE SEGMENTS:
/opt/nutch_1_4/bin/nutch mergesegs
/opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir
/opt/nutch_1_4/data/crawl/segments/ -filter -normalize
Interestingly, this prints out:
"SegmentMerger: using segment data from: crawl_generate crawl_fetch
crawl_parse parse_data parse_text"
MERGEDsegments segment directory then has just two directories, instead
of all of those listed in the last output, i.e. just: crawl_generate and
crawl_fetch
(when then delete from the segments directory and copy the
MERGEDsegments results into it)
Lastly we run invert links after merge segments:
7) INVERT LINKS:
/opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/
-dir /opt/nutch_1_4/data/crawl/segments/
Which produces:
"LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does
not exist:
file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data"