Re: parse data directory not found after merge

Dean Pullen Fri, 06 Jan 2012 07:31:18 -0800

No problem Lewis, I appreciate you looking into it.

Firstly I have a seed URL XML document here:http://www.ukcigarforums.com/injectlist.xmlThis basically has 'http://www.ukcigarforums.com/content.php' as a URLwithin it.


Nutch's regex-urlfilter.txt contains this:

# allow urls in ukcigarforums.com domain
+http://([a-z0-9\-A-Z]*\.)*ukcigarforums.com/
# deny anything else
-.


Here's the procedure:


1) INJECT:

/opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb//opt/nutch_1_4/data/seed/


2) GENERATE:

/opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb//opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26


3) FETCH:

/opt/nutch_1_4/bin/nutch fetch/opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15


4) PARSE:

/opt/nutch_1_4/bin/nutch parse/opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15


5) UPDATE DB:

/opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb//opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter



Repeat steps 2 to 5 another 4 times, then:

6) MERGE SEGMENTS:

/opt/nutch_1_4/bin/nutch mergesegs/opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir/opt/nutch_1_4/data/crawl/segments/ -filter -normalize



Interestingly, this prints out:

"SegmentMerger: using segment data from: crawl_generate crawl_fetchcrawl_parse parse_data parse_text"

MERGEDsegments segment directory then has just two directories, insteadof all of those listed in the last output, i.e. just: crawl_generate andcrawl_fetch

(when then delete from the segments directory and copy theMERGEDsegments results into it)



Lastly we run invert links after merge segments:

7) INVERT LINKS:

/opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/-dir /opt/nutch_1_4/data/crawl/segments/


Which produces:

"LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path doesnot exist:file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data"

Re: parse data directory not found after merge

Reply via email to