Re: parse data directory not found after merge

Markus Jelsma Wed, 11 Jan 2012 03:33:46 -0800

There is no zip. Anyway, i just did three fetch and parse cycles of 
nutch.apache.org with trunk. Trunk has no changes concerning segments etc with 
regards to 1.4. I injected nutch.apache.org and then did two fetches of -topN 
4 pages so i got 9 pages in three segments. I also configured to stay within 
the domain.


CrawlDb statistics start: crawl/crawldb/
Statistics for CrawlDb: crawl/crawldb/
TOTAL urls:     28
retry 0:        28
min score:      0.0010
avg score:      0.080714285
max score:      1.588
status 1 (db_unfetched):        19
status 2 (db_fetched):  9
CrawlDb statistics: done

crawl/segments/20120111122321/:
total 24
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:23 content
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:23 crawl_fetch
drwxr-xr-x 2 markus markus 4096 2012-01-11 12:23 crawl_generate
drwxr-xr-x 2 markus markus 4096 2012-01-11 12:23 crawl_parse
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:23 parse_data
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:23 parse_text

crawl/segments/20120111122438/:
total 24
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:25 content
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:25 crawl_fetch
drwxr-xr-x 2 markus markus 4096 2012-01-11 12:24 crawl_generate
drwxr-xr-x 2 markus markus 4096 2012-01-11 12:25 crawl_parse
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:25 parse_data
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:25 parse_text

crawl/segments/20120111122539/:
total 24
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:26 content
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:26 crawl_fetch
drwxr-xr-x 2 markus markus 4096 2012-01-11 12:25 crawl_generate
drwxr-xr-x 2 markus markus 4096 2012-01-11 12:26 crawl_parse
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:26 parse_data
drwxr-xr-x 3 markus markus 4096 2012-01-11 12:26 parse_text


Let's merge the three segments into one:
$ bin/nutch mergesegs merged_segment -dir crawl/segments/
Merging 3 segments to merged_segment/20120111122826
SegmentMerger:   adding file:/PATH/crawl/segments/20120111122539
SegmentMerger:   adding file:/PATH/crawl/segments/20120111122438
SegmentMerger:   adding file:/PATH/crawl/segments/20120111122321
SegmentMerger: using segment data from: content crawl_generate crawl_fetch 
crawl_parse parse_data parse_text

.. it takes a while but finishes. Then i've got this:

$ ls merged_segment/20120111122826/
content  crawl_fetch  crawl_generate  crawl_parse  parse_data  parse_text

I don't see the problem but this should reproduce your problem as your steps 
are not really different from mine. Is it still the parse_data directory that 
is missing?

Why are you mering anyway, it is not mandatory at all.


On Wednesday 11 January 2012 12:09:57 Dean Pullen wrote:
> A fresh Nutch 1.4/Hadoop 0.20.2 crawling nutch.apache.org does the same
> thing.
> 
> I've zipped up the nutch/hadoop dir with all config etc, would either of
> you (Markus/Lewis) care to look at it?
> 
> Any help at this stage would be immensely appreciated.
> 
> Regards,
> 
> Dean.

-- 
Markus Jelsma - CTO - Openindex

Re: parse data directory not found after merge

Reply via email to