There is no zip. Anyway, i just did three fetch and parse cycles of nutch.apache.org with trunk. Trunk has no changes concerning segments etc with regards to 1.4. I injected nutch.apache.org and then did two fetches of -topN 4 pages so i got 9 pages in three segments. I also configured to stay within the domain.
CrawlDb statistics start: crawl/crawldb/ Statistics for CrawlDb: crawl/crawldb/ TOTAL urls: 28 retry 0: 28 min score: 0.0010 avg score: 0.080714285 max score: 1.588 status 1 (db_unfetched): 19 status 2 (db_fetched): 9 CrawlDb statistics: done crawl/segments/20120111122321/: total 24 drwxr-xr-x 3 markus markus 4096 2012-01-11 12:23 content drwxr-xr-x 3 markus markus 4096 2012-01-11 12:23 crawl_fetch drwxr-xr-x 2 markus markus 4096 2012-01-11 12:23 crawl_generate drwxr-xr-x 2 markus markus 4096 2012-01-11 12:23 crawl_parse drwxr-xr-x 3 markus markus 4096 2012-01-11 12:23 parse_data drwxr-xr-x 3 markus markus 4096 2012-01-11 12:23 parse_text crawl/segments/20120111122438/: total 24 drwxr-xr-x 3 markus markus 4096 2012-01-11 12:25 content drwxr-xr-x 3 markus markus 4096 2012-01-11 12:25 crawl_fetch drwxr-xr-x 2 markus markus 4096 2012-01-11 12:24 crawl_generate drwxr-xr-x 2 markus markus 4096 2012-01-11 12:25 crawl_parse drwxr-xr-x 3 markus markus 4096 2012-01-11 12:25 parse_data drwxr-xr-x 3 markus markus 4096 2012-01-11 12:25 parse_text crawl/segments/20120111122539/: total 24 drwxr-xr-x 3 markus markus 4096 2012-01-11 12:26 content drwxr-xr-x 3 markus markus 4096 2012-01-11 12:26 crawl_fetch drwxr-xr-x 2 markus markus 4096 2012-01-11 12:25 crawl_generate drwxr-xr-x 2 markus markus 4096 2012-01-11 12:26 crawl_parse drwxr-xr-x 3 markus markus 4096 2012-01-11 12:26 parse_data drwxr-xr-x 3 markus markus 4096 2012-01-11 12:26 parse_text Let's merge the three segments into one: $ bin/nutch mergesegs merged_segment -dir crawl/segments/ Merging 3 segments to merged_segment/20120111122826 SegmentMerger: adding file:/PATH/crawl/segments/20120111122539 SegmentMerger: adding file:/PATH/crawl/segments/20120111122438 SegmentMerger: adding file:/PATH/crawl/segments/20120111122321 SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text .. it takes a while but finishes. Then i've got this: $ ls merged_segment/20120111122826/ content crawl_fetch crawl_generate crawl_parse parse_data parse_text I don't see the problem but this should reproduce your problem as your steps are not really different from mine. Is it still the parse_data directory that is missing? Why are you mering anyway, it is not mandatory at all. On Wednesday 11 January 2012 12:09:57 Dean Pullen wrote: > A fresh Nutch 1.4/Hadoop 0.20.2 crawling nutch.apache.org does the same > thing. > > I've zipped up the nutch/hadoop dir with all config etc, would either of > you (Markus/Lewis) care to look at it? > > Any help at this stage would be immensely appreciated. > > Regards, > > Dean. -- Markus Jelsma - CTO - Openindex

