That was exactly my problem. Thank you. Do you think an update to the documentation at http://wiki.apache.org/nutch/bin/nutch%20mergedb would be worthwhile to explain this?
-----Original Message----- From: feng lu [mailto:[email protected]] Sent: Wednesday, November 20, 2013 8:54 AM To: [email protected] Subject: Re: CrawlDB Directory Structure Hi do you use this command to merge crawldb, you add the same crawldb directory within output crawldb and input crawldb. like this bin/nutch mergedb crawldb crawldb/ crawldb_test1/ crawldb_test2 ... it will create a temporary crawldb directroy (crawldb-merge-947753248) into your output crawldb directory. you can fix this issue by not add same crawldb directory into output crawldb and input crawldbs. like this bin/nutch mergedb newCrawldb crawldb crawldb_domain1 crawldb_domain2 rm -rf crawldb mv newCrawldb crawldb or you can add my patch in your program https://issues.apache.org/jira/browse/NUTCH-1670 On Tue, Nov 19, 2013 at 3:05 AM, Iain Lopata <[email protected]> wrote: > I suspect that this is a simple problem, but I have not been able to > figure it out. Any help sincerely appreciated. > > > > I am using Nutch 1.6 on a single Ubuntu machine and using it to crawl > about > 100 domains. Each domain requires manual configuration and testing of > various plugin parameters (regex-urlfilter etc.), so I am developing > one domain at a time and testing a small sample of pages, but will > eventually want to run a crawl across domains. > > > > I am taking an incremental approach and created separate seed files > for each domain, and then use bin/crawl on that single domain. When I > am satisfied with the results, I us mergedb to merge the crawldb from > the single domain into a master crawldb. I would plan to keep > crawling using the master crawldb once I have configured and tested > each individual domain. > > > > Before I got too far I wanted to test that I could in fact run a crawl > using the master crawldb. However it fails with: > > > > java.io.FileNotFoundException: File > > file:/usr/share/apache-nutch-1.6/runtime/local/Crawls/Master/crawldb/c > urrent /crawldb-merge-947753248/data does not exist. > > > > The data file identified in this error message is in fact located at > Crawls/Master/crawldb/current/crawldb-merge-947753248/part-00000/data > -- and all the other data directories in this crawldb are also under a > part-00000 directory. > > > > Could someone explain how to get bin/crawl to construct the required path? > Or is there something I should have done earlier to avoid creation of > the > part-00000 component of the directory structure? > > > > Thanks! > > > > > > -- Don't Grow Old, Grow Up... :-)

