Hi do you use this command to merge crawldb, you add the same crawldb directory within output crawldb and input crawldb. like this
bin/nutch mergedb crawldb crawldb/ crawldb_test1/ crawldb_test2 ... it will create a temporary crawldb directroy (crawldb-merge-947753248) into your output crawldb directory. you can fix this issue by not add same crawldb directory into output crawldb and input crawldbs. like this bin/nutch mergedb newCrawldb crawldb crawldb_domain1 crawldb_domain2 rm -rf crawldb mv newCrawldb crawldb or you can add my patch in your program https://issues.apache.org/jira/browse/NUTCH-1670 On Tue, Nov 19, 2013 at 3:05 AM, Iain Lopata <[email protected]> wrote: > I suspect that this is a simple problem, but I have not been able to figure > it out. Any help sincerely appreciated. > > > > I am using Nutch 1.6 on a single Ubuntu machine and using it to crawl about > 100 domains. Each domain requires manual configuration and testing of > various plugin parameters (regex-urlfilter etc.), so I am developing one > domain at a time and testing a small sample of pages, but will eventually > want to run a crawl across domains. > > > > I am taking an incremental approach and created separate seed files for > each > domain, and then use bin/crawl on that single domain. When I am satisfied > with the results, I us mergedb to merge the crawldb from the single domain > into a master crawldb. I would plan to keep crawling using the master > crawldb once I have configured and tested each individual domain. > > > > Before I got too far I wanted to test that I could in fact run a crawl > using > the master crawldb. However it fails with: > > > > java.io.FileNotFoundException: File > > file:/usr/share/apache-nutch-1.6/runtime/local/Crawls/Master/crawldb/current > /crawldb-merge-947753248/data does not exist. > > > > The data file identified in this error message is in fact located at > Crawls/Master/crawldb/current/crawldb-merge-947753248/part-00000/data -- > and > all the other data directories in this crawldb are also under a part-00000 > directory. > > > > Could someone explain how to get bin/crawl to construct the required path? > Or is there something I should have done earlier to avoid creation of the > part-00000 component of the directory structure? > > > > Thanks! > > > > > > -- Don't Grow Old, Grow Up... :-)

