CrawlDB Directory Structure

Iain Lopata Mon, 18 Nov 2013 11:06:40 -0800

I suspect that this is a simple problem, but I have not been able to figure
it out.  Any help sincerely appreciated.


 

I am using Nutch 1.6 on a single Ubuntu machine and using it to crawl about
100 domains.  Each domain requires manual configuration and testing of
various plugin parameters (regex-urlfilter etc.), so I am developing one
domain at a time and testing a small sample of pages, but will eventually
want to run a crawl across domains.

 

I am taking an incremental approach and created separate seed files for each
domain, and then use bin/crawl on that single domain. When I am satisfied
with the results, I us mergedb to merge the crawldb from the single domain
into a master crawldb.  I would plan to keep crawling using the master
crawldb once I have configured and tested each individual domain.

 

Before I got too far I wanted to test that I could in fact run a crawl using
the master crawldb.  However it fails with:

 

java.io.FileNotFoundException: File
file:/usr/share/apache-nutch-1.6/runtime/local/Crawls/Master/crawldb/current
/crawldb-merge-947753248/data does not exist.

 

The data file identified in this error message is in fact located at
Crawls/Master/crawldb/current/crawldb-merge-947753248/part-00000/data -- and
all the other data directories in this crawldb are also under a part-00000
directory.

 

Could someone explain how to get bin/crawl to construct the required path?
Or is there something I should have done earlier to avoid creation of the
part-00000 component of the directory structure?

 

Thanks!

CrawlDB Directory Structure

Reply via email to