RE: CrawlDB Directory Structure

Iain Lopata Wed, 20 Nov 2013 07:09:52 -0800

That was exactly my problem. Thank you.

Do you think an update to the documentation at
http://wiki.apache.org/nutch/bin/nutch%20mergedb would be worthwhile to
explain this?


-----Original Message-----
From: feng lu [mailto:[email protected]] 
Sent: Wednesday, November 20, 2013 8:54 AM
To: [email protected]
Subject: Re: CrawlDB Directory Structure

Hi

do you use this command to merge crawldb, you add the same crawldb directory
within output crawldb and input crawldb. like this

bin/nutch mergedb crawldb crawldb/ crawldb_test1/ crawldb_test2 ...

it will create a temporary crawldb directroy (crawldb-merge-947753248) into
your output crawldb directory.

you can fix this issue by not add same crawldb directory into output crawldb
and input crawldbs. like this

bin/nutch mergedb newCrawldb crawldb crawldb_domain1 crawldb_domain2 rm -rf
crawldb mv newCrawldb crawldb

or you can add my patch in your program

https://issues.apache.org/jira/browse/NUTCH-1670




On Tue, Nov 19, 2013 at 3:05 AM, Iain Lopata <[email protected]> wrote:

> I suspect that this is a simple problem, but I have not been able to 
> figure it out.  Any help sincerely appreciated.
>
>
>
> I am using Nutch 1.6 on a single Ubuntu machine and using it to crawl 
> about
> 100 domains.  Each domain requires manual configuration and testing of 
> various plugin parameters (regex-urlfilter etc.), so I am developing 
> one domain at a time and testing a small sample of pages, but will 
> eventually want to run a crawl across domains.
>
>
>
> I am taking an incremental approach and created separate seed files 
> for each domain, and then use bin/crawl on that single domain. When I 
> am satisfied with the results, I us mergedb to merge the crawldb from 
> the single domain into a master crawldb.  I would plan to keep 
> crawling using the master crawldb once I have configured and tested 
> each individual domain.
>
>
>
> Before I got too far I wanted to test that I could in fact run a crawl 
> using the master crawldb.  However it fails with:
>
>
>
> java.io.FileNotFoundException: File
>
> file:/usr/share/apache-nutch-1.6/runtime/local/Crawls/Master/crawldb/c
> urrent /crawldb-merge-947753248/data does not exist.
>
>
>
> The data file identified in this error message is in fact located at 
> Crawls/Master/crawldb/current/crawldb-merge-947753248/part-00000/data 
> -- and all the other data directories in this crawldb are also under a 
> part-00000 directory.
>
>
>
> Could someone explain how to get bin/crawl to construct the required path?
> Or is there something I should have done earlier to avoid creation of 
> the
> part-00000 component of the directory structure?
>
>
>
> Thanks!
>
>
>
>
>
>


--
Don't Grow Old, Grow Up... :-)

RE: CrawlDB Directory Structure

Reply via email to