Re: CrawlDB Directory Structure

feng lu Wed, 20 Nov 2013 06:55:52 -0800

Hi

do you use this command to merge crawldb, you add the same crawldb
directory within output crawldb and input crawldb. like this


bin/nutch mergedb crawldb crawldb/ crawldb_test1/ crawldb_test2 ...

it will create a temporary crawldb directroy (crawldb-merge-947753248) into
your output crawldb directory.

you can fix this issue by not add same crawldb directory into output
crawldb and input crawldbs. like this

bin/nutch mergedb newCrawldb crawldb crawldb_domain1 crawldb_domain2
rm -rf crawldb
mv newCrawldb crawldb

or you can add my patch in your program

https://issues.apache.org/jira/browse/NUTCH-1670




On Tue, Nov 19, 2013 at 3:05 AM, Iain Lopata <[email protected]> wrote:

> I suspect that this is a simple problem, but I have not been able to figure
> it out.  Any help sincerely appreciated.
>
>
>
> I am using Nutch 1.6 on a single Ubuntu machine and using it to crawl about
> 100 domains.  Each domain requires manual configuration and testing of
> various plugin parameters (regex-urlfilter etc.), so I am developing one
> domain at a time and testing a small sample of pages, but will eventually
> want to run a crawl across domains.
>
>
>
> I am taking an incremental approach and created separate seed files for
> each
> domain, and then use bin/crawl on that single domain. When I am satisfied
> with the results, I us mergedb to merge the crawldb from the single domain
> into a master crawldb.  I would plan to keep crawling using the master
> crawldb once I have configured and tested each individual domain.
>
>
>
> Before I got too far I wanted to test that I could in fact run a crawl
> using
> the master crawldb.  However it fails with:
>
>
>
> java.io.FileNotFoundException: File
>
> file:/usr/share/apache-nutch-1.6/runtime/local/Crawls/Master/crawldb/current
> /crawldb-merge-947753248/data does not exist.
>
>
>
> The data file identified in this error message is in fact located at
> Crawls/Master/crawldb/current/crawldb-merge-947753248/part-00000/data --
> and
> all the other data directories in this crawldb are also under a part-00000
> directory.
>
>
>
> Could someone explain how to get bin/crawl to construct the required path?
> Or is there something I should have done earlier to avoid creation of the
> part-00000 component of the directory structure?
>
>
>
> Thanks!
>
>
>
>
>
>


-- 
Don't Grow Old, Grow Up... :-)

Re: CrawlDB Directory Structure

Reply via email to