Re: Removing urls from crawl db

alxsss Tue, 01 Nov 2011 11:26:36 -0700

I think you must add a regex to regex-urlfilter.txt . In that case those urls 
will not be fetched by fetcher.


-----Original Message-----
From: Bai Shen <[email protected]>
To: user <[email protected]>
Sent: Tue, Nov 1, 2011 10:35 am
Subject: Re: Removing urls from crawl db


Already did that.  But it doesn't allow me to delete urls from the list to
be crawled.

On Tue, Nov 1, 2011 at 5:56 AM, Ferdy Galema <[email protected]>wrote:

> As for reading the crawldb, you can use 
> org.apache.nutch.crawl.**CrawlDbReader.
> This allows for dumping the crawldb into a readable textfile as well as
> querying individual urls. Run without args to see its usage.
>
>
> On 10/31/2011 08:47 PM, Markus Jelsma wrote:
>
>> Hi
>>
>> Write an regex URL filter and use it the next time you update the db; it
>> will
>> disappear. Be sure to backup the db first in case your regex catches valid
>> URL's. Nutch 1.5 will have an option to keep the previous version of the
>> DB
>> after update.
>>
>> cheers
>>
>>  We accidentally injected some urls into the crawl database and I need to
>>> go
>>> remove them.  From what I understand, in 1.4 I can view and modify the
>>> urls
>>> and indexes.  But I can't seem to find any information on how to do this.
>>>
>>> Is there anything regarding this available?
>>>
>>

Re: Removing urls from crawl db

Reply via email to