Re: Removing urls from crawl db

Markus Jelsma Tue, 01 Nov 2011 13:55:02 -0700

> It seems like there would be a better way to do that.

The problem is that there are many files storing URL's, CrawlDB, LinkDB, 
WebGraph DB's, segment data. There is in Nutch 1.x no single place where you 
can find an URL.


For example, if we find URL patterns we don't want we write additional filters 
for it and have to update all DB's again, which can take minutes, hours or 
days depending on size and cluster capacity.

> 
> I thought 1.4 was going to have a Luke style capability in regards to it's
> data?

Where did you read that? That is, unfortunately, not the case :)

> 
> On Tue, Nov 1, 2011 at 4:45 PM, Markus Jelsma 
<[email protected]>wrote:
> > > I think you must add a regex to regex-urlfilter.txt . In that case
> > > those urls will not be fetched by fetcher.
> > 
> > Yes but if you use it when doing updatedb it will disappear from the
> > crawldb
> > entirely.
> > 
> > > -----Original Message-----
> > > From: Bai Shen <[email protected]>
> > > To: user <[email protected]>
> > > Sent: Tue, Nov 1, 2011 10:35 am
> > > Subject: Re: Removing urls from crawl db
> > > 
> > > 
> > > Already did that.  But it doesn't allow me to delete urls from the list
> > 
> > to
> > 
> > > be crawled.
> > > 
> > > On Tue, Nov 1, 2011 at 5:56 AM, Ferdy Galema
> > 
> > <[email protected]>wrote:
> > > > As for reading the crawldb, you can use
> > > > org.apache.nutch.crawl.**CrawlDbReader. This allows for dumping the
> > > > crawldb into a readable textfile as well as querying individual urls.
> > > > Run without args to see its usage.
> > > > 
> > > > On 10/31/2011 08:47 PM, Markus Jelsma wrote:
> > > >> Hi
> > > >> 
> > > >> Write an regex URL filter and use it the next time you update the
> > > >> db;
> > 
> > it
> > 
> > > >> will
> > > >> disappear. Be sure to backup the db first in case your regex catches
> > > >> valid URL's. Nutch 1.5 will have an option to keep the previous
> > 
> > version
> > 
> > > >> of the DB
> > > >> after update.
> > > >> 
> > > >> cheers
> > > >> 
> > > >>  We accidentally injected some urls into the crawl database and I
> > > >>  need to
> > > >>  
> > > >>> go
> > > >>> remove them.  From what I understand, in 1.4 I can view and modify
> > 
> > the
> > 
> > > >>> urls
> > > >>> and indexes.  But I can't seem to find any information on how to do
> > > >>> this.
> > > >>> 
> > > >>> Is there anything regarding this available?

Re: Removing urls from crawl db

Reply via email to