Re: Removing urls from crawl db

Bai Shen Tue, 01 Nov 2011 13:51:01 -0700

It seems like there would be a better way to do that.

I thought 1.4 was going to have a Luke style capability in regards to it's
data?


On Tue, Nov 1, 2011 at 4:45 PM, Markus Jelsma <[email protected]>wrote:

>
> > I think you must add a regex to regex-urlfilter.txt . In that case those
> > urls will not be fetched by fetcher.
>
> Yes but if you use it when doing updatedb it will disappear from the
> crawldb
> entirely.
>
> >
> >
> > -----Original Message-----
> > From: Bai Shen <[email protected]>
> > To: user <[email protected]>
> > Sent: Tue, Nov 1, 2011 10:35 am
> > Subject: Re: Removing urls from crawl db
> >
> >
> > Already did that.  But it doesn't allow me to delete urls from the list
> to
> > be crawled.
> >
> > On Tue, Nov 1, 2011 at 5:56 AM, Ferdy Galema
> <[email protected]>wrote:
> > > As for reading the crawldb, you can use
> > > org.apache.nutch.crawl.**CrawlDbReader. This allows for dumping the
> > > crawldb into a readable textfile as well as querying individual urls.
> > > Run without args to see its usage.
> > >
> > > On 10/31/2011 08:47 PM, Markus Jelsma wrote:
> > >> Hi
> > >>
> > >> Write an regex URL filter and use it the next time you update the db;
> it
> > >> will
> > >> disappear. Be sure to backup the db first in case your regex catches
> > >> valid URL's. Nutch 1.5 will have an option to keep the previous
> version
> > >> of the DB
> > >> after update.
> > >>
> > >> cheers
> > >>
> > >>  We accidentally injected some urls into the crawl database and I need
> > >>  to
> > >>
> > >>> go
> > >>> remove them.  From what I understand, in 1.4 I can view and modify
> the
> > >>> urls
> > >>> and indexes.  But I can't seem to find any information on how to do
> > >>> this.
> > >>>
> > >>> Is there anything regarding this available?
>

Re: Removing urls from crawl db

Reply via email to