I think you must add a regex to regex-urlfilter.txt . In that case those urls will not be fetched by fetcher.
-----Original Message----- From: Bai Shen <[email protected]> To: user <[email protected]> Sent: Tue, Nov 1, 2011 10:35 am Subject: Re: Removing urls from crawl db Already did that. But it doesn't allow me to delete urls from the list to be crawled. On Tue, Nov 1, 2011 at 5:56 AM, Ferdy Galema <[email protected]>wrote: > As for reading the crawldb, you can use > org.apache.nutch.crawl.**CrawlDbReader. > This allows for dumping the crawldb into a readable textfile as well as > querying individual urls. Run without args to see its usage. > > > On 10/31/2011 08:47 PM, Markus Jelsma wrote: > >> Hi >> >> Write an regex URL filter and use it the next time you update the db; it >> will >> disappear. Be sure to backup the db first in case your regex catches valid >> URL's. Nutch 1.5 will have an option to keep the previous version of the >> DB >> after update. >> >> cheers >> >> We accidentally injected some urls into the crawl database and I need to >>> go >>> remove them. From what I understand, in 1.4 I can view and modify the >>> urls >>> and indexes. But I can't seem to find any information on how to do this. >>> >>> Is there anything regarding this available? >>> >>

