if i want to remove example.org from my CrawlDB using regex filters i'll add:
-^http://example\.org/ and run updatedb with filtering enabled. The URL's will then be deleted. On Thursday 10 November 2011 16:36:24 Bai Shen wrote: > Can you give me an example of how would I set my URL filter to do this? > Right now I'm just using the default. > > On Mon, Oct 31, 2011 at 3:47 PM, Markus Jelsma > > <[email protected]>wrote: > > Hi > > > > Write an regex URL filter and use it the next time you update the db; it > > will > > disappear. Be sure to backup the db first in case your regex catches > > valid URL's. Nutch 1.5 will have an option to keep the previous version > > of the DB after update. > > > > cheers > > > > > We accidentally injected some urls into the crawl database and I need > > > to > > > > go > > > > > remove them. From what I understand, in 1.4 I can view and modify the > > > > urls > > > > > and indexes. But I can't seem to find any information on how to do > > > this. > > > > > > Is there anything regarding this available? -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

