Re: Removing urls from crawl db

Bai Shen Mon, 28 Nov 2011 06:12:30 -0800

It was http://www.sipri.org/yearbook/2011/files/SIPRIYB11summaryNL.pdf IIRC.


On Mon, Nov 21, 2011 at 3:12 PM, Markus Jelsma
<[email protected]>wrote:

> Can you pass me the URL?
>
> > Nothing shows up for me.  It just sits there like it's waiting on
> something
> > or processing.
> >
> > On Thu, Nov 10, 2011 at 3:30 PM, Markus Jelsma
> >
> > <[email protected]>wrote:
> > > Uh, the filter checker immediately produces output.
> > >
> > > > Interesting.  What kind of output should I expect to see?  So far
> it's
> > >
> > > been
> > >
> > > > running for a while with no output.
> > > >
> > > > On Thu, Nov 10, 2011 at 1:51 PM, Markus Jelsma
> > > >
> > > > <[email protected]>wrote:
> > > > > You can use bin/nutch org.apache.nutch.net.URLFilterChecker
> > >
> > > -allCombined
> > >
> > > > > to test.
> > > > >
> > > > > > Okay.  So I would just put that above the +. line, right?
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > On Thu, Nov 10, 2011 at 10:42 AM, Markus Jelsma
> > > > > >
> > > > > > <[email protected]>wrote:
> > > > > > > if i want to remove example.org from my CrawlDB using regex
> > >
> > > filters
> > >
> > > > > i'll
> > > > >
> > > > > > > add:
> > > > > > >
> > > > > > > -^http://example\.org/
> > > > > > >
> > > > > > > and run updatedb with filtering enabled. The URL's will then be
> > > > >
> > > > > deleted.
> > > > >
> > > > > > > On Thursday 10 November 2011 16:36:24 Bai Shen wrote:
> > > > > > > > Can you give me an example of how would I set my URL filter
> to
> > > > > > > > do
> > > > >
> > > > > this?
> > > > >
> > > > > > > > Right now I'm just using the default.
> > > > > > > >
> > > > > > > > On Mon, Oct 31, 2011 at 3:47 PM, Markus Jelsma
> > > > > > > >
> > > > > > > > <[email protected]>wrote:
> > > > > > > > > Hi
> > > > > > > > >
> > > > > > > > > Write an regex URL filter and use it the next time you
> update
> > >
> > > the
> > >
> > > > > db;
> > > > >
> > > > > > > it
> > > > > > >
> > > > > > > > > will
> > > > > > > > > disappear. Be sure to backup the db first in case your
> regex
> > > > >
> > > > > catches
> > > > >
> > > > > > > > > valid URL's. Nutch 1.5 will have an option to keep the
> > > > > > > > > previous version of the DB after update.
> > > > > > > > >
> > > > > > > > > cheers
> > > > > > > > >
> > > > > > > > > > We accidentally injected some urls into the crawl
> database
> > >
> > > and
> > >
> > > > > > > > > > I need to
> > > > > > > > >
> > > > > > > > > go
> > > > > > > > >
> > > > > > > > > > remove them.  From what I understand, in 1.4 I can view
> and
> > > > >
> > > > > modify
> > > > >
> > > > > > > the
> > > > > > >
> > > > > > > > > urls
> > > > > > > > >
> > > > > > > > > > and indexes.  But I can't seem to find any information on
> > > > > > > > > > how to
> > > > >
> > > > > do
> > > > >
> > > > > > > > > > this.
> > > > > > > > > >
> > > > > > > > > > Is there anything regarding this available?
> > > > > > >
> > > > > > > --
> > > > > > > Markus Jelsma - CTO - Openindex
> > > > > > > http://www.linkedin.com/in/markus17
> > > > > > > 050-8536620 / 06-50258350
>

Re: Removing urls from crawl db

Reply via email to