Re: URL filtering and normalization

Bai Shen Tue, 22 May 2012 11:27:37 -0700

On Tue, May 22, 2012 at 2:00 PM, Markus Jelsma
<[email protected]>wrote:


>
> -----Original message-----
> > From:Bai Shen <[email protected]>
> > Sent: Tue 22-May-2012 19:40
> > To: [email protected]
> > Subject: URL filtering and normalization
> >
> > Somehow my crawler started fetching youtube.  I'm not really sure why as
> I
> > have db.ignore.external.links set to true.
>
> Weird!
>
> That's what I said. :)


> >
> > I've since added the following line to my regex-urlfilter.txt file.
> >
> > -^http://www\.youtube\.com/
>
> For domain filtering you should use the domain-urlfilter or
> domain-blacklistfilter. It's faster and easier to maintain.
>

Do I put the same regex in there?  How do I ensure that it's run?


>
> >
> > However, I'm still seeing youtube urls in the fetch logs.  I'm using the
> > -noFilter and -noNorm options with generate.  I'm also not using the
> > -filter and -normalize options for updatedb.
>
> You must either filter out all YT records from the CrawlDB or filter
> during generating.
>
>
I'm not sure what you mean.  In the link I posted below you said that it
filters in 1.4 even when using the options I listed.  Is that not the
case?  What is the best and fastest way to filter and normalize my urls?

>
> > According to Markus in this thread, the normalization and filtering
> should
> > still occur even when using the above options and using 1.4
> >
> >
> http://lucene.472066.n3.nabble.com/Re-Re-generate-update-times-and-crawldb-size-td3564078.html
> >
> >
> > Is there a setting I'm missing?  I'm not seeing anything in the logs
> > regarding this.
> >
> > Thanks.
> >
>

Re: URL filtering and normalization

Reply via email to