-----Original message-----
> From:Bai Shen <[email protected]>
> Sent: Tue 22-May-2012 19:40
> To: [email protected]
> Subject: URL filtering and normalization
> 
> Somehow my crawler started fetching youtube.  I'm not really sure why as I
> have db.ignore.external.links set to true.

Weird!

> 
> I've since added the following line to my regex-urlfilter.txt file.
> 
> -^http://www\.youtube\.com/

For domain filtering you should use the domain-urlfilter or 
domain-blacklistfilter. It's faster and easier to maintain.

> 
> However, I'm still seeing youtube urls in the fetch logs.  I'm using the
> -noFilter and -noNorm options with generate.  I'm also not using the
> -filter and -normalize options for updatedb.

You must either filter out all YT records from the CrawlDB or filter during 
generating.

> 
> According to Markus in this thread, the normalization and filtering should
> still occur even when using the above options and using 1.4
> 
> http://lucene.472066.n3.nabble.com/Re-Re-generate-update-times-and-crawldb-size-td3564078.html
> 
> 
> Is there a setting I'm missing?  I'm not seeing anything in the logs
> regarding this.
> 
> Thanks.
> 

Reply via email to