On Tue, May 22, 2012 at 2:00 PM, Markus Jelsma <[email protected]>wrote:
> > -----Original message----- > > From:Bai Shen <[email protected]> > > Sent: Tue 22-May-2012 19:40 > > To: [email protected] > > Subject: URL filtering and normalization > > > > Somehow my crawler started fetching youtube. I'm not really sure why as > I > > have db.ignore.external.links set to true. > > Weird! > > That's what I said. :) > > > > I've since added the following line to my regex-urlfilter.txt file. > > > > -^http://www\.youtube\.com/ > > For domain filtering you should use the domain-urlfilter or > domain-blacklistfilter. It's faster and easier to maintain. > Do I put the same regex in there? How do I ensure that it's run? > > > > > However, I'm still seeing youtube urls in the fetch logs. I'm using the > > -noFilter and -noNorm options with generate. I'm also not using the > > -filter and -normalize options for updatedb. > > You must either filter out all YT records from the CrawlDB or filter > during generating. > > I'm not sure what you mean. In the link I posted below you said that it filters in 1.4 even when using the options I listed. Is that not the case? What is the best and fastest way to filter and normalize my urls? > > > According to Markus in this thread, the normalization and filtering > should > > still occur even when using the above options and using 1.4 > > > > > http://lucene.472066.n3.nabble.com/Re-Re-generate-update-times-and-crawldb-size-td3564078.html > > > > > > Is there a setting I'm missing? I'm not seeing anything in the logs > > regarding this. > > > > Thanks. > > >

