-----Original message----- > From:Bai Shen <[email protected]> > Sent: Tue 22-May-2012 19:40 > To: [email protected] > Subject: URL filtering and normalization > > Somehow my crawler started fetching youtube. I'm not really sure why as I > have db.ignore.external.links set to true.
Weird! > > I've since added the following line to my regex-urlfilter.txt file. > > -^http://www\.youtube\.com/ For domain filtering you should use the domain-urlfilter or domain-blacklistfilter. It's faster and easier to maintain. > > However, I'm still seeing youtube urls in the fetch logs. I'm using the > -noFilter and -noNorm options with generate. I'm also not using the > -filter and -normalize options for updatedb. You must either filter out all YT records from the CrawlDB or filter during generating. > > According to Markus in this thread, the normalization and filtering should > still occur even when using the above options and using 1.4 > > http://lucene.472066.n3.nabble.com/Re-Re-generate-update-times-and-crawldb-size-td3564078.html > > > Is there a setting I'm missing? I'm not seeing anything in the logs > regarding this. > > Thanks. >

