Wouldn't it be enough to filter and normalize urls once during the parsing? Then in generate, update and invert links it shouldn't be necessary any more.
On Fri, Jun 8, 2012 at 3:53 PM, Bai Shen <[email protected]> wrote: > I'm attempting to filter during the generating. I removed the noFilter and > noNorm flags from my generate job. I have around 10M records in my crawldb. > > The generate job has been running for several days now. Is there a way to > check it's progress and/or make sure it's not hung? > > Also, is there a faster way to do this? It seems like I shouldn't need to > filter the entire crawldb every time I generate a segment. Just the new > urls that were found in the latest fetch. > > On Tue, May 22, 2012 at 2:00 PM, Markus Jelsma > <[email protected]>wrote: > >> >> -----Original message----- >> > From:Bai Shen <[email protected]> >> > Sent: Tue 22-May-2012 19:40 >> > To: [email protected] >> > Subject: URL filtering and normalization >> > >> > Somehow my crawler started fetching youtube. I'm not really sure why as >> I >> > have db.ignore.external.links set to true. >> >> Weird! >> >> > >> > I've since added the following line to my regex-urlfilter.txt file. >> > >> > -^http://www\.youtube\.com/ >> >> For domain filtering you should use the domain-urlfilter or >> domain-blacklistfilter. It's faster and easier to maintain. >> >> > >> > However, I'm still seeing youtube urls in the fetch logs. I'm using the >> > -noFilter and -noNorm options with generate. I'm also not using the >> > -filter and -normalize options for updatedb. >> >> You must either filter out all YT records from the CrawlDB or filter >> during generating. >> >> > >> > According to Markus in this thread, the normalization and filtering >> should >> > still occur even when using the above options and using 1.4 >> > >> > >> http://lucene.472066.n3.nabble.com/Re-Re-generate-update-times-and-crawldb-size-td3564078.html >> > >> > >> > Is there a setting I'm missing? I'm not seeing anything in the logs >> > regarding this. >> > >> > Thanks. >> > >>

