I'm attempting to filter during the generating. I removed the noFilter and noNorm flags from my generate job. I have around 10M records in my crawldb.
The generate job has been running for several days now. Is there a way to check it's progress and/or make sure it's not hung? Also, is there a faster way to do this? It seems like I shouldn't need to filter the entire crawldb every time I generate a segment. Just the new urls that were found in the latest fetch. On Tue, May 22, 2012 at 2:00 PM, Markus Jelsma <[email protected]>wrote: > > -----Original message----- > > From:Bai Shen <[email protected]> > > Sent: Tue 22-May-2012 19:40 > > To: [email protected] > > Subject: URL filtering and normalization > > > > Somehow my crawler started fetching youtube. I'm not really sure why as > I > > have db.ignore.external.links set to true. > > Weird! > > > > > I've since added the following line to my regex-urlfilter.txt file. > > > > -^http://www\.youtube\.com/ > > For domain filtering you should use the domain-urlfilter or > domain-blacklistfilter. It's faster and easier to maintain. > > > > > However, I'm still seeing youtube urls in the fetch logs. I'm using the > > -noFilter and -noNorm options with generate. I'm also not using the > > -filter and -normalize options for updatedb. > > You must either filter out all YT records from the CrawlDB or filter > during generating. > > > > > According to Markus in this thread, the normalization and filtering > should > > still occur even when using the above options and using 1.4 > > > > > http://lucene.472066.n3.nabble.com/Re-Re-generate-update-times-and-crawldb-size-td3564078.html > > > > > > Is there a setting I'm missing? I'm not seeing anything in the logs > > regarding this. > > > > Thanks. > > >

