I'm attempting to filter during the generating.  I removed the noFilter and
noNorm flags from my generate job.  I have around 10M records in my crawldb.

The generate job has been running for several days now.  Is there a way to
check it's progress and/or make sure it's not hung?

Also, is there a faster way to do this?  It seems like I shouldn't need to
filter the entire crawldb every time I generate a segment.  Just the new
urls that were found in the latest fetch.

On Tue, May 22, 2012 at 2:00 PM, Markus Jelsma
<[email protected]>wrote:

>
> -----Original message-----
> > From:Bai Shen <[email protected]>
> > Sent: Tue 22-May-2012 19:40
> > To: [email protected]
> > Subject: URL filtering and normalization
> >
> > Somehow my crawler started fetching youtube.  I'm not really sure why as
> I
> > have db.ignore.external.links set to true.
>
> Weird!
>
> >
> > I've since added the following line to my regex-urlfilter.txt file.
> >
> > -^http://www\.youtube\.com/
>
> For domain filtering you should use the domain-urlfilter or
> domain-blacklistfilter. It's faster and easier to maintain.
>
> >
> > However, I'm still seeing youtube urls in the fetch logs.  I'm using the
> > -noFilter and -noNorm options with generate.  I'm also not using the
> > -filter and -normalize options for updatedb.
>
> You must either filter out all YT records from the CrawlDB or filter
> during generating.
>
> >
> > According to Markus in this thread, the normalization and filtering
> should
> > still occur even when using the above options and using 1.4
> >
> >
> http://lucene.472066.n3.nabble.com/Re-Re-generate-update-times-and-crawldb-size-td3564078.html
> >
> >
> > Is there a setting I'm missing?  I'm not seeing anything in the logs
> > regarding this.
> >
> > Thanks.
> >
>

Reply via email to