Wouldn't it be enough to filter and normalize urls once during the parsing?
Then in generate, update and invert links it shouldn't be necessary any more.


On Fri, Jun 8, 2012 at 3:53 PM, Bai Shen <[email protected]> wrote:
> I'm attempting to filter during the generating.  I removed the noFilter and
> noNorm flags from my generate job.  I have around 10M records in my crawldb.
>
> The generate job has been running for several days now.  Is there a way to
> check it's progress and/or make sure it's not hung?
>
> Also, is there a faster way to do this?  It seems like I shouldn't need to
> filter the entire crawldb every time I generate a segment.  Just the new
> urls that were found in the latest fetch.
>
> On Tue, May 22, 2012 at 2:00 PM, Markus Jelsma
> <[email protected]>wrote:
>
>>
>> -----Original message-----
>> > From:Bai Shen <[email protected]>
>> > Sent: Tue 22-May-2012 19:40
>> > To: [email protected]
>> > Subject: URL filtering and normalization
>> >
>> > Somehow my crawler started fetching youtube.  I'm not really sure why as
>> I
>> > have db.ignore.external.links set to true.
>>
>> Weird!
>>
>> >
>> > I've since added the following line to my regex-urlfilter.txt file.
>> >
>> > -^http://www\.youtube\.com/
>>
>> For domain filtering you should use the domain-urlfilter or
>> domain-blacklistfilter. It's faster and easier to maintain.
>>
>> >
>> > However, I'm still seeing youtube urls in the fetch logs.  I'm using the
>> > -noFilter and -noNorm options with generate.  I'm also not using the
>> > -filter and -normalize options for updatedb.
>>
>> You must either filter out all YT records from the CrawlDB or filter
>> during generating.
>>
>> >
>> > According to Markus in this thread, the normalization and filtering
>> should
>> > still occur even when using the above options and using 1.4
>> >
>> >
>> http://lucene.472066.n3.nabble.com/Re-Re-generate-update-times-and-crawldb-size-td3564078.html
>> >
>> >
>> > Is there a setting I'm missing?  I'm not seeing anything in the logs
>> > regarding this.
>> >
>> > Thanks.
>> >
>>

Reply via email to