Re: URL filtering and normalization

Matthias Paul Sun, 10 Jun 2012 23:55:58 -0700

Wouldn't it be enough to filter and normalize urls once during the parsing?
Then in generate, update and invert links it shouldn't be necessary any more.



On Fri, Jun 8, 2012 at 3:53 PM, Bai Shen <[email protected]> wrote:
> I'm attempting to filter during the generating.  I removed the noFilter and
> noNorm flags from my generate job.  I have around 10M records in my crawldb.
>
> The generate job has been running for several days now.  Is there a way to
> check it's progress and/or make sure it's not hung?
>
> Also, is there a faster way to do this?  It seems like I shouldn't need to
> filter the entire crawldb every time I generate a segment.  Just the new
> urls that were found in the latest fetch.
>
> On Tue, May 22, 2012 at 2:00 PM, Markus Jelsma
> <[email protected]>wrote:
>
>>
>> -----Original message-----
>> > From:Bai Shen <[email protected]>
>> > Sent: Tue 22-May-2012 19:40
>> > To: [email protected]
>> > Subject: URL filtering and normalization
>> >
>> > Somehow my crawler started fetching youtube.  I'm not really sure why as
>> I
>> > have db.ignore.external.links set to true.
>>
>> Weird!
>>
>> >
>> > I've since added the following line to my regex-urlfilter.txt file.
>> >
>> > -^http://www\.youtube\.com/
>>
>> For domain filtering you should use the domain-urlfilter or
>> domain-blacklistfilter. It's faster and easier to maintain.
>>
>> >
>> > However, I'm still seeing youtube urls in the fetch logs.  I'm using the
>> > -noFilter and -noNorm options with generate.  I'm also not using the
>> > -filter and -normalize options for updatedb.
>>
>> You must either filter out all YT records from the CrawlDB or filter
>> during generating.
>>
>> >
>> > According to Markus in this thread, the normalization and filtering
>> should
>> > still occur even when using the above options and using 1.4
>> >
>> >
>> http://lucene.472066.n3.nabble.com/Re-Re-generate-update-times-and-crawldb-size-td3564078.html
>> >
>> >
>> > Is there a setting I'm missing?  I'm not seeing anything in the logs
>> > regarding this.
>> >
>> > Thanks.
>> >
>>

Re: URL filtering and normalization

Reply via email to