Normally when I see a 1M entry URL filter, it's doing domain-level filtering.

If that's the case, I'd use a BloomFilter, which has worked well for us in the 
past during large-scale crawls.

-- Ken

On Nov 30, 2011, at 8:19am, Lewis John Mcgibbney wrote:

> Yes I was interested in seeing if this issue has any traction and where (if
> any) interest there is in kick starting it.
> 
> From Kirby's original comments on the issue, on the face of it it looks
> like it would be really useful to you guys doing LARGE crawls.
> 
> On Mon, Nov 28, 2011 at 6:56 PM, Kirby Bohling <[email protected]>wrote:
> 
>> Julien,
>> 
>> 
>> 
>> On Mon, Nov 28, 2011 at 12:47 PM, Julien Nioche
>> <[email protected]> wrote:
>>> That would be a good thing to benchmark. IIRC there is a JIRA about
>>> improvements to the Finite State library we use, would be good to see the
>>> impact of the patch. The regex-urlfilter will probably take more memory
>> and
>>> be much slower.
>>> 
>> 
>> https://issues.apache.org/jira/browse/NUTCH-1068
>> 
>> Pretty sure that is the JIRA item you are discussing.  Still not sure
>> what to do with the Automaton library, I don't think that the
>> maintainer has integrated any parts of the performance improvements
>> from Lucene.
>> 
>> Kirby
>> 
>> 
>>> Julien
>>> 
>>> On 28 November 2011 18:14, Markus Jelsma <[email protected]>
>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> Anyone used URL filters containing up to a million rows? In our case
>> this
>>>> would be only 25MB so heap space is no problem (unless the data is not
>>>> shared
>>>> between threads). Will it perform?
>>>> 
>>>> Thanks,
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> *
>>> *Open Source Solutions for Text Engineering
>>> 
>>> http://digitalpebble.blogspot.com/
>>> http://www.digitalpebble.com
>>> 
>> 
> 
> 
> 
> -- 
> *Lewis*

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




Reply via email to