Re: Very large filter lists

Lewis John Mcgibbney Wed, 30 Nov 2011 08:20:11 -0800

Yes I was interested in seeing if this issue has any traction and where (if
any) interest there is in kick starting it.


>From Kirby's original comments on the issue, on the face of it it looks
like it would be really useful to you guys doing LARGE crawls.

On Mon, Nov 28, 2011 at 6:56 PM, Kirby Bohling <[email protected]>wrote:

> Julien,
>
>
>
> On Mon, Nov 28, 2011 at 12:47 PM, Julien Nioche
> <[email protected]> wrote:
> > That would be a good thing to benchmark. IIRC there is a JIRA about
> > improvements to the Finite State library we use, would be good to see the
> > impact of the patch. The regex-urlfilter will probably take more memory
> and
> > be much slower.
> >
>
> https://issues.apache.org/jira/browse/NUTCH-1068
>
> Pretty sure that is the JIRA item you are discussing.  Still not sure
> what to do with the Automaton library, I don't think that the
> maintainer has integrated any parts of the performance improvements
> from Lucene.
>
> Kirby
>
>
> > Julien
> >
> > On 28 November 2011 18:14, Markus Jelsma <[email protected]>
> wrote:
> >
> >> Hi,
> >>
> >> Anyone used URL filters containing up to a million rows? In our case
> this
> >> would be only 25MB so heap space is no problem (unless the data is not
> >> shared
> >> between threads). Will it perform?
> >>
> >> Thanks,
> >>
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> >
>



-- 
*Lewis*

Re: Very large filter lists

Reply via email to