This was actually not about a regex filter, at least not from my point of 
view, i wasn't clear it seems.

Anyway, it works well. Instead of a filter we built a normalizer that takes a  
large file and uses a HashMap for a key look-up.

Cheers


On Wednesday 30 November 2011 17:19:44 Lewis John Mcgibbney wrote:
> Yes I was interested in seeing if this issue has any traction and where (if
> any) interest there is in kick starting it.
> 
> From Kirby's original comments on the issue, on the face of it it looks
> like it would be really useful to you guys doing LARGE crawls.
> 
> On Mon, Nov 28, 2011 at 6:56 PM, Kirby Bohling 
<[email protected]>wrote:
> > Julien,
> > 
> > 
> > 
> > On Mon, Nov 28, 2011 at 12:47 PM, Julien Nioche
> > 
> > <[email protected]> wrote:
> > > That would be a good thing to benchmark. IIRC there is a JIRA about
> > > improvements to the Finite State library we use, would be good to see
> > > the impact of the patch. The regex-urlfilter will probably take more
> > > memory
> > 
> > and
> > 
> > > be much slower.
> > 
> > https://issues.apache.org/jira/browse/NUTCH-1068
> > 
> > Pretty sure that is the JIRA item you are discussing.  Still not sure
> > what to do with the Automaton library, I don't think that the
> > maintainer has integrated any parts of the performance improvements
> > from Lucene.
> > 
> > Kirby
> > 
> > > Julien
> > > 
> > > On 28 November 2011 18:14, Markus Jelsma <[email protected]>
> > 
> > wrote:
> > >> Hi,
> > >> 
> > >> Anyone used URL filters containing up to a million rows? In our case
> > 
> > this
> > 
> > >> would be only 25MB so heap space is no problem (unless the data is not
> > >> shared
> > >> between threads). Will it perform?
> > >> 
> > >> Thanks,
> > > 
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > > 
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com

-- 
Markus Jelsma - CTO - Openindex

Reply via email to