This was actually not about a regex filter, at least not from my point of view, i wasn't clear it seems.
Anyway, it works well. Instead of a filter we built a normalizer that takes a large file and uses a HashMap for a key look-up. Cheers On Wednesday 30 November 2011 17:19:44 Lewis John Mcgibbney wrote: > Yes I was interested in seeing if this issue has any traction and where (if > any) interest there is in kick starting it. > > From Kirby's original comments on the issue, on the face of it it looks > like it would be really useful to you guys doing LARGE crawls. > > On Mon, Nov 28, 2011 at 6:56 PM, Kirby Bohling <[email protected]>wrote: > > Julien, > > > > > > > > On Mon, Nov 28, 2011 at 12:47 PM, Julien Nioche > > > > <[email protected]> wrote: > > > That would be a good thing to benchmark. IIRC there is a JIRA about > > > improvements to the Finite State library we use, would be good to see > > > the impact of the patch. The regex-urlfilter will probably take more > > > memory > > > > and > > > > > be much slower. > > > > https://issues.apache.org/jira/browse/NUTCH-1068 > > > > Pretty sure that is the JIRA item you are discussing. Still not sure > > what to do with the Automaton library, I don't think that the > > maintainer has integrated any parts of the performance improvements > > from Lucene. > > > > Kirby > > > > > Julien > > > > > > On 28 November 2011 18:14, Markus Jelsma <[email protected]> > > > > wrote: > > >> Hi, > > >> > > >> Anyone used URL filters containing up to a million rows? In our case > > > > this > > > > >> would be only 25MB so heap space is no problem (unless the data is not > > >> shared > > >> between threads). Will it perform? > > >> > > >> Thanks, > > > > > > -- > > > * > > > *Open Source Solutions for Text Engineering > > > > > > http://digitalpebble.blogspot.com/ > > > http://www.digitalpebble.com -- Markus Jelsma - CTO - Openindex

