Normally when I see a 1M entry URL filter, it's doing domain-level filtering.
If that's the case, I'd use a BloomFilter, which has worked well for us in the past during large-scale crawls. -- Ken On Nov 30, 2011, at 8:19am, Lewis John Mcgibbney wrote: > Yes I was interested in seeing if this issue has any traction and where (if > any) interest there is in kick starting it. > > From Kirby's original comments on the issue, on the face of it it looks > like it would be really useful to you guys doing LARGE crawls. > > On Mon, Nov 28, 2011 at 6:56 PM, Kirby Bohling <[email protected]>wrote: > >> Julien, >> >> >> >> On Mon, Nov 28, 2011 at 12:47 PM, Julien Nioche >> <[email protected]> wrote: >>> That would be a good thing to benchmark. IIRC there is a JIRA about >>> improvements to the Finite State library we use, would be good to see the >>> impact of the patch. The regex-urlfilter will probably take more memory >> and >>> be much slower. >>> >> >> https://issues.apache.org/jira/browse/NUTCH-1068 >> >> Pretty sure that is the JIRA item you are discussing. Still not sure >> what to do with the Automaton library, I don't think that the >> maintainer has integrated any parts of the performance improvements >> from Lucene. >> >> Kirby >> >> >>> Julien >>> >>> On 28 November 2011 18:14, Markus Jelsma <[email protected]> >> wrote: >>> >>>> Hi, >>>> >>>> Anyone used URL filters containing up to a million rows? In our case >> this >>>> would be only 25MB so heap space is no problem (unless the data is not >>>> shared >>>> between threads). Will it perform? >>>> >>>> Thanks, >>>> >>> >>> >>> >>> -- >>> * >>> *Open Source Solutions for Text Engineering >>> >>> http://digitalpebble.blogspot.com/ >>> http://www.digitalpebble.com >>> >> > > > > -- > *Lewis* -------------------------- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr

