another option is to instead of looking ahead with the wierd Big-O runtime you noticed, use a set to keep track of which terms have been seen (cleared after each word with posInc>0).
i implemented this with the new ts api already and will plop the patch on SOLR-1657 On Tue, Dec 22, 2009 at 7:47 PM, Lance Norskog <goks...@gmail.com> wrote: > It looks like the inner loop of > org.apache.solr.analysis.RemoveDuplicatesTokenFilter could use a > 'break'. I don't remember enough Big-O analysis to give the > difference, but they will be two different formulae. > > For people doing large documents (I've heard gigabytes for email > forensics) this would matter... > > -- > Lance Norskog > goks...@gmail.com > -- Robert Muir rcm...@gmail.com