help refactoring from 3.x to 4.x

Ryan McKinley Sun, 22 Aug 2010 23:38:46 -0700

I have a function that works well in 3.x, but when I tried to
re-implement in 4.x it runs very very slow (~20ms vs 45s on an index w
~100K items).


Big picture, I am trying to calculate a bounding box for items that
match the query.  To calculate this, I have two fields bboxNS, and
bboxEW that get filled with the min and max values for that doc.  To
get the bounding box, I just need the first matching term in the index
and the last matching term.

In 3.x the code looked like this:

public class FirstLastMatchingTerm
{
  String first = null;
  String last = null;

  public static FirstLastMatchingTerm read(SolrIndexSearcher searcher,
String field, DocSet docs) throws IOException
  {
    FirstLastMatchingTerm firstLast = new FirstLastMatchingTerm();
    if( docs.size() > 0 ) {
      IndexReader reader = searcher.getReader();
      TermEnum te = reader.terms(new Term(field,""));
      do {
        Term t = te.term();
        if( null == t || !t.field().equals(field) ) {
          break;
        }

        if( searcher.numDocs(new TermQuery(t), docs) > 0 ) {
          firstLast.last = t.text();
          if( firstLast.first == null ) {
            firstLast.first = firstLast.last;
          }
        }
      }
      while( te.next() );
    }
    return firstLast;
  }
}


In 4.x, I tried:

public class FirstLastMatchingTerm
{
  String first = null;
  String last = null;

  public static FirstLastMatchingTerm read(SolrIndexSearcher searcher,
String field, DocSet docs) throws IOException
  {
    FirstLastMatchingTerm firstLast = new FirstLastMatchingTerm();
    if( docs.size() > 0 ) {
      IndexReader reader = searcher.getReader();

      Terms terms = MultiFields.getTerms(reader, field);
      TermsEnum te = terms.iterator();
      BytesRef term = te.next();
      while( term != null ) {
        if( searcher.numDocs(new TermQuery(new Term(field,term)), docs) > 0 ) {
          firstLast.last = term.utf8ToString();
          if( firstLast.first == null ) {
            firstLast.first = firstLast.last;
          }
        }
        term = te.next();
      }
    }
    return firstLast;
  }
}

but the results are slow (and incorrect).  I tried some variations of
using ReaderUtil.Gather(), but the real hit seems to come from
  if( searcher.numDocs(new TermQuery(new Term(field,term)), docs) > 0 )

Any ideas?  I'm not tied to the approach or indexing strategy, so if
anyone has other suggestions that would be great.  Looking at it
again, it seems crazy that you have to run a query for each term, but
in 3.x

thanks
ryan

help refactoring from 3.x to 4.x

Reply via email to