help refactoring from 3.x to 4.x

2010-08-23 Thread Ryan McKinley
I have a function that works well in 3.x, but when I tried to
re-implement in 4.x it runs very very slow (~20ms vs 45s on an index w
~100K items).

Big picture, I am trying to calculate a bounding box for items that
match the query.  To calculate this, I have two fields bboxNS, and
bboxEW that get filled with the min and max values for that doc.  To
get the bounding box, I just need the first matching term in the index
and the last matching term.

In 3.x the code looked like this:

public class FirstLastMatchingTerm
{
  String first = null;
  String last = null;

  public static FirstLastMatchingTerm read(SolrIndexSearcher searcher,
String field, DocSet docs) throws IOException
  {
FirstLastMatchingTerm firstLast = new FirstLastMatchingTerm();
if( docs.size()  0 ) {
  IndexReader reader = searcher.getReader();
  TermEnum te = reader.terms(new Term(field,));
  do {
Term t = te.term();
if( null == t || !t.field().equals(field) ) {
  break;
}

if( searcher.numDocs(new TermQuery(t), docs)  0 ) {
  firstLast.last = t.text();
  if( firstLast.first == null ) {
firstLast.first = firstLast.last;
  }
}
  }
  while( te.next() );
}
return firstLast;
  }
}


In 4.x, I tried:

public class FirstLastMatchingTerm
{
  String first = null;
  String last = null;

  public static FirstLastMatchingTerm read(SolrIndexSearcher searcher,
String field, DocSet docs) throws IOException
  {
FirstLastMatchingTerm firstLast = new FirstLastMatchingTerm();
if( docs.size()  0 ) {
  IndexReader reader = searcher.getReader();

  Terms terms = MultiFields.getTerms(reader, field);
  TermsEnum te = terms.iterator();
  BytesRef term = te.next();
  while( term != null ) {
if( searcher.numDocs(new TermQuery(new Term(field,term)), docs)  0 ) {
  firstLast.last = term.utf8ToString();
  if( firstLast.first == null ) {
firstLast.first = firstLast.last;
  }
}
term = te.next();
  }
}
return firstLast;
  }
}

but the results are slow (and incorrect).  I tried some variations of
using ReaderUtil.Gather(), but the real hit seems to come from
  if( searcher.numDocs(new TermQuery(new Term(field,term)), docs)  0 )

Any ideas?  I'm not tied to the approach or indexing strategy, so if
anyone has other suggestions that would be great.  Looking at it
again, it seems crazy that you have to run a query for each term, but
in 3.x

thanks
ryan


Re: help refactoring from 3.x to 4.x

2010-08-23 Thread Michael McCandless
Spooky that you see incorrect results!  The code looks correct.  What
are the specifics on when it produces an invalid result?

Also spooky that you see it running slower -- how much slower?  Did
you rebuild the index in 4.x (if not, you are using the preflex
codec)?  And is the index otherwise identical?

You could improve perf by not using SolrIndexSearcher.numDocs?  Ie you
don't need the count; you just need to know if it's  0.  So you could
make your own loop that breaks out on the first docID in common.  You
could also stick w/ BytesRef the whole time (only do .utf8ToString()
in the end on the first/last), though this is presumably a net/nets
tiny cost.

But, we should still dig down on why numDocs is slower in 4.x; that's
unexpected; Yonik any ideas?  I'm not familiar with this part of
Solr...

Mike

On Mon, Aug 23, 2010 at 2:38 AM, Ryan McKinley ryan...@gmail.com wrote:
 I have a function that works well in 3.x, but when I tried to
 re-implement in 4.x it runs very very slow (~20ms vs 45s on an index w
 ~100K items).

 Big picture, I am trying to calculate a bounding box for items that
 match the query.  To calculate this, I have two fields bboxNS, and
 bboxEW that get filled with the min and max values for that doc.  To
 get the bounding box, I just need the first matching term in the index
 and the last matching term.

 In 3.x the code looked like this:

 public class FirstLastMatchingTerm
 {
  String first = null;
  String last = null;

  public static FirstLastMatchingTerm read(SolrIndexSearcher searcher,
 String field, DocSet docs) throws IOException
  {
    FirstLastMatchingTerm firstLast = new FirstLastMatchingTerm();
    if( docs.size()  0 ) {
      IndexReader reader = searcher.getReader();
      TermEnum te = reader.terms(new Term(field,));
      do {
        Term t = te.term();
        if( null == t || !t.field().equals(field) ) {
          break;
        }

        if( searcher.numDocs(new TermQuery(t), docs)  0 ) {
          firstLast.last = t.text();
          if( firstLast.first == null ) {
            firstLast.first = firstLast.last;
          }
        }
      }
      while( te.next() );
    }
    return firstLast;
  }
 }


 In 4.x, I tried:

 public class FirstLastMatchingTerm
 {
  String first = null;
  String last = null;

  public static FirstLastMatchingTerm read(SolrIndexSearcher searcher,
 String field, DocSet docs) throws IOException
  {
    FirstLastMatchingTerm firstLast = new FirstLastMatchingTerm();
    if( docs.size()  0 ) {
      IndexReader reader = searcher.getReader();

      Terms terms = MultiFields.getTerms(reader, field);
      TermsEnum te = terms.iterator();
      BytesRef term = te.next();
      while( term != null ) {
        if( searcher.numDocs(new TermQuery(new Term(field,term)), docs)  0 ) {
          firstLast.last = term.utf8ToString();
          if( firstLast.first == null ) {
            firstLast.first = firstLast.last;
          }
        }
        term = te.next();
      }
    }
    return firstLast;
  }
 }

 but the results are slow (and incorrect).  I tried some variations of
 using ReaderUtil.Gather(), but the real hit seems to come from
  if( searcher.numDocs(new TermQuery(new Term(field,term)), docs)  0 )

 Any ideas?  I'm not tied to the approach or indexing strategy, so if
 anyone has other suggestions that would be great.  Looking at it
 again, it seems crazy that you have to run a query for each term, but
 in 3.x

 thanks
 ryan