Catching BooleanQuery.TooManyClauses

2006-04-14 Thread bb
Hi Lucene Users,

I would like to catch BooleanQuery.TooManyClauses exception for certain
wildcard searches and display a 'subset' of results.  I have used the
WildcardTermEnum to give me the first X documents matching the wildcard
query.  Below is the code I use to implement the solution.  

Without any performance concerns is this the best solution?
Or should I just tell the user to refine their query!?

Thanks

Ben

= QueryParserTest.java  
...
public class QueryParserTest extends LuceneTestCase {
...
private static int MAX_HITS = 10;
public void testCatchTooManyClauses() throws Exception {
reader = IndexReader.open(directory);
String queryStr = "9*";
String field = "PART_NBR";
Hits hits = null;
Vector docList;
try {
System.out.println("query: " + queryStr);
System.out.println("field: " + field);
hits =
searcher.search(parser.parse(field+":"+queryStr));
docList = new Vector(hits.length());
Iterator docListIt = hits.iterator();
while(docListIt.hasNext())

docList.add(((Hit)docListIt.next()).getDocument());
}
catch(BooleanQuery.TooManyClauses ex) {
System.out.println("catch
BooleanQuery.TooManyClauses, refining query");
Term term = new Term(field, queryStr);

WildcardTermEnum wte = new WildcardTermEnum(reader,
term);
int cnt = 0;
docList = new Vector(MAX_HITS);
while(wte.next() && cnt++ < MAX_HITS) {
term = wte.term();
TermQuery query = new TermQuery(new
Term(field, term.text()));
System.out.println("search for " +
query.getTerm().text());
hits = searcher.search(query);
Iterator docListIt = hits.iterator();
while(docListIt.hasNext())

docList.add(((Hit)docListIt.next()).getDocument());
}
}
System.out.println("found:" + docList.size());

}
...
= QueryParserTest.java  

-- 
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.385 / Virus Database: 268.4.1/312 - Release Date: 14/04/2006
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Catching BooleanQuery.TooManyClauses

2006-04-17 Thread bb
Thanks Erick & Paul,

I also found a great example of a custom filter in LIA (6.4 Using a custom
filter)

Here's my updated testcase if anybody is interested...

= QueryParserTest.java  
...
public class QueryParserTest extends LuceneTestCase {
...
private static int MAX_HITS = 10;

public void testCatchTooManyClauses() throws Exception {
System.out.println("===>testCatchTooManyClauses");
Vector docList = null;
try {
causeTooManyClauses();
}
catch(BooleanQuery.TooManyClauses ex) {
Term term = new Term(field, queryStr);
final BitSet bs = new BitSet(reader.maxDoc());
TermDocs termDocs = reader.termDocs();
WildcardTermEnum wte = new WildcardTermEnum(reader,
term);
int cnt = 0;
docList = new Vector(MAX_HITS);
/*
the methods termDocs.next() and reader.document() go
to different places in 
the Lucene index so this will send the disk head up
and down.
see
http://lucene.apache.org/java/docs/fileformats.html
*/
for (term = null; (term = wte.term()) != null && cnt
< MAX_HITS; wte.next()) {
// get doc ids from .frq file
termDocs.seek(term);
while (termDocs.next() && cnt++ < MAX_HITS)
{
bs.set(termDocs.doc());
}
}
termDocs.close();
// retrieve the Document's in numerical order
for(int i=bs.nextSetBit(0); i>=0;
i=bs.nextSetBit(i+1)) {
docList.add(reader.document(i));
}
}
System.out.println("found:" + docList.size());
assertTrue(docList.size() == MAX_HITS);
} 
...
= QueryParserTest.java 
 

> -Original Message-
> From: Paul Elschot [mailto:[EMAIL PROTECTED] 
> Sent: Sunday, 16 April 2006 5:13 AM
> To: [email protected]
> Subject: Re: Catching BooleanQuery.TooManyClauses
> 
> 
> On Saturday 15 April 2006 13:44, Erick Erickson wrote:
> > With the warning that I'm not the most experienced Lucene 
> user in the
> > world...
> > 
> > I *think*, that rather than search for each term, it's more 
> efficient to
> > just use IndexReader.termDocs. i.e.
> > 
> > Indexreader ir = ;
> > TermDocs termDocs = ir.TermDocs();
> > WildcardTermEnum wildEnum = ;
> > 
> > for (Term term = null; (term = wildEnum.term()) != null; 
> wildEnum.next()) {
> >   termDocs.seek(term);
> 
> This avoids the buffer space needed for each TermDocs by 
> using each term
> separately. A BooleanQuery over all the terms will use the 
> termDocs.next() and
> termDocs.doc() for all terms at the same time. It has to, 
> because more terms
> might match each document and it has to compute the query 
> score for each
> document.
> 
> >   while (termDocs.next()) {
> > Document doc = reader.document(termDocs.doc())
> 
> The methods termDocs.next() and reader.document()
> go to different places in the Lucene index (see the index format),
> so this will send the disk head up and down.
> It's better to collect the termDocs.doc() values first,  for 
> example in a
> BitSet, and then retrieve the Document's in numerical order.
> Btw., this is what the ConstantScoreRangeQuery does to avoid 
> using all terms
> at the same time.
> 
> >   }
> > }
> > 
> > I know that for loop looks odd, but I just peeked at the 
> source code for the
> > TermEnum classes and see why it works.
> > 
> > One warning, as the folks on the board have pointed out to 
> me is that the
> > Hits object is not entirely efficient when you fetch lots 
> of docs (more than
> > 100 has been mentioned) and you should think about TopDocs 
> or some such.
> > 
> > Also, if you can avoid fetching the document (i.e. get 
> everything you want
> > from the index) you'll add efficiency. I have no clue how 
> much you're
> > returning to the user, so I don't know whether that would 
> work for you.
> 
> In other words, one can use the above BitSet in a Filter lateron
> during an IndexSearcher.search() (or in a ConstantScoreQuery),
> and use Hits or TopDocs for document retrieval.
> 
> Regards,
> Paul Elschot.
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> -- 
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.385 / Virus