RE: Phrase Query Performance Question

Haishan Chen Fri, 02 Nov 2007 00:52:20 -0800


> From: [EMAIL PROTECTED]> Subject: Re: Phrase Query Performance Question> 
> Date: Thu, 1 Nov 2007 11:25:26 -0700> To: solr-user@lucene.apache.org> > On 
> 31-Oct-07, at 11:54 PM, Haishan Chen wrote:> > >> >> Date: Wed, 31 Oct 2007 
> 17:54:53 -0700> Subject: Re: Phrase Query > >> Performance Question> From: 
> [EMAIL PROTECTED]> To: solr- > >> [EMAIL PROTECTED]> > "hurricane katrina" is 
> a very expensive > >> query against a collection> focused on Hurricane 
> Katrina. There > >> will be many matches in many> documents. If you want to 
> measure > >> worst-case, this is fine.> > I'd try other things, like:> > * > 
> >> ninth ward> * Ray Nagin> * Audubon Park> * Canal Street> * French > >> 
> Quarter> * FEMA mistakes> * storm surge> * Jackson Square> > Of > >> course, 
> real query logs are the only real test.> > wunder> >> > These terms are not 
> frequent in my index. I believe they are going > > to be fast. The thing is 
> that I feel 2 million documents is a small > > index.> > 100,000 or 200,000 
> hits is a small set and should always have sub > > second query performance. 
> Now I am only querying one field and the> > response is almost one second. I 
> feel I can't achieve sub second > > performance if I add a bit more 
> complexity to the query.> >> > Many of the category terms in my index will 
> appear in more than 5% > > of the documents and those category terms are very 
> popular search> > terms. So the example I gave were not extreme cases for my 
> index> > I think that you are somewhat misguided about what constitutes a > 
> small set. A query term that appears in 5-10% of the index in a > natural 
> language corpus is _extremely_ frequent. Not quite on the > order of 
> stopwords, but getting there. As a comparison, on an > extremely large corpus 
> that I have handy, documents containing both > the word 'auto' and 'repair' 
> (not necessarily adjacent) constitute > 0.1% of the index. The frequency of 
> the phrase "auto repair" is 0.025%.> > @200k docs would be the response rate 
> from an 800million-doc corpus.> > What data are you indexing, what what is 
> the intended effect of the > phrase queries you are performing? Perhaps 
> getting at the issue from > this end would be more productive than hammering 
> at the phrasequery > performance question.
 
 
 
 
Thanks for the advice. You certainly have a point. I believe you mean a query 
term that appears in 5-10% of an index in a  natural language corpus is 
extremely INFREQUENT?  
 
 
 
 
> > > When I start tomcat I saw this message:> > The Apache Tomcat Native 
> > > library which allows optimal performance > > in production environments 
> > > was not found on the java.library.path> >> > Is that mean if I use Apache 
> > > Tomcat Native library the query > > performance will be better. Anyone 
> > > has experience on that?> > Unlikely, though it might help you slightly at 
> > > a high query rate with > high cache hit ratios.> > -Mike
 
I have try Apache Tomcat Native library on my window machine and you are right. 
No obvious difference on query performance
 
 
 
I have try the index on a linux machine. 
The windows machine:  Windows 2003, one intel(R) Xeon(TM) CPU 3.00 GHZ 
(Quo-core cpu) 4G Ram
The linux machine:  (not sure what version of linux), two  Intel(R) Xeon(R) CPU 
E5310 1.6 GHZ (Quo-core cpu) 4G Ram
 
Both system have raid5 but I don't know the difference.
 
I found substantial indexing performance improvement on the linux machine. On 
the windows machine it took more than 5 hours. 
But it took only one hour to index 2 million documents on the linux system. I 
am really happy to see that. I guess both linux and the extra CPU contributed 
to the improvement.
 
Query performance are almost the same though. The cpu on linux machine is 
slower so I think if the linux system were using the same cpu as the windows 
system query performance will improve too.  Both index and query are cpu bound. 
If I am right.
 
I guess I got enough on this question. But I still want to try the solr-trunk. 
Will update with everyone later.
 
 
 
Thanks
-Haishan
 
 
 
 
 
 
 
 
 
 
 
 
 
_________________________________________________________________
Boo! Scare away worms, viruses and so much more! Try Windows Live OneCare!
http://onecare.live.com/standard/en-us/purchase/trial.aspx?s_cid=wl_hotmailnews
RE: Phrase Query Performance Question

Reply via email to