> From: [EMAIL PROTECTED]> Subject: Re: Phrase Query Performance Question>
> Date: Thu, 1 Nov 2007 11:25:26 -0700> To: solr-user@lucene.apache.org> > On
> 31-Oct-07, at 11:54 PM, Haishan Chen wrote:> > >> >> Date: Wed, 31 Oct 2007
> 17:54:53 -0700> Subject: Re: Phrase Query > >> Performance Question> From:
> [EMAIL PROTECTED]> To: solr- > >> [EMAIL PROTECTED]> > "hurricane katrina" is
> a very expensive > >> query against a collection> focused on Hurricane
> Katrina. There > >> will be many matches in many> documents. If you want to
> measure > >> worst-case, this is fine.> > I'd try other things, like:> > * >
> >> ninth ward> * Ray Nagin> * Audubon Park> * Canal Street> * French > >>
> Quarter> * FEMA mistakes> * storm surge> * Jackson Square> > Of > >> course,
> real query logs are the only real test.> > wunder> >> > These terms are not
> frequent in my index. I believe they are going > > to be fast. The thing is
> that I feel 2 million documents is a small > > index.> > 100,000 or 200,000
> hits is a small set and should always have sub > > second query performance.
> Now I am only querying one field and the> > response is almost one second. I
> feel I can't achieve sub second > > performance if I add a bit more
> complexity to the query.> >> > Many of the category terms in my index will
> appear in more than 5% > > of the documents and those category terms are very
> popular search> > terms. So the example I gave were not extreme cases for my
> index> > I think that you are somewhat misguided about what constitutes a >
> small set. A query term that appears in 5-10% of the index in a > natural
> language corpus is _extremely_ frequent. Not quite on the > order of
> stopwords, but getting there. As a comparison, on an > extremely large corpus
> that I have handy, documents containing both > the word 'auto' and 'repair'
> (not necessarily adjacent) constitute > 0.1% of the index. The frequency of
> the phrase "auto repair" is 0.025%.> > @200k docs would be the response rate
> from an 800million-doc corpus.> > What data are you indexing, what what is
> the intended effect of the > phrase queries you are performing? Perhaps
> getting at the issue from > this end would be more productive than hammering
> at the phrasequery > performance question.
Thanks for the advice. You certainly have a point. I believe you mean a query
term that appears in 5-10% of an index in a natural language corpus is
extremely INFREQUENT?
> > > When I start tomcat I saw this message:> > The Apache Tomcat Native
> > > library which allows optimal performance > > in production environments
> > > was not found on the java.library.path> >> > Is that mean if I use Apache
> > > Tomcat Native library the query > > performance will be better. Anyone
> > > has experience on that?> > Unlikely, though it might help you slightly at
> > > a high query rate with > high cache hit ratios.> > -Mike
I have try Apache Tomcat Native library on my window machine and you are right.
No obvious difference on query performance
I have try the index on a linux machine.
The windows machine: Windows 2003, one intel(R) Xeon(TM) CPU 3.00 GHZ
(Quo-core cpu) 4G Ram
The linux machine: (not sure what version of linux), two Intel(R) Xeon(R) CPU
E5310 1.6 GHZ (Quo-core cpu) 4G Ram
Both system have raid5 but I don't know the difference.
I found substantial indexing performance improvement on the linux machine. On
the windows machine it took more than 5 hours.
But it took only one hour to index 2 million documents on the linux system. I
am really happy to see that. I guess both linux and the extra CPU contributed
to the improvement.
Query performance are almost the same though. The cpu on linux machine is
slower so I think if the linux system were using the same cpu as the windows
system query performance will improve too. Both index and query are cpu bound.
If I am right.
I guess I got enough on this question. But I still want to try the solr-trunk.
Will update with everyone later.
Thanks
-Haishan
_________________________________________________________________
Boo! Scare away worms, viruses and so much more! Try Windows Live OneCare!
http://onecare.live.com/standard/en-us/purchase/trial.aspx?s_cid=wl_hotmailnews