Re: Phrase Query Performance Question and score threshold

2007-11-05 Thread Yonik Seeley
On 11/5/07, Haishan Chen [EMAIL PROTECTED] wrote:
 As for the first issues. The number of different phrase queries have 
 performance issues I found so far are about 10.

If these are normal phrase queries (no slop), a good solution might be
to simply index and query these phrases as a single token.  One could
do this with a SynonymFilter.

Oh, and no, a score threshold won't help performance.

 I believe there will be a lot more I just haven't tried.  It can be solve by 
 using faster hard
 ware though.  Also I believe it will help if SOLR has samilar distributed 
 search
 architecture like NUTCH so that it can scale out instead of scale up.

It's coming...

-Yonik


Re: Phrase Query Performance Question

2007-11-02 Thread Walter Underwood
He means extremely frequent and I agree. --wunder

On 11/2/07 1:51 AM, Haishan Chen [EMAIL PROTECTED] wrote:

 Thanks for the advice. You certainly have a point. I believe you mean a query
 term that appears in 5-10% of an index in a natural language corpus is
 extremely INFREQUENT?  



RE: Phrase Query Performance Question

2007-11-02 Thread Haishan Chen




 From: [EMAIL PROTECTED] Subject: Re: Phrase Query Performance Question 
 Date: Thu, 1 Nov 2007 11:25:26 -0700 To: solr-user@lucene.apache.org  On 
 31-Oct-07, at 11:54 PM, Haishan Chen wrote:Date: Wed, 31 Oct 2007 
 17:54:53 -0700 Subject: Re: Phrase Query   Performance Question From: 
 [EMAIL PROTECTED] To: solr-   [EMAIL PROTECTED]  hurricane katrina is 
 a very expensive   query against a collection focused on Hurricane 
 Katrina. There   will be many matches in many documents. If you want to 
 measure   worst-case, this is fine.  I'd try other things, like:  *  
  ninth ward * Ray Nagin * Audubon Park * Canal Street * French   
 Quarter * FEMA mistakes * storm surge * Jackson Square  Of   course, 
 real query logs are the only real test.  wunder   These terms are not 
 frequent in my index. I believe they are going   to be fast. The thing is 
 that I feel 2 million documents is a small   index.  100,000 or 200,000 
 hits is a small set and should always have sub   second query performance. 
 Now I am only querying one field and the  response is almost one second. I 
 feel I can't achieve sub second   performance if I add a bit more 
 complexity to the query.   Many of the category terms in my index will 
 appear in more than 5%   of the documents and those category terms are very 
 popular search  terms. So the example I gave were not extreme cases for my 
 index  I think that you are somewhat misguided about what constitutes a  
 small set. A query term that appears in 5-10% of the index in a  natural 
 language corpus is _extremely_ frequent. Not quite on the  order of 
 stopwords, but getting there. As a comparison, on an  extremely large corpus 
 that I have handy, documents containing both  the word 'auto' and 'repair' 
 (not necessarily adjacent) constitute  0.1% of the index. The frequency of 
 the phrase auto repair is 0.025%.  @200k docs would be the response rate 
 from an 800million-doc corpus.  What data are you indexing, what what is 
 the intended effect of the  phrase queries you are performing? Perhaps 
 getting at the issue from  this end would be more productive than hammering 
 at the phrasequery  performance question.
 
 
 
 
Thanks for the advice. You certainly have a point. I believe you mean a query 
term that appears in 5-10% of an index in a  natural language corpus is 
extremely INFREQUENT?  
 
 
 
 
   When I start tomcat I saw this message:  The Apache Tomcat Native 
   library which allows optimal performance   in production environments 
   was not found on the java.library.path   Is that mean if I use Apache 
   Tomcat Native library the query   performance will be better. Anyone 
   has experience on that?  Unlikely, though it might help you slightly at 
   a high query rate with  high cache hit ratios.  -Mike
 
I have try Apache Tomcat Native library on my window machine and you are right. 
No obvious difference on query performance
 
 
 
I have try the index on a linux machine. 
The windows machine:  Windows 2003, one intel(R) Xeon(TM) CPU 3.00 GHZ 
(Quo-core cpu) 4G Ram
The linux machine:  (not sure what version of linux), two  Intel(R) Xeon(R) CPU 
E5310 1.6 GHZ (Quo-core cpu) 4G Ram
 
Both system have raid5 but I don't know the difference.
 
I found substantial indexing performance improvement on the linux machine. On 
the windows machine it took more than 5 hours. 
But it took only one hour to index 2 million documents on the linux system. I 
am really happy to see that. I guess both linux and the extra CPU contributed 
to the improvement.
 
Query performance are almost the same though. The cpu on linux machine is 
slower so I think if the linux system were using the same cpu as the windows 
system query performance will improve too.  Both index and query are cpu bound. 
If I am right.
 
I guess I got enough on this question. But I still want to try the solr-trunk. 
Will update with everyone later.
 
 
 
Thanks
-Haishan
 
 
 
 
 
 
 
 
 
 
 
 
 
_
Boo! Scare away worms, viruses and so much more! Try Windows Live OneCare!
http://onecare.live.com/standard/en-us/purchase/trial.aspx?s_cid=wl_hotmailnews

Re: Phrase Query Performance Question

2007-11-02 Thread Mike Klaas

On 2-Nov-07, at 10:03 AM, Haishan Chen wrote:






Date: Fri, 2 Nov 2007 07:32:30 -0700 Subject: Re: Phrase Query  
Performance Question From: [EMAIL PROTECTED] To: solr- 
[EMAIL PROTECTED]  He means extremely frequent and I  
agree. --wunder



Then it means a PHRASE (combination of terms except stopwords)  
appear in 5% to 10% of an index should NOT be that frequent? I  
guess I get the idea.


Phrases should be rarer than individual keywords.  5-10% is  
moderately high even for a _single_ keyword, let alone the  
conjunction of two keywords, let alone the _exact phrase_ of two  
keywords (non stopwords in all of this discussion).


As I mentioned, the 'natural' rate of 'auto'+'repair' on a corpus  
100's of times bigger than yours (web documents) is .1%, and the rate  
of the phrase 'auto repair' is .025%.


It still feels to me that you are trying doing something unique with  
your phrase queries.  Unfortunately, you still haven't said what you  
are trying to do in general terms, which makes it very difficult for  
people to help you.


-Mike


Re: Phrase Query Performance Question

2007-11-02 Thread Chris Hostetter

: It still feels to me that you are trying doing something unique with your
: phrase queries.  Unfortunately, you still haven't said what you are trying to
: do in general terms, which makes it very difficult for people to help you.

Agreed.  This seems very special case, but we dont' know what the case is.

If there are specific phrases you know in advance that you will care 
about, and those phrases occur as frequetnly as the individual 
words, then the best way to deal with them is to index each phrase as 
a single Term (and ignore the individual words)

Speaking more generally to mike's point...

http://people.apache.org/~hossman/#xyproblem
Your question appears to be an XY Problem ... that is: you are dealing
with X, you are assuming Y will help you, and you are asking about Y
without giving more details about the X so that we can understand the
full issue.  Perhaps the best solution doesn't involve Y at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341





-Hoss



RE: Phrase Query Performance Question

2007-11-02 Thread Haishan Chen


 Date: Fri, 2 Nov 2007 12:31:29 -0700 From: [EMAIL PROTECTED] To: 
 solr-user@lucene.apache.org Subject: Re: Phrase Query Performance Question 
   : It still feels to me that you are trying doing something unique with 
 your : phrase queries. Unfortunately, you still haven't said what you are 
 trying to : do in general terms, which makes it very difficult for people to 
 help you.  Agreed. This seems very special case, but we dont' know what the 
 case is.  If there are specific phrases you know in advance that you will 
 care  about, and those phrases occur as frequetnly as the individual  
 words, then the best way to deal with them is to index each phrase as  a 
 single Term (and ignore the individual words)  Speaking more generally to 
 mike's point...  http://people.apache.org/~hossman/#xyproblem Your 
 question appears to be an XY Problem ... that is: you are dealing with 
 X, you are assuming Y will help you, and you are asking about Y 
 without giving more details about the X so that we can understand the full 
 issue. Perhaps the best solution doesn't involve Y at all? See Also: 
 http://www.perlmonks.org/index.pl?node_id=542341  -Hoss 
I think the documents I was indexing can not be considered a natural language 
documents. It is constructed following certain rules and then feed into the 
indexing process. I guess because of the rules many targeting searching terms 
have high document frequency. I am not in obligation to achieve the quarter 
second performance I am just interested to see whether it is achievable. 
 
Thanks everyone for offering advice
-Haishan
 
 
 
 
 
 
 
_
Help yourself to FREE treats served up daily at the Messenger Café. Stop by 
today.
http://www.cafemessenger.com/info/info_sweetstuff2.html?ocid=TXT_TAGLM_OctWLtagline

Re: Phrase Query Performance Question

2007-11-01 Thread Mike Klaas

On 31-Oct-07, at 11:54 PM, Haishan Chen wrote:



Date: Wed, 31 Oct 2007 17:54:53 -0700 Subject: Re: Phrase Query  
Performance Question From: [EMAIL PROTECTED] To: solr- 
[EMAIL PROTECTED]  hurricane katrina is a very expensive  
query against a collection focused on Hurricane Katrina. There  
will be many matches in many documents. If you want to measure  
worst-case, this is fine.  I'd try other things, like:  *  
ninth ward * Ray Nagin * Audubon Park * Canal Street * French  
Quarter * FEMA mistakes * storm surge * Jackson Square  Of  
course, real query logs are the only real test.  wunder


These terms are not frequent in my index. I believe they are going  
to be fast. The thing is that I feel 2 million documents is a small  
index.
100,000 or 200,000 hits is a small set and should always have sub  
second query performance. Now I am only querying one field and the
response is almost one second. I feel I can't achieve sub second  
performance if I add a bit more complexity to the query.


Many of the category terms in my index will appear in more than 5%  
of the documents and those category terms are very popular search

terms. So the example I gave were not extreme cases for my index


I think that you are somewhat misguided about what constitutes a  
small set.  A query term that appears in 5-10% of the index in a  
natural language corpus is _extremely_ frequent.  Not quite on the  
order of stopwords, but getting there.  As a comparison, on an  
extremely large corpus that I have handy, documents containing both  
the word 'auto' and 'repair' (not necessarily adjacent) constitute  
0.1% of the index.  The frequency of the phrase auto repair is 0.025%.


@200k docs would be the response rate from an 800million-doc corpus.

What data are you indexing, what what is the intended effect of the  
phrase queries you are performing?  Perhaps getting at the issue from  
this end would be more productive than hammering at the phrasequery  
performance question.



When I start tomcat I saw this message:
The Apache Tomcat Native library which allows optimal performance  
in production environments was not found on the java.library.path


Is that mean if I use Apache Tomcat Native library the query  
performance will be better. Anyone has experience on that?


Unlikely, though it might help you slightly at a high query rate with  
high cache hit ratios.


-Mike


RE: Phrase Query Performance Question

2007-10-31 Thread Haishan Chen




 From: [EMAIL PROTECTED] Subject: Re: Phrase Query Performance Question 
 Date: Tue, 30 Oct 2007 11:22:17 -0700 To: solr-user@lucene.apache.org  On 
 30-Oct-07, at 6:09 AM, Yonik Seeley wrote:   On 10/30/07, Haishan Chen 
 [EMAIL PROTECTED] wrote:  Thanks a lot for replying Yonik!   I am 
 running solr on a windows 2003 server (standard version).   intel Xeon CPU 
 3.00GHz, with 4.00 GB RAM.  The index is locate on Raid5 with 2 million 
 documents. Is there   any way to improve query performance without moving 
 to more   powerful computer?   I understand that the query 
 performances of phrase query (auto   repair) has to do with the number 
 of documents containing the two   words. In fact the number of documents 
 that have auto and repair   are about 10. It is like 5% of the 
 documents containing auto   and repair. It seems to me 937 ms is too 
 slower.   Chen, that does seem slow I'm not sure why.  1) was this 
 the first search on the index? if so, try running some  other searches to 
 warm things up first.  Indeed--phrase matching uses a completely different 
 part of the  index, so that needs to be warmed too.  One thing to try is 
 solr trunk: it contains some speedups for phrase  queries (though perhaps 
 not as substantial as you hope for).  -MIke  
 
Thanks for replying.
The statistics I collected were not on the first query. And I believe I was 
runing JVM on server mode. 
I configure tomcat to use the server version of JVM.dll. I guess that is the 
way to set it on windows.
I execute the same phrase query (auto repair) over and over again and that is 
the best performance I observe. 
Also when I did the test I disable all solr cache. I want to see the 
performance without Solr cache
 
I am currently trying to test the index on linux system with similar hardware. 
It will take me some time to set it up.
 
I read a discussion between Doug cutting and Andrzej Bialecki about lucene 
performance.
 
http://mail-archives.apache.org/mod_mbox/lucene-java-user/200512.mbox/[EMAIL 
PROTECTED]
It mentioned that  http://websearch.archive.org/katrina/ (in nutch) had 10M 
documents and a search of hurricane katrina was able to return in 1.35 
seconds with  600,867 hits.  Althought the computer it was using might be more 
powerful than mine. I feel 937ms for a phrase query on a single field is kind 
of slower. Nutch actually expand a search to more complex queries. My index and 
the number of hits on my query (auto repair) is about one fifth of 
websearch.archive.org and its testing query. So I feel a reasonable performance 
for my query should be less than 300 ms. I am not sure if I am right on that 
logic.
 
Anyway I will collect the statistic on linux first and try out other options. 
 
 
Thanks a lot
Haishan
_
Windows Live Hotmail and Microsoft Office Outlook – together at last.  Get it 
now.
http://office.microsoft.com/en-us/outlook/HA102225181033.aspx?pid=CL100626971033

Re: Phrase Query Performance Question

2007-10-31 Thread Mike Klaas

On 31-Oct-07, at 2:40 PM, Haishan Chen wrote:



http://mail-archives.apache.org/mod_mbox/lucene-java-user/ 
200512.mbox/[EMAIL PROTECTED]
It mentioned that  http://websearch.archive.org/katrina/ (in nutch)  
had 10M documents and a search of hurricane katrina was able to  
return in 1.35 seconds with  600,867 hits.  Althought the computer  
it was using might be more powerful than mine. I feel 937ms for a  
phrase query on a single field is kind of slower. Nutch actually  
expand a search to more complex queries. My index and the number of  
hits on my query (auto repair) is about one fifth of  
websearch.archive.org and its testing query. So I feel a reasonable  
performance for my query should be less than 300 ms. I am not sure  
if I am right on that logic.


I'm not sure that it is reasonable, but I'm not sure that it isn't.   
However, have you tried other queries?  937ms seems a little high,  
even for phrase queries.


Anyway I will collect the statistic on linux first and try out  
other options.


Have you tried using the performance enhancements present in solr-trunk?

-Mike


RE: Phrase Query Performance Question

2007-10-31 Thread Haishan Chen


 From: [EMAIL PROTECTED] Subject: Re: Phrase Query Performance Question 
 Date: Wed, 31 Oct 2007 15:25:42 -0700 To: solr-user@lucene.apache.org  On 
 31-Oct-07, at 2:40 PM, Haishan Chen wrote:
 http://mail-archives.apache.org/mod_mbox/lucene-java-user/   
 200512.mbox/[EMAIL PROTECTED]  It mentioned that 
 http://websearch.archive.org/katrina/ (in nutch)   had 10M documents and a 
 search of hurricane katrina was able to   return in 1.35 seconds with 
 600,867 hits. Althought the computer   it was using might be more powerful 
 than mine. I feel 937ms for a   phrase query on a single field is kind of 
 slower. Nutch actually   expand a search to more complex queries. My index 
 and the number of   hits on my query (auto repair) is about one fifth of 
   websearch.archive.org and its testing query. So I feel a reasonable   
 performance for my query should be less than 300 ms. I am not sure   if I 
 am right on that logic.  I'm not sure that it is reasonable, but I'm not 
 sure that it isn't.  However, have you tried other queries? 937ms seems a 
 little high,  even for phrase queries.   Anyway I will collect the 
 statistic on linux first and try out   other options.  Have you tried 
 using the performance enhancements present in solr-trunk?  -Mike
 
Here are some query statistic. The phrase queries look slow to me.  
These are queries have more than 10 hits. For those return a couple 
thousand hits the responds time is quite fast. 
But this is query on one field only. 
 
(auto repair)  100384 hits 946 ms(auto repair)  100384 hits  31ms(car 
repair~100)  112183 hits  766 ms(car repair)112183 hits  63 
ms(business service~100) 1209751 hits  1500 ms(business service)  1209751 
hits  234 ms(shopping center~100) 119481 hits   359 ms(shopping 
center~100) 119481 hits   63 ms
 
I don't know what is solr-trunk yet but I will find out
 
Thank you
Haishan
 
 
 
_
Climb to the top of the charts!  Play Star Shuffle:  the word scramble 
challenge with star power.
http://club.live.com/star_shuffle.aspx?icid=starshuffle_wlmailtextlink_oct

Re: Phrase Query Performance Question

2007-10-31 Thread Walter Underwood
hurricane katrina is a very expensive query against a collection
focused on Hurricane Katrina. There will be many matches in many
documents. If you want to measure worst-case, this is fine.

I'd try other things, like:

* ninth ward
* Ray Nagin
* Audubon Park
* Canal Street
* French Quarter
* FEMA mistakes
* storm surge
* Jackson Square

Of course, real query logs are the only real test.

wunder

On 10/31/07 3:25 PM, Mike Klaas [EMAIL PROTECTED] wrote:

 On 31-Oct-07, at 2:40 PM, Haishan Chen wrote:
 
 
 http://mail-archives.apache.org/mod_mbox/lucene-java-user/
 200512.mbox/[EMAIL PROTECTED]
 It mentioned that  http://websearch.archive.org/katrina/ (in nutch)
 had 10M documents and a search of hurricane katrina was able to
 return in 1.35 seconds with  600,867 hits.  Althought the computer
 it was using might be more powerful than mine. I feel 937ms for a
 phrase query on a single field is kind of slower. Nutch actually
 expand a search to more complex queries. My index and the number of
 hits on my query (auto repair) is about one fifth of
 websearch.archive.org and its testing query. So I feel a reasonable
 performance for my query should be less than 300 ms. I am not sure
 if I am right on that logic.
 
 I'm not sure that it is reasonable, but I'm not sure that it isn't.
 However, have you tried other queries?  937ms seems a little high,
 even for phrase queries.
 
 Anyway I will collect the statistic on linux first and try out
 other options.
 
 Have you tried using the performance enhancements present in solr-trunk?
 
 -Mike



RE: Phrase Query Performance Question

2007-10-31 Thread Chris Hostetter

: (auto repair)  100384 hits 946 ms(auto repair)  100384 hits 31ms(car 
: repair~100)  112183 hits 766 ms(car repair)  112183 hits 63 
: ms(business service~100) 1209751 hits 1500 ms(business service)  
: 1209751 hits 234 ms(shopping center~100) 119481 hits 359 
: ms(shopping center~100) 119481 hits 63 ms

if i'm reading those numbers right, every document in your corpus 
containing the words auto or repair also contains the exact phrase 
auto repair with no slop ... this seems HIGHLY unlikely.  can you show 
us *exactly* what the query URLs you are using look like, and show us what 
the request handler section of your solrconfig.xml looks like.

also: where are you getting these times from?  are these from the logging 
output solr produces, or from the client you have hitting solr?

: I don't know what is solr-trunk yet but I will find out

he's refering to the unreleased develoment code, which you can checkout 
from the trunk of the SOlr subversion repository...

http://lucene.apache.org/solr/version_control.html


-Hoss



RE: Phrase Query Performance Question

2007-10-31 Thread Haishan Chen




 Date: Wed, 31 Oct 2007 19:19:07 -0700 From: [EMAIL PROTECTED] To: 
 solr-user@lucene.apache.org Subject: RE: Phrase Query Performance Question 
   : (auto repair) 100384 hits 946 ms(auto repair) 100384 hits 31ms(car  
 : repair~100) 112183 hits 766 ms(car repair) 112183 hits 63  : 
 ms(business service~100) 1209751 hits 1500 ms(business service)  : 
 1209751 hits 234 ms(shopping center~100) 119481 hits 359  : 
 ms(shopping center~100) 119481 hits 63 ms  if i'm reading those numbers 
 right, every document in your corpus  containing the words auto or 
 repair also contains the exact phrase  auto repair with no slop ... this 
 seems HIGHLY unlikely. can you show  us *exactly* what the query URLs you 
 are using look like, and show us what  the request handler section of your 
 solrconfig.xml looks like.
 
 
Yes that's exactly what the documents are like. The documents are categorized. 
I indexed the category with the content 
of the documents using text field type.  The URL I used is 
select?q=content:(auto repair~100)fl=title. All other options like 
faceting, highlighting are not used.
 
  also: where are you getting these times from? are these from the logging  
  output solr produces, or from the client you have hitting solr?  : I 
  don't know what is solr-trunk yet but I will find out  he's refering to 
  the unreleased develoment code, which you can checkout  from the trunk 
  of the SOlr subversion repository...  
  http://lucene.apache.org/solr/version_control.html   -Hoss 
 
I am getting the time from the client browser
 
 
Thanks
-Haishan
 
 
 
 
 
 
 
_
Help yourself to FREE treats served up daily at the Messenger Café. Stop by 
today.
http://www.cafemessenger.com/info/info_sweetstuff2.html?ocid=TXT_TAGLM_OctWLtagline

RE: Phrase Query Performance Question

2007-10-31 Thread Haishan Chen




 Date: Wed, 31 Oct 2007 17:54:53 -0700 Subject: Re: Phrase Query Performance 
 Question From: [EMAIL PROTECTED] To: solr-user@lucene.apache.org  
 hurricane katrina is a very expensive query against a collection focused 
 on Hurricane Katrina. There will be many matches in many documents. If you 
 want to measure worst-case, this is fine.  I'd try other things, like:  * 
 ninth ward * Ray Nagin * Audubon Park * Canal Street * French Quarter * 
 FEMA mistakes * storm surge * Jackson Square  Of course, real query logs 
 are the only real test.  wunder
 
 
 
These terms are not frequent in my index. I believe they are going to be fast. 
The thing is that I feel 2 million documents is a small index.
100,000 or 200,000 hits is a small set and should always have sub second query 
performance. Now I am only querying one field and the
response is almost one second. I feel I can't achieve sub second performance if 
I add a bit more complexity to the query.
 
Many of the category terms in my index will appear in more than 5% of the 
documents and those category terms are very popular search
terms. So the example I gave were not extreme cases for my index
 
When I start tomcat I saw this message:
The Apache Tomcat Native library which allows optimal performance in production 
environments was not found on the java.library.path
 
Is that mean if I use Apache Tomcat Native library the query performance will 
be better. Anyone has experience on that?
 
 
 
Thanks a lot
-Haishan
 
 
 
 
 
 
 
 
 
 
 
 
  On 10/31/07 3:25 PM, Mike Klaas [EMAIL PROTECTED] wrote:   On 
  31-Oct-07, at 2:40 PM, Haishan Chen wrote:  
  http://mail-archives.apache.org/mod_mbox/lucene-java-user/  
  200512.mbox/[EMAIL PROTECTED]  It mentioned that 
  http://websearch.archive.org/katrina/ (in nutch)  had 10M documents and 
  a search of hurricane katrina was able to  return in 1.35 seconds with 
  600,867 hits. Althought the computer  it was using might be more 
  powerful than mine. I feel 937ms for a  phrase query on a single field 
  is kind of slower. Nutch actually  expand a search to more complex 
  queries. My index and the number of  hits on my query (auto repair) is 
  about one fifth of  websearch.archive.org and its testing query. So I 
  feel a reasonable  performance for my query should be less than 300 ms. 
  I am not sure  if I am right on that logic.I'm not sure that it 
  is reasonable, but I'm not sure that it isn't.  However, have you tried 
  other queries? 937ms seems a little high,  even for phrase queries.   
   Anyway I will collect the statistic on linux first and try out  other 
  options.Have you tried using the performance enhancements present 
  in solr-trunk?-Mike 
_
Peek-a-boo FREE Tricks  Treats for You!
http://www.reallivemoms.com?ocid=TXT_TAGHMloc=us

RE: Phrase Query Performance Question

2007-10-30 Thread Haishan Chen
Thanks a lot for replying Yonik!
 
I am running solr on a windows 2003 server (standard version). intel Xeon CPU 
3.00GHz, with 4.00 GB RAM.
The index is locate on Raid5 with 2 million documents. Is there any way to 
improve query performance without moving to more powerful computer?
 
I understand that the query performances of phrase query (auto repair) has to 
do with the number of documents containing the two words. In fact the number of 
documents that have auto and repair are about 10. It is like 5% of the 
documents containing auto and repair.  It seems to me 937 ms is too slower.
 
Would it be faster if I run solr on linux system? If it is then how much faster 
it would be generally?  My performance target for this kind of phrase query is 
a quarter of a second or so. Any advice on how to achieve this on the above 
hardware?
 
 
Thanks a lot
 
Haishan
 
 
 
 
 
 
 
 
Re: phrase query performanceYonik SeeleyFri, 26 Oct 2007 08:09:52 -0700
The differences lie in Lucene.Instead of thinking of phrase queries as slow, 
think of term queries as fast :-)Phrase queries need to read and consider 
position information thatterm queries do not.
-Yonik
 
 
On 10/26/07, Haishan Chen [EMAIL PROTECTED] wrote: I am a new Solr user and 
wonder if anyone can help me these questions. I used  Solr to index about two 
million documents and query on it using standard  request handler. I disabled 
all cache. I found phrase query was substantially  slower than the usual 
query.  The statistic I collected is as following. I  was doing the query on 
the one field only.  content:(auto repair) 47 ms  
repeatablecontent:(auto repair)  937 ms  
repeatablecontent:(auto repair~1) 766 ms repeatable What are the  
factors affecting phrase query performance? How come the phrase query  
content:(auto repair) is almost 20 times slower than content:(auto repair)?  
I also notice a the phrase query with a slop is always faster than the one  
without a slop. Is the difference I observe here a performance problem of  
Lucene or Solr? It will be appreciated if anyone can help
_
Boo! Scare away worms, viruses and so much more! Try Windows Live OneCare!
http://onecare.live.com/standard/en-us/purchase/trial.aspx?s_cid=wl_hotmailnews

Re: Phrase Query Performance Question

2007-10-30 Thread Yonik Seeley
On 10/30/07, Haishan Chen [EMAIL PROTECTED] wrote:
 Thanks a lot for replying Yonik!

 I am running solr on a windows 2003 server (standard version). intel Xeon CPU 
 3.00GHz, with 4.00 GB RAM.
 The index is locate on Raid5 with 2 million documents. Is there any way to 
 improve query performance without moving to more powerful computer?

 I understand that the query performances of phrase query (auto repair) has 
 to do with the number of documents containing the two words. In fact the 
 number of documents that have auto and repair are about 10. It is like 5% 
 of the documents containing auto and repair.  It seems to me 937 ms is too 
 slower.

Chen, that does seem slow I'm not sure why.
1) was this the first search on the index?  if so, try running some
other searches to warm things up first.
2) was the jvm in server mode?  (start with -server)
3) shut down unlrelated things on the system so that there is more
memory available to the OS to cache the index files

 Would it be faster if I run solr on linux system?

Maybe... Lucene does rely on the OS caching often used parts of the
index, so this can differ the most between Windows and Linux.  If you
have a Linux box lying around, trying it out quick to remove that
variable would be a good idea.

-Yonik


Re: Phrase Query Performance Question

2007-10-30 Thread Mike Klaas

On 30-Oct-07, at 6:09 AM, Yonik Seeley wrote:


On 10/30/07, Haishan Chen [EMAIL PROTECTED] wrote:

Thanks a lot for replying Yonik!

I am running solr on a windows 2003 server (standard version).  
intel Xeon CPU 3.00GHz, with 4.00 GB RAM.
The index is locate on Raid5 with 2 million documents. Is there  
any way to improve query performance without moving to more  
powerful computer?


I understand that the query performances of phrase query (auto  
repair) has to do with the number of documents containing the two  
words. In fact the number of documents that have auto and repair  
are about 10. It is like 5% of the documents containing auto  
and repair.  It seems to me 937 ms is too slower.


Chen, that does seem slow I'm not sure why.
1) was this the first search on the index?  if so, try running some
other searches to warm things up first.


Indeed--phrase matching uses a completely different part of the  
index, so that needs to be warmed too.


One thing to try is solr trunk: it contains some speedups for phrase  
queries (though perhaps not as substantial as you hope for).


-MIke




Phrase Query Performance Question

2007-10-26 Thread Haishan Chen
I am a new Solr user and wonder if anyone can help me these questions. I used 
Solr to index about two million documents and query on it using standard 
request handler. I disabled all cache. I found phrase query was substantially 
slower than the usual query.  The statistic I collected is as following. I was 
doing the query on the one field only.  content:(auto repair)47 
ms  repeatablecontent:(auto repair)  937 ms 
repeatablecontent:(auto repair~1) 766 ms repeatable What are the 
factors affecting phrase query performance? How come the phrase query 
content:(auto repair) is almost 20 times slower than content:(auto repair)? I 
also notice a the phrase query with a slop is always faster than the one 
without a slop. Is the performance difference I observed here between phrase 
query and regular query a performance problem of Lucene or Solr? 
I was having trouble starting a new discussion thread eariler. Hopefully I do 
it right this time.
It will be appreciated if anyone can help Haishan
_
Climb to the top of the charts!  Play Star Shuffle:  the word scramble 
challenge with star power.
http://club.live.com/star_shuffle.aspx?icid=starshuffle_wlmailtextlink_oct