Query Performance of -field:[* TO *]

2009-03-10 Thread Sammy Yu
Hi guys,
I'm trying to limit the size of my index and one of the things I
have done is to not populate certain fields when the majority of the
document have that value.  For example, if most of my documents in my
index have a field color which has the value green I will not populate
that field.  Only if it's another color will I will populate the
field.

My question is if I try to make a query such as -color:[* TO *] will
it be much slower than -color:green?  Thanks for your help in advance.

Best,
Sammy Yu


debugQuery missing boost

2009-02-11 Thread Sammy Yu
Hi,
   I'm trying to get some information how boost is used in the ranking
calculation via the debugQuery parameter for the following query:
(bodytext:iphone OR bodytext:firmware)^2.0 OR dateCreatedYear:2009^5.0

For one of the matching documents I can see:

4.7144237 = (MATCH) sum of:
  2.2903786 = (MATCH) sum of:
0.7662499 = (MATCH) weight(bodytext:iphon in 8339166), product of:
  0.427938 = queryWeight(bodytext:iphon), product of:
5.729801 = idf(docFreq=76646, numDocs=8682037)
0.07468636 = queryNorm
  1.7905629 = (MATCH) fieldWeight(bodytext:iphon in 8339166), product of:
1.0 = tf(termFreq(bodytext:iphon)=1)
5.729801 = idf(docFreq=76646, numDocs=8682037)
0.3125 = fieldNorm(field=bodytext, doc=8339166)
1.5241286 = (MATCH) weight(bodytext:firmwar in 8339166), product of:
  0.60354054 = queryWeight(bodytext:firmwar), product of:
8.081 = idf(docFreq=7300, numDocs=8682037)
0.07468636 = queryNorm
  2.5253127 = (MATCH) fieldWeight(bodytext:firmwar in 8339166), product of:
1.0 = tf(termFreq(bodytext:firmwar)=1)
8.081 = idf(docFreq=7300, numDocs=8682037)
0.3125 = fieldNorm(field=bodytext, doc=8339166)
  2.424045 = (MATCH) weight(dateCreatedYear:2009^5.0 in 8339166), product of:
0.6727613 = queryWeight(dateCreatedYear:2009^5.0), product of:
  5.0 = boost
  3.603128 = idf(docFreq=642831, numDocs=8682037)
  0.03734318 = queryNorm
3.603128 = (MATCH) fieldWeight(dateCreatedYear:2009 in 8339166), product of:
  1.0 = tf(termFreq(dateCreatedYear:2009)=1)
  3.603128 = idf(docFreq=642831, numDocs=8682037)
  1.0 = fieldNorm(field=dateCreatedYear, doc=8339166)

which shows that the 5.0 boost in  dateCreatedYear:2009^5.0 is being
applied however, the 2.0 boost is missing in (bodytext:iphone OR
bodytext:firmware)^2.0  How is the 2.0 boost being applied to the
score?

Thanks,
Sammy


Re: SOLR 1.4 and 1.3 diff and other

2008-12-17 Thread Sammy Yu
Hi Yonik,
   Thanks for the quick response.  Do you know the release schedule
when 1.4 would be released or if it is possible to backport the NIO
implementation into 1.3?  If you could give me a pointer that would be
great.  It seems like a huge performance gain that would be of value
to a lot of people.

Thanks,
Sammy

On Wed, Dec 17, 2008 at 5:36 PM, Yonik Seeley ysee...@gmail.com wrote:
 On Wed, Dec 17, 2008 at 7:52 PM, Sammy Yu temi...@gmail.com wrote:
 I read somewhere that there are contention issues with the current
 cache implementation of LRUCache in 1.3 in that it is synchronous,
 could this be the reason why the filter query are slow?

 Probably not.  The change is much more likely due to using a
 non-blocking NIO implementation in Lucene.

 -Yonik



Slow Response time after optimize

2008-12-15 Thread Sammy Yu
Hi guys,
   I have a typical master/slave setup running with Solr 1.3.0.  I did
some basic scalability test with JMeter and tweaked our environment
and determined that we can handle approximately 26 simultaneous
threads and get end-to-end response times of under 200ms even with
typically every 5 minute distribution.   However, as soon as I issue a
single optimize on the master, the response time goes up to over 500ms
and does not seem to recover.   As soon as I restarted the response
time is back down to 200ms.  My index is approximately 5 GB in size
and the queries are just basic constructed disjunction queries such as
title:iphone OR bodytext:iphone.  Has anybody seen this issue before?

Thanks,
Sammy


Re: Standard request with functional query

2008-12-15 Thread Sammy Yu
Hey guys,
Thanks for the response, but how would make recency a factor on
scoring documents with the standard request handler.
The query (title:iphone OR bodytext:iphone OR title:firmware OR
bodytext:firmware) AND _val_:ord(dateCreated)^0.1
seems to do something very similar to just sorting by dateCreated
rather than having dateCreated being a part of the score.

Thanks,
Sammy

n Thu, Dec 4, 2008 at 1:35 PM, Sammy Yu temi...@gmail.com wrote:
 Hi guys,
I have a standard query that searches across multiple text fields such as
 q=title:iphone OR bodytext:iphone OR title:firmware OR bodytext:firmware

 This comes back with documents that have iphone and firmware (I know I
 can use dismax handler but it seems to be really slow), which is
 great.  Now I want to give some more weight to more recent documents
 (there is a dateCreated field in each document).

 So I've modified the query as such:
 (title:iphone OR bodytext:iphone OR title:firmware OR
 bodytext:firmware) AND _val_:ord(dateCreated)^0.1
 URLencoded to 
 q=(title%3Aiphone+OR+bodytext%3Aiphone+OR+title%3Afirmware+OR+bodytext%3Afirmware)+AND+_val_%3Aord(dateCreated)^0.1

 However, the results are not as one would expects.  The first few
 documents only come back with the word iphone and appears to be sorted
 by date created.  It seems to completely ignore the score and use the
 dateCreated field for the score.

 On a not directly related issue it seems like if you put the weight
 within the double quotes:
 (title:iphone OR bodytext:iphone OR title:firmware OR
 bodytext:firmware) AND _val_:ord(dateCreated)^0.1

 the parser complains:
 org.apache.lucene.queryParser.ParseException: Cannot parse
 '(title:iphone OR bodytext:iphone OR title:firmware OR
 bodytext:firmware) AND _val_:ord(dateCreated)^0.1': Expected ',' at
 position 16 in 'ord(dateCreated)^0.1'

 Thanks,
 Sammy



Standard request with functional query

2008-12-04 Thread Sammy Yu
Hi guys,
I have a standard query that searches across multiple text fields such as
q=title:iphone OR bodytext:iphone OR title:firmware OR bodytext:firmware

This comes back with documents that have iphone and firmware (I know I
can use dismax handler but it seems to be really slow), which is
great.  Now I want to give some more weight to more recent documents
(there is a dateCreated field in each document).

So I've modified the query as such:
(title:iphone OR bodytext:iphone OR title:firmware OR
bodytext:firmware) AND _val_:ord(dateCreated)^0.1
URLencoded to 
q=(title%3Aiphone+OR+bodytext%3Aiphone+OR+title%3Afirmware+OR+bodytext%3Afirmware)+AND+_val_%3Aord(dateCreated)^0.1

However, the results are not as one would expects.  The first few
documents only come back with the word iphone and appears to be sorted
by date created.  It seems to completely ignore the score and use the
dateCreated field for the score.

On a not directly related issue it seems like if you put the weight
within the double quotes:
(title:iphone OR bodytext:iphone OR title:firmware OR
bodytext:firmware) AND _val_:ord(dateCreated)^0.1

the parser complains:
org.apache.lucene.queryParser.ParseException: Cannot parse
'(title:iphone OR bodytext:iphone OR title:firmware OR
bodytext:firmware) AND _val_:ord(dateCreated)^0.1': Expected ',' at
position 16 in 'ord(dateCreated)^0.1'

Thanks,
Sammy


Re: SOLR query times

2008-10-13 Thread Sammy Yu
Hi Grant,
   Thanks for your response.  I'm trying to simulate our production
environment's search traffic which has very low cache hit rate.
Turning off the caches can help us better understand query times and
the load of the slave's when distribution occurs with a small list of
pre-canned queries.

If the latency is caused by loading and caching of Lucene's segments,
is there a way to force Lucene's index to preload this?  This seems to
be the case in our production environment, when SOLR restarts the load
spikes and it takes a couple of hours before it settles down.

Also, are there general acceptable ways of doing scalability and
performance characterization?

Thanks,
Sammy.

On Sun, Oct 12, 2008 at 8:17 AM, Grant Ingersoll [EMAIL PROTECTED] wrote:
 This is pretty typical.  The first query is always more expensive, as Lucene
 lazily loads some pieces of the index into memory and you may see the
 FieldCache in action, depending on sorting, not to mention you are also
 seeing operating system caching take place.

 Is there some reason you don't want these or are you just trying to
 understand the why?

 -Grant

 On Oct 10, 2008, at 6:25 PM, Sammy Yu wrote:

 Hi,
  I'm using SOLR 1.3 on a index with approximately 8 million
 documents.  I would like to disable SOLR's cache so that it is easier
 for me to test the scenario when there is a small likelihood of cache
 hits.  I've disabled caching by commenting out the filterCache,
 queryResultCache, and documentCache section in solrconfig.xml as
 suggested by the Wiki.  It seems disabled because the admin interface
 no longer shows any entries in the Cache section.

 However, it appears that there is still some sort caching taking
 place.  The first time I make specific query it would take around 100
 msec, subsequent queries would take around 15 msec.  Is there some
 sort of caching happening at Lucene level?

 Thanks for your help,
 Sammy Yu

 --
 Grant Ingersoll
 Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
 http://www.lucenebootcamp.com


 Lucene Helpful Hints:
 http://wiki.apache.org/lucene-java/BasicsOfPerformance
 http://wiki.apache.org/lucene-java/LuceneFAQ












SOLR query times

2008-10-10 Thread Sammy Yu
Hi,
   I'm using SOLR 1.3 on a index with approximately 8 million
documents.  I would like to disable SOLR's cache so that it is easier
for me to test the scenario when there is a small likelihood of cache
hits.  I've disabled caching by commenting out the filterCache,
queryResultCache, and documentCache section in solrconfig.xml as
suggested by the Wiki.  It seems disabled because the admin interface
no longer shows any entries in the Cache section.

However, it appears that there is still some sort caching taking
place.  The first time I make specific query it would take around 100
msec, subsequent queries would take around 15 msec.  Is there some
sort of caching happening at Lucene level?

Thanks for your help,
Sammy Yu