In my previous posting, I said: "Subsequent calls to ScaleFloatFuntion.getValues bypassed 'createScaleInfo and added ~0 time."
These subsequent calls are for the remaining segments in the index reader (21 segments). Peter On Fri, Dec 6, 2013 at 2:10 PM, Peter Keegan <peterlkee...@gmail.com> wrote: > I added some timing logging to IndexSearcher and ScaleFloatFunction and > compared a simple DisMax query with a DisMax query wrapped in the scale > function. The index size was 500K docs, 61K docs match the DisMax query. > The simple DisMax query took 33 ms, the function query took 89 ms. What I > found was: > > 1. The scale query only normalized the scores once (in > ScaleInfo.createScaleInfo) and added 33 ms to the Qtime. Subsequent calls > to ScaleFloatFuntion.getValues bypassed 'createScaleInfo and added ~0 time. > > 2. The FunctionQuery 'nextDoc' iterations added 16 ms over the DisMax > 'nextDoc' iterations. > > Here's the breakdown: > > Simple DisMax query: > weight.scorer: 3 ms (get term enum) > scorer.score: 23 ms (nextDoc iterations) > other: 3 ms > Total: 33 ms > > DisMax wrapped in ScaleFloatFunction: > weight.scorer: 39 ms (get scaled values) > scorer.score: 39 ms (nextDoc iterations) > other: 11 ms > Total: 89 ms > > Even with any improvements to 'scale', all function queries will add a > linear increase to the Qtime as index size increases, since they match all > docs. > > Trey: I'd be happy to test any patch that you find improves the speed. > > > > On Mon, Dec 2, 2013 at 11:21 PM, Trey Grainger <solrt...@gmail.com> wrote: > >> We're working on the same problem with the combination of the >> scale(query(...)) combination, so I'd like to share a bit more information >> that may be useful. >> >> *On the scale function:* >> Even thought the scale query has to calculate the scores for all >> documents, >> it is actually doing this work twice for each ValueSource (once to >> calculate the min and max values, and then again when actually scoring the >> documents), which is inefficient. >> >> To solve the problem, we're in the process of putting a cache inside the >> scale function to remember the values for each document when they are >> initially computed (to find the min and max) so that the second pass can >> just use the previously computed values for each document. Our theory is >> that most of the extra time due to the scale function is really just the >> result of doing duplicate work. >> >> No promises this won't be overly costly in terms of memory utilization, >> but >> we'll see what we get in terms of speed improvements and will share the >> code if it works out well. Alternate implementation suggestions (or >> criticism of a cache like this) are also welcomed. >> >> >> *On the NoOp product function: scale(prod(1, query(...))):* >> We do the same thing, which ultimately is just an unnecessary waste of a >> loop through all documents to do an extra multiplication step. I just >> debugged the code and uncovered the problem. There is a Map (called >> context) that is passed through to each value source to store intermediate >> state, and both the query and scale functions are passing the ValueSource >> for the query function in as the KEY to this Map (as opposed to using some >> composite key that makes sense in the current context). Essentially, >> these >> lines are overwriting each other: >> >> Inside ScaleFloatFunction: context.put(this.source, scaleInfo); >> //this.source refers to the QueryValueSource, and the scaleInfo refers to >> a ScaleInfo object >> Inside QueryValueSource: context.put(this, w); //this refers to the same >> QueryValueSource from above, and the w refers to a Weight object >> >> As such, when the ScaleFloatFunction later goes to read the ScaleInfo from >> the context Map, it unexpectedly pulls the Weight object out instead and >> thus the invalid case exception occurs. The NoOp multiplication works >> because it puts an "different" ValueSource between the query and the >> ScaleFloatFunction such that this.source (in ScaleFloatFunction) != this >> (in QueryValueSource). >> >> This should be an easy fix. I'll create a JIRA ticket to use better key >> names in these functions and push up a patch. This will eliminate the >> need >> for the extra NoOp function. >> >> -Trey >> >> >> On Mon, Dec 2, 2013 at 12:41 PM, Peter Keegan <peterlkee...@gmail.com >> >wrote: >> >> > I'm persuing this possible PostFilter solution, I can see how to collect >> > all the hits and recompute the scores in a PostFilter, after all the >> hits >> > have been collected (for scaling). Now, I can't see how to get the >> custom >> > doc/score values back into the main query's HitQueue. Any advice? >> > >> > Thanks, >> > Peter >> > >> > >> > On Fri, Nov 29, 2013 at 9:18 AM, Peter Keegan <peterlkee...@gmail.com >> > >wrote: >> > >> > > Instead of using a function query, could I use the edismax query (plus >> > > some low cost filters not shown in the example) and implement the >> > > scale/sum/product computation in a PostFilter? Is the query's maxScore >> > > available there? >> > > >> > > Thanks, >> > > Peter >> > > >> > > >> > > On Wed, Nov 27, 2013 at 1:58 PM, Peter Keegan <peterlkee...@gmail.com >> > >wrote: >> > > >> > >> Although the 'scale' is a big part of it, here's a closer breakdown. >> > Here >> > >> are 4 queries with increasing functions, and theei response times >> > (caching >> > >> turned off in solrconfig): >> > >> >> > >> 100 msec: >> > >> select?q={!edismax v='news' qf='title^2 body'} >> > >> >> > >> 135 msec: >> > >> select?qq={!edismax v='news' qf='title^2 >> > >> body'}q={!func}product(field(myfield),query($qq)&fq={!query v=$qq} >> > >> >> > >> 200 msec: >> > >> select?qq={!edismax v='news' qf='title^2 >> > >> >> > >> body'}q={!func}sum(product(0.75,query($qq)),product(0.25,field(myfield))))&fq={!query >> > >> v=$qq} >> > >> >> > >> 320 msec: >> > >> select?qq={!edismax v='news' qf='title^2 >> > >> >> > >> body'}&scaledQ=scale(product(query($qq),1),0,1)&q={!func}sum(product(0.75,$scaledQ),product(0.25,field(myfield)))&fq={!query >> > >> v=$qq} >> > >> >> > >> Btw, that no-op product is necessary, else you get this exception: >> > >> >> > >> org.apache.lucene.search.BooleanQuery$BooleanWeight cannot be cast to >> > >> org.apache.lucene.queries.function.valuesource.ScaleFloatFunction$ScaleInfo >> > >> >> > >> thanks, >> > >> >> > >> peter >> > >> >> > >> >> > >> >> > >> On Wed, Nov 27, 2013 at 1:30 PM, Chris Hostetter < >> > >> hossman_luc...@fucit.org> wrote: >> > >> >> > >>> >> > >>> : So, this query does just what I want, but it's typically 3 times >> > slower >> > >>> : than the edismax query without the functions: >> > >>> >> > >>> that's because the scale() function is inhernetly slow (it has to >> > >>> compute the min & max value for every document in order to know how >> to >> > >>> scale them) >> > >>> >> > >>> what you are seeing is the price you have to pay to get that query >> > with a >> > >>> "normalized" 0-1 value. >> > >>> >> > >>> (you might be able to save a little bit of time by eliminating that >> > >>> no-Op multiply by 1: "product(query($qq),1)" ... but i doubt you'll >> > even >> > >>> notice much of a chnage given that scale function. >> > >>> >> > >>> : Is there any way to speed this up? Would writing a custom function >> > >>> query >> > >>> : that compiled all the function queries together be any faster? >> > >>> >> > >>> If you can find a faster implementation for scale() then by all >> means >> > let >> > >>> us konw, and we can fold it back into Solr. >> > >>> >> > >>> >> > >>> -Hoss >> > >>> >> > >> >> > >> >> > > >> > >> > >