In my previous posting, I said:

  "Subsequent calls to ScaleFloatFuntion.getValues bypassed
'createScaleInfo and  added ~0 time."

These subsequent calls are for the remaining segments in the index reader
(21 segments).

Peter



On Fri, Dec 6, 2013 at 2:10 PM, Peter Keegan <peterlkee...@gmail.com> wrote:

> I added some timing logging to IndexSearcher and ScaleFloatFunction and
> compared a simple DisMax query with a DisMax query wrapped in the scale
> function. The index size was 500K docs, 61K docs match the DisMax query.
> The simple DisMax query took 33 ms, the function query took 89 ms. What I
> found was:
>
> 1. The scale query only normalized the scores once (in
> ScaleInfo.createScaleInfo) and added 33 ms to the Qtime.  Subsequent calls
> to ScaleFloatFuntion.getValues bypassed 'createScaleInfo and  added ~0 time.
>
> 2. The FunctionQuery 'nextDoc' iterations added 16 ms over the DisMax
> 'nextDoc' iterations.
>
> Here's the breakdown:
>
> Simple DisMax query:
> weight.scorer: 3 ms (get term enum)
> scorer.score: 23 ms (nextDoc iterations)
> other: 3 ms
> Total: 33 ms
>
> DisMax wrapped in ScaleFloatFunction:
> weight.scorer: 39 ms (get scaled values)
> scorer.score: 39 ms (nextDoc iterations)
> other: 11 ms
> Total: 89 ms
>
> Even with any improvements to 'scale', all function queries will add a
> linear increase to the Qtime as index size increases, since they match all
> docs.
>
> Trey: I'd be happy to test any patch that you find improves the speed.
>
>
>
> On Mon, Dec 2, 2013 at 11:21 PM, Trey Grainger <solrt...@gmail.com> wrote:
>
>> We're working on the same problem with the combination of the
>> scale(query(...)) combination, so I'd like to share a bit more information
>> that may be useful.
>>
>> *On the scale function:*
>> Even thought the scale query has to calculate the scores for all
>> documents,
>> it is actually doing this work twice for each ValueSource (once to
>> calculate the min and max values, and then again when actually scoring the
>> documents), which is inefficient.
>>
>> To solve the problem, we're in the process of putting a cache inside the
>> scale function to remember the values for each document when they are
>> initially computed (to find the min and max) so that the second pass can
>> just use the previously computed values for each document.  Our theory is
>> that most of the extra time due to the scale function is really just the
>> result of doing duplicate work.
>>
>> No promises this won't be overly costly in terms of memory utilization,
>> but
>> we'll see what we get in terms of speed improvements and will share the
>> code if it works out well.  Alternate implementation suggestions (or
>> criticism of a cache like this) are also welcomed.
>>
>>
>> *On the NoOp product function: scale(prod(1, query(...))):*
>> We do the same thing, which ultimately is just an unnecessary waste of a
>> loop through all documents to do an extra multiplication step.  I just
>> debugged the code and uncovered the problem.  There is a Map (called
>> context) that is passed through to each value source to store intermediate
>> state, and both the query and scale functions are passing the ValueSource
>> for the query function in as the KEY to this Map (as opposed to using some
>> composite key that makes sense in the current context).  Essentially,
>> these
>> lines are overwriting each other:
>>
>> Inside ScaleFloatFunction: context.put(this.source, scaleInfo);
>>  //this.source refers to the QueryValueSource, and the scaleInfo refers to
>> a ScaleInfo object
>> Inside QueryValueSource: context.put(this, w); //this refers to the same
>> QueryValueSource from above, and the w refers to a Weight object
>>
>> As such, when the ScaleFloatFunction later goes to read the ScaleInfo from
>> the context Map, it unexpectedly pulls the Weight object out instead and
>> thus the invalid case exception occurs.  The NoOp multiplication works
>> because it puts an "different" ValueSource between the query and the
>> ScaleFloatFunction such that this.source (in ScaleFloatFunction) != this
>> (in QueryValueSource).
>>
>> This should be an easy fix.  I'll create a JIRA ticket to use better key
>> names in these functions and push up a patch.  This will eliminate the
>> need
>> for the extra NoOp function.
>>
>> -Trey
>>
>>
>> On Mon, Dec 2, 2013 at 12:41 PM, Peter Keegan <peterlkee...@gmail.com
>> >wrote:
>>
>> > I'm persuing this possible PostFilter solution, I can see how to collect
>> > all the hits and recompute the scores in a PostFilter, after all the
>> hits
>> > have been collected (for scaling). Now, I can't see how to get the
>> custom
>> > doc/score values back into the main query's HitQueue. Any advice?
>> >
>> > Thanks,
>> > Peter
>> >
>> >
>> > On Fri, Nov 29, 2013 at 9:18 AM, Peter Keegan <peterlkee...@gmail.com
>> > >wrote:
>> >
>> > > Instead of using a function query, could I use the edismax query (plus
>> > > some low cost filters not shown in the example) and implement the
>> > > scale/sum/product computation in a PostFilter? Is the query's maxScore
>> > > available there?
>> > >
>> > > Thanks,
>> > > Peter
>> > >
>> > >
>> > > On Wed, Nov 27, 2013 at 1:58 PM, Peter Keegan <peterlkee...@gmail.com
>> > >wrote:
>> > >
>> > >> Although the 'scale' is a big part of it, here's a closer breakdown.
>> > Here
>> > >> are 4 queries with increasing functions, and theei response times
>> > (caching
>> > >> turned off in solrconfig):
>> > >>
>> > >> 100 msec:
>> > >> select?q={!edismax v='news' qf='title^2 body'}
>> > >>
>> > >> 135 msec:
>> > >> select?qq={!edismax v='news' qf='title^2
>> > >> body'}q={!func}product(field(myfield),query($qq)&fq={!query v=$qq}
>> > >>
>> > >> 200 msec:
>> > >> select?qq={!edismax v='news' qf='title^2
>> > >>
>> >
>> body'}q={!func}sum(product(0.75,query($qq)),product(0.25,field(myfield))))&fq={!query
>> > >> v=$qq}
>> > >>
>> > >> 320 msec:
>> > >>  select?qq={!edismax v='news' qf='title^2
>> > >>
>> >
>> body'}&scaledQ=scale(product(query($qq),1),0,1)&q={!func}sum(product(0.75,$scaledQ),product(0.25,field(myfield)))&fq={!query
>> > >> v=$qq}
>> > >>
>> > >> Btw, that no-op product is necessary, else you get this exception:
>> > >>
>> > >> org.apache.lucene.search.BooleanQuery$BooleanWeight cannot be cast to
>> >
>> org.apache.lucene.queries.function.valuesource.ScaleFloatFunction$ScaleInfo
>> > >>
>> > >> thanks,
>> > >>
>> > >> peter
>> > >>
>> > >>
>> > >>
>> > >> On Wed, Nov 27, 2013 at 1:30 PM, Chris Hostetter <
>> > >> hossman_luc...@fucit.org> wrote:
>> > >>
>> > >>>
>> > >>> : So, this query does just what I want, but it's typically 3 times
>> > slower
>> > >>> : than the edismax query  without the functions:
>> > >>>
>> > >>> that's because the scale() function is inhernetly slow (it has to
>> > >>> compute the min & max value for every document in order to know how
>> to
>> > >>> scale them)
>> > >>>
>> > >>> what you are seeing is the price you have to pay to get that query
>> > with a
>> > >>> "normalized" 0-1 value.
>> > >>>
>> > >>> (you might be able to save a little bit of time by eliminating that
>> > >>> no-Op multiply by 1: "product(query($qq),1)" ... but i doubt you'll
>> > even
>> > >>> notice much of a chnage given that scale function.
>> > >>>
>> > >>> : Is there any way to speed this up? Would writing a custom function
>> > >>> query
>> > >>> : that compiled all the function queries together be any faster?
>> > >>>
>> > >>> If you can find a faster implementation for scale() then by all
>> means
>> > let
>> > >>> us konw, and we can fold it back into Solr.
>> > >>>
>> > >>>
>> > >>> -Hoss
>> > >>>
>> > >>
>> > >>
>> > >
>> >
>>
>
>

Reply via email to