subject:"RE\: Facet performance"

Re: Facet Performance

2020-06-17 Thread Erick Erickson

queryResultCache doesn’t really help with faceting, even if it’s hit for the 
main query. 
That cache only stores a subset of the hits, and to facet properly you need 
the entire result set….

> On Jun 17, 2020, at 12:47 PM, James Bodkin  
> wrote:
> 
> We've noticed that the filterCache uses a significant amount of memory, as 
> we've assigned 8GB Heap per instance.
> In total, we have 32 shards with 2 replicas, hence (8*32*2) 512G Heap space 
> alone, further memory is required to ensure the index is always memory mapped 
> for performance reasons.
> 
> Ideally I would like to be able to reduce the amount of memory assigned to 
> the heap by using docValues instead of indexed but it doesn't seem possible.
> The QTime (after warming) for facet.method=enum is around 150-250ms whereas 
> the QTime for facet.method=fc is around 1000-1200ms.
> As we require the results in real-time for customers searching on our 
> website, the later QTime of 1000-1200ms is too slow for us to be able to use.
> 
> Our facet queries change as the customer selects different search criteria, 
> and hence the possible number of potential queries makes it very difficult 
> for the query result cache.
> We already have a custom implementation in which we check our redis cache for 
> queries before they are sent to our aggregators which runs at 30% hit rate.
> 
> Kind Regards,
> 
> James Bodkin
> 
> On 17/06/2020, 16:21, "Michael Gibney"  wrote:
> 
>To expand a bit on what Erick said regarding performance: my sense is
>that the RefGuide assertion that "docValues=true" makes faceting
>"faster" could use some qualification/clarification. My take, fwiw:
> 
>First, to reiterate/paraphrase what Erick said: the "faster" assertion
>is not comparing to "facet.method=enum". For low-cardinality fields,
>if you have the heap space, and are very intentional about configuring
>your filterCache (and monitoring it as access patterns might change),
>"facet.method=enum" will likely be as fast as you can get (at least
>for "legacy" facets or whatever -- not sure about "enum" method in
>JSON facets).
> 
>Even where "docValues=true" arguably does make faceting "faster", the
>main benefit is that the "uninverted" data structures are serialized
>on disk, so you're avoiding the need to uninvert each facet field
>on-heap for every new indexSearcher, which is generally high-latency
>-- user perception of this latency can be mitigated using warming
>queries, but it can still be problematic, esp. for frequent index
>updates. On-heap uninversion also inherently consumes a lot of heap
>space, which has general implications wrt GC, etc ... so in that
>respect even if faceting per se might not be "faster" with
>"docValues=true", your overall system may in many cases perform
>better.
> 
>(and Anthony, I'm pretty sure that tag/ex on facets should be
>orthogonal to the "facet.method=enum"/filterCache discussion, as
>tag/ex only affects the DocSet domain over which facets are calculated
>... I think that step is pretty cleanly separated from the actual
>calculation of the facets. I'm not 100% sure on that, so proceed with
>caution, but it could definitely be worth evaluating for your use
>case!)
> 
>Michael
> 
>On Wed, Jun 17, 2020 at 10:42 AM Erick Erickson  
> wrote:
>> 
>> Uninvertible is a safety mechanism to make sure that you don’t _unknowingly_ 
>> use a docValues=false
>> field for faceting/grouping/sorting/function queries. The primary point of 
>> docValues=true is twofold:
>> 
>> 1> reduce Java heap requirements by using the OS memory to hold it
>> 
>> 2> uninverting can be expensive CPU wise too, although not with just a few
>>unique values (for each term, read the list of docs that have it and flip 
>> a bit).
>> 
>> It doesn’t really make sense to set it on an index=false field, since 
>> uninverting only happens on
>> index=true docValues=false. OTOH, I don’t think it would do any harm either. 
>> That said, I frankly
>> don’t know how that interacts with facet.method=enum.
>> 
>> As far as speed… yeah, you’re in the edge cases. All things being equal, 
>> stuffing these into the
>> filterCache is the fastest way to facet if you have the memory. I’ve seen 
>> very few installations
>> where people have that luxury though. Each entry in the filterCache can 
>> occupy maxDoc/8 + some overhead
>> bytes. If maxDoc is very large, this’ll chew up an enormous amount of 
>> memory. I’m cheating
>> a bit here since the size might be smaller if only a few docs have any 
>> particular entry then the
>> size is smaller. But that’s the worst-case you have to allow for ‘cause you 
>> could theoretically hit
>> the perfect storm where, due to some particular sequence of queries, your 
>> entire filter
>> cache fills up with entries that size.
>> 
>> You’ll have some overhead to keep the cache at that size, but it sounds like 
>> it’s worth it.

Re: Facet Performance

2020-06-17 Thread James Bodkin

We've noticed that the filterCache uses a significant amount of memory, as 
we've assigned 8GB Heap per instance.
In total, we have 32 shards with 2 replicas, hence (8*32*2) 512G Heap space 
alone, further memory is required to ensure the index is always memory mapped 
for performance reasons.

Ideally I would like to be able to reduce the amount of memory assigned to the 
heap by using docValues instead of indexed but it doesn't seem possible.
The QTime (after warming) for facet.method=enum is around 150-250ms whereas the 
QTime for facet.method=fc is around 1000-1200ms.
As we require the results in real-time for customers searching on our website, 
the later QTime of 1000-1200ms is too slow for us to be able to use.

Our facet queries change as the customer selects different search criteria, and 
hence the possible number of potential queries makes it very difficult for the 
query result cache.
We already have a custom implementation in which we check our redis cache for 
queries before they are sent to our aggregators which runs at 30% hit rate.

Kind Regards,

James Bodkin

On 17/06/2020, 16:21, "Michael Gibney"  wrote:

To expand a bit on what Erick said regarding performance: my sense is
that the RefGuide assertion that "docValues=true" makes faceting
"faster" could use some qualification/clarification. My take, fwiw:

First, to reiterate/paraphrase what Erick said: the "faster" assertion
is not comparing to "facet.method=enum". For low-cardinality fields,
if you have the heap space, and are very intentional about configuring
your filterCache (and monitoring it as access patterns might change),
"facet.method=enum" will likely be as fast as you can get (at least
for "legacy" facets or whatever -- not sure about "enum" method in
JSON facets).

Even where "docValues=true" arguably does make faceting "faster", the
main benefit is that the "uninverted" data structures are serialized
on disk, so you're avoiding the need to uninvert each facet field
on-heap for every new indexSearcher, which is generally high-latency
-- user perception of this latency can be mitigated using warming
queries, but it can still be problematic, esp. for frequent index
updates. On-heap uninversion also inherently consumes a lot of heap
space, which has general implications wrt GC, etc ... so in that
respect even if faceting per se might not be "faster" with
"docValues=true", your overall system may in many cases perform
better.

(and Anthony, I'm pretty sure that tag/ex on facets should be
orthogonal to the "facet.method=enum"/filterCache discussion, as
tag/ex only affects the DocSet domain over which facets are calculated
... I think that step is pretty cleanly separated from the actual
calculation of the facets. I'm not 100% sure on that, so proceed with
caution, but it could definitely be worth evaluating for your use
case!)

Michael

On Wed, Jun 17, 2020 at 10:42 AM Erick Erickson  
wrote:
>
> Uninvertible is a safety mechanism to make sure that you don’t 
_unknowingly_ use a docValues=false
> field for faceting/grouping/sorting/function queries. The primary point 
of docValues=true is twofold:
>
> 1> reduce Java heap requirements by using the OS memory to hold it
>
> 2> uninverting can be expensive CPU wise too, although not with just a few
> unique values (for each term, read the list of docs that have it and 
flip a bit).
>
> It doesn’t really make sense to set it on an index=false field, since 
uninverting only happens on
> index=true docValues=false. OTOH, I don’t think it would do any harm 
either. That said, I frankly
> don’t know how that interacts with facet.method=enum.
>
> As far as speed… yeah, you’re in the edge cases. All things being equal, 
stuffing these into the
> filterCache is the fastest way to facet if you have the memory. I’ve seen 
very few installations
> where people have that luxury though. Each entry in the filterCache can 
occupy maxDoc/8 + some overhead
> bytes. If maxDoc is very large, this’ll chew up an enormous amount of 
memory. I’m cheating
> a bit here since the size might be smaller if only a few docs have any 
particular entry then the
> size is smaller. But that’s the worst-case you have to allow for ‘cause 
you could theoretically hit
> the perfect storm where, due to some particular sequence of queries, your 
entire filter
> cache fills up with entries that size.
>
> You’ll have some overhead to keep the cache at that size, but it sounds 
like it’s worth it.
>
> Best,
> Erick
>
>
>
> > On Jun 17, 2020, at 10:05 AM, James Bodkin 
 wrote:
> >
> > The large majority of the relevant fields have fewer than 20 unique 
values. We have two fields over that with 150 unique values and 5300 unique 
values retrospectively.
> > At the moment,

Re: Facet Performance

2020-06-17 Thread Michael Gibney

To expand a bit on what Erick said regarding performance: my sense is
that the RefGuide assertion that "docValues=true" makes faceting
"faster" could use some qualification/clarification. My take, fwiw:

First, to reiterate/paraphrase what Erick said: the "faster" assertion
is not comparing to "facet.method=enum". For low-cardinality fields,
if you have the heap space, and are very intentional about configuring
your filterCache (and monitoring it as access patterns might change),
"facet.method=enum" will likely be as fast as you can get (at least
for "legacy" facets or whatever -- not sure about "enum" method in
JSON facets).

Even where "docValues=true" arguably does make faceting "faster", the
main benefit is that the "uninverted" data structures are serialized
on disk, so you're avoiding the need to uninvert each facet field
on-heap for every new indexSearcher, which is generally high-latency
-- user perception of this latency can be mitigated using warming
queries, but it can still be problematic, esp. for frequent index
updates. On-heap uninversion also inherently consumes a lot of heap
space, which has general implications wrt GC, etc ... so in that
respect even if faceting per se might not be "faster" with
"docValues=true", your overall system may in many cases perform
better.

(and Anthony, I'm pretty sure that tag/ex on facets should be
orthogonal to the "facet.method=enum"/filterCache discussion, as
tag/ex only affects the DocSet domain over which facets are calculated
... I think that step is pretty cleanly separated from the actual
calculation of the facets. I'm not 100% sure on that, so proceed with
caution, but it could definitely be worth evaluating for your use
case!)

Michael

On Wed, Jun 17, 2020 at 10:42 AM Erick Erickson  wrote:
>
> Uninvertible is a safety mechanism to make sure that you don’t _unknowingly_ 
> use a docValues=false
> field for faceting/grouping/sorting/function queries. The primary point of 
> docValues=true is twofold:
>
> 1> reduce Java heap requirements by using the OS memory to hold it
>
> 2> uninverting can be expensive CPU wise too, although not with just a few
> unique values (for each term, read the list of docs that have it and flip 
> a bit).
>
> It doesn’t really make sense to set it on an index=false field, since 
> uninverting only happens on
> index=true docValues=false. OTOH, I don’t think it would do any harm either. 
> That said, I frankly
> don’t know how that interacts with facet.method=enum.
>
> As far as speed… yeah, you’re in the edge cases. All things being equal, 
> stuffing these into the
> filterCache is the fastest way to facet if you have the memory. I’ve seen 
> very few installations
> where people have that luxury though. Each entry in the filterCache can 
> occupy maxDoc/8 + some overhead
> bytes. If maxDoc is very large, this’ll chew up an enormous amount of memory. 
> I’m cheating
> a bit here since the size might be smaller if only a few docs have any 
> particular entry then the
> size is smaller. But that’s the worst-case you have to allow for ‘cause you 
> could theoretically hit
> the perfect storm where, due to some particular sequence of queries, your 
> entire filter
> cache fills up with entries that size.
>
> You’ll have some overhead to keep the cache at that size, but it sounds like 
> it’s worth it.
>
> Best,
> Erick
>
>
>
> > On Jun 17, 2020, at 10:05 AM, James Bodkin  
> > wrote:
> >
> > The large majority of the relevant fields have fewer than 20 unique values. 
> > We have two fields over that with 150 unique values and 5300 unique values 
> > retrospectively.
> > At the moment, our filterCache is configured with a maximum size of 8192.
> >
> > From the DocValues documentation 
> > (https://lucene.apache.org/solr/guide/8_3/docvalues.html), it mentions that 
> > this approach promises to make lookups for faceting, sorting and grouping 
> > much faster.
> > Hence I thought that using DocValues would be better than using Indexed and 
> > in turn improve our response times and possibly lower memory requirements. 
> > It sounds like this isn't the case if you are able to allocate enough 
> > memory to the filterCache.
> >
> > I haven't yet tried changing the uninvertible setting, I was looking at the 
> > documentation for this field earlier today.
> > Should we be setting uninvertible="false" if docValues="true" regardless of 
> > whether indexed is true or false?
> >
> > Kind Regards,
> >
> > James Bodkin
> >
> > On 17/06/2020, 14:02, "Michael Gibney"  wrote:
> >
> >facet.method=enum works by executing a query (against indexed values)
> >for each indexed value in a given field (which, for indexed=false, is
> >"no values"). So that explains why facet.method=enum no longer works.
> >I was going to suggest that you might not want to set indexed=false on
> >the docValues facet fields anyway, since the indexed values are still
> >used for facet refinement (assuming your index is distributed).
> >

Re: Facet Performance

2020-06-17 Thread Erick Erickson

Uninvertible is a safety mechanism to make sure that you don’t _unknowingly_ 
use a docValues=false
field for faceting/grouping/sorting/function queries. The primary point of 
docValues=true is twofold:

1> reduce Java heap requirements by using the OS memory to hold it

2> uninverting can be expensive CPU wise too, although not with just a few
unique values (for each term, read the list of docs that have it and flip a 
bit).

It doesn’t really make sense to set it on an index=false field, since 
uninverting only happens on
index=true docValues=false. OTOH, I don’t think it would do any harm either. 
That said, I frankly
don’t know how that interacts with facet.method=enum.

As far as speed… yeah, you’re in the edge cases. All things being equal, 
stuffing these into the
filterCache is the fastest way to facet if you have the memory. I’ve seen very 
few installations
where people have that luxury though. Each entry in the filterCache can occupy 
maxDoc/8 + some overhead
bytes. If maxDoc is very large, this’ll chew up an enormous amount of memory. 
I’m cheating
a bit here since the size might be smaller if only a few docs have any 
particular entry then the
size is smaller. But that’s the worst-case you have to allow for ‘cause you 
could theoretically hit
the perfect storm where, due to some particular sequence of queries, your 
entire filter
cache fills up with entries that size. 

You’ll have some overhead to keep the cache at that size, but it sounds like 
it’s worth it.

Best,
Erick



> On Jun 17, 2020, at 10:05 AM, James Bodkin  
> wrote:
> 
> The large majority of the relevant fields have fewer than 20 unique values. 
> We have two fields over that with 150 unique values and 5300 unique values 
> retrospectively.
> At the moment, our filterCache is configured with a maximum size of 8192.
> 
> From the DocValues documentation 
> (https://lucene.apache.org/solr/guide/8_3/docvalues.html), it mentions that 
> this approach promises to make lookups for faceting, sorting and grouping 
> much faster.
> Hence I thought that using DocValues would be better than using Indexed and 
> in turn improve our response times and possibly lower memory requirements. It 
> sounds like this isn't the case if you are able to allocate enough memory to 
> the filterCache.
> 
> I haven't yet tried changing the uninvertible setting, I was looking at the 
> documentation for this field earlier today.
> Should we be setting uninvertible="false" if docValues="true" regardless of 
> whether indexed is true or false?
> 
> Kind Regards,
> 
> James Bodkin
> 
> On 17/06/2020, 14:02, "Michael Gibney"  wrote:
> 
>facet.method=enum works by executing a query (against indexed values)
>for each indexed value in a given field (which, for indexed=false, is
>"no values"). So that explains why facet.method=enum no longer works.
>I was going to suggest that you might not want to set indexed=false on
>the docValues facet fields anyway, since the indexed values are still
>used for facet refinement (assuming your index is distributed).
> 
>What's the number of unique values in the relevant fields? If it's low
>enough, setting docValues=false and indexed=true and using
>facet.method=enum (with a sufficiently large filterCache) is
>definitely a viable option, and will almost certainly be faster than
>docValues-based faceting. (As an aside, noting for future reference:
>high-cardinality facets over high-cardinality DocSet domains might be
>able to benefit from a term facet count cache:
>https://issues.apache.org/jira/browse/SOLR-13807)
> 
>I think you didn't specifically mention whether you acted on Erick's
>suggestion of setting "uninvertible=false" (I think Erick accidentally
>said "uninvertible=true") to fail fast. I'd also recommend doing that,
>perhaps even above all else -- it shouldn't actually *do* anything,
>but will help ensure that things are behaving as you expect them to!
> 
>Michael
> 
>On Wed, Jun 17, 2020 at 4:31 AM James Bodkin
> wrote:
>> 
>> Thanks, I've implemented some queries that improve the first-hit execution 
>> for faceting.
>> 
>> Since turning off indexed on those fields, we've noticed that 
>> facet.method=enum no longer returns the facets when used.
>> Using facet.method=fc/fcs is significantly slower compared to 
>> facet.method=enum for us. Why do these two differences exist?
>> 
>> On 16/06/2020, 17:52, "Erick Erickson"  wrote:
>> 
>>Ok, I see the disconnect... Necessary parts if the index are read from 
>> disk
>>lazily. So your newSearcher or firstSearcher query needs to do whatever
>>operation causes the relevant parts of the index to be read. In this case,
>>probably just facet on all the fields you care about. I'd add sorting too
>>if you sort on different fields.
>> 
>>The *:* query without facets or sorting does virtually nothing due to some
>>special handling...
>> 
>>On Tue, Jun 16,

Re: Facet Performance

2020-06-17 Thread James Bodkin

The large majority of the relevant fields have fewer than 20 unique values. We 
have two fields over that with 150 unique values and 5300 unique values 
retrospectively.
At the moment, our filterCache is configured with a maximum size of 8192.

From the DocValues documentation 
(https://lucene.apache.org/solr/guide/8_3/docvalues.html), it mentions that 
this approach promises to make lookups for faceting, sorting and grouping much 
faster.
Hence I thought that using DocValues would be better than using Indexed and in 
turn improve our response times and possibly lower memory requirements. It 
sounds like this isn't the case if you are able to allocate enough memory to 
the filterCache.

I haven't yet tried changing the uninvertible setting, I was looking at the 
documentation for this field earlier today.
Should we be setting uninvertible="false" if docValues="true" regardless of 
whether indexed is true or false?

Kind Regards,

James Bodkin

On 17/06/2020, 14:02, "Michael Gibney"  wrote:

facet.method=enum works by executing a query (against indexed values)
for each indexed value in a given field (which, for indexed=false, is
"no values"). So that explains why facet.method=enum no longer works.
I was going to suggest that you might not want to set indexed=false on
the docValues facet fields anyway, since the indexed values are still
used for facet refinement (assuming your index is distributed).

What's the number of unique values in the relevant fields? If it's low
enough, setting docValues=false and indexed=true and using
facet.method=enum (with a sufficiently large filterCache) is
definitely a viable option, and will almost certainly be faster than
docValues-based faceting. (As an aside, noting for future reference:
high-cardinality facets over high-cardinality DocSet domains might be
able to benefit from a term facet count cache:
https://issues.apache.org/jira/browse/SOLR-13807)

I think you didn't specifically mention whether you acted on Erick's
suggestion of setting "uninvertible=false" (I think Erick accidentally
said "uninvertible=true") to fail fast. I'd also recommend doing that,
perhaps even above all else -- it shouldn't actually *do* anything,
but will help ensure that things are behaving as you expect them to!

Michael

On Wed, Jun 17, 2020 at 4:31 AM James Bodkin
 wrote:
>
> Thanks, I've implemented some queries that improve the first-hit 
execution for faceting.
>
> Since turning off indexed on those fields, we've noticed that 
facet.method=enum no longer returns the facets when used.
> Using facet.method=fc/fcs is significantly slower compared to 
facet.method=enum for us. Why do these two differences exist?
>
> On 16/06/2020, 17:52, "Erick Erickson"  wrote:
>
> Ok, I see the disconnect... Necessary parts if the index are read 
from disk
> lazily. So your newSearcher or firstSearcher query needs to do 
whatever
> operation causes the relevant parts of the index to be read. In this 
case,
> probably just facet on all the fields you care about. I'd add sorting 
too
> if you sort on different fields.
>
> The *:* query without facets or sorting does virtually nothing due to 
some
> special handling...
>
> On Tue, Jun 16, 2020, 10:48 James Bodkin 

> wrote:
>
> > I've been trying to build a query that I can use in newSearcher 
based off
> > the information in your previous e-mail. I thought you meant to 
build a *:*
> > query as per Query 1 in my previous e-mail but I'm still seeing the
> > first-hit execution.
> > Now I'm wondering if you meant to create a *:* query with each of 
the
> > fields as part of the fl query parameters or a *:* query with each 
of the
> > fields and values as part of the fq query parameters.
> >
> > At the moment I've been running these manually as I expected that I 
would
> > see the first-execution penalty disappear by the time I got to 
query 4, as
> > I thought this would replicate the actions of the newSeacher.
> > Unfortunately we can't use the autowarm count that is available as 
part of
> > the filterCache/filterCache due to the custom deployment mechanism 
we use
> > to update our index.
> >
> > Kind Regards,
> >
> > James Bodkin
> >
> > On 16/06/2020, 15:30, "Erick Erickson"  
wrote:
> >
> > Did you try the autowarming like I mentioned in my previous 
e-mail?
> >
> > > On Jun 16, 2020, at 10:18 AM, James Bodkin <
> > james.bod...@loveholidays.com> wrote:
> > >
> > > We've changed the schema to enable docValues for these fields 
and
> > this led to an improvement in the response time. We found a further
> >

Re: Facet Performance

2020-06-17 Thread Anthony Groves

Ah, interesting! So if the number of possible values is low (like <= 10),
it is faster to *not *use docvalues on that (indexed) faceted field?
Does this hold true even when using faceting techniques like tag and
exclusion?

Thanks,
Anthony


On Wed, Jun 17, 2020 at 9:37 AM David Smiley 
wrote:

> I strongly recommend setting indexed=true on a field you facet on for the
> purposes of efficient refinement (fq=field:value).  But it strictly isn't
> required, as you have discovered.
>
> ~ David
>
>
> On Wed, Jun 17, 2020 at 9:02 AM Michael Gibney 
> wrote:
>
> > facet.method=enum works by executing a query (against indexed values)
> > for each indexed value in a given field (which, for indexed=false, is
> > "no values"). So that explains why facet.method=enum no longer works.
> > I was going to suggest that you might not want to set indexed=false on
> > the docValues facet fields anyway, since the indexed values are still
> > used for facet refinement (assuming your index is distributed).
> >
> > What's the number of unique values in the relevant fields? If it's low
> > enough, setting docValues=false and indexed=true and using
> > facet.method=enum (with a sufficiently large filterCache) is
> > definitely a viable option, and will almost certainly be faster than
> > docValues-based faceting. (As an aside, noting for future reference:
> > high-cardinality facets over high-cardinality DocSet domains might be
> > able to benefit from a term facet count cache:
> > https://issues.apache.org/jira/browse/SOLR-13807)
> >
> > I think you didn't specifically mention whether you acted on Erick's
> > suggestion of setting "uninvertible=false" (I think Erick accidentally
> > said "uninvertible=true") to fail fast. I'd also recommend doing that,
> > perhaps even above all else -- it shouldn't actually *do* anything,
> > but will help ensure that things are behaving as you expect them to!
> >
> > Michael
> >
> > On Wed, Jun 17, 2020 at 4:31 AM James Bodkin
> >  wrote:
> > >
> > > Thanks, I've implemented some queries that improve the first-hit
> > execution for faceting.
> > >
> > > Since turning off indexed on those fields, we've noticed that
> > facet.method=enum no longer returns the facets when used.
> > > Using facet.method=fc/fcs is significantly slower compared to
> > facet.method=enum for us. Why do these two differences exist?
> > >
> > > On 16/06/2020, 17:52, "Erick Erickson" 
> wrote:
> > >
> > > Ok, I see the disconnect... Necessary parts if the index are read
> > from disk
> > > lazily. So your newSearcher or firstSearcher query needs to do
> > whatever
> > > operation causes the relevant parts of the index to be read. In
> this
> > case,
> > > probably just facet on all the fields you care about. I'd add
> > sorting too
> > > if you sort on different fields.
> > >
> > > The *:* query without facets or sorting does virtually nothing due
> > to some
> > > special handling...
> > >
> > > On Tue, Jun 16, 2020, 10:48 James Bodkin <
> > james.bod...@loveholidays.com>
> > > wrote:
> > >
> > > > I've been trying to build a query that I can use in newSearcher
> > based off
> > > > the information in your previous e-mail. I thought you meant to
> > build a *:*
> > > > query as per Query 1 in my previous e-mail but I'm still seeing
> the
> > > > first-hit execution.
> > > > Now I'm wondering if you meant to create a *:* query with each of
> > the
> > > > fields as part of the fl query parameters or a *:* query with
> each
> > of the
> > > > fields and values as part of the fq query parameters.
> > > >
> > > > At the moment I've been running these manually as I expected that
> > I would
> > > > see the first-execution penalty disappear by the time I got to
> > query 4, as
> > > > I thought this would replicate the actions of the newSeacher.
> > > > Unfortunately we can't use the autowarm count that is available
> as
> > part of
> > > > the filterCache/filterCache due to the custom deployment
> mechanism
> > we use
> > > > to update our index.
> > > >
> > > > Kind Regards,
> > > >
> > > > James Bodkin
> > > >
> > > > On 16/06/2020, 15:30, "Erick Erickson"  >
> > wrote:
> > > >
> > > > Did you try the autowarming like I mentioned in my previous
> > e-mail?
> > > >
> > > > > On Jun 16, 2020, at 10:18 AM, James Bodkin <
> > > > james.bod...@loveholidays.com> wrote:
> > > > >
> > > > > We've changed the schema to enable docValues for these
> > fields and
> > > > this led to an improvement in the response time. We found a
> further
> > > > improvement by also switching off indexed as these fields are
> used
> > for
> > > > faceting and filtering only.
> > > > > Since those changes, we've found that the first-execution
> for
> > > > queries is really noticeable. I thought this would be the
> > filterCache based
> > > > on what I saw in

Re: Facet Performance

2020-06-17 Thread David Smiley

I strongly recommend setting indexed=true on a field you facet on for the
purposes of efficient refinement (fq=field:value).  But it strictly isn't
required, as you have discovered.

~ David


On Wed, Jun 17, 2020 at 9:02 AM Michael Gibney 
wrote:

> facet.method=enum works by executing a query (against indexed values)
> for each indexed value in a given field (which, for indexed=false, is
> "no values"). So that explains why facet.method=enum no longer works.
> I was going to suggest that you might not want to set indexed=false on
> the docValues facet fields anyway, since the indexed values are still
> used for facet refinement (assuming your index is distributed).
>
> What's the number of unique values in the relevant fields? If it's low
> enough, setting docValues=false and indexed=true and using
> facet.method=enum (with a sufficiently large filterCache) is
> definitely a viable option, and will almost certainly be faster than
> docValues-based faceting. (As an aside, noting for future reference:
> high-cardinality facets over high-cardinality DocSet domains might be
> able to benefit from a term facet count cache:
> https://issues.apache.org/jira/browse/SOLR-13807)
>
> I think you didn't specifically mention whether you acted on Erick's
> suggestion of setting "uninvertible=false" (I think Erick accidentally
> said "uninvertible=true") to fail fast. I'd also recommend doing that,
> perhaps even above all else -- it shouldn't actually *do* anything,
> but will help ensure that things are behaving as you expect them to!
>
> Michael
>
> On Wed, Jun 17, 2020 at 4:31 AM James Bodkin
>  wrote:
> >
> > Thanks, I've implemented some queries that improve the first-hit
> execution for faceting.
> >
> > Since turning off indexed on those fields, we've noticed that
> facet.method=enum no longer returns the facets when used.
> > Using facet.method=fc/fcs is significantly slower compared to
> facet.method=enum for us. Why do these two differences exist?
> >
> > On 16/06/2020, 17:52, "Erick Erickson"  wrote:
> >
> > Ok, I see the disconnect... Necessary parts if the index are read
> from disk
> > lazily. So your newSearcher or firstSearcher query needs to do
> whatever
> > operation causes the relevant parts of the index to be read. In this
> case,
> > probably just facet on all the fields you care about. I'd add
> sorting too
> > if you sort on different fields.
> >
> > The *:* query without facets or sorting does virtually nothing due
> to some
> > special handling...
> >
> > On Tue, Jun 16, 2020, 10:48 James Bodkin <
> james.bod...@loveholidays.com>
> > wrote:
> >
> > > I've been trying to build a query that I can use in newSearcher
> based off
> > > the information in your previous e-mail. I thought you meant to
> build a *:*
> > > query as per Query 1 in my previous e-mail but I'm still seeing the
> > > first-hit execution.
> > > Now I'm wondering if you meant to create a *:* query with each of
> the
> > > fields as part of the fl query parameters or a *:* query with each
> of the
> > > fields and values as part of the fq query parameters.
> > >
> > > At the moment I've been running these manually as I expected that
> I would
> > > see the first-execution penalty disappear by the time I got to
> query 4, as
> > > I thought this would replicate the actions of the newSeacher.
> > > Unfortunately we can't use the autowarm count that is available as
> part of
> > > the filterCache/filterCache due to the custom deployment mechanism
> we use
> > > to update our index.
> > >
> > > Kind Regards,
> > >
> > > James Bodkin
> > >
> > > On 16/06/2020, 15:30, "Erick Erickson" 
> wrote:
> > >
> > > Did you try the autowarming like I mentioned in my previous
> e-mail?
> > >
> > > > On Jun 16, 2020, at 10:18 AM, James Bodkin <
> > > james.bod...@loveholidays.com> wrote:
> > > >
> > > > We've changed the schema to enable docValues for these
> fields and
> > > this led to an improvement in the response time. We found a further
> > > improvement by also switching off indexed as these fields are used
> for
> > > faceting and filtering only.
> > > > Since those changes, we've found that the first-execution for
> > > queries is really noticeable. I thought this would be the
> filterCache based
> > > on what I saw in NewRelic however it is probably trying to read the
> > > docValues from disk. How can we use the autowarming to improve
> this?
> > > >
> > > > For example, I've run the following queries in sequence and
> each
> > > query has a first-execution penalty.
> > > >
> > > > Query 1:
> > > >
> > > > q=*:*
> > > > facet=true
> > > > facet.field=D_DepartureAirport
> > > > facet.field=D_Destination
> > > > facet.limit=-1
> > > > rows=0
> > >

Re: Facet Performance

2020-06-17 Thread Michael Gibney

facet.method=enum works by executing a query (against indexed values)
for each indexed value in a given field (which, for indexed=false, is
"no values"). So that explains why facet.method=enum no longer works.
I was going to suggest that you might not want to set indexed=false on
the docValues facet fields anyway, since the indexed values are still
used for facet refinement (assuming your index is distributed).

What's the number of unique values in the relevant fields? If it's low
enough, setting docValues=false and indexed=true and using
facet.method=enum (with a sufficiently large filterCache) is
definitely a viable option, and will almost certainly be faster than
docValues-based faceting. (As an aside, noting for future reference:
high-cardinality facets over high-cardinality DocSet domains might be
able to benefit from a term facet count cache:
https://issues.apache.org/jira/browse/SOLR-13807)

I think you didn't specifically mention whether you acted on Erick's
suggestion of setting "uninvertible=false" (I think Erick accidentally
said "uninvertible=true") to fail fast. I'd also recommend doing that,
perhaps even above all else -- it shouldn't actually *do* anything,
but will help ensure that things are behaving as you expect them to!

Michael

On Wed, Jun 17, 2020 at 4:31 AM James Bodkin
 wrote:
>
> Thanks, I've implemented some queries that improve the first-hit execution 
> for faceting.
>
> Since turning off indexed on those fields, we've noticed that 
> facet.method=enum no longer returns the facets when used.
> Using facet.method=fc/fcs is significantly slower compared to 
> facet.method=enum for us. Why do these two differences exist?
>
> On 16/06/2020, 17:52, "Erick Erickson"  wrote:
>
> Ok, I see the disconnect... Necessary parts if the index are read from 
> disk
> lazily. So your newSearcher or firstSearcher query needs to do whatever
> operation causes the relevant parts of the index to be read. In this case,
> probably just facet on all the fields you care about. I'd add sorting too
> if you sort on different fields.
>
> The *:* query without facets or sorting does virtually nothing due to some
> special handling...
>
> On Tue, Jun 16, 2020, 10:48 James Bodkin 
> wrote:
>
> > I've been trying to build a query that I can use in newSearcher based 
> off
> > the information in your previous e-mail. I thought you meant to build a 
> *:*
> > query as per Query 1 in my previous e-mail but I'm still seeing the
> > first-hit execution.
> > Now I'm wondering if you meant to create a *:* query with each of the
> > fields as part of the fl query parameters or a *:* query with each of 
> the
> > fields and values as part of the fq query parameters.
> >
> > At the moment I've been running these manually as I expected that I 
> would
> > see the first-execution penalty disappear by the time I got to query 4, 
> as
> > I thought this would replicate the actions of the newSeacher.
> > Unfortunately we can't use the autowarm count that is available as part 
> of
> > the filterCache/filterCache due to the custom deployment mechanism we 
> use
> > to update our index.
> >
> > Kind Regards,
> >
> > James Bodkin
> >
> > On 16/06/2020, 15:30, "Erick Erickson"  wrote:
> >
> > Did you try the autowarming like I mentioned in my previous e-mail?
> >
> > > On Jun 16, 2020, at 10:18 AM, James Bodkin <
> > james.bod...@loveholidays.com> wrote:
> > >
> > > We've changed the schema to enable docValues for these fields and
> > this led to an improvement in the response time. We found a further
> > improvement by also switching off indexed as these fields are used for
> > faceting and filtering only.
> > > Since those changes, we've found that the first-execution for
> > queries is really noticeable. I thought this would be the filterCache 
> based
> > on what I saw in NewRelic however it is probably trying to read the
> > docValues from disk. How can we use the autowarming to improve this?
> > >
> > > For example, I've run the following queries in sequence and each
> > query has a first-execution penalty.
> > >
> > > Query 1:
> > >
> > > q=*:*
> > > facet=true
> > > facet.field=D_DepartureAirport
> > > facet.field=D_Destination
> > > facet.limit=-1
> > > rows=0
> > >
> > > Query 2:
> > >
> > > q=*:*
> > > fq=D_DepartureAirport:(2660)
> > > facet=true
> > > facet.field=D_Destination
> > > facet.limit=-1
> > > rows=0
> > >
> > > Query 3:
> > >
> > > q=*:*
> > > fq=D_DepartureAirport:(2661)
> > > facet=true
> > > facet.field=D_Destination
> > > facet.limit=-1
> > > rows=0
> > >
> > >

Re: Facet Performance

2020-06-17 Thread James Bodkin

Thanks, I've implemented some queries that improve the first-hit execution for 
faceting.

Since turning off indexed on those fields, we've noticed that facet.method=enum 
no longer returns the facets when used.
Using facet.method=fc/fcs is significantly slower compared to facet.method=enum 
for us. Why do these two differences exist?

On 16/06/2020, 17:52, "Erick Erickson"  wrote:

Ok, I see the disconnect... Necessary parts if the index are read from disk
lazily. So your newSearcher or firstSearcher query needs to do whatever
operation causes the relevant parts of the index to be read. In this case,
probably just facet on all the fields you care about. I'd add sorting too
if you sort on different fields.

The *:* query without facets or sorting does virtually nothing due to some
special handling...

On Tue, Jun 16, 2020, 10:48 James Bodkin 
wrote:

> I've been trying to build a query that I can use in newSearcher based off
> the information in your previous e-mail. I thought you meant to build a 
*:*
> query as per Query 1 in my previous e-mail but I'm still seeing the
> first-hit execution.
> Now I'm wondering if you meant to create a *:* query with each of the
> fields as part of the fl query parameters or a *:* query with each of the
> fields and values as part of the fq query parameters.
>
> At the moment I've been running these manually as I expected that I would
> see the first-execution penalty disappear by the time I got to query 4, as
> I thought this would replicate the actions of the newSeacher.
> Unfortunately we can't use the autowarm count that is available as part of
> the filterCache/filterCache due to the custom deployment mechanism we use
> to update our index.
>
> Kind Regards,
>
> James Bodkin
>
> On 16/06/2020, 15:30, "Erick Erickson"  wrote:
>
> Did you try the autowarming like I mentioned in my previous e-mail?
>
> > On Jun 16, 2020, at 10:18 AM, James Bodkin <
> james.bod...@loveholidays.com> wrote:
> >
> > We've changed the schema to enable docValues for these fields and
> this led to an improvement in the response time. We found a further
> improvement by also switching off indexed as these fields are used for
> faceting and filtering only.
> > Since those changes, we've found that the first-execution for
> queries is really noticeable. I thought this would be the filterCache 
based
> on what I saw in NewRelic however it is probably trying to read the
> docValues from disk. How can we use the autowarming to improve this?
> >
> > For example, I've run the following queries in sequence and each
> query has a first-execution penalty.
> >
> > Query 1:
> >
> > q=*:*
> > facet=true
> > facet.field=D_DepartureAirport
> > facet.field=D_Destination
> > facet.limit=-1
> > rows=0
> >
> > Query 2:
> >
> > q=*:*
> > fq=D_DepartureAirport:(2660)
> > facet=true
> > facet.field=D_Destination
> > facet.limit=-1
> > rows=0
> >
> > Query 3:
> >
> > q=*:*
> > fq=D_DepartureAirport:(2661)
> > facet=true
> > facet.field=D_Destination
> > facet.limit=-1
> > rows=0
> >
> > Query 4:
> >
> > q=*:*
> > fq=D_DepartureAirport:(2660+OR+2661)
> > facet=true
> > facet.field=D_Destination
> > facet.limit=-1
> > rows=0
> >
> > We've kept the field type as a string, as the value is mapped by
> application that accesses Solr. In the examples above, the values are
> mapped to airports and destinations.
> > Is it possible to prewarm the above queries without having to define
> all the potential filters manually in the auto warming?
> >
> > At the moment, we update and optimise our index in a different
> environment and then copy the index to our production instances by using a
> rolling deployment in Kubernetes.
> >
> > Kind Regards,
> >
> > James Bodkin
> >
> > On 12/06/2020, 18:58, "Erick Erickson" 
> wrote:
> >
> >I question whether fiterCache has anything to do with it, I
> suspect what’s really happening is that first time you’re reading the
> relevant bits from disk into memory. And to double check you should have
> docVaues enabled for all these fields. The “uninverting” process  can be
> very expensive, and docValues bypasses that.
> >
> >As of Solr 7.6, you can define “uninvertible=true” to your
> field(Type) to “fail fast” if Solr needs to uninvert the field.
> >
> >But that’s an aside. In either case, my claim is that

Re: Facet Performance

2020-06-16 Thread Erick Erickson

Ok, I see the disconnect... Necessary parts if the index are read from disk
lazily. So your newSearcher or firstSearcher query needs to do whatever
operation causes the relevant parts of the index to be read. In this case,
probably just facet on all the fields you care about. I'd add sorting too
if you sort on different fields.

The *:* query without facets or sorting does virtually nothing due to some
special handling...

On Tue, Jun 16, 2020, 10:48 James Bodkin 
wrote:

> I've been trying to build a query that I can use in newSearcher based off
> the information in your previous e-mail. I thought you meant to build a *:*
> query as per Query 1 in my previous e-mail but I'm still seeing the
> first-hit execution.
> Now I'm wondering if you meant to create a *:* query with each of the
> fields as part of the fl query parameters or a *:* query with each of the
> fields and values as part of the fq query parameters.
>
> At the moment I've been running these manually as I expected that I would
> see the first-execution penalty disappear by the time I got to query 4, as
> I thought this would replicate the actions of the newSeacher.
> Unfortunately we can't use the autowarm count that is available as part of
> the filterCache/filterCache due to the custom deployment mechanism we use
> to update our index.
>
> Kind Regards,
>
> James Bodkin
>
> On 16/06/2020, 15:30, "Erick Erickson"  wrote:
>
> Did you try the autowarming like I mentioned in my previous e-mail?
>
> > On Jun 16, 2020, at 10:18 AM, James Bodkin <
> james.bod...@loveholidays.com> wrote:
> >
> > We've changed the schema to enable docValues for these fields and
> this led to an improvement in the response time. We found a further
> improvement by also switching off indexed as these fields are used for
> faceting and filtering only.
> > Since those changes, we've found that the first-execution for
> queries is really noticeable. I thought this would be the filterCache based
> on what I saw in NewRelic however it is probably trying to read the
> docValues from disk. How can we use the autowarming to improve this?
> >
> > For example, I've run the following queries in sequence and each
> query has a first-execution penalty.
> >
> > Query 1:
> >
> > q=*:*
> > facet=true
> > facet.field=D_DepartureAirport
> > facet.field=D_Destination
> > facet.limit=-1
> > rows=0
> >
> > Query 2:
> >
> > q=*:*
> > fq=D_DepartureAirport:(2660)
> > facet=true
> > facet.field=D_Destination
> > facet.limit=-1
> > rows=0
> >
> > Query 3:
> >
> > q=*:*
> > fq=D_DepartureAirport:(2661)
> > facet=true
> > facet.field=D_Destination
> > facet.limit=-1
> > rows=0
> >
> > Query 4:
> >
> > q=*:*
> > fq=D_DepartureAirport:(2660+OR+2661)
> > facet=true
> > facet.field=D_Destination
> > facet.limit=-1
> > rows=0
> >
> > We've kept the field type as a string, as the value is mapped by
> application that accesses Solr. In the examples above, the values are
> mapped to airports and destinations.
> > Is it possible to prewarm the above queries without having to define
> all the potential filters manually in the auto warming?
> >
> > At the moment, we update and optimise our index in a different
> environment and then copy the index to our production instances by using a
> rolling deployment in Kubernetes.
> >
> > Kind Regards,
> >
> > James Bodkin
> >
> > On 12/06/2020, 18:58, "Erick Erickson" 
> wrote:
> >
> >I question whether fiterCache has anything to do with it, I
> suspect what’s really happening is that first time you’re reading the
> relevant bits from disk into memory. And to double check you should have
> docVaues enabled for all these fields. The “uninverting” process  can be
> very expensive, and docValues bypasses that.
> >
> >As of Solr 7.6, you can define “uninvertible=true” to your
> field(Type) to “fail fast” if Solr needs to uninvert the field.
> >
> >But that’s an aside. In either case, my claim is that first-time
> execution does “something”, either reads the serialized docValues from disk
> or uninverts the file on Solr’s heap.
> >
> >You can have this autowarmed by any combination of
> >1> specifying an autowarm count on your queryResultCache. That’s
> hit or miss, as it replays the most recent N queries which may or may not
> contain the sorts. That said, specifying 10-20 for autowarm count is
> usually a good idea, assuming you’re not committing more than, say, every
> 30 seconds. I’d add the same to filterCache too.
> >
> >2> specifying a newSearcher or firstSearcher query in
> solrconfig.xml. The difference is that newSearcher is fired every time a
> commit happens, while firstSearcher is only fired when Solr starts, the
> theory being that there’s no cache autowarming available

Re: Facet Performance

2020-06-16 Thread James Bodkin

I've been trying to build a query that I can use in newSearcher based off the 
information in your previous e-mail. I thought you meant to build a *:* query 
as per Query 1 in my previous e-mail but I'm still seeing the first-hit 
execution.
Now I'm wondering if you meant to create a *:* query with each of the fields as 
part of the fl query parameters or a *:* query with each of the fields and 
values as part of the fq query parameters.

At the moment I've been running these manually as I expected that I would see 
the first-execution penalty disappear by the time I got to query 4, as I 
thought this would replicate the actions of the newSeacher.
Unfortunately we can't use the autowarm count that is available as part of the 
filterCache/filterCache due to the custom deployment mechanism we use to update 
our index.

Kind Regards,

James Bodkin

On 16/06/2020, 15:30, "Erick Erickson"  wrote:

Did you try the autowarming like I mentioned in my previous e-mail?

> On Jun 16, 2020, at 10:18 AM, James Bodkin 
 wrote:
> 
> We've changed the schema to enable docValues for these fields and this 
led to an improvement in the response time. We found a further improvement by 
also switching off indexed as these fields are used for faceting and filtering 
only.
> Since those changes, we've found that the first-execution for queries is 
really noticeable. I thought this would be the filterCache based on what I saw 
in NewRelic however it is probably trying to read the docValues from disk. How 
can we use the autowarming to improve this?
> 
> For example, I've run the following queries in sequence and each query 
has a first-execution penalty.
> 
> Query 1:
> 
> q=*:*
> facet=true
> facet.field=D_DepartureAirport
> facet.field=D_Destination
> facet.limit=-1
> rows=0
> 
> Query 2:
> 
> q=*:*
> fq=D_DepartureAirport:(2660) 
> facet=true
> facet.field=D_Destination
> facet.limit=-1
> rows=0
> 
> Query 3:
> 
> q=*:*
> fq=D_DepartureAirport:(2661)
> facet=true
> facet.field=D_Destination
> facet.limit=-1
> rows=0
> 
> Query 4:
> 
> q=*:*
> fq=D_DepartureAirport:(2660+OR+2661)
> facet=true
> facet.field=D_Destination
> facet.limit=-1
> rows=0
> 
> We've kept the field type as a string, as the value is mapped by 
application that accesses Solr. In the examples above, the values are mapped to 
airports and destinations.
> Is it possible to prewarm the above queries without having to define all 
the potential filters manually in the auto warming?
> 
> At the moment, we update and optimise our index in a different 
environment and then copy the index to our production instances by using a 
rolling deployment in Kubernetes.
> 
> Kind Regards,
> 
> James Bodkin
> 
> On 12/06/2020, 18:58, "Erick Erickson"  wrote:
> 
>I question whether fiterCache has anything to do with it, I suspect 
what’s really happening is that first time you’re reading the relevant bits 
from disk into memory. And to double check you should have docVaues enabled for 
all these fields. The “uninverting” process  can be very expensive, and 
docValues bypasses that.
> 
>As of Solr 7.6, you can define “uninvertible=true” to your field(Type) 
to “fail fast” if Solr needs to uninvert the field.
> 
>But that’s an aside. In either case, my claim is that first-time 
execution does “something”, either reads the serialized docValues from disk or 
uninverts the file on Solr’s heap.
> 
>You can have this autowarmed by any combination of
>1> specifying an autowarm count on your queryResultCache. That’s hit 
or miss, as it replays the most recent N queries which may or may not contain 
the sorts. That said, specifying 10-20 for autowarm count is usually a good 
idea, assuming you’re not committing more than, say, every 30 seconds. I’d add 
the same to filterCache too.
> 
>2> specifying a newSearcher or firstSearcher query in solrconfig.xml. 
The difference is that newSearcher is fired every time a commit happens, while 
firstSearcher is only fired when Solr starts, the theory being that there’s no 
cache autowarming available when Solr fist powers up. Usually, people don’t 
bother with firstSearcher or just make it the same as newSearcher. Note that a 
query doesn’t have to be “real” at all. You can just add all the facet fields 
to a *:* query in a single go.
> 
>BTW, Trie fields will stay around for a long time even though 
deprecated. Or at least until we find something to replace them with that 
doesn’t have this penalty, so I’d feel pretty safe using those and they’ll be 
more efficient than strings.
> 
>Best,
>Erick
>

Re: Facet Performance

2020-06-16 Thread Erick Erickson

Did you try the autowarming like I mentioned in my previous e-mail?

> On Jun 16, 2020, at 10:18 AM, James Bodkin  
> wrote:
> 
> We've changed the schema to enable docValues for these fields and this led to 
> an improvement in the response time. We found a further improvement by also 
> switching off indexed as these fields are used for faceting and filtering 
> only.
> Since those changes, we've found that the first-execution for queries is 
> really noticeable. I thought this would be the filterCache based on what I 
> saw in NewRelic however it is probably trying to read the docValues from 
> disk. How can we use the autowarming to improve this?
> 
> For example, I've run the following queries in sequence and each query has a 
> first-execution penalty.
> 
> Query 1:
> 
> q=*:*
> facet=true
> facet.field=D_DepartureAirport
> facet.field=D_Destination
> facet.limit=-1
> rows=0
> 
> Query 2:
> 
> q=*:*
> fq=D_DepartureAirport:(2660) 
> facet=true
> facet.field=D_Destination
> facet.limit=-1
> rows=0
> 
> Query 3:
> 
> q=*:*
> fq=D_DepartureAirport:(2661)
> facet=true
> facet.field=D_Destination
> facet.limit=-1
> rows=0
> 
> Query 4:
> 
> q=*:*
> fq=D_DepartureAirport:(2660+OR+2661)
> facet=true
> facet.field=D_Destination
> facet.limit=-1
> rows=0
> 
> We've kept the field type as a string, as the value is mapped by application 
> that accesses Solr. In the examples above, the values are mapped to airports 
> and destinations.
> Is it possible to prewarm the above queries without having to define all the 
> potential filters manually in the auto warming?
> 
> At the moment, we update and optimise our index in a different environment 
> and then copy the index to our production instances by using a rolling 
> deployment in Kubernetes.
> 
> Kind Regards,
> 
> James Bodkin
> 
> On 12/06/2020, 18:58, "Erick Erickson"  wrote:
> 
>I question whether fiterCache has anything to do with it, I suspect what’s 
> really happening is that first time you’re reading the relevant bits from 
> disk into memory. And to double check you should have docVaues enabled for 
> all these fields. The “uninverting” process  can be very expensive, and 
> docValues bypasses that.
> 
>As of Solr 7.6, you can define “uninvertible=true” to your field(Type) to 
> “fail fast” if Solr needs to uninvert the field.
> 
>But that’s an aside. In either case, my claim is that first-time execution 
> does “something”, either reads the serialized docValues from disk or 
> uninverts the file on Solr’s heap.
> 
>You can have this autowarmed by any combination of
>1> specifying an autowarm count on your queryResultCache. That’s hit or 
> miss, as it replays the most recent N queries which may or may not contain 
> the sorts. That said, specifying 10-20 for autowarm count is usually a good 
> idea, assuming you’re not committing more than, say, every 30 seconds. I’d 
> add the same to filterCache too.
> 
>2> specifying a newSearcher or firstSearcher query in solrconfig.xml. The 
> difference is that newSearcher is fired every time a commit happens, while 
> firstSearcher is only fired when Solr starts, the theory being that there’s 
> no cache autowarming available when Solr fist powers up. Usually, people 
> don’t bother with firstSearcher or just make it the same as newSearcher. Note 
> that a query doesn’t have to be “real” at all. You can just add all the facet 
> fields to a *:* query in a single go.
> 
>BTW, Trie fields will stay around for a long time even though deprecated. 
> Or at least until we find something to replace them with that doesn’t have 
> this penalty, so I’d feel pretty safe using those and they’ll be more 
> efficient than strings.
> 
>Best,
>Erick
>

Re: Facet Performance

2020-06-16 Thread James Bodkin

We've changed the schema to enable docValues for these fields and this led to 
an improvement in the response time. We found a further improvement by also 
switching off indexed as these fields are used for faceting and filtering only.
Since those changes, we've found that the first-execution for queries is really 
noticeable. I thought this would be the filterCache based on what I saw in 
NewRelic however it is probably trying to read the docValues from disk. How can 
we use the autowarming to improve this?

For example, I've run the following queries in sequence and each query has a 
first-execution penalty.

Query 1:

q=*:*
facet=true
facet.field=D_DepartureAirport
facet.field=D_Destination
facet.limit=-1
rows=0

Query 2:

q=*:*
fq=D_DepartureAirport:(2660) 
facet=true
facet.field=D_Destination
facet.limit=-1
rows=0

Query 3:

q=*:*
fq=D_DepartureAirport:(2661)
facet=true
facet.field=D_Destination
facet.limit=-1
rows=0

Query 4:

q=*:*
fq=D_DepartureAirport:(2660+OR+2661)
facet=true
facet.field=D_Destination
facet.limit=-1
rows=0

We've kept the field type as a string, as the value is mapped by application 
that accesses Solr. In the examples above, the values are mapped to airports 
and destinations.
Is it possible to prewarm the above queries without having to define all the 
potential filters manually in the auto warming?

At the moment, we update and optimise our index in a different environment and 
then copy the index to our production instances by using a rolling deployment 
in Kubernetes.

Kind Regards,

James Bodkin

On 12/06/2020, 18:58, "Erick Erickson"  wrote:

I question whether fiterCache has anything to do with it, I suspect what’s 
really happening is that first time you’re reading the relevant bits from disk 
into memory. And to double check you should have docVaues enabled for all these 
fields. The “uninverting” process  can be very expensive, and docValues 
bypasses that.

As of Solr 7.6, you can define “uninvertible=true” to your field(Type) to 
“fail fast” if Solr needs to uninvert the field.

But that’s an aside. In either case, my claim is that first-time execution 
does “something”, either reads the serialized docValues from disk or uninverts 
the file on Solr’s heap.

You can have this autowarmed by any combination of
1> specifying an autowarm count on your queryResultCache. That’s hit or 
miss, as it replays the most recent N queries which may or may not contain the 
sorts. That said, specifying 10-20 for autowarm count is usually a good idea, 
assuming you’re not committing more than, say, every 30 seconds. I’d add the 
same to filterCache too.

2> specifying a newSearcher or firstSearcher query in solrconfig.xml. The 
difference is that newSearcher is fired every time a commit happens, while 
firstSearcher is only fired when Solr starts, the theory being that there’s no 
cache autowarming available when Solr fist powers up. Usually, people don’t 
bother with firstSearcher or just make it the same as newSearcher. Note that a 
query doesn’t have to be “real” at all. You can just add all the facet fields 
to a *:* query in a single go.

BTW, Trie fields will stay around for a long time even though deprecated. 
Or at least until we find something to replace them with that doesn’t have this 
penalty, so I’d feel pretty safe using those and they’ll be more efficient than 
strings.

Best,
Erick

Re: Facet Performance

2020-06-12 Thread Erick Erickson

I question whether fiterCache has anything to do with it, I suspect what’s 
really happening is that first time you’re reading the relevant bits from disk 
into memory. And to double check you should have docVaues enabled for all these 
fields. The “uninverting” process  can be very expensive, and docValues 
bypasses that.

As of Solr 7.6, you can define “uninvertible=true” to your field(Type) to “fail 
fast” if Solr needs to uninvert the field.

But that’s an aside. In either case, my claim is that first-time execution does 
“something”, either reads the serialized docValues from disk or uninverts the 
file on Solr’s heap.

You can have this autowarmed by any combination of
1> specifying an autowarm count on your queryResultCache. That’s hit or miss, 
as it replays the most recent N queries which may or may not contain the sorts. 
That said, specifying 10-20 for autowarm count is usually a good idea, assuming 
you’re not committing more than, say, every 30 seconds. I’d add the same to 
filterCache too.

2> specifying a newSearcher or firstSearcher query in solrconfig.xml. The 
difference is that newSearcher is fired every time a commit happens, while 
firstSearcher is only fired when Solr starts, the theory being that there’s no 
cache autowarming available when Solr fist powers up. Usually, people don’t 
bother with firstSearcher or just make it the same as newSearcher. Note that a 
query doesn’t have to be “real” at all. You can just add all the facet fields 
to a *:* query in a single go.

BTW, Trie fields will stay around for a long time even though deprecated. Or at 
least until we find something to replace them with that doesn’t have this 
penalty, so I’d feel pretty safe using those and they’ll be more efficient than 
strings.

Best,
Erick

> On Jun 12, 2020, at 12:39 PM, James Bodkin  
> wrote:
> 
> We've run the performance test after changing the fields to be of the type 
> string. We're seeing improved performance, especially after the first time 
> the query has run. The first run is taking around 1-2 seconds rather than 6-8 
> seconds and when the filter cache is present, the response time is around 
> 400ms.
> Do you have any more suggestions that we could try in order to optimise the 
> performance?
> 
> On 11/06/2020, 14:49, "Erick Erickson"  wrote:
> 
>There’s a lot of confusion about using points-based fields for faceting, 
> see: https://issues.apache.org/jira/browse/SOLR-13227 for instance.
> 
>Two options you might try:
>1> copyField to a string field and facet on that (won’t work, of course, 
> for any kind of interval/range facet)
>2> use the deprecated Trie field instead. You could use the copyField to a 
> Trie field for this too.
> 
>Best,
>Erick
>

Re: Facet Performance

2020-06-12 Thread James Bodkin

We've run the performance test after changing the fields to be of the type 
string. We're seeing improved performance, especially after the first time the 
query has run. The first run is taking around 1-2 seconds rather than 6-8 
seconds and when the filter cache is present, the response time is around 400ms.
Do you have any more suggestions that we could try in order to optimise the 
performance?

On 11/06/2020, 14:49, "Erick Erickson"  wrote:

There’s a lot of confusion about using points-based fields for faceting, 
see: https://issues.apache.org/jira/browse/SOLR-13227 for instance.

Two options you might try:
1> copyField to a string field and facet on that (won’t work, of course, 
for any kind of interval/range facet)
2> use the deprecated Trie field instead. You could use the copyField to a 
Trie field for this too.

Best,
Erick

Re: Facet Performance

2020-06-11 Thread James Bodkin

Could you explain why the performance is an issue for points-based fields? I've 
looked through the referenced issue (which is fixed in the version we are 
running) but I'm missing the link between the two. Is there an issue to improve 
this for points-based fields?
We're going to change the field type to a string, as our queries are always 
looking for a specific value (and not intervals/ranges) and rerun our load test.


Kind Regards,

James Bodkin

On 11/06/2020, 14:49, "Erick Erickson"  wrote:

There’s a lot of confusion about using points-based fields for faceting, 
see: https://issues.apache.org/jira/browse/SOLR-13227 for instance.

Two options you might try:
1> copyField to a string field and facet on that (won’t work, of course, 
for any kind of interval/range facet)
2> use the deprecated Trie field instead. You could use the copyField to a 
Trie field for this too.

Best,
Erick

Re: Facet Performance

2020-06-11 Thread Erick Erickson

There’s a lot of confusion about using points-based fields for faceting, see: 
https://issues.apache.org/jira/browse/SOLR-13227 for instance.

Two options you might try:
1> copyField to a string field and facet on that (won’t work, of course, for 
any kind of interval/range facet)
2> use the deprecated Trie field instead. You could use the copyField to a Trie 
field for this too.

Best,
Erick

> On Jun 11, 2020, at 9:39 AM, James Bodkin  
> wrote:
> 
> We’ve been running a load test against our index and have noticed that the 
> facet queries are significantly slower than we would like.
> Currently these types of queries are taking several seconds to execute and 
> are wondering if it would be possible to speed these up.
> Repeating the same query over and over does not improve the response time so 
> does not appear to utilise any caching.
> Ideally we would like to be targeting a response time around tens or hundreds 
> of milliseconds if possible.
> 
> An example query that is taking around 2-3 seconds to execute is:
> 
> q=*.*
> facet=true
> facet.field=D_UserRatingGte
> facet.mincount=1
> facet.limit=-1
> rows=0
> 
> "response":{"numFound":18979503,"start":0,"maxScore":1.0,"docs":[]}
> "facet_counts":{
>"facet_queries":{},
>"facet_fields":{
>  "D_UserRatingGte":[
>"1575",16614238,
>"1576",16614238,
>"1577",16614238,
>"1578",16065938,
>"1579",12079545,
>"1580",458799]},
>"facet_ranges":{},
>"facet_intervals":{},
>"facet_heatmaps":{}}}
> 
> I have also tried the equivalent query using the JSON Facet API with the same 
> outcome of slow response time.
> Additionally I have tried changing the facet method (on both facet apis) with 
> the same outcome of slow response time.
> 
> The underlying field for the above query is configured as a 
> solr.IntPointField with docValues, indexed and multiValued set to true.
> The index has just under 19 million documents and the physical size on disk 
> is 10.95GB. The index is read-only and consists of 4 segments with 0 
> deletions.
> We’re running standalone Solr 8.3.1 with a 8GB Heap and the underlying Google 
> Cloud Virtual Machine in our load test environment has 6 vCPUs, 32G RAM and 
> 100GB SSD.
> 
> Would anyone be able to point me in a direction to either improve the 
> performance or understand the current performance is expected?
> 
> Kind Regards,
> 
> James Bodkin

Re: Facet performance problem

2018-02-20 Thread Shawn Heisey


On 2/20/2018 1:18 AM, LOPEZ-CORTES Mariano-ext wrote:

We return a facet list of values in "motifPresence" field (person status).
Status:
[ ] status1
[x] status2
[x] status3

The user then selects 1 or multiple status (It's this step that we called "facet 
filtering").

Query is then re-executed with fq=motifPresence:(status2 OR status3)

We use fq in order to not alter the score in main query.

We've read that docValues=true for facet fields.

We need also indexed=true?


Facets, grouping, and sorting are more efficient with docValues, but 
searches aren't helped by docValues.  Without indexed="true", searches 
on the field will be VERY slow.  A filter query is still a search.  The 
"filter" in filter query just refers to the fact that it's separate from 
the main query, and that it does not affect relevancy scoring.


Thanks,
Shawn

RE: Facet performance problem

2018-02-20 Thread LOPEZ-CORTES Mariano-ext

Our query looks like this:

...factet=true=motifPresence

We return a facet list of values in "motifPresence" field (person status).
Status:
[ ] status1
[x] status2
[x] status3

The user then selects 1 or multiple status (It's this step that we called 
"facet filtering").

Query is then re-executed with fq=motifPresence:(status2 OR status3)

We use fq in order to not alter the score in main query.

We've read that docValues=true for facet fields.  

We need also indexed=true?
Is there any other problem in our solution?

-Message d'origine-
De : Erick Erickson [mailto:erickerick...@gmail.com] 
Envoyé : lundi 19 février 2018 18:18
À : solr-user
Objet : Re: Facet performance problem

I'm confused here. What do you mean by "facet filtering"? Your examples have no 
facets at all, just a _filter query_.

I'll assume you want to use filter query (fq), and faceting has nothing to do 
with it. This is one of the tricky bits of docValues.
While it's _possible_ to search on a field that's defined as above, it's very 
inefficient since there's no "inverted index" for the field, you specified 
'indexed="false" '. So the docValues are searched, and it's essentially a table 
scan.

If you mean to search against this field, set indexed="true". You'll have to 
completely reindex your corpus of course.

If you intend to facet, group or sort on this field, you should _also_ have 
docValues="true".

Best,
Erick

On Mon, Feb 19, 2018 at 7:47 AM, MOUSSA MZE Oussama-ext 
<oussama.moussa-mze-...@pole-emploi.fr> wrote:
> Hi
>
> We have following environement :
>
> 3 nodes cluster
> 1 shard
> Replication factor = 2
> 8GB per node
>
> 29 millions of documents
>
> We've faceting over field "motifPresence" defined as follow:
>
>  indexed="false" stored="true" required="false"/>
>
> Once the user selects motifPresence filter we executes search again with:
>
> fq: (value1 OR value2 OR value3 OR ...)
>
> The problem is: During facet filtering query is too slow and her response 
> time is greater than main search (without facet filtering).
>
> Thanks in advance!

Re: Facet performance problem

2018-02-19 Thread Erick Erickson

I'm confused here. What do you mean by "facet filtering"? Your
examples have no facets at all, just a _filter query_.

I'll assume you want to use filter query (fq), and faceting has
nothing to do with it. This is one of the tricky bits of docValues.
While it's _possible_ to search on a field that's defined as above,
it's very inefficient since there's no "inverted index" for the field,
you specified 'indexed="false" '. So the docValues are searched, and
it's essentially a table scan.

If you mean to search against this field, set indexed="true". You'll
have to completely reindex your corpus of course.

If you intend to facet, group or sort on this field, you should _also_
have docValues="true".

Best,
Erick

On Mon, Feb 19, 2018 at 7:47 AM, MOUSSA MZE Oussama-ext
 wrote:
> Hi
>
> We have following environement :
>
> 3 nodes cluster
> 1 shard
> Replication factor = 2
> 8GB per node
>
> 29 millions of documents
>
> We've faceting over field "motifPresence" defined as follow:
>
>  stored="true" required="false"/>
>
> Once the user selects motifPresence filter we executes search again with:
>
> fq: (value1 OR value2 OR value3 OR ...)
>
> The problem is: During facet filtering query is too slow and her response 
> time is greater than main search (without facet filtering).
>
> Thanks in advance!

RE: Facet performance

2013-10-23 Thread Toke Eskildsen

On Tue, 2013-10-22 at 17:25 +0200, Lemke, Michael SZ/HZA-ZSW wrote:
 On Tue, October 22, 2013 11:54 AM Andre Bois-Crettez wrote:
  This is with Solr 1.4.
 Really ?
 This sound really outdated to me.
 Have you tried a tried more recent version, 4.5 just went out ?
 
 Sorry, can't.  Too much `grown' stuff.

I did not see that. I guess I parsed it as 4.1.

Well, that rules out DocValues and fcs (as far as I remember). I am a
bit surprised that the limit on #terms with fc is also in 1.4. I thought
it was introduced in a later version.

We too has been in a position where upgrading was hard due to homegrown
addons. We even scrapped some DidYouMean-like functionality when going
from 3.x to 4.x, but 4.x was so much better that there were little
choice.

Last suggestion for using fc: Create 2 or more CONTENT-fields and choose
between them randomly when indexing. Facet on all the CONTENT fields and
merge the results. It will take a bit more RAM though, so it is still
out on your (assumedly) 32 bit machine.

Regards,
Toke Eskildsen, State and University Library, Denmark

RE: Facet performance

2013-10-23 Thread Lemke, Michael SZ/HZA-ZSW

On Tue, October 22, 2013 5:23 PM Michael Lemke wrote:
On Tue, October 22, 2013 9:23 AM Toke Eskildsen wrote:
On Mon, 2013-10-21 at 16:57 +0200, Lemke, Michael SZ/HZA-ZSW wrote:
 QTime fc:
never returns, webserver restarts itself after 30 min with 100% CPU 
 load

It might be because it dies due to garbage collection. But since more
memory (as your test server presumably has) just leads to the too many
values-error, there isn't much to do.

Essentially, fc is out then.


 QTime=41205  facet.prefix=q=frequent_word  
 numFound=44532
 
 Same query repeated:
 QTime=225810 facet.prefix=q=ottomotor  
 numFound=909
 QTime=199839 facet.prefix=q=ottomotor  
 numFound=909

I am stumped on this, sorry. I do not understand why the 'ottomotor'
query can take 5 times as long as the 'frequent_word'-one.

I looked into this some more this morning.  I noticed the java process was 
doing
a lot of I/O as shown in Process Explorer.  For the frequent_word it read 
about 
180MB, for ottomotor is was about seven times as much, ~ 1,200 MB.


Got another observation today.  The response time for q=ottomotor depends on 
facet.limit:

QTime=59300  facet.limit=2
QTime=69395  facet.limit=4
QTime=85208  facet.limit=6
QTime=158150 facet.limit=8
QTime=186276 facet.limit=10
QTime=231763 facet.limit=15
QTime=260437 facet.limit=20
QTime=312268 facet.limit=30

For q=frequent_word the result is much less pronounced and shows only
for facet.limit = 15 :

QTime=0  facet.limit=0
QTime=20535  facet.limit=1
QTime=13456  facet.limit=2
QTime=13925  facet.limit=4
QTime=13705  facet.limit=6
QTime=13924  facet.limit=8
QTime=13799  facet.limit=10
QTime=14361  facet.limit=15
QTime=14704  facet.limit=20
QTime=15189  facet.limit=30
QTime=16783  facet.limit=50
QTime=57128  facet.limit=500

Looks to me for solr to collect enough facets to fulfill the limit constraint
it has to read much more of the index in the case of the infrequent word.

jconsole didn't show anything unusual according to our more experienced Java 
experts here.  Nor was the machine swapping.

Is it possible to screw up an index such that this sort of faceting leads to
constant reading of the index?  Something like full table scans in a db?


Michael

RE: Facet performance

2013-10-22 Thread Toke Eskildsen

On Mon, 2013-10-21 at 16:57 +0200, Lemke, Michael SZ/HZA-ZSW wrote:
 QTime enum:
  1st call: 1200
  subsequent calls: 200

Those numbers seems fine.

 QTime fc:
never returns, webserver restarts itself after 30 min with 100% CPU 
 load

It might be because it dies due to garbage collection. But since more
memory (as your test server presumably has) just leads to the too many
values-error, there isn't much to do.

 QTime=41205  facet.prefix=q=frequent_word  
 numFound=44532
 
 Same query repeated:
 QTime=225810 facet.prefix=q=ottomotor  
 numFound=909
 QTime=199839 facet.prefix=q=ottomotor  
 numFound=909

I am stumped on this, sorry. I do not understand why the 'ottomotor'
query can take 5 times as long as the 'frequent_word'-one.

 QTime=185948 facet.prefix=q=ottomotor  
 numFound=909
 
 QTime=3344   facet.prefix=d   q=ottomotor  
 numFound=909

Fits with expectations.

 - Documents in your index
 13,434,414
 
 - Unique values in the CONTENT field
 Not sure how to get this.  In luke I find
 21,797,514 term count CONTENT

Those are the relevant numbers for faceting. There is a limit of 2^24
(16M) terms for facet.method=enum, although I am a bit unsure if that is
for the whole index or per segment.

Come to think of it, if you have a multi-segmented index, you might want
to try facet.method.fcs. It should have faster startup than fc and
better performance than enum for fields with a large number of unique
values. Memory requirements should be between fc and enum.

 - Xmx
 The maximum the system allows me to get: 1612m
 
 Maybe I have a hopelessly under-dimensioned server for this sort of things?

Well, 1612m should be enough for the faceting in itself; it it the
startup that is the killer. 

A rule of thumb for fc is that the internal structure takes at least
#docs*log(#references) + #references*log(#unique_values) bytes

If your content field is a description, let's say that each description
has 40 words, which gives us 500M references from documents to facet
values. This translates to
13M*log(500M) + 500M*log(22M) bytes ~= 13M*29 + 500M*25 bytes ~= 380MB.

Taking into account that building the structure has an overhead of 2-3
times that, we are approaching the memory limit of 1612m. If the index
is updated, a new facet structure is build all over again while the old
structure is still in memory.


If you need better performance on your large field I would suggest, in
order of priority:

- facet.method=fcs
- facet.method=fcs with DocValues
- Shard your index and use facet.method=fc
- SOLR-2412 (https://issues.apache.org/jira/browse/SOLR-2412)

SOLR-2412 is a last resort, but it does have the same speed as
facet.method=fc only without the 16M unique values limitation.

Regards,
Toke Eskildsen, State and University Library, Denmark

Re: Facet performance

2013-10-22 Thread Andre Bois-Crettez


This is with Solr 1.4.

Really ?
This sound really outdated to me.
Have you tried a tried more recent version, 4.5 just went out ?

--
André Bois-Crettez

Software Architect
Search Developer
http://www.kelkoo.com/


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.

RE: Facet performance

2013-10-22 Thread Lemke, Michael SZ/HZA-ZSW

On Tue, October 22, 2013 9:23 AM Toke Eskildsen wrote:
On Mon, 2013-10-21 at 16:57 +0200, Lemke, Michael SZ/HZA-ZSW wrote:
 QTime fc:
never returns, webserver restarts itself after 30 min with 100% CPU 
 load

It might be because it dies due to garbage collection. But since more
memory (as your test server presumably has) just leads to the too many
values-error, there isn't much to do.

Essentially, fc is out then.


 QTime=41205  facet.prefix=q=frequent_word  
 numFound=44532
 
 Same query repeated:
 QTime=225810 facet.prefix=q=ottomotor  
 numFound=909
 QTime=199839 facet.prefix=q=ottomotor  
 numFound=909

I am stumped on this, sorry. I do not understand why the 'ottomotor'
query can take 5 times as long as the 'frequent_word'-one.

I looked into this some more this morning.  I noticed the java process was doing
a lot of I/O as shown in Process Explorer.  For the frequent_word it read about 
180MB, for ottomotor is was about seven times as much, ~ 1,200 MB.

jconsole didn’t show anything unusual according to our more experienced Java 
experts here.  Nor was the machine swapping.

Is it possible to screw up an index such that this sort of faceting leads to
constant reading of the index?  Something like full table scans in a db?

Michael

RE: Facet performance

2013-10-22 Thread Lemke, Michael SZ/HZA-ZSW

On Tue, October 22, 2013 11:54 AM Andre Bois-Crettez wrote:

 This is with Solr 1.4.
Really ?
This sound really outdated to me.
Have you tried a tried more recent version, 4.5 just went out ?

Sorry, can't.  Too much `grown' stuff.

Michael

RE: Facet performance

2013-10-21 Thread Toke Eskildsen

On Fri, 2013-10-18 at 18:30 +0200, Lemke, Michael SZ/HZA-ZSW wrote:
 Toke Eskildsen [mailto:t...@statsbiblioteket.dk] wrote:
  Unfortunately the enum-solution is normally quite slow when there
  are enough unique values to trigger the too many  values-exception.
  [...]
 
 [...] And yes, the fc method was terribly slow in a case where it did
 work.  Something like 20 minutes whereas enum returned within a few
 seconds.

Err.. What? That sounds _very_ strange. You have millions of unique
values so fc should be a lot faster than enum, not the other way around.

I assume the 20 minutes was for the first call. How fast does subsequent
calls return for fc?


Maybe you could provide some approximate numbers?

- Documents in your index
- Unique values in the CONTENT field
- Hits are returned from a typical query
- Xmx

Regards,
Toke Eskildsen, State and University Library, Denmark

RE: Facet performance

2013-10-21 Thread Lemke, Michael SZ/HZA-ZSW

On Mon, October 21, 2013 10:04 AM, Toke Eskildsen wrote:
On Fri, 2013-10-18 at 18:30 +0200, Lemke, Michael SZ/HZA-ZSW wrote:
 Toke Eskildsen wrote:
  Unfortunately the enum-solution is normally quite slow when there
  are enough unique values to trigger the too many  values-exception.
  [...]
 
 [...] And yes, the fc method was terribly slow in a case where it did
 work.  Something like 20 minutes whereas enum returned within a few
 seconds.

Err.. What? That sounds _very_ strange. You have millions of unique
values so fc should be a lot faster than enum, not the other way around.

I assume the 20 minutes was for the first call. How fast does subsequent
calls return for fc?

QTime enum:
 1st call: 1200
 subsequent calls: 200

QTime fc:
   never returns, webserver restarts itself after 30 min with 100% CPU load


This is on the test system, the production system managed to return with
... Too many values for UnInvertedField faceting 

However, I also have different faceting queries I played with today.

One complete example:

q=ottomotorfacet.field=CONTENTfacet=truefacet.prefix=facet.limit=10facet.mincount=1facet.method=enumrows=0

These are the results, all with facet.method=enum (fc doesn't work).  They
were executed in the sequence shown on an otherwise unused server:

QTime=41205  facet.prefix=q=frequent_word  
numFound=44532

Same query repeated:
QTime=225810 facet.prefix=q=ottomotor  
numFound=909
QTime=199839 facet.prefix=q=ottomotor  
numFound=909

QTime=0  facet.prefix=q=ottomotor jkdhwjfh 
numFound=0
QTime=0  facet.prefix=q=jkdhwjfh   
numFound=0

QTime=185948 facet.prefix=q=ottomotor  
numFound=909

QTime=3344   facet.prefix=d   q=ottomotor  
numFound=909
QTime=3078   facet.prefix=d   q=ottomotor  
numFound=909
QTime=3141   facet.prefix=d   q=ottomotor  
numFound=909

The response time is obviously not dependent on the number of documents found.
Caching doesn't kick in either.



Maybe you could provide some approximate numbers?

I'll try, see below.  Thanks for asking and having a closer look.


- Documents in your index
13,434,414

- Unique values in the CONTENT field
Not sure how to get this.  In luke I find
21,797,514 term count CONTENT

Is that what you mean?

- Hits are returned from a typical query
Hm, that can be anything between 0 and 40,000 or more.
Or do you mean from the facets?  Or do my tests above
answer it?

- Xmx
The maximum the system allows me to get: 1612m


Maybe I have a hopelessly under-dimensioned server for this sort of things?

Thanks a lot for your help,
Michael

RE: Facet performance

2013-10-18 Thread Toke Eskildsen

Lemke, Michael  SZ/HZA-ZSW [lemke...@schaeffler.com] wrote:
 1. 
 q=wordfacet.field=CONTENTfacet=truefacet.prefix=facet.limit=10facet.mincount=1facet.method=enumrows=0
 2. 
 q=wordfacet.field=CONTENTfacet=truefacet.prefix=afacet.limit=10facet.mincount=1facet.method=enumrows=0

 The only difference is am empty facet.prefix in the first query.

 The first query returns after some 20 seconds (QTime 2 in the result) 
 while
 the second one takes only 80 msec (QTime 80). Why is this?

If you index was just opened when you issued your queries, the first request 
will be notably slower than the second as the facet values might not be in the 
disk cache.

Furthermore, for enum the difference between no prefix and some prefix is huge. 
As enum iterates values first (as opposed to fc that iterates hits first), 
limiting to only the values that starts with 'a' ought to speed up retrieval by 
a factor 10 or more.

 And as side note: facet.method=fc makes the queries run 'forever' and 
 eventually
 fail with org.apache.solr.common.SolrException: Too many values for 
 UnInvertedField faceting on field CONTENT.

An internal memory structure optimization in Solr limits the amount of possible 
unique values when using fc. It is not a bug as such, but more a consequence of 
a choice. Unfortunately the enum-solution is normally quite slow when there are 
enough unique values to trigger the too many values-exception. I know too 
little about the structures for DocValues to say if they will help here, but 
you might want to take a look at those.

- Toke Eskildsen

RE: Facet performance

2013-10-18 Thread Lemke, Michael SZ/HZA-ZSW

Toke Eskildsen [mailto:t...@statsbiblioteket.dk] wrote:
Lemke, Michael SZ/HZA-ZSW [lemke...@schaeffler.com] wrote:
1.
q=wordfacet.field=CONTENTfacet=truefacet.prefix=facet.limit=10facet.mincount=1facet.method=enumrows=0
2.
q=wordfacet.field=CONTENTfacet=truefacet.prefix=afacet.limit=10facet.mincount=1facet.method=enumrows=0

The only difference is am empty facet.prefix in the first query.

The first query returns after some 20 seconds (QTime 2 in the result)
while
the second one takes only 80 msec (QTime 80). Why is this?

If you index was just opened when you issued your queries, the first request
will be notably slower than the second as the facet values might not be in
the disk cache.

I know but it shouldn't be orders of magnitudes as in this example, should it?

Furthermore, for enum the difference between no prefix and some prefix is
huge. As enum iterates values first (as opposed to fc that iterates hits
first), limiting to only the values that starts with 'a' ought to speed up
retrieval by a factor 10 or more.

Thanks. That is what we sort of figured but it's good to know for sure. Of
course it begs the question if there is a way to speed this up?

And as side note: facet.method=fc makes the queries run 'forever' and
eventually
fail with org.apache.solr.common.SolrException: Too many values for
UnInvertedField faceting on field CONTENT.

An internal memory structure optimization in Solr limits the amount of
possible unique values when using fc. It is not a bug as such, but more a
consequence of a choice. Unfortunately the enum-solution is normally quite
slow when there are enough unique values to trigger the too many
values-exception. I know too little about the structures for DocValues to say
if they will help here, but you might want to take a look at those.

What is DocValues? Haven't heard of it yet. And yes, the fc method was
terribly slow in a case where it did work. Something like 20 minutes whereas
enum returned within a few seconds.

Michael

Re: Facet performance

2013-10-18 Thread Otis Gospodnetic

DocValues is the new black
http://wiki.apache.org/solr/DocValues

Otis
--
Solr ElasticSearch Support -- http://sematext.com/
SOLR Performance Monitoring -- http://sematext.com/spm

On Fri, Oct 18, 2013 at 12:30 PM, Lemke, Michael SZ/HZA-ZSW
lemke...@schaeffler.com wrote:
Toke Eskildsen [mailto:t...@statsbiblioteket.dk] wrote:
Lemke, Michael SZ/HZA-ZSW [lemke...@schaeffler.com] wrote:
1.
q=wordfacet.field=CONTENTfacet=truefacet.prefix=facet.limit=10facet.mincount=1facet.method=enumrows=0
2.
q=wordfacet.field=CONTENTfacet=truefacet.prefix=afacet.limit=10facet.mincount=1facet.method=enumrows=0

The only difference is am empty facet.prefix in the first query.

The first query returns after some 20 seconds (QTime 2 in the result)
while
the second one takes only 80 msec (QTime 80). Why is this?

If you index was just opened when you issued your queries, the first request
will be notably slower than the second as the facet values might not be in
the disk cache.

I know but it shouldn't be orders of magnitudes as in this example, should it?

Thanks. That is what we sort of figured but it's good to know for sure. Of
course it begs the question if there is a way to speed this up?

And as side note: facet.method=fc makes the queries run 'forever' and
eventually
fail with org.apache.solr.common.SolrException: Too many values for
UnInvertedField faceting on field CONTENT.

An internal memory structure optimization in Solr limits the amount of
possible unique values when using fc. It is not a bug as such, but more a
consequence of a choice. Unfortunately the enum-solution is normally quite
slow when there are enough unique values to trigger the too many
values-exception. I know too little about the structures for DocValues to
say if they will help here, but you might want to take a look at those.

What is DocValues? Haven't heard of it yet. And yes, the fc method was
terribly slow in a case where it did work. Something like 20 minutes whereas
enum returned within a few seconds.

Michael

RE: Facet performance

2013-10-18 Thread Chris Hostetter


:  1. 
q=wordfacet.field=CONTENTfacet=truefacet.prefix=facet.limit=10facet.mincount=1facet.method=enumrows=0
:  2. 
q=wordfacet.field=CONTENTfacet=truefacet.prefix=afacet.limit=10facet.mincount=1facet.method=enumrows=0
: 
:  The only difference is am empty facet.prefix in the first query.

: If you index was just opened when you issued your queries, the first 
: request will be notably slower than the second as the facet values might 
: not be in the disk cache.
: 
: I know but it shouldn't be orders of magnitudes as in this example, should it?

in and of itself: it can be if your index is large enough and none of the 
disk pages are in the file system buffer.

more significantly however, is that depending on how big your filterCache 
is, the first request could eaisly be caching all of filters needed for 
the second query -- at a minimum it's definitely caching your main query 
which will be re-used and save a lot of time independent of hte faceting.


-Hoss

Re: facet performance tips

2009-08-13 Thread Jérôme Etévé

Thanks everyone for your advices.

I increased my filterCache, and the faceting performances improved greatly.

My faceted field can have at the moment ~4 different terms, so I
did set a filterCache size of 5 and it works very well.

However, I'm planning to increase the number of terms to maybe around
500 000, so I guess this approach won't work anymore, as I doubt a 500
000 sized fieldCache would work.

So I guess my best move would be to upgrade to the soon to be 1.4
version of solr to benefit from its new faceting method.

I know this is a bit off-topic, but do you have a rough idea about
when 1.4 will be an official release?
As well, is the current trunk OK for production? Is it compatible with
1.3 configuration files?

Thanks !

Jerome.

2009/8/13 Stephen Duncan Jr stephen.dun...@gmail.com:
 Note that depending on the profile of your field (full text and how many
 unique terms on average per document), the improvements from 1.4 may not
 apply, as you may exceed the limits of the new faceting technique in Solr
 1.4.
 -Stephen

 On Wed, Aug 12, 2009 at 2:12 PM, Erik Hatcher ehatc...@apache.org wrote:

 Yes, increasing the filterCache size will help with Solr 1.3 performance.

 Do note that trunk (soon Solr 1.4) has dramatically improved faceting
 performance.

Erik


 On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote:

  Hi everyone,

  I'm using some faceting on a solr index containing ~ 160K documents.
 I perform facets on multivalued string fields. The number of possible
 different values is quite large.

 Enabling facets degrades the performance by a factor 3.

 Because I'm using solr 1.3, I guess the facetting makes use of the
 filter cache to work. My filterCache is set
 to a size of 2048. I also noticed in my solr stats a very small ratio
 of cache hit (~ 0.01%).

 Can it be the reason why the faceting is slow? Does it make sense to
 increase the filterCache size so it matches more or less the number
 of different possible values for the faceted fields? Would that not
 make the memory usage explode?

 Thanks for your help !

 --
 Jerome Eteve.

 Chat with me live at http://www.eteve.net

 jer...@eteve.net





 --
 Stephen Duncan Jr
 www.stephenduncanjr.com




-- 
Jerome Eteve.

Chat with me live at http://www.eteve.net

jer...@eteve.net

RE: facet performance tips

2009-08-13 Thread Fuad Efendi

I took 1.4 from trunk three days ago, it seems Ok for production (at least for 
my Master instance which is doing writes-only). I use the same config files.

500 000 terms are Ok too; I am using several millions with pre-1.3 SOLR taken 
from trunk.

However, do not try to facet (probably outdated term after SOLR-475) on 
generic queries such as [* TO *] (with huge resultset). For smaller query 
results (100,000 instead of 100,000,000) counting terms is fast enough (few 
milliseconds at http://www.tokenizer.org)

 

-Original Message-
From: Jérôme Etévé [mailto:jerome.et...@gmail.com] 
Sent: August-13-09 5:38 AM
To: solr-user@lucene.apache.org
Subject: Re: facet performance tips

Thanks everyone for your advices.

I increased my filterCache, and the faceting performances improved greatly.

My faceted field can have at the moment ~4 different terms, so I
did set a filterCache size of 5 and it works very well.

However, I'm planning to increase the number of terms to maybe around
500 000, so I guess this approach won't work anymore, as I doubt a 500
000 sized fieldCache would work.

So I guess my best move would be to upgrade to the soon to be 1.4
version of solr to benefit from its new faceting method.

I know this is a bit off-topic, but do you have a rough idea about
when 1.4 will be an official release?
As well, is the current trunk OK for production? Is it compatible with
1.3 configuration files?

Thanks !

Jerome.

2009/8/13 Stephen Duncan Jr stephen.dun...@gmail.com:
 Note that depending on the profile of your field (full text and how many
 unique terms on average per document), the improvements from 1.4 may not
 apply, as you may exceed the limits of the new faceting technique in Solr
 1.4.
 -Stephen

 On Wed, Aug 12, 2009 at 2:12 PM, Erik Hatcher ehatc...@apache.org wrote:

 Yes, increasing the filterCache size will help with Solr 1.3 performance.

 Do note that trunk (soon Solr 1.4) has dramatically improved faceting
 performance.

Erik


 On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote:

  Hi everyone,

  I'm using some faceting on a solr index containing ~ 160K documents.
 I perform facets on multivalued string fields. The number of possible
 different values is quite large.

 Enabling facets degrades the performance by a factor 3.

 Because I'm using solr 1.3, I guess the facetting makes use of the
 filter cache to work. My filterCache is set
 to a size of 2048. I also noticed in my solr stats a very small ratio
 of cache hit (~ 0.01%).

 Can it be the reason why the faceting is slow? Does it make sense to
 increase the filterCache size so it matches more or less the number
 of different possible values for the faceted fields? Would that not
 make the memory usage explode?

 Thanks for your help !

 --
 Jerome Eteve.

 Chat with me live at http://www.eteve.net

 jer...@eteve.net





 --
 Stephen Duncan Jr
 www.stephenduncanjr.com




-- 
Jerome Eteve.

Chat with me live at http://www.eteve.net

jer...@eteve.net

RE: facet performance tips

2009-08-13 Thread Fuad Efendi

It seems BOBO-Browse is alternate faceting engine; would be interesting to
compare performance with SOLR... Distributed?

-Original Message-
From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] 
Sent: August-12-09 6:12 PM
To: solr-user@lucene.apache.org
Subject: Re: facet performance tips

For your fields with many terms you may want to try Bobo
http://code.google.com/p/bobo-browse/ which could work well with your
case.

RE: facet performance tips

2009-08-13 Thread Fuad Efendi

Interesting, it has BoboRequestHandler implements SolrRequestHandler
- easy to try it; and shards support



[Fuad Efendi] It seems BOBO-Browse is alternate faceting engine; would be
interesting to
compare performance with SOLR... Distributed?


[Jason Rutherglen] For your fields with many terms you may want to try Bobo
http://code.google.com/p/bobo-browse/ which could work well with your
case.

Re: facet performance tips

2009-08-13 Thread Jason Rutherglen

Yeah we need a performance comparison, I haven't had time to put
one together. If/when I do I'll compare Bobo performance against
Solr bitset intersection based facets, compare memory
consumption.

For near realtime Solr needs to cache and merge bitsets at the
SegmentReader level, and Bobo needs to be upgraded to work with
Lucene 2.9's searching at the segment level (currently it uses a
MultiSearcher).

Distributed search on either should be fairly straightforward?

On Thu, Aug 13, 2009 at 9:55 AM, Fuad Efendif...@efendi.ca wrote:
 It seems BOBO-Browse is alternate faceting engine; would be interesting to
 compare performance with SOLR... Distributed?


 -Original Message-
 From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com]
 Sent: August-12-09 6:12 PM
 To: solr-user@lucene.apache.org
 Subject: Re: facet performance tips

 For your fields with many terms you may want to try Bobo
 http://code.google.com/p/bobo-browse/ which could work well with your
 case.

RE: facet performance tips

2009-08-13 Thread Fuad Efendi

SOLR-1.4-trunk uses terms counting instead of bitset intersects (seems to
be); check this
http://issues.apache.org/jira/browse/SOLR-475
(and probably http://issues.apache.org/jira/browse/SOLR-711)

-Original Message-
From: Jason Rutherglen 

Yeah we need a performance comparison, I haven't had time to put
one together. If/when I do I'll compare Bobo performance against
Solr bitset intersection based facets, compare memory
consumption.

For near realtime Solr needs to cache and merge bitsets at the
SegmentReader level, and Bobo needs to be upgraded to work with
Lucene 2.9's searching at the segment level (currently it uses a
MultiSearcher).

Distributed search on either should be fairly straightforward?

On Thu, Aug 13, 2009 at 9:55 AM, Fuad Efendif...@efendi.ca wrote:
 It seems BOBO-Browse is alternate faceting engine; would be interesting to
 compare performance with SOLR... Distributed?


 -Original Message-
 From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com]
 Sent: August-12-09 6:12 PM
 To: solr-user@lucene.apache.org
 Subject: Re: facet performance tips

 For your fields with many terms you may want to try Bobo
 http://code.google.com/p/bobo-browse/ which could work well with your
 case.

Re: facet performance tips

2009-08-13 Thread Jason Rutherglen

Right, I haven't used SOLR-475 yet and am more familiar with
Bobo. I believe there are differences but I haven't gone into
them yet. As I'm using Solr 1.4 now, maybe I'll test the
UnInvertedField modality.

Feel free to report back results as I don't think I've seen much
yet?

On Thu, Aug 13, 2009 at 10:51 AM, Fuad Efendif...@efendi.ca wrote:
 SOLR-1.4-trunk uses terms counting instead of bitset intersects (seems to
 be); check this
 http://issues.apache.org/jira/browse/SOLR-475
 (and probably http://issues.apache.org/jira/browse/SOLR-711)

 -Original Message-
 From: Jason Rutherglen

 Yeah we need a performance comparison, I haven't had time to put
 one together. If/when I do I'll compare Bobo performance against
 Solr bitset intersection based facets, compare memory
 consumption.

 For near realtime Solr needs to cache and merge bitsets at the
 SegmentReader level, and Bobo needs to be upgraded to work with
 Lucene 2.9's searching at the segment level (currently it uses a
 MultiSearcher).

 Distributed search on either should be fairly straightforward?

 On Thu, Aug 13, 2009 at 9:55 AM, Fuad Efendif...@efendi.ca wrote:
 It seems BOBO-Browse is alternate faceting engine; would be interesting to
 compare performance with SOLR... Distributed?


 -Original Message-
 From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com]
 Sent: August-12-09 6:12 PM
 To: solr-user@lucene.apache.org
 Subject: Re: facet performance tips

 For your fields with many terms you may want to try Bobo
 http://code.google.com/p/bobo-browse/ which could work well with your
 case.

RE: facet performance tips

2009-08-12 Thread Manepalli, Kalyan

Jerome,
Yes you need to increase the filterCache size to something close to 
unique number of facet elements. But also consider the RAM required to 
accommodate the increase. 
I did see a significant performance gain by increasing the filterCache size

Thanks,
Kalyan Manepalli

-Original Message-
From: Jérôme Etévé [mailto:jerome.et...@gmail.com] 
Sent: Wednesday, August 12, 2009 12:31 PM
To: solr-user@lucene.apache.org
Subject: facet performance tips

Hi everyone,

  I'm using some faceting on a solr index containing ~ 160K documents.
I perform facets on multivalued string fields. The number of possible
different values is quite large.

Enabling facets degrades the performance by a factor 3.

Because I'm using solr 1.3, I guess the facetting makes use of the
filter cache to work. My filterCache is set
to a size of 2048. I also noticed in my solr stats a very small ratio
of cache hit (~ 0.01%).

Can it be the reason why the faceting is slow? Does it make sense to
increase the filterCache size so it matches more or less the number
of different possible values for the faceted fields? Would that not
make the memory usage explode?

Thanks for your help !

-- 
Jerome Eteve.

Chat with me live at http://www.eteve.net

jer...@eteve.net

RE: facet performance tips

2009-08-12 Thread Fuad Efendi

I am currently faceting on tokenized multi-valued field at
http://www.tokenizer.org (25 mlns simple docs)

It uses some home-made quick fixes similar to SOLR-475 (SOLR-711) and
non-synchronized cache (similar to LingPipe's FastCache, SOLR-665, SOLR-667)

Average faceting on query results: 0.2 - 0.3 seconds; without those
patches - 20-50 seconds.

I am going to upgrade to SOLR-1.4 from trunk (with SOLR-475  SOLR-667) and
to compare results...




P.S.
Avoid faceting on a field with heavy distribution of terms (such as few
millions of terms in my case); It won't work in SOLR 1.3.

TIP: use non-tokenized single-valued field for faceting, such as
non-tokenized country field.



P.P.S.
Would be nice to load/stress
http://alias-i.com/lingpipe/docs/api/com/aliasi/util/FastCache.html against
putting CPU in a spin loop ConcurrentHashMap.



-Original Message-
From: Erik Hatcher [mailto:ehatc...@apache.org] 
Sent: August-12-09 2:12 PM
To: solr-user@lucene.apache.org
Subject: Re: facet performance tips

Yes, increasing the filterCache size will help with Solr 1.3  
performance.

Do note that trunk (soon Solr 1.4) has dramatically improved faceting  
performance.

Erik

On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote:

 Hi everyone,

  I'm using some faceting on a solr index containing ~ 160K documents.
 I perform facets on multivalued string fields. The number of possible
 different values is quite large.

 Enabling facets degrades the performance by a factor 3.

 Because I'm using solr 1.3, I guess the facetting makes use of the
 filter cache to work. My filterCache is set
 to a size of 2048. I also noticed in my solr stats a very small ratio
 of cache hit (~ 0.01%).

 Can it be the reason why the faceting is slow? Does it make sense to
 increase the filterCache size so it matches more or less the number
 of different possible values for the faceted fields? Would that not
 make the memory usage explode?

 Thanks for your help !

 -- 
 Jerome Eteve.

 Chat with me live at http://www.eteve.net

 jer...@eteve.net

Re: facet performance tips

2009-08-12 Thread Erik Hatcher

Yes, increasing the filterCache size will help with Solr 1.3  
performance.


Do note that trunk (soon Solr 1.4) has dramatically improved faceting  
performance.


Erik

On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote:


Hi everyone,

 I'm using some faceting on a solr index containing ~ 160K documents.
I perform facets on multivalued string fields. The number of possible
different values is quite large.

Enabling facets degrades the performance by a factor 3.

Because I'm using solr 1.3, I guess the facetting makes use of the
filter cache to work. My filterCache is set
to a size of 2048. I also noticed in my solr stats a very small ratio
of cache hit (~ 0.01%).

Can it be the reason why the faceting is slow? Does it make sense to
increase the filterCache size so it matches more or less the number
of different possible values for the faceted fields? Would that not
make the memory usage explode?

Thanks for your help !

--
Jerome Eteve.

Chat with me live at http://www.eteve.net

jer...@eteve.net

Re: facet performance tips

2009-08-12 Thread Jason Rutherglen

For your fields with many terms you may want to try Bobo
http://code.google.com/p/bobo-browse/ which could work well with your
case.

On Wed, Aug 12, 2009 at 12:02 PM, Fuad Efendif...@efendi.ca wrote:
 I am currently faceting on tokenized multi-valued field at
 http://www.tokenizer.org (25 mlns simple docs)

 It uses some home-made quick fixes similar to SOLR-475 (SOLR-711) and
 non-synchronized cache (similar to LingPipe's FastCache, SOLR-665, SOLR-667)

 Average faceting on query results: 0.2 - 0.3 seconds; without those
 patches - 20-50 seconds.

 I am going to upgrade to SOLR-1.4 from trunk (with SOLR-475  SOLR-667) and
 to compare results...




 P.S.
 Avoid faceting on a field with heavy distribution of terms (such as few
 millions of terms in my case); It won't work in SOLR 1.3.

 TIP: use non-tokenized single-valued field for faceting, such as
 non-tokenized country field.



 P.P.S.
 Would be nice to load/stress
 http://alias-i.com/lingpipe/docs/api/com/aliasi/util/FastCache.html against
 putting CPU in a spin loop ConcurrentHashMap.



 -Original Message-
 From: Erik Hatcher [mailto:ehatc...@apache.org]
 Sent: August-12-09 2:12 PM
 To: solr-user@lucene.apache.org
 Subject: Re: facet performance tips

 Yes, increasing the filterCache size will help with Solr 1.3
 performance.

 Do note that trunk (soon Solr 1.4) has dramatically improved faceting
 performance.

        Erik

 On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote:

 Hi everyone,

  I'm using some faceting on a solr index containing ~ 160K documents.
 I perform facets on multivalued string fields. The number of possible
 different values is quite large.

 Enabling facets degrades the performance by a factor 3.

 Because I'm using solr 1.3, I guess the facetting makes use of the
 filter cache to work. My filterCache is set
 to a size of 2048. I also noticed in my solr stats a very small ratio
 of cache hit (~ 0.01%).

 Can it be the reason why the faceting is slow? Does it make sense to
 increase the filterCache size so it matches more or less the number
 of different possible values for the faceted fields? Would that not
 make the memory usage explode?

 Thanks for your help !

 --
 Jerome Eteve.

 Chat with me live at http://www.eteve.net

 jer...@eteve.net

Re: facet performance tips

2009-08-12 Thread Stephen Duncan Jr

Note that depending on the profile of your field (full text and how many
unique terms on average per document), the improvements from 1.4 may not
apply, as you may exceed the limits of the new faceting technique in Solr
1.4.
-Stephen

On Wed, Aug 12, 2009 at 2:12 PM, Erik Hatcher ehatc...@apache.org wrote:

 Yes, increasing the filterCache size will help with Solr 1.3 performance.

 Do note that trunk (soon Solr 1.4) has dramatically improved faceting
 performance.

Erik


 On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote:

  Hi everyone,

  I'm using some faceting on a solr index containing ~ 160K documents.
 I perform facets on multivalued string fields. The number of possible
 different values is quite large.

 Enabling facets degrades the performance by a factor 3.

 Because I'm using solr 1.3, I guess the facetting makes use of the
 filter cache to work. My filterCache is set
 to a size of 2048. I also noticed in my solr stats a very small ratio
 of cache hit (~ 0.01%).

 Can it be the reason why the faceting is slow? Does it make sense to
 increase the filterCache size so it matches more or less the number
 of different possible values for the faceted fields? Would that not
 make the memory usage explode?

 Thanks for your help !

 --
 Jerome Eteve.

 Chat with me live at http://www.eteve.net

 jer...@eteve.net





-- 
Stephen Duncan Jr
www.stephenduncanjr.com

Re: Facet Performance

2008-07-31 Thread Funtick


Hoss,

This is still extremely interesting area for possible improvements; I simply
don't want the topic to die 
http://www.nabble.com/Facet-Performance-td7746964.html

http://issues.apache.org/jira/browse/SOLR-665
http://issues.apache.org/jira/browse/SOLR-667
http://issues.apache.org/jira/browse/SOLR-669

I am currently using faceting on single-valued _tokenized_ field with huge
amount of documents; _unsynchronized_ version of FIFOCache; 1.5 seconds
average response time (for faceted queries only!)

I think we can use additional cache for facet results (to store calculated
values!); Lucene's FieldCache can be used only for non-tokenized
single-valued non-bollean fields

-Fuad



hossman_lucene wrote:
 
 
 : Unfortunately which strategy will be chosen is currently undocumented
 : and control is a bit oblique:  If the field is tokenized or multivalued
 : or Boolean, the FilterQuery method will be used; otherwise the
 : FieldCache method.  I expect I or others will improve that shortly.
 
 Bear in mind, what's provide out of the box is SimpleFacets ... it's
 designed to meet simple faceting needs ... when you start talking about
 100s or thousands of constraints per facet, you are getting outside the
 scope of what it was intended to serve efficiently.
 
 At a certain point the only practical thing to do is write a custom
 request handler that makes the best choices for your data.
 
 For the record: a really simple patch someone could submit would be to
 make add an optional field based param indicating which type of faceting
 (termenum/fieldcache) should be used to generate the list of terms and
 then make SimpleFacets.getFacetFieldCounts use that and call the
 apprpriate method insteado calling getTermCounts -- that way you could
 force one or the other if you know it's better for your data/query.
 
 
 
 -Hoss
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Facet-Performance-tp7746964p18756500.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Facet Performance

2006-12-08 Thread Andrew Nagy


Yonik Seeley wrote:


1) facet on single-valued strings if you can
2) if you can't do (1) then enlarge the fieldcache so that the number
of filters (one per possible term in the field you are filtering on)
can fit.


I changed the filterCache to the following:
   filterCache
 class=solr.LRUCache
 size=25600
 initialSize=5120
 autowarmCount=1024/

However a search that normally takes .04s is taking 74 seconds once I 
use the facets since I am faceting on 4 fields.


Can you suggest a better configuration that would solve this performance 
issue, or should I not use faceting?
I figure I could run the query twice, once limited to 20 records and 
then again with the limit set to the total number of records and develop 
my own facets.  I have infact done this before with a different back-end 
and my code is processed in under .01 seconds.


Why is faceting so slow?

Andrew

Re: Facet Performance

2006-12-08 Thread Andrew Nagy


Chris Hostetter wrote:


: Could you suggest a better configuration based on this?

If that's what your stats look like after a single request, then i would
guess you would need to make your cache size at least 1.6 million in order
for it to be of any use in improving your facet speed.
 

Would this have any strong impacts on my system?  Should I just set it 
to an even 2 million to allow for growth?



: My data is 492,000 records of book data.  I am faceting on 4 fields:
: author, subject, language, format.
: Format and language are fairly simple as their are only a few unique
: terms.  Author and subject however are much different in that there are
: thousands of unique terms.

by the looks of it, you have a lot more then a few thousand unique terms
in those two fields ... are you tokenizing on these fields?  that's
probably not what you want for ields you're going to facet on.
 

All of these fields are set as string in my schema, so if I understand 
the fields correctly, they are not being tokenized.  I also have an 
author field that is set as text for searching.


Thanks
Andrew

Re: Facet Performance

2006-12-08 Thread Yonik Seeley


On 12/8/06, Andrew Nagy [EMAIL PROTECTED] wrote:

Chris Hostetter wrote:

: Could you suggest a better configuration based on this?

If that's what your stats look like after a single request, then i would
guess you would need to make your cache size at least 1.6 million in order
for it to be of any use in improving your facet speed.


Would this have any strong impacts on my system?  Should I just set it
to an even 2 million to allow for growth?


Change the following in solrconfig.xml, and you should be fine with a
higher setting.
useFilterForSortedQuerytrue/useFilterForSortedQuery
to
useFilterForSortedQueryfalse/useFilterForSortedQuery

That will prevent the filtercache from being used for anything but
filters and faceting, so if you set it to high, it won't be utilized
anyway.


: My data is 492,000 records of book data.  I am faceting on 4 fields:
: author, subject, language, format.
: Format and language are fairly simple as their are only a few unique
: terms.  Author and subject however are much different in that there are
: thousands of unique terms.

by the looks of it, you have a lot more then a few thousand unique terms
in those two fields ... are you tokenizing on these fields?  that's
probably not what you want for ields you're going to facet on.


All of these fields are set as string in my schema


Are they multivalued, and do they need to be.
Anything that is of type string and not multivalued will use the
lucene FieldCache rather than the filterCache.

-Yonik

Re: Facet Performance

2006-12-08 Thread Andrew Nagy


Yonik Seeley wrote:


Are they multivalued, and do they need to be.
Anything that is of type string and not multivalued will use the
lucene FieldCache rather than the filterCache.


The author field is multivalued.  Will this be a strong performance issue?

I could make multiple author fields as to not have the multivalued field 
and then only facet on the first author.


Thanks
Andrew

Re: Facet Performance

2006-12-08 Thread J.J. Larrea

Andrew Nagy, ditto on what Yonik said.  Here is some further elaboration:

I am doing much the same thing (faceting on Author etc.). When my Author field 
was defined as a solr.TextField, even using solr.KeywordTokenizerFactory so it 
wasn't actually tokenized, the faceting code chose the QueryFilter approach, 
and faceting on Author for 100k+ document took about 4 seconds.

When I changed the field to string e.g. solr.StrField, the faceting code 
recognized it as untokenized and used the FieldCache approach.  Times have 
dropped to about 120ms for the first query (when the FieldCache is generated) 
and  10ms for subsequent queries returning a few thousand results.  Quite a 
difference.

The strategy must be chosen on a field-by-field basis.  While QueryFilter is 
excellent for fields with a small set of enumerated values such as Language or 
Format, it is inappropriate for large value sets such as Author.

Unfortunately which strategy will be chosen is currently undocumented and 
control is a bit oblique:  If the field is tokenized or multivalued or Boolean, 
the FilterQuery method will be used; otherwise the FieldCache method.  I expect 
I or others will improve that shortly.

- J.J.

At 2:58 PM -0500 12/8/06, Yonik Seeley wrote:
Right, if any of these are tokenized, then you could make them
non-tokenized (use string type).  If they really need to be
tokenized (author for example), then you could use copyField to make
another copy to a non-tokenized field that you can use for faceting.

After that, as Hoss suggests, run a single faceting query with all 4
fields and look at the filterCache statistics.  Take the lookups
number and multiply it by, say, 1.5 to leave some room for future
growth, and use that as your cache size.  You probably want to bump up
both initialSize and autowarmCount as well.

The first query will still be slow.  The second should be relatively fast.
You may hit an OOM error.  Increase the JVM heap size if this happens.

-Yonik

Re: Facet Performance

2006-12-08 Thread Yonik Seeley


On 12/8/06, J.J. Larrea [EMAIL PROTECTED] wrote:

Unfortunately which strategy will be chosen is currently undocumented and 
control is a bit oblique:  If the field is tokenized or multivalued or Boolean, 
the FilterQuery method will be used; otherwise the FieldCache method.


If anyone had time some of this could be documented here:
http://wiki.apache.org/solr/SimpleFacetParameters
The wiki is open to all.

Or perhaps a new top level FacetedSearching page that references
SimpleFacetParameters

-Yonik

Re: Facet Performance

2006-12-08 Thread Andrew Nagy


J.J. Larrea wrote:


Unfortunately which strategy will be chosen is currently undocumented and 
control is a bit oblique:  If the field is tokenized or multivalued or Boolean, 
the FilterQuery method will be used; otherwise the FieldCache method.  I expect 
I or others will improve that shortly.
 

Good to hear, cause I can't really get away with not having a 
multi-valued field for author.


Im really excited by solr and really impressed so far.

Thanks!
Andrew

Re: Facet Performance

2006-12-08 Thread Chris Hostetter


: Unfortunately which strategy will be chosen is currently undocumented
: and control is a bit oblique:  If the field is tokenized or multivalued
: or Boolean, the FilterQuery method will be used; otherwise the
: FieldCache method.  I expect I or others will improve that shortly.

Bear in mind, what's provide out of the box is SimpleFacets ... it's
designed to meet simple faceting needs ... when you start talking about
100s or thousands of constraints per facet, you are getting outside the
scope of what it was intended to serve efficiently.

At a certain point the only practical thing to do is write a custom
request handler that makes the best choices for your data.

For the record: a really simple patch someone could submit would be to
make add an optional field based param indicating which type of faceting
(termenum/fieldcache) should be used to generate the list of terms and
then make SimpleFacets.getFacetFieldCounts use that and call the
apprpriate method insteado calling getTermCounts -- that way you could
force one or the other if you know it's better for your data/query.



-Hoss

Re: Facet Performance

2006-12-08 Thread Andrew Nagy


Erik Hatcher wrote:


On Dec 8, 2006, at 2:15 PM, Andrew Nagy wrote:

My data is 492,000 records of book data.  I am faceting on 4  fields: 
author, subject, language, format.
Format and language are fairly simple as their are only a few  unique 
terms.  Author and subject however are much different in  that there 
are thousands of unique terms.



When encountering difficult issues, I like to think in terms of the  
user interface.  Surely you're not presenting 400k+ authors to the  
users in one shot.  In Collex, we have put an AJAX drop-down that  
shows the author facet (we call it name on the UI, with various roles  
like author, painter, etc).  You can see this in action here:


In our data, we don't have unique authors for each records ... so let's 
say out of the 500,000 records ... we have 200,000 authors.  What I am 
trying to display is the top 10 authors from the results of a search.  
So I do a search for title:Gone with the wind and I would like to see 
the top 10 matching authors from these results.


But no worries, I have written my own facet handler and I am now back to 
under a second with faceting!


Thanks for everyone's help and keep up the good work!

Andrew

Re: Facet performance with heterogeneous 'facets'?

2006-09-22 Thread Michael Imbeault

Excellent news; as you guessed, my schema was (for some reason) set to 
version 1.0. This also caused some of the problems I had with the 
original SolrPHP (parsing the wrong response).


But better yet, the 800 seconds query is now running in 0.5-2 seconds! 
Amazing optimization! I can now do faceting on journal title (17 000 
different titles) and last author (400 000 authors), + 12 date range 
queries, in a very reasonable time (considering im on a test windows 
desktop box and not a server).


The only problem is if I add first author, I get a 
java.lang.OutOfMemoryError: Java heap space. I'm sure this problem will 
get away on a server with more than the current 500 megs I can allocate 
to Tomcat.


Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212

Yonik Seeley wrote:

On 9/22/06, Michael Imbeault [EMAIL PROTECTED] wrote:

I upgraded to the most recent Solr build (9-22) and sadly it's still
really slow. 800 seconds query with a single facet on first_author, 15
millions documents total, the query return 180. Maybe i'm doing
something wrong? Also, this is on my personal desktop; not on a server.
Still, I'm getting 0.1 seconds queries without facets, so I don't think
thats the cause. In the admin panel i can still see the filtercache
doing millions of lookups (and tons of evictions once it hits the 
maxsize).


The fact that you see all the filtercache usage means that the
optimization didn't kick in for some reason.


Here's the field i'm using in schema.xml :
field name =first_author type=string indexed=true stored=true/


That looks fine...


This is the query :
q=hiv red 
bloodstart=0rows=20fl=article_title+authors+journal_iso+pubdate+pmid+scoreqt=standardfacet=truefacet.field=first_authorfacet.limit=5facet.missing=falsefacet.zeros=false 



That looks OK too.
I assume that you didn't change the fieldtype definition for string,
and that the schema has version=1.1?  Before 1.1, all fields were
assumed to be multiValued (there was no checking or enforcement).

-Yonik

Re: Facet performance with heterogeneous 'facets'?

2006-09-21 Thread Yonik Seeley


On 9/21/06, Michael Imbeault [EMAIL PROTECTED] wrote:

It turns out that journal_name has 17038 different tokens, which is
manageable, but first_author has  400 000. I don't think this will ever
yield good performance, so i might only do journal_name facets.


Hang in there Michael, a fix is on the way for your scenario (and
subscribe to solr-dev if you want to stay on the bleeding edge):

http://www.nabble.com/big-faceting-speedup-for-single-valued-fields-tf2308153.html

-Yonik

Re: Facet performance with heterogeneous 'facets'?

2006-09-21 Thread Yonik Seeley


On 9/21/06, Michael Imbeault [EMAIL PROTECTED] wrote:

Btw, Any plans for a facets cache?


Maybe a partial one (like caching top terms to implement some other
optimizations).  My general philosophy on caching in Solr has been to
cache things the client can't: elemental things, or *parts* of
requests to make many different requests faster (most
bang-for-the-buck).

Caching complete requests/responses is generally less useful since it
requires even more memory, has a worse hit ratio, and can be done
anyway by the client or a separate process like squid.

-Yonik

Re: Facet performance with heterogeneous 'facets'?

2006-09-21 Thread Yonik Seeley


On 9/21/06, Yonik Seeley [EMAIL PROTECTED] wrote:

Hang in there Michael, a fix is on the way for your scenario (and
subscribe to solr-dev if you want to stay on the bleeding edge):


OK, the optimization has been checked in.  You can checkout from svn
and build Solr, or wait for the 9-22 nightly build (after 8:30 EDT).
I'd be interested in hearing your results with it.

The first facet request on a field will take longer than subsequent
ones because the FieldCache entry is loaded on demand.  You can use a
firstSearcher/newSearcher hook in solrconfig.xml to send a facet
request so that a real user would never see this slower query.

-Yonik

Re: Facet performance with heterogeneous 'facets'?

2006-09-21 Thread Michael Imbeault

I upgraded to the most recent Solr build (9-22) and sadly it's still 
really slow. 800 seconds query with a single facet on first_author, 15 
millions documents total, the query return 180. Maybe i'm doing 
something wrong? Also, this is on my personal desktop; not on a server. 
Still, I'm getting 0.1 seconds queries without facets, so I don't think 
thats the cause. In the admin panel i can still see the filtercache 
doing millions of lookups (and tons of evictions once it hits the maxsize).


Here's the field i'm using in schema.xml :
field name =first_author type=string indexed=true stored=true/

This is the query :
q=hiv red 
bloodstart=0rows=20fl=article_title+authors+journal_iso+pubdate+pmid+scoreqt=standardfacet=truefacet.field=first_authorfacet.limit=5facet.missing=falsefacet.zeros=false


I'll do more testing on the weekend,

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Yonik Seeley wrote:

On 9/21/06, Yonik Seeley [EMAIL PROTECTED] wrote:

Hang in there Michael, a fix is on the way for your scenario (and
subscribe to solr-dev if you want to stay on the bleeding edge):


OK, the optimization has been checked in.  You can checkout from svn
and build Solr, or wait for the 9-22 nightly build (after 8:30 EDT).
I'd be interested in hearing your results with it.

The first facet request on a field will take longer than subsequent
ones because the FieldCache entry is loaded on demand.  You can use a
firstSearcher/newSearcher hook in solrconfig.xml to send a facet
request so that a real user would never see this slower query.

-Yonik

Re: Facet performance with heterogeneous 'facets'?

2006-09-19 Thread Yonik Seeley


On 9/18/06, Michael Imbeault [EMAIL PROTECTED] wrote:

Yonik Seeley wrote:
 For cases like author, if there is only one value per document, then
 a possible fix is to use the field cache.  If there can be multiple
 occurrences, there doesn't seem to be a good way that preserves exact
 counts, except maybe if the number of documents matching a query is
 low.

I have one value per document (I have fields for authors, last_author
and first_author, and I'm doing faceted search on first and last authors
fields). How would I use the field cache to fix my problem?


Unless you want to dive into Solr development, you don't :-)
It requires extensive changes to the faceting code and doing things a
different way in some cases.

The FieldCache is the fastest way to uninvert single valued
fields... it's currently only used for Sorting, where one needs to
quickly know the field value given the document id.
The downside is high memory use, and that it's not a general
solution... it can't handle fields with multiple tokens (tokenized
fields or multi-valued fields).

So the strategy would be to step through the documents, get the value
for the field from the FieldCache, increment a counter for that value,
then find the top counters when we are done.


Also, would
it be better to store a unique number (for each possible author) in an
int field along with the string, and do the faceted searching on the int
field?


It won't really help.  It wouldn't be faster, and it would require
only slightly less memory.


 Just a little follow-up - I did a little more testing, and the query
 takes 20 seconds no matter what - If there's one document in the results
 set, or if I do a query that returns all 13 documents.

 Yes, currently the same strategy is always used.
   intersection_count(docs_matching_query, docs_matching_author1)
   intersection_count(docs_matching_query, docs_matching_author2)
   intersection_count(docs_matching_query, docs_matching_author3)
   etc...

 Normally, the docsets will be cached, but since the number of authors
 is greater than the size of the filtercache, the effective cache hit
 rate will be 0%

 -Yonik
So more memory would fix the problem?


Yes, if your collection size isn't that large...  it's not a practical
solution for many cases though.


Also, I was under the impression
that it was only searching / sorting for authors that it knows are in
the result set...


That's the problem... it's not necessarily easy to know *what* authors
are in the result set.  If we could quickly determine that, we could
just count them and not do any intersections or anything at all.


 in the case of only one document (1 result), it seems
strange that it takes the same time as for 130 000 results. It should
just check the results, see that there's only one author, and return
that? And in the case of 2 documents, just sort 2 authors (or 1 if
they're the same)? I understand your answer (it does intersections), but
I wonder why its intersecting from the whole document set at first, and
not docs_matching_query like you said.


It is just intersecting docs_matching_query.  The problem is that it's
intersecting that set with all possible author sets since it doesn't
know ahead of time what authors are in the docs that match the query.

There could be optimizations when docs_matching_query.size() is small,
so we start somehow with terms in the documents rather than terms in
the index.  That requires termvectors to be stored (medium speed), or
requires that the field be stored and that we re-analyze it (very
slow).

More optimization of special cases hasn't been done simply because no
one has done it yet... (as you note, faceting is a new feature).


-Yonik

Re: Facet performance with heterogeneous 'facets'?

2006-09-19 Thread Joachim Martin


Michael Imbeault wrote:

Also, is there any plans to add an option not to run a facet search if 
the result set is too big? To avoid 40 seconds queries if the docset 
is too large...



You could run one query with facet=false, check the result size and then 
run it again (should be fast because it is cached) with 
facet=truerows=0 to get facet results only.


I would think that the decision to run/not run facets would be highly 
custom to your collection and not easily developed as a configurable 
feature.


--Joachim

Re: Facet performance with heterogeneous 'facets'?

2006-09-19 Thread Yonik Seeley


I just updated the comments in solrconfig.xml:

  !-- Cache used by SolrIndexSearcher for filters (DocSets),
unordered sets of *all* documents that match a query.
When a new searcher is opened, its caches may be prepopulated
or autowarmed using data from caches in the old searcher.
autowarmCount is the number of items to prepopulate.  For LRUCache,
the autowarmed items will be the most recently accessed items.
  Parameters:
class - the SolrCache implementation (currently only LRUCache)
size - the maximum number of entries in the cache
initialSize - the initial capacity (number of entries) of
  the cache.  (seel java.util.HashMap)
autowarmCount - the number of entries to prepopulate from
  and old cache.
--
   filterCache
 class=solr.LRUCache
 size=512
 initialSize=512
 autowarmCount=256/

On 9/18/06, Michael Imbeault [EMAIL PROTECTED] wrote:

Another followup: I bumped all the caches in solrconfig.xml to

  size=1600384
  initialSize=400096
  autowarmCount=400096

It seemed to fix the problem on a very small index (facets on last and
first author fields, + 12 range date facets, sub 0.3 seconds for
queries). I'll check on the full index tomorrow (it's indexing right
now, 400docs/sec!). However, I still don't have an idea what are these
values representing, and how I should estimate what values I should set
them to. Originally I thought it was the size of the cache in kb, and
someone on the list told me it was number of items, but I don't quite
get it. Better documentation on that would be welcomed :)

Also, is there any plans to add an option not to run a facet search if
the result set is too big? To avoid 40 seconds queries if the docset is
too large...


I'd like to speed up certain corner cases, but you can always set
timeouts in whatever frontend is making the request to Solr too.

-Yonik

Re: Facet performance with heterogeneous 'facets'?

2006-09-19 Thread Chris Hostetter


Quick Question: did you say you are faceting on the first name field
seperately from the last name field? ... why?

You'll probably see a sharp increase in performacne if you have a single
untokenized author field containing hte full name and you facet on that --
there will be a lot less unique terms to use when computing DocSets and
intersections.

Second: you mentioned increasing hte size of your filterCache
significantly, but we don't really know how heterogenous your index is ...
once you made that cahnge did your filterCache hitrate increase? .. do you
have any evictions (you can check on the Statistics patge)

:  Also, I was under the impression
:  that it was only searching / sorting for authors that it knows are in
:  the result set...
:
: That's the problem... it's not necessarily easy to know *what* authors
: are in the result set.  If we could quickly determine that, we could
: just count them and not do any intersections or anything at all.

another way to look at it is that by looking at all the authors, the work
done for generating the facet counts for query A can be completely reused
for the next query B -- presuming your filterCache is large enough to hold
all of the author filters.

: There could be optimizations when docs_matching_query.size() is small,
: so we start somehow with terms in the documents rather than terms in
: the index.  That requires termvectors to be stored (medium speed), or
: requires that the field be stored and that we re-analyze it (very
: slow).
:
: More optimization of special cases hasn't been done simply because no
: one has done it yet... (as you note, faceting is a new feature).

the optimization optimization i anticipated from teh begining, would
probably be usefull in the situation Michael is describing ... if there is
a long tail oif authors (and in my experience, there typically is) we
can cache an ordered list of the top N most prolific authors, along with
the count of how many documents they have in the index (this info is easy
to getfrom TermEnum.docFreq).  when we facet on the authors, we start with
that list and go in order, generating their facet constraint count using
the DocSet intersection just like we currently do ... if we reach our
facet.limit before we reach the end of hte list and the lowest constraint
count is higher then the total doc count of the last author in the list,
then we know we don't need to bother testing any other Author, because no
other author an possibly have a higher facet constraint count then the
ones on our list (since they haven't even written that many documents)



-Hoss

Re: Facet performance with heterogeneous 'facets'?

2006-09-19 Thread Yonik Seeley


On 9/19/06, Chris Hostetter [EMAIL PROTECTED] wrote:


Quick Question: did you say you are faceting on the first name field
seperately from the last name field? ... why?

You'll probably see a sharp increase in performacne if you have a single
untokenized author field containing hte full name and you facet on that --
there will be a lot less unique terms to use when computing DocSets and
intersections.

Second: you mentioned increasing hte size of your filterCache
significantly, but we don't really know how heterogenous your index is ...
once you made that cahnge did your filterCache hitrate increase? .. do you
have any evictions (you can check on the Statistics patge)

:  Also, I was under the impression
:  that it was only searching / sorting for authors that it knows are in
:  the result set...
:
: That's the problem... it's not necessarily easy to know *what* authors
: are in the result set.  If we could quickly determine that, we could
: just count them and not do any intersections or anything at all.

another way to look at it is that by looking at all the authors, the work
done for generating the facet counts for query A can be completely reused
for the next query B -- presuming your filterCache is large enough to hold
all of the author filters.

: There could be optimizations when docs_matching_query.size() is small,
: so we start somehow with terms in the documents rather than terms in
: the index.  That requires termvectors to be stored (medium speed), or
: requires that the field be stored and that we re-analyze it (very
: slow).
:
: More optimization of special cases hasn't been done simply because no
: one has done it yet... (as you note, faceting is a new feature).

the optimization optimization i anticipated from teh begining, would
probably be usefull in the situation Michael is describing ... if there is
a long tail oif authors (and in my experience, there typically is)



we
can cache an ordered list of the top N most prolific authors, along with
the count of how many documents they have in the index (this info is easy
to getfrom TermEnum.docFreq).


Yeah, I've thought about a fieldInfoCache too.  It could also cache
the total number of terms in order to make decisions about what
faceting strategy to follow.


when we facet on the authors, we start with
that list and go in order, generating their facet constraint count using
the DocSet intersection just like we currently do ... if we reach our
facet.limit before we reach the end of hte list and the lowest constraint
count is higher then the total doc count of the last author in the list,
then we know we don't need to bother testing any other Author, because no
other author an possibly have a higher facet constraint count then the
ones on our list


This works OK if the intersection counts are high (as a percentage of
the facet sets).  I'm not sure how often this will be the case though.

Another tradeoff is to allow getting inexact counts with multi-token fields by:
- simply faceting on the most popular values
  OR
- do some sort of statistical sampling by reading term vectors for a
fraction of the matching docs.

-Yonik

Re: Facet performance with heterogeneous 'facets'?

2006-09-19 Thread Chris Hostetter


:  when we facet on the authors, we start with
:  that list and go in order, generating their facet constraint count using
:  the DocSet intersection just like we currently do ... if we reach our
:  facet.limit before we reach the end of hte list and the lowest constraint
:  count is higher then the total doc count of the last author in the list,
:  then we know we don't need to bother testing any other Author, because no
:  other author an possibly have a higher facet constraint count then the
:  ones on our list
:
: This works OK if the intersection counts are high (as a percentage of
: the facet sets).  I'm not sure how often this will be the case though.

well, keep in mind N could be very big, big enough to store the full
list of Terms sorted in docFreq order (it shouldn't take up much space
since it's just hte Term and an int)e ... for any query that returns a
large number of results, you probably won't need to reach the end of the
list before you can tell that all the remaining Terms have a lower docFreq
then the current last constraint count in your facet.limit list.  For
queries that return a small number of results, it wouldn't be as
usefull, but thats where a switch could be fliped to start with the values
mapped to hte docs (using FieldCache -- assuming single-value fields)

: Another tradeoff is to allow getting inexact counts with multi-token fields 
by:
:  - simply faceting on the most popular values
:OR
:  - do some sort of statistical sampling by reading term vectors for a
: fraction of the matching docs.

i loath inexact counts ... i think of them as Astrology to the Astronomy
of true Faceted Searching ... but i'm sure they would be good enough for
some peoples use cases.



-Hoss

Re: Facet performance with heterogeneous 'facets'?

2006-09-18 Thread Michael Imbeault

Just a little follow-up - I did a little more testing, and the query 
takes 20 seconds no matter what - If there's one document in the results 
set, or if I do a query that returns all 13 documents.


It seems something isn't right... it looks like solr is doing faceted 
search on the whole index no matter what's the result set when doing 
facets on a string field. I must be doing something wrong?


Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Michael Imbeault wrote:
Been playing around with the news 'facets search' and it works very 
well, but it's really slow for some particular applications. I've been 
trying to use it to display the most frequent authors of articles; 
this is from a huge (15 millions articles) database and names of 
authors are rare and heterogeneous. On a query that takes (without 
facets) 0.1 seconds, it jumps to ~20 seconds with just 1% of the 
documents indexed (I've been getting java.lang.OutOfMemoryError with 
the full index). ~40 seconds for a faceted search on 2 (string) 
fields. Range queries on a slong field is more acceptable (even with a 
dozen of them, query time is still in the subsecond range).


I'm I trying to do something which isn't what faceted search was made 
for? It would be understandable, after all, I guess the facets engine 
has to check very doc in the index and sort... which shouldn't yield 
good performance no matter what, sadly.


Is there any other way I could achieve what I'm trying to do? Just a 
list of the most frequent (top 5) authors present in the results of a 
query.


Thanks,

Re: Facet performance with heterogeneous 'facets'?

2006-09-18 Thread Yonik Seeley


On 9/18/06, Michael Imbeault [EMAIL PROTECTED] wrote:

Just a little follow-up - I did a little more testing, and the query
takes 20 seconds no matter what - If there's one document in the results
set, or if I do a query that returns all 13 documents.


Yes, currently the same strategy is always used.
  intersection_count(docs_matching_query, docs_matching_author1)
  intersection_count(docs_matching_query, docs_matching_author2)
  intersection_count(docs_matching_query, docs_matching_author3)
  etc...

Normally, the docsets will be cached, but since the number of authors
is greater than the size of the filtercache, the effective cache hit
rate will be 0%

-Yonik

Re: Facet performance with heterogeneous 'facets'?

2006-09-18 Thread Michael Imbeault


Yonik Seeley wrote:

I noticed this too, and have been thinking about ways to fix it.
The root of the problem is that lucene, like all full-text search
engines, uses inverted indicies.  It's fast and easy to get all
documents for a particular term, but getting all terms for a document
documents is either not possible, or not fast (assuming many documents
match a query).
Yeah that's what I've been thinking; the index isn't built to handle 
such searches, sadly :( It would be very nice to be able to rapidly 
search by most frequent author, journal, etc.

For cases like author, if there is only one value per document, then
a possible fix is to use the field cache.  If there can be multiple
occurrences, there doesn't seem to be a good way that preserves exact
counts, except maybe if the number of documents matching a query is
low.

I have one value per document (I have fields for authors, last_author 
and first_author, and I'm doing faceted search on first and last authors 
fields). How would I use the field cache to fix my problem? Also, would 
it be better to store a unique number (for each possible author) in an 
int field along with the string, and do the faceted searching on the int 
field? Would this be faster / require less memory? I guess that yes, and 
I'll test that when I have the time.



Just a little follow-up - I did a little more testing, and the query
takes 20 seconds no matter what - If there's one document in the results
set, or if I do a query that returns all 13 documents.


Yes, currently the same strategy is always used.
  intersection_count(docs_matching_query, docs_matching_author1)
  intersection_count(docs_matching_query, docs_matching_author2)
  intersection_count(docs_matching_query, docs_matching_author3)
  etc...

Normally, the docsets will be cached, but since the number of authors
is greater than the size of the filtercache, the effective cache hit
rate will be 0%

-Yonik
So more memory would fix the problem? Also, I was under the impression 
that it was only searching / sorting for authors that it knows are in 
the result set... in the case of only one document (1 result), it seems 
strange that it takes the same time as for 130 000 results. It should 
just check the results, see that there's only one author, and return 
that? And in the case of 2 documents, just sort 2 authors (or 1 if 
they're the same)? I understand your answer (it does intersections), but 
I wonder why its intersecting from the whole document set at first, and 
not docs_matching_query like you said.


Thanks for the support,

Michael

Re: Facet performance with heterogeneous 'facets'?

2006-09-18 Thread Michael Imbeault


Another followup: I bumped all the caches in solrconfig.xml to

 size=1600384
 initialSize=400096
 autowarmCount=400096

It seemed to fix the problem on a very small index (facets on last and 
first author fields, + 12 range date facets, sub 0.3 seconds for 
queries). I'll check on the full index tomorrow (it's indexing right 
now, 400docs/sec!). However, I still don't have an idea what are these 
values representing, and how I should estimate what values I should set 
them to. Originally I thought it was the size of the cache in kb, and 
someone on the list told me it was number of items, but I don't quite 
get it. Better documentation on that would be welcomed :)


Also, is there any plans to add an option not to run a facet search if 
the result set is too big? To avoid 40 seconds queries if the docset is 
too large...


Thanks,

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Yonik Seeley wrote:

On 9/18/06, Michael Imbeault [EMAIL PROTECTED] wrote:

Just a little follow-up - I did a little more testing, and the query
takes 20 seconds no matter what - If there's one document in the results
set, or if I do a query that returns all 13 documents.


Yes, currently the same strategy is always used.
  intersection_count(docs_matching_query, docs_matching_author1)
  intersection_count(docs_matching_query, docs_matching_author2)
  intersection_count(docs_matching_query, docs_matching_author3)
  etc...

Normally, the docsets will be cached, but since the number of authors
is greater than the size of the filtercache, the effective cache hit
rate will be 0%

-Yonik

69 matches

Mail list logo