Re: Facet Performance

2020-06-17 Thread Erick Erickson
queryResultCache doesn’t really help with faceting, even if it’s hit for the 
main query. 
That cache only stores a subset of the hits, and to facet properly you need 
the entire result set….

> On Jun 17, 2020, at 12:47 PM, James Bodkin  
> wrote:
> 
> We've noticed that the filterCache uses a significant amount of memory, as 
> we've assigned 8GB Heap per instance.
> In total, we have 32 shards with 2 replicas, hence (8*32*2) 512G Heap space 
> alone, further memory is required to ensure the index is always memory mapped 
> for performance reasons.
> 
> Ideally I would like to be able to reduce the amount of memory assigned to 
> the heap by using docValues instead of indexed but it doesn't seem possible.
> The QTime (after warming) for facet.method=enum is around 150-250ms whereas 
> the QTime for facet.method=fc is around 1000-1200ms.
> As we require the results in real-time for customers searching on our 
> website, the later QTime of 1000-1200ms is too slow for us to be able to use.
> 
> Our facet queries change as the customer selects different search criteria, 
> and hence the possible number of potential queries makes it very difficult 
> for the query result cache.
> We already have a custom implementation in which we check our redis cache for 
> queries before they are sent to our aggregators which runs at 30% hit rate.
> 
> Kind Regards,
> 
> James Bodkin
> 
> On 17/06/2020, 16:21, "Michael Gibney"  wrote:
> 
>To expand a bit on what Erick said regarding performance: my sense is
>that the RefGuide assertion that "docValues=true" makes faceting
>"faster" could use some qualification/clarification. My take, fwiw:
> 
>First, to reiterate/paraphrase what Erick said: the "faster" assertion
>is not comparing to "facet.method=enum". For low-cardinality fields,
>if you have the heap space, and are very intentional about configuring
>your filterCache (and monitoring it as access patterns might change),
>"facet.method=enum" will likely be as fast as you can get (at least
>for "legacy" facets or whatever -- not sure about "enum" method in
>JSON facets).
> 
>Even where "docValues=true" arguably does make faceting "faster", the
>main benefit is that the "uninverted" data structures are serialized
>on disk, so you're avoiding the need to uninvert each facet field
>on-heap for every new indexSearcher, which is generally high-latency
>-- user perception of this latency can be mitigated using warming
>queries, but it can still be problematic, esp. for frequent index
>updates. On-heap uninversion also inherently consumes a lot of heap
>space, which has general implications wrt GC, etc ... so in that
>respect even if faceting per se might not be "faster" with
>"docValues=true", your overall system may in many cases perform
>better.
> 
>(and Anthony, I'm pretty sure that tag/ex on facets should be
>orthogonal to the "facet.method=enum"/filterCache discussion, as
>tag/ex only affects the DocSet domain over which facets are calculated
>... I think that step is pretty cleanly separated from the actual
>calculation of the facets. I'm not 100% sure on that, so proceed with
>caution, but it could definitely be worth evaluating for your use
>case!)
> 
>Michael
> 
>On Wed, Jun 17, 2020 at 10:42 AM Erick Erickson  
> wrote:
>> 
>> Uninvertible is a safety mechanism to make sure that you don’t _unknowingly_ 
>> use a docValues=false
>> field for faceting/grouping/sorting/function queries. The primary point of 
>> docValues=true is twofold:
>> 
>> 1> reduce Java heap requirements by using the OS memory to hold it
>> 
>> 2> uninverting can be expensive CPU wise too, although not with just a few
>>unique values (for each term, read the list of docs that have it and flip 
>> a bit).
>> 
>> It doesn’t really make sense to set it on an index=false field, since 
>> uninverting only happens on
>> index=true docValues=false. OTOH, I don’t think it would do any harm either. 
>> That said, I frankly
>> don’t know how that interacts with facet.method=enum.
>> 
>> As far as speed… yeah, you’re in the edge cases. All things being equal, 
>> stuffing these into the
>> filterCache is the fastest way to facet if you have the memory. I’ve seen 
>> very few installations
>> where people have that luxury though. Each entry in the filterCache can 
>> occupy maxDoc/8 + some overhead
>> bytes. If maxDoc is very large, this’ll chew up an enormous amount of 
>> memory. I’m cheating
>> a bit here since the size might be smaller if only a few docs have any 
>> particular entry then the
>> size is smaller. But that’s the worst-case you have to allow for ‘cause you 
>> could theoretically hit
>> the perfect storm where, due to some particular sequence of queries, your 
>> entire filter
>> cache fills up with entries that size.
>> 
>> You’ll have some overhead to keep the cache at that size, but it sounds like 
>> it’s worth it.

Re: Facet Performance

2020-06-17 Thread James Bodkin
We've noticed that the filterCache uses a significant amount of memory, as 
we've assigned 8GB Heap per instance.
In total, we have 32 shards with 2 replicas, hence (8*32*2) 512G Heap space 
alone, further memory is required to ensure the index is always memory mapped 
for performance reasons.

Ideally I would like to be able to reduce the amount of memory assigned to the 
heap by using docValues instead of indexed but it doesn't seem possible.
The QTime (after warming) for facet.method=enum is around 150-250ms whereas the 
QTime for facet.method=fc is around 1000-1200ms.
As we require the results in real-time for customers searching on our website, 
the later QTime of 1000-1200ms is too slow for us to be able to use.

Our facet queries change as the customer selects different search criteria, and 
hence the possible number of potential queries makes it very difficult for the 
query result cache.
We already have a custom implementation in which we check our redis cache for 
queries before they are sent to our aggregators which runs at 30% hit rate.

Kind Regards,

James Bodkin

On 17/06/2020, 16:21, "Michael Gibney"  wrote:

To expand a bit on what Erick said regarding performance: my sense is
that the RefGuide assertion that "docValues=true" makes faceting
"faster" could use some qualification/clarification. My take, fwiw:

First, to reiterate/paraphrase what Erick said: the "faster" assertion
is not comparing to "facet.method=enum". For low-cardinality fields,
if you have the heap space, and are very intentional about configuring
your filterCache (and monitoring it as access patterns might change),
"facet.method=enum" will likely be as fast as you can get (at least
for "legacy" facets or whatever -- not sure about "enum" method in
JSON facets).

Even where "docValues=true" arguably does make faceting "faster", the
main benefit is that the "uninverted" data structures are serialized
on disk, so you're avoiding the need to uninvert each facet field
on-heap for every new indexSearcher, which is generally high-latency
-- user perception of this latency can be mitigated using warming
queries, but it can still be problematic, esp. for frequent index
updates. On-heap uninversion also inherently consumes a lot of heap
space, which has general implications wrt GC, etc ... so in that
respect even if faceting per se might not be "faster" with
"docValues=true", your overall system may in many cases perform
better.

(and Anthony, I'm pretty sure that tag/ex on facets should be
orthogonal to the "facet.method=enum"/filterCache discussion, as
tag/ex only affects the DocSet domain over which facets are calculated
... I think that step is pretty cleanly separated from the actual
calculation of the facets. I'm not 100% sure on that, so proceed with
caution, but it could definitely be worth evaluating for your use
case!)

Michael

On Wed, Jun 17, 2020 at 10:42 AM Erick Erickson  
wrote:
>
> Uninvertible is a safety mechanism to make sure that you don’t 
_unknowingly_ use a docValues=false
> field for faceting/grouping/sorting/function queries. The primary point 
of docValues=true is twofold:
>
> 1> reduce Java heap requirements by using the OS memory to hold it
>
> 2> uninverting can be expensive CPU wise too, although not with just a few
> unique values (for each term, read the list of docs that have it and 
flip a bit).
>
> It doesn’t really make sense to set it on an index=false field, since 
uninverting only happens on
> index=true docValues=false. OTOH, I don’t think it would do any harm 
either. That said, I frankly
> don’t know how that interacts with facet.method=enum.
>
> As far as speed… yeah, you’re in the edge cases. All things being equal, 
stuffing these into the
> filterCache is the fastest way to facet if you have the memory. I’ve seen 
very few installations
> where people have that luxury though. Each entry in the filterCache can 
occupy maxDoc/8 + some overhead
> bytes. If maxDoc is very large, this’ll chew up an enormous amount of 
memory. I’m cheating
> a bit here since the size might be smaller if only a few docs have any 
particular entry then the
> size is smaller. But that’s the worst-case you have to allow for ‘cause 
you could theoretically hit
> the perfect storm where, due to some particular sequence of queries, your 
entire filter
> cache fills up with entries that size.
>
> You’ll have some overhead to keep the cache at that size, but it sounds 
like it’s worth it.
>
> Best,
> Erick
>
>
>
> > On Jun 17, 2020, at 10:05 AM, James Bodkin 
 wrote:
> >
> > The large majority of the relevant fields have fewer than 20 unique 
values. We have two fields over that with 150 unique values and 5300 unique 
values retrospectively.
> > At the moment, 

Re: Facet Performance

2020-06-17 Thread Michael Gibney
To expand a bit on what Erick said regarding performance: my sense is
that the RefGuide assertion that "docValues=true" makes faceting
"faster" could use some qualification/clarification. My take, fwiw:

First, to reiterate/paraphrase what Erick said: the "faster" assertion
is not comparing to "facet.method=enum". For low-cardinality fields,
if you have the heap space, and are very intentional about configuring
your filterCache (and monitoring it as access patterns might change),
"facet.method=enum" will likely be as fast as you can get (at least
for "legacy" facets or whatever -- not sure about "enum" method in
JSON facets).

Even where "docValues=true" arguably does make faceting "faster", the
main benefit is that the "uninverted" data structures are serialized
on disk, so you're avoiding the need to uninvert each facet field
on-heap for every new indexSearcher, which is generally high-latency
-- user perception of this latency can be mitigated using warming
queries, but it can still be problematic, esp. for frequent index
updates. On-heap uninversion also inherently consumes a lot of heap
space, which has general implications wrt GC, etc ... so in that
respect even if faceting per se might not be "faster" with
"docValues=true", your overall system may in many cases perform
better.

(and Anthony, I'm pretty sure that tag/ex on facets should be
orthogonal to the "facet.method=enum"/filterCache discussion, as
tag/ex only affects the DocSet domain over which facets are calculated
... I think that step is pretty cleanly separated from the actual
calculation of the facets. I'm not 100% sure on that, so proceed with
caution, but it could definitely be worth evaluating for your use
case!)

Michael

On Wed, Jun 17, 2020 at 10:42 AM Erick Erickson  wrote:
>
> Uninvertible is a safety mechanism to make sure that you don’t _unknowingly_ 
> use a docValues=false
> field for faceting/grouping/sorting/function queries. The primary point of 
> docValues=true is twofold:
>
> 1> reduce Java heap requirements by using the OS memory to hold it
>
> 2> uninverting can be expensive CPU wise too, although not with just a few
> unique values (for each term, read the list of docs that have it and flip 
> a bit).
>
> It doesn’t really make sense to set it on an index=false field, since 
> uninverting only happens on
> index=true docValues=false. OTOH, I don’t think it would do any harm either. 
> That said, I frankly
> don’t know how that interacts with facet.method=enum.
>
> As far as speed… yeah, you’re in the edge cases. All things being equal, 
> stuffing these into the
> filterCache is the fastest way to facet if you have the memory. I’ve seen 
> very few installations
> where people have that luxury though. Each entry in the filterCache can 
> occupy maxDoc/8 + some overhead
> bytes. If maxDoc is very large, this’ll chew up an enormous amount of memory. 
> I’m cheating
> a bit here since the size might be smaller if only a few docs have any 
> particular entry then the
> size is smaller. But that’s the worst-case you have to allow for ‘cause you 
> could theoretically hit
> the perfect storm where, due to some particular sequence of queries, your 
> entire filter
> cache fills up with entries that size.
>
> You’ll have some overhead to keep the cache at that size, but it sounds like 
> it’s worth it.
>
> Best,
> Erick
>
>
>
> > On Jun 17, 2020, at 10:05 AM, James Bodkin  
> > wrote:
> >
> > The large majority of the relevant fields have fewer than 20 unique values. 
> > We have two fields over that with 150 unique values and 5300 unique values 
> > retrospectively.
> > At the moment, our filterCache is configured with a maximum size of 8192.
> >
> > From the DocValues documentation 
> > (https://lucene.apache.org/solr/guide/8_3/docvalues.html), it mentions that 
> > this approach promises to make lookups for faceting, sorting and grouping 
> > much faster.
> > Hence I thought that using DocValues would be better than using Indexed and 
> > in turn improve our response times and possibly lower memory requirements. 
> > It sounds like this isn't the case if you are able to allocate enough 
> > memory to the filterCache.
> >
> > I haven't yet tried changing the uninvertible setting, I was looking at the 
> > documentation for this field earlier today.
> > Should we be setting uninvertible="false" if docValues="true" regardless of 
> > whether indexed is true or false?
> >
> > Kind Regards,
> >
> > James Bodkin
> >
> > On 17/06/2020, 14:02, "Michael Gibney"  wrote:
> >
> >facet.method=enum works by executing a query (against indexed values)
> >for each indexed value in a given field (which, for indexed=false, is
> >"no values"). So that explains why facet.method=enum no longer works.
> >I was going to suggest that you might not want to set indexed=false on
> >the docValues facet fields anyway, since the indexed values are still
> >used for facet refinement (assuming your index is distributed).
> >

Re: Facet Performance

2020-06-17 Thread Erick Erickson
Uninvertible is a safety mechanism to make sure that you don’t _unknowingly_ 
use a docValues=false
field for faceting/grouping/sorting/function queries. The primary point of 
docValues=true is twofold:

1> reduce Java heap requirements by using the OS memory to hold it

2> uninverting can be expensive CPU wise too, although not with just a few
unique values (for each term, read the list of docs that have it and flip a 
bit).

It doesn’t really make sense to set it on an index=false field, since 
uninverting only happens on
index=true docValues=false. OTOH, I don’t think it would do any harm either. 
That said, I frankly
don’t know how that interacts with facet.method=enum.

As far as speed… yeah, you’re in the edge cases. All things being equal, 
stuffing these into the
filterCache is the fastest way to facet if you have the memory. I’ve seen very 
few installations
where people have that luxury though. Each entry in the filterCache can occupy 
maxDoc/8 + some overhead
bytes. If maxDoc is very large, this’ll chew up an enormous amount of memory. 
I’m cheating
a bit here since the size might be smaller if only a few docs have any 
particular entry then the
size is smaller. But that’s the worst-case you have to allow for ‘cause you 
could theoretically hit
the perfect storm where, due to some particular sequence of queries, your 
entire filter
cache fills up with entries that size. 

You’ll have some overhead to keep the cache at that size, but it sounds like 
it’s worth it.

Best,
Erick



> On Jun 17, 2020, at 10:05 AM, James Bodkin  
> wrote:
> 
> The large majority of the relevant fields have fewer than 20 unique values. 
> We have two fields over that with 150 unique values and 5300 unique values 
> retrospectively.
> At the moment, our filterCache is configured with a maximum size of 8192.
> 
> From the DocValues documentation 
> (https://lucene.apache.org/solr/guide/8_3/docvalues.html), it mentions that 
> this approach promises to make lookups for faceting, sorting and grouping 
> much faster.
> Hence I thought that using DocValues would be better than using Indexed and 
> in turn improve our response times and possibly lower memory requirements. It 
> sounds like this isn't the case if you are able to allocate enough memory to 
> the filterCache.
> 
> I haven't yet tried changing the uninvertible setting, I was looking at the 
> documentation for this field earlier today.
> Should we be setting uninvertible="false" if docValues="true" regardless of 
> whether indexed is true or false?
> 
> Kind Regards,
> 
> James Bodkin
> 
> On 17/06/2020, 14:02, "Michael Gibney"  wrote:
> 
>facet.method=enum works by executing a query (against indexed values)
>for each indexed value in a given field (which, for indexed=false, is
>"no values"). So that explains why facet.method=enum no longer works.
>I was going to suggest that you might not want to set indexed=false on
>the docValues facet fields anyway, since the indexed values are still
>used for facet refinement (assuming your index is distributed).
> 
>What's the number of unique values in the relevant fields? If it's low
>enough, setting docValues=false and indexed=true and using
>facet.method=enum (with a sufficiently large filterCache) is
>definitely a viable option, and will almost certainly be faster than
>docValues-based faceting. (As an aside, noting for future reference:
>high-cardinality facets over high-cardinality DocSet domains might be
>able to benefit from a term facet count cache:
>https://issues.apache.org/jira/browse/SOLR-13807)
> 
>I think you didn't specifically mention whether you acted on Erick's
>suggestion of setting "uninvertible=false" (I think Erick accidentally
>said "uninvertible=true") to fail fast. I'd also recommend doing that,
>perhaps even above all else -- it shouldn't actually *do* anything,
>but will help ensure that things are behaving as you expect them to!
> 
>Michael
> 
>On Wed, Jun 17, 2020 at 4:31 AM James Bodkin
> wrote:
>> 
>> Thanks, I've implemented some queries that improve the first-hit execution 
>> for faceting.
>> 
>> Since turning off indexed on those fields, we've noticed that 
>> facet.method=enum no longer returns the facets when used.
>> Using facet.method=fc/fcs is significantly slower compared to 
>> facet.method=enum for us. Why do these two differences exist?
>> 
>> On 16/06/2020, 17:52, "Erick Erickson"  wrote:
>> 
>>Ok, I see the disconnect... Necessary parts if the index are read from 
>> disk
>>lazily. So your newSearcher or firstSearcher query needs to do whatever
>>operation causes the relevant parts of the index to be read. In this case,
>>probably just facet on all the fields you care about. I'd add sorting too
>>if you sort on different fields.
>> 
>>The *:* query without facets or sorting does virtually nothing due to some
>>special handling...
>> 
>>On Tue, Jun 16, 

Re: Facet Performance

2020-06-17 Thread James Bodkin
The large majority of the relevant fields have fewer than 20 unique values. We 
have two fields over that with 150 unique values and 5300 unique values 
retrospectively.
At the moment, our filterCache is configured with a maximum size of 8192.

From the DocValues documentation 
(https://lucene.apache.org/solr/guide/8_3/docvalues.html), it mentions that 
this approach promises to make lookups for faceting, sorting and grouping much 
faster.
Hence I thought that using DocValues would be better than using Indexed and in 
turn improve our response times and possibly lower memory requirements. It 
sounds like this isn't the case if you are able to allocate enough memory to 
the filterCache.

I haven't yet tried changing the uninvertible setting, I was looking at the 
documentation for this field earlier today.
Should we be setting uninvertible="false" if docValues="true" regardless of 
whether indexed is true or false?

Kind Regards,

James Bodkin

On 17/06/2020, 14:02, "Michael Gibney"  wrote:

facet.method=enum works by executing a query (against indexed values)
for each indexed value in a given field (which, for indexed=false, is
"no values"). So that explains why facet.method=enum no longer works.
I was going to suggest that you might not want to set indexed=false on
the docValues facet fields anyway, since the indexed values are still
used for facet refinement (assuming your index is distributed).

What's the number of unique values in the relevant fields? If it's low
enough, setting docValues=false and indexed=true and using
facet.method=enum (with a sufficiently large filterCache) is
definitely a viable option, and will almost certainly be faster than
docValues-based faceting. (As an aside, noting for future reference:
high-cardinality facets over high-cardinality DocSet domains might be
able to benefit from a term facet count cache:
https://issues.apache.org/jira/browse/SOLR-13807)

I think you didn't specifically mention whether you acted on Erick's
suggestion of setting "uninvertible=false" (I think Erick accidentally
said "uninvertible=true") to fail fast. I'd also recommend doing that,
perhaps even above all else -- it shouldn't actually *do* anything,
but will help ensure that things are behaving as you expect them to!

Michael

On Wed, Jun 17, 2020 at 4:31 AM James Bodkin
 wrote:
>
> Thanks, I've implemented some queries that improve the first-hit 
execution for faceting.
>
> Since turning off indexed on those fields, we've noticed that 
facet.method=enum no longer returns the facets when used.
> Using facet.method=fc/fcs is significantly slower compared to 
facet.method=enum for us. Why do these two differences exist?
>
> On 16/06/2020, 17:52, "Erick Erickson"  wrote:
>
> Ok, I see the disconnect... Necessary parts if the index are read 
from disk
> lazily. So your newSearcher or firstSearcher query needs to do 
whatever
> operation causes the relevant parts of the index to be read. In this 
case,
> probably just facet on all the fields you care about. I'd add sorting 
too
> if you sort on different fields.
>
> The *:* query without facets or sorting does virtually nothing due to 
some
> special handling...
>
> On Tue, Jun 16, 2020, 10:48 James Bodkin 

> wrote:
>
> > I've been trying to build a query that I can use in newSearcher 
based off
> > the information in your previous e-mail. I thought you meant to 
build a *:*
> > query as per Query 1 in my previous e-mail but I'm still seeing the
> > first-hit execution.
> > Now I'm wondering if you meant to create a *:* query with each of 
the
> > fields as part of the fl query parameters or a *:* query with each 
of the
> > fields and values as part of the fq query parameters.
> >
> > At the moment I've been running these manually as I expected that I 
would
> > see the first-execution penalty disappear by the time I got to 
query 4, as
> > I thought this would replicate the actions of the newSeacher.
> > Unfortunately we can't use the autowarm count that is available as 
part of
> > the filterCache/filterCache due to the custom deployment mechanism 
we use
> > to update our index.
> >
> > Kind Regards,
> >
> > James Bodkin
> >
> > On 16/06/2020, 15:30, "Erick Erickson"  
wrote:
> >
> > Did you try the autowarming like I mentioned in my previous 
e-mail?
> >
> > > On Jun 16, 2020, at 10:18 AM, James Bodkin <
> > james.bod...@loveholidays.com> wrote:
> > >
> > > We've changed the schema to enable docValues for these fields 
and
> > this led to an improvement in the response time. We found a further
> > 

Re: Facet Performance

2020-06-17 Thread Anthony Groves
Ah, interesting! So if the number of possible values is low (like <= 10),
it is faster to *not *use docvalues on that (indexed) faceted field?
Does this hold true even when using faceting techniques like tag and
exclusion?

Thanks,
Anthony


On Wed, Jun 17, 2020 at 9:37 AM David Smiley 
wrote:

> I strongly recommend setting indexed=true on a field you facet on for the
> purposes of efficient refinement (fq=field:value).  But it strictly isn't
> required, as you have discovered.
>
> ~ David
>
>
> On Wed, Jun 17, 2020 at 9:02 AM Michael Gibney 
> wrote:
>
> > facet.method=enum works by executing a query (against indexed values)
> > for each indexed value in a given field (which, for indexed=false, is
> > "no values"). So that explains why facet.method=enum no longer works.
> > I was going to suggest that you might not want to set indexed=false on
> > the docValues facet fields anyway, since the indexed values are still
> > used for facet refinement (assuming your index is distributed).
> >
> > What's the number of unique values in the relevant fields? If it's low
> > enough, setting docValues=false and indexed=true and using
> > facet.method=enum (with a sufficiently large filterCache) is
> > definitely a viable option, and will almost certainly be faster than
> > docValues-based faceting. (As an aside, noting for future reference:
> > high-cardinality facets over high-cardinality DocSet domains might be
> > able to benefit from a term facet count cache:
> > https://issues.apache.org/jira/browse/SOLR-13807)
> >
> > I think you didn't specifically mention whether you acted on Erick's
> > suggestion of setting "uninvertible=false" (I think Erick accidentally
> > said "uninvertible=true") to fail fast. I'd also recommend doing that,
> > perhaps even above all else -- it shouldn't actually *do* anything,
> > but will help ensure that things are behaving as you expect them to!
> >
> > Michael
> >
> > On Wed, Jun 17, 2020 at 4:31 AM James Bodkin
> >  wrote:
> > >
> > > Thanks, I've implemented some queries that improve the first-hit
> > execution for faceting.
> > >
> > > Since turning off indexed on those fields, we've noticed that
> > facet.method=enum no longer returns the facets when used.
> > > Using facet.method=fc/fcs is significantly slower compared to
> > facet.method=enum for us. Why do these two differences exist?
> > >
> > > On 16/06/2020, 17:52, "Erick Erickson" 
> wrote:
> > >
> > > Ok, I see the disconnect... Necessary parts if the index are read
> > from disk
> > > lazily. So your newSearcher or firstSearcher query needs to do
> > whatever
> > > operation causes the relevant parts of the index to be read. In
> this
> > case,
> > > probably just facet on all the fields you care about. I'd add
> > sorting too
> > > if you sort on different fields.
> > >
> > > The *:* query without facets or sorting does virtually nothing due
> > to some
> > > special handling...
> > >
> > > On Tue, Jun 16, 2020, 10:48 James Bodkin <
> > james.bod...@loveholidays.com>
> > > wrote:
> > >
> > > > I've been trying to build a query that I can use in newSearcher
> > based off
> > > > the information in your previous e-mail. I thought you meant to
> > build a *:*
> > > > query as per Query 1 in my previous e-mail but I'm still seeing
> the
> > > > first-hit execution.
> > > > Now I'm wondering if you meant to create a *:* query with each of
> > the
> > > > fields as part of the fl query parameters or a *:* query with
> each
> > of the
> > > > fields and values as part of the fq query parameters.
> > > >
> > > > At the moment I've been running these manually as I expected that
> > I would
> > > > see the first-execution penalty disappear by the time I got to
> > query 4, as
> > > > I thought this would replicate the actions of the newSeacher.
> > > > Unfortunately we can't use the autowarm count that is available
> as
> > part of
> > > > the filterCache/filterCache due to the custom deployment
> mechanism
> > we use
> > > > to update our index.
> > > >
> > > > Kind Regards,
> > > >
> > > > James Bodkin
> > > >
> > > > On 16/06/2020, 15:30, "Erick Erickson"  >
> > wrote:
> > > >
> > > > Did you try the autowarming like I mentioned in my previous
> > e-mail?
> > > >
> > > > > On Jun 16, 2020, at 10:18 AM, James Bodkin <
> > > > james.bod...@loveholidays.com> wrote:
> > > > >
> > > > > We've changed the schema to enable docValues for these
> > fields and
> > > > this led to an improvement in the response time. We found a
> further
> > > > improvement by also switching off indexed as these fields are
> used
> > for
> > > > faceting and filtering only.
> > > > > Since those changes, we've found that the first-execution
> for
> > > > queries is really noticeable. I thought this would be the
> > filterCache based
> > > > on what I saw in 

Re: Facet Performance

2020-06-17 Thread David Smiley
I strongly recommend setting indexed=true on a field you facet on for the
purposes of efficient refinement (fq=field:value).  But it strictly isn't
required, as you have discovered.

~ David


On Wed, Jun 17, 2020 at 9:02 AM Michael Gibney 
wrote:

> facet.method=enum works by executing a query (against indexed values)
> for each indexed value in a given field (which, for indexed=false, is
> "no values"). So that explains why facet.method=enum no longer works.
> I was going to suggest that you might not want to set indexed=false on
> the docValues facet fields anyway, since the indexed values are still
> used for facet refinement (assuming your index is distributed).
>
> What's the number of unique values in the relevant fields? If it's low
> enough, setting docValues=false and indexed=true and using
> facet.method=enum (with a sufficiently large filterCache) is
> definitely a viable option, and will almost certainly be faster than
> docValues-based faceting. (As an aside, noting for future reference:
> high-cardinality facets over high-cardinality DocSet domains might be
> able to benefit from a term facet count cache:
> https://issues.apache.org/jira/browse/SOLR-13807)
>
> I think you didn't specifically mention whether you acted on Erick's
> suggestion of setting "uninvertible=false" (I think Erick accidentally
> said "uninvertible=true") to fail fast. I'd also recommend doing that,
> perhaps even above all else -- it shouldn't actually *do* anything,
> but will help ensure that things are behaving as you expect them to!
>
> Michael
>
> On Wed, Jun 17, 2020 at 4:31 AM James Bodkin
>  wrote:
> >
> > Thanks, I've implemented some queries that improve the first-hit
> execution for faceting.
> >
> > Since turning off indexed on those fields, we've noticed that
> facet.method=enum no longer returns the facets when used.
> > Using facet.method=fc/fcs is significantly slower compared to
> facet.method=enum for us. Why do these two differences exist?
> >
> > On 16/06/2020, 17:52, "Erick Erickson"  wrote:
> >
> > Ok, I see the disconnect... Necessary parts if the index are read
> from disk
> > lazily. So your newSearcher or firstSearcher query needs to do
> whatever
> > operation causes the relevant parts of the index to be read. In this
> case,
> > probably just facet on all the fields you care about. I'd add
> sorting too
> > if you sort on different fields.
> >
> > The *:* query without facets or sorting does virtually nothing due
> to some
> > special handling...
> >
> > On Tue, Jun 16, 2020, 10:48 James Bodkin <
> james.bod...@loveholidays.com>
> > wrote:
> >
> > > I've been trying to build a query that I can use in newSearcher
> based off
> > > the information in your previous e-mail. I thought you meant to
> build a *:*
> > > query as per Query 1 in my previous e-mail but I'm still seeing the
> > > first-hit execution.
> > > Now I'm wondering if you meant to create a *:* query with each of
> the
> > > fields as part of the fl query parameters or a *:* query with each
> of the
> > > fields and values as part of the fq query parameters.
> > >
> > > At the moment I've been running these manually as I expected that
> I would
> > > see the first-execution penalty disappear by the time I got to
> query 4, as
> > > I thought this would replicate the actions of the newSeacher.
> > > Unfortunately we can't use the autowarm count that is available as
> part of
> > > the filterCache/filterCache due to the custom deployment mechanism
> we use
> > > to update our index.
> > >
> > > Kind Regards,
> > >
> > > James Bodkin
> > >
> > > On 16/06/2020, 15:30, "Erick Erickson" 
> wrote:
> > >
> > > Did you try the autowarming like I mentioned in my previous
> e-mail?
> > >
> > > > On Jun 16, 2020, at 10:18 AM, James Bodkin <
> > > james.bod...@loveholidays.com> wrote:
> > > >
> > > > We've changed the schema to enable docValues for these
> fields and
> > > this led to an improvement in the response time. We found a further
> > > improvement by also switching off indexed as these fields are used
> for
> > > faceting and filtering only.
> > > > Since those changes, we've found that the first-execution for
> > > queries is really noticeable. I thought this would be the
> filterCache based
> > > on what I saw in NewRelic however it is probably trying to read the
> > > docValues from disk. How can we use the autowarming to improve
> this?
> > > >
> > > > For example, I've run the following queries in sequence and
> each
> > > query has a first-execution penalty.
> > > >
> > > > Query 1:
> > > >
> > > > q=*:*
> > > > facet=true
> > > > facet.field=D_DepartureAirport
> > > > facet.field=D_Destination
> > > > facet.limit=-1
> > > > rows=0
> > >

Re: Facet Performance

2020-06-17 Thread Michael Gibney
facet.method=enum works by executing a query (against indexed values)
for each indexed value in a given field (which, for indexed=false, is
"no values"). So that explains why facet.method=enum no longer works.
I was going to suggest that you might not want to set indexed=false on
the docValues facet fields anyway, since the indexed values are still
used for facet refinement (assuming your index is distributed).

What's the number of unique values in the relevant fields? If it's low
enough, setting docValues=false and indexed=true and using
facet.method=enum (with a sufficiently large filterCache) is
definitely a viable option, and will almost certainly be faster than
docValues-based faceting. (As an aside, noting for future reference:
high-cardinality facets over high-cardinality DocSet domains might be
able to benefit from a term facet count cache:
https://issues.apache.org/jira/browse/SOLR-13807)

I think you didn't specifically mention whether you acted on Erick's
suggestion of setting "uninvertible=false" (I think Erick accidentally
said "uninvertible=true") to fail fast. I'd also recommend doing that,
perhaps even above all else -- it shouldn't actually *do* anything,
but will help ensure that things are behaving as you expect them to!

Michael

On Wed, Jun 17, 2020 at 4:31 AM James Bodkin
 wrote:
>
> Thanks, I've implemented some queries that improve the first-hit execution 
> for faceting.
>
> Since turning off indexed on those fields, we've noticed that 
> facet.method=enum no longer returns the facets when used.
> Using facet.method=fc/fcs is significantly slower compared to 
> facet.method=enum for us. Why do these two differences exist?
>
> On 16/06/2020, 17:52, "Erick Erickson"  wrote:
>
> Ok, I see the disconnect... Necessary parts if the index are read from 
> disk
> lazily. So your newSearcher or firstSearcher query needs to do whatever
> operation causes the relevant parts of the index to be read. In this case,
> probably just facet on all the fields you care about. I'd add sorting too
> if you sort on different fields.
>
> The *:* query without facets or sorting does virtually nothing due to some
> special handling...
>
> On Tue, Jun 16, 2020, 10:48 James Bodkin 
> wrote:
>
> > I've been trying to build a query that I can use in newSearcher based 
> off
> > the information in your previous e-mail. I thought you meant to build a 
> *:*
> > query as per Query 1 in my previous e-mail but I'm still seeing the
> > first-hit execution.
> > Now I'm wondering if you meant to create a *:* query with each of the
> > fields as part of the fl query parameters or a *:* query with each of 
> the
> > fields and values as part of the fq query parameters.
> >
> > At the moment I've been running these manually as I expected that I 
> would
> > see the first-execution penalty disappear by the time I got to query 4, 
> as
> > I thought this would replicate the actions of the newSeacher.
> > Unfortunately we can't use the autowarm count that is available as part 
> of
> > the filterCache/filterCache due to the custom deployment mechanism we 
> use
> > to update our index.
> >
> > Kind Regards,
> >
> > James Bodkin
> >
> > On 16/06/2020, 15:30, "Erick Erickson"  wrote:
> >
> > Did you try the autowarming like I mentioned in my previous e-mail?
> >
> > > On Jun 16, 2020, at 10:18 AM, James Bodkin <
> > james.bod...@loveholidays.com> wrote:
> > >
> > > We've changed the schema to enable docValues for these fields and
> > this led to an improvement in the response time. We found a further
> > improvement by also switching off indexed as these fields are used for
> > faceting and filtering only.
> > > Since those changes, we've found that the first-execution for
> > queries is really noticeable. I thought this would be the filterCache 
> based
> > on what I saw in NewRelic however it is probably trying to read the
> > docValues from disk. How can we use the autowarming to improve this?
> > >
> > > For example, I've run the following queries in sequence and each
> > query has a first-execution penalty.
> > >
> > > Query 1:
> > >
> > > q=*:*
> > > facet=true
> > > facet.field=D_DepartureAirport
> > > facet.field=D_Destination
> > > facet.limit=-1
> > > rows=0
> > >
> > > Query 2:
> > >
> > > q=*:*
> > > fq=D_DepartureAirport:(2660)
> > > facet=true
> > > facet.field=D_Destination
> > > facet.limit=-1
> > > rows=0
> > >
> > > Query 3:
> > >
> > > q=*:*
> > > fq=D_DepartureAirport:(2661)
> > > facet=true
> > > facet.field=D_Destination
> > > facet.limit=-1
> > > rows=0
> > >
> > > 

Re: Facet Performance

2020-06-17 Thread James Bodkin
Thanks, I've implemented some queries that improve the first-hit execution for 
faceting.

Since turning off indexed on those fields, we've noticed that facet.method=enum 
no longer returns the facets when used.
Using facet.method=fc/fcs is significantly slower compared to facet.method=enum 
for us. Why do these two differences exist?

On 16/06/2020, 17:52, "Erick Erickson"  wrote:

Ok, I see the disconnect... Necessary parts if the index are read from disk
lazily. So your newSearcher or firstSearcher query needs to do whatever
operation causes the relevant parts of the index to be read. In this case,
probably just facet on all the fields you care about. I'd add sorting too
if you sort on different fields.

The *:* query without facets or sorting does virtually nothing due to some
special handling...

On Tue, Jun 16, 2020, 10:48 James Bodkin 
wrote:

> I've been trying to build a query that I can use in newSearcher based off
> the information in your previous e-mail. I thought you meant to build a 
*:*
> query as per Query 1 in my previous e-mail but I'm still seeing the
> first-hit execution.
> Now I'm wondering if you meant to create a *:* query with each of the
> fields as part of the fl query parameters or a *:* query with each of the
> fields and values as part of the fq query parameters.
>
> At the moment I've been running these manually as I expected that I would
> see the first-execution penalty disappear by the time I got to query 4, as
> I thought this would replicate the actions of the newSeacher.
> Unfortunately we can't use the autowarm count that is available as part of
> the filterCache/filterCache due to the custom deployment mechanism we use
> to update our index.
>
> Kind Regards,
>
> James Bodkin
>
> On 16/06/2020, 15:30, "Erick Erickson"  wrote:
>
> Did you try the autowarming like I mentioned in my previous e-mail?
>
> > On Jun 16, 2020, at 10:18 AM, James Bodkin <
> james.bod...@loveholidays.com> wrote:
> >
> > We've changed the schema to enable docValues for these fields and
> this led to an improvement in the response time. We found a further
> improvement by also switching off indexed as these fields are used for
> faceting and filtering only.
> > Since those changes, we've found that the first-execution for
> queries is really noticeable. I thought this would be the filterCache 
based
> on what I saw in NewRelic however it is probably trying to read the
> docValues from disk. How can we use the autowarming to improve this?
> >
> > For example, I've run the following queries in sequence and each
> query has a first-execution penalty.
> >
> > Query 1:
> >
> > q=*:*
> > facet=true
> > facet.field=D_DepartureAirport
> > facet.field=D_Destination
> > facet.limit=-1
> > rows=0
> >
> > Query 2:
> >
> > q=*:*
> > fq=D_DepartureAirport:(2660)
> > facet=true
> > facet.field=D_Destination
> > facet.limit=-1
> > rows=0
> >
> > Query 3:
> >
> > q=*:*
> > fq=D_DepartureAirport:(2661)
> > facet=true
> > facet.field=D_Destination
> > facet.limit=-1
> > rows=0
> >
> > Query 4:
> >
> > q=*:*
> > fq=D_DepartureAirport:(2660+OR+2661)
> > facet=true
> > facet.field=D_Destination
> > facet.limit=-1
> > rows=0
> >
> > We've kept the field type as a string, as the value is mapped by
> application that accesses Solr. In the examples above, the values are
> mapped to airports and destinations.
> > Is it possible to prewarm the above queries without having to define
> all the potential filters manually in the auto warming?
> >
> > At the moment, we update and optimise our index in a different
> environment and then copy the index to our production instances by using a
> rolling deployment in Kubernetes.
> >
> > Kind Regards,
> >
> > James Bodkin
> >
> > On 12/06/2020, 18:58, "Erick Erickson" 
> wrote:
> >
> >I question whether fiterCache has anything to do with it, I
> suspect what’s really happening is that first time you’re reading the
> relevant bits from disk into memory. And to double check you should have
> docVaues enabled for all these fields. The “uninverting” process  can be
> very expensive, and docValues bypasses that.
> >
> >As of Solr 7.6, you can define “uninvertible=true” to your
> field(Type) to “fail fast” if Solr needs to uninvert the field.
> >
> >But that’s an aside. In either case, my claim is that 

Re: Facet Performance

2020-06-16 Thread Erick Erickson
Ok, I see the disconnect... Necessary parts if the index are read from disk
lazily. So your newSearcher or firstSearcher query needs to do whatever
operation causes the relevant parts of the index to be read. In this case,
probably just facet on all the fields you care about. I'd add sorting too
if you sort on different fields.

The *:* query without facets or sorting does virtually nothing due to some
special handling...

On Tue, Jun 16, 2020, 10:48 James Bodkin 
wrote:

> I've been trying to build a query that I can use in newSearcher based off
> the information in your previous e-mail. I thought you meant to build a *:*
> query as per Query 1 in my previous e-mail but I'm still seeing the
> first-hit execution.
> Now I'm wondering if you meant to create a *:* query with each of the
> fields as part of the fl query parameters or a *:* query with each of the
> fields and values as part of the fq query parameters.
>
> At the moment I've been running these manually as I expected that I would
> see the first-execution penalty disappear by the time I got to query 4, as
> I thought this would replicate the actions of the newSeacher.
> Unfortunately we can't use the autowarm count that is available as part of
> the filterCache/filterCache due to the custom deployment mechanism we use
> to update our index.
>
> Kind Regards,
>
> James Bodkin
>
> On 16/06/2020, 15:30, "Erick Erickson"  wrote:
>
> Did you try the autowarming like I mentioned in my previous e-mail?
>
> > On Jun 16, 2020, at 10:18 AM, James Bodkin <
> james.bod...@loveholidays.com> wrote:
> >
> > We've changed the schema to enable docValues for these fields and
> this led to an improvement in the response time. We found a further
> improvement by also switching off indexed as these fields are used for
> faceting and filtering only.
> > Since those changes, we've found that the first-execution for
> queries is really noticeable. I thought this would be the filterCache based
> on what I saw in NewRelic however it is probably trying to read the
> docValues from disk. How can we use the autowarming to improve this?
> >
> > For example, I've run the following queries in sequence and each
> query has a first-execution penalty.
> >
> > Query 1:
> >
> > q=*:*
> > facet=true
> > facet.field=D_DepartureAirport
> > facet.field=D_Destination
> > facet.limit=-1
> > rows=0
> >
> > Query 2:
> >
> > q=*:*
> > fq=D_DepartureAirport:(2660)
> > facet=true
> > facet.field=D_Destination
> > facet.limit=-1
> > rows=0
> >
> > Query 3:
> >
> > q=*:*
> > fq=D_DepartureAirport:(2661)
> > facet=true
> > facet.field=D_Destination
> > facet.limit=-1
> > rows=0
> >
> > Query 4:
> >
> > q=*:*
> > fq=D_DepartureAirport:(2660+OR+2661)
> > facet=true
> > facet.field=D_Destination
> > facet.limit=-1
> > rows=0
> >
> > We've kept the field type as a string, as the value is mapped by
> application that accesses Solr. In the examples above, the values are
> mapped to airports and destinations.
> > Is it possible to prewarm the above queries without having to define
> all the potential filters manually in the auto warming?
> >
> > At the moment, we update and optimise our index in a different
> environment and then copy the index to our production instances by using a
> rolling deployment in Kubernetes.
> >
> > Kind Regards,
> >
> > James Bodkin
> >
> > On 12/06/2020, 18:58, "Erick Erickson" 
> wrote:
> >
> >I question whether fiterCache has anything to do with it, I
> suspect what’s really happening is that first time you’re reading the
> relevant bits from disk into memory. And to double check you should have
> docVaues enabled for all these fields. The “uninverting” process  can be
> very expensive, and docValues bypasses that.
> >
> >As of Solr 7.6, you can define “uninvertible=true” to your
> field(Type) to “fail fast” if Solr needs to uninvert the field.
> >
> >But that’s an aside. In either case, my claim is that first-time
> execution does “something”, either reads the serialized docValues from disk
> or uninverts the file on Solr’s heap.
> >
> >You can have this autowarmed by any combination of
> >1> specifying an autowarm count on your queryResultCache. That’s
> hit or miss, as it replays the most recent N queries which may or may not
> contain the sorts. That said, specifying 10-20 for autowarm count is
> usually a good idea, assuming you’re not committing more than, say, every
> 30 seconds. I’d add the same to filterCache too.
> >
> >2> specifying a newSearcher or firstSearcher query in
> solrconfig.xml. The difference is that newSearcher is fired every time a
> commit happens, while firstSearcher is only fired when Solr starts, the
> theory being that there’s no cache autowarming available 

Re: Facet Performance

2020-06-16 Thread James Bodkin
I've been trying to build a query that I can use in newSearcher based off the 
information in your previous e-mail. I thought you meant to build a *:* query 
as per Query 1 in my previous e-mail but I'm still seeing the first-hit 
execution.
Now I'm wondering if you meant to create a *:* query with each of the fields as 
part of the fl query parameters or a *:* query with each of the fields and 
values as part of the fq query parameters.

At the moment I've been running these manually as I expected that I would see 
the first-execution penalty disappear by the time I got to query 4, as I 
thought this would replicate the actions of the newSeacher.
Unfortunately we can't use the autowarm count that is available as part of the 
filterCache/filterCache due to the custom deployment mechanism we use to update 
our index.

Kind Regards,

James Bodkin

On 16/06/2020, 15:30, "Erick Erickson"  wrote:

Did you try the autowarming like I mentioned in my previous e-mail?

> On Jun 16, 2020, at 10:18 AM, James Bodkin 
 wrote:
> 
> We've changed the schema to enable docValues for these fields and this 
led to an improvement in the response time. We found a further improvement by 
also switching off indexed as these fields are used for faceting and filtering 
only.
> Since those changes, we've found that the first-execution for queries is 
really noticeable. I thought this would be the filterCache based on what I saw 
in NewRelic however it is probably trying to read the docValues from disk. How 
can we use the autowarming to improve this?
> 
> For example, I've run the following queries in sequence and each query 
has a first-execution penalty.
> 
> Query 1:
> 
> q=*:*
> facet=true
> facet.field=D_DepartureAirport
> facet.field=D_Destination
> facet.limit=-1
> rows=0
> 
> Query 2:
> 
> q=*:*
> fq=D_DepartureAirport:(2660) 
> facet=true
> facet.field=D_Destination
> facet.limit=-1
> rows=0
> 
> Query 3:
> 
> q=*:*
> fq=D_DepartureAirport:(2661)
> facet=true
> facet.field=D_Destination
> facet.limit=-1
> rows=0
> 
> Query 4:
> 
> q=*:*
> fq=D_DepartureAirport:(2660+OR+2661)
> facet=true
> facet.field=D_Destination
> facet.limit=-1
> rows=0
> 
> We've kept the field type as a string, as the value is mapped by 
application that accesses Solr. In the examples above, the values are mapped to 
airports and destinations.
> Is it possible to prewarm the above queries without having to define all 
the potential filters manually in the auto warming?
> 
> At the moment, we update and optimise our index in a different 
environment and then copy the index to our production instances by using a 
rolling deployment in Kubernetes.
> 
> Kind Regards,
> 
> James Bodkin
> 
> On 12/06/2020, 18:58, "Erick Erickson"  wrote:
> 
>I question whether fiterCache has anything to do with it, I suspect 
what’s really happening is that first time you’re reading the relevant bits 
from disk into memory. And to double check you should have docVaues enabled for 
all these fields. The “uninverting” process  can be very expensive, and 
docValues bypasses that.
> 
>As of Solr 7.6, you can define “uninvertible=true” to your field(Type) 
to “fail fast” if Solr needs to uninvert the field.
> 
>But that’s an aside. In either case, my claim is that first-time 
execution does “something”, either reads the serialized docValues from disk or 
uninverts the file on Solr’s heap.
> 
>You can have this autowarmed by any combination of
>1> specifying an autowarm count on your queryResultCache. That’s hit 
or miss, as it replays the most recent N queries which may or may not contain 
the sorts. That said, specifying 10-20 for autowarm count is usually a good 
idea, assuming you’re not committing more than, say, every 30 seconds. I’d add 
the same to filterCache too.
> 
>2> specifying a newSearcher or firstSearcher query in solrconfig.xml. 
The difference is that newSearcher is fired every time a commit happens, while 
firstSearcher is only fired when Solr starts, the theory being that there’s no 
cache autowarming available when Solr fist powers up. Usually, people don’t 
bother with firstSearcher or just make it the same as newSearcher. Note that a 
query doesn’t have to be “real” at all. You can just add all the facet fields 
to a *:* query in a single go.
> 
>BTW, Trie fields will stay around for a long time even though 
deprecated. Or at least until we find something to replace them with that 
doesn’t have this penalty, so I’d feel pretty safe using those and they’ll be 
more efficient than strings.
> 
>Best,
>Erick
> 



Re: Facet Performance

2020-06-16 Thread Erick Erickson
Did you try the autowarming like I mentioned in my previous e-mail?

> On Jun 16, 2020, at 10:18 AM, James Bodkin  
> wrote:
> 
> We've changed the schema to enable docValues for these fields and this led to 
> an improvement in the response time. We found a further improvement by also 
> switching off indexed as these fields are used for faceting and filtering 
> only.
> Since those changes, we've found that the first-execution for queries is 
> really noticeable. I thought this would be the filterCache based on what I 
> saw in NewRelic however it is probably trying to read the docValues from 
> disk. How can we use the autowarming to improve this?
> 
> For example, I've run the following queries in sequence and each query has a 
> first-execution penalty.
> 
> Query 1:
> 
> q=*:*
> facet=true
> facet.field=D_DepartureAirport
> facet.field=D_Destination
> facet.limit=-1
> rows=0
> 
> Query 2:
> 
> q=*:*
> fq=D_DepartureAirport:(2660) 
> facet=true
> facet.field=D_Destination
> facet.limit=-1
> rows=0
> 
> Query 3:
> 
> q=*:*
> fq=D_DepartureAirport:(2661)
> facet=true
> facet.field=D_Destination
> facet.limit=-1
> rows=0
> 
> Query 4:
> 
> q=*:*
> fq=D_DepartureAirport:(2660+OR+2661)
> facet=true
> facet.field=D_Destination
> facet.limit=-1
> rows=0
> 
> We've kept the field type as a string, as the value is mapped by application 
> that accesses Solr. In the examples above, the values are mapped to airports 
> and destinations.
> Is it possible to prewarm the above queries without having to define all the 
> potential filters manually in the auto warming?
> 
> At the moment, we update and optimise our index in a different environment 
> and then copy the index to our production instances by using a rolling 
> deployment in Kubernetes.
> 
> Kind Regards,
> 
> James Bodkin
> 
> On 12/06/2020, 18:58, "Erick Erickson"  wrote:
> 
>I question whether fiterCache has anything to do with it, I suspect what’s 
> really happening is that first time you’re reading the relevant bits from 
> disk into memory. And to double check you should have docVaues enabled for 
> all these fields. The “uninverting” process  can be very expensive, and 
> docValues bypasses that.
> 
>As of Solr 7.6, you can define “uninvertible=true” to your field(Type) to 
> “fail fast” if Solr needs to uninvert the field.
> 
>But that’s an aside. In either case, my claim is that first-time execution 
> does “something”, either reads the serialized docValues from disk or 
> uninverts the file on Solr’s heap.
> 
>You can have this autowarmed by any combination of
>1> specifying an autowarm count on your queryResultCache. That’s hit or 
> miss, as it replays the most recent N queries which may or may not contain 
> the sorts. That said, specifying 10-20 for autowarm count is usually a good 
> idea, assuming you’re not committing more than, say, every 30 seconds. I’d 
> add the same to filterCache too.
> 
>2> specifying a newSearcher or firstSearcher query in solrconfig.xml. The 
> difference is that newSearcher is fired every time a commit happens, while 
> firstSearcher is only fired when Solr starts, the theory being that there’s 
> no cache autowarming available when Solr fist powers up. Usually, people 
> don’t bother with firstSearcher or just make it the same as newSearcher. Note 
> that a query doesn’t have to be “real” at all. You can just add all the facet 
> fields to a *:* query in a single go.
> 
>BTW, Trie fields will stay around for a long time even though deprecated. 
> Or at least until we find something to replace them with that doesn’t have 
> this penalty, so I’d feel pretty safe using those and they’ll be more 
> efficient than strings.
> 
>Best,
>Erick
> 



Re: Facet Performance

2020-06-16 Thread James Bodkin
We've changed the schema to enable docValues for these fields and this led to 
an improvement in the response time. We found a further improvement by also 
switching off indexed as these fields are used for faceting and filtering only.
Since those changes, we've found that the first-execution for queries is really 
noticeable. I thought this would be the filterCache based on what I saw in 
NewRelic however it is probably trying to read the docValues from disk. How can 
we use the autowarming to improve this?

For example, I've run the following queries in sequence and each query has a 
first-execution penalty.

Query 1:

q=*:*
facet=true
facet.field=D_DepartureAirport
facet.field=D_Destination
facet.limit=-1
rows=0

Query 2:

q=*:*
fq=D_DepartureAirport:(2660) 
facet=true
facet.field=D_Destination
facet.limit=-1
rows=0

Query 3:

q=*:*
fq=D_DepartureAirport:(2661)
facet=true
facet.field=D_Destination
facet.limit=-1
rows=0

Query 4:

q=*:*
fq=D_DepartureAirport:(2660+OR+2661)
facet=true
facet.field=D_Destination
facet.limit=-1
rows=0

We've kept the field type as a string, as the value is mapped by application 
that accesses Solr. In the examples above, the values are mapped to airports 
and destinations.
Is it possible to prewarm the above queries without having to define all the 
potential filters manually in the auto warming?

At the moment, we update and optimise our index in a different environment and 
then copy the index to our production instances by using a rolling deployment 
in Kubernetes.

Kind Regards,

James Bodkin

On 12/06/2020, 18:58, "Erick Erickson"  wrote:

I question whether fiterCache has anything to do with it, I suspect what’s 
really happening is that first time you’re reading the relevant bits from disk 
into memory. And to double check you should have docVaues enabled for all these 
fields. The “uninverting” process  can be very expensive, and docValues 
bypasses that.

As of Solr 7.6, you can define “uninvertible=true” to your field(Type) to 
“fail fast” if Solr needs to uninvert the field.

But that’s an aside. In either case, my claim is that first-time execution 
does “something”, either reads the serialized docValues from disk or uninverts 
the file on Solr’s heap.

You can have this autowarmed by any combination of
1> specifying an autowarm count on your queryResultCache. That’s hit or 
miss, as it replays the most recent N queries which may or may not contain the 
sorts. That said, specifying 10-20 for autowarm count is usually a good idea, 
assuming you’re not committing more than, say, every 30 seconds. I’d add the 
same to filterCache too.

2> specifying a newSearcher or firstSearcher query in solrconfig.xml. The 
difference is that newSearcher is fired every time a commit happens, while 
firstSearcher is only fired when Solr starts, the theory being that there’s no 
cache autowarming available when Solr fist powers up. Usually, people don’t 
bother with firstSearcher or just make it the same as newSearcher. Note that a 
query doesn’t have to be “real” at all. You can just add all the facet fields 
to a *:* query in a single go.

BTW, Trie fields will stay around for a long time even though deprecated. 
Or at least until we find something to replace them with that doesn’t have this 
penalty, so I’d feel pretty safe using those and they’ll be more efficient than 
strings.

Best,
Erick



Re: Facet Performance

2020-06-12 Thread Erick Erickson
I question whether fiterCache has anything to do with it, I suspect what’s 
really happening is that first time you’re reading the relevant bits from disk 
into memory. And to double check you should have docVaues enabled for all these 
fields. The “uninverting” process  can be very expensive, and docValues 
bypasses that.

As of Solr 7.6, you can define “uninvertible=true” to your field(Type) to “fail 
fast” if Solr needs to uninvert the field.

But that’s an aside. In either case, my claim is that first-time execution does 
“something”, either reads the serialized docValues from disk or uninverts the 
file on Solr’s heap.

You can have this autowarmed by any combination of
1> specifying an autowarm count on your queryResultCache. That’s hit or miss, 
as it replays the most recent N queries which may or may not contain the sorts. 
That said, specifying 10-20 for autowarm count is usually a good idea, assuming 
you’re not committing more than, say, every 30 seconds. I’d add the same to 
filterCache too.

2> specifying a newSearcher or firstSearcher query in solrconfig.xml. The 
difference is that newSearcher is fired every time a commit happens, while 
firstSearcher is only fired when Solr starts, the theory being that there’s no 
cache autowarming available when Solr fist powers up. Usually, people don’t 
bother with firstSearcher or just make it the same as newSearcher. Note that a 
query doesn’t have to be “real” at all. You can just add all the facet fields 
to a *:* query in a single go.

BTW, Trie fields will stay around for a long time even though deprecated. Or at 
least until we find something to replace them with that doesn’t have this 
penalty, so I’d feel pretty safe using those and they’ll be more efficient than 
strings.

Best,
Erick

> On Jun 12, 2020, at 12:39 PM, James Bodkin  
> wrote:
> 
> We've run the performance test after changing the fields to be of the type 
> string. We're seeing improved performance, especially after the first time 
> the query has run. The first run is taking around 1-2 seconds rather than 6-8 
> seconds and when the filter cache is present, the response time is around 
> 400ms.
> Do you have any more suggestions that we could try in order to optimise the 
> performance?
> 
> On 11/06/2020, 14:49, "Erick Erickson"  wrote:
> 
>There’s a lot of confusion about using points-based fields for faceting, 
> see: https://issues.apache.org/jira/browse/SOLR-13227 for instance.
> 
>Two options you might try:
>1> copyField to a string field and facet on that (won’t work, of course, 
> for any kind of interval/range facet)
>2> use the deprecated Trie field instead. You could use the copyField to a 
> Trie field for this too.
> 
>Best,
>Erick
> 



Re: Facet Performance

2020-06-12 Thread James Bodkin
We've run the performance test after changing the fields to be of the type 
string. We're seeing improved performance, especially after the first time the 
query has run. The first run is taking around 1-2 seconds rather than 6-8 
seconds and when the filter cache is present, the response time is around 400ms.
Do you have any more suggestions that we could try in order to optimise the 
performance?

On 11/06/2020, 14:49, "Erick Erickson"  wrote:

There’s a lot of confusion about using points-based fields for faceting, 
see: https://issues.apache.org/jira/browse/SOLR-13227 for instance.

Two options you might try:
1> copyField to a string field and facet on that (won’t work, of course, 
for any kind of interval/range facet)
2> use the deprecated Trie field instead. You could use the copyField to a 
Trie field for this too.

Best,
Erick



Re: Facet Performance

2020-06-11 Thread James Bodkin
Could you explain why the performance is an issue for points-based fields? I've 
looked through the referenced issue (which is fixed in the version we are 
running) but I'm missing the link between the two. Is there an issue to improve 
this for points-based fields?
We're going to change the field type to a string, as our queries are always 
looking for a specific value (and not intervals/ranges) and rerun our load test.


Kind Regards,

James Bodkin

On 11/06/2020, 14:49, "Erick Erickson"  wrote:

There’s a lot of confusion about using points-based fields for faceting, 
see: https://issues.apache.org/jira/browse/SOLR-13227 for instance.

Two options you might try:
1> copyField to a string field and facet on that (won’t work, of course, 
for any kind of interval/range facet)
2> use the deprecated Trie field instead. You could use the copyField to a 
Trie field for this too.

Best,
Erick



Re: Facet Performance

2020-06-11 Thread Erick Erickson
There’s a lot of confusion about using points-based fields for faceting, see: 
https://issues.apache.org/jira/browse/SOLR-13227 for instance.

Two options you might try:
1> copyField to a string field and facet on that (won’t work, of course, for 
any kind of interval/range facet)
2> use the deprecated Trie field instead. You could use the copyField to a Trie 
field for this too.

Best,
Erick

> On Jun 11, 2020, at 9:39 AM, James Bodkin  
> wrote:
> 
> We’ve been running a load test against our index and have noticed that the 
> facet queries are significantly slower than we would like.
> Currently these types of queries are taking several seconds to execute and 
> are wondering if it would be possible to speed these up.
> Repeating the same query over and over does not improve the response time so 
> does not appear to utilise any caching.
> Ideally we would like to be targeting a response time around tens or hundreds 
> of milliseconds if possible.
> 
> An example query that is taking around 2-3 seconds to execute is:
> 
> q=*.*
> facet=true
> facet.field=D_UserRatingGte
> facet.mincount=1
> facet.limit=-1
> rows=0
> 
> "response":{"numFound":18979503,"start":0,"maxScore":1.0,"docs":[]}
> "facet_counts":{
>"facet_queries":{},
>"facet_fields":{
>  "D_UserRatingGte":[
>"1575",16614238,
>"1576",16614238,
>"1577",16614238,
>"1578",16065938,
>"1579",12079545,
>"1580",458799]},
>"facet_ranges":{},
>"facet_intervals":{},
>"facet_heatmaps":{}}}
> 
> I have also tried the equivalent query using the JSON Facet API with the same 
> outcome of slow response time.
> Additionally I have tried changing the facet method (on both facet apis) with 
> the same outcome of slow response time.
> 
> The underlying field for the above query is configured as a 
> solr.IntPointField with docValues, indexed and multiValued set to true.
> The index has just under 19 million documents and the physical size on disk 
> is 10.95GB. The index is read-only and consists of 4 segments with 0 
> deletions.
> We’re running standalone Solr 8.3.1 with a 8GB Heap and the underlying Google 
> Cloud Virtual Machine in our load test environment has 6 vCPUs, 32G RAM and 
> 100GB SSD.
> 
> Would anyone be able to point me in a direction to either improve the 
> performance or understand the current performance is expected?
> 
> Kind Regards,
> 
> James Bodkin



Facet Performance

2020-06-11 Thread James Bodkin
We’ve been running a load test against our index and have noticed that the 
facet queries are significantly slower than we would like.
Currently these types of queries are taking several seconds to execute and are 
wondering if it would be possible to speed these up.
Repeating the same query over and over does not improve the response time so 
does not appear to utilise any caching.
Ideally we would like to be targeting a response time around tens or hundreds 
of milliseconds if possible.

An example query that is taking around 2-3 seconds to execute is:

q=*.*
facet=true
facet.field=D_UserRatingGte
facet.mincount=1
facet.limit=-1
rows=0

"response":{"numFound":18979503,"start":0,"maxScore":1.0,"docs":[]}
"facet_counts":{
"facet_queries":{},
"facet_fields":{
  "D_UserRatingGte":[
"1575",16614238,
"1576",16614238,
"1577",16614238,
"1578",16065938,
"1579",12079545,
"1580",458799]},
"facet_ranges":{},
"facet_intervals":{},
"facet_heatmaps":{}}}

I have also tried the equivalent query using the JSON Facet API with the same 
outcome of slow response time.
Additionally I have tried changing the facet method (on both facet apis) with 
the same outcome of slow response time.

The underlying field for the above query is configured as a solr.IntPointField 
with docValues, indexed and multiValued set to true.
The index has just under 19 million documents and the physical size on disk is 
10.95GB. The index is read-only and consists of 4 segments with 0 deletions.
We’re running standalone Solr 8.3.1 with a 8GB Heap and the underlying Google 
Cloud Virtual Machine in our load test environment has 6 vCPUs, 32G RAM and 
100GB SSD.

Would anyone be able to point me in a direction to either improve the 
performance or understand the current performance is expected?

Kind Regards,

James Bodkin


Re: Facet performance problem

2018-02-20 Thread Shawn Heisey

On 2/20/2018 1:18 AM, LOPEZ-CORTES Mariano-ext wrote:

We return a facet list of values in "motifPresence" field (person status).
Status:
[ ] status1
[x] status2
[x] status3

The user then selects 1 or multiple status (It's this step that we called "facet 
filtering").

Query is then re-executed with fq=motifPresence:(status2 OR status3)

We use fq in order to not alter the score in main query.

We've read that docValues=true for facet fields.

We need also indexed=true?


Facets, grouping, and sorting are more efficient with docValues, but 
searches aren't helped by docValues.  Without indexed="true", searches 
on the field will be VERY slow.  A filter query is still a search.  The 
"filter" in filter query just refers to the fact that it's separate from 
the main query, and that it does not affect relevancy scoring.


Thanks,
Shawn



RE: Facet performance problem

2018-02-20 Thread LOPEZ-CORTES Mariano-ext
Our query looks like this:

...factet=true=motifPresence

We return a facet list of values in "motifPresence" field (person status).
Status:
[ ] status1
[x] status2
[x] status3

The user then selects 1 or multiple status (It's this step that we called 
"facet filtering").

Query is then re-executed with fq=motifPresence:(status2 OR status3)

We use fq in order to not alter the score in main query.

We've read that docValues=true for facet fields.  

We need also indexed=true?
Is there any other problem in our solution?

-Message d'origine-
De : Erick Erickson [mailto:erickerick...@gmail.com] 
Envoyé : lundi 19 février 2018 18:18
À : solr-user
Objet : Re: Facet performance problem

I'm confused here. What do you mean by "facet filtering"? Your examples have no 
facets at all, just a _filter query_.

I'll assume you want to use filter query (fq), and faceting has nothing to do 
with it. This is one of the tricky bits of docValues.
While it's _possible_ to search on a field that's defined as above, it's very 
inefficient since there's no "inverted index" for the field, you specified 
'indexed="false" '. So the docValues are searched, and it's essentially a table 
scan.

If you mean to search against this field, set indexed="true". You'll have to 
completely reindex your corpus of course.

If you intend to facet, group or sort on this field, you should _also_ have 
docValues="true".

Best,
Erick

On Mon, Feb 19, 2018 at 7:47 AM, MOUSSA MZE Oussama-ext 
<oussama.moussa-mze-...@pole-emploi.fr> wrote:
> Hi
>
> We have following environement :
>
> 3 nodes cluster
> 1 shard
> Replication factor = 2
> 8GB per node
>
> 29 millions of documents
>
> We've faceting over field "motifPresence" defined as follow:
>
>  indexed="false" stored="true" required="false"/>
>
> Once the user selects motifPresence filter we executes search again with:
>
> fq: (value1 OR value2 OR value3 OR ...)
>
> The problem is: During facet filtering query is too slow and her response 
> time is greater than main search (without facet filtering).
>
> Thanks in advance!


Re: Facet performance problem

2018-02-19 Thread Erick Erickson
I'm confused here. What do you mean by "facet filtering"? Your
examples have no facets at all, just a _filter query_.

I'll assume you want to use filter query (fq), and faceting has
nothing to do with it. This is one of the tricky bits of docValues.
While it's _possible_ to search on a field that's defined as above,
it's very inefficient since there's no "inverted index" for the field,
you specified 'indexed="false" '. So the docValues are searched, and
it's essentially a table scan.

If you mean to search against this field, set indexed="true". You'll
have to completely reindex your corpus of course.

If you intend to facet, group or sort on this field, you should _also_
have docValues="true".

Best,
Erick

On Mon, Feb 19, 2018 at 7:47 AM, MOUSSA MZE Oussama-ext
 wrote:
> Hi
>
> We have following environement :
>
> 3 nodes cluster
> 1 shard
> Replication factor = 2
> 8GB per node
>
> 29 millions of documents
>
> We've faceting over field "motifPresence" defined as follow:
>
>  stored="true" required="false"/>
>
> Once the user selects motifPresence filter we executes search again with:
>
> fq: (value1 OR value2 OR value3 OR ...)
>
> The problem is: During facet filtering query is too slow and her response 
> time is greater than main search (without facet filtering).
>
> Thanks in advance!


Facet performance problem

2018-02-19 Thread MOUSSA MZE Oussama-ext
Hi

We have following environement :

3 nodes cluster
1 shard
Replication factor = 2
8GB per node

29 millions of documents

We've faceting over field "motifPresence" defined as follow:



Once the user selects motifPresence filter we executes search again with:

fq: (value1 OR value2 OR value3 OR ...)

The problem is: During facet filtering query is too slow and her response time 
is greater than main search (without facet filtering).

Thanks in advance!


Re: Really slow facet performance in 6.6

2017-10-25 Thread Yonik Seeley
On Mon, Oct 23, 2017 at 3:06 PM, John Davis <johndavis925...@gmail.com> wrote:
> Hello,
>
> We are seeing really slow facet performance with new solr release. This is
> on an index of 2M documents. A few things we've tried:

What happens when you run this facet request again?
The first time a UIF faceting method runs for a field on a changed
index, the data structure needs to be rebuilt (i.e. it's not good for
NRT).  Maybe that build time is being included.  Otherwise I've never
seen faceting so slow and there is something else going on here.

-Yonik


Re: Really slow facet performance in 6.6

2017-10-24 Thread Günter Hipler

have a look for more background information:

https://issues.apache.org/jira/browse/SOLR-8096

it's not only related to version 6.6. It's a question of design since 5.x

Günter


On 23.10.2017 21:06, John Davis wrote:

Hello,

We are seeing really slow facet performance with new solr release. This is
on an index of 2M documents. A few things we've tried:

1. method=uif however that didn't help much (the facet fields have
docValues=false since they are multi-valued). Debug info below.

2. changing query (q=) that selects what documents to compute facets on
didn't help a lot, except repeating the same query was fast presumably due
to exact cache hits.

Sample debug info:

“timing”: {
 “prepare”: {
 “debug”: {
 “time”: 0.0
 },
 “expand”: {
 “time”: 0.0
 },
 “facet”: {
 “time”: 0.0
 },
 “facet_module”: {
 “time”: 0.0
 },
 “highlight”: {
 “time”: 0.0
 },
 “mlt”: {
 “time”: 0.0
 },
 “query”: {
 “time”: 0.0
 },
 “stats”: {
 “time”: 0.0
 },
 “terms”: {
 “time”: 0.0
 },
 “time”: 0.0
 },
 “process”: {
 “debug”: {
 “time”: 87.0
 },
 “expand”: {
 “time”: 0.0
 },
 “facet”: {
 “time”: 9814.0
 },
 “facet_module”: {
 “time”: 0.0
 },
 “highlight”: {
 “time”: 0.0
 },
 “mlt”: {
 “time”: 0.0
 },
 “query”: {
 “time”: 20.0
 },
 “stats”: {
 “time”: 0.0
 },
 “terms”: {
 “time”: 0.0
 },
 “time”: 9922.0
 },
 “time”: 9923.0
 }
 },

"facet-debug": {
 "elapse": 8310,
 "sub-facet": [
 {
 "action": "field facet",
 "elapse": 8310,
 "maxThreads": 2,
 "processor": "SimpleFacets",
 "sub-facet": [
 {},
 {
 "appliedMethod": "UIF",
 "field": "school",
 "inputDocSetSize": 476,
 "requestedMethod": "UIF"
 },
 {
 "appliedMethod": "UIF",
 "elapse": 2575,
 "field": "work",
 "inputDocSetSize": 476,
 "requestedMethod": "UIF"
 },
 {
 "appliedMethod": "UIF",
 "elapse": 8310,
 "field": "level",
 "inputDocSetSize": 476,
 "requestedMethod": "UIF"
 }
 ]
 }

Thanks
John



--
Günter Hipler

Universität Basel | Universitätsbibliothek | Projekt swissbib

Schönbeinstrasse 18-20 | 4056 Basel | Schweiz

Tel +41 61 207 31 12 | Fax +41 61 207 31 03

E-Mail guenter.hip...@unibas.ch | http://www.ub.unibas.ch | 
https://www.swissbib.ch



Re: Really slow facet performance in 6.6

2017-10-23 Thread Toke Eskildsen
John Davis <johndavis925...@gmail.com> wrote:
> We are seeing really slow facet performance with new solr release.
> This is on an index of 2M documents.

I am currently running some performance experiments on simple String faceting, 
comparing Solr 4 & 6. There is definitely a performance difference, but it is 
not trivial to pinpoint where it is. My first thought was that it was tied to 
the Solr version, with Solr 6 being markedly slower than Solr 4. However, 
looking at segment count, I can see that Solr 6 has twice as many segments as 
Solr 4 for my test setup. I tried optimizing down to 10 segments, which flipped 
the result: Suddenly Solr 6 was faster than Solr 4.

I'm still poking at this, but I guess my takeaway for now is to be sure to 
compare on fair terms. The strategy for creating segments can be tweaked and 
(guessing a lot here) it seems that Solr 6 defaults leans towards faster 
indexing (by having more small segments) at the cost of faceting performance.

These JIRAs seems relevant:
https://issues.apache.org/jira/browse/SOLR-8096
https://issues.apache.org/jira/browse/SOLR-9599

> 1. method=uif however that didn't help much (the facet fields have
> docValues=false since they are multi-valued). Debug info below.

docValues works fine with multi-values (at least for Strings).

- Toke Eskildsen


Really slow facet performance in 6.6

2017-10-23 Thread John Davis
Hello,

We are seeing really slow facet performance with new solr release. This is
on an index of 2M documents. A few things we've tried:

1. method=uif however that didn't help much (the facet fields have
docValues=false since they are multi-valued). Debug info below.

2. changing query (q=) that selects what documents to compute facets on
didn't help a lot, except repeating the same query was fast presumably due
to exact cache hits.

Sample debug info:

“timing”: {
“prepare”: {
“debug”: {
“time”: 0.0
},
“expand”: {
“time”: 0.0
},
“facet”: {
“time”: 0.0
},
“facet_module”: {
“time”: 0.0
},
“highlight”: {
“time”: 0.0
},
“mlt”: {
“time”: 0.0
},
“query”: {
“time”: 0.0
},
“stats”: {
“time”: 0.0
},
“terms”: {
“time”: 0.0
},
“time”: 0.0
},
“process”: {
“debug”: {
“time”: 87.0
},
“expand”: {
“time”: 0.0
},
“facet”: {
“time”: 9814.0
},
“facet_module”: {
“time”: 0.0
},
“highlight”: {
“time”: 0.0
},
“mlt”: {
“time”: 0.0
},
“query”: {
“time”: 20.0
},
“stats”: {
“time”: 0.0
},
“terms”: {
“time”: 0.0
},
“time”: 9922.0
},
“time”: 9923.0
}
},

"facet-debug": {
"elapse": 8310,
"sub-facet": [
{
"action": "field facet",
"elapse": 8310,
"maxThreads": 2,
"processor": "SimpleFacets",
"sub-facet": [
{},
{
"appliedMethod": "UIF",
"field": "school",
"inputDocSetSize": 476,
"requestedMethod": "UIF"
},
{
"appliedMethod": "UIF",
"elapse": 2575,
"field": "work",
"inputDocSetSize": 476,
"requestedMethod": "UIF"
},
{
"appliedMethod": "UIF",
"elapse": 8310,
"field": "level",
"inputDocSetSize": 476,
"requestedMethod": "UIF"
}
]
}

Thanks
John


Re: JSON facet performance for aggregations

2017-05-25 Thread Saman Rasheed
hi yonik,


i like your work on solr very much, and i'm hoping it can deliver what we are 
looking to acheive here... and apologies for the direct aproach but i dont i 
have a choice, i've sumitted the request below to the mailing list and i still 
havent had a reply ... and part of me wondering it's because either i have 
missed out on something very obvious, or maybe my aproach to my problem is 
using the wrong technology here!


The mailing list is not allowing me to send you a direct link to the issue 
unless you want to see my message with alot of xml 

so i'm pasting the contents of my message below:

thanks,

~

i have an english book which i have indexed its contents successfully into 
field called 'content,
with the following properties:





so if need to return the number of a specific term regex e.g. '*olomo*' then my 
document should
contain 2 and give me 'Solomon' with a term frequency = 2.


I've tried going through the term vector section in the reference and various 
other posts
on the internet but still i havent managed to figure out how.


the nearest i found is the following syntax/way:


http://localhost:8983/solr/test/tvrh?q=content:[*%20TO%20*]=true=true=true


which brings my pc to a near halt for about a couple of minutes, and then it 
returns the term
frequency of every term! but i only need the term frequency of particular 
pattern/regex:


is there a way to narrow it down to just one regex term, e.g. *thing*, so it 
will find soothing,
somthing, everything each with their number of occurences for the document?


thanks,



~





From: Yonik Seeley <ysee...@gmail.com>
Sent: 24 May 2017 10:45
To: solr-user@lucene.apache.org
Subject: Re: JSON facet performance for aggregations

On Mon, May 8, 2017 at 11:27 AM, Yonik Seeley <ysee...@gmail.com> wrote:
> I opened https://issues.apache.org/jira/browse/SOLR-10634 to address
> this performance issue.

OK, this has been committed.
A quick test shows about a 30x speedup when faceting on a
string/numeric docvalues field with 100K unique values and doing a
simple aggregation on another numeric field (and when the limit:-1).

-Yonik


Re: JSON facet performance for aggregations

2017-05-24 Thread Yonik Seeley
On Mon, May 8, 2017 at 11:27 AM, Yonik Seeley  wrote:
> I opened https://issues.apache.org/jira/browse/SOLR-10634 to address
> this performance issue.

OK, this has been committed.
A quick test shows about a 30x speedup when faceting on a
string/numeric docvalues field with 100K unique values and doing a
simple aggregation on another numeric field (and when the limit:-1).

-Yonik


Re: JSON facet performance for aggregations

2017-05-08 Thread Yonik Seeley
On Mon, May 8, 2017 at 3:55 AM, Mikhail Ibraheem
<mikhail.ibrah...@oracle.com> wrote:
> Thanks Yonik.
> It is double because our use case allows to group by any field of any type.

Grouping in Solr does not require a double type, so I'm not sure how
that logically follows.  Perhaps it's a limitation in the system using
Solr?

> According to your below valuable explanation, is it better at this case to 
> use flat faceting instead of JSON faceting?

I don't think it would help.

I opened https://issues.apache.org/jira/browse/SOLR-10634 to address
this performance issue.

> Indexing the field should give us better performance than flat faceting?

Indexing the studentId field should give better performance wherever
you need to search for or filter by specific student ids.

-Yonik


> Indexing the field should give us better performance than flat faceting?
> Do you recommend streaming at that case?
>
> Please advise.
>
> Thanks
> Mikhail
>
> -Original Message-
> From: Yonik Seeley [mailto:ysee...@gmail.com]
> Sent: Sunday, May 07, 2017 6:25 PM
> To: solr-user@lucene.apache.org
> Subject: Re: JSON facet performance for aggregations
>
> OK, so I think I know what's going on.
>
> The current code is more optimized for finding the top K buckets from a total 
> of N.
> When one asks to return the top 10 buckets when there are potentially 
> millions of buckets, it makes sense to defer calculating other metrics for 
> those buckets until we know which ones they are.  After we identify the top 
> 10 buckets, we calculate the domain for that bucket and use that to calculate 
> the remaining metrics.
>
> The current method is obviously much slower when one is requesting
> *all* buckets.  We might as well just calculate all metrics in the first pass 
> rather than trying to defer them.
>
> This inefficiency is compounded by the fact that the fields are not indexed.  
> In the second phase, finding the domain for a bucket is a field query.  For 
> an indexed field, this would involve a single term lookup.  For a non-indexed 
> docValues field, this involves a full column scan.
>
> If you ever want to do quick lookups on studentId, it would make sense for it 
> to be indexed (and why is it a double, anyway?)
>
> I'll open up a JIRA issue for the first problem (don't defer metrics if we're 
> going to return all buckets anyway)
>
> -Yonik
>
>
> On Sun, Apr 30, 2017 at 8:58 AM, Mikhail Ibraheem 
> <mikhail.ibrah...@oracle.com> wrote:
>> Hi Yonik,
>> We are using Solr 6.5
>> Both studentId and grades are double:
>>   > indexed="false" stored="true" docValues="true" multiValued="false"
>> required="false"/>
>>
>> We have 1.5 million records.
>>
>> Thanks
>> Mikhail
>>
>> -Original Message-
>> From: Yonik Seeley [mailto:ysee...@gmail.com]
>> Sent: Sunday, April 30, 2017 1:04 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: JSON facet performance for aggregations
>>
>> It is odd there would be quite such a big performance delta.
>> What version of solr are you using?
>> What is the fieldType of "grades"?
>> -Yonik
>>
>>
>> On Sun, Apr 30, 2017 at 5:15 AM, Mikhail Ibraheem 
>> <mikhail.ibrah...@oracle.com> wrote:
>>> 1-
>>> studentId has docValue = true . it is of type double which is
>>> >> stored="true" docValues="true" multiValued="false" required="false"/>
>>>
>>>
>>> 2- If we just facet without aggregation it finishes in good time 60ms:
>>>
>>> json.facet={
>>>studentId:{
>>>   type:terms,
>>>   limit:-1,
>>>   field:" studentId "
>>>
>>>}
>>> }
>>>
>>>
>>> Thanks
>>>
>>>
>>> -Original Message-
>>> From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com]
>>> Sent: Sunday, April 30, 2017 10:44 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: RE: JSON facet performance for aggregations
>>>
>>> Please enable doc values and try.
>>> There is a bug in the source code which causes json facet on string field 
>>> to run very slow. On numeric fields it runs fine with doc value enabled.
>>>
>>> On Apr 30, 2017 1:41 PM, "Mikhail Ibraheem"
>>> <mikhail.ibrah...@oracle.com>
>>> wrote:
>>>
>>>> Hi Vijay,
>>>> It is already numeric field.
>>>> It is huge difference between json and flat here. Do y

RE: JSON facet performance for aggregations

2017-05-08 Thread Mikhail Ibraheem
Thanks Yonik.
It is double because our use case allows to group by any field of any type.
According to your below valuable explanation, is it better at this case to use 
flat faceting instead of JSON faceting?
Indexing the field should give us better performance than flat faceting?
Do you recommend streaming at that case?

Please advise.

Thanks
Mikhail

-Original Message-
From: Yonik Seeley [mailto:ysee...@gmail.com] 
Sent: Sunday, May 07, 2017 6:25 PM
To: solr-user@lucene.apache.org
Subject: Re: JSON facet performance for aggregations

OK, so I think I know what's going on.

The current code is more optimized for finding the top K buckets from a total 
of N.
When one asks to return the top 10 buckets when there are potentially millions 
of buckets, it makes sense to defer calculating other metrics for those buckets 
until we know which ones they are.  After we identify the top 10 buckets, we 
calculate the domain for that bucket and use that to calculate the remaining 
metrics.

The current method is obviously much slower when one is requesting
*all* buckets.  We might as well just calculate all metrics in the first pass 
rather than trying to defer them.

This inefficiency is compounded by the fact that the fields are not indexed.  
In the second phase, finding the domain for a bucket is a field query.  For an 
indexed field, this would involve a single term lookup.  For a non-indexed 
docValues field, this involves a full column scan.

If you ever want to do quick lookups on studentId, it would make sense for it 
to be indexed (and why is it a double, anyway?)

I'll open up a JIRA issue for the first problem (don't defer metrics if we're 
going to return all buckets anyway)

-Yonik


On Sun, Apr 30, 2017 at 8:58 AM, Mikhail Ibraheem <mikhail.ibrah...@oracle.com> 
wrote:
> Hi Yonik,
> We are using Solr 6.5
> Both studentId and grades are double:
>indexed="false" stored="true" docValues="true" multiValued="false" 
> required="false"/>
>
> We have 1.5 million records.
>
> Thanks
> Mikhail
>
> -Original Message-
> From: Yonik Seeley [mailto:ysee...@gmail.com]
> Sent: Sunday, April 30, 2017 1:04 PM
> To: solr-user@lucene.apache.org
> Subject: Re: JSON facet performance for aggregations
>
> It is odd there would be quite such a big performance delta.
> What version of solr are you using?
> What is the fieldType of "grades"?
> -Yonik
>
>
> On Sun, Apr 30, 2017 at 5:15 AM, Mikhail Ibraheem 
> <mikhail.ibrah...@oracle.com> wrote:
>> 1-
>> studentId has docValue = true . it is of type double which is 
>> > stored="true" docValues="true" multiValued="false" required="false"/>
>>
>>
>> 2- If we just facet without aggregation it finishes in good time 60ms:
>>
>> json.facet={
>>studentId:{
>>   type:terms,
>>   limit:-1,
>>   field:" studentId "
>>
>>}
>> }
>>
>>
>> Thanks
>>
>>
>> -Original Message-
>> From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com]
>> Sent: Sunday, April 30, 2017 10:44 AM
>> To: solr-user@lucene.apache.org
>> Subject: RE: JSON facet performance for aggregations
>>
>> Please enable doc values and try.
>> There is a bug in the source code which causes json facet on string field to 
>> run very slow. On numeric fields it runs fine with doc value enabled.
>>
>> On Apr 30, 2017 1:41 PM, "Mikhail Ibraheem"
>> <mikhail.ibrah...@oracle.com>
>> wrote:
>>
>>> Hi Vijay,
>>> It is already numeric field.
>>> It is huge difference between json and flat here. Do you know the 
>>> reason for this? Is there a way to improve it ?
>>>
>>> -Original Message-
>>> From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com]
>>> Sent: Sunday, April 30, 2017 9:58 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: JSON facet performance for aggregations
>>>
>>> Json facet on string fields run lot slower than on numeric fields.
>>> Try and see if you can represent studentid as a numeric field.
>>>
>>> On Apr 30, 2017 1:19 PM, "Mikhail Ibraheem"
>>> <mikhail.ibrah...@oracle.com>
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > I am trying to do aggregation with JSON faceting but performance 
>>> > is very bad for one of the requests:
>>> >
>>> > json.facet={
>>> >
>>> >studentId:{
>>> >
>>> >   type:terms,
>>> >
>>> >   limit:-1,
&

Re: JSON facet performance for aggregations

2017-05-07 Thread Yonik Seeley
OK, so I think I know what's going on.

The current code is more optimized for finding the top K buckets from
a total of N.
When one asks to return the top 10 buckets when there are potentially
millions of buckets, it makes sense to defer calculating other metrics
for those buckets until we know which ones they are.  After we
identify the top 10 buckets, we calculate the domain for that bucket
and use that to calculate the remaining metrics.

The current method is obviously much slower when one is requesting
*all* buckets.  We might as well just calculate all metrics in the
first pass rather than trying to defer them.

This inefficiency is compounded by the fact that the fields are not
indexed.  In the second phase, finding the domain for a bucket is a
field query.  For an indexed field, this would involve a single term
lookup.  For a non-indexed docValues field, this involves a full
column scan.

If you ever want to do quick lookups on studentId, it would make sense
for it to be indexed (and why is it a double, anyway?)

I'll open up a JIRA issue for the first problem (don't defer metrics
if we're going to return all buckets anyway)

-Yonik


On Sun, Apr 30, 2017 at 8:58 AM, Mikhail Ibraheem
<mikhail.ibrah...@oracle.com> wrote:
> Hi Yonik,
> We are using Solr 6.5
> Both studentId and grades are double:
>stored="true" docValues="true" multiValued="false" required="false"/>
>
> We have 1.5 million records.
>
> Thanks
> Mikhail
>
> -Original Message-
> From: Yonik Seeley [mailto:ysee...@gmail.com]
> Sent: Sunday, April 30, 2017 1:04 PM
> To: solr-user@lucene.apache.org
> Subject: Re: JSON facet performance for aggregations
>
> It is odd there would be quite such a big performance delta.
> What version of solr are you using?
> What is the fieldType of "grades"?
> -Yonik
>
>
> On Sun, Apr 30, 2017 at 5:15 AM, Mikhail Ibraheem 
> <mikhail.ibrah...@oracle.com> wrote:
>> 1-
>> studentId has docValue = true . it is of type double which is
>> > stored="true" docValues="true" multiValued="false" required="false"/>
>>
>>
>> 2- If we just facet without aggregation it finishes in good time 60ms:
>>
>> json.facet={
>>studentId:{
>>   type:terms,
>>   limit:-1,
>>   field:" studentId "
>>
>>}
>> }
>>
>>
>> Thanks
>>
>>
>> -Original Message-
>> From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com]
>> Sent: Sunday, April 30, 2017 10:44 AM
>> To: solr-user@lucene.apache.org
>> Subject: RE: JSON facet performance for aggregations
>>
>> Please enable doc values and try.
>> There is a bug in the source code which causes json facet on string field to 
>> run very slow. On numeric fields it runs fine with doc value enabled.
>>
>> On Apr 30, 2017 1:41 PM, "Mikhail Ibraheem"
>> <mikhail.ibrah...@oracle.com>
>> wrote:
>>
>>> Hi Vijay,
>>> It is already numeric field.
>>> It is huge difference between json and flat here. Do you know the
>>> reason for this? Is there a way to improve it ?
>>>
>>> -Original Message-
>>> From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com]
>>> Sent: Sunday, April 30, 2017 9:58 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: JSON facet performance for aggregations
>>>
>>> Json facet on string fields run lot slower than on numeric fields.
>>> Try and see if you can represent studentid as a numeric field.
>>>
>>> On Apr 30, 2017 1:19 PM, "Mikhail Ibraheem"
>>> <mikhail.ibrah...@oracle.com>
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > I am trying to do aggregation with JSON faceting but performance is
>>> > very bad for one of the requests:
>>> >
>>> > json.facet={
>>> >
>>> >studentId:{
>>> >
>>> >   type:terms,
>>> >
>>> >   limit:-1,
>>> >
>>> >   field:"studentId",
>>> >
>>> >   facet:{
>>> >
>>> >   x:"sum(grades)"
>>> >
>>> >   }
>>> >
>>> >}
>>> >
>>> > }
>>> >
>>> >
>>> >
>>> > This request finishes in 250 seconds, and we can't paginate for
>>> > this service for functional reason so we have to use limit:-1, and
>>> > the cardinality of the studentId is 7500.
>>> >
>>> >
>>> >
>>> > If I try the same with flat facet it finishes in 3 seconds :
>>> > stats=true=true={!tag=piv1
>>> > sum=true}grades={!stats=piv1}studentId
>>> >
>>> >
>>> >
>>> > We are hoping to use one approach json or flat for all our services.
>>> > JSON facet performance is better for many case.
>>> >
>>> >
>>> >
>>> > Please advise on why the performance for this is so bad and if we
>>> > can improve it. Also what is the default algorithm used for json facet.
>>> >
>>> >
>>> >
>>> > Thanks
>>> >
>>> > Mikhail
>>> >
>>>


RE: JSON facet performance for aggregations

2017-04-30 Thread Mikhail Ibraheem
Hi Yonik,
We are using Solr 6.5
Both studentId and grades are double:
  

We have 1.5 million records.

Thanks
Mikhail

-Original Message-
From: Yonik Seeley [mailto:ysee...@gmail.com] 
Sent: Sunday, April 30, 2017 1:04 PM
To: solr-user@lucene.apache.org
Subject: Re: JSON facet performance for aggregations

It is odd there would be quite such a big performance delta.
What version of solr are you using?
What is the fieldType of "grades"?
-Yonik


On Sun, Apr 30, 2017 at 5:15 AM, Mikhail Ibraheem <mikhail.ibrah...@oracle.com> 
wrote:
> 1-
> studentId has docValue = true . it is of type double which is 
>  stored="true" docValues="true" multiValued="false" required="false"/>
>
>
> 2- If we just facet without aggregation it finishes in good time 60ms:
>
> json.facet={
>studentId:{
>   type:terms,
>   limit:-1,
>   field:" studentId "
>
>}
> }
>
>
> Thanks
>
>
> -Original Message-
> From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com]
> Sent: Sunday, April 30, 2017 10:44 AM
> To: solr-user@lucene.apache.org
> Subject: RE: JSON facet performance for aggregations
>
> Please enable doc values and try.
> There is a bug in the source code which causes json facet on string field to 
> run very slow. On numeric fields it runs fine with doc value enabled.
>
> On Apr 30, 2017 1:41 PM, "Mikhail Ibraheem" 
> <mikhail.ibrah...@oracle.com>
> wrote:
>
>> Hi Vijay,
>> It is already numeric field.
>> It is huge difference between json and flat here. Do you know the 
>> reason for this? Is there a way to improve it ?
>>
>> -Original Message-
>> From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com]
>> Sent: Sunday, April 30, 2017 9:58 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: JSON facet performance for aggregations
>>
>> Json facet on string fields run lot slower than on numeric fields. 
>> Try and see if you can represent studentid as a numeric field.
>>
>> On Apr 30, 2017 1:19 PM, "Mikhail Ibraheem"
>> <mikhail.ibrah...@oracle.com>
>> wrote:
>>
>> > Hi,
>> >
>> > I am trying to do aggregation with JSON faceting but performance is 
>> > very bad for one of the requests:
>> >
>> > json.facet={
>> >
>> >studentId:{
>> >
>> >   type:terms,
>> >
>> >   limit:-1,
>> >
>> >   field:"studentId",
>> >
>> >   facet:{
>> >
>> >   x:"sum(grades)"
>> >
>> >       }
>> >
>> >}
>> >
>> > }
>> >
>> >
>> >
>> > This request finishes in 250 seconds, and we can't paginate for 
>> > this service for functional reason so we have to use limit:-1, and 
>> > the cardinality of the studentId is 7500.
>> >
>> >
>> >
>> > If I try the same with flat facet it finishes in 3 seconds :
>> > stats=true=true={!tag=piv1
>> > sum=true}grades={!stats=piv1}studentId
>> >
>> >
>> >
>> > We are hoping to use one approach json or flat for all our services.
>> > JSON facet performance is better for many case.
>> >
>> >
>> >
>> > Please advise on why the performance for this is so bad and if we 
>> > can improve it. Also what is the default algorithm used for json facet.
>> >
>> >
>> >
>> > Thanks
>> >
>> > Mikhail
>> >
>>


Re: JSON facet performance for aggregations

2017-04-30 Thread Yonik Seeley
It is odd there would be quite such a big performance delta.
What version of solr are you using?
What is the fieldType of "grades"?
-Yonik


On Sun, Apr 30, 2017 at 5:15 AM, Mikhail Ibraheem
<mikhail.ibrah...@oracle.com> wrote:
> 1-
> studentId has docValue = true . it is of type double which is  name="double" class="solr.TrieDoubleField" indexed="false" stored="true" 
> docValues="true" multiValued="false" required="false"/>
>
>
> 2- If we just facet without aggregation it finishes in good time 60ms:
>
> json.facet={
>studentId:{
>   type:terms,
>   limit:-1,
>   field:" studentId "
>
>}
> }
>
>
> Thanks
>
>
> -Original Message-
> From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com]
> Sent: Sunday, April 30, 2017 10:44 AM
> To: solr-user@lucene.apache.org
> Subject: RE: JSON facet performance for aggregations
>
> Please enable doc values and try.
> There is a bug in the source code which causes json facet on string field to 
> run very slow. On numeric fields it runs fine with doc value enabled.
>
> On Apr 30, 2017 1:41 PM, "Mikhail Ibraheem" <mikhail.ibrah...@oracle.com>
> wrote:
>
>> Hi Vijay,
>> It is already numeric field.
>> It is huge difference between json and flat here. Do you know the
>> reason for this? Is there a way to improve it ?
>>
>> -Original Message-
>> From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com]
>> Sent: Sunday, April 30, 2017 9:58 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: JSON facet performance for aggregations
>>
>> Json facet on string fields run lot slower than on numeric fields. Try
>> and see if you can represent studentid as a numeric field.
>>
>> On Apr 30, 2017 1:19 PM, "Mikhail Ibraheem"
>> <mikhail.ibrah...@oracle.com>
>> wrote:
>>
>> > Hi,
>> >
>> > I am trying to do aggregation with JSON faceting but performance is
>> > very bad for one of the requests:
>> >
>> > json.facet={
>> >
>> >studentId:{
>> >
>> >   type:terms,
>> >
>> >   limit:-1,
>> >
>> >   field:"studentId",
>> >
>> >   facet:{
>> >
>> >   x:"sum(grades)"
>> >
>> >   }
>> >
>> >}
>> >
>> > }
>> >
>> >
>> >
>> > This request finishes in 250 seconds, and we can't paginate for this
>> > service for functional reason so we have to use limit:-1, and the
>> > cardinality of the studentId is 7500.
>> >
>> >
>> >
>> > If I try the same with flat facet it finishes in 3 seconds :
>> > stats=true=true={!tag=piv1
>> > sum=true}grades={!stats=piv1}studentId
>> >
>> >
>> >
>> > We are hoping to use one approach json or flat for all our services.
>> > JSON facet performance is better for many case.
>> >
>> >
>> >
>> > Please advise on why the performance for this is so bad and if we
>> > can improve it. Also what is the default algorithm used for json facet.
>> >
>> >
>> >
>> > Thanks
>> >
>> > Mikhail
>> >
>>


RE: JSON facet performance for aggregations

2017-04-30 Thread Mikhail Ibraheem
1- 
studentId has docValue = true . it is of type double which is 


2- If we just facet without aggregation it finishes in good time 60ms:

json.facet={  
   studentId:{  
  type:terms,
  limit:-1,
  field:" studentId "
  
   }
}


Thanks


-Original Message-
From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com] 
Sent: Sunday, April 30, 2017 10:44 AM
To: solr-user@lucene.apache.org
Subject: RE: JSON facet performance for aggregations

Please enable doc values and try.
There is a bug in the source code which causes json facet on string field to 
run very slow. On numeric fields it runs fine with doc value enabled.

On Apr 30, 2017 1:41 PM, "Mikhail Ibraheem" <mikhail.ibrah...@oracle.com>
wrote:

> Hi Vijay,
> It is already numeric field.
> It is huge difference between json and flat here. Do you know the 
> reason for this? Is there a way to improve it ?
>
> -Original Message-
> From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com]
> Sent: Sunday, April 30, 2017 9:58 AM
> To: solr-user@lucene.apache.org
> Subject: Re: JSON facet performance for aggregations
>
> Json facet on string fields run lot slower than on numeric fields. Try 
> and see if you can represent studentid as a numeric field.
>
> On Apr 30, 2017 1:19 PM, "Mikhail Ibraheem" 
> <mikhail.ibrah...@oracle.com>
> wrote:
>
> > Hi,
> >
> > I am trying to do aggregation with JSON faceting but performance is 
> > very bad for one of the requests:
> >
> > json.facet={
> >
> >studentId:{
> >
> >   type:terms,
> >
> >   limit:-1,
> >
> >   field:"studentId",
> >
> >   facet:{
> >
> >   x:"sum(grades)"
> >
> >   }
> >
> >}
> >
> > }
> >
> >
> >
> > This request finishes in 250 seconds, and we can't paginate for this 
> > service for functional reason so we have to use limit:-1, and the 
> > cardinality of the studentId is 7500.
> >
> >
> >
> > If I try the same with flat facet it finishes in 3 seconds :
> > stats=true=true={!tag=piv1
> > sum=true}grades={!stats=piv1}studentId
> >
> >
> >
> > We are hoping to use one approach json or flat for all our services.
> > JSON facet performance is better for many case.
> >
> >
> >
> > Please advise on why the performance for this is so bad and if we 
> > can improve it. Also what is the default algorithm used for json facet.
> >
> >
> >
> > Thanks
> >
> > Mikhail
> >
>


RE: JSON facet performance for aggregations

2017-04-30 Thread Vijay Tiwary
Please enable doc values and try.
There is a bug in the source code which causes json facet on string field
to run very slow. On numeric fields it runs fine with doc value enabled.

On Apr 30, 2017 1:41 PM, "Mikhail Ibraheem" <mikhail.ibrah...@oracle.com>
wrote:

> Hi Vijay,
> It is already numeric field.
> It is huge difference between json and flat here. Do you know the reason
> for this? Is there a way to improve it ?
>
> -Original Message-
> From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com]
> Sent: Sunday, April 30, 2017 9:58 AM
> To: solr-user@lucene.apache.org
> Subject: Re: JSON facet performance for aggregations
>
> Json facet on string fields run lot slower than on numeric fields. Try and
> see if you can represent studentid as a numeric field.
>
> On Apr 30, 2017 1:19 PM, "Mikhail Ibraheem" <mikhail.ibrah...@oracle.com>
> wrote:
>
> > Hi,
> >
> > I am trying to do aggregation with JSON faceting but performance is
> > very bad for one of the requests:
> >
> > json.facet={
> >
> >studentId:{
> >
> >   type:terms,
> >
> >   limit:-1,
> >
> >   field:"studentId",
> >
> >   facet:{
> >
> >   x:"sum(grades)"
> >
> >   }
> >
> >}
> >
> > }
> >
> >
> >
> > This request finishes in 250 seconds, and we can't paginate for this
> > service for functional reason so we have to use limit:-1, and the
> > cardinality of the studentId is 7500.
> >
> >
> >
> > If I try the same with flat facet it finishes in 3 seconds :
> > stats=true=true={!tag=piv1
> > sum=true}grades={!stats=piv1}studentId
> >
> >
> >
> > We are hoping to use one approach json or flat for all our services.
> > JSON facet performance is better for many case.
> >
> >
> >
> > Please advise on why the performance for this is so bad and if we can
> > improve it. Also what is the default algorithm used for json facet.
> >
> >
> >
> > Thanks
> >
> > Mikhail
> >
>


RE: JSON facet performance for aggregations

2017-04-30 Thread Mikhail Ibraheem
Hi Vijay,
It is already numeric field.
It is huge difference between json and flat here. Do you know the reason for 
this? Is there a way to improve it ?

-Original Message-
From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com] 
Sent: Sunday, April 30, 2017 9:58 AM
To: solr-user@lucene.apache.org
Subject: Re: JSON facet performance for aggregations

Json facet on string fields run lot slower than on numeric fields. Try and see 
if you can represent studentid as a numeric field.

On Apr 30, 2017 1:19 PM, "Mikhail Ibraheem" <mikhail.ibrah...@oracle.com>
wrote:

> Hi,
>
> I am trying to do aggregation with JSON faceting but performance is 
> very bad for one of the requests:
>
> json.facet={
>
>studentId:{
>
>   type:terms,
>
>   limit:-1,
>
>   field:"studentId",
>
>   facet:{
>
>   x:"sum(grades)"
>
>   }
>
>}
>
> }
>
>
>
> This request finishes in 250 seconds, and we can't paginate for this 
> service for functional reason so we have to use limit:-1, and the 
> cardinality of the studentId is 7500.
>
>
>
> If I try the same with flat facet it finishes in 3 seconds :
> stats=true=true={!tag=piv1
> sum=true}grades={!stats=piv1}studentId
>
>
>
> We are hoping to use one approach json or flat for all our services. 
> JSON facet performance is better for many case.
>
>
>
> Please advise on why the performance for this is so bad and if we can 
> improve it. Also what is the default algorithm used for json facet.
>
>
>
> Thanks
>
> Mikhail
>


Re: JSON facet performance for aggregations

2017-04-30 Thread Vijay Tiwary
Json facet on string fields run lot slower than on numeric fields. Try and
see if you can represent studentid as a numeric field.

On Apr 30, 2017 1:19 PM, "Mikhail Ibraheem" <mikhail.ibrah...@oracle.com>
wrote:

> Hi,
>
> I am trying to do aggregation with JSON faceting but performance is very
> bad for one of the requests:
>
> json.facet={
>
>studentId:{
>
>   type:terms,
>
>   limit:-1,
>
>   field:"studentId",
>
>   facet:{
>
>   x:"sum(grades)"
>
>   }
>
>}
>
> }
>
>
>
> This request finishes in 250 seconds, and we can't paginate for this
> service for functional reason so we have to use limit:-1, and the
> cardinality of the studentId is 7500.
>
>
>
> If I try the same with flat facet it finishes in 3 seconds :
> stats=true=true={!tag=piv1
> sum=true}grades={!stats=piv1}studentId
>
>
>
> We are hoping to use one approach json or flat for all our services. JSON
> facet performance is better for many case.
>
>
>
> Please advise on why the performance for this is so bad and if we can
> improve it. Also what is the default algorithm used for json facet.
>
>
>
> Thanks
>
> Mikhail
>


JSON facet performance for aggregations

2017-04-30 Thread Mikhail Ibraheem
Hi,

I am trying to do aggregation with JSON faceting but performance is very bad 
for one of the requests:

json.facet={  

   studentId:{  

  type:terms,

  limit:-1,

  field:"studentId",

  facet:{

  x:"sum(grades)"

  }

   }

}

 

This request finishes in 250 seconds, and we can't paginate for this service 
for functional reason so we have to use limit:-1, and the cardinality of the 
studentId is 7500.

 

If I try the same with flat facet it finishes in 3 seconds :  
stats=true=true={!tag=piv1 
sum=true}grades={!stats=piv1}studentId

 

We are hoping to use one approach json or flat for all our services. JSON facet 
performance is better for many case.

 

Please advise on why the performance for this is so bad and if we can improve 
it. Also what is the default algorithm used for json facet.

 

Thanks

Mikhail


Re: prefix facet performance

2017-04-24 Thread Yonik Seeley
In SimpleFacets.getFacetTermEnumCounts, we seek to the first term
matching the prefix using the index and then for each term after
compare the prefix until it no longer matches.

-Yonik


On Mon, Apr 24, 2017 at 5:04 AM, alessandro.benedetti
<a.benede...@sease.io> wrote:
> Thanks Yonik and Maria.
> It make sense, if we reduce the number of terms, term enum becomes a very
> good solution.
> @Yonik : do we still check the prefix on the term dictionary one by one, or
> an FST is used to identify the set of candidate terms ?
>
> I will check the code later,
>
> Regards
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/prefix-facet-performance-tp4330684p4331553.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: prefix facet performance

2017-04-24 Thread alessandro.benedetti
Thanks Yonik and Maria.
It make sense, if we reduce the number of terms, term enum becomes a very
good solution.
@Yonik : do we still check the prefix on the term dictionary one by one, or
an FST is used to identify the set of candidate terms ?

I will check the code later,

Regards



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/prefix-facet-performance-tp4330684p4331553.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: prefix facet performance

2017-04-21 Thread Maria Muslea
I see. Once I specify a prefix the number of terms is MUCH smaller.

Thank you again for all your help.

Maria

On Fri, Apr 21, 2017 at 1:46 PM, Yonik Seeley  wrote:

> On Fri, Apr 21, 2017 at 4:25 PM, Maria Muslea 
> wrote:
> > The field is:
> >
> > 
> >
> > and using unique() I found that it has 700K+ unique values.
> >
> > The query before (that takes ~10s):
> >
> > wt=json=true=*:*=0=true=
> concept=A/
> >
> > the query after (that is almost instant):
> >
> > wt=json=true=*:*=0=true=
> concept=A/=enum'
>
> Ah, the fact that you specify a facet.prefix makes this perfectly
> aligned for the "enum" method, which can skip directly to the first
> term on-or-after "A/"
> facet.method=enum goes term-by-term, calculating the intersection with
> the facet domain.
> In this case, it's the number of terms that start with "A/" that
> matters, not the number of terms in the entire field (hence the
> speedup).
>
> -Yonik
>


Re: prefix facet performance

2017-04-21 Thread Yonik Seeley
On Fri, Apr 21, 2017 at 4:25 PM, Maria Muslea  wrote:
> The field is:
>
> 
>
> and using unique() I found that it has 700K+ unique values.
>
> The query before (that takes ~10s):
>
> wt=json=true=*:*=0=true=concept=A/
>
> the query after (that is almost instant):
>
> wt=json=true=*:*=0=true=concept=A/=enum'

Ah, the fact that you specify a facet.prefix makes this perfectly
aligned for the "enum" method, which can skip directly to the first
term on-or-after "A/"
facet.method=enum goes term-by-term, calculating the intersection with
the facet domain.
In this case, it's the number of terms that start with "A/" that
matters, not the number of terms in the entire field (hence the
speedup).

-Yonik


Re: prefix facet performance

2017-04-21 Thread Maria Muslea
The field is:



and using unique() I found that it has 700K+ unique values.

The query before (that takes ~10s):

wt=json=true=*:*=0=true=concept=A/

the query after (that is almost instant):

wt=json=true=*:*=0=true=concept=A/=enum'

Maria

On Fri, Apr 21, 2017 at 8:59 AM, alessandro.benedetti <a.benede...@sease.io>
wrote:

> That is quite interesting !
> You can use the stats module ( in association with the Json facets if you
> need it) to calculate an accurate approximation of the unique values [1]
> [2]
> .
>
> Good to know it improved your scenario, I may need to update my knowledge
> of
> term enum internals!
> Can you describe your schema configuration for the field and the way you
> were faceting before in comparison to the way you facet now ( with the
> related benefit)
>
> [1] https://cwiki.apache.org/confluence/display/solr/The+Stats+Component
> [2] http://yonik.com/solr-count-distinct/
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/prefix-facet-performance-tp4330684p4331309.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: prefix facet performance

2017-04-21 Thread alessandro.benedetti
That is quite interesting !
You can use the stats module ( in association with the Json facets if you
need it) to calculate an accurate approximation of the unique values [1] [2]
.

Good to know it improved your scenario, I may need to update my knowledge of
term enum internals!
Can you describe your schema configuration for the field and the way you
were faceting before in comparison to the way you facet now ( with the
related benefit)

[1] https://cwiki.apache.org/confluence/display/solr/The+Stats+Component
[2] http://yonik.com/solr-count-distinct/



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/prefix-facet-performance-tp4330684p4331309.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: prefix facet performance

2017-04-21 Thread Maria Muslea
Actually using facet.method=enum made a HUGE difference even in my case
where I have many unique values. I am happy with the query response time
now.

Is there a way in SOLR to count the unique values for a field? If not, I
could run the reindexing and count the unique values while I add them to
give you a more accurate count of how many I have (there is a good chance
that I have more than 500K).

Thanks,
Maria

On Fri, Apr 21, 2017 at 1:16 AM, alessandro.benedetti <a.benede...@sease.io>
wrote:

> Hi Maria,
> If you have 100-500.000 unique values for the field you are interested in,
> and the cardinality of your search results is actually quite small in
> comparison, I am not that sure term enum will help you that much ...
>
> To simplify, with the term enum approach, you iterate over each unique
> value, if it matches the prefix and then you count the intersection of the
> result set with the posting list for that term.
> In your case, your result set is likely to be much smaller than the number
> of unique values.
> I would assume you are using the fc approach, which in my opinion was not a
> bad idea.
> Let's start from the algorithm you are using and the schema config for your
> field,
>
> Cheers
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/prefix-facet-performance-tp4330684p4331221.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: prefix facet performance

2017-04-21 Thread alessandro.benedetti
Hi Maria,
If you have 100-500.000 unique values for the field you are interested in,
and the cardinality of your search results is actually quite small in
comparison, I am not that sure term enum will help you that much ...

To simplify, with the term enum approach, you iterate over each unique
value, if it matches the prefix and then you count the intersection of the
result set with the posting list for that term.
In your case, your result set is likely to be much smaller than the number
of unique values.
I would assume you are using the fc approach, which in my opinion was not a
bad idea.
Let's start from the algorithm you are using and the schema config for your
field,

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/prefix-facet-performance-tp4330684p4331221.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: prefix facet performance

2017-04-18 Thread Maria Muslea
Hmmm, not sure. Probably in the range of 100K-500K.

Before writing the email I was just looking at:
http://yonik.com/facet-performance/

Wow, using facet.method=enum makes a big difference. I will read on it to
understand what it does.

Thank you so much.

Maria

On Tue, Apr 18, 2017 at 5:21 PM, Yonik Seeley <ysee...@gmail.com> wrote:

> How many unique values in the index?
> You could try facet.method=enum
>
> -Yonik
>
>
> On Tue, Apr 18, 2017 at 8:16 PM, Maria Muslea <maria.mus...@gmail.com>
> wrote:
> > Hi,
> >
> > I have ~40K documents in SOLR (not many) and a multivalued facet field
> that
> > contains at least 2K values per document.
> >
> > The values of the facet field look like: A/B, A/C, A/D, C/E, M/F, etc,
> and
> > I use facet.prefix.
> >
> > q=*:*=0=true=concept=A/
> >
> >
> > with "concept" defined as:
> >
> >
> > 
> >
> >
> > This generates the output that I am looking for, but it takes more than
> 10
> > seconds per query.
> >
> >
> > Is there any way that I could improve the facet query performance for
> this
> > example?
> >
> >
> > Thank you,
> >
> > Maria
>


Re: prefix facet performance

2017-04-18 Thread Yonik Seeley
How many unique values in the index?
You could try facet.method=enum

-Yonik


On Tue, Apr 18, 2017 at 8:16 PM, Maria Muslea  wrote:
> Hi,
>
> I have ~40K documents in SOLR (not many) and a multivalued facet field that
> contains at least 2K values per document.
>
> The values of the facet field look like: A/B, A/C, A/D, C/E, M/F, etc, and
> I use facet.prefix.
>
> q=*:*=0=true=concept=A/
>
>
> with "concept" defined as:
>
>
> 
>
>
> This generates the output that I am looking for, but it takes more than 10
> seconds per query.
>
>
> Is there any way that I could improve the facet query performance for this
> example?
>
>
> Thank you,
>
> Maria


prefix facet performance

2017-04-18 Thread Maria Muslea
Hi,

I have ~40K documents in SOLR (not many) and a multivalued facet field that
contains at least 2K values per document.

The values of the facet field look like: A/B, A/C, A/D, C/E, M/F, etc, and
I use facet.prefix.

q=*:*=0=true=concept=A/


with "concept" defined as:





This generates the output that I am looking for, but it takes more than 10
seconds per query.


Is there any way that I could improve the facet query performance for this
example?


Thank you,

Maria


Re: 5.4 facet performance thumbs-up

2015-12-23 Thread Yonik Seeley
Awesome, thanks for the feedback!

-Yonik

On Tue, Dec 22, 2015 at 5:36 PM, Aigner, Max  wrote:
> I'm happy to report that we are seeing significant speed-ups in our queries 
> with Json facets on 5.4 vs regular facets on 5.1. Our queries contain mostly 
> terms facets, many of them with exclusion tags and prefix filtering.
> Nice work!


5.4 facet performance thumbs-up

2015-12-22 Thread Aigner, Max
I'm happy to report that we are seeing significant speed-ups in our queries 
with Json facets on 5.4 vs regular facets on 5.1. Our queries contain mostly 
terms facets, many of them with exclusion tags and prefix filtering.
Nice work!



答复: (Issue) How improve solr facet performance

2014-05-27 Thread Alice.H.Yang (mis.cnsh04.Newegg) 41493
Hi, Token

1.
I set the 3 fields with hundreds of values uses fc and the rest uses 
enum, the performance is improved 2 times compared with no parameter, and then 
I add facet.method=20 , the performance is improved about 4 times compared with 
no parameter.
And I also tried setting 9 facet field to one copyfield, I test the 
performance, it is improved about 2.5 times compared with no parameter.
So, It is improved a lot under your advice, thanks a lot.
2.
Now I have another performance issue, It's the group performance. The 
number of data is as same as facet performance scenario. 
When the keyword search hits about one million documents, the QTime is about 
600ms.(It doesn't query the first time, it's in cache)

Query url: 
select?fl=item_catalogq=default_search:paramterdefType=edismaxrows=50group=truegroup.field=item_group_idgroup.ngroups=truegroup.sort=stock4sort%20desc,final_price%20asc,is_selleritem%20ascsort=score%20desc,default_sort%20desc

It need Qtime about 600ms.

This query have two parameter: 
1. fl one field 
2. group=true, 
group.ngroups=true

If I set group=false,, the QTime is only 1 ms.
But I need do group and group.ngroups, How can I improve the group performance 
under this demand. Do you have some advice for me. I'm looking forward to your 
reply.

Best Regards,
Alice Yang
+86-021-51530666*41493
Floor 19,KaiKai Plaza,888,Wanhandu Rd,Shanghai(200042)


-邮件原件-
发件人: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] 
发送时间: 2014年5月24日 15:17
收件人: solr-user@lucene.apache.org
主题: RE: (Issue) How improve solr facet performance

Alice.H.Yang (mis.cnsh04.Newegg) 41493 [alice.h.y...@newegg.com] wrote:
 1.  I'm sorry, I have made a mistake, the total number of documents is 32 
 Million, not 320 Million.
 2.  The system memory is large for solr index, OS total has 256G, I set the 
 solr tomcat HEAPSIZE=-Xms25G -Xmx100G

100G is a very high number. What special requirements dictates such a large 
heap size?

 Reply:  9 fields I facet on.

Solr treats each facet separately and with facet.method=fc and 10M hits, this 
means that it will iterate 9*10M = 90M document IDs and update the counters for 
those.

 Reply:  3 facet fields have one hundred unique values, other 6 facet fields' 
 unique values are between 3 to 15.

So very low cardinality. This is confirmed by your low response time of 6ms for 
2925 hits.

 And we test this scenario:  If the number of facet fields' unique values is 
 less we add facet.method=enum, there is a little to improve performance.

That is a shame: enum is normally the simple answer to a setup like yours. Have 
you tried fine-tuning your fc/enum selection, so that the 3 fields with 
hundreds of values uses fc and the rest uses enum? That might halve your 
response time.


Since the number of unique facets is so low, I do not think that DocValues can 
help you here. Besides the fine-grained fc/enum-selection above, you could try 
collapsing all 9 facet-fields into a single field. The idea behind this is that 
for facet.method=fc, performing faceting on a field with (for example) 300 
unique values takes practically the same amount of time as faceting on a field 
with 1000 unique values: Faceting on a single slightly larger field is much 
faster than faceting on 9 smaller fields. After faceting with facet.limit=-1 on 
the single super-facet-field, you must match the returned values back to their 
original fields:


If you have the facet-fields

field0: 34
field1: 187
field2: 78432
field3: 3
...

then collapse them by or-ing a field-specific mask that is bigger than the max 
in any field, then put it all into a single field:

fieldAll: 0xA000 | 34
fieldAll: 0xA100 | 187
fieldAll: 0xA200 | 78432
fieldAll: 0xA300 | 3
...

perform the facet request on fieldAll with facet.limit=-1 and split the 
resulting counts with

for (entry: facetResultAll) {
  switch (0xFF00  entry.value) {
case 0xA000:
  field0.add(entry.value, entry.count);
  break;
case 0xA100:
  field1.add(entry.value, entry.count);
  break;
...
  }
}


Regards,
Toke Eskildsen, State and University Library, Denmark


Re: 答复: (Issue) How improve solr facet performance

2014-05-27 Thread david.w.smi...@gmail.com
Alice,

RE grouping, try Solr 4.8’s new “collapse” qparser w/ “expand
SearchComponent.  The ref guide has the docs.  It’s usually a faster
equivalent approach to group=true

Do you care to comment further on NewEgg’s apparent switch from Endeca to
Solr?  (confirm true/false and rationale)

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley


On Tue, May 27, 2014 at 4:17 AM, Alice.H.Yang (mis.cnsh04.Newegg) 41493 
alice.h.y...@newegg.com wrote:

 Hi, Token

 1.
 I set the 3 fields with hundreds of values uses fc and the rest
 uses enum, the performance is improved 2 times compared with no parameter,
 and then I add facet.method=20 , the performance is improved about 4 times
 compared with no parameter.
 And I also tried setting 9 facet field to one copyfield, I test
 the performance, it is improved about 2.5 times compared with no parameter.
 So, It is improved a lot under your advice, thanks a lot.
 2.
 Now I have another performance issue, It's the group performance.
 The number of data is as same as facet performance scenario.
 When the keyword search hits about one million documents, the QTime is
 about 600ms.(It doesn't query the first time, it's in cache)

 Query url:

 select?fl=item_catalogq=default_search:paramterdefType=edismaxrows=50group=truegroup.field=item_group_idgroup.ngroups=truegroup.sort=stock4sort%20desc,final_price%20asc,is_selleritem%20ascsort=score%20desc,default_sort%20desc

 It need Qtime about 600ms.

 This query have two parameter:
 1. fl one field
 2. group=true,
 group.ngroups=true

 If I set group=false,, the QTime is only 1 ms.
 But I need do group and group.ngroups, How can I improve the group
 performance under this demand. Do you have some advice for me. I'm looking
 forward to your reply.

 Best Regards,
 Alice Yang
 +86-021-51530666*41493
 Floor 19,KaiKai Plaza,888,Wanhandu Rd,Shanghai(200042)


 -邮件原件-
 发件人: Toke Eskildsen [mailto:t...@statsbiblioteket.dk]
 发送时间: 2014年5月24日 15:17
 收件人: solr-user@lucene.apache.org
 主题: RE: (Issue) How improve solr facet performance

 Alice.H.Yang (mis.cnsh04.Newegg) 41493 [alice.h.y...@newegg.com] wrote:
  1.  I'm sorry, I have made a mistake, the total number of documents is
 32 Million, not 320 Million.
  2.  The system memory is large for solr index, OS total has 256G, I set
 the solr tomcat HEAPSIZE=-Xms25G -Xmx100G

 100G is a very high number. What special requirements dictates such a
 large heap size?

  Reply:  9 fields I facet on.

 Solr treats each facet separately and with facet.method=fc and 10M hits,
 this means that it will iterate 9*10M = 90M document IDs and update the
 counters for those.

  Reply:  3 facet fields have one hundred unique values, other 6 facet
 fields' unique values are between 3 to 15.

 So very low cardinality. This is confirmed by your low response time of
 6ms for 2925 hits.

  And we test this scenario:  If the number of facet fields' unique values
 is less we add facet.method=enum, there is a little to improve performance.

 That is a shame: enum is normally the simple answer to a setup like yours.
 Have you tried fine-tuning your fc/enum selection, so that the 3 fields
 with hundreds of values uses fc and the rest uses enum? That might halve
 your response time.


 Since the number of unique facets is so low, I do not think that DocValues
 can help you here. Besides the fine-grained fc/enum-selection above, you
 could try collapsing all 9 facet-fields into a single field. The idea
 behind this is that for facet.method=fc, performing faceting on a field
 with (for example) 300 unique values takes practically the same amount of
 time as faceting on a field with 1000 unique values: Faceting on a single
 slightly larger field is much faster than faceting on 9 smaller fields.
 After faceting with facet.limit=-1 on the single super-facet-field, you
 must match the returned values back to their original fields:


 If you have the facet-fields

 field0: 34
 field1: 187
 field2: 78432
 field3: 3
 ...

 then collapse them by or-ing a field-specific mask that is bigger than the
 max in any field, then put it all into a single field:

 fieldAll: 0xA000 | 34
 fieldAll: 0xA100 | 187
 fieldAll: 0xA200 | 78432
 fieldAll: 0xA300 | 3
 ...

 perform the facet request on fieldAll with facet.limit=-1 and split the
 resulting counts with

 for (entry: facetResultAll) {
   switch (0xFF00  entry.value) {
 case 0xA000:
   field0.add(entry.value, entry.count);
   break;
 case 0xA100:
   field1.add(entry.value, entry.count);
   break;
 ...
   }
 }


 Regards,
 Toke Eskildsen, State and University Library, Denmark



RE: (Issue) How improve solr facet performance

2014-05-24 Thread Toke Eskildsen
Alice.H.Yang (mis.cnsh04.Newegg) 41493 [alice.h.y...@newegg.com] wrote:
 1.  I'm sorry, I have made a mistake, the total number of documents is 32 
 Million, not 320 Million.
 2.  The system memory is large for solr index, OS total has 256G, I set the 
 solr tomcat HEAPSIZE=-Xms25G -Xmx100G

100G is a very high number. What special requirements dictates such a large 
heap size?

 Reply:  9 fields I facet on.

Solr treats each facet separately and with facet.method=fc and 10M hits, this 
means that it will iterate 9*10M = 90M document IDs and update the counters for 
those.

 Reply:  3 facet fields have one hundred unique values, other 6 facet fields' 
 unique values are between 3 to 15.

So very low cardinality. This is confirmed by your low response time of 6ms for 
2925 hits.

 And we test this scenario:  If the number of facet fields' unique values is 
 less we add facet.method=enum, there is a little to improve performance.

That is a shame: enum is normally the simple answer to a setup like yours. Have 
you tried fine-tuning your fc/enum selection, so that the 3 fields with 
hundreds of values uses fc and the rest uses enum? That might halve your 
response time.


Since the number of unique facets is so low, I do not think that DocValues can 
help you here. Besides the fine-grained fc/enum-selection above, you could try 
collapsing all 9 facet-fields into a single field. The idea behind this is that 
for facet.method=fc, performing faceting on a field with (for example) 300 
unique values takes practically the same amount of time as faceting on a field 
with 1000 unique values: Faceting on a single slightly larger field is much 
faster than faceting on 9 smaller fields. After faceting with facet.limit=-1 on 
the single super-facet-field, you must match the returned values back to their 
original fields:


If you have the facet-fields

field0: 34
field1: 187
field2: 78432
field3: 3
...

then collapse them by or-ing a field-specific mask that is bigger than the max 
in any field, then put it all into a single field:

fieldAll: 0xA000 | 34
fieldAll: 0xA100 | 187
fieldAll: 0xA200 | 78432
fieldAll: 0xA300 | 3
...

perform the facet request on fieldAll with facet.limit=-1 and split the 
resulting counts with

for (entry: facetResultAll) {
  switch (0xFF00  entry.value) {
case 0xA000:
  field0.add(entry.value, entry.count);
  break;
case 0xA100:
  field1.add(entry.value, entry.count);
  break;
...
  }
}


Regards,
Toke Eskildsen, State and University Library, Denmark


fw: (Issue) How improve solr facet performance

2014-05-23 Thread Alice.H.Yang (mis.cnsh04.Newegg) 41493
Hi, Solr Developer

  Thanks very much for your timely reply.

1.  I'm sorry, I have made a mistake, the total number of documents is 32 
Million, not 320 Million.
2.  The system memory is large for solr index, OS total has 256G, I set the 
solr tomcat HEAPSIZE=-Xms25G -Xmx100G

-How many fields are you faceting on?

Reply:  9 fields I facet on.

- How many unique values does your facet fields have (approximately)?

Reply:  3 facet fields have one hundred unique values, other 6 facet fields' 
unique values are between 3 to 15. 


- What is the content of your facets (Strings, numbers?)

Reply:  9 fields are all numbers.

- Which facet.method do you use?

Reply:  Used the default facet.method=fc

And we test this scenario:  If the number of facet fields' unique values is 
less we add facet.method=enum, there is a little to improve performance.

- What is the response time with faceting and a few thousand hits?

Reply:   result name=response numFound=2925 start=0  
   QTime is  int name=QTime6/int 


Best Regards,
Alice Yang
+86-021-51530666*41493
Floor 19,KaiKai Plaza,888,Wanhandu Rd,Shanghai(200042)

-Original Message-
From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk]
Sent: Friday, May 23, 2014 8:08 PM
To: d...@lucene.apache.org
Subject: Re: (Issue) How improve solr facet performance

On Fri, 2014-05-23 at 11:45 +0200, Alice.H.Yang (mis.cnsh04.Newegg)
41493 wrote:
We are blocked by solr facet performance when query hits many 
 documents. (about 10,000,000)

[320M documents, immediate response for plain search with 1M hits]

 But when we add several facet.field to do facet ,QTime  increaseto 
 220ms or more.

It is not clear whether your observation of increased response time is due to 
many hits or faceting in itself.

- How many fields are you faceting on?
- How many unique values does your facet fields have (approximately)?
- What is the content of your facets (Strings, numbers?)
- Which facet.method do you use?
- What is the response time with faceting and a few thousand hits?

 Do you have some advice on how improve the facet performance when hit 
 many documents.

That depends on whether your bottleneck is the hitcount itself, the number of 
unique facet values or something third like I/O.


- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional 
commands, e-mail: dev-h...@lucene.apache.org



RE: Facet performance

2013-10-23 Thread Toke Eskildsen
On Tue, 2013-10-22 at 17:25 +0200, Lemke, Michael SZ/HZA-ZSW wrote:
 On Tue, October 22, 2013 11:54 AM Andre Bois-Crettez wrote:
  This is with Solr 1.4.
 Really ?
 This sound really outdated to me.
 Have you tried a tried more recent version, 4.5 just went out ?
 
 Sorry, can't.  Too much `grown' stuff.

I did not see that. I guess I parsed it as 4.1.

Well, that rules out DocValues and fcs (as far as I remember). I am a
bit surprised that the limit on #terms with fc is also in 1.4. I thought
it was introduced in a later version.

We too has been in a position where upgrading was hard due to homegrown
addons. We even scrapped some DidYouMean-like functionality when going
from 3.x to 4.x, but 4.x was so much better that there were little
choice.

Last suggestion for using fc: Create 2 or more CONTENT-fields and choose
between them randomly when indexing. Facet on all the CONTENT fields and
merge the results. It will take a bit more RAM though, so it is still
out on your (assumedly) 32 bit machine.

Regards,
Toke Eskildsen, State and University Library, Denmark



RE: Facet performance

2013-10-23 Thread Lemke, Michael SZ/HZA-ZSW
On Tue, October 22, 2013 5:23 PM Michael Lemke wrote:
On Tue, October 22, 2013 9:23 AM Toke Eskildsen wrote:
On Mon, 2013-10-21 at 16:57 +0200, Lemke, Michael SZ/HZA-ZSW wrote:
 QTime fc:
never returns, webserver restarts itself after 30 min with 100% CPU 
 load

It might be because it dies due to garbage collection. But since more
memory (as your test server presumably has) just leads to the too many
values-error, there isn't much to do.

Essentially, fc is out then.


 QTime=41205  facet.prefix=q=frequent_word  
 numFound=44532
 
 Same query repeated:
 QTime=225810 facet.prefix=q=ottomotor  
 numFound=909
 QTime=199839 facet.prefix=q=ottomotor  
 numFound=909

I am stumped on this, sorry. I do not understand why the 'ottomotor'
query can take 5 times as long as the 'frequent_word'-one.

I looked into this some more this morning.  I noticed the java process was 
doing
a lot of I/O as shown in Process Explorer.  For the frequent_word it read 
about 
180MB, for ottomotor is was about seven times as much, ~ 1,200 MB.


Got another observation today.  The response time for q=ottomotor depends on 
facet.limit:

QTime=59300  facet.limit=2
QTime=69395  facet.limit=4
QTime=85208  facet.limit=6
QTime=158150 facet.limit=8
QTime=186276 facet.limit=10
QTime=231763 facet.limit=15
QTime=260437 facet.limit=20
QTime=312268 facet.limit=30

For q=frequent_word the result is much less pronounced and shows only
for facet.limit = 15 :

QTime=0  facet.limit=0
QTime=20535  facet.limit=1
QTime=13456  facet.limit=2
QTime=13925  facet.limit=4
QTime=13705  facet.limit=6
QTime=13924  facet.limit=8
QTime=13799  facet.limit=10
QTime=14361  facet.limit=15
QTime=14704  facet.limit=20
QTime=15189  facet.limit=30
QTime=16783  facet.limit=50
QTime=57128  facet.limit=500

Looks to me for solr to collect enough facets to fulfill the limit constraint
it has to read much more of the index in the case of the infrequent word.

jconsole didn't show anything unusual according to our more experienced Java 
experts here.  Nor was the machine swapping.

Is it possible to screw up an index such that this sort of faceting leads to
constant reading of the index?  Something like full table scans in a db?


Michael




RE: Facet performance

2013-10-22 Thread Toke Eskildsen
On Mon, 2013-10-21 at 16:57 +0200, Lemke, Michael SZ/HZA-ZSW wrote:
 QTime enum:
  1st call: 1200
  subsequent calls: 200

Those numbers seems fine.

 QTime fc:
never returns, webserver restarts itself after 30 min with 100% CPU 
 load

It might be because it dies due to garbage collection. But since more
memory (as your test server presumably has) just leads to the too many
values-error, there isn't much to do.

 QTime=41205  facet.prefix=q=frequent_word  
 numFound=44532
 
 Same query repeated:
 QTime=225810 facet.prefix=q=ottomotor  
 numFound=909
 QTime=199839 facet.prefix=q=ottomotor  
 numFound=909

I am stumped on this, sorry. I do not understand why the 'ottomotor'
query can take 5 times as long as the 'frequent_word'-one.

 QTime=185948 facet.prefix=q=ottomotor  
 numFound=909
 
 QTime=3344   facet.prefix=d   q=ottomotor  
 numFound=909

Fits with expectations.

 - Documents in your index
 13,434,414
 
 - Unique values in the CONTENT field
 Not sure how to get this.  In luke I find
 21,797,514 term count CONTENT

Those are the relevant numbers for faceting. There is a limit of 2^24
(16M) terms for facet.method=enum, although I am a bit unsure if that is
for the whole index or per segment.

Come to think of it, if you have a multi-segmented index, you might want
to try facet.method.fcs. It should have faster startup than fc and
better performance than enum for fields with a large number of unique
values. Memory requirements should be between fc and enum.

 - Xmx
 The maximum the system allows me to get: 1612m
 
 Maybe I have a hopelessly under-dimensioned server for this sort of things?

Well, 1612m should be enough for the faceting in itself; it it the
startup that is the killer. 

A rule of thumb for fc is that the internal structure takes at least
#docs*log(#references) + #references*log(#unique_values) bytes

If your content field is a description, let's say that each description
has 40 words, which gives us 500M references from documents to facet
values. This translates to
13M*log(500M) + 500M*log(22M) bytes ~= 13M*29 + 500M*25 bytes ~= 380MB.

Taking into account that building the structure has an overhead of 2-3
times that, we are approaching the memory limit of 1612m. If the index
is updated, a new facet structure is build all over again while the old
structure is still in memory.


If you need better performance on your large field I would suggest, in
order of priority:

- facet.method=fcs
- facet.method=fcs with DocValues
- Shard your index and use facet.method=fc
- SOLR-2412 (https://issues.apache.org/jira/browse/SOLR-2412)

SOLR-2412 is a last resort, but it does have the same speed as
facet.method=fc only without the 16M unique values limitation.

Regards,
Toke Eskildsen, State and University Library, Denmark



Re: Facet performance

2013-10-22 Thread Andre Bois-Crettez

This is with Solr 1.4.

Really ?
This sound really outdated to me.
Have you tried a tried more recent version, 4.5 just went out ?

--
André Bois-Crettez

Software Architect
Search Developer
http://www.kelkoo.com/


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


RE: Facet performance

2013-10-22 Thread Lemke, Michael SZ/HZA-ZSW
On Tue, October 22, 2013 9:23 AM Toke Eskildsen wrote:
On Mon, 2013-10-21 at 16:57 +0200, Lemke, Michael SZ/HZA-ZSW wrote:
 QTime fc:
never returns, webserver restarts itself after 30 min with 100% CPU 
 load

It might be because it dies due to garbage collection. But since more
memory (as your test server presumably has) just leads to the too many
values-error, there isn't much to do.

Essentially, fc is out then.


 QTime=41205  facet.prefix=q=frequent_word  
 numFound=44532
 
 Same query repeated:
 QTime=225810 facet.prefix=q=ottomotor  
 numFound=909
 QTime=199839 facet.prefix=q=ottomotor  
 numFound=909

I am stumped on this, sorry. I do not understand why the 'ottomotor'
query can take 5 times as long as the 'frequent_word'-one.

I looked into this some more this morning.  I noticed the java process was doing
a lot of I/O as shown in Process Explorer.  For the frequent_word it read about 
180MB, for ottomotor is was about seven times as much, ~ 1,200 MB.

jconsole didn’t show anything unusual according to our more experienced Java 
experts here.  Nor was the machine swapping.

Is it possible to screw up an index such that this sort of faceting leads to
constant reading of the index?  Something like full table scans in a db?

Michael


RE: Facet performance

2013-10-22 Thread Lemke, Michael SZ/HZA-ZSW
On Tue, October 22, 2013 11:54 AM Andre Bois-Crettez wrote:

 This is with Solr 1.4.
Really ?
This sound really outdated to me.
Have you tried a tried more recent version, 4.5 just went out ?

Sorry, can't.  Too much `grown' stuff.

Michael


RE: Facet performance

2013-10-21 Thread Toke Eskildsen
On Fri, 2013-10-18 at 18:30 +0200, Lemke, Michael SZ/HZA-ZSW wrote:
 Toke Eskildsen [mailto:t...@statsbiblioteket.dk] wrote:
  Unfortunately the enum-solution is normally quite slow when there
  are enough unique values to trigger the too many  values-exception.
  [...]
 
 [...] And yes, the fc method was terribly slow in a case where it did
 work.  Something like 20 minutes whereas enum returned within a few
 seconds.

Err.. What? That sounds _very_ strange. You have millions of unique
values so fc should be a lot faster than enum, not the other way around.

I assume the 20 minutes was for the first call. How fast does subsequent
calls return for fc?


Maybe you could provide some approximate numbers?

- Documents in your index
- Unique values in the CONTENT field
- Hits are returned from a typical query
- Xmx

Regards,
Toke Eskildsen, State and University Library, Denmark



RE: Facet performance

2013-10-21 Thread Lemke, Michael SZ/HZA-ZSW
On Mon, October 21, 2013 10:04 AM, Toke Eskildsen wrote:
On Fri, 2013-10-18 at 18:30 +0200, Lemke, Michael SZ/HZA-ZSW wrote:
 Toke Eskildsen wrote:
  Unfortunately the enum-solution is normally quite slow when there
  are enough unique values to trigger the too many  values-exception.
  [...]
 
 [...] And yes, the fc method was terribly slow in a case where it did
 work.  Something like 20 minutes whereas enum returned within a few
 seconds.

Err.. What? That sounds _very_ strange. You have millions of unique
values so fc should be a lot faster than enum, not the other way around.

I assume the 20 minutes was for the first call. How fast does subsequent
calls return for fc?

QTime enum:
 1st call: 1200
 subsequent calls: 200

QTime fc:
   never returns, webserver restarts itself after 30 min with 100% CPU load


This is on the test system, the production system managed to return with
... Too many values for UnInvertedField faceting 

However, I also have different faceting queries I played with today.

One complete example:

q=ottomotorfacet.field=CONTENTfacet=truefacet.prefix=facet.limit=10facet.mincount=1facet.method=enumrows=0

These are the results, all with facet.method=enum (fc doesn't work).  They
were executed in the sequence shown on an otherwise unused server:

QTime=41205  facet.prefix=q=frequent_word  
numFound=44532

Same query repeated:
QTime=225810 facet.prefix=q=ottomotor  
numFound=909
QTime=199839 facet.prefix=q=ottomotor  
numFound=909

QTime=0  facet.prefix=q=ottomotor jkdhwjfh 
numFound=0
QTime=0  facet.prefix=q=jkdhwjfh   
numFound=0

QTime=185948 facet.prefix=q=ottomotor  
numFound=909

QTime=3344   facet.prefix=d   q=ottomotor  
numFound=909
QTime=3078   facet.prefix=d   q=ottomotor  
numFound=909
QTime=3141   facet.prefix=d   q=ottomotor  
numFound=909

The response time is obviously not dependent on the number of documents found.
Caching doesn't kick in either.



Maybe you could provide some approximate numbers?

I'll try, see below.  Thanks for asking and having a closer look.


- Documents in your index
13,434,414

- Unique values in the CONTENT field
Not sure how to get this.  In luke I find
21,797,514 term count CONTENT

Is that what you mean?

- Hits are returned from a typical query
Hm, that can be anything between 0 and 40,000 or more.
Or do you mean from the facets?  Or do my tests above
answer it?

- Xmx
The maximum the system allows me to get: 1612m


Maybe I have a hopelessly under-dimensioned server for this sort of things?

Thanks a lot for your help,
Michael


Facet performance

2013-10-18 Thread Lemke, Michael SZ/HZA-ZSW
I am working with Solr facet fields and come across a 
performance problem I don't understand. Consider these 
two queries:

1. 
q=wordfacet.field=CONTENTfacet=truefacet.prefix=facet.limit=10facet.mincount=1facet.method=enumrows=0

2. 
q=wordfacet.field=CONTENTfacet=truefacet.prefix=afacet.limit=10facet.mincount=1facet.method=enumrows=0

The only difference is am empty facet.prefix in the first query.

The first query returns after some 20 seconds (QTime 2 in the result) while 
the second one takes only 80 msec (QTime 80). Why is this?

And as side note: facet.method=fc makes the queries run 'forever' and 
eventually 
fail with org.apache.solr.common.SolrException: Too many values for 
UnInvertedField faceting on field CONTENT.

This is with Solr 1.4.




RE: Facet performance

2013-10-18 Thread Toke Eskildsen
Lemke, Michael  SZ/HZA-ZSW [lemke...@schaeffler.com] wrote:
 1. 
 q=wordfacet.field=CONTENTfacet=truefacet.prefix=facet.limit=10facet.mincount=1facet.method=enumrows=0
 2. 
 q=wordfacet.field=CONTENTfacet=truefacet.prefix=afacet.limit=10facet.mincount=1facet.method=enumrows=0

 The only difference is am empty facet.prefix in the first query.

 The first query returns after some 20 seconds (QTime 2 in the result) 
 while
 the second one takes only 80 msec (QTime 80). Why is this?

If you index was just opened when you issued your queries, the first request 
will be notably slower than the second as the facet values might not be in the 
disk cache.

Furthermore, for enum the difference between no prefix and some prefix is huge. 
As enum iterates values first (as opposed to fc that iterates hits first), 
limiting to only the values that starts with 'a' ought to speed up retrieval by 
a factor 10 or more.

 And as side note: facet.method=fc makes the queries run 'forever' and 
 eventually
 fail with org.apache.solr.common.SolrException: Too many values for 
 UnInvertedField faceting on field CONTENT.

An internal memory structure optimization in Solr limits the amount of possible 
unique values when using fc. It is not a bug as such, but more a consequence of 
a choice. Unfortunately the enum-solution is normally quite slow when there are 
enough unique values to trigger the too many values-exception. I know too 
little about the structures for DocValues to say if they will help here, but 
you might want to take a look at those.

- Toke Eskildsen

RE: Facet performance

2013-10-18 Thread Lemke, Michael SZ/HZA-ZSW
Toke Eskildsen [mailto:t...@statsbiblioteket.dk] wrote:
Lemke, Michael  SZ/HZA-ZSW [lemke...@schaeffler.com] wrote:
 1. 
 q=wordfacet.field=CONTENTfacet=truefacet.prefix=facet.limit=10facet.mincount=1facet.method=enumrows=0
 2. 
 q=wordfacet.field=CONTENTfacet=truefacet.prefix=afacet.limit=10facet.mincount=1facet.method=enumrows=0

 The only difference is am empty facet.prefix in the first query.

 The first query returns after some 20 seconds (QTime 2 in the result) 
 while
 the second one takes only 80 msec (QTime 80). Why is this?

If you index was just opened when you issued your queries, the first request 
will be notably slower than the second as the facet values might not be in 
the disk cache.

I know but it shouldn't be orders of magnitudes as in this example, should it?


Furthermore, for enum the difference between no prefix and some prefix is 
huge. As enum iterates values first (as opposed to fc that iterates hits 
first), limiting to only the values that starts with 'a' ought to speed up 
retrieval by a factor 10 or more.

Thanks.  That is what we sort of figured but it's good to know for sure.  Of 
course it begs the question if there is a way to speed this up?


 And as side note: facet.method=fc makes the queries run 'forever' and 
 eventually
 fail with org.apache.solr.common.SolrException: Too many values for 
 UnInvertedField faceting on field CONTENT.

An internal memory structure optimization in Solr limits the amount of 
possible unique values when using fc. It is not a bug as such, but more a 
consequence of a choice. Unfortunately the enum-solution is normally quite 
slow when there are enough unique values to trigger the too many 
values-exception. I know too little about the structures for DocValues to say 
if they will help here, but you might want to take a look at those.

What is DocValues?  Haven't heard of it yet.  And yes, the fc method was 
terribly slow in a case where it did work.  Something like 20 minutes whereas 
enum returned within a few seconds.

Michael



Re: Facet performance

2013-10-18 Thread Otis Gospodnetic
DocValues is the new black
http://wiki.apache.org/solr/DocValues

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
SOLR Performance Monitoring -- http://sematext.com/spm



On Fri, Oct 18, 2013 at 12:30 PM, Lemke, Michael  SZ/HZA-ZSW
lemke...@schaeffler.com wrote:
 Toke Eskildsen [mailto:t...@statsbiblioteket.dk] wrote:
Lemke, Michael  SZ/HZA-ZSW [lemke...@schaeffler.com] wrote:
 1. 
 q=wordfacet.field=CONTENTfacet=truefacet.prefix=facet.limit=10facet.mincount=1facet.method=enumrows=0
 2. 
 q=wordfacet.field=CONTENTfacet=truefacet.prefix=afacet.limit=10facet.mincount=1facet.method=enumrows=0

 The only difference is am empty facet.prefix in the first query.

 The first query returns after some 20 seconds (QTime 2 in the result) 
 while
 the second one takes only 80 msec (QTime 80). Why is this?

If you index was just opened when you issued your queries, the first request 
will be notably slower than the second as the facet values might not be in
 the disk cache.

 I know but it shouldn't be orders of magnitudes as in this example, should it?


Furthermore, for enum the difference between no prefix and some prefix is 
huge. As enum iterates values first (as opposed to fc that iterates hits 
first), limiting to only the values that starts with 'a' ought to speed up 
retrieval by a factor 10 or more.

 Thanks.  That is what we sort of figured but it's good to know for sure.  Of 
 course it begs the question if there is a way to speed this up?


 And as side note: facet.method=fc makes the queries run 'forever' and 
 eventually
 fail with org.apache.solr.common.SolrException: Too many values for 
 UnInvertedField faceting on field CONTENT.

An internal memory structure optimization in Solr limits the amount of 
possible unique values when using fc. It is not a bug as such, but more a 
consequence of a choice. Unfortunately the enum-solution is normally quite 
slow when there are enough unique values to trigger the too many 
values-exception. I know too little about the structures for DocValues to 
say if they will help here, but you might want to take a look at those.

 What is DocValues?  Haven't heard of it yet.  And yes, the fc method was 
 terribly slow in a case where it did work.  Something like 20 minutes whereas 
 enum returned within a few seconds.

 Michael



RE: Facet performance

2013-10-18 Thread Chris Hostetter

:  1. 
q=wordfacet.field=CONTENTfacet=truefacet.prefix=facet.limit=10facet.mincount=1facet.method=enumrows=0
:  2. 
q=wordfacet.field=CONTENTfacet=truefacet.prefix=afacet.limit=10facet.mincount=1facet.method=enumrows=0
: 
:  The only difference is am empty facet.prefix in the first query.

: If you index was just opened when you issued your queries, the first 
: request will be notably slower than the second as the facet values might 
: not be in the disk cache.
: 
: I know but it shouldn't be orders of magnitudes as in this example, should it?

in and of itself: it can be if your index is large enough and none of the 
disk pages are in the file system buffer.

more significantly however, is that depending on how big your filterCache 
is, the first request could eaisly be caching all of filters needed for 
the second query -- at a minimum it's definitely caching your main query 
which will be re-used and save a lot of time independent of hte faceting.


-Hoss


Re: Multivalued fields and facet performance

2011-01-10 Thread Otis Gospodnetic
Hi Howard,

This is normal.  Your first query is reading a bunch of index data from disk 
and 
your RAM is then caching it.  If your first query involves sorting, some more 
data for FieldCache is being read and stored.  If there are multiple sort 
fields, one such thing for each.  If facets are involves, more of that stuff.  
If you are optimizing your index you are likely to be forcing more disk IO

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Howard Lee how...@workdigital.co.uk
 To: solr-user@lucene.apache.org
 Sent: Mon, January 10, 2011 8:59:03 AM
 Subject: Multivalued fields and facet performance
 
 Hi,
 
 I'd appreciate some explanation on what may be going on in the  following
 scenario using multivalued fields and facets.
 
 Solr version:  1.5
 
 Our index contains 35 million docs, and our search is using 2  multivalued
 fields as facets. There are approx 5 million different values in  one field
 and 5000 in the other. We are seeing the following, and I'm curious  as what
 is actually happening in the background.
 
 The first search can  take up to 5 minutes, all subsequent queries of any q
 return in under a  second. This is fine unless you are the first search or
 new  searcher.
 
 I plan on adding a first searcher and new searcher in the  config to avoid
 long delays every time the index is updated (once a day) but  I have concerns
 of the length of the delay in launching a new searcher, and  whether this is
 causing too much overhead.
 
 Can someone explain to me  what processes are going on in the backgroud that
 cause  this behaviour  so I can understand the implications or make some
 adjustments in the config  to compensate.
 
 thanx
 
 Howard
 


Re: Multivalued fields and facet performance

2011-01-10 Thread Howard Lee
Otis,
The reason I ask is that I run a number of sites on Solr, some with 10
million+ docs faceting on similar types of data, and have not seen anywhere
near this length of initial delay. The main difference is that these sites
facet on single value fields rather that multivalued and that this site is
searching on 3 times the volume of data. Would switching to single valued
(I'd rather not) make much of a  difference.

I've also noticed that multivalued fields aren't populating the lucene field
cache. Is this the correct behaviour.

Regards

Howard

On 10 January 2011 14:55, Otis Gospodnetic otis_gospodne...@yahoo.comwrote:

 Hi Howard,

 This is normal.  Your first query is reading a bunch of index data from
 disk and
 your RAM is then caching it.  If your first query involves sorting, some
 more
 data for FieldCache is being read and stored.  If there are multiple sort
 fields, one such thing for each.  If facets are involves, more of that
 stuff.
 If you are optimizing your index you are likely to be forcing more disk
 IO

 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



 - Original Message 
  From: Howard Lee how...@workdigital.co.uk
  To: solr-user@lucene.apache.org
  Sent: Mon, January 10, 2011 8:59:03 AM
  Subject: Multivalued fields and facet performance
 
  Hi,
 
  I'd appreciate some explanation on what may be going on in the  following
  scenario using multivalued fields and facets.
 
  Solr version:  1.5
 
  Our index contains 35 million docs, and our search is using 2
  multivalued
  fields as facets. There are approx 5 million different values in  one
 field
  and 5000 in the other. We are seeing the following, and I'm curious  as
 what
  is actually happening in the background.
 
  The first search can  take up to 5 minutes, all subsequent queries of any
 q
  return in under a  second. This is fine unless you are the first search
 or
  new  searcher.
 
  I plan on adding a first searcher and new searcher in the  config to
 avoid
  long delays every time the index is updated (once a day) but  I have
 concerns
  of the length of the delay in launching a new searcher, and  whether this
 is
  causing too much overhead.
 
  Can someone explain to me  what processes are going on in the backgroud
 that
  cause  this behaviour  so I can understand the implications or make some
  adjustments in the config  to compensate.
 
  thanx
 
  Howard
 




-- 
WORKDIGITAL LTD
workdigital.co.uk
32-34 Broadwick Street
W1A 2HG London, UK

Howard Lee
CEO

M  +44(0)7931 476 766
E  how...@workdigital.co.uk

workhound.co.uk - salarytrack.co.uk - twitterjobsearch.com -
dreamjobalert.co.uk - recruitmentadnetwork.com


facet performance when number of values is large

2010-03-03 Thread Andy
I have a facet field whose values are created by users. So potentially there 
could be a very large number of values. is that going to be a problem 
performance-wise?

A few more questions to help me understand how facet works:
- after the filter cache warmed up, will the (if any) performance problems 
caused by large number of facet values go away?
I thought that would be the case but according to the benchmark here: 
http://wiki.apache.org/solr/HierarchicalFaceting
SOLR-64 still had very poor performance even after the filter caches are warmed 

- In the wiki it was stated that facet.method=fc is excellent for situations 
where the number of indexed values for the field is high. Would that be the 
solution?




  

Re: facet performance tips

2009-08-13 Thread Jérôme Etévé
Thanks everyone for your advices.

I increased my filterCache, and the faceting performances improved greatly.

My faceted field can have at the moment ~4 different terms, so I
did set a filterCache size of 5 and it works very well.

However, I'm planning to increase the number of terms to maybe around
500 000, so I guess this approach won't work anymore, as I doubt a 500
000 sized fieldCache would work.

So I guess my best move would be to upgrade to the soon to be 1.4
version of solr to benefit from its new faceting method.

I know this is a bit off-topic, but do you have a rough idea about
when 1.4 will be an official release?
As well, is the current trunk OK for production? Is it compatible with
1.3 configuration files?

Thanks !

Jerome.

2009/8/13 Stephen Duncan Jr stephen.dun...@gmail.com:
 Note that depending on the profile of your field (full text and how many
 unique terms on average per document), the improvements from 1.4 may not
 apply, as you may exceed the limits of the new faceting technique in Solr
 1.4.
 -Stephen

 On Wed, Aug 12, 2009 at 2:12 PM, Erik Hatcher ehatc...@apache.org wrote:

 Yes, increasing the filterCache size will help with Solr 1.3 performance.

 Do note that trunk (soon Solr 1.4) has dramatically improved faceting
 performance.

Erik


 On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote:

  Hi everyone,

  I'm using some faceting on a solr index containing ~ 160K documents.
 I perform facets on multivalued string fields. The number of possible
 different values is quite large.

 Enabling facets degrades the performance by a factor 3.

 Because I'm using solr 1.3, I guess the facetting makes use of the
 filter cache to work. My filterCache is set
 to a size of 2048. I also noticed in my solr stats a very small ratio
 of cache hit (~ 0.01%).

 Can it be the reason why the faceting is slow? Does it make sense to
 increase the filterCache size so it matches more or less the number
 of different possible values for the faceted fields? Would that not
 make the memory usage explode?

 Thanks for your help !

 --
 Jerome Eteve.

 Chat with me live at http://www.eteve.net

 jer...@eteve.net





 --
 Stephen Duncan Jr
 www.stephenduncanjr.com




-- 
Jerome Eteve.

Chat with me live at http://www.eteve.net

jer...@eteve.net


RE: facet performance tips

2009-08-13 Thread Fuad Efendi
I took 1.4 from trunk three days ago, it seems Ok for production (at least for 
my Master instance which is doing writes-only). I use the same config files.

500 000 terms are Ok too; I am using several millions with pre-1.3 SOLR taken 
from trunk.

However, do not try to facet (probably outdated term after SOLR-475) on 
generic queries such as [* TO *] (with huge resultset). For smaller query 
results (100,000 instead of 100,000,000) counting terms is fast enough (few 
milliseconds at http://www.tokenizer.org)

 

-Original Message-
From: Jérôme Etévé [mailto:jerome.et...@gmail.com] 
Sent: August-13-09 5:38 AM
To: solr-user@lucene.apache.org
Subject: Re: facet performance tips

Thanks everyone for your advices.

I increased my filterCache, and the faceting performances improved greatly.

My faceted field can have at the moment ~4 different terms, so I
did set a filterCache size of 5 and it works very well.

However, I'm planning to increase the number of terms to maybe around
500 000, so I guess this approach won't work anymore, as I doubt a 500
000 sized fieldCache would work.

So I guess my best move would be to upgrade to the soon to be 1.4
version of solr to benefit from its new faceting method.

I know this is a bit off-topic, but do you have a rough idea about
when 1.4 will be an official release?
As well, is the current trunk OK for production? Is it compatible with
1.3 configuration files?

Thanks !

Jerome.

2009/8/13 Stephen Duncan Jr stephen.dun...@gmail.com:
 Note that depending on the profile of your field (full text and how many
 unique terms on average per document), the improvements from 1.4 may not
 apply, as you may exceed the limits of the new faceting technique in Solr
 1.4.
 -Stephen

 On Wed, Aug 12, 2009 at 2:12 PM, Erik Hatcher ehatc...@apache.org wrote:

 Yes, increasing the filterCache size will help with Solr 1.3 performance.

 Do note that trunk (soon Solr 1.4) has dramatically improved faceting
 performance.

Erik


 On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote:

  Hi everyone,

  I'm using some faceting on a solr index containing ~ 160K documents.
 I perform facets on multivalued string fields. The number of possible
 different values is quite large.

 Enabling facets degrades the performance by a factor 3.

 Because I'm using solr 1.3, I guess the facetting makes use of the
 filter cache to work. My filterCache is set
 to a size of 2048. I also noticed in my solr stats a very small ratio
 of cache hit (~ 0.01%).

 Can it be the reason why the faceting is slow? Does it make sense to
 increase the filterCache size so it matches more or less the number
 of different possible values for the faceted fields? Would that not
 make the memory usage explode?

 Thanks for your help !

 --
 Jerome Eteve.

 Chat with me live at http://www.eteve.net

 jer...@eteve.net





 --
 Stephen Duncan Jr
 www.stephenduncanjr.com




-- 
Jerome Eteve.

Chat with me live at http://www.eteve.net

jer...@eteve.net




RE: facet performance tips

2009-08-13 Thread Fuad Efendi
It seems BOBO-Browse is alternate faceting engine; would be interesting to
compare performance with SOLR... Distributed?


-Original Message-
From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] 
Sent: August-12-09 6:12 PM
To: solr-user@lucene.apache.org
Subject: Re: facet performance tips

For your fields with many terms you may want to try Bobo
http://code.google.com/p/bobo-browse/ which could work well with your
case.






RE: facet performance tips

2009-08-13 Thread Fuad Efendi
Interesting, it has BoboRequestHandler implements SolrRequestHandler
- easy to try it; and shards support



[Fuad Efendi] It seems BOBO-Browse is alternate faceting engine; would be
interesting to
compare performance with SOLR... Distributed?


[Jason Rutherglen] For your fields with many terms you may want to try Bobo
http://code.google.com/p/bobo-browse/ which could work well with your
case.








Re: facet performance tips

2009-08-13 Thread Jason Rutherglen
Yeah we need a performance comparison, I haven't had time to put
one together. If/when I do I'll compare Bobo performance against
Solr bitset intersection based facets, compare memory
consumption.

For near realtime Solr needs to cache and merge bitsets at the
SegmentReader level, and Bobo needs to be upgraded to work with
Lucene 2.9's searching at the segment level (currently it uses a
MultiSearcher).

Distributed search on either should be fairly straightforward?

On Thu, Aug 13, 2009 at 9:55 AM, Fuad Efendif...@efendi.ca wrote:
 It seems BOBO-Browse is alternate faceting engine; would be interesting to
 compare performance with SOLR... Distributed?


 -Original Message-
 From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com]
 Sent: August-12-09 6:12 PM
 To: solr-user@lucene.apache.org
 Subject: Re: facet performance tips

 For your fields with many terms you may want to try Bobo
 http://code.google.com/p/bobo-browse/ which could work well with your
 case.







RE: facet performance tips

2009-08-13 Thread Fuad Efendi
SOLR-1.4-trunk uses terms counting instead of bitset intersects (seems to
be); check this
http://issues.apache.org/jira/browse/SOLR-475
(and probably http://issues.apache.org/jira/browse/SOLR-711)

-Original Message-
From: Jason Rutherglen 

Yeah we need a performance comparison, I haven't had time to put
one together. If/when I do I'll compare Bobo performance against
Solr bitset intersection based facets, compare memory
consumption.

For near realtime Solr needs to cache and merge bitsets at the
SegmentReader level, and Bobo needs to be upgraded to work with
Lucene 2.9's searching at the segment level (currently it uses a
MultiSearcher).

Distributed search on either should be fairly straightforward?

On Thu, Aug 13, 2009 at 9:55 AM, Fuad Efendif...@efendi.ca wrote:
 It seems BOBO-Browse is alternate faceting engine; would be interesting to
 compare performance with SOLR... Distributed?


 -Original Message-
 From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com]
 Sent: August-12-09 6:12 PM
 To: solr-user@lucene.apache.org
 Subject: Re: facet performance tips

 For your fields with many terms you may want to try Bobo
 http://code.google.com/p/bobo-browse/ which could work well with your
 case.









Re: facet performance tips

2009-08-13 Thread Jason Rutherglen
Right, I haven't used SOLR-475 yet and am more familiar with
Bobo. I believe there are differences but I haven't gone into
them yet. As I'm using Solr 1.4 now, maybe I'll test the
UnInvertedField modality.

Feel free to report back results as I don't think I've seen much
yet?

On Thu, Aug 13, 2009 at 10:51 AM, Fuad Efendif...@efendi.ca wrote:
 SOLR-1.4-trunk uses terms counting instead of bitset intersects (seems to
 be); check this
 http://issues.apache.org/jira/browse/SOLR-475
 (and probably http://issues.apache.org/jira/browse/SOLR-711)

 -Original Message-
 From: Jason Rutherglen

 Yeah we need a performance comparison, I haven't had time to put
 one together. If/when I do I'll compare Bobo performance against
 Solr bitset intersection based facets, compare memory
 consumption.

 For near realtime Solr needs to cache and merge bitsets at the
 SegmentReader level, and Bobo needs to be upgraded to work with
 Lucene 2.9's searching at the segment level (currently it uses a
 MultiSearcher).

 Distributed search on either should be fairly straightforward?

 On Thu, Aug 13, 2009 at 9:55 AM, Fuad Efendif...@efendi.ca wrote:
 It seems BOBO-Browse is alternate faceting engine; would be interesting to
 compare performance with SOLR... Distributed?


 -Original Message-
 From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com]
 Sent: August-12-09 6:12 PM
 To: solr-user@lucene.apache.org
 Subject: Re: facet performance tips

 For your fields with many terms you may want to try Bobo
 http://code.google.com/p/bobo-browse/ which could work well with your
 case.










facet performance tips

2009-08-12 Thread Jérôme Etévé
Hi everyone,

  I'm using some faceting on a solr index containing ~ 160K documents.
I perform facets on multivalued string fields. The number of possible
different values is quite large.

Enabling facets degrades the performance by a factor 3.

Because I'm using solr 1.3, I guess the facetting makes use of the
filter cache to work. My filterCache is set
to a size of 2048. I also noticed in my solr stats a very small ratio
of cache hit (~ 0.01%).

Can it be the reason why the faceting is slow? Does it make sense to
increase the filterCache size so it matches more or less the number
of different possible values for the faceted fields? Would that not
make the memory usage explode?

Thanks for your help !

-- 
Jerome Eteve.

Chat with me live at http://www.eteve.net

jer...@eteve.net


RE: facet performance tips

2009-08-12 Thread Manepalli, Kalyan
Jerome,
Yes you need to increase the filterCache size to something close to 
unique number of facet elements. But also consider the RAM required to 
accommodate the increase. 
I did see a significant performance gain by increasing the filterCache size

Thanks,
Kalyan Manepalli

-Original Message-
From: Jérôme Etévé [mailto:jerome.et...@gmail.com] 
Sent: Wednesday, August 12, 2009 12:31 PM
To: solr-user@lucene.apache.org
Subject: facet performance tips

Hi everyone,

  I'm using some faceting on a solr index containing ~ 160K documents.
I perform facets on multivalued string fields. The number of possible
different values is quite large.

Enabling facets degrades the performance by a factor 3.

Because I'm using solr 1.3, I guess the facetting makes use of the
filter cache to work. My filterCache is set
to a size of 2048. I also noticed in my solr stats a very small ratio
of cache hit (~ 0.01%).

Can it be the reason why the faceting is slow? Does it make sense to
increase the filterCache size so it matches more or less the number
of different possible values for the faceted fields? Would that not
make the memory usage explode?

Thanks for your help !

-- 
Jerome Eteve.

Chat with me live at http://www.eteve.net

jer...@eteve.net


RE: facet performance tips

2009-08-12 Thread Fuad Efendi
I am currently faceting on tokenized multi-valued field at
http://www.tokenizer.org (25 mlns simple docs)

It uses some home-made quick fixes similar to SOLR-475 (SOLR-711) and
non-synchronized cache (similar to LingPipe's FastCache, SOLR-665, SOLR-667)

Average faceting on query results: 0.2 - 0.3 seconds; without those
patches - 20-50 seconds.

I am going to upgrade to SOLR-1.4 from trunk (with SOLR-475  SOLR-667) and
to compare results...




P.S.
Avoid faceting on a field with heavy distribution of terms (such as few
millions of terms in my case); It won't work in SOLR 1.3.

TIP: use non-tokenized single-valued field for faceting, such as
non-tokenized country field.



P.P.S.
Would be nice to load/stress
http://alias-i.com/lingpipe/docs/api/com/aliasi/util/FastCache.html against
putting CPU in a spin loop ConcurrentHashMap.



-Original Message-
From: Erik Hatcher [mailto:ehatc...@apache.org] 
Sent: August-12-09 2:12 PM
To: solr-user@lucene.apache.org
Subject: Re: facet performance tips

Yes, increasing the filterCache size will help with Solr 1.3  
performance.

Do note that trunk (soon Solr 1.4) has dramatically improved faceting  
performance.

Erik

On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote:

 Hi everyone,

  I'm using some faceting on a solr index containing ~ 160K documents.
 I perform facets on multivalued string fields. The number of possible
 different values is quite large.

 Enabling facets degrades the performance by a factor 3.

 Because I'm using solr 1.3, I guess the facetting makes use of the
 filter cache to work. My filterCache is set
 to a size of 2048. I also noticed in my solr stats a very small ratio
 of cache hit (~ 0.01%).

 Can it be the reason why the faceting is slow? Does it make sense to
 increase the filterCache size so it matches more or less the number
 of different possible values for the faceted fields? Would that not
 make the memory usage explode?

 Thanks for your help !

 -- 
 Jerome Eteve.

 Chat with me live at http://www.eteve.net

 jer...@eteve.net





Re: facet performance tips

2009-08-12 Thread Erik Hatcher
Yes, increasing the filterCache size will help with Solr 1.3  
performance.


Do note that trunk (soon Solr 1.4) has dramatically improved faceting  
performance.


Erik

On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote:


Hi everyone,

 I'm using some faceting on a solr index containing ~ 160K documents.
I perform facets on multivalued string fields. The number of possible
different values is quite large.

Enabling facets degrades the performance by a factor 3.

Because I'm using solr 1.3, I guess the facetting makes use of the
filter cache to work. My filterCache is set
to a size of 2048. I also noticed in my solr stats a very small ratio
of cache hit (~ 0.01%).

Can it be the reason why the faceting is slow? Does it make sense to
increase the filterCache size so it matches more or less the number
of different possible values for the faceted fields? Would that not
make the memory usage explode?

Thanks for your help !

--
Jerome Eteve.

Chat with me live at http://www.eteve.net

jer...@eteve.net




Re: facet performance tips

2009-08-12 Thread Jason Rutherglen
For your fields with many terms you may want to try Bobo
http://code.google.com/p/bobo-browse/ which could work well with your
case.

On Wed, Aug 12, 2009 at 12:02 PM, Fuad Efendif...@efendi.ca wrote:
 I am currently faceting on tokenized multi-valued field at
 http://www.tokenizer.org (25 mlns simple docs)

 It uses some home-made quick fixes similar to SOLR-475 (SOLR-711) and
 non-synchronized cache (similar to LingPipe's FastCache, SOLR-665, SOLR-667)

 Average faceting on query results: 0.2 - 0.3 seconds; without those
 patches - 20-50 seconds.

 I am going to upgrade to SOLR-1.4 from trunk (with SOLR-475  SOLR-667) and
 to compare results...




 P.S.
 Avoid faceting on a field with heavy distribution of terms (such as few
 millions of terms in my case); It won't work in SOLR 1.3.

 TIP: use non-tokenized single-valued field for faceting, such as
 non-tokenized country field.



 P.P.S.
 Would be nice to load/stress
 http://alias-i.com/lingpipe/docs/api/com/aliasi/util/FastCache.html against
 putting CPU in a spin loop ConcurrentHashMap.



 -Original Message-
 From: Erik Hatcher [mailto:ehatc...@apache.org]
 Sent: August-12-09 2:12 PM
 To: solr-user@lucene.apache.org
 Subject: Re: facet performance tips

 Yes, increasing the filterCache size will help with Solr 1.3
 performance.

 Do note that trunk (soon Solr 1.4) has dramatically improved faceting
 performance.

        Erik

 On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote:

 Hi everyone,

  I'm using some faceting on a solr index containing ~ 160K documents.
 I perform facets on multivalued string fields. The number of possible
 different values is quite large.

 Enabling facets degrades the performance by a factor 3.

 Because I'm using solr 1.3, I guess the facetting makes use of the
 filter cache to work. My filterCache is set
 to a size of 2048. I also noticed in my solr stats a very small ratio
 of cache hit (~ 0.01%).

 Can it be the reason why the faceting is slow? Does it make sense to
 increase the filterCache size so it matches more or less the number
 of different possible values for the faceted fields? Would that not
 make the memory usage explode?

 Thanks for your help !

 --
 Jerome Eteve.

 Chat with me live at http://www.eteve.net

 jer...@eteve.net






Re: facet performance tips

2009-08-12 Thread Stephen Duncan Jr
Note that depending on the profile of your field (full text and how many
unique terms on average per document), the improvements from 1.4 may not
apply, as you may exceed the limits of the new faceting technique in Solr
1.4.
-Stephen

On Wed, Aug 12, 2009 at 2:12 PM, Erik Hatcher ehatc...@apache.org wrote:

 Yes, increasing the filterCache size will help with Solr 1.3 performance.

 Do note that trunk (soon Solr 1.4) has dramatically improved faceting
 performance.

Erik


 On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote:

  Hi everyone,

  I'm using some faceting on a solr index containing ~ 160K documents.
 I perform facets on multivalued string fields. The number of possible
 different values is quite large.

 Enabling facets degrades the performance by a factor 3.

 Because I'm using solr 1.3, I guess the facetting makes use of the
 filter cache to work. My filterCache is set
 to a size of 2048. I also noticed in my solr stats a very small ratio
 of cache hit (~ 0.01%).

 Can it be the reason why the faceting is slow? Does it make sense to
 increase the filterCache size so it matches more or less the number
 of different possible values for the faceted fields? Would that not
 make the memory usage explode?

 Thanks for your help !

 --
 Jerome Eteve.

 Chat with me live at http://www.eteve.net

 jer...@eteve.net





-- 
Stephen Duncan Jr
www.stephenduncanjr.com


Re: Facet Performance

2008-07-31 Thread Funtick

Hoss,

This is still extremely interesting area for possible improvements; I simply
don't want the topic to die 
http://www.nabble.com/Facet-Performance-td7746964.html

http://issues.apache.org/jira/browse/SOLR-665
http://issues.apache.org/jira/browse/SOLR-667
http://issues.apache.org/jira/browse/SOLR-669

I am currently using faceting on single-valued _tokenized_ field with huge
amount of documents; _unsynchronized_ version of FIFOCache; 1.5 seconds
average response time (for faceted queries only!)

I think we can use additional cache for facet results (to store calculated
values!); Lucene's FieldCache can be used only for non-tokenized
single-valued non-bollean fields

-Fuad



hossman_lucene wrote:
 
 
 : Unfortunately which strategy will be chosen is currently undocumented
 : and control is a bit oblique:  If the field is tokenized or multivalued
 : or Boolean, the FilterQuery method will be used; otherwise the
 : FieldCache method.  I expect I or others will improve that shortly.
 
 Bear in mind, what's provide out of the box is SimpleFacets ... it's
 designed to meet simple faceting needs ... when you start talking about
 100s or thousands of constraints per facet, you are getting outside the
 scope of what it was intended to serve efficiently.
 
 At a certain point the only practical thing to do is write a custom
 request handler that makes the best choices for your data.
 
 For the record: a really simple patch someone could submit would be to
 make add an optional field based param indicating which type of faceting
 (termenum/fieldcache) should be used to generate the list of terms and
 then make SimpleFacets.getFacetFieldCounts use that and call the
 apprpriate method insteado calling getTermCounts -- that way you could
 force one or the other if you know it's better for your data/query.
 
 
 
 -Hoss
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Facet-Performance-tp7746964p18756500.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Facet Performance

2006-12-08 Thread Andrew Nagy

Yonik Seeley wrote:


1) facet on single-valued strings if you can
2) if you can't do (1) then enlarge the fieldcache so that the number
of filters (one per possible term in the field you are filtering on)
can fit.


I changed the filterCache to the following:
   filterCache
 class=solr.LRUCache
 size=25600
 initialSize=5120
 autowarmCount=1024/

However a search that normally takes .04s is taking 74 seconds once I 
use the facets since I am faceting on 4 fields.


Can you suggest a better configuration that would solve this performance 
issue, or should I not use faceting?
I figure I could run the query twice, once limited to 20 records and 
then again with the limit set to the total number of records and develop 
my own facets.  I have infact done this before with a different back-end 
and my code is processed in under .01 seconds.


Why is faceting so slow?

Andrew


Re: Facet Performance

2006-12-08 Thread Andrew Nagy

Chris Hostetter wrote:


: Could you suggest a better configuration based on this?

If that's what your stats look like after a single request, then i would
guess you would need to make your cache size at least 1.6 million in order
for it to be of any use in improving your facet speed.
 

Would this have any strong impacts on my system?  Should I just set it 
to an even 2 million to allow for growth?



: My data is 492,000 records of book data.  I am faceting on 4 fields:
: author, subject, language, format.
: Format and language are fairly simple as their are only a few unique
: terms.  Author and subject however are much different in that there are
: thousands of unique terms.

by the looks of it, you have a lot more then a few thousand unique terms
in those two fields ... are you tokenizing on these fields?  that's
probably not what you want for ields you're going to facet on.
 

All of these fields are set as string in my schema, so if I understand 
the fields correctly, they are not being tokenized.  I also have an 
author field that is set as text for searching.


Thanks
Andrew


Re: Facet Performance

2006-12-08 Thread Yonik Seeley

On 12/8/06, Andrew Nagy [EMAIL PROTECTED] wrote:

Chris Hostetter wrote:

: Could you suggest a better configuration based on this?

If that's what your stats look like after a single request, then i would
guess you would need to make your cache size at least 1.6 million in order
for it to be of any use in improving your facet speed.


Would this have any strong impacts on my system?  Should I just set it
to an even 2 million to allow for growth?


Change the following in solrconfig.xml, and you should be fine with a
higher setting.
useFilterForSortedQuerytrue/useFilterForSortedQuery
to
useFilterForSortedQueryfalse/useFilterForSortedQuery

That will prevent the filtercache from being used for anything but
filters and faceting, so if you set it to high, it won't be utilized
anyway.


: My data is 492,000 records of book data.  I am faceting on 4 fields:
: author, subject, language, format.
: Format and language are fairly simple as their are only a few unique
: terms.  Author and subject however are much different in that there are
: thousands of unique terms.

by the looks of it, you have a lot more then a few thousand unique terms
in those two fields ... are you tokenizing on these fields?  that's
probably not what you want for ields you're going to facet on.


All of these fields are set as string in my schema


Are they multivalued, and do they need to be.
Anything that is of type string and not multivalued will use the
lucene FieldCache rather than the filterCache.

-Yonik


Re: Facet Performance

2006-12-08 Thread Andrew Nagy

Yonik Seeley wrote:


Are they multivalued, and do they need to be.
Anything that is of type string and not multivalued will use the
lucene FieldCache rather than the filterCache.


The author field is multivalued.  Will this be a strong performance issue?

I could make multiple author fields as to not have the multivalued field 
and then only facet on the first author.


Thanks
Andrew




Re: Facet Performance

2006-12-08 Thread J.J. Larrea
Andrew Nagy, ditto on what Yonik said.  Here is some further elaboration:

I am doing much the same thing (faceting on Author etc.). When my Author field 
was defined as a solr.TextField, even using solr.KeywordTokenizerFactory so it 
wasn't actually tokenized, the faceting code chose the QueryFilter approach, 
and faceting on Author for 100k+ document took about 4 seconds.

When I changed the field to string e.g. solr.StrField, the faceting code 
recognized it as untokenized and used the FieldCache approach.  Times have 
dropped to about 120ms for the first query (when the FieldCache is generated) 
and  10ms for subsequent queries returning a few thousand results.  Quite a 
difference.

The strategy must be chosen on a field-by-field basis.  While QueryFilter is 
excellent for fields with a small set of enumerated values such as Language or 
Format, it is inappropriate for large value sets such as Author.

Unfortunately which strategy will be chosen is currently undocumented and 
control is a bit oblique:  If the field is tokenized or multivalued or Boolean, 
the FilterQuery method will be used; otherwise the FieldCache method.  I expect 
I or others will improve that shortly.

- J.J.

At 2:58 PM -0500 12/8/06, Yonik Seeley wrote:
Right, if any of these are tokenized, then you could make them
non-tokenized (use string type).  If they really need to be
tokenized (author for example), then you could use copyField to make
another copy to a non-tokenized field that you can use for faceting.

After that, as Hoss suggests, run a single faceting query with all 4
fields and look at the filterCache statistics.  Take the lookups
number and multiply it by, say, 1.5 to leave some room for future
growth, and use that as your cache size.  You probably want to bump up
both initialSize and autowarmCount as well.

The first query will still be slow.  The second should be relatively fast.
You may hit an OOM error.  Increase the JVM heap size if this happens.

-Yonik



Re: Facet Performance

2006-12-08 Thread Yonik Seeley

On 12/8/06, J.J. Larrea [EMAIL PROTECTED] wrote:

Unfortunately which strategy will be chosen is currently undocumented and 
control is a bit oblique:  If the field is tokenized or multivalued or Boolean, 
the FilterQuery method will be used; otherwise the FieldCache method.


If anyone had time some of this could be documented here:
http://wiki.apache.org/solr/SimpleFacetParameters
The wiki is open to all.

Or perhaps a new top level FacetedSearching page that references
SimpleFacetParameters

-Yonik


Re: Facet Performance

2006-12-08 Thread Andrew Nagy

J.J. Larrea wrote:


Unfortunately which strategy will be chosen is currently undocumented and 
control is a bit oblique:  If the field is tokenized or multivalued or Boolean, 
the FilterQuery method will be used; otherwise the FieldCache method.  I expect 
I or others will improve that shortly.
 

Good to hear, cause I can't really get away with not having a 
multi-valued field for author.


Im really excited by solr and really impressed so far.

Thanks!
Andrew


Re: Facet Performance

2006-12-08 Thread Chris Hostetter

: Unfortunately which strategy will be chosen is currently undocumented
: and control is a bit oblique:  If the field is tokenized or multivalued
: or Boolean, the FilterQuery method will be used; otherwise the
: FieldCache method.  I expect I or others will improve that shortly.

Bear in mind, what's provide out of the box is SimpleFacets ... it's
designed to meet simple faceting needs ... when you start talking about
100s or thousands of constraints per facet, you are getting outside the
scope of what it was intended to serve efficiently.

At a certain point the only practical thing to do is write a custom
request handler that makes the best choices for your data.

For the record: a really simple patch someone could submit would be to
make add an optional field based param indicating which type of faceting
(termenum/fieldcache) should be used to generate the list of terms and
then make SimpleFacets.getFacetFieldCounts use that and call the
apprpriate method insteado calling getTermCounts -- that way you could
force one or the other if you know it's better for your data/query.



-Hoss



Re: Facet Performance

2006-12-08 Thread Andrew Nagy

Erik Hatcher wrote:


On Dec 8, 2006, at 2:15 PM, Andrew Nagy wrote:

My data is 492,000 records of book data.  I am faceting on 4  fields: 
author, subject, language, format.
Format and language are fairly simple as their are only a few  unique 
terms.  Author and subject however are much different in  that there 
are thousands of unique terms.



When encountering difficult issues, I like to think in terms of the  
user interface.  Surely you're not presenting 400k+ authors to the  
users in one shot.  In Collex, we have put an AJAX drop-down that  
shows the author facet (we call it name on the UI, with various roles  
like author, painter, etc).  You can see this in action here:


In our data, we don't have unique authors for each records ... so let's 
say out of the 500,000 records ... we have 200,000 authors.  What I am 
trying to display is the top 10 authors from the results of a search.  
So I do a search for title:Gone with the wind and I would like to see 
the top 10 matching authors from these results.


But no worries, I have written my own facet handler and I am now back to 
under a second with faceting!


Thanks for everyone's help and keep up the good work!

Andrew


Facet Performance

2006-12-07 Thread Andrew Nagy
In September there was a thread [1] on this list about heterogeneous 
facets and their performance.  I am having a similar issue and am 
unclear as the resolution of this thread.


I performed a search against my dataset (492,000 records) and got the 
results I am looking for in .3 seconds.  I then set facet to true and 
got results in 16 seconds and the facets include data that is not in my 
result set, it is from the entire set.  How do I limit the faceting to 
my results set and speed up the results?


Thanks!
Andrew

[1] http://www.mail-archive.com/solr-user@lucene.apache.org/msg00955.html


Re: Facet performance with heterogeneous 'facets'?

2006-09-22 Thread Michael Imbeault
Excellent news; as you guessed, my schema was (for some reason) set to 
version 1.0. This also caused some of the problems I had with the 
original SolrPHP (parsing the wrong response).


But better yet, the 800 seconds query is now running in 0.5-2 seconds! 
Amazing optimization! I can now do faceting on journal title (17 000 
different titles) and last author (400 000 authors), + 12 date range 
queries, in a very reasonable time (considering im on a test windows 
desktop box and not a server).


The only problem is if I add first author, I get a 
java.lang.OutOfMemoryError: Java heap space. I'm sure this problem will 
get away on a server with more than the current 500 megs I can allocate 
to Tomcat.


Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212

Yonik Seeley wrote:

On 9/22/06, Michael Imbeault [EMAIL PROTECTED] wrote:

I upgraded to the most recent Solr build (9-22) and sadly it's still
really slow. 800 seconds query with a single facet on first_author, 15
millions documents total, the query return 180. Maybe i'm doing
something wrong? Also, this is on my personal desktop; not on a server.
Still, I'm getting 0.1 seconds queries without facets, so I don't think
thats the cause. In the admin panel i can still see the filtercache
doing millions of lookups (and tons of evictions once it hits the 
maxsize).


The fact that you see all the filtercache usage means that the
optimization didn't kick in for some reason.


Here's the field i'm using in schema.xml :
field name =first_author type=string indexed=true stored=true/


That looks fine...


This is the query :
q=hiv red 
bloodstart=0rows=20fl=article_title+authors+journal_iso+pubdate+pmid+scoreqt=standardfacet=truefacet.field=first_authorfacet.limit=5facet.missing=falsefacet.zeros=false 



That looks OK too.
I assume that you didn't change the fieldtype definition for string,
and that the schema has version=1.1?  Before 1.1, all fields were
assumed to be multiValued (there was no checking or enforcement).

-Yonik



Re: Facet performance with heterogeneous 'facets'?

2006-09-21 Thread Yonik Seeley

On 9/21/06, Michael Imbeault [EMAIL PROTECTED] wrote:

It turns out that journal_name has 17038 different tokens, which is
manageable, but first_author has  400 000. I don't think this will ever
yield good performance, so i might only do journal_name facets.


Hang in there Michael, a fix is on the way for your scenario (and
subscribe to solr-dev if you want to stay on the bleeding edge):

http://www.nabble.com/big-faceting-speedup-for-single-valued-fields-tf2308153.html

-Yonik


Re: Facet performance with heterogeneous 'facets'?

2006-09-21 Thread Yonik Seeley

On 9/21/06, Michael Imbeault [EMAIL PROTECTED] wrote:

Btw, Any plans for a facets cache?


Maybe a partial one (like caching top terms to implement some other
optimizations).  My general philosophy on caching in Solr has been to
cache things the client can't: elemental things, or *parts* of
requests to make many different requests faster (most
bang-for-the-buck).

Caching complete requests/responses is generally less useful since it
requires even more memory, has a worse hit ratio, and can be done
anyway by the client or a separate process like squid.

-Yonik


Re: Facet performance with heterogeneous 'facets'?

2006-09-21 Thread Yonik Seeley

On 9/21/06, Yonik Seeley [EMAIL PROTECTED] wrote:

Hang in there Michael, a fix is on the way for your scenario (and
subscribe to solr-dev if you want to stay on the bleeding edge):


OK, the optimization has been checked in.  You can checkout from svn
and build Solr, or wait for the 9-22 nightly build (after 8:30 EDT).
I'd be interested in hearing your results with it.

The first facet request on a field will take longer than subsequent
ones because the FieldCache entry is loaded on demand.  You can use a
firstSearcher/newSearcher hook in solrconfig.xml to send a facet
request so that a real user would never see this slower query.

-Yonik


Re: Facet performance with heterogeneous 'facets'?

2006-09-21 Thread Michael Imbeault
I upgraded to the most recent Solr build (9-22) and sadly it's still 
really slow. 800 seconds query with a single facet on first_author, 15 
millions documents total, the query return 180. Maybe i'm doing 
something wrong? Also, this is on my personal desktop; not on a server. 
Still, I'm getting 0.1 seconds queries without facets, so I don't think 
thats the cause. In the admin panel i can still see the filtercache 
doing millions of lookups (and tons of evictions once it hits the maxsize).


Here's the field i'm using in schema.xml :
field name =first_author type=string indexed=true stored=true/

This is the query :
q=hiv red 
bloodstart=0rows=20fl=article_title+authors+journal_iso+pubdate+pmid+scoreqt=standardfacet=truefacet.field=first_authorfacet.limit=5facet.missing=falsefacet.zeros=false


I'll do more testing on the weekend,

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Yonik Seeley wrote:

On 9/21/06, Yonik Seeley [EMAIL PROTECTED] wrote:

Hang in there Michael, a fix is on the way for your scenario (and
subscribe to solr-dev if you want to stay on the bleeding edge):


OK, the optimization has been checked in.  You can checkout from svn
and build Solr, or wait for the 9-22 nightly build (after 8:30 EDT).
I'd be interested in hearing your results with it.

The first facet request on a field will take longer than subsequent
ones because the FieldCache entry is loaded on demand.  You can use a
firstSearcher/newSearcher hook in solrconfig.xml to send a facet
request so that a real user would never see this slower query.

-Yonik



  1   2   >