Re: Facet Performance
queryResultCache doesn’t really help with faceting, even if it’s hit for the main query. That cache only stores a subset of the hits, and to facet properly you need the entire result set…. > On Jun 17, 2020, at 12:47 PM, James Bodkin > wrote: > > We've noticed that the filterCache uses a significant amount of memory, as > we've assigned 8GB Heap per instance. > In total, we have 32 shards with 2 replicas, hence (8*32*2) 512G Heap space > alone, further memory is required to ensure the index is always memory mapped > for performance reasons. > > Ideally I would like to be able to reduce the amount of memory assigned to > the heap by using docValues instead of indexed but it doesn't seem possible. > The QTime (after warming) for facet.method=enum is around 150-250ms whereas > the QTime for facet.method=fc is around 1000-1200ms. > As we require the results in real-time for customers searching on our > website, the later QTime of 1000-1200ms is too slow for us to be able to use. > > Our facet queries change as the customer selects different search criteria, > and hence the possible number of potential queries makes it very difficult > for the query result cache. > We already have a custom implementation in which we check our redis cache for > queries before they are sent to our aggregators which runs at 30% hit rate. > > Kind Regards, > > James Bodkin > > On 17/06/2020, 16:21, "Michael Gibney" wrote: > >To expand a bit on what Erick said regarding performance: my sense is >that the RefGuide assertion that "docValues=true" makes faceting >"faster" could use some qualification/clarification. My take, fwiw: > >First, to reiterate/paraphrase what Erick said: the "faster" assertion >is not comparing to "facet.method=enum". For low-cardinality fields, >if you have the heap space, and are very intentional about configuring >your filterCache (and monitoring it as access patterns might change), >"facet.method=enum" will likely be as fast as you can get (at least >for "legacy" facets or whatever -- not sure about "enum" method in >JSON facets). > >Even where "docValues=true" arguably does make faceting "faster", the >main benefit is that the "uninverted" data structures are serialized >on disk, so you're avoiding the need to uninvert each facet field >on-heap for every new indexSearcher, which is generally high-latency >-- user perception of this latency can be mitigated using warming >queries, but it can still be problematic, esp. for frequent index >updates. On-heap uninversion also inherently consumes a lot of heap >space, which has general implications wrt GC, etc ... so in that >respect even if faceting per se might not be "faster" with >"docValues=true", your overall system may in many cases perform >better. > >(and Anthony, I'm pretty sure that tag/ex on facets should be >orthogonal to the "facet.method=enum"/filterCache discussion, as >tag/ex only affects the DocSet domain over which facets are calculated >... I think that step is pretty cleanly separated from the actual >calculation of the facets. I'm not 100% sure on that, so proceed with >caution, but it could definitely be worth evaluating for your use >case!) > >Michael > >On Wed, Jun 17, 2020 at 10:42 AM Erick Erickson > wrote: >> >> Uninvertible is a safety mechanism to make sure that you don’t _unknowingly_ >> use a docValues=false >> field for faceting/grouping/sorting/function queries. The primary point of >> docValues=true is twofold: >> >> 1> reduce Java heap requirements by using the OS memory to hold it >> >> 2> uninverting can be expensive CPU wise too, although not with just a few >>unique values (for each term, read the list of docs that have it and flip >> a bit). >> >> It doesn’t really make sense to set it on an index=false field, since >> uninverting only happens on >> index=true docValues=false. OTOH, I don’t think it would do any harm either. >> That said, I frankly >> don’t know how that interacts with facet.method=enum. >> >> As far as speed… yeah, you’re in the edge cases. All things being equal, >> stuffing these into the >> filterCache is the fastest way to facet if you have the memory. I’ve seen >> very few installations >> where people have that luxury though. Each entry in the filterCache can >> occupy maxDoc/8 + some overhead >> bytes. If maxDoc is very large, this’ll chew up an enormous amount of >> memory. I’m cheating >> a bit here since the size might be smaller if only a few docs have any >> particular entry then the >> size is smaller. But that’s the worst-case you have to allow for ‘cause you >> could theoretically hit >> the perfect storm where, due to some particular sequence of queries, your >> entire filter >> cache fills up with entries that size. >> >> You’ll have some overhead to keep the cache at that size, but it sounds like >> it’s worth it. >
Re: Facet Performance
We've noticed that the filterCache uses a significant amount of memory, as we've assigned 8GB Heap per instance. In total, we have 32 shards with 2 replicas, hence (8*32*2) 512G Heap space alone, further memory is required to ensure the index is always memory mapped for performance reasons. Ideally I would like to be able to reduce the amount of memory assigned to the heap by using docValues instead of indexed but it doesn't seem possible. The QTime (after warming) for facet.method=enum is around 150-250ms whereas the QTime for facet.method=fc is around 1000-1200ms. As we require the results in real-time for customers searching on our website, the later QTime of 1000-1200ms is too slow for us to be able to use. Our facet queries change as the customer selects different search criteria, and hence the possible number of potential queries makes it very difficult for the query result cache. We already have a custom implementation in which we check our redis cache for queries before they are sent to our aggregators which runs at 30% hit rate. Kind Regards, James Bodkin On 17/06/2020, 16:21, "Michael Gibney" wrote: To expand a bit on what Erick said regarding performance: my sense is that the RefGuide assertion that "docValues=true" makes faceting "faster" could use some qualification/clarification. My take, fwiw: First, to reiterate/paraphrase what Erick said: the "faster" assertion is not comparing to "facet.method=enum". For low-cardinality fields, if you have the heap space, and are very intentional about configuring your filterCache (and monitoring it as access patterns might change), "facet.method=enum" will likely be as fast as you can get (at least for "legacy" facets or whatever -- not sure about "enum" method in JSON facets). Even where "docValues=true" arguably does make faceting "faster", the main benefit is that the "uninverted" data structures are serialized on disk, so you're avoiding the need to uninvert each facet field on-heap for every new indexSearcher, which is generally high-latency -- user perception of this latency can be mitigated using warming queries, but it can still be problematic, esp. for frequent index updates. On-heap uninversion also inherently consumes a lot of heap space, which has general implications wrt GC, etc ... so in that respect even if faceting per se might not be "faster" with "docValues=true", your overall system may in many cases perform better. (and Anthony, I'm pretty sure that tag/ex on facets should be orthogonal to the "facet.method=enum"/filterCache discussion, as tag/ex only affects the DocSet domain over which facets are calculated ... I think that step is pretty cleanly separated from the actual calculation of the facets. I'm not 100% sure on that, so proceed with caution, but it could definitely be worth evaluating for your use case!) Michael On Wed, Jun 17, 2020 at 10:42 AM Erick Erickson wrote: > > Uninvertible is a safety mechanism to make sure that you don’t _unknowingly_ use a docValues=false > field for faceting/grouping/sorting/function queries. The primary point of docValues=true is twofold: > > 1> reduce Java heap requirements by using the OS memory to hold it > > 2> uninverting can be expensive CPU wise too, although not with just a few > unique values (for each term, read the list of docs that have it and flip a bit). > > It doesn’t really make sense to set it on an index=false field, since uninverting only happens on > index=true docValues=false. OTOH, I don’t think it would do any harm either. That said, I frankly > don’t know how that interacts with facet.method=enum. > > As far as speed… yeah, you’re in the edge cases. All things being equal, stuffing these into the > filterCache is the fastest way to facet if you have the memory. I’ve seen very few installations > where people have that luxury though. Each entry in the filterCache can occupy maxDoc/8 + some overhead > bytes. If maxDoc is very large, this’ll chew up an enormous amount of memory. I’m cheating > a bit here since the size might be smaller if only a few docs have any particular entry then the > size is smaller. But that’s the worst-case you have to allow for ‘cause you could theoretically hit > the perfect storm where, due to some particular sequence of queries, your entire filter > cache fills up with entries that size. > > You’ll have some overhead to keep the cache at that size, but it sounds like it’s worth it. > > Best, > Erick > > > > > On Jun 17, 2020, at 10:05 AM, James Bodkin wrote: > > > > The large majority of the relevant fields have fewer than 20 unique values. We have two fields over that with 150 unique values and 5300 unique values retrospectively. > > At the moment, our
Re: Facet Performance
To expand a bit on what Erick said regarding performance: my sense is that the RefGuide assertion that "docValues=true" makes faceting "faster" could use some qualification/clarification. My take, fwiw: First, to reiterate/paraphrase what Erick said: the "faster" assertion is not comparing to "facet.method=enum". For low-cardinality fields, if you have the heap space, and are very intentional about configuring your filterCache (and monitoring it as access patterns might change), "facet.method=enum" will likely be as fast as you can get (at least for "legacy" facets or whatever -- not sure about "enum" method in JSON facets). Even where "docValues=true" arguably does make faceting "faster", the main benefit is that the "uninverted" data structures are serialized on disk, so you're avoiding the need to uninvert each facet field on-heap for every new indexSearcher, which is generally high-latency -- user perception of this latency can be mitigated using warming queries, but it can still be problematic, esp. for frequent index updates. On-heap uninversion also inherently consumes a lot of heap space, which has general implications wrt GC, etc ... so in that respect even if faceting per se might not be "faster" with "docValues=true", your overall system may in many cases perform better. (and Anthony, I'm pretty sure that tag/ex on facets should be orthogonal to the "facet.method=enum"/filterCache discussion, as tag/ex only affects the DocSet domain over which facets are calculated ... I think that step is pretty cleanly separated from the actual calculation of the facets. I'm not 100% sure on that, so proceed with caution, but it could definitely be worth evaluating for your use case!) Michael On Wed, Jun 17, 2020 at 10:42 AM Erick Erickson wrote: > > Uninvertible is a safety mechanism to make sure that you don’t _unknowingly_ > use a docValues=false > field for faceting/grouping/sorting/function queries. The primary point of > docValues=true is twofold: > > 1> reduce Java heap requirements by using the OS memory to hold it > > 2> uninverting can be expensive CPU wise too, although not with just a few > unique values (for each term, read the list of docs that have it and flip > a bit). > > It doesn’t really make sense to set it on an index=false field, since > uninverting only happens on > index=true docValues=false. OTOH, I don’t think it would do any harm either. > That said, I frankly > don’t know how that interacts with facet.method=enum. > > As far as speed… yeah, you’re in the edge cases. All things being equal, > stuffing these into the > filterCache is the fastest way to facet if you have the memory. I’ve seen > very few installations > where people have that luxury though. Each entry in the filterCache can > occupy maxDoc/8 + some overhead > bytes. If maxDoc is very large, this’ll chew up an enormous amount of memory. > I’m cheating > a bit here since the size might be smaller if only a few docs have any > particular entry then the > size is smaller. But that’s the worst-case you have to allow for ‘cause you > could theoretically hit > the perfect storm where, due to some particular sequence of queries, your > entire filter > cache fills up with entries that size. > > You’ll have some overhead to keep the cache at that size, but it sounds like > it’s worth it. > > Best, > Erick > > > > > On Jun 17, 2020, at 10:05 AM, James Bodkin > > wrote: > > > > The large majority of the relevant fields have fewer than 20 unique values. > > We have two fields over that with 150 unique values and 5300 unique values > > retrospectively. > > At the moment, our filterCache is configured with a maximum size of 8192. > > > > From the DocValues documentation > > (https://lucene.apache.org/solr/guide/8_3/docvalues.html), it mentions that > > this approach promises to make lookups for faceting, sorting and grouping > > much faster. > > Hence I thought that using DocValues would be better than using Indexed and > > in turn improve our response times and possibly lower memory requirements. > > It sounds like this isn't the case if you are able to allocate enough > > memory to the filterCache. > > > > I haven't yet tried changing the uninvertible setting, I was looking at the > > documentation for this field earlier today. > > Should we be setting uninvertible="false" if docValues="true" regardless of > > whether indexed is true or false? > > > > Kind Regards, > > > > James Bodkin > > > > On 17/06/2020, 14:02, "Michael Gibney" wrote: > > > >facet.method=enum works by executing a query (against indexed values) > >for each indexed value in a given field (which, for indexed=false, is > >"no values"). So that explains why facet.method=enum no longer works. > >I was going to suggest that you might not want to set indexed=false on > >the docValues facet fields anyway, since the indexed values are still > >used for facet refinement (assuming your index is distributed). > > >
Re: Facet Performance
Uninvertible is a safety mechanism to make sure that you don’t _unknowingly_ use a docValues=false field for faceting/grouping/sorting/function queries. The primary point of docValues=true is twofold: 1> reduce Java heap requirements by using the OS memory to hold it 2> uninverting can be expensive CPU wise too, although not with just a few unique values (for each term, read the list of docs that have it and flip a bit). It doesn’t really make sense to set it on an index=false field, since uninverting only happens on index=true docValues=false. OTOH, I don’t think it would do any harm either. That said, I frankly don’t know how that interacts with facet.method=enum. As far as speed… yeah, you’re in the edge cases. All things being equal, stuffing these into the filterCache is the fastest way to facet if you have the memory. I’ve seen very few installations where people have that luxury though. Each entry in the filterCache can occupy maxDoc/8 + some overhead bytes. If maxDoc is very large, this’ll chew up an enormous amount of memory. I’m cheating a bit here since the size might be smaller if only a few docs have any particular entry then the size is smaller. But that’s the worst-case you have to allow for ‘cause you could theoretically hit the perfect storm where, due to some particular sequence of queries, your entire filter cache fills up with entries that size. You’ll have some overhead to keep the cache at that size, but it sounds like it’s worth it. Best, Erick > On Jun 17, 2020, at 10:05 AM, James Bodkin > wrote: > > The large majority of the relevant fields have fewer than 20 unique values. > We have two fields over that with 150 unique values and 5300 unique values > retrospectively. > At the moment, our filterCache is configured with a maximum size of 8192. > > From the DocValues documentation > (https://lucene.apache.org/solr/guide/8_3/docvalues.html), it mentions that > this approach promises to make lookups for faceting, sorting and grouping > much faster. > Hence I thought that using DocValues would be better than using Indexed and > in turn improve our response times and possibly lower memory requirements. It > sounds like this isn't the case if you are able to allocate enough memory to > the filterCache. > > I haven't yet tried changing the uninvertible setting, I was looking at the > documentation for this field earlier today. > Should we be setting uninvertible="false" if docValues="true" regardless of > whether indexed is true or false? > > Kind Regards, > > James Bodkin > > On 17/06/2020, 14:02, "Michael Gibney" wrote: > >facet.method=enum works by executing a query (against indexed values) >for each indexed value in a given field (which, for indexed=false, is >"no values"). So that explains why facet.method=enum no longer works. >I was going to suggest that you might not want to set indexed=false on >the docValues facet fields anyway, since the indexed values are still >used for facet refinement (assuming your index is distributed). > >What's the number of unique values in the relevant fields? If it's low >enough, setting docValues=false and indexed=true and using >facet.method=enum (with a sufficiently large filterCache) is >definitely a viable option, and will almost certainly be faster than >docValues-based faceting. (As an aside, noting for future reference: >high-cardinality facets over high-cardinality DocSet domains might be >able to benefit from a term facet count cache: >https://issues.apache.org/jira/browse/SOLR-13807) > >I think you didn't specifically mention whether you acted on Erick's >suggestion of setting "uninvertible=false" (I think Erick accidentally >said "uninvertible=true") to fail fast. I'd also recommend doing that, >perhaps even above all else -- it shouldn't actually *do* anything, >but will help ensure that things are behaving as you expect them to! > >Michael > >On Wed, Jun 17, 2020 at 4:31 AM James Bodkin > wrote: >> >> Thanks, I've implemented some queries that improve the first-hit execution >> for faceting. >> >> Since turning off indexed on those fields, we've noticed that >> facet.method=enum no longer returns the facets when used. >> Using facet.method=fc/fcs is significantly slower compared to >> facet.method=enum for us. Why do these two differences exist? >> >> On 16/06/2020, 17:52, "Erick Erickson" wrote: >> >>Ok, I see the disconnect... Necessary parts if the index are read from >> disk >>lazily. So your newSearcher or firstSearcher query needs to do whatever >>operation causes the relevant parts of the index to be read. In this case, >>probably just facet on all the fields you care about. I'd add sorting too >>if you sort on different fields. >> >>The *:* query without facets or sorting does virtually nothing due to some >>special handling... >> >>On Tue, Jun 16,
Re: Facet Performance
The large majority of the relevant fields have fewer than 20 unique values. We have two fields over that with 150 unique values and 5300 unique values retrospectively. At the moment, our filterCache is configured with a maximum size of 8192. From the DocValues documentation (https://lucene.apache.org/solr/guide/8_3/docvalues.html), it mentions that this approach promises to make lookups for faceting, sorting and grouping much faster. Hence I thought that using DocValues would be better than using Indexed and in turn improve our response times and possibly lower memory requirements. It sounds like this isn't the case if you are able to allocate enough memory to the filterCache. I haven't yet tried changing the uninvertible setting, I was looking at the documentation for this field earlier today. Should we be setting uninvertible="false" if docValues="true" regardless of whether indexed is true or false? Kind Regards, James Bodkin On 17/06/2020, 14:02, "Michael Gibney" wrote: facet.method=enum works by executing a query (against indexed values) for each indexed value in a given field (which, for indexed=false, is "no values"). So that explains why facet.method=enum no longer works. I was going to suggest that you might not want to set indexed=false on the docValues facet fields anyway, since the indexed values are still used for facet refinement (assuming your index is distributed). What's the number of unique values in the relevant fields? If it's low enough, setting docValues=false and indexed=true and using facet.method=enum (with a sufficiently large filterCache) is definitely a viable option, and will almost certainly be faster than docValues-based faceting. (As an aside, noting for future reference: high-cardinality facets over high-cardinality DocSet domains might be able to benefit from a term facet count cache: https://issues.apache.org/jira/browse/SOLR-13807) I think you didn't specifically mention whether you acted on Erick's suggestion of setting "uninvertible=false" (I think Erick accidentally said "uninvertible=true") to fail fast. I'd also recommend doing that, perhaps even above all else -- it shouldn't actually *do* anything, but will help ensure that things are behaving as you expect them to! Michael On Wed, Jun 17, 2020 at 4:31 AM James Bodkin wrote: > > Thanks, I've implemented some queries that improve the first-hit execution for faceting. > > Since turning off indexed on those fields, we've noticed that facet.method=enum no longer returns the facets when used. > Using facet.method=fc/fcs is significantly slower compared to facet.method=enum for us. Why do these two differences exist? > > On 16/06/2020, 17:52, "Erick Erickson" wrote: > > Ok, I see the disconnect... Necessary parts if the index are read from disk > lazily. So your newSearcher or firstSearcher query needs to do whatever > operation causes the relevant parts of the index to be read. In this case, > probably just facet on all the fields you care about. I'd add sorting too > if you sort on different fields. > > The *:* query without facets or sorting does virtually nothing due to some > special handling... > > On Tue, Jun 16, 2020, 10:48 James Bodkin > wrote: > > > I've been trying to build a query that I can use in newSearcher based off > > the information in your previous e-mail. I thought you meant to build a *:* > > query as per Query 1 in my previous e-mail but I'm still seeing the > > first-hit execution. > > Now I'm wondering if you meant to create a *:* query with each of the > > fields as part of the fl query parameters or a *:* query with each of the > > fields and values as part of the fq query parameters. > > > > At the moment I've been running these manually as I expected that I would > > see the first-execution penalty disappear by the time I got to query 4, as > > I thought this would replicate the actions of the newSeacher. > > Unfortunately we can't use the autowarm count that is available as part of > > the filterCache/filterCache due to the custom deployment mechanism we use > > to update our index. > > > > Kind Regards, > > > > James Bodkin > > > > On 16/06/2020, 15:30, "Erick Erickson" wrote: > > > > Did you try the autowarming like I mentioned in my previous e-mail? > > > > > On Jun 16, 2020, at 10:18 AM, James Bodkin < > > james.bod...@loveholidays.com> wrote: > > > > > > We've changed the schema to enable docValues for these fields and > > this led to an improvement in the response time. We found a further > > i
Re: Facet Performance
Ah, interesting! So if the number of possible values is low (like <= 10), it is faster to *not *use docvalues on that (indexed) faceted field? Does this hold true even when using faceting techniques like tag and exclusion? Thanks, Anthony On Wed, Jun 17, 2020 at 9:37 AM David Smiley wrote: > I strongly recommend setting indexed=true on a field you facet on for the > purposes of efficient refinement (fq=field:value). But it strictly isn't > required, as you have discovered. > > ~ David > > > On Wed, Jun 17, 2020 at 9:02 AM Michael Gibney > wrote: > > > facet.method=enum works by executing a query (against indexed values) > > for each indexed value in a given field (which, for indexed=false, is > > "no values"). So that explains why facet.method=enum no longer works. > > I was going to suggest that you might not want to set indexed=false on > > the docValues facet fields anyway, since the indexed values are still > > used for facet refinement (assuming your index is distributed). > > > > What's the number of unique values in the relevant fields? If it's low > > enough, setting docValues=false and indexed=true and using > > facet.method=enum (with a sufficiently large filterCache) is > > definitely a viable option, and will almost certainly be faster than > > docValues-based faceting. (As an aside, noting for future reference: > > high-cardinality facets over high-cardinality DocSet domains might be > > able to benefit from a term facet count cache: > > https://issues.apache.org/jira/browse/SOLR-13807) > > > > I think you didn't specifically mention whether you acted on Erick's > > suggestion of setting "uninvertible=false" (I think Erick accidentally > > said "uninvertible=true") to fail fast. I'd also recommend doing that, > > perhaps even above all else -- it shouldn't actually *do* anything, > > but will help ensure that things are behaving as you expect them to! > > > > Michael > > > > On Wed, Jun 17, 2020 at 4:31 AM James Bodkin > > wrote: > > > > > > Thanks, I've implemented some queries that improve the first-hit > > execution for faceting. > > > > > > Since turning off indexed on those fields, we've noticed that > > facet.method=enum no longer returns the facets when used. > > > Using facet.method=fc/fcs is significantly slower compared to > > facet.method=enum for us. Why do these two differences exist? > > > > > > On 16/06/2020, 17:52, "Erick Erickson" > wrote: > > > > > > Ok, I see the disconnect... Necessary parts if the index are read > > from disk > > > lazily. So your newSearcher or firstSearcher query needs to do > > whatever > > > operation causes the relevant parts of the index to be read. In > this > > case, > > > probably just facet on all the fields you care about. I'd add > > sorting too > > > if you sort on different fields. > > > > > > The *:* query without facets or sorting does virtually nothing due > > to some > > > special handling... > > > > > > On Tue, Jun 16, 2020, 10:48 James Bodkin < > > james.bod...@loveholidays.com> > > > wrote: > > > > > > > I've been trying to build a query that I can use in newSearcher > > based off > > > > the information in your previous e-mail. I thought you meant to > > build a *:* > > > > query as per Query 1 in my previous e-mail but I'm still seeing > the > > > > first-hit execution. > > > > Now I'm wondering if you meant to create a *:* query with each of > > the > > > > fields as part of the fl query parameters or a *:* query with > each > > of the > > > > fields and values as part of the fq query parameters. > > > > > > > > At the moment I've been running these manually as I expected that > > I would > > > > see the first-execution penalty disappear by the time I got to > > query 4, as > > > > I thought this would replicate the actions of the newSeacher. > > > > Unfortunately we can't use the autowarm count that is available > as > > part of > > > > the filterCache/filterCache due to the custom deployment > mechanism > > we use > > > > to update our index. > > > > > > > > Kind Regards, > > > > > > > > James Bodkin > > > > > > > > On 16/06/2020, 15:30, "Erick Erickson" > > > wrote: > > > > > > > > Did you try the autowarming like I mentioned in my previous > > e-mail? > > > > > > > > > On Jun 16, 2020, at 10:18 AM, James Bodkin < > > > > james.bod...@loveholidays.com> wrote: > > > > > > > > > > We've changed the schema to enable docValues for these > > fields and > > > > this led to an improvement in the response time. We found a > further > > > > improvement by also switching off indexed as these fields are > used > > for > > > > faceting and filtering only. > > > > > Since those changes, we've found that the first-execution > for > > > > queries is really noticeable. I thought this would be the > > filterCache based > > > > on what I saw in NewRel
Re: Facet Performance
I strongly recommend setting indexed=true on a field you facet on for the purposes of efficient refinement (fq=field:value). But it strictly isn't required, as you have discovered. ~ David On Wed, Jun 17, 2020 at 9:02 AM Michael Gibney wrote: > facet.method=enum works by executing a query (against indexed values) > for each indexed value in a given field (which, for indexed=false, is > "no values"). So that explains why facet.method=enum no longer works. > I was going to suggest that you might not want to set indexed=false on > the docValues facet fields anyway, since the indexed values are still > used for facet refinement (assuming your index is distributed). > > What's the number of unique values in the relevant fields? If it's low > enough, setting docValues=false and indexed=true and using > facet.method=enum (with a sufficiently large filterCache) is > definitely a viable option, and will almost certainly be faster than > docValues-based faceting. (As an aside, noting for future reference: > high-cardinality facets over high-cardinality DocSet domains might be > able to benefit from a term facet count cache: > https://issues.apache.org/jira/browse/SOLR-13807) > > I think you didn't specifically mention whether you acted on Erick's > suggestion of setting "uninvertible=false" (I think Erick accidentally > said "uninvertible=true") to fail fast. I'd also recommend doing that, > perhaps even above all else -- it shouldn't actually *do* anything, > but will help ensure that things are behaving as you expect them to! > > Michael > > On Wed, Jun 17, 2020 at 4:31 AM James Bodkin > wrote: > > > > Thanks, I've implemented some queries that improve the first-hit > execution for faceting. > > > > Since turning off indexed on those fields, we've noticed that > facet.method=enum no longer returns the facets when used. > > Using facet.method=fc/fcs is significantly slower compared to > facet.method=enum for us. Why do these two differences exist? > > > > On 16/06/2020, 17:52, "Erick Erickson" wrote: > > > > Ok, I see the disconnect... Necessary parts if the index are read > from disk > > lazily. So your newSearcher or firstSearcher query needs to do > whatever > > operation causes the relevant parts of the index to be read. In this > case, > > probably just facet on all the fields you care about. I'd add > sorting too > > if you sort on different fields. > > > > The *:* query without facets or sorting does virtually nothing due > to some > > special handling... > > > > On Tue, Jun 16, 2020, 10:48 James Bodkin < > james.bod...@loveholidays.com> > > wrote: > > > > > I've been trying to build a query that I can use in newSearcher > based off > > > the information in your previous e-mail. I thought you meant to > build a *:* > > > query as per Query 1 in my previous e-mail but I'm still seeing the > > > first-hit execution. > > > Now I'm wondering if you meant to create a *:* query with each of > the > > > fields as part of the fl query parameters or a *:* query with each > of the > > > fields and values as part of the fq query parameters. > > > > > > At the moment I've been running these manually as I expected that > I would > > > see the first-execution penalty disappear by the time I got to > query 4, as > > > I thought this would replicate the actions of the newSeacher. > > > Unfortunately we can't use the autowarm count that is available as > part of > > > the filterCache/filterCache due to the custom deployment mechanism > we use > > > to update our index. > > > > > > Kind Regards, > > > > > > James Bodkin > > > > > > On 16/06/2020, 15:30, "Erick Erickson" > wrote: > > > > > > Did you try the autowarming like I mentioned in my previous > e-mail? > > > > > > > On Jun 16, 2020, at 10:18 AM, James Bodkin < > > > james.bod...@loveholidays.com> wrote: > > > > > > > > We've changed the schema to enable docValues for these > fields and > > > this led to an improvement in the response time. We found a further > > > improvement by also switching off indexed as these fields are used > for > > > faceting and filtering only. > > > > Since those changes, we've found that the first-execution for > > > queries is really noticeable. I thought this would be the > filterCache based > > > on what I saw in NewRelic however it is probably trying to read the > > > docValues from disk. How can we use the autowarming to improve > this? > > > > > > > > For example, I've run the following queries in sequence and > each > > > query has a first-execution penalty. > > > > > > > > Query 1: > > > > > > > > q=*:* > > > > facet=true > > > > facet.field=D_DepartureAirport > > > > facet.field=D_Destination > > > > facet.limit=-1 > > > > rows=0 > > >
Re: Facet Performance
facet.method=enum works by executing a query (against indexed values) for each indexed value in a given field (which, for indexed=false, is "no values"). So that explains why facet.method=enum no longer works. I was going to suggest that you might not want to set indexed=false on the docValues facet fields anyway, since the indexed values are still used for facet refinement (assuming your index is distributed). What's the number of unique values in the relevant fields? If it's low enough, setting docValues=false and indexed=true and using facet.method=enum (with a sufficiently large filterCache) is definitely a viable option, and will almost certainly be faster than docValues-based faceting. (As an aside, noting for future reference: high-cardinality facets over high-cardinality DocSet domains might be able to benefit from a term facet count cache: https://issues.apache.org/jira/browse/SOLR-13807) I think you didn't specifically mention whether you acted on Erick's suggestion of setting "uninvertible=false" (I think Erick accidentally said "uninvertible=true") to fail fast. I'd also recommend doing that, perhaps even above all else -- it shouldn't actually *do* anything, but will help ensure that things are behaving as you expect them to! Michael On Wed, Jun 17, 2020 at 4:31 AM James Bodkin wrote: > > Thanks, I've implemented some queries that improve the first-hit execution > for faceting. > > Since turning off indexed on those fields, we've noticed that > facet.method=enum no longer returns the facets when used. > Using facet.method=fc/fcs is significantly slower compared to > facet.method=enum for us. Why do these two differences exist? > > On 16/06/2020, 17:52, "Erick Erickson" wrote: > > Ok, I see the disconnect... Necessary parts if the index are read from > disk > lazily. So your newSearcher or firstSearcher query needs to do whatever > operation causes the relevant parts of the index to be read. In this case, > probably just facet on all the fields you care about. I'd add sorting too > if you sort on different fields. > > The *:* query without facets or sorting does virtually nothing due to some > special handling... > > On Tue, Jun 16, 2020, 10:48 James Bodkin > wrote: > > > I've been trying to build a query that I can use in newSearcher based > off > > the information in your previous e-mail. I thought you meant to build a > *:* > > query as per Query 1 in my previous e-mail but I'm still seeing the > > first-hit execution. > > Now I'm wondering if you meant to create a *:* query with each of the > > fields as part of the fl query parameters or a *:* query with each of > the > > fields and values as part of the fq query parameters. > > > > At the moment I've been running these manually as I expected that I > would > > see the first-execution penalty disappear by the time I got to query 4, > as > > I thought this would replicate the actions of the newSeacher. > > Unfortunately we can't use the autowarm count that is available as part > of > > the filterCache/filterCache due to the custom deployment mechanism we > use > > to update our index. > > > > Kind Regards, > > > > James Bodkin > > > > On 16/06/2020, 15:30, "Erick Erickson" wrote: > > > > Did you try the autowarming like I mentioned in my previous e-mail? > > > > > On Jun 16, 2020, at 10:18 AM, James Bodkin < > > james.bod...@loveholidays.com> wrote: > > > > > > We've changed the schema to enable docValues for these fields and > > this led to an improvement in the response time. We found a further > > improvement by also switching off indexed as these fields are used for > > faceting and filtering only. > > > Since those changes, we've found that the first-execution for > > queries is really noticeable. I thought this would be the filterCache > based > > on what I saw in NewRelic however it is probably trying to read the > > docValues from disk. How can we use the autowarming to improve this? > > > > > > For example, I've run the following queries in sequence and each > > query has a first-execution penalty. > > > > > > Query 1: > > > > > > q=*:* > > > facet=true > > > facet.field=D_DepartureAirport > > > facet.field=D_Destination > > > facet.limit=-1 > > > rows=0 > > > > > > Query 2: > > > > > > q=*:* > > > fq=D_DepartureAirport:(2660) > > > facet=true > > > facet.field=D_Destination > > > facet.limit=-1 > > > rows=0 > > > > > > Query 3: > > > > > > q=*:* > > > fq=D_DepartureAirport:(2661) > > > facet=true > > > facet.field=D_Destination > > > facet.limit=-1 > > > rows=0 > > > > > >
Re: Facet Performance
Thanks, I've implemented some queries that improve the first-hit execution for faceting. Since turning off indexed on those fields, we've noticed that facet.method=enum no longer returns the facets when used. Using facet.method=fc/fcs is significantly slower compared to facet.method=enum for us. Why do these two differences exist? On 16/06/2020, 17:52, "Erick Erickson" wrote: Ok, I see the disconnect... Necessary parts if the index are read from disk lazily. So your newSearcher or firstSearcher query needs to do whatever operation causes the relevant parts of the index to be read. In this case, probably just facet on all the fields you care about. I'd add sorting too if you sort on different fields. The *:* query without facets or sorting does virtually nothing due to some special handling... On Tue, Jun 16, 2020, 10:48 James Bodkin wrote: > I've been trying to build a query that I can use in newSearcher based off > the information in your previous e-mail. I thought you meant to build a *:* > query as per Query 1 in my previous e-mail but I'm still seeing the > first-hit execution. > Now I'm wondering if you meant to create a *:* query with each of the > fields as part of the fl query parameters or a *:* query with each of the > fields and values as part of the fq query parameters. > > At the moment I've been running these manually as I expected that I would > see the first-execution penalty disappear by the time I got to query 4, as > I thought this would replicate the actions of the newSeacher. > Unfortunately we can't use the autowarm count that is available as part of > the filterCache/filterCache due to the custom deployment mechanism we use > to update our index. > > Kind Regards, > > James Bodkin > > On 16/06/2020, 15:30, "Erick Erickson" wrote: > > Did you try the autowarming like I mentioned in my previous e-mail? > > > On Jun 16, 2020, at 10:18 AM, James Bodkin < > james.bod...@loveholidays.com> wrote: > > > > We've changed the schema to enable docValues for these fields and > this led to an improvement in the response time. We found a further > improvement by also switching off indexed as these fields are used for > faceting and filtering only. > > Since those changes, we've found that the first-execution for > queries is really noticeable. I thought this would be the filterCache based > on what I saw in NewRelic however it is probably trying to read the > docValues from disk. How can we use the autowarming to improve this? > > > > For example, I've run the following queries in sequence and each > query has a first-execution penalty. > > > > Query 1: > > > > q=*:* > > facet=true > > facet.field=D_DepartureAirport > > facet.field=D_Destination > > facet.limit=-1 > > rows=0 > > > > Query 2: > > > > q=*:* > > fq=D_DepartureAirport:(2660) > > facet=true > > facet.field=D_Destination > > facet.limit=-1 > > rows=0 > > > > Query 3: > > > > q=*:* > > fq=D_DepartureAirport:(2661) > > facet=true > > facet.field=D_Destination > > facet.limit=-1 > > rows=0 > > > > Query 4: > > > > q=*:* > > fq=D_DepartureAirport:(2660+OR+2661) > > facet=true > > facet.field=D_Destination > > facet.limit=-1 > > rows=0 > > > > We've kept the field type as a string, as the value is mapped by > application that accesses Solr. In the examples above, the values are > mapped to airports and destinations. > > Is it possible to prewarm the above queries without having to define > all the potential filters manually in the auto warming? > > > > At the moment, we update and optimise our index in a different > environment and then copy the index to our production instances by using a > rolling deployment in Kubernetes. > > > > Kind Regards, > > > > James Bodkin > > > > On 12/06/2020, 18:58, "Erick Erickson" > wrote: > > > >I question whether fiterCache has anything to do with it, I > suspect what’s really happening is that first time you’re reading the > relevant bits from disk into memory. And to double check you should have > docVaues enabled for all these fields. The “uninverting” process can be > very expensive, and docValues bypasses that. > > > >As of Solr 7.6, you can define “uninvertible=true” to your > field(Type) to “fail fast” if Solr needs to uninvert the field. > > > >But that’s an aside. In either case, my claim is that first
Re: Facet Performance
Ok, I see the disconnect... Necessary parts if the index are read from disk lazily. So your newSearcher or firstSearcher query needs to do whatever operation causes the relevant parts of the index to be read. In this case, probably just facet on all the fields you care about. I'd add sorting too if you sort on different fields. The *:* query without facets or sorting does virtually nothing due to some special handling... On Tue, Jun 16, 2020, 10:48 James Bodkin wrote: > I've been trying to build a query that I can use in newSearcher based off > the information in your previous e-mail. I thought you meant to build a *:* > query as per Query 1 in my previous e-mail but I'm still seeing the > first-hit execution. > Now I'm wondering if you meant to create a *:* query with each of the > fields as part of the fl query parameters or a *:* query with each of the > fields and values as part of the fq query parameters. > > At the moment I've been running these manually as I expected that I would > see the first-execution penalty disappear by the time I got to query 4, as > I thought this would replicate the actions of the newSeacher. > Unfortunately we can't use the autowarm count that is available as part of > the filterCache/filterCache due to the custom deployment mechanism we use > to update our index. > > Kind Regards, > > James Bodkin > > On 16/06/2020, 15:30, "Erick Erickson" wrote: > > Did you try the autowarming like I mentioned in my previous e-mail? > > > On Jun 16, 2020, at 10:18 AM, James Bodkin < > james.bod...@loveholidays.com> wrote: > > > > We've changed the schema to enable docValues for these fields and > this led to an improvement in the response time. We found a further > improvement by also switching off indexed as these fields are used for > faceting and filtering only. > > Since those changes, we've found that the first-execution for > queries is really noticeable. I thought this would be the filterCache based > on what I saw in NewRelic however it is probably trying to read the > docValues from disk. How can we use the autowarming to improve this? > > > > For example, I've run the following queries in sequence and each > query has a first-execution penalty. > > > > Query 1: > > > > q=*:* > > facet=true > > facet.field=D_DepartureAirport > > facet.field=D_Destination > > facet.limit=-1 > > rows=0 > > > > Query 2: > > > > q=*:* > > fq=D_DepartureAirport:(2660) > > facet=true > > facet.field=D_Destination > > facet.limit=-1 > > rows=0 > > > > Query 3: > > > > q=*:* > > fq=D_DepartureAirport:(2661) > > facet=true > > facet.field=D_Destination > > facet.limit=-1 > > rows=0 > > > > Query 4: > > > > q=*:* > > fq=D_DepartureAirport:(2660+OR+2661) > > facet=true > > facet.field=D_Destination > > facet.limit=-1 > > rows=0 > > > > We've kept the field type as a string, as the value is mapped by > application that accesses Solr. In the examples above, the values are > mapped to airports and destinations. > > Is it possible to prewarm the above queries without having to define > all the potential filters manually in the auto warming? > > > > At the moment, we update and optimise our index in a different > environment and then copy the index to our production instances by using a > rolling deployment in Kubernetes. > > > > Kind Regards, > > > > James Bodkin > > > > On 12/06/2020, 18:58, "Erick Erickson" > wrote: > > > >I question whether fiterCache has anything to do with it, I > suspect what’s really happening is that first time you’re reading the > relevant bits from disk into memory. And to double check you should have > docVaues enabled for all these fields. The “uninverting” process can be > very expensive, and docValues bypasses that. > > > >As of Solr 7.6, you can define “uninvertible=true” to your > field(Type) to “fail fast” if Solr needs to uninvert the field. > > > >But that’s an aside. In either case, my claim is that first-time > execution does “something”, either reads the serialized docValues from disk > or uninverts the file on Solr’s heap. > > > >You can have this autowarmed by any combination of > >1> specifying an autowarm count on your queryResultCache. That’s > hit or miss, as it replays the most recent N queries which may or may not > contain the sorts. That said, specifying 10-20 for autowarm count is > usually a good idea, assuming you’re not committing more than, say, every > 30 seconds. I’d add the same to filterCache too. > > > >2> specifying a newSearcher or firstSearcher query in > solrconfig.xml. The difference is that newSearcher is fired every time a > commit happens, while firstSearcher is only fired when Solr starts, the > theory being that there’s no cache autowarming available when
Re: Facet Performance
I've been trying to build a query that I can use in newSearcher based off the information in your previous e-mail. I thought you meant to build a *:* query as per Query 1 in my previous e-mail but I'm still seeing the first-hit execution. Now I'm wondering if you meant to create a *:* query with each of the fields as part of the fl query parameters or a *:* query with each of the fields and values as part of the fq query parameters. At the moment I've been running these manually as I expected that I would see the first-execution penalty disappear by the time I got to query 4, as I thought this would replicate the actions of the newSeacher. Unfortunately we can't use the autowarm count that is available as part of the filterCache/filterCache due to the custom deployment mechanism we use to update our index. Kind Regards, James Bodkin On 16/06/2020, 15:30, "Erick Erickson" wrote: Did you try the autowarming like I mentioned in my previous e-mail? > On Jun 16, 2020, at 10:18 AM, James Bodkin wrote: > > We've changed the schema to enable docValues for these fields and this led to an improvement in the response time. We found a further improvement by also switching off indexed as these fields are used for faceting and filtering only. > Since those changes, we've found that the first-execution for queries is really noticeable. I thought this would be the filterCache based on what I saw in NewRelic however it is probably trying to read the docValues from disk. How can we use the autowarming to improve this? > > For example, I've run the following queries in sequence and each query has a first-execution penalty. > > Query 1: > > q=*:* > facet=true > facet.field=D_DepartureAirport > facet.field=D_Destination > facet.limit=-1 > rows=0 > > Query 2: > > q=*:* > fq=D_DepartureAirport:(2660) > facet=true > facet.field=D_Destination > facet.limit=-1 > rows=0 > > Query 3: > > q=*:* > fq=D_DepartureAirport:(2661) > facet=true > facet.field=D_Destination > facet.limit=-1 > rows=0 > > Query 4: > > q=*:* > fq=D_DepartureAirport:(2660+OR+2661) > facet=true > facet.field=D_Destination > facet.limit=-1 > rows=0 > > We've kept the field type as a string, as the value is mapped by application that accesses Solr. In the examples above, the values are mapped to airports and destinations. > Is it possible to prewarm the above queries without having to define all the potential filters manually in the auto warming? > > At the moment, we update and optimise our index in a different environment and then copy the index to our production instances by using a rolling deployment in Kubernetes. > > Kind Regards, > > James Bodkin > > On 12/06/2020, 18:58, "Erick Erickson" wrote: > >I question whether fiterCache has anything to do with it, I suspect what’s really happening is that first time you’re reading the relevant bits from disk into memory. And to double check you should have docVaues enabled for all these fields. The “uninverting” process can be very expensive, and docValues bypasses that. > >As of Solr 7.6, you can define “uninvertible=true” to your field(Type) to “fail fast” if Solr needs to uninvert the field. > >But that’s an aside. In either case, my claim is that first-time execution does “something”, either reads the serialized docValues from disk or uninverts the file on Solr’s heap. > >You can have this autowarmed by any combination of >1> specifying an autowarm count on your queryResultCache. That’s hit or miss, as it replays the most recent N queries which may or may not contain the sorts. That said, specifying 10-20 for autowarm count is usually a good idea, assuming you’re not committing more than, say, every 30 seconds. I’d add the same to filterCache too. > >2> specifying a newSearcher or firstSearcher query in solrconfig.xml. The difference is that newSearcher is fired every time a commit happens, while firstSearcher is only fired when Solr starts, the theory being that there’s no cache autowarming available when Solr fist powers up. Usually, people don’t bother with firstSearcher or just make it the same as newSearcher. Note that a query doesn’t have to be “real” at all. You can just add all the facet fields to a *:* query in a single go. > >BTW, Trie fields will stay around for a long time even though deprecated. Or at least until we find something to replace them with that doesn’t have this penalty, so I’d feel pretty safe using those and they’ll be more efficient than strings. > >Best, >Erick >
Re: Facet Performance
Did you try the autowarming like I mentioned in my previous e-mail? > On Jun 16, 2020, at 10:18 AM, James Bodkin > wrote: > > We've changed the schema to enable docValues for these fields and this led to > an improvement in the response time. We found a further improvement by also > switching off indexed as these fields are used for faceting and filtering > only. > Since those changes, we've found that the first-execution for queries is > really noticeable. I thought this would be the filterCache based on what I > saw in NewRelic however it is probably trying to read the docValues from > disk. How can we use the autowarming to improve this? > > For example, I've run the following queries in sequence and each query has a > first-execution penalty. > > Query 1: > > q=*:* > facet=true > facet.field=D_DepartureAirport > facet.field=D_Destination > facet.limit=-1 > rows=0 > > Query 2: > > q=*:* > fq=D_DepartureAirport:(2660) > facet=true > facet.field=D_Destination > facet.limit=-1 > rows=0 > > Query 3: > > q=*:* > fq=D_DepartureAirport:(2661) > facet=true > facet.field=D_Destination > facet.limit=-1 > rows=0 > > Query 4: > > q=*:* > fq=D_DepartureAirport:(2660+OR+2661) > facet=true > facet.field=D_Destination > facet.limit=-1 > rows=0 > > We've kept the field type as a string, as the value is mapped by application > that accesses Solr. In the examples above, the values are mapped to airports > and destinations. > Is it possible to prewarm the above queries without having to define all the > potential filters manually in the auto warming? > > At the moment, we update and optimise our index in a different environment > and then copy the index to our production instances by using a rolling > deployment in Kubernetes. > > Kind Regards, > > James Bodkin > > On 12/06/2020, 18:58, "Erick Erickson" wrote: > >I question whether fiterCache has anything to do with it, I suspect what’s > really happening is that first time you’re reading the relevant bits from > disk into memory. And to double check you should have docVaues enabled for > all these fields. The “uninverting” process can be very expensive, and > docValues bypasses that. > >As of Solr 7.6, you can define “uninvertible=true” to your field(Type) to > “fail fast” if Solr needs to uninvert the field. > >But that’s an aside. In either case, my claim is that first-time execution > does “something”, either reads the serialized docValues from disk or > uninverts the file on Solr’s heap. > >You can have this autowarmed by any combination of >1> specifying an autowarm count on your queryResultCache. That’s hit or > miss, as it replays the most recent N queries which may or may not contain > the sorts. That said, specifying 10-20 for autowarm count is usually a good > idea, assuming you’re not committing more than, say, every 30 seconds. I’d > add the same to filterCache too. > >2> specifying a newSearcher or firstSearcher query in solrconfig.xml. The > difference is that newSearcher is fired every time a commit happens, while > firstSearcher is only fired when Solr starts, the theory being that there’s > no cache autowarming available when Solr fist powers up. Usually, people > don’t bother with firstSearcher or just make it the same as newSearcher. Note > that a query doesn’t have to be “real” at all. You can just add all the facet > fields to a *:* query in a single go. > >BTW, Trie fields will stay around for a long time even though deprecated. > Or at least until we find something to replace them with that doesn’t have > this penalty, so I’d feel pretty safe using those and they’ll be more > efficient than strings. > >Best, >Erick >
Re: Facet Performance
We've changed the schema to enable docValues for these fields and this led to an improvement in the response time. We found a further improvement by also switching off indexed as these fields are used for faceting and filtering only. Since those changes, we've found that the first-execution for queries is really noticeable. I thought this would be the filterCache based on what I saw in NewRelic however it is probably trying to read the docValues from disk. How can we use the autowarming to improve this? For example, I've run the following queries in sequence and each query has a first-execution penalty. Query 1: q=*:* facet=true facet.field=D_DepartureAirport facet.field=D_Destination facet.limit=-1 rows=0 Query 2: q=*:* fq=D_DepartureAirport:(2660) facet=true facet.field=D_Destination facet.limit=-1 rows=0 Query 3: q=*:* fq=D_DepartureAirport:(2661) facet=true facet.field=D_Destination facet.limit=-1 rows=0 Query 4: q=*:* fq=D_DepartureAirport:(2660+OR+2661) facet=true facet.field=D_Destination facet.limit=-1 rows=0 We've kept the field type as a string, as the value is mapped by application that accesses Solr. In the examples above, the values are mapped to airports and destinations. Is it possible to prewarm the above queries without having to define all the potential filters manually in the auto warming? At the moment, we update and optimise our index in a different environment and then copy the index to our production instances by using a rolling deployment in Kubernetes. Kind Regards, James Bodkin On 12/06/2020, 18:58, "Erick Erickson" wrote: I question whether fiterCache has anything to do with it, I suspect what’s really happening is that first time you’re reading the relevant bits from disk into memory. And to double check you should have docVaues enabled for all these fields. The “uninverting” process can be very expensive, and docValues bypasses that. As of Solr 7.6, you can define “uninvertible=true” to your field(Type) to “fail fast” if Solr needs to uninvert the field. But that’s an aside. In either case, my claim is that first-time execution does “something”, either reads the serialized docValues from disk or uninverts the file on Solr’s heap. You can have this autowarmed by any combination of 1> specifying an autowarm count on your queryResultCache. That’s hit or miss, as it replays the most recent N queries which may or may not contain the sorts. That said, specifying 10-20 for autowarm count is usually a good idea, assuming you’re not committing more than, say, every 30 seconds. I’d add the same to filterCache too. 2> specifying a newSearcher or firstSearcher query in solrconfig.xml. The difference is that newSearcher is fired every time a commit happens, while firstSearcher is only fired when Solr starts, the theory being that there’s no cache autowarming available when Solr fist powers up. Usually, people don’t bother with firstSearcher or just make it the same as newSearcher. Note that a query doesn’t have to be “real” at all. You can just add all the facet fields to a *:* query in a single go. BTW, Trie fields will stay around for a long time even though deprecated. Or at least until we find something to replace them with that doesn’t have this penalty, so I’d feel pretty safe using those and they’ll be more efficient than strings. Best, Erick
Re: Facet Performance
I question whether fiterCache has anything to do with it, I suspect what’s really happening is that first time you’re reading the relevant bits from disk into memory. And to double check you should have docVaues enabled for all these fields. The “uninverting” process can be very expensive, and docValues bypasses that. As of Solr 7.6, you can define “uninvertible=true” to your field(Type) to “fail fast” if Solr needs to uninvert the field. But that’s an aside. In either case, my claim is that first-time execution does “something”, either reads the serialized docValues from disk or uninverts the file on Solr’s heap. You can have this autowarmed by any combination of 1> specifying an autowarm count on your queryResultCache. That’s hit or miss, as it replays the most recent N queries which may or may not contain the sorts. That said, specifying 10-20 for autowarm count is usually a good idea, assuming you’re not committing more than, say, every 30 seconds. I’d add the same to filterCache too. 2> specifying a newSearcher or firstSearcher query in solrconfig.xml. The difference is that newSearcher is fired every time a commit happens, while firstSearcher is only fired when Solr starts, the theory being that there’s no cache autowarming available when Solr fist powers up. Usually, people don’t bother with firstSearcher or just make it the same as newSearcher. Note that a query doesn’t have to be “real” at all. You can just add all the facet fields to a *:* query in a single go. BTW, Trie fields will stay around for a long time even though deprecated. Or at least until we find something to replace them with that doesn’t have this penalty, so I’d feel pretty safe using those and they’ll be more efficient than strings. Best, Erick > On Jun 12, 2020, at 12:39 PM, James Bodkin > wrote: > > We've run the performance test after changing the fields to be of the type > string. We're seeing improved performance, especially after the first time > the query has run. The first run is taking around 1-2 seconds rather than 6-8 > seconds and when the filter cache is present, the response time is around > 400ms. > Do you have any more suggestions that we could try in order to optimise the > performance? > > On 11/06/2020, 14:49, "Erick Erickson" wrote: > >There’s a lot of confusion about using points-based fields for faceting, > see: https://issues.apache.org/jira/browse/SOLR-13227 for instance. > >Two options you might try: >1> copyField to a string field and facet on that (won’t work, of course, > for any kind of interval/range facet) >2> use the deprecated Trie field instead. You could use the copyField to a > Trie field for this too. > >Best, >Erick >
Re: Facet Performance
We've run the performance test after changing the fields to be of the type string. We're seeing improved performance, especially after the first time the query has run. The first run is taking around 1-2 seconds rather than 6-8 seconds and when the filter cache is present, the response time is around 400ms. Do you have any more suggestions that we could try in order to optimise the performance? On 11/06/2020, 14:49, "Erick Erickson" wrote: There’s a lot of confusion about using points-based fields for faceting, see: https://issues.apache.org/jira/browse/SOLR-13227 for instance. Two options you might try: 1> copyField to a string field and facet on that (won’t work, of course, for any kind of interval/range facet) 2> use the deprecated Trie field instead. You could use the copyField to a Trie field for this too. Best, Erick
Re: Facet Performance
Could you explain why the performance is an issue for points-based fields? I've looked through the referenced issue (which is fixed in the version we are running) but I'm missing the link between the two. Is there an issue to improve this for points-based fields? We're going to change the field type to a string, as our queries are always looking for a specific value (and not intervals/ranges) and rerun our load test. Kind Regards, James Bodkin On 11/06/2020, 14:49, "Erick Erickson" wrote: There’s a lot of confusion about using points-based fields for faceting, see: https://issues.apache.org/jira/browse/SOLR-13227 for instance. Two options you might try: 1> copyField to a string field and facet on that (won’t work, of course, for any kind of interval/range facet) 2> use the deprecated Trie field instead. You could use the copyField to a Trie field for this too. Best, Erick
Re: Facet Performance
There’s a lot of confusion about using points-based fields for faceting, see: https://issues.apache.org/jira/browse/SOLR-13227 for instance. Two options you might try: 1> copyField to a string field and facet on that (won’t work, of course, for any kind of interval/range facet) 2> use the deprecated Trie field instead. You could use the copyField to a Trie field for this too. Best, Erick > On Jun 11, 2020, at 9:39 AM, James Bodkin > wrote: > > We’ve been running a load test against our index and have noticed that the > facet queries are significantly slower than we would like. > Currently these types of queries are taking several seconds to execute and > are wondering if it would be possible to speed these up. > Repeating the same query over and over does not improve the response time so > does not appear to utilise any caching. > Ideally we would like to be targeting a response time around tens or hundreds > of milliseconds if possible. > > An example query that is taking around 2-3 seconds to execute is: > > q=*.* > facet=true > facet.field=D_UserRatingGte > facet.mincount=1 > facet.limit=-1 > rows=0 > > "response":{"numFound":18979503,"start":0,"maxScore":1.0,"docs":[]} > "facet_counts":{ >"facet_queries":{}, >"facet_fields":{ > "D_UserRatingGte":[ >"1575",16614238, >"1576",16614238, >"1577",16614238, >"1578",16065938, >"1579",12079545, >"1580",458799]}, >"facet_ranges":{}, >"facet_intervals":{}, >"facet_heatmaps":{}}} > > I have also tried the equivalent query using the JSON Facet API with the same > outcome of slow response time. > Additionally I have tried changing the facet method (on both facet apis) with > the same outcome of slow response time. > > The underlying field for the above query is configured as a > solr.IntPointField with docValues, indexed and multiValued set to true. > The index has just under 19 million documents and the physical size on disk > is 10.95GB. The index is read-only and consists of 4 segments with 0 > deletions. > We’re running standalone Solr 8.3.1 with a 8GB Heap and the underlying Google > Cloud Virtual Machine in our load test environment has 6 vCPUs, 32G RAM and > 100GB SSD. > > Would anyone be able to point me in a direction to either improve the > performance or understand the current performance is expected? > > Kind Regards, > > James Bodkin
Re: Facet performance problem
On 2/20/2018 1:18 AM, LOPEZ-CORTES Mariano-ext wrote: We return a facet list of values in "motifPresence" field (person status). Status: [ ] status1 [x] status2 [x] status3 The user then selects 1 or multiple status (It's this step that we called "facet filtering"). Query is then re-executed with fq=motifPresence:(status2 OR status3) We use fq in order to not alter the score in main query. We've read that docValues=true for facet fields. We need also indexed=true? Facets, grouping, and sorting are more efficient with docValues, but searches aren't helped by docValues. Without indexed="true", searches on the field will be VERY slow. A filter query is still a search. The "filter" in filter query just refers to the fact that it's separate from the main query, and that it does not affect relevancy scoring. Thanks, Shawn
RE: Facet performance problem
Our query looks like this: ...factet=true&facet.field=motifPresence We return a facet list of values in "motifPresence" field (person status). Status: [ ] status1 [x] status2 [x] status3 The user then selects 1 or multiple status (It's this step that we called "facet filtering"). Query is then re-executed with fq=motifPresence:(status2 OR status3) We use fq in order to not alter the score in main query. We've read that docValues=true for facet fields. We need also indexed=true? Is there any other problem in our solution? -Message d'origine- De : Erick Erickson [mailto:erickerick...@gmail.com] Envoyé : lundi 19 février 2018 18:18 À : solr-user Objet : Re: Facet performance problem I'm confused here. What do you mean by "facet filtering"? Your examples have no facets at all, just a _filter query_. I'll assume you want to use filter query (fq), and faceting has nothing to do with it. This is one of the tricky bits of docValues. While it's _possible_ to search on a field that's defined as above, it's very inefficient since there's no "inverted index" for the field, you specified 'indexed="false" '. So the docValues are searched, and it's essentially a table scan. If you mean to search against this field, set indexed="true". You'll have to completely reindex your corpus of course. If you intend to facet, group or sort on this field, you should _also_ have docValues="true". Best, Erick On Mon, Feb 19, 2018 at 7:47 AM, MOUSSA MZE Oussama-ext wrote: > Hi > > We have following environement : > > 3 nodes cluster > 1 shard > Replication factor = 2 > 8GB per node > > 29 millions of documents > > We've faceting over field "motifPresence" defined as follow: > > indexed="false" stored="true" required="false"/> > > Once the user selects motifPresence filter we executes search again with: > > fq: (value1 OR value2 OR value3 OR ...) > > The problem is: During facet filtering query is too slow and her response > time is greater than main search (without facet filtering). > > Thanks in advance!
Re: Facet performance problem
I'm confused here. What do you mean by "facet filtering"? Your examples have no facets at all, just a _filter query_. I'll assume you want to use filter query (fq), and faceting has nothing to do with it. This is one of the tricky bits of docValues. While it's _possible_ to search on a field that's defined as above, it's very inefficient since there's no "inverted index" for the field, you specified 'indexed="false" '. So the docValues are searched, and it's essentially a table scan. If you mean to search against this field, set indexed="true". You'll have to completely reindex your corpus of course. If you intend to facet, group or sort on this field, you should _also_ have docValues="true". Best, Erick On Mon, Feb 19, 2018 at 7:47 AM, MOUSSA MZE Oussama-ext wrote: > Hi > > We have following environement : > > 3 nodes cluster > 1 shard > Replication factor = 2 > 8GB per node > > 29 millions of documents > > We've faceting over field "motifPresence" defined as follow: > > stored="true" required="false"/> > > Once the user selects motifPresence filter we executes search again with: > > fq: (value1 OR value2 OR value3 OR ...) > > The problem is: During facet filtering query is too slow and her response > time is greater than main search (without facet filtering). > > Thanks in advance!
RE: Facet performance
On Tue, October 22, 2013 5:23 PM Michael Lemke wrote: >On Tue, October 22, 2013 9:23 AM Toke Eskildsen wrote: >>On Mon, 2013-10-21 at 16:57 +0200, Lemke, Michael SZ/HZA-ZSW wrote: >>> QTime fc: >>>never returns, webserver restarts itself after 30 min with 100% CPU >>> load >> >>It might be because it dies due to garbage collection. But since more >>memory (as your test server presumably has) just leads to the too many >>values-error, there isn't much to do. > >Essentially, fc is out then. > >> >>> QTime=41205 facet.prefix=q=frequent_word >>> numFound=44532 >>> >>> Same query repeated: >>> QTime=225810 facet.prefix=q=ottomotor >>> numFound=909 >>> QTime=199839 facet.prefix=q=ottomotor >>> numFound=909 >> >>I am stumped on this, sorry. I do not understand why the 'ottomotor' >>query can take 5 times as long as the 'frequent_word'-one. > >I looked into this some more this morning. I noticed the java process was >doing >a lot of I/O as shown in Process Explorer. For the frequent_word it read >about >180MB, for ottomotor is was about seven times as much, ~ 1,200 MB. > Got another observation today. The response time for q=ottomotor depends on facet.limit: QTime=59300 facet.limit=2 QTime=69395 facet.limit=4 QTime=85208 facet.limit=6 QTime=158150 facet.limit=8 QTime=186276 facet.limit=10 QTime=231763 facet.limit=15 QTime=260437 facet.limit=20 QTime=312268 facet.limit=30 For q=frequent_word the result is much less pronounced and shows only for facet.limit >= 15 : QTime=0 facet.limit=0 QTime=20535 facet.limit=1 QTime=13456 facet.limit=2 QTime=13925 facet.limit=4 QTime=13705 facet.limit=6 QTime=13924 facet.limit=8 QTime=13799 facet.limit=10 QTime=14361 facet.limit=15 QTime=14704 facet.limit=20 QTime=15189 facet.limit=30 QTime=16783 facet.limit=50 QTime=57128 facet.limit=500 Looks to me for solr to collect enough facets to fulfill the limit constraint it has to read much more of the index in the case of the infrequent word. >jconsole didn't show anything unusual according to our more experienced Java >experts here. Nor was the machine swapping. > >Is it possible to screw up an index such that this sort of faceting leads to >constant reading of the index? Something like full table scans in a db? > Michael
RE: Facet performance
On Tue, 2013-10-22 at 17:25 +0200, Lemke, Michael SZ/HZA-ZSW wrote: > On Tue, October 22, 2013 11:54 AM Andre Bois-Crettez wrote: > >> This is with Solr 1.4. > >Really ? > >This sound really outdated to me. > >Have you tried a tried more recent version, 4.5 just went out ? > > Sorry, can't. Too much `grown' stuff. I did not see that. I guess I parsed it as 4.1. Well, that rules out DocValues and fcs (as far as I remember). I am a bit surprised that the limit on #terms with fc is also in 1.4. I thought it was introduced in a later version. We too has been in a position where upgrading was hard due to homegrown addons. We even scrapped some DidYouMean-like functionality when going from 3.x to 4.x, but 4.x was so much better that there were little choice. Last suggestion for using fc: Create 2 or more CONTENT-fields and choose between them randomly when indexing. Facet on all the CONTENT fields and merge the results. It will take a bit more RAM though, so it is still out on your (assumedly) 32 bit machine. Regards, Toke Eskildsen, State and University Library, Denmark
RE: Facet performance
On Tue, October 22, 2013 11:54 AM Andre Bois-Crettez wrote: > >> This is with Solr 1.4. >Really ? >This sound really outdated to me. >Have you tried a tried more recent version, 4.5 just went out ? Sorry, can't. Too much `grown' stuff. Michael
RE: Facet performance
On Tue, October 22, 2013 9:23 AM Toke Eskildsen wrote: >On Mon, 2013-10-21 at 16:57 +0200, Lemke, Michael SZ/HZA-ZSW wrote: >> QTime fc: >>never returns, webserver restarts itself after 30 min with 100% CPU >> load > >It might be because it dies due to garbage collection. But since more >memory (as your test server presumably has) just leads to the too many >values-error, there isn't much to do. Essentially, fc is out then. > >> QTime=41205 facet.prefix=q=frequent_word >> numFound=44532 >> >> Same query repeated: >> QTime=225810 facet.prefix=q=ottomotor >> numFound=909 >> QTime=199839 facet.prefix=q=ottomotor >> numFound=909 > >I am stumped on this, sorry. I do not understand why the 'ottomotor' >query can take 5 times as long as the 'frequent_word'-one. I looked into this some more this morning. I noticed the java process was doing a lot of I/O as shown in Process Explorer. For the frequent_word it read about 180MB, for ottomotor is was about seven times as much, ~ 1,200 MB. jconsole didn’t show anything unusual according to our more experienced Java experts here. Nor was the machine swapping. Is it possible to screw up an index such that this sort of faceting leads to constant reading of the index? Something like full table scans in a db? Michael
Re: Facet performance
This is with Solr 1.4. Really ? This sound really outdated to me. Have you tried a tried more recent version, 4.5 just went out ? -- André Bois-Crettez Software Architect Search Developer http://www.kelkoo.com/ Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
RE: Facet performance
On Mon, 2013-10-21 at 16:57 +0200, Lemke, Michael SZ/HZA-ZSW wrote: > QTime enum: > 1st call: 1200 > subsequent calls: 200 Those numbers seems fine. > QTime fc: >never returns, webserver restarts itself after 30 min with 100% CPU > load It might be because it dies due to garbage collection. But since more memory (as your test server presumably has) just leads to the too many values-error, there isn't much to do. > QTime=41205 facet.prefix=q=frequent_word > numFound=44532 > > Same query repeated: > QTime=225810 facet.prefix=q=ottomotor > numFound=909 > QTime=199839 facet.prefix=q=ottomotor > numFound=909 I am stumped on this, sorry. I do not understand why the 'ottomotor' query can take 5 times as long as the 'frequent_word'-one. > QTime=185948 facet.prefix=q=ottomotor > numFound=909 > > QTime=3344 facet.prefix=d q=ottomotor > numFound=909 Fits with expectations. > >- Documents in your index > 13,434,414 > > >- Unique values in the CONTENT field > Not sure how to get this. In luke I find > 21,797,514 term count CONTENT Those are the relevant numbers for faceting. There is a limit of 2^24 (16M) terms for facet.method=enum, although I am a bit unsure if that is for the whole index or per segment. Come to think of it, if you have a multi-segmented index, you might want to try facet.method.fcs. It should have faster startup than fc and better performance than enum for fields with a large number of unique values. Memory requirements should be between fc and enum. > >- Xmx > The maximum the system allows me to get: 1612m > > Maybe I have a hopelessly under-dimensioned server for this sort of things? Well, 1612m should be enough for the faceting in itself; it it the startup that is the killer. A rule of thumb for fc is that the internal structure takes at least #docs*log(#references) + #references*log(#unique_values) bytes If your content field is a description, let's say that each description has 40 words, which gives us 500M references from documents to facet values. This translates to 13M*log(500M) + 500M*log(22M) bytes ~= 13M*29 + 500M*25 bytes ~= 380MB. Taking into account that building the structure has an overhead of 2-3 times that, we are approaching the memory limit of 1612m. If the index is updated, a new facet structure is build all over again while the old structure is still in memory. If you need better performance on your large field I would suggest, in order of priority: - facet.method=fcs - facet.method=fcs with DocValues - Shard your index and use facet.method=fc - SOLR-2412 (https://issues.apache.org/jira/browse/SOLR-2412) SOLR-2412 is a last resort, but it does have the same speed as facet.method=fc only without the 16M unique values limitation. Regards, Toke Eskildsen, State and University Library, Denmark
RE: Facet performance
On Mon, October 21, 2013 10:04 AM, Toke Eskildsen wrote: >On Fri, 2013-10-18 at 18:30 +0200, Lemke, Michael SZ/HZA-ZSW wrote: >> Toke Eskildsen wrote: >> > Unfortunately the enum-solution is normally quite slow when there >> > are enough unique values to trigger the "too many > values"-exception. >> > [...] >> >> [...] And yes, the fc method was terribly slow in a case where it did >> work. Something like 20 minutes whereas enum returned within a few >> seconds. > >Err.. What? That sounds _very_ strange. You have millions of unique >values so fc should be a lot faster than enum, not the other way around. > >I assume the 20 minutes was for the first call. How fast does subsequent >calls return for fc? QTime enum: 1st call: 1200 subsequent calls: 200 QTime fc: never returns, webserver restarts itself after 30 min with 100% CPU load This is on the test system, the production system managed to return with "... Too many values for UnInvertedField faceting ...". However, I also have different faceting queries I played with today. One complete example: q=ottomotor&facet.field=CONTENT&facet=true&facet.prefix=&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0 These are the results, all with facet.method=enum (fc doesn't work). They were executed in the sequence shown on an otherwise unused server: QTime=41205 facet.prefix=q=frequent_word numFound=44532 Same query repeated: QTime=225810 facet.prefix=q=ottomotor numFound=909 QTime=199839 facet.prefix=q=ottomotor numFound=909 QTime=0 facet.prefix=q=ottomotor jkdhwjfh numFound=0 QTime=0 facet.prefix=q=jkdhwjfh numFound=0 QTime=185948 facet.prefix=q=ottomotor numFound=909 QTime=3344 facet.prefix=d q=ottomotor numFound=909 QTime=3078 facet.prefix=d q=ottomotor numFound=909 QTime=3141 facet.prefix=d q=ottomotor numFound=909 The response time is obviously not dependent on the number of documents found. Caching doesn't kick in either. > > >Maybe you could provide some approximate numbers? I'll try, see below. Thanks for asking and having a closer look. > >- Documents in your index 13,434,414 >- Unique values in the CONTENT field Not sure how to get this. In luke I find 21,797,514 term count CONTENT Is that what you mean? >- Hits are returned from a typical query Hm, that can be anything between 0 and 40,000 or more. Or do you mean from the facets? Or do my tests above answer it? >- Xmx The maximum the system allows me to get: 1612m Maybe I have a hopelessly under-dimensioned server for this sort of things? Thanks a lot for your help, Michael
RE: Facet performance
On Fri, 2013-10-18 at 18:30 +0200, Lemke, Michael SZ/HZA-ZSW wrote: > Toke Eskildsen [mailto:t...@statsbiblioteket.dk] wrote: > > Unfortunately the enum-solution is normally quite slow when there > > are enough unique values to trigger the "too many > values"-exception. > > [...] > > [...] And yes, the fc method was terribly slow in a case where it did > work. Something like 20 minutes whereas enum returned within a few > seconds. Err.. What? That sounds _very_ strange. You have millions of unique values so fc should be a lot faster than enum, not the other way around. I assume the 20 minutes was for the first call. How fast does subsequent calls return for fc? Maybe you could provide some approximate numbers? - Documents in your index - Unique values in the CONTENT field - Hits are returned from a typical query - Xmx Regards, Toke Eskildsen, State and University Library, Denmark
RE: Facet performance
: >> 1. q=word&facet.field=CONTENT&facet=true&facet.prefix=&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0 : >> 2. q=word&facet.field=CONTENT&facet=true&facet.prefix=a&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0 : > : >> The only difference is am empty facet.prefix in the first query. : >If you index was just opened when you issued your queries, the first : request will be notably slower than the second as the facet values might : not be in the disk cache. : : I know but it shouldn't be orders of magnitudes as in this example, should it? in and of itself: it can be if your index is large enough and none of the disk pages are in the file system buffer. more significantly however, is that depending on how big your filterCache is, the first request could eaisly be caching all of filters needed for the second query -- at a minimum it's definitely caching your main query which will be re-used and save a lot of time independent of hte faceting. -Hoss
Re: Facet performance
DocValues is the new black http://wiki.apache.org/solr/DocValues Otis -- Solr & ElasticSearch Support -- http://sematext.com/ SOLR Performance Monitoring -- http://sematext.com/spm On Fri, Oct 18, 2013 at 12:30 PM, Lemke, Michael SZ/HZA-ZSW wrote: > Toke Eskildsen [mailto:t...@statsbiblioteket.dk] wrote: >>Lemke, Michael SZ/HZA-ZSW [lemke...@schaeffler.com] wrote: >>> 1. >>> q=word&facet.field=CONTENT&facet=true&facet.prefix=&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0 >>> 2. >>> q=word&facet.field=CONTENT&facet=true&facet.prefix=a&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0 >> >>> The only difference is am empty facet.prefix in the first query. >> >>> The first query returns after some 20 seconds (QTime 2 in the result) >>> while >>> the second one takes only 80 msec (QTime 80). Why is this? >> >>If you index was just opened when you issued your queries, the first request >>will be notably slower than the second as the facet values might not be in > the disk cache. > > I know but it shouldn't be orders of magnitudes as in this example, should it? > >> >>Furthermore, for enum the difference between no prefix and some prefix is >>huge. As enum iterates values first (as opposed to fc that iterates hits >>first), limiting to only the values that starts with 'a' ought to speed up >>retrieval by a factor 10 or more. > > Thanks. That is what we sort of figured but it's good to know for sure. Of > course it begs the question if there is a way to speed this up? > >> >>> And as side note: facet.method=fc makes the queries run 'forever' and >>> eventually >>> fail with org.apache.solr.common.SolrException: Too many values for >>> UnInvertedField faceting on field CONTENT. >> >>An internal memory structure optimization in Solr limits the amount of >>possible unique values when using fc. It is not a bug as such, but more a >>consequence of a choice. Unfortunately the enum-solution is normally quite >>slow when there are enough unique values to trigger the "too many >>values"-exception. I know too little about the structures for DocValues to >>say if they will help here, but you might want to take a look at those. > > What is DocValues? Haven't heard of it yet. And yes, the fc method was > terribly slow in a case where it did work. Something like 20 minutes whereas > enum returned within a few seconds. > > Michael >
RE: Facet performance
Toke Eskildsen [mailto:t...@statsbiblioteket.dk] wrote: >Lemke, Michael SZ/HZA-ZSW [lemke...@schaeffler.com] wrote: >> 1. >> q=word&facet.field=CONTENT&facet=true&facet.prefix=&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0 >> 2. >> q=word&facet.field=CONTENT&facet=true&facet.prefix=a&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0 > >> The only difference is am empty facet.prefix in the first query. > >> The first query returns after some 20 seconds (QTime 2 in the result) >> while >> the second one takes only 80 msec (QTime 80). Why is this? > >If you index was just opened when you issued your queries, the first request >will be notably slower than the second as the facet values might not be in the disk cache. I know but it shouldn't be orders of magnitudes as in this example, should it? > >Furthermore, for enum the difference between no prefix and some prefix is >huge. As enum iterates values first (as opposed to fc that iterates hits >first), limiting to only the values that starts with 'a' ought to speed up >retrieval by a factor 10 or more. Thanks. That is what we sort of figured but it's good to know for sure. Of course it begs the question if there is a way to speed this up? > >> And as side note: facet.method=fc makes the queries run 'forever' and >> eventually >> fail with org.apache.solr.common.SolrException: Too many values for >> UnInvertedField faceting on field CONTENT. > >An internal memory structure optimization in Solr limits the amount of >possible unique values when using fc. It is not a bug as such, but more a >consequence of a choice. Unfortunately the enum-solution is normally quite >slow when there are enough unique values to trigger the "too many >values"-exception. I know too little about the structures for DocValues to say >if they will help here, but you might want to take a look at those. What is DocValues? Haven't heard of it yet. And yes, the fc method was terribly slow in a case where it did work. Something like 20 minutes whereas enum returned within a few seconds. Michael
RE: Facet performance
Lemke, Michael SZ/HZA-ZSW [lemke...@schaeffler.com] wrote: > 1. > q=word&facet.field=CONTENT&facet=true&facet.prefix=&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0 > 2. > q=word&facet.field=CONTENT&facet=true&facet.prefix=a&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0 > The only difference is am empty facet.prefix in the first query. > The first query returns after some 20 seconds (QTime 2 in the result) > while > the second one takes only 80 msec (QTime 80). Why is this? If you index was just opened when you issued your queries, the first request will be notably slower than the second as the facet values might not be in the disk cache. Furthermore, for enum the difference between no prefix and some prefix is huge. As enum iterates values first (as opposed to fc that iterates hits first), limiting to only the values that starts with 'a' ought to speed up retrieval by a factor 10 or more. > And as side note: facet.method=fc makes the queries run 'forever' and > eventually > fail with org.apache.solr.common.SolrException: Too many values for > UnInvertedField faceting on field CONTENT. An internal memory structure optimization in Solr limits the amount of possible unique values when using fc. It is not a bug as such, but more a consequence of a choice. Unfortunately the enum-solution is normally quite slow when there are enough unique values to trigger the "too many values"-exception. I know too little about the structures for DocValues to say if they will help here, but you might want to take a look at those. - Toke Eskildsen
Re: facet performance tips
Right, I haven't used SOLR-475 yet and am more familiar with Bobo. I believe there are differences but I haven't gone into them yet. As I'm using Solr 1.4 now, maybe I'll test the UnInvertedField modality. Feel free to report back results as I don't think I've seen much yet? On Thu, Aug 13, 2009 at 10:51 AM, Fuad Efendi wrote: > SOLR-1.4-trunk uses terms counting instead of bitset intersects (seems to > be); check this > http://issues.apache.org/jira/browse/SOLR-475 > (and probably http://issues.apache.org/jira/browse/SOLR-711) > > -Original Message- > From: Jason Rutherglen > > Yeah we need a performance comparison, I haven't had time to put > one together. If/when I do I'll compare Bobo performance against > Solr bitset intersection based facets, compare memory > consumption. > > For near realtime Solr needs to cache and merge bitsets at the > SegmentReader level, and Bobo needs to be upgraded to work with > Lucene 2.9's searching at the segment level (currently it uses a > MultiSearcher). > > Distributed search on either should be fairly straightforward? > > On Thu, Aug 13, 2009 at 9:55 AM, Fuad Efendi wrote: >> It seems BOBO-Browse is alternate faceting engine; would be interesting to >> compare performance with SOLR... Distributed? >> >> >> -Original Message- >> From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] >> Sent: August-12-09 6:12 PM >> To: solr-user@lucene.apache.org >> Subject: Re: facet performance tips >> >> For your fields with many terms you may want to try Bobo >> http://code.google.com/p/bobo-browse/ which could work well with your >> case. >> >> >> >> >> > > >
RE: facet performance tips
SOLR-1.4-trunk uses terms counting instead of bitset intersects (seems to be); check this http://issues.apache.org/jira/browse/SOLR-475 (and probably http://issues.apache.org/jira/browse/SOLR-711) -Original Message- From: Jason Rutherglen Yeah we need a performance comparison, I haven't had time to put one together. If/when I do I'll compare Bobo performance against Solr bitset intersection based facets, compare memory consumption. For near realtime Solr needs to cache and merge bitsets at the SegmentReader level, and Bobo needs to be upgraded to work with Lucene 2.9's searching at the segment level (currently it uses a MultiSearcher). Distributed search on either should be fairly straightforward? On Thu, Aug 13, 2009 at 9:55 AM, Fuad Efendi wrote: > It seems BOBO-Browse is alternate faceting engine; would be interesting to > compare performance with SOLR... Distributed? > > > -Original Message- > From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] > Sent: August-12-09 6:12 PM > To: solr-user@lucene.apache.org > Subject: Re: facet performance tips > > For your fields with many terms you may want to try Bobo > http://code.google.com/p/bobo-browse/ which could work well with your > case. > > > > >
Re: facet performance tips
Yeah we need a performance comparison, I haven't had time to put one together. If/when I do I'll compare Bobo performance against Solr bitset intersection based facets, compare memory consumption. For near realtime Solr needs to cache and merge bitsets at the SegmentReader level, and Bobo needs to be upgraded to work with Lucene 2.9's searching at the segment level (currently it uses a MultiSearcher). Distributed search on either should be fairly straightforward? On Thu, Aug 13, 2009 at 9:55 AM, Fuad Efendi wrote: > It seems BOBO-Browse is alternate faceting engine; would be interesting to > compare performance with SOLR... Distributed? > > > -Original Message- > From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] > Sent: August-12-09 6:12 PM > To: solr-user@lucene.apache.org > Subject: Re: facet performance tips > > For your fields with many terms you may want to try Bobo > http://code.google.com/p/bobo-browse/ which could work well with your > case. > > > > >
RE: facet performance tips
Interesting, it has "BoboRequestHandler implements SolrRequestHandler" - easy to try it; and shards support [Fuad Efendi] It seems BOBO-Browse is alternate faceting engine; would be interesting to compare performance with SOLR... Distributed? [Jason Rutherglen] For your fields with many terms you may want to try Bobo http://code.google.com/p/bobo-browse/ which could work well with your case.
RE: facet performance tips
It seems BOBO-Browse is alternate faceting engine; would be interesting to compare performance with SOLR... Distributed? -Original Message- From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] Sent: August-12-09 6:12 PM To: solr-user@lucene.apache.org Subject: Re: facet performance tips For your fields with many terms you may want to try Bobo http://code.google.com/p/bobo-browse/ which could work well with your case.
RE: facet performance tips
I took 1.4 from trunk three days ago, it seems Ok for production (at least for my Master instance which is doing writes-only). I use the same config files. 500 000 terms are Ok too; I am using several millions with pre-1.3 SOLR taken from trunk. However, do not try to "facet" (probably outdated term after SOLR-475) on generic queries such as [* TO *] (with huge resultset). For smaller query results (100,000 instead of 100,000,000) "counting terms" is fast enough (few milliseconds at http://www.tokenizer.org) -Original Message- From: Jérôme Etévé [mailto:jerome.et...@gmail.com] Sent: August-13-09 5:38 AM To: solr-user@lucene.apache.org Subject: Re: facet performance tips Thanks everyone for your advices. I increased my filterCache, and the faceting performances improved greatly. My faceted field can have at the moment ~4 different terms, so I did set a filterCache size of 5 and it works very well. However, I'm planning to increase the number of terms to maybe around 500 000, so I guess this approach won't work anymore, as I doubt a 500 000 sized fieldCache would work. So I guess my best move would be to upgrade to the soon to be 1.4 version of solr to benefit from its new faceting method. I know this is a bit off-topic, but do you have a rough idea about when 1.4 will be an official release? As well, is the current trunk OK for production? Is it compatible with 1.3 configuration files? Thanks ! Jerome. 2009/8/13 Stephen Duncan Jr : > Note that depending on the profile of your field (full text and how many > unique terms on average per document), the improvements from 1.4 may not > apply, as you may exceed the limits of the new faceting technique in Solr > 1.4. > -Stephen > > On Wed, Aug 12, 2009 at 2:12 PM, Erik Hatcher wrote: > >> Yes, increasing the filterCache size will help with Solr 1.3 performance. >> >> Do note that trunk (soon Solr 1.4) has dramatically improved faceting >> performance. >> >>Erik >> >> >> On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote: >> >> Hi everyone, >>> >>> I'm using some faceting on a solr index containing ~ 160K documents. >>> I perform facets on multivalued string fields. The number of possible >>> different values is quite large. >>> >>> Enabling facets degrades the performance by a factor 3. >>> >>> Because I'm using solr 1.3, I guess the facetting makes use of the >>> filter cache to work. My filterCache is set >>> to a size of 2048. I also noticed in my solr stats a very small ratio >>> of cache hit (~ 0.01%). >>> >>> Can it be the reason why the faceting is slow? Does it make sense to >>> increase the filterCache size so it matches more or less the number >>> of different possible values for the faceted fields? Would that not >>> make the memory usage explode? >>> >>> Thanks for your help ! >>> >>> -- >>> Jerome Eteve. >>> >>> Chat with me live at http://www.eteve.net >>> >>> jer...@eteve.net >>> >> >> > > > -- > Stephen Duncan Jr > www.stephenduncanjr.com > -- Jerome Eteve. Chat with me live at http://www.eteve.net jer...@eteve.net
Re: facet performance tips
Thanks everyone for your advices. I increased my filterCache, and the faceting performances improved greatly. My faceted field can have at the moment ~4 different terms, so I did set a filterCache size of 5 and it works very well. However, I'm planning to increase the number of terms to maybe around 500 000, so I guess this approach won't work anymore, as I doubt a 500 000 sized fieldCache would work. So I guess my best move would be to upgrade to the soon to be 1.4 version of solr to benefit from its new faceting method. I know this is a bit off-topic, but do you have a rough idea about when 1.4 will be an official release? As well, is the current trunk OK for production? Is it compatible with 1.3 configuration files? Thanks ! Jerome. 2009/8/13 Stephen Duncan Jr : > Note that depending on the profile of your field (full text and how many > unique terms on average per document), the improvements from 1.4 may not > apply, as you may exceed the limits of the new faceting technique in Solr > 1.4. > -Stephen > > On Wed, Aug 12, 2009 at 2:12 PM, Erik Hatcher wrote: > >> Yes, increasing the filterCache size will help with Solr 1.3 performance. >> >> Do note that trunk (soon Solr 1.4) has dramatically improved faceting >> performance. >> >>Erik >> >> >> On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote: >> >> Hi everyone, >>> >>> I'm using some faceting on a solr index containing ~ 160K documents. >>> I perform facets on multivalued string fields. The number of possible >>> different values is quite large. >>> >>> Enabling facets degrades the performance by a factor 3. >>> >>> Because I'm using solr 1.3, I guess the facetting makes use of the >>> filter cache to work. My filterCache is set >>> to a size of 2048. I also noticed in my solr stats a very small ratio >>> of cache hit (~ 0.01%). >>> >>> Can it be the reason why the faceting is slow? Does it make sense to >>> increase the filterCache size so it matches more or less the number >>> of different possible values for the faceted fields? Would that not >>> make the memory usage explode? >>> >>> Thanks for your help ! >>> >>> -- >>> Jerome Eteve. >>> >>> Chat with me live at http://www.eteve.net >>> >>> jer...@eteve.net >>> >> >> > > > -- > Stephen Duncan Jr > www.stephenduncanjr.com > -- Jerome Eteve. Chat with me live at http://www.eteve.net jer...@eteve.net
Re: facet performance tips
Note that depending on the profile of your field (full text and how many unique terms on average per document), the improvements from 1.4 may not apply, as you may exceed the limits of the new faceting technique in Solr 1.4. -Stephen On Wed, Aug 12, 2009 at 2:12 PM, Erik Hatcher wrote: > Yes, increasing the filterCache size will help with Solr 1.3 performance. > > Do note that trunk (soon Solr 1.4) has dramatically improved faceting > performance. > >Erik > > > On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote: > > Hi everyone, >> >> I'm using some faceting on a solr index containing ~ 160K documents. >> I perform facets on multivalued string fields. The number of possible >> different values is quite large. >> >> Enabling facets degrades the performance by a factor 3. >> >> Because I'm using solr 1.3, I guess the facetting makes use of the >> filter cache to work. My filterCache is set >> to a size of 2048. I also noticed in my solr stats a very small ratio >> of cache hit (~ 0.01%). >> >> Can it be the reason why the faceting is slow? Does it make sense to >> increase the filterCache size so it matches more or less the number >> of different possible values for the faceted fields? Would that not >> make the memory usage explode? >> >> Thanks for your help ! >> >> -- >> Jerome Eteve. >> >> Chat with me live at http://www.eteve.net >> >> jer...@eteve.net >> > > -- Stephen Duncan Jr www.stephenduncanjr.com
Re: facet performance tips
For your fields with many terms you may want to try Bobo http://code.google.com/p/bobo-browse/ which could work well with your case. On Wed, Aug 12, 2009 at 12:02 PM, Fuad Efendi wrote: > I am currently faceting on tokenized multi-valued field at > http://www.tokenizer.org (25 mlns simple docs) > > It uses some home-made quick fixes similar to SOLR-475 (SOLR-711) and > non-synchronized cache (similar to LingPipe's FastCache, SOLR-665, SOLR-667) > > Average "faceting" on query results: 0.2 - 0.3 seconds; without those > patches - 20-50 seconds. > > I am going to upgrade to SOLR-1.4 from trunk (with SOLR-475 & SOLR-667) and > to compare results... > > > > > P.S. > Avoid faceting on a field with heavy distribution of terms (such as few > millions of terms in my case); It won't work in SOLR 1.3. > > TIP: use non-tokenized single-valued field for faceting, such as > non-tokenized "country" field. > > > > P.P.S. > Would be nice to load/stress > http://alias-i.com/lingpipe/docs/api/com/aliasi/util/FastCache.html against > putting CPU in a spin loop ConcurrentHashMap. > > > > -Original Message----- > From: Erik Hatcher [mailto:ehatc...@apache.org] > Sent: August-12-09 2:12 PM > To: solr-user@lucene.apache.org > Subject: Re: facet performance tips > > Yes, increasing the filterCache size will help with Solr 1.3 > performance. > > Do note that trunk (soon Solr 1.4) has dramatically improved faceting > performance. > > Erik > > On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote: > >> Hi everyone, >> >> I'm using some faceting on a solr index containing ~ 160K documents. >> I perform facets on multivalued string fields. The number of possible >> different values is quite large. >> >> Enabling facets degrades the performance by a factor 3. >> >> Because I'm using solr 1.3, I guess the facetting makes use of the >> filter cache to work. My filterCache is set >> to a size of 2048. I also noticed in my solr stats a very small ratio >> of cache hit (~ 0.01%). >> >> Can it be the reason why the faceting is slow? Does it make sense to >> increase the filterCache size so it matches more or less the number >> of different possible values for the faceted fields? Would that not >> make the memory usage explode? >> >> Thanks for your help ! >> >> -- >> Jerome Eteve. >> >> Chat with me live at http://www.eteve.net >> >> jer...@eteve.net > > > >
RE: facet performance tips
I am currently faceting on tokenized multi-valued field at http://www.tokenizer.org (25 mlns simple docs) It uses some home-made quick fixes similar to SOLR-475 (SOLR-711) and non-synchronized cache (similar to LingPipe's FastCache, SOLR-665, SOLR-667) Average "faceting" on query results: 0.2 - 0.3 seconds; without those patches - 20-50 seconds. I am going to upgrade to SOLR-1.4 from trunk (with SOLR-475 & SOLR-667) and to compare results... P.S. Avoid faceting on a field with heavy distribution of terms (such as few millions of terms in my case); It won't work in SOLR 1.3. TIP: use non-tokenized single-valued field for faceting, such as non-tokenized "country" field. P.P.S. Would be nice to load/stress http://alias-i.com/lingpipe/docs/api/com/aliasi/util/FastCache.html against putting CPU in a spin loop ConcurrentHashMap. -Original Message- From: Erik Hatcher [mailto:ehatc...@apache.org] Sent: August-12-09 2:12 PM To: solr-user@lucene.apache.org Subject: Re: facet performance tips Yes, increasing the filterCache size will help with Solr 1.3 performance. Do note that trunk (soon Solr 1.4) has dramatically improved faceting performance. Erik On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote: > Hi everyone, > > I'm using some faceting on a solr index containing ~ 160K documents. > I perform facets on multivalued string fields. The number of possible > different values is quite large. > > Enabling facets degrades the performance by a factor 3. > > Because I'm using solr 1.3, I guess the facetting makes use of the > filter cache to work. My filterCache is set > to a size of 2048. I also noticed in my solr stats a very small ratio > of cache hit (~ 0.01%). > > Can it be the reason why the faceting is slow? Does it make sense to > increase the filterCache size so it matches more or less the number > of different possible values for the faceted fields? Would that not > make the memory usage explode? > > Thanks for your help ! > > -- > Jerome Eteve. > > Chat with me live at http://www.eteve.net > > jer...@eteve.net
Re: facet performance tips
Yes, increasing the filterCache size will help with Solr 1.3 performance. Do note that trunk (soon Solr 1.4) has dramatically improved faceting performance. Erik On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote: Hi everyone, I'm using some faceting on a solr index containing ~ 160K documents. I perform facets on multivalued string fields. The number of possible different values is quite large. Enabling facets degrades the performance by a factor 3. Because I'm using solr 1.3, I guess the facetting makes use of the filter cache to work. My filterCache is set to a size of 2048. I also noticed in my solr stats a very small ratio of cache hit (~ 0.01%). Can it be the reason why the faceting is slow? Does it make sense to increase the filterCache size so it matches more or less the number of different possible values for the faceted fields? Would that not make the memory usage explode? Thanks for your help ! -- Jerome Eteve. Chat with me live at http://www.eteve.net jer...@eteve.net
RE: facet performance tips
Jerome, Yes you need to increase the filterCache size to something close to unique number of facet elements. But also consider the RAM required to accommodate the increase. I did see a significant performance gain by increasing the filterCache size Thanks, Kalyan Manepalli -Original Message- From: Jérôme Etévé [mailto:jerome.et...@gmail.com] Sent: Wednesday, August 12, 2009 12:31 PM To: solr-user@lucene.apache.org Subject: facet performance tips Hi everyone, I'm using some faceting on a solr index containing ~ 160K documents. I perform facets on multivalued string fields. The number of possible different values is quite large. Enabling facets degrades the performance by a factor 3. Because I'm using solr 1.3, I guess the facetting makes use of the filter cache to work. My filterCache is set to a size of 2048. I also noticed in my solr stats a very small ratio of cache hit (~ 0.01%). Can it be the reason why the faceting is slow? Does it make sense to increase the filterCache size so it matches more or less the number of different possible values for the faceted fields? Would that not make the memory usage explode? Thanks for your help ! -- Jerome Eteve. Chat with me live at http://www.eteve.net jer...@eteve.net
Re: Facet Performance
Hoss, This is still extremely interesting area for possible improvements; I simply don't want the topic to die http://www.nabble.com/Facet-Performance-td7746964.html http://issues.apache.org/jira/browse/SOLR-665 http://issues.apache.org/jira/browse/SOLR-667 http://issues.apache.org/jira/browse/SOLR-669 I am currently using faceting on single-valued _tokenized_ field with huge amount of documents; _unsynchronized_ version of FIFOCache; 1.5 seconds average response time (for faceted queries only!) I think we can use additional cache for facet results (to store calculated values!); Lucene's FieldCache can be used only for non-tokenized single-valued non-bollean fields -Fuad hossman_lucene wrote: > > > : Unfortunately which strategy will be chosen is currently undocumented > : and control is a bit oblique: If the field is tokenized or multivalued > : or Boolean, the FilterQuery method will be used; otherwise the > : FieldCache method. I expect I or others will improve that shortly. > > Bear in mind, what's provide out of the box is "SimpleFacets" ... it's > designed to meet simple faceting needs ... when you start talking about > 100s or thousands of constraints per facet, you are getting outside the > scope of what it was intended to serve efficiently. > > At a certain point the only practical thing to do is write a custom > request handler that makes the best choices for your data. > > For the record: a really simple patch someone could submit would be to > make add an optional field based param indicating which type of faceting > (termenum/fieldcache) should be used to generate the list of terms and > then make SimpleFacets.getFacetFieldCounts use that and call the > apprpriate method insteado calling getTermCounts -- that way you could > force one or the other if you know it's better for your data/query. > > > > -Hoss > > > -- View this message in context: http://www.nabble.com/Facet-Performance-tp7746964p18756500.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Facet Performance
Erik Hatcher wrote: On Dec 8, 2006, at 2:15 PM, Andrew Nagy wrote: My data is 492,000 records of book data. I am faceting on 4 fields: author, subject, language, format. Format and language are fairly simple as their are only a few unique terms. Author and subject however are much different in that there are thousands of unique terms. When encountering difficult issues, I like to think in terms of the user interface. Surely you're not presenting 400k+ authors to the users in one shot. In Collex, we have put an AJAX drop-down that shows the author facet (we call it name on the UI, with various roles like author, painter, etc). You can see this in action here: In our data, we don't have unique authors for each records ... so let's say out of the 500,000 records ... we have 200,000 authors. What I am trying to display is the top 10 authors from the results of a search. So I do a search for title:"Gone with the wind" and I would like to see the top 10 matching authors from these results. But no worries, I have written my own facet handler and I am now back to under a second with faceting! Thanks for everyone's help and keep up the good work! Andrew
Re: Facet Performance
: Unfortunately which strategy will be chosen is currently undocumented : and control is a bit oblique: If the field is tokenized or multivalued : or Boolean, the FilterQuery method will be used; otherwise the : FieldCache method. I expect I or others will improve that shortly. Bear in mind, what's provide out of the box is "SimpleFacets" ... it's designed to meet simple faceting needs ... when you start talking about 100s or thousands of constraints per facet, you are getting outside the scope of what it was intended to serve efficiently. At a certain point the only practical thing to do is write a custom request handler that makes the best choices for your data. For the record: a really simple patch someone could submit would be to make add an optional field based param indicating which type of faceting (termenum/fieldcache) should be used to generate the list of terms and then make SimpleFacets.getFacetFieldCounts use that and call the apprpriate method insteado calling getTermCounts -- that way you could force one or the other if you know it's better for your data/query. -Hoss
Re: Facet Performance
On Dec 8, 2006, at 2:15 PM, Andrew Nagy wrote: My data is 492,000 records of book data. I am faceting on 4 fields: author, subject, language, format. Format and language are fairly simple as their are only a few unique terms. Author and subject however are much different in that there are thousands of unique terms. When encountering difficult issues, I like to think in terms of the user interface. Surely you're not presenting 400k+ authors to the users in one shot. In Collex, we have put an AJAX drop-down that shows the author facet (we call it name on the UI, with various roles like author, painter, etc). You can see this in action here: http://www.nines.org/collex type in "da" into the name for example. I developed a custom request handler in Solr for returning these types of suggest interfaces complete with facet counts. My code is very specific to our fields, so its not usable in a general sense, but maybe this gives you some ideas on where to go with these large sets of facet values. Erik
Re: Facet Performance
J.J. Larrea wrote: Unfortunately which strategy will be chosen is currently undocumented and control is a bit oblique: If the field is tokenized or multivalued or Boolean, the FilterQuery method will be used; otherwise the FieldCache method. I expect I or others will improve that shortly. Good to hear, cause I can't really get away with not having a multi-valued field for author. Im really excited by solr and really impressed so far. Thanks! Andrew
Re: Facet Performance
On 12/8/06, J.J. Larrea <[EMAIL PROTECTED]> wrote: Unfortunately which strategy will be chosen is currently undocumented and control is a bit oblique: If the field is tokenized or multivalued or Boolean, the FilterQuery method will be used; otherwise the FieldCache method. If anyone had time some of this could be documented here: http://wiki.apache.org/solr/SimpleFacetParameters The wiki is open to all. Or perhaps a new top level FacetedSearching page that references SimpleFacetParameters -Yonik
Re: Facet Performance
Andrew Nagy, ditto on what Yonik said. Here is some further elaboration: I am doing much the same thing (faceting on Author etc.). When my Author field was defined as a solr.TextField, even using solr.KeywordTokenizerFactory so it wasn't actually tokenized, the faceting code chose the QueryFilter approach, and faceting on Author for 100k+ document took about 4 seconds. When I changed the field to "string" e.g. solr.StrField, the faceting code recognized it as untokenized and used the FieldCache approach. Times have dropped to about 120ms for the first query (when the FieldCache is generated) and < 10ms for subsequent queries returning a few thousand results. Quite a difference. The strategy must be chosen on a field-by-field basis. While QueryFilter is excellent for fields with a small set of enumerated values such as Language or Format, it is inappropriate for large value sets such as Author. Unfortunately which strategy will be chosen is currently undocumented and control is a bit oblique: If the field is tokenized or multivalued or Boolean, the FilterQuery method will be used; otherwise the FieldCache method. I expect I or others will improve that shortly. - J.J. At 2:58 PM -0500 12/8/06, Yonik Seeley wrote: >Right, if any of these are tokenized, then you could make them >non-tokenized (use "string" type). If they really need to be >tokenized (author for example), then you could use copyField to make >another copy to a non-tokenized field that you can use for faceting. > >After that, as Hoss suggests, run a single faceting query with all 4 >fields and look at the filterCache statistics. Take the "lookups" >number and multiply it by, say, 1.5 to leave some room for future >growth, and use that as your cache size. You probably want to bump up >both initialSize and autowarmCount as well. > >The first query will still be slow. The second should be relatively fast. >You may hit an OOM error. Increase the JVM heap size if this happens. > >-Yonik
Re: Facet Performance
Yonik Seeley wrote: Are they multivalued, and do they need to be. Anything that is of type "string" and not multivalued will use the lucene FieldCache rather than the filterCache. The author field is multivalued. Will this be a strong performance issue? I could make multiple author fields as to not have the multivalued field and then only facet on the first author. Thanks Andrew
Re: Facet Performance
On 12/8/06, Andrew Nagy <[EMAIL PROTECTED]> wrote: Chris Hostetter wrote: >: Could you suggest a better configuration based on this? > >If that's what your stats look like after a single request, then i would >guess you would need to make your cache size at least 1.6 million in order >for it to be of any use in improving your facet speed. > > Would this have any strong impacts on my system? Should I just set it to an even 2 million to allow for growth? Change the following in solrconfig.xml, and you should be fine with a higher setting. true to false That will prevent the filtercache from being used for anything but filters and faceting, so if you set it to high, it won't be utilized anyway. >: My data is 492,000 records of book data. I am faceting on 4 fields: >: author, subject, language, format. >: Format and language are fairly simple as their are only a few unique >: terms. Author and subject however are much different in that there are >: thousands of unique terms. > >by the looks of it, you have a lot more then a few thousand unique terms >in those two fields ... are you tokenizing on these fields? that's >probably not what you want for ields you're going to facet on. > > All of these fields are set as "string" in my schema Are they multivalued, and do they need to be. Anything that is of type "string" and not multivalued will use the lucene FieldCache rather than the filterCache. -Yonik
Re: Facet Performance
Chris Hostetter wrote: : Could you suggest a better configuration based on this? If that's what your stats look like after a single request, then i would guess you would need to make your cache size at least 1.6 million in order for it to be of any use in improving your facet speed. Would this have any strong impacts on my system? Should I just set it to an even 2 million to allow for growth? : My data is 492,000 records of book data. I am faceting on 4 fields: : author, subject, language, format. : Format and language are fairly simple as their are only a few unique : terms. Author and subject however are much different in that there are : thousands of unique terms. by the looks of it, you have a lot more then a few thousand unique terms in those two fields ... are you tokenizing on these fields? that's probably not what you want for ields you're going to facet on. All of these fields are set as "string" in my schema, so if I understand the fields correctly, they are not being tokenized. I also have an author field that is set as "text" for searching. Thanks Andrew
Re: Facet Performance
On 12/8/06, Chris Hostetter <[EMAIL PROTECTED]> wrote: : My data is 492,000 records of book data. I am faceting on 4 fields: : author, subject, language, format. : Format and language are fairly simple as their are only a few unique : terms. Author and subject however are much different in that there are : thousands of unique terms. by the looks of it, you have a lot more then a few thousand unique terms in those two fields ... are you tokenizing on these fields? that's probably not what you want for ields you're going to facet on. Right, if any of these are tokenized, then you could make them non-tokenized (use "string" type). If they really need to be tokenized (author for example), then you could use copyField to make another copy to a non-tokenized field that you can use for faceting. After that, as Hoss suggests, run a single faceting query with all 4 fields and look at the filterCache statistics. Take the "lookups" number and multiply it by, say, 1.5 to leave some room for future growth, and use that as your cache size. You probably want to bump up both initialSize and autowarmCount as well. The first query will still be slow. The second should be relatively fast. You may hit an OOM error. Increase the JVM heap size if this happens. -Yonik
Re: Facet Performance
: Here are the stats, Im still a newbie to SOLR, so Im not totally sure : what this all means: : lookups : 1530036 : hits : 2 : hitratio : 0.00 : inserts : 1530035 : evictions : 1504435 : size : 25600 those numbers are telling you that your cache is capable of holding 25,600 items. you have attempted to lookup something in the cache 1,530,036 times, and only 2 of those times did you get a hit. you have added 1,530,035 items to the cache, and 1,504,435 items have been removed from your cache to make room for newer items. in short: your cache isn't really helping you at all. : Could you suggest a better configuration based on this? If that's what your stats look like after a single request, then i would guess you would need to make your cache size at least 1.6 million in order for it to be of any use in improving your facet speed. : My data is 492,000 records of book data. I am faceting on 4 fields: : author, subject, language, format. : Format and language are fairly simple as their are only a few unique : terms. Author and subject however are much different in that there are : thousands of unique terms. by the looks of it, you have a lot more then a few thousand unique terms in those two fields ... are you tokenizing on these fields? that's probably not what you want for ields you're going to facet on. -Hoss
Re: Facet Performance
Yonik Seeley wrote: On 12/8/06, Andrew Nagy <[EMAIL PROTECTED]> wrote: I changed the filterCache to the following: However a search that normally takes .04s is taking 74 seconds once I use the facets since I am faceting on 4 fields. The first time or subsequent times? Is your filterCache big enough yet? What do you see for evictions and hit ratio? Here are the stats, Im still a newbie to SOLR, so Im not totally sure what this all means: lookups : 1530036 hits : 2 hitratio : 0.00 inserts : 1530035 evictions : 1504435 size : 25600 cumulative_lookups : 1530036 cumulative_hits : 2 cumulative_hitratio : 0.00 cumulative_inserts : 1530035 cumulative_evictions : 1504435 Could you suggest a better configuration based on this? Can you suggest a better configuration that would solve this performance issue, or should I not use faceting? Faceting isn't something that will always be fast... one often needs to design things in a way that it can be fast. Can you give some examples of your faceted queries? Can you show the field and fieldtype definitions for the fields you are faceting on? For each field that you are faceting on, how many different terms are in it? My data is 492,000 records of book data. I am faceting on 4 fields: author, subject, language, format. Format and language are fairly simple as their are only a few unique terms. Author and subject however are much different in that there are thousands of unique terms. Thanks for your help! Andrew
Re: Facet Performance
On 12/8/06, Andrew Nagy <[EMAIL PROTECTED]> wrote: I changed the filterCache to the following: However a search that normally takes .04s is taking 74 seconds once I use the facets since I am faceting on 4 fields. The first time or subsequent times? Is your filterCache big enough yet? What do you see for evictions and hit ratio? Can you suggest a better configuration that would solve this performance issue, or should I not use faceting? Faceting isn't something that will always be fast... one often needs to design things in a way that it can be fast. Can you give some examples of your faceted queries? Can you show the field and fieldtype definitions for the fields you are faceting on? For each field that you are faceting on, how many different terms are in it? I figure I could run the query twice, once limited to 20 records and then again with the limit set to the total number of records and develop my own facets. I have infact done this before with a different back-end and my code is processed in under .01 seconds. Why is faceting so slow? It's computationally expensive to get exact facet counts for a large number of hits, and that is what the current faceting code is designed to do. No single method will be appropriate *and* fast for all scenarios. Another method that hasn't been implemented is some statistical faceting based on the top hits, using stored fields or stored term vectors. -Yonik
Re: Facet Performance
Yonik Seeley wrote: 1) facet on single-valued strings if you can 2) if you can't do (1) then enlarge the fieldcache so that the number of filters (one per possible term in the field you are filtering on) can fit. I changed the filterCache to the following: However a search that normally takes .04s is taking 74 seconds once I use the facets since I am faceting on 4 fields. Can you suggest a better configuration that would solve this performance issue, or should I not use faceting? I figure I could run the query twice, once limited to 20 records and then again with the limit set to the total number of records and develop my own facets. I have infact done this before with a different back-end and my code is processed in under .01 seconds. Why is faceting so slow? Andrew
Re: Facet Performance
: > This seems like a poor choice for an element : > name. Why not just name the element what is in the "name" attribute? : > It would make parsing much easier! : : When the XML was first conceived, there was a preference for limiting : the number of tags. : The structure could have been inverted so that : ...but then we couldn't support arbitrary field names, and it would be impossible to validate the XML docs independent of hte schema, see this previous explanation... http://www.nabble.com/Default-XML-Output-Schema-tf2312439.html#a643 -Hoss
Re: Facet Performance
On 12/7/06, Andrew Nagy <[EMAIL PROTECTED]> wrote: On complaint about the faceting though: Why is the element that is returned called "1st". I think maybe you are seeing lst (it starts with an L, not a one). It is short for NamedList, an ordered list who's elements are named. This seems like a poor choice for an element name. Why not just name the element what is in the "name" attribute? It would make parsing much easier! When the XML was first conceived, there was a preference for limiting the number of tags. The structure could have been inverted so that -Yonik
Re: Facet Performance
Yonik Seeley wrote: 1) facet on single-valued strings if you can 2) if you can't do (1) then enlarge the fieldcache so that the number of filters (one per possible term in the field you are filtering on) can fit. I wll try this out. 3) facet counts are limited to the results of the query, filtered by any filters. Is there a reason you think they are not? No, you are right. I was thrown off at 1st. On complaint about the faceting though: Why is the element that is returned called "1st". This seems like a poor choice for an element name. Why not just name the element what is in the "name" attribute? It would make parsing much easier! Thanks! Andrew
Re: Facet Performance
On 12/7/06, Andrew Nagy <[EMAIL PROTECTED]> wrote: In September there was a thread [1] on this list about heterogeneous facets and their performance. I am having a similar issue and am unclear as the resolution of this thread. I performed a search against my dataset (492,000 records) and got the results I am looking for in .3 seconds. I then set facet to true and got results in 16 seconds and the facets include data that is not in my result set, it is from the entire set. How do I limit the faceting to my results set and speed up the results? 1) facet on single-valued strings if you can 2) if you can't do (1) then enlarge the fieldcache so that the number of filters (one per possible term in the field you are filtering on) can fit. 3) facet counts are limited to the results of the query, filtered by any filters. Is there a reason you think they are not? -Yonik
Re: Facet performance with heterogeneous 'facets'?
On 9/22/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: Excellent news; as you guessed, my schema was (for some reason) set to version 1.0. Yeah, I just realized that having "version" right next to "name" would lead people to think it's "their" version number, when it's really Solr's version number. I've added a comment to the example schema to clarify that. But better yet, the 800 seconds query is now running in 0.5-2 seconds! Amazing optimization! I can now do faceting on journal title (17 000 different titles) and last author (>400 000 authors), + 12 date range queries, in a very reasonable time (considering im on a test windows desktop box and not a server). The only problem is if I add first author, I get a java.lang.OutOfMemoryError: Java heap space. I'm sure this problem will get away on a server with more than the current 500 megs I can allocate to Tomcat. Yes, the Lucene FieldCache takes up a lot of memory. It basically holds the entire field in a non-inverted form: http://lucene.apache.org/java/docs/api/org/apache/lucene/search/FieldCache.StringIndex.html It's currently also used for sorting, which also needs fast document->fieldvalue lookups, rather than the inverted term->documents_containing_that_term -Yonik
Re: Facet performance with heterogeneous 'facets'?
Excellent news; as you guessed, my schema was (for some reason) set to version 1.0. This also caused some of the problems I had with the original SolrPHP (parsing the wrong response). But better yet, the 800 seconds query is now running in 0.5-2 seconds! Amazing optimization! I can now do faceting on journal title (17 000 different titles) and last author (>400 000 authors), + 12 date range queries, in a very reasonable time (considering im on a test windows desktop box and not a server). The only problem is if I add first author, I get a java.lang.OutOfMemoryError: Java heap space. I'm sure this problem will get away on a server with more than the current 500 megs I can allocate to Tomcat. Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212 Yonik Seeley wrote: On 9/22/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: I upgraded to the most recent Solr build (9-22) and sadly it's still really slow. 800 seconds query with a single facet on first_author, 15 millions documents total, the query return 180. Maybe i'm doing something wrong? Also, this is on my personal desktop; not on a server. Still, I'm getting 0.1 seconds queries without facets, so I don't think thats the cause. In the admin panel i can still see the filtercache doing millions of lookups (and tons of evictions once it hits the maxsize). The fact that you see all the filtercache usage means that the optimization didn't kick in for some reason. Here's the field i'm using in schema.xml : That looks fine... This is the query : q="hiv red blood"&start=0&rows=20&fl=article_title+authors+journal_iso+pubdate+pmid+score&qt=standard&facet=true&facet.field=first_author&facet.limit=5&facet.missing=false&facet.zeros=false That looks OK too. I assume that you didn't change the fieldtype definition for "string", and that the schema has version="1.1"? Before 1.1, all fields were assumed to be multiValued (there was no checking or enforcement). -Yonik
Re: Facet performance with heterogeneous 'facets'?
On 9/22/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: I upgraded to the most recent Solr build (9-22) and sadly it's still really slow. 800 seconds query with a single facet on first_author, 15 millions documents total, the query return 180. Maybe i'm doing something wrong? Also, this is on my personal desktop; not on a server. Still, I'm getting 0.1 seconds queries without facets, so I don't think thats the cause. In the admin panel i can still see the filtercache doing millions of lookups (and tons of evictions once it hits the maxsize). The fact that you see all the filtercache usage means that the optimization didn't kick in for some reason. Here's the field i'm using in schema.xml : That looks fine... This is the query : q="hiv red blood"&start=0&rows=20&fl=article_title+authors+journal_iso+pubdate+pmid+score&qt=standard&facet=true&facet.field=first_author&facet.limit=5&facet.missing=false&facet.zeros=false That looks OK too. I assume that you didn't change the fieldtype definition for "string", and that the schema has version="1.1"? Before 1.1, all fields were assumed to be multiValued (there was no checking or enforcement). -Yonik
Re: Facet performance with heterogeneous 'facets'?
I upgraded to the most recent Solr build (9-22) and sadly it's still really slow. 800 seconds query with a single facet on first_author, 15 millions documents total, the query return 180. Maybe i'm doing something wrong? Also, this is on my personal desktop; not on a server. Still, I'm getting 0.1 seconds queries without facets, so I don't think thats the cause. In the admin panel i can still see the filtercache doing millions of lookups (and tons of evictions once it hits the maxsize). Here's the field i'm using in schema.xml : This is the query : q="hiv red blood"&start=0&rows=20&fl=article_title+authors+journal_iso+pubdate+pmid+score&qt=standard&facet=true&facet.field=first_author&facet.limit=5&facet.missing=false&facet.zeros=false I'll do more testing on the weekend, Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212 Yonik Seeley wrote: On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: Hang in there Michael, a fix is on the way for your scenario (and subscribe to solr-dev if you want to stay on the bleeding edge): OK, the optimization has been checked in. You can checkout from svn and build Solr, or wait for the 9-22 nightly build (after 8:30 EDT). I'd be interested in hearing your results with it. The first facet request on a field will take longer than subsequent ones because the FieldCache entry is loaded on demand. You can use a firstSearcher/newSearcher hook in solrconfig.xml to send a facet request so that a real user would never see this slower query. -Yonik
Re: Facet performance with heterogeneous 'facets'?
On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: Hang in there Michael, a fix is on the way for your scenario (and subscribe to solr-dev if you want to stay on the bleeding edge): OK, the optimization has been checked in. You can checkout from svn and build Solr, or wait for the 9-22 nightly build (after 8:30 EDT). I'd be interested in hearing your results with it. The first facet request on a field will take longer than subsequent ones because the FieldCache entry is loaded on demand. You can use a firstSearcher/newSearcher hook in solrconfig.xml to send a facet request so that a real user would never see this slower query. -Yonik
Re: Facet performance with heterogeneous 'facets'?
On 9/21/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: Btw, Any plans for a facets cache? Maybe a partial one (like caching top terms to implement some other optimizations). My general philosophy on caching in Solr has been to cache things the client can't: elemental things, or *parts* of requests to make many different requests faster (most bang-for-the-buck). Caching complete requests/responses is generally less useful since it requires even more memory, has a worse hit ratio, and can be done anyway by the client or a separate process like squid. -Yonik
Re: Facet performance with heterogeneous 'facets'?
Dude, stop being so awesome (and the whole Solr team). Seriously! Every problem / request (MoreLikeThis class, change AND/OR preference programatically, etc) I've submitted to this mailing list has received a quick, more-than-I-ever-expected answer. I'll subscribe to the dev list (been reading it off and on), but I'm afraid I couldn't code my way of a paper bag in Java. I'll contribute to the Solr wiki (the SolrPHP part in particular) as soon as I can. Thats the least I can do! Btw, Any plans for a facets cache? Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212 Yonik Seeley wrote: On 9/21/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: It turns out that journal_name has 17038 different tokens, which is manageable, but first_author has > 400 000. I don't think this will ever yield good performance, so i might only do journal_name facets. Hang in there Michael, a fix is on the way for your scenario (and subscribe to solr-dev if you want to stay on the bleeding edge): http://www.nabble.com/big-faceting-speedup-for-single-valued-fields-tf2308153.html -Yonik
Re: Facet performance with heterogeneous 'facets'?
On 9/21/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: It turns out that journal_name has 17038 different tokens, which is manageable, but first_author has > 400 000. I don't think this will ever yield good performance, so i might only do journal_name facets. Hang in there Michael, a fix is on the way for your scenario (and subscribe to solr-dev if you want to stay on the bleeding edge): http://www.nabble.com/big-faceting-speedup-for-single-valued-fields-tf2308153.html -Yonik
Re: Facet performance with heterogeneous 'facets'?
Thanks for all the great answers. Quick Question: did you say you are faceting on the first name field seperately from the last name field? ... why? You misunderstood. I'm doing faceting on first author, and last author of the list. Life science papers have authors list, and the first one is usually the guy who did most of the work, and the last one is usually the boss of the lab. I already have untokenized author fields for that using copyField. Second: you mentioned increasing hte size of your filterCache significantly, but we don't really know how heterogenous your index is ... once you made that cahnge did your filterCache hitrate increase? .. do you have any evictions (you can check on the "Statistics" page) It was at the default (16000) and it hit the ceiling so to speak. I did maxSize=1600 (for testing purpose) and now size : 17038 and 0 evictions. For a single facet field (journal name) with a limit of 5 and 12 faceted query fields (range on publication date), I now have 0.5 seconds search, which is not too bad. The filtercache size is pretty much constant no matter how many queries I do. However, if I try to add another facet field (such as first_author), something strange happens. 99% CPU, the filter cache is filling up really fast, hitratio goes to hell, no disk activity, and it can stay that way for at least 30 minutes (didn't test longer, no point really). It turns out that journal_name has 17038 different tokens, which is manageable, but first_author has > 400 000. I don't think this will ever yield good performance, so i might only do journal_name facets. Any reasons why facets tries to preload every term in the field? I have noticed that facets are not cached. Facets off, cached query take 0.01 seconds. Facet on, uncached and cached queries take 0.7 seconds. Any plans for a facets cache? I know that facets is still a very early feature, but its already awesome; my application is maybe irrealistic. Thanks, Michael
Re: Facet performance with heterogeneous 'facets'?
: I just updated the comments in solrconfig.xml: I've tweaked the SolrCaching wiki page to include some of this info as well, feel free to add any additional info you think would be helpful to other people (or ask any qestions about it if any of it still doesn't seem clear to you)... http://wiki.apache.org/solr/SolrCaching : > now, 400docs/sec!). However, I still don't have an idea what are these : > values representing, and how I should estimate what values I should set : > them to. Originally I thought it was the size of the cache in kb, and : > someone on the list told me it was number of items, but I don't quite : > get it. Better documentation on that would be welcomed :) -Hoss
Re: Facet performance with heterogeneous 'facets'?
: > when we facet on the authors, we start with : > that list and go in order, generating their facet constraint count using : > the DocSet intersection just like we currently do ... if we reach our : > facet.limit before we reach the end of hte list and the lowest constraint : > count is higher then the total doc count of the last author in the list, : > then we know we don't need to bother testing any other Author, because no : > other author an possibly have a higher facet constraint count then the : > ones on our list : : This works OK if the intersection counts are high (as a percentage of : the facet sets). I'm not sure how often this will be the case though. well, keep in mind "N" could be very big, big enough to store the full list of Terms sorted in docFreq order (it shouldn't take up much space since it's just hte Term and an int)e ... for any query that returns a "large" number of results, you probably won't need to reach the end of the list before you can tell that all the remaining Terms have a lower docFreq then the current last constraint count in your facet.limit list. For queries that return a "small" number of results, it wouldn't be as usefull, but thats where a switch could be fliped to start with the values mapped to hte docs (using FieldCache -- assuming single-value fields) : Another tradeoff is to allow getting inexact counts with multi-token fields by: : - simply faceting on the most popular values :OR : - do some sort of statistical sampling by reading term vectors for a : fraction of the matching docs. i loath inexact counts ... i think of them as "Astrology" to the Astronomy of true Faceted Searching ... but i'm sure they would be "good enough" for some peoples use cases. -Hoss
Re: Facet performance with heterogeneous 'facets'?
On 9/19/06, Chris Hostetter <[EMAIL PROTECTED]> wrote: Quick Question: did you say you are faceting on the first name field seperately from the last name field? ... why? You'll probably see a sharp increase in performacne if you have a single untokenized author field containing hte full name and you facet on that -- there will be a lot less unique terms to use when computing DocSets and intersections. Second: you mentioned increasing hte size of your filterCache significantly, but we don't really know how heterogenous your index is ... once you made that cahnge did your filterCache hitrate increase? .. do you have any evictions (you can check on the "Statistics" patge) : > Also, I was under the impression : > that it was only searching / sorting for authors that it knows are in : > the result set... : : That's the problem... it's not necessarily easy to know *what* authors : are in the result set. If we could quickly determine that, we could : just count them and not do any intersections or anything at all. another way to look at it is that by looking at all the authors, the work done for generating the facet counts for query A can be completely reused for the next query B -- presuming your filterCache is large enough to hold all of the author filters. : There could be optimizations when docs_matching_query.size() is small, : so we start somehow with terms in the documents rather than terms in : the index. That requires termvectors to be stored (medium speed), or : requires that the field be stored and that we re-analyze it (very : slow). : : More optimization of special cases hasn't been done simply because no : one has done it yet... (as you note, faceting is a new feature). the optimization optimization i anticipated from teh begining, would probably be usefull in the situation Michael is describing ... if there is a "long tail" oif authors (and in my experience, there typically is) we can cache an ordered list of the top N most prolific authors, along with the count of how many documents they have in the index (this info is easy to getfrom TermEnum.docFreq). Yeah, I've thought about a fieldInfoCache too. It could also cache the total number of terms in order to make decisions about what faceting strategy to follow. when we facet on the authors, we start with that list and go in order, generating their facet constraint count using the DocSet intersection just like we currently do ... if we reach our facet.limit before we reach the end of hte list and the lowest constraint count is higher then the total doc count of the last author in the list, then we know we don't need to bother testing any other Author, because no other author an possibly have a higher facet constraint count then the ones on our list This works OK if the intersection counts are high (as a percentage of the facet sets). I'm not sure how often this will be the case though. Another tradeoff is to allow getting inexact counts with multi-token fields by: - simply faceting on the most popular values OR - do some sort of statistical sampling by reading term vectors for a fraction of the matching docs. -Yonik
Re: Facet performance with heterogeneous 'facets'?
Quick Question: did you say you are faceting on the first name field seperately from the last name field? ... why? You'll probably see a sharp increase in performacne if you have a single untokenized author field containing hte full name and you facet on that -- there will be a lot less unique terms to use when computing DocSets and intersections. Second: you mentioned increasing hte size of your filterCache significantly, but we don't really know how heterogenous your index is ... once you made that cahnge did your filterCache hitrate increase? .. do you have any evictions (you can check on the "Statistics" patge) : > Also, I was under the impression : > that it was only searching / sorting for authors that it knows are in : > the result set... : : That's the problem... it's not necessarily easy to know *what* authors : are in the result set. If we could quickly determine that, we could : just count them and not do any intersections or anything at all. another way to look at it is that by looking at all the authors, the work done for generating the facet counts for query A can be completely reused for the next query B -- presuming your filterCache is large enough to hold all of the author filters. : There could be optimizations when docs_matching_query.size() is small, : so we start somehow with terms in the documents rather than terms in : the index. That requires termvectors to be stored (medium speed), or : requires that the field be stored and that we re-analyze it (very : slow). : : More optimization of special cases hasn't been done simply because no : one has done it yet... (as you note, faceting is a new feature). the optimization optimization i anticipated from teh begining, would probably be usefull in the situation Michael is describing ... if there is a "long tail" oif authors (and in my experience, there typically is) we can cache an ordered list of the top N most prolific authors, along with the count of how many documents they have in the index (this info is easy to getfrom TermEnum.docFreq). when we facet on the authors, we start with that list and go in order, generating their facet constraint count using the DocSet intersection just like we currently do ... if we reach our facet.limit before we reach the end of hte list and the lowest constraint count is higher then the total doc count of the last author in the list, then we know we don't need to bother testing any other Author, because no other author an possibly have a higher facet constraint count then the ones on our list (since they haven't even written that many documents) -Hoss
Re: Facet performance with heterogeneous 'facets'?
I just updated the comments in solrconfig.xml: On 9/18/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: Another followup: I bumped all the caches in solrconfig.xml to size="1600384" initialSize="400096" autowarmCount="400096" It seemed to fix the problem on a very small index (facets on last and first author fields, + 12 range date facets, sub 0.3 seconds for queries). I'll check on the full index tomorrow (it's indexing right now, 400docs/sec!). However, I still don't have an idea what are these values representing, and how I should estimate what values I should set them to. Originally I thought it was the size of the cache in kb, and someone on the list told me it was number of items, but I don't quite get it. Better documentation on that would be welcomed :) Also, is there any plans to add an option not to run a facet search if the result set is too big? To avoid 40 seconds queries if the docset is too large... I'd like to speed up certain corner cases, but you can always set timeouts in whatever frontend is making the request to Solr too. -Yonik
Re: Facet performance with heterogeneous 'facets'?
Michael Imbeault wrote: Also, is there any plans to add an option not to run a facet search if the result set is too big? To avoid 40 seconds queries if the docset is too large... You could run one query with facet=false, check the result size and then run it again (should be fast because it is cached) with facet=true&rows=0 to get facet results only. I would think that the decision to run/not run facets would be highly custom to your collection and not easily developed as a configurable feature. --Joachim
Re: Facet performance with heterogeneous 'facets'?
On 9/18/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: Yonik Seeley wrote: > For cases like "author", if there is only one value per document, then > a possible fix is to use the field cache. If there can be multiple > occurrences, there doesn't seem to be a good way that preserves exact > counts, except maybe if the number of documents matching a query is > low. > I have one value per document (I have fields for authors, last_author and first_author, and I'm doing faceted search on first and last authors fields). How would I use the field cache to fix my problem? Unless you want to dive into Solr development, you don't :-) It requires extensive changes to the faceting code and doing things a different way in some cases. The FieldCache is the fastest way to "uninvert" single valued fields... it's currently only used for Sorting, where one needs to quickly know the field value given the document id. The downside is high memory use, and that it's not a general solution... it can't handle fields with multiple tokens (tokenized fields or multi-valued fields). So the strategy would be to step through the documents, get the value for the field from the FieldCache, increment a counter for that value, then find the top counters when we are done. Also, would it be better to store a unique number (for each possible author) in an int field along with the string, and do the faceted searching on the int field? It won't really help. It wouldn't be faster, and it would require only slightly less memory. >> Just a little follow-up - I did a little more testing, and the query >> takes 20 seconds no matter what - If there's one document in the results >> set, or if I do a query that returns all 13 documents. > > Yes, currently the same strategy is always used. > intersection_count(docs_matching_query, docs_matching_author1) > intersection_count(docs_matching_query, docs_matching_author2) > intersection_count(docs_matching_query, docs_matching_author3) > etc... > > Normally, the docsets will be cached, but since the number of authors > is greater than the size of the filtercache, the effective cache hit > rate will be 0% > > -Yonik So more memory would fix the problem? Yes, if your collection size isn't that large... it's not a practical solution for many cases though. Also, I was under the impression that it was only searching / sorting for authors that it knows are in the result set... That's the problem... it's not necessarily easy to know *what* authors are in the result set. If we could quickly determine that, we could just count them and not do any intersections or anything at all. in the case of only one document (1 result), it seems strange that it takes the same time as for 130 000 results. It should just check the results, see that there's only one author, and return that? And in the case of 2 documents, just sort 2 authors (or 1 if they're the same)? I understand your answer (it does intersections), but I wonder why its intersecting from the whole document set at first, and not docs_matching_query like you said. It is just intersecting docs_matching_query. The problem is that it's intersecting that set with all possible author sets since it doesn't know ahead of time what authors are in the docs that match the query. There could be optimizations when docs_matching_query.size() is small, so we start somehow with terms in the documents rather than terms in the index. That requires termvectors to be stored (medium speed), or requires that the field be stored and that we re-analyze it (very slow). More optimization of special cases hasn't been done simply because no one has done it yet... (as you note, faceting is a new feature). -Yonik
Re: Facet performance with heterogeneous 'facets'?
Another followup: I bumped all the caches in solrconfig.xml to size="1600384" initialSize="400096" autowarmCount="400096" It seemed to fix the problem on a very small index (facets on last and first author fields, + 12 range date facets, sub 0.3 seconds for queries). I'll check on the full index tomorrow (it's indexing right now, 400docs/sec!). However, I still don't have an idea what are these values representing, and how I should estimate what values I should set them to. Originally I thought it was the size of the cache in kb, and someone on the list told me it was number of items, but I don't quite get it. Better documentation on that would be welcomed :) Also, is there any plans to add an option not to run a facet search if the result set is too big? To avoid 40 seconds queries if the docset is too large... Thanks, Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212 Yonik Seeley wrote: On 9/18/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: Just a little follow-up - I did a little more testing, and the query takes 20 seconds no matter what - If there's one document in the results set, or if I do a query that returns all 13 documents. Yes, currently the same strategy is always used. intersection_count(docs_matching_query, docs_matching_author1) intersection_count(docs_matching_query, docs_matching_author2) intersection_count(docs_matching_query, docs_matching_author3) etc... Normally, the docsets will be cached, but since the number of authors is greater than the size of the filtercache, the effective cache hit rate will be 0% -Yonik
Re: Facet performance with heterogeneous 'facets'?
Yonik Seeley wrote: I noticed this too, and have been thinking about ways to fix it. The root of the problem is that lucene, like all full-text search engines, uses inverted indicies. It's fast and easy to get all documents for a particular term, but getting all terms for a document documents is either not possible, or not fast (assuming many documents match a query). Yeah that's what I've been thinking; the index isn't built to handle such searches, sadly :( It would be very nice to be able to rapidly search by most frequent author, journal, etc. For cases like "author", if there is only one value per document, then a possible fix is to use the field cache. If there can be multiple occurrences, there doesn't seem to be a good way that preserves exact counts, except maybe if the number of documents matching a query is low. I have one value per document (I have fields for authors, last_author and first_author, and I'm doing faceted search on first and last authors fields). How would I use the field cache to fix my problem? Also, would it be better to store a unique number (for each possible author) in an int field along with the string, and do the faceted searching on the int field? Would this be faster / require less memory? I guess that yes, and I'll test that when I have the time. Just a little follow-up - I did a little more testing, and the query takes 20 seconds no matter what - If there's one document in the results set, or if I do a query that returns all 13 documents. Yes, currently the same strategy is always used. intersection_count(docs_matching_query, docs_matching_author1) intersection_count(docs_matching_query, docs_matching_author2) intersection_count(docs_matching_query, docs_matching_author3) etc... Normally, the docsets will be cached, but since the number of authors is greater than the size of the filtercache, the effective cache hit rate will be 0% -Yonik So more memory would fix the problem? Also, I was under the impression that it was only searching / sorting for authors that it knows are in the result set... in the case of only one document (1 result), it seems strange that it takes the same time as for 130 000 results. It should just check the results, see that there's only one author, and return that? And in the case of 2 documents, just sort 2 authors (or 1 if they're the same)? I understand your answer (it does intersections), but I wonder why its intersecting from the whole document set at first, and not docs_matching_query like you said. Thanks for the support, Michael
Re: Facet performance with heterogeneous 'facets'?
On 9/18/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: Just a little follow-up - I did a little more testing, and the query takes 20 seconds no matter what - If there's one document in the results set, or if I do a query that returns all 13 documents. Yes, currently the same strategy is always used. intersection_count(docs_matching_query, docs_matching_author1) intersection_count(docs_matching_query, docs_matching_author2) intersection_count(docs_matching_query, docs_matching_author3) etc... Normally, the docsets will be cached, but since the number of authors is greater than the size of the filtercache, the effective cache hit rate will be 0% -Yonik
Re: Facet performance with heterogeneous 'facets'?
On 9/18/06, Michael Imbeault <[EMAIL PROTECTED]> wrote: Been playing around with the news 'facets search' and it works very well, but it's really slow for some particular applications. I've been trying to use it to display the most frequent authors of articles I noticed this too, and have been thinking about ways to fix it. The root of the problem is that lucene, like all full-text search engines, uses inverted indicies. It's fast and easy to get all documents for a particular term, but getting all terms for a document documents is either not possible, or not fast (assuming many documents match a query). For cases like "author", if there is only one value per document, then a possible fix is to use the field cache. If there can be multiple occurrences, there doesn't seem to be a good way that preserves exact counts, except maybe if the number of documents matching a query is low. -Yonik
Re: Facet performance with heterogeneous 'facets'?
Just a little follow-up - I did a little more testing, and the query takes 20 seconds no matter what - If there's one document in the results set, or if I do a query that returns all 13 documents. It seems something isn't right... it looks like solr is doing faceted search on the whole index no matter what's the result set when doing facets on a string field. I must be doing something wrong? Michael Imbeault CHUL Research Center (CHUQ) 2705 boul. Laurier Ste-Foy, QC, Canada, G1V 4G2 Tel: (418) 654-2705, Fax: (418) 654-2212 Michael Imbeault wrote: Been playing around with the news 'facets search' and it works very well, but it's really slow for some particular applications. I've been trying to use it to display the most frequent authors of articles; this is from a huge (15 millions articles) database and names of authors are rare and heterogeneous. On a query that takes (without facets) 0.1 seconds, it jumps to ~20 seconds with just 1% of the documents indexed (I've been getting java.lang.OutOfMemoryError with the full index). ~40 seconds for a faceted search on 2 (string) fields. Range queries on a slong field is more acceptable (even with a dozen of them, query time is still in the subsecond range). I'm I trying to do something which isn't what faceted search was made for? It would be understandable, after all, I guess the facets engine has to check very doc in the index and sort... which shouldn't yield good performance no matter what, sadly. Is there any other way I could achieve what I'm trying to do? Just a list of the most frequent (top 5) authors present in the results of a query. Thanks,