Re: Multiple fq vs combined fq performance

2020-07-10 Thread Tomás Fernández Löbbe
All non-cached filters will be executed together (leapfrog between them)
and will be sorted by the filter cost (I guess that, since you aren't
setting a cost, then the order of the input matters).  You can try setting
a cost in your filters (lower than 100, so that they don't become post
filters)

One other thing though, I guess you are using Point fields? If you
typically query for a single value like in this example (vs. ranges), you
may want to use string fields for those. See
https://issues.apache.org/jira/browse/SOLR-11078.




On Fri, Jul 10, 2020 at 7:51 AM Chris Dempsey  wrote:

> Thanks for the suggestion, Alex. It doesn't appear that
> IndexOrDocValuesQuery (at least in Solr 7.7.1) supports the PostFilter
> interface. I've tried various values for cost on each of the fq and it
> doesn't change the QTime.
>
> So, after digging around a bit even though
> {!cache=false}taggedTickets_ticketId:100241 only matches one and only
> one document in the collection that doesn't matter for the other two fq who
> continue to look over the index of the collection, correct?
>
> On Thu, Jul 9, 2020 at 4:24 PM Alexandre Rafalovitch 
> wrote:
>
> > I _think_ it will run all 3 and then do index hopping. But if you know
> one
> > fq is super expensive, you could assign it a cost
> > Value over 100 will try to use PostFilter then and apply the query on top
> > of results from other queries.
> >
> >
> >
> >
> https://lucene.apache.org/solr/guide/8_4/common-query-parameters.html#cache-parameter
> >
> > Hope it helps,
> > Alex.
> >
> > On Thu., Jul. 9, 2020, 2:05 p.m. Chris Dempsey, 
> wrote:
> >
> > > Hi all! In a collection where we have ~54 million documents we've
> noticed
> > > running a query with the following:
> > >
> > > "fq":["{!cache=false}_class:taggedTickets",
> > >   "{!cache=false}taggedTickets_ticketId:100241",
> > >   "{!cache=false}companyId:22476"]
> > >
> > > when I debugQuery I see:
> > >
> > > "parsed_filter_queries":[
> > >   "{!cache=false}_class:taggedTickets",
> > >
>  "{!cache=false}IndexOrDocValuesQuery(taggedTickets_ticketId:[100241
> > > TO 100241])",
> > >   "{!cache=false}IndexOrDocValuesQuery(companyId:[22476 TO 22476])"
> > > ]
> > >
> > > runs in roughly ~450ms but if we remove `{!cache=false}companyId:22476`
> > it
> > > drops down to ~5ms (it's important to note that
> `taggedTickets_ticketId`
> > is
> > > globally unique).
> > >
> > > If we change the fqs to:
> > >
> > > "fq":["{!cache=false}_class:taggedTickets",
> > >   "{!cache=false}+companyId:22476
> > +taggedTickets_ticketId:100241"]
> > >
> > > when I debugQuery I see:
> > >
> > > "parsed_filter_queries":[
> > >"{!cache=false}_class:taggedTickets",
> > >"{!cache=false}+IndexOrDocValuesQuery(companyId:[22476 TO 22476])
> > > +IndexOrDocValuesQuery(taggedTickets_ticketId:[100241 TO
> > 100241])"
> > > ]
> > >
> > > we get the correct result back in ~5ms.
> > >
> > > My current thought is that in the slow scenario Solr is still running
> > > `{!cache=false}IndexOrDocValuesQuery(companyId:[22476
> > > TO 22476])` even though it "has the answer" from the first two fq.
> > >
> > > Am I off-base or misunderstanding how `fq` are processed?
> > >
> >
>


Re: Multiple fq vs combined fq performance

2020-07-10 Thread Chris Dempsey
Thanks for the suggestion, Alex. It doesn't appear that
IndexOrDocValuesQuery (at least in Solr 7.7.1) supports the PostFilter
interface. I've tried various values for cost on each of the fq and it
doesn't change the QTime.

So, after digging around a bit even though
{!cache=false}taggedTickets_ticketId:100241 only matches one and only
one document in the collection that doesn't matter for the other two fq who
continue to look over the index of the collection, correct?

On Thu, Jul 9, 2020 at 4:24 PM Alexandre Rafalovitch 
wrote:

> I _think_ it will run all 3 and then do index hopping. But if you know one
> fq is super expensive, you could assign it a cost
> Value over 100 will try to use PostFilter then and apply the query on top
> of results from other queries.
>
>
>
> https://lucene.apache.org/solr/guide/8_4/common-query-parameters.html#cache-parameter
>
> Hope it helps,
> Alex.
>
> On Thu., Jul. 9, 2020, 2:05 p.m. Chris Dempsey,  wrote:
>
> > Hi all! In a collection where we have ~54 million documents we've noticed
> > running a query with the following:
> >
> > "fq":["{!cache=false}_class:taggedTickets",
> >   "{!cache=false}taggedTickets_ticketId:100241",
> >   "{!cache=false}companyId:22476"]
> >
> > when I debugQuery I see:
> >
> > "parsed_filter_queries":[
> >   "{!cache=false}_class:taggedTickets",
> >   "{!cache=false}IndexOrDocValuesQuery(taggedTickets_ticketId:[100241
> > TO 100241])",
> >   "{!cache=false}IndexOrDocValuesQuery(companyId:[22476 TO 22476])"
> > ]
> >
> > runs in roughly ~450ms but if we remove `{!cache=false}companyId:22476`
> it
> > drops down to ~5ms (it's important to note that `taggedTickets_ticketId`
> is
> > globally unique).
> >
> > If we change the fqs to:
> >
> > "fq":["{!cache=false}_class:taggedTickets",
> >   "{!cache=false}+companyId:22476
> +taggedTickets_ticketId:100241"]
> >
> > when I debugQuery I see:
> >
> > "parsed_filter_queries":[
> >"{!cache=false}_class:taggedTickets",
> >"{!cache=false}+IndexOrDocValuesQuery(companyId:[22476 TO 22476])
> > +IndexOrDocValuesQuery(taggedTickets_ticketId:[100241 TO
> 100241])"
> > ]
> >
> > we get the correct result back in ~5ms.
> >
> > My current thought is that in the slow scenario Solr is still running
> > `{!cache=false}IndexOrDocValuesQuery(companyId:[22476
> > TO 22476])` even though it "has the answer" from the first two fq.
> >
> > Am I off-base or misunderstanding how `fq` are processed?
> >
>


Re: Multiple fq vs combined fq performance

2020-07-09 Thread Alexandre Rafalovitch
I _think_ it will run all 3 and then do index hopping. But if you know one
fq is super expensive, you could assign it a cost
Value over 100 will try to use PostFilter then and apply the query on top
of results from other queries.


https://lucene.apache.org/solr/guide/8_4/common-query-parameters.html#cache-parameter

Hope it helps,
Alex.

On Thu., Jul. 9, 2020, 2:05 p.m. Chris Dempsey,  wrote:

> Hi all! In a collection where we have ~54 million documents we've noticed
> running a query with the following:
>
> "fq":["{!cache=false}_class:taggedTickets",
>   "{!cache=false}taggedTickets_ticketId:100241",
>   "{!cache=false}companyId:22476"]
>
> when I debugQuery I see:
>
> "parsed_filter_queries":[
>   "{!cache=false}_class:taggedTickets",
>   "{!cache=false}IndexOrDocValuesQuery(taggedTickets_ticketId:[100241
> TO 100241])",
>   "{!cache=false}IndexOrDocValuesQuery(companyId:[22476 TO 22476])"
> ]
>
> runs in roughly ~450ms but if we remove `{!cache=false}companyId:22476` it
> drops down to ~5ms (it's important to note that `taggedTickets_ticketId` is
> globally unique).
>
> If we change the fqs to:
>
> "fq":["{!cache=false}_class:taggedTickets",
>   "{!cache=false}+companyId:22476 +taggedTickets_ticketId:100241"]
>
> when I debugQuery I see:
>
> "parsed_filter_queries":[
>"{!cache=false}_class:taggedTickets",
>"{!cache=false}+IndexOrDocValuesQuery(companyId:[22476 TO 22476])
> +IndexOrDocValuesQuery(taggedTickets_ticketId:[100241 TO 100241])"
> ]
>
> we get the correct result back in ~5ms.
>
> My current thought is that in the slow scenario Solr is still running
> `{!cache=false}IndexOrDocValuesQuery(companyId:[22476
> TO 22476])` even though it "has the answer" from the first two fq.
>
> Am I off-base or misunderstanding how `fq` are processed?
>


Multiple fq vs combined fq performance

2020-07-09 Thread Chris Dempsey
Hi all! In a collection where we have ~54 million documents we've noticed
running a query with the following:

"fq":["{!cache=false}_class:taggedTickets",
  "{!cache=false}taggedTickets_ticketId:100241",
  "{!cache=false}companyId:22476"]

when I debugQuery I see:

"parsed_filter_queries":[
  "{!cache=false}_class:taggedTickets",
  "{!cache=false}IndexOrDocValuesQuery(taggedTickets_ticketId:[100241
TO 100241])",
  "{!cache=false}IndexOrDocValuesQuery(companyId:[22476 TO 22476])"
]

runs in roughly ~450ms but if we remove `{!cache=false}companyId:22476` it
drops down to ~5ms (it's important to note that `taggedTickets_ticketId` is
globally unique).

If we change the fqs to:

"fq":["{!cache=false}_class:taggedTickets",
  "{!cache=false}+companyId:22476 +taggedTickets_ticketId:100241"]

when I debugQuery I see:

"parsed_filter_queries":[
   "{!cache=false}_class:taggedTickets",
   "{!cache=false}+IndexOrDocValuesQuery(companyId:[22476 TO 22476])
+IndexOrDocValuesQuery(taggedTickets_ticketId:[100241 TO 100241])"
]

we get the correct result back in ~5ms.

My current thought is that in the slow scenario Solr is still running
`{!cache=false}IndexOrDocValuesQuery(companyId:[22476
TO 22476])` even though it "has the answer" from the first two fq.

Am I off-base or misunderstanding how `fq` are processed?


Re: fq performance

2017-06-11 Thread mganeshs
Thanks for suggestions Erick, Micheal and all. I guess using of single field
as access_control will make sense. we can have access_control_user as multi
value field to hold user list ( hold permission given to user alone
individually ) and another field access_control_group as multi value field
to hold group list ( hold permission given to groups ) for that document. I
tried with this example with 6 million of documents and in fq i used almost
50 values as following
fq={!cache=false}acl_groups_ss:(G43 G96 G72 G80 G7 G24 G16 G67 G43 G57 G84
G23 G8 G38 G33 G10 G13 G65 G57 G72 G44 G34 G63 G90 G100 G63)

Tried these queries with 20 users concurrently also... Got less than 1 sec
response time. So it should be fine for now for us.

But curious to know how this would be handled in bigger applications like
linkedin and other social medias. What would be the schema, will it be like
keep access control in the same documents / resources itself or it's kept
outside and they do join in the query.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/fq-performance-tp4325326p4340057.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: fq performance

2017-03-18 Thread Damien Kamerman
You may want to consider a join, esp. if you're ever consider thousands of
groups. e.g.
fq={!join from=access_control_group
to=doc_group}access_control_user_id:USERID

On 18 March 2017 at 05:57, Yonik Seeley  wrote:

> On Fri, Mar 17, 2017 at 2:17 PM, Shawn Heisey  wrote:
> > On 3/17/2017 8:11 AM, Yonik Seeley wrote:
> >> For Solr 6.4, we've managed to circumvent this for filter queries and
> >> other contexts where scoring isn't needed.
> >> http://yonik.com/solr-6-4/  "More efficient filter queries"
> >
> > Nice!
> >
> > If the filter looks like the following (because q.op=AND), does it still
> > use TermsQuery?
> >
> > fq=id:(id1 OR id2 OR id3 OR ... id2000)
>
> Yep, that works as well.  As does fq=id:id1 OR id:id2 OR id:id3 ...
> Was implemented here: https://issues.apache.org/jira/browse/SOLR-9786
>
> -Yonik
>


Re: fq performance

2017-03-17 Thread Yonik Seeley
On Fri, Mar 17, 2017 at 2:17 PM, Shawn Heisey  wrote:
> On 3/17/2017 8:11 AM, Yonik Seeley wrote:
>> For Solr 6.4, we've managed to circumvent this for filter queries and
>> other contexts where scoring isn't needed.
>> http://yonik.com/solr-6-4/  "More efficient filter queries"
>
> Nice!
>
> If the filter looks like the following (because q.op=AND), does it still
> use TermsQuery?
>
> fq=id:(id1 OR id2 OR id3 OR ... id2000)

Yep, that works as well.  As does fq=id:id1 OR id:id2 OR id:id3 ...
Was implemented here: https://issues.apache.org/jira/browse/SOLR-9786

-Yonik


Re: fq performance

2017-03-17 Thread Shawn Heisey
On 3/17/2017 8:11 AM, Yonik Seeley wrote:
> For Solr 6.4, we've managed to circumvent this for filter queries and
> other contexts where scoring isn't needed.
> http://yonik.com/solr-6-4/  "More efficient filter queries"

Nice!

If the filter looks like the following (because q.op=AND), does it still
use TermsQuery?

fq=id:(id1 OR id2 OR id3 OR ... id2000)

Thanks,
Shawn



Re: fq performance

2017-03-17 Thread Erick Erickson
And to chime in.

bq: It contains information about who have access to the
documents, like field as (U1_s:true).

I wanted to make explicit the implications of Micael's response.

You are talking about different _fields_ per user or group, i.e.
Don't do this, it's horribly wasteful. Instead as Michael suggests,
you have a single field  ("access_control" in his example)
that contains the groups and users, i.e.
permissions might contain U1, G1, G4, U1000 and then form the
fq clauses as he suggests.

Also, if you're on an earlier version than 6.4 you can have massive
OR clauses by using the TermsQueryParser.

Best,
Erick



On Fri, Mar 17, 2017 at 7:11 AM, Yonik Seeley  wrote:
> On Fri, Mar 17, 2017 at 9:09 AM, Shawn Heisey  wrote:
> [...]
>> Lucene has a global configuration called "maxBooleanClauses" which
>> defaults to 1024.
>
> For Solr 6.4, we've managed to circumvent this for filter queries and
> other contexts where scoring isn't needed.
> http://yonik.com/solr-6-4/  "More efficient filter queries"
>
> -Yonik


Re: fq performance

2017-03-17 Thread Yonik Seeley
On Fri, Mar 17, 2017 at 9:09 AM, Shawn Heisey  wrote:
[...]
> Lucene has a global configuration called "maxBooleanClauses" which
> defaults to 1024.

For Solr 6.4, we've managed to circumvent this for filter queries and
other contexts where scoring isn't needed.
http://yonik.com/solr-6-4/  "More efficient filter queries"

-Yonik


Re: fq performance

2017-03-17 Thread Shawn Heisey
On 3/17/2017 12:46 AM, Ganesh M wrote:
> For how many ORs solr can give the results in less than one second.Can
> I pass 100's of OR condtion in the solr query? will that affects the
> performance ? 

This is a question that's impossible to answer.  The number will vary
depending on the nature of the queries, the size and nature of the data
in the index, and the hardware resources available in the server running
Solr.

https://lucidworks.com/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Another wrinkle affecting your question:

Lucene has a global configuration called "maxBooleanClauses" which
defaults to 1024.  This means that if one query with a bunch of
AND/OR/NOT clauses ends up with more than 1024 of them, and this
configuration value has not been increased, the query will simply fail
to execute.  This parameter can be increased up to a value a little
larger than two billion, but due to the global nature of the
configuration, you must increase it in *EVERY* Solr configuration, or
you may find that the ones you didn't increase it in will reset it back
down to 1024 -- and this will affect every index, because it's global.

Thanks,
Shawn



Re: fq performance

2017-03-17 Thread Michael Kuhlmann

Hi Ganesh,

you might want to use something like this:

fq=access_control:(g1 g2 g5 g99 ...)

Then it's only one fq filter per request. Internally it's like an OR condition, 
but in a more condensed form. I already have used this with up to 500 values 
without larger performance degradation (but in that case it was the unique id 
field).

You should think a minute about your filter cache here. Since you only have one 
fq filter per request, you won't blow your cache that fast. But it depends on 
your use case whether you should cache these filters at all. When it's common 
that a single user will send several requests within one commit interval, or 
when it's likely that several users will be in the same groups, that just use 
it like that. But when it's more likely that each request belongs to a 
different user with different security settings, then you should consider 
disabling the cache for this fq filter so that your filter cache (for other 
filters you probably have) won't be polluted: 
fq=*{!cache=false}*access_control:(g1 g2 g5 g99 ...). See 
http://yonik.com/advanced-filter-caching-in-solr/ for information on that.

-Michael



Am 17.03.2017 um 07:46 schrieb Ganesh M:

Hi Shawn / Michael,

Thanks for your replies and I guess you have got my scenarios exactly right.

Initially my document contains information about who have access to the
documents, like field as (U1_s:true). if 100 users can access a document,
we will have 100 such fields for each user.
So when U1 wants to see all this documents..i will query like get all
documents where U1_s:true.

If user U5 added to group G1, then I have to take all the documents of
group G1 and have to set the information of user U5 in the document like
U5_s:true in the document. For this, I have re-index all the documents in
that group.

To avoid this, I was trying to keep group information instead of user
information like G1_s:true, G2_s:true in the document. And for querying
user documents, I will first get all the groups of User U1, and then query
get all documents where G1_s:true OR G2_s:true or G3_s:true  By this we
don't need to re-index all the documents. But while querying I need to
query with OR of all the groups user belongs to.

For how many ORs solr can give the results in less than one second.Can I
pass 100's of OR condtion in the solr query? will that affects the
performance ?

Pls share your valuable inputs.

On Thu, Mar 16, 2017 at 6:04 PM Shawn Heisey  wrote:


On 3/16/2017 6:02 AM, Ganesh M wrote:

We have 1 million of documents and would like to query with multiple fq

values.

We have kept the access_control ( multi value field ) which holds

information about for which group that document is accessible.

Now to get the list of all the documents of an user, we would like to

pass multiple fq values ( one for each group user belongs to )



q:somefiled:value:access_control:g1:access_control:g2:access_control:g3:access_control:g4:access_control:g5...

Like this, there could be 100 groups for an user.

The correct syntax is fq=field:value -- what you have there is not going
to work.

This might not do what you expect.  Filter queries are ANDed together --
*every* filter must match, which means that if a document that you want
has only one of those values in access_control, or has 98 of them but
not all 100, then the query isn't going to match that document.  The
solution is one filter query that can match ANY of them, which also
might run faster.  I can't say whether this is a problem for you or
not.  Your data might be completely correct for matching 100 filters.

Also keep in mind that there is a limit to the size of a URL that you
can send into any webserver, including the container that runs Solr.
That default limit is 8192 bytes, and includes the "GET " or "POST " at
the beginning and the " HTTP/1.1" at the end (note the spaces).  The
filter query information for 100 of the filters you mentioned is going
to be over 2K, which will fit in the default, but if your query has more
complexity than you have mentioned here, the total URL might not fit.
There's a workaround to this -- use a POST request and put the
parameters in the request body.


If we fire query with 100 values in the fq, whats the penalty on the

performance ? Can we get the result in less than one second for 1 million
of documents.

With one million documents, each internal filter query result is 25
bytes -- the number of documents divided by eight.  That's 2.5 megabytes
for 100 of them.  In addition, every time a filter is run, it must
examine every document in the index to create that 25 byte
structure, which means that filters which *aren't* found in the
filterCache are relatively slow.  If they are found in the cache,
they're lightning fast, because the cache will contain the entire 25
byte bitset.

If you make your filterCache large enough, it's going to consume a LOT
of java heap memory, particularly if the index gets bigger.  The 

Re: fq performance

2017-03-17 Thread Ganesh M
Hi Shawn / Michael,

Thanks for your replies and I guess you have got my scenarios exactly right.

Initially my document contains information about who have access to the
documents, like field as (U1_s:true). if 100 users can access a document,
we will have 100 such fields for each user.
So when U1 wants to see all this documents..i will query like get all
documents where U1_s:true.

If user U5 added to group G1, then I have to take all the documents of
group G1 and have to set the information of user U5 in the document like
U5_s:true in the document. For this, I have re-index all the documents in
that group.

To avoid this, I was trying to keep group information instead of user
information like G1_s:true, G2_s:true in the document. And for querying
user documents, I will first get all the groups of User U1, and then query
get all documents where G1_s:true OR G2_s:true or G3_s:true  By this we
don't need to re-index all the documents. But while querying I need to
query with OR of all the groups user belongs to.

For how many ORs solr can give the results in less than one second.Can I
pass 100's of OR condtion in the solr query? will that affects the
performance ?

Pls share your valuable inputs.

On Thu, Mar 16, 2017 at 6:04 PM Shawn Heisey  wrote:

> On 3/16/2017 6:02 AM, Ganesh M wrote:
> > We have 1 million of documents and would like to query with multiple fq
> values.
> >
> > We have kept the access_control ( multi value field ) which holds
> information about for which group that document is accessible.
> >
> > Now to get the list of all the documents of an user, we would like to
> pass multiple fq values ( one for each group user belongs to )
> >
> >
> q:somefiled:value:access_control:g1:access_control:g2:access_control:g3:access_control:g4:access_control:g5...
> >
> > Like this, there could be 100 groups for an user.
>
> The correct syntax is fq=field:value -- what you have there is not going
> to work.
>
> This might not do what you expect.  Filter queries are ANDed together --
> *every* filter must match, which means that if a document that you want
> has only one of those values in access_control, or has 98 of them but
> not all 100, then the query isn't going to match that document.  The
> solution is one filter query that can match ANY of them, which also
> might run faster.  I can't say whether this is a problem for you or
> not.  Your data might be completely correct for matching 100 filters.
>
> Also keep in mind that there is a limit to the size of a URL that you
> can send into any webserver, including the container that runs Solr.
> That default limit is 8192 bytes, and includes the "GET " or "POST " at
> the beginning and the " HTTP/1.1" at the end (note the spaces).  The
> filter query information for 100 of the filters you mentioned is going
> to be over 2K, which will fit in the default, but if your query has more
> complexity than you have mentioned here, the total URL might not fit.
> There's a workaround to this -- use a POST request and put the
> parameters in the request body.
>
> > If we fire query with 100 values in the fq, whats the penalty on the
> performance ? Can we get the result in less than one second for 1 million
> of documents.
>
> With one million documents, each internal filter query result is 25
> bytes -- the number of documents divided by eight.  That's 2.5 megabytes
> for 100 of them.  In addition, every time a filter is run, it must
> examine every document in the index to create that 25 byte
> structure, which means that filters which *aren't* found in the
> filterCache are relatively slow.  If they are found in the cache,
> they're lightning fast, because the cache will contain the entire 25
> byte bitset.
>
> If you make your filterCache large enough, it's going to consume a LOT
> of java heap memory, particularly if the index gets bigger.  The nice
> thing about the filterCache is that once the cache entries exist, the
> filters are REALLY fast, and if they're all cached, you would DEFINITELY
> be able to get results in under one second.  I have no idea whether the
> same would happen when filters aren't cached.  It might.  Filters that
> do not exist in the cache will be executed in parallel, so the number of
> CPUs that you have in the machine, along with the query rate, will have
> a big impact on the overall performance of a single query with a lot of
> filters.
>
> Also related to the filterCache, keep in mind that every time a commit
> is made that opens a new searcher, the filterCache will be autowarmed.
> If the autowarmCount value for the filterCache is large, that can make
> commits take a very long time, which will cause problems if commits are
> happening frequently.  On the other hand, a very small autowarmCount can
> cause slow performance after a commit if you use a lot of filters.
>
> My reply is longer and more dense than I had anticipated.  Apologies if
> it's information overload.
>
> Thanks,
> 

Re: fq performance

2017-03-16 Thread Shawn Heisey
On 3/16/2017 6:02 AM, Ganesh M wrote:
> We have 1 million of documents and would like to query with multiple fq 
> values.
>
> We have kept the access_control ( multi value field ) which holds information 
> about for which group that document is accessible.
>
> Now to get the list of all the documents of an user, we would like to pass 
> multiple fq values ( one for each group user belongs to )
>
> q:somefiled:value:access_control:g1:access_control:g2:access_control:g3:access_control:g4:access_control:g5...
>
> Like this, there could be 100 groups for an user.

The correct syntax is fq=field:value -- what you have there is not going
to work.

This might not do what you expect.  Filter queries are ANDed together --
*every* filter must match, which means that if a document that you want
has only one of those values in access_control, or has 98 of them but
not all 100, then the query isn't going to match that document.  The
solution is one filter query that can match ANY of them, which also
might run faster.  I can't say whether this is a problem for you or
not.  Your data might be completely correct for matching 100 filters.

Also keep in mind that there is a limit to the size of a URL that you
can send into any webserver, including the container that runs Solr. 
That default limit is 8192 bytes, and includes the "GET " or "POST " at
the beginning and the " HTTP/1.1" at the end (note the spaces).  The
filter query information for 100 of the filters you mentioned is going
to be over 2K, which will fit in the default, but if your query has more
complexity than you have mentioned here, the total URL might not fit. 
There's a workaround to this -- use a POST request and put the
parameters in the request body.

> If we fire query with 100 values in the fq, whats the penalty on the 
> performance ? Can we get the result in less than one second for 1 million of 
> documents.

With one million documents, each internal filter query result is 25
bytes -- the number of documents divided by eight.  That's 2.5 megabytes
for 100 of them.  In addition, every time a filter is run, it must
examine every document in the index to create that 25 byte
structure, which means that filters which *aren't* found in the
filterCache are relatively slow.  If they are found in the cache,
they're lightning fast, because the cache will contain the entire 25
byte bitset.

If you make your filterCache large enough, it's going to consume a LOT
of java heap memory, particularly if the index gets bigger.  The nice
thing about the filterCache is that once the cache entries exist, the
filters are REALLY fast, and if they're all cached, you would DEFINITELY
be able to get results in under one second.  I have no idea whether the
same would happen when filters aren't cached.  It might.  Filters that
do not exist in the cache will be executed in parallel, so the number of
CPUs that you have in the machine, along with the query rate, will have
a big impact on the overall performance of a single query with a lot of
filters.

Also related to the filterCache, keep in mind that every time a commit
is made that opens a new searcher, the filterCache will be autowarmed. 
If the autowarmCount value for the filterCache is large, that can make
commits take a very long time, which will cause problems if commits are
happening frequently.  On the other hand, a very small autowarmCount can
cause slow performance after a commit if you use a lot of filters.

My reply is longer and more dense than I had anticipated.  Apologies if
it's information overload.

Thanks,
Shawn



Re: fq performance

2017-03-16 Thread Michael Kuhlmann
First of all, from what I can see, this won't do what you're expecting. 
Multiple fq conditions are always combined using AND, so if a user is 
member of 100 groups, but the document is accessible to only 99 of them, 
then the user won't find it.


Or in other words, if you add a user to some group, then she would get 
*less* results than before.


But coming back to your performance question: Just try it. Having 100 fq 
conditions will of course slow down your query a bit, but not that much. 
I rather see the problem with the filter cache: It will only be fast 
enough if all of your fq filters fit into the cache. Each possible fq 
filter will take 1 million/8 == 125k bytes, so having hundreds of 
possible access groups conditions might blow up your query cache (which 
must fit into RAM).


-Michael


Am 16.03.2017 um 13:02 schrieb Ganesh M:

Hi,

We have 1 million of documents and would like to query with multiple fq values.

We have kept the access_control ( multi value field ) which holds information 
about for which group that document is accessible.

Now to get the list of all the documents of an user, we would like to pass 
multiple fq values ( one for each group user belongs to )

q:somefiled:value&
fq:access_control:g1:access_control:g2:access_control:g3:access_control:g4:access_control:g5...

Like this, there could be 100 groups for an user.

If we fire query with 100 values in the fq, whats the penalty on the 
performance ? Can we get the result in less than one second for 1 million of 
documents.

Let us know your valuable inputs on this.

Regards,





fq performance

2017-03-16 Thread Ganesh M
Hi,

We have 1 million of documents and would like to query with multiple fq values.

We have kept the access_control ( multi value field ) which holds information 
about for which group that document is accessible.

Now to get the list of all the documents of an user, we would like to pass 
multiple fq values ( one for each group user belongs to )

q:somefiled:value&
fq:access_control:g1:access_control:g2:access_control:g3:access_control:g4:access_control:g5...

Like this, there could be 100 groups for an user.

If we fire query with 100 values in the fq, whats the penalty on the 
performance ? Can we get the result in less than one second for 1 million of 
documents.

Let us know your valuable inputs on this.

Regards,