Re: dealing with large result sets

Christian Stocker Tue, 10 Apr 2012 02:55:56 -0700


On 10.04.12 11:51, Ard Schrijvers wrote:
> On Tue, Apr 10, 2012 at 11:42 AM, Christian Stocker
> <[email protected]> wrote:
>>
>>
>> On 10.04.12 11:32, Ard Schrijvers wrote:
>>> On Tue, Apr 10, 2012 at 11:21 AM, Lukas Kahwe Smith <[email protected]> 
>>> wrote:
>>>> Hi,
>>>>
>>>> Currently I see some big issues with queries that return large result 
>>>> sets. A lot of work is not done inside Lucene, which will probably not be 
>>>> fixed soon (or maybe never inside 2.x). However I think its important to 
>>>> do some intermediate improvements.
>>>>
>>>> Here are some suggestions I have. I hope we can brainstorm together on 
>>>> some ideas that are feasible to get implemented in a shorter time period 
>>>> than waiting for Oak:
>>>>
>>>> 1) there should be a way to get a count
>>>>
>>>> This way if I need to do a query that needs to be ordered, I can first 
>>>> check if the count is too high to determine if I should even bother 
>>>> running the search. Aka in most cases a search leading to 100+ results 
>>>> means that who ever did the search needs to further narrow it down.
>>>
>>> The cpu is not spend in ordering the results: That is done quite fast
>>> in Lucene, unless you have millions of hits
>>
>> I read the code and also read this
>> https://issues.apache.org/jira/browse/JCR-2959 and it looks to me that
>> jackrabbit always sorts the result set by itself and not in lucene (or
>> maybe additionally). This makes it slow even if you have a limit set,
>> because it first sorts all nodes (fetching it from the PM if necessary),
>> then does the limit. Maybe I have missed something but real life tests
>> showed exactly this behaviour.
> 
> Ah, I don't know about that part: We always sticked to xpath queries :
> Sorting is done in Lucene (more precisely, in some Lucene exensions in
> jr, but are equally fast) for at least xpath, I am quite sure


Is the search part done differently in SQL2 and XPath? Can't remember ;)

>>> The problem with getting a correct count is authorization : This total
>>> search index count should is fast (if you try to avoid some known slow
>>> searches). However, authorizing for example 100k+ nodes if they are
>>> not in the jackrabbit caches is very expensive.
>>>
>>> Either way: You get a correct count if you make sure that you include
>>> in your (xpath) search at least an order by clause. Then, to avoid
>>> 100k + hits, make sure you also set a limit. For example a limit of
>>> 501 : You can then show 50 pages of 10 hits, and if the count is 501
>>> you state that there are at least 500+ hits
>>
>> That's what we do now, but it doesn't help (as said above) if we have
>> thousends of results which have to be ordered first.
> 
> And the second sort is also slow? The first sort is also slow with
> Lucene, as Lucene needs to load all terms to sort on from FS in
> memory. However, consecutive searches are fast. We don't have problems
> for resultsets sorting for a million hits

It definitively loaded all nodes from the PM before sorting it. The
lucene part itself was fast enough, that wasn't the issue.

> 
>>
>>>
>>> We also wanted to get around this, thus in our api hooked in a
>>> 'getTotalSize()' which returns the Lucene unauthorized count
>>
>> That would help us a lot, since we currently don't use the ACLs of
>> Jackrabbit, so the lucene count would be pretty correct for our use case.
> 
> Yes, however, you would have to hook into jr itself to get this done

Yep, saw that, that's somewhere deep in the code. That's why I didn't
try to adress that yet

chregu

> 
> Regards Ard
> 
>>
>> chregu
>>
>>>
>>>>
>>>> I guess the most sensible thing would be to simply offer a way to do 
>>>> SELECT COUNT(*) FROM ..
>>>>
>>>> 2) a way to automatically stop long running queries
>>>
>>> It is not just about 'long' . Some queries easily blow up, and bring
>>> you app to an OOM before they can be stopped. For example jcr:like is
>>> such a thing. Or range queries on many unique values
>>
>>
>>>
>>> Regards Ard
>>>
>>>>
>>>> It would be great if one could define a timeout for queries. If a query 
>>>> takes longer than X, it should just fail. This should be a global setting, 
>>>> but ideally it should be possible to override this on a per query basis.
>>>>
>>>> 3) .. ?
>>>>
>>>> regards,
>>>> Lukas Kahwe Smith
>>>> [email protected]
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>> --
>> Liip AG  //  Feldstrasse 133 //  CH-8004 Zurich
>> Tel +41 43 500 39 81 // Mobile +41 76 561 88 60
>> www.liip.ch // blog.liip.ch // GnuPG 0x0748D5FE
>>
> 
> 
> 

-- 
Liip AG  //  Feldstrasse 133 //  CH-8004 Zurich
Tel +41 43 500 39 81 // Mobile +41 76 561 88 60
www.liip.ch // blog.liip.ch // GnuPG 0x0748D5FE

Re: dealing with large result sets

Reply via email to