Re: dealing with large result sets

Ard Schrijvers Wed, 11 Apr 2012 01:50:33 -0700

On Wed, Apr 11, 2012 at 9:46 AM, Christian Stocker
<[email protected]> wrote:
> Ok, that made me wondering and I did some short tests on my macbook,
> there are approx. 600'000 nodes, which match those queries
>
> With xpath, without ordering
>
>
> <d:searchrequest xmlns:d="DAV:"
> xmlns:dcr="http://www.day.com/jcr/webdav/1.0";>
> <dcr:xpath>
>
> /jcr:root/article//*[@phpcr:class = 'Own\ApiBundle\Document\Article']
>
> </dcr:xpath>
> <d:limit>
>     <d:nresults>10</d:nresults>
> </d:limit>
> </d:searchrequest>
>
>
> 1st run: 455 ms
> 2nd run:  42 ms
>
> With xpath, with order by
>
> <d:searchrequest xmlns:d="DAV:"
> xmlns:dcr="http://www.day.com/jcr/webdav/1.0";>
> <dcr:xpath>
>
> /jcr:root/article//*[@phpcr:class = 'Own\ApiBundle\Document\Article']
> order by @firstImportDate
>
> </dcr:xpath>
> <d:limit>
>     <d:nresults>10</d:nresults>
> </d:limit>
> </d:searchrequest>
>
>
> 1st run: 2555 ms
> 2nd run:   16 ms


This makes sense: Only the first time, Lucene needs to load the unique
terms of the 600.000 you want to sort on, so many FS lookups. After
that, they are cached

About SQL2, I have no experience with it, nor ever looked at it. I
can't help you out there.

We stick to xpath as jr 2.x

Regards Ard

>
>
> Those numbers seem to be reasonable.
>
> With SQL2 without ordering:
>
> <D:searchrequest xmlns:D="DAV:">
>        <JCR-SQL2>
>                <![CDATA[
>                SELECT data.* FROM [nt:base] AS data WHERE data.[phpcr:class] =
> 'Own\ApiBundle\Document\Article'  AND  ISDESCENDANTNODE(data, '/article')
>                ]]>
>        </JCR-SQL2>
>        <D:limit>
>                <D:nresults>10</D:nresults>
>        </D:limit>
> </D:searchrequest>
>
> 1st run: 2'006'634 ms (33 minutes.)
>
> From the log
>
>  SQL2 SELECT took 2004498 ms. selector: [nt:base] AS data, columns:
> [data.jcr:primaryType], constraint: (data.[phpcr:class] =
> 'Own\ApiBundle\Document\Article') AND (ISDESCENDANTNODE(data,
> [/article])), offset 0, limit 10
>  SQL2 SORT took 1479 ms.
>  SQL2 QUERY execute took 2006634 ms. native sort is false.
>
>
> With those results, I didn't even try a 2nd time (caches are full
> anyway) or with ordering.
>
> Something seems to be quite wrong here. If you want more measurements,
> just tell me
>
> Greetings
>
> chregu
>
>
>
>
>
> On 10.04.12 11:55, Christian Stocker wrote:
>>
>>
>> On 10.04.12 11:51, Ard Schrijvers wrote:
>>> On Tue, Apr 10, 2012 at 11:42 AM, Christian Stocker
>>> <[email protected]> wrote:
>>>>
>>>>
>>>> On 10.04.12 11:32, Ard Schrijvers wrote:
>>>>> On Tue, Apr 10, 2012 at 11:21 AM, Lukas Kahwe Smith <[email protected]> 
>>>>> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Currently I see some big issues with queries that return large result 
>>>>>> sets. A lot of work is not done inside Lucene, which will probably not 
>>>>>> be fixed soon (or maybe never inside 2.x). However I think its important 
>>>>>> to do some intermediate improvements.
>>>>>>
>>>>>> Here are some suggestions I have. I hope we can brainstorm together on 
>>>>>> some ideas that are feasible to get implemented in a shorter time period 
>>>>>> than waiting for Oak:
>>>>>>
>>>>>> 1) there should be a way to get a count
>>>>>>
>>>>>> This way if I need to do a query that needs to be ordered, I can first 
>>>>>> check if the count is too high to determine if I should even bother 
>>>>>> running the search. Aka in most cases a search leading to 100+ results 
>>>>>> means that who ever did the search needs to further narrow it down.
>>>>>
>>>>> The cpu is not spend in ordering the results: That is done quite fast
>>>>> in Lucene, unless you have millions of hits
>>>>
>>>> I read the code and also read this
>>>> https://issues.apache.org/jira/browse/JCR-2959 and it looks to me that
>>>> jackrabbit always sorts the result set by itself and not in lucene (or
>>>> maybe additionally). This makes it slow even if you have a limit set,
>>>> because it first sorts all nodes (fetching it from the PM if necessary),
>>>> then does the limit. Maybe I have missed something but real life tests
>>>> showed exactly this behaviour.
>>>
>>> Ah, I don't know about that part: We always sticked to xpath queries :
>>> Sorting is done in Lucene (more precisely, in some Lucene exensions in
>>> jr, but are equally fast) for at least xpath, I am quite sure
>>
>> Is the search part done differently in SQL2 and XPath? Can't remember ;)
>>
>>>>> The problem with getting a correct count is authorization : This total
>>>>> search index count should is fast (if you try to avoid some known slow
>>>>> searches). However, authorizing for example 100k+ nodes if they are
>>>>> not in the jackrabbit caches is very expensive.
>>>>>
>>>>> Either way: You get a correct count if you make sure that you include
>>>>> in your (xpath) search at least an order by clause. Then, to avoid
>>>>> 100k + hits, make sure you also set a limit. For example a limit of
>>>>> 501 : You can then show 50 pages of 10 hits, and if the count is 501
>>>>> you state that there are at least 500+ hits
>>>>
>>>> That's what we do now, but it doesn't help (as said above) if we have
>>>> thousends of results which have to be ordered first.
>>>
>>> And the second sort is also slow? The first sort is also slow with
>>> Lucene, as Lucene needs to load all terms to sort on from FS in
>>> memory. However, consecutive searches are fast. We don't have problems
>>> for resultsets sorting for a million hits
>>
>> It definitively loaded all nodes from the PM before sorting it. The
>> lucene part itself was fast enough, that wasn't the issue.
>>
>>>
>>>>
>>>>>
>>>>> We also wanted to get around this, thus in our api hooked in a
>>>>> 'getTotalSize()' which returns the Lucene unauthorized count
>>>>
>>>> That would help us a lot, since we currently don't use the ACLs of
>>>> Jackrabbit, so the lucene count would be pretty correct for our use case.
>>>
>>> Yes, however, you would have to hook into jr itself to get this done
>>
>> Yep, saw that, that's somewhere deep in the code. That's why I didn't
>> try to adress that yet
>>
>> chregu
>>
>>>
>>> Regards Ard
>>>
>>>>
>>>> chregu
>>>>
>>>>>
>>>>>>
>>>>>> I guess the most sensible thing would be to simply offer a way to do 
>>>>>> SELECT COUNT(*) FROM ..
>>>>>>
>>>>>> 2) a way to automatically stop long running queries
>>>>>
>>>>> It is not just about 'long' . Some queries easily blow up, and bring
>>>>> you app to an OOM before they can be stopped. For example jcr:like is
>>>>> such a thing. Or range queries on many unique values
>>>>
>>>>
>>>>>
>>>>> Regards Ard
>>>>>
>>>>>>
>>>>>> It would be great if one could define a timeout for queries. If a query 
>>>>>> takes longer than X, it should just fail. This should be a global 
>>>>>> setting, but ideally it should be possible to override this on a per 
>>>>>> query basis.
>>>>>>
>>>>>> 3) .. ?
>>>>>>
>>>>>> regards,
>>>>>> Lukas Kahwe Smith
>>>>>> [email protected]
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Liip AG  //  Feldstrasse 133 //  CH-8004 Zurich
>>>> Tel +41 43 500 39 81 // Mobile +41 76 561 88 60
>>>> www.liip.ch // blog.liip.ch // GnuPG 0x0748D5FE
>>>>
>>>
>>>
>>>
>>



-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Re: dealing with large result sets

Reply via email to