Re: dealing with large result sets

Christian Stocker Wed, 11 Apr 2012 00:47:16 -0700

Ok, that made me wondering and I did some short tests on my macbook,
there are approx. 600'000 nodes, which match those queries


With xpath, without ordering


<d:searchrequest xmlns:d="DAV:"
xmlns:dcr="http://www.day.com/jcr/webdav/1.0";>
<dcr:xpath>

/jcr:root/article//*[@phpcr:class = 'Own\ApiBundle\Document\Article']

</dcr:xpath>
<d:limit>
     <d:nresults>10</d:nresults>
</d:limit>
</d:searchrequest>


1st run: 455 ms
2nd run:  42 ms

With xpath, with order by

<d:searchrequest xmlns:d="DAV:"
xmlns:dcr="http://www.day.com/jcr/webdav/1.0";>
<dcr:xpath>

/jcr:root/article//*[@phpcr:class = 'Own\ApiBundle\Document\Article']
order by @firstImportDate

</dcr:xpath>
<d:limit>
     <d:nresults>10</d:nresults>
</d:limit>
</d:searchrequest>


1st run: 2555 ms
2nd run:   16 ms


Those numbers seem to be reasonable.

With SQL2 without ordering:

<D:searchrequest xmlns:D="DAV:">
        <JCR-SQL2>
                <![CDATA[
                SELECT data.* FROM [nt:base] AS data WHERE data.[phpcr:class] =
'Own\ApiBundle\Document\Article'  AND  ISDESCENDANTNODE(data, '/article')
                ]]>
        </JCR-SQL2>
        <D:limit>
                <D:nresults>10</D:nresults>
        </D:limit>
</D:searchrequest>

1st run: 2'006'634 ms (33 minutes.)

>From the log

 SQL2 SELECT took 2004498 ms. selector: [nt:base] AS data, columns:
[data.jcr:primaryType], constraint: (data.[phpcr:class] =
'Own\ApiBundle\Document\Article') AND (ISDESCENDANTNODE(data,
[/article])), offset 0, limit 10
 SQL2 SORT took 1479 ms.
 SQL2 QUERY execute took 2006634 ms. native sort is false.


With those results, I didn't even try a 2nd time (caches are full
anyway) or with ordering.

Something seems to be quite wrong here. If you want more measurements,
just tell me

Greetings

chregu





On 10.04.12 11:55, Christian Stocker wrote:
> 
> 
> On 10.04.12 11:51, Ard Schrijvers wrote:
>> On Tue, Apr 10, 2012 at 11:42 AM, Christian Stocker
>> <[email protected]> wrote:
>>>
>>>
>>> On 10.04.12 11:32, Ard Schrijvers wrote:
>>>> On Tue, Apr 10, 2012 at 11:21 AM, Lukas Kahwe Smith <[email protected]> 
>>>> wrote:
>>>>> Hi,
>>>>>
>>>>> Currently I see some big issues with queries that return large result 
>>>>> sets. A lot of work is not done inside Lucene, which will probably not be 
>>>>> fixed soon (or maybe never inside 2.x). However I think its important to 
>>>>> do some intermediate improvements.
>>>>>
>>>>> Here are some suggestions I have. I hope we can brainstorm together on 
>>>>> some ideas that are feasible to get implemented in a shorter time period 
>>>>> than waiting for Oak:
>>>>>
>>>>> 1) there should be a way to get a count
>>>>>
>>>>> This way if I need to do a query that needs to be ordered, I can first 
>>>>> check if the count is too high to determine if I should even bother 
>>>>> running the search. Aka in most cases a search leading to 100+ results 
>>>>> means that who ever did the search needs to further narrow it down.
>>>>
>>>> The cpu is not spend in ordering the results: That is done quite fast
>>>> in Lucene, unless you have millions of hits
>>>
>>> I read the code and also read this
>>> https://issues.apache.org/jira/browse/JCR-2959 and it looks to me that
>>> jackrabbit always sorts the result set by itself and not in lucene (or
>>> maybe additionally). This makes it slow even if you have a limit set,
>>> because it first sorts all nodes (fetching it from the PM if necessary),
>>> then does the limit. Maybe I have missed something but real life tests
>>> showed exactly this behaviour.
>>
>> Ah, I don't know about that part: We always sticked to xpath queries :
>> Sorting is done in Lucene (more precisely, in some Lucene exensions in
>> jr, but are equally fast) for at least xpath, I am quite sure
> 
> Is the search part done differently in SQL2 and XPath? Can't remember ;)
> 
>>>> The problem with getting a correct count is authorization : This total
>>>> search index count should is fast (if you try to avoid some known slow
>>>> searches). However, authorizing for example 100k+ nodes if they are
>>>> not in the jackrabbit caches is very expensive.
>>>>
>>>> Either way: You get a correct count if you make sure that you include
>>>> in your (xpath) search at least an order by clause. Then, to avoid
>>>> 100k + hits, make sure you also set a limit. For example a limit of
>>>> 501 : You can then show 50 pages of 10 hits, and if the count is 501
>>>> you state that there are at least 500+ hits
>>>
>>> That's what we do now, but it doesn't help (as said above) if we have
>>> thousends of results which have to be ordered first.
>>
>> And the second sort is also slow? The first sort is also slow with
>> Lucene, as Lucene needs to load all terms to sort on from FS in
>> memory. However, consecutive searches are fast. We don't have problems
>> for resultsets sorting for a million hits
> 
> It definitively loaded all nodes from the PM before sorting it. The
> lucene part itself was fast enough, that wasn't the issue.
> 
>>
>>>
>>>>
>>>> We also wanted to get around this, thus in our api hooked in a
>>>> 'getTotalSize()' which returns the Lucene unauthorized count
>>>
>>> That would help us a lot, since we currently don't use the ACLs of
>>> Jackrabbit, so the lucene count would be pretty correct for our use case.
>>
>> Yes, however, you would have to hook into jr itself to get this done
> 
> Yep, saw that, that's somewhere deep in the code. That's why I didn't
> try to adress that yet
> 
> chregu
> 
>>
>> Regards Ard
>>
>>>
>>> chregu
>>>
>>>>
>>>>>
>>>>> I guess the most sensible thing would be to simply offer a way to do 
>>>>> SELECT COUNT(*) FROM ..
>>>>>
>>>>> 2) a way to automatically stop long running queries
>>>>
>>>> It is not just about 'long' . Some queries easily blow up, and bring
>>>> you app to an OOM before they can be stopped. For example jcr:like is
>>>> such a thing. Or range queries on many unique values
>>>
>>>
>>>>
>>>> Regards Ard
>>>>
>>>>>
>>>>> It would be great if one could define a timeout for queries. If a query 
>>>>> takes longer than X, it should just fail. This should be a global 
>>>>> setting, but ideally it should be possible to override this on a per 
>>>>> query basis.
>>>>>
>>>>> 3) .. ?
>>>>>
>>>>> regards,
>>>>> Lukas Kahwe Smith
>>>>> [email protected]
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> Liip AG  //  Feldstrasse 133 //  CH-8004 Zurich
>>> Tel +41 43 500 39 81 // Mobile +41 76 561 88 60
>>> www.liip.ch // blog.liip.ch // GnuPG 0x0748D5FE
>>>
>>
>>
>>
>

Re: dealing with large result sets

Reply via email to