Ok, that made me wondering and I did some short tests on my macbook, there are approx. 600'000 nodes, which match those queries
With xpath, without ordering <d:searchrequest xmlns:d="DAV:" xmlns:dcr="http://www.day.com/jcr/webdav/1.0"> <dcr:xpath> /jcr:root/article//*[@phpcr:class = 'Own\ApiBundle\Document\Article'] </dcr:xpath> <d:limit> <d:nresults>10</d:nresults> </d:limit> </d:searchrequest> 1st run: 455 ms 2nd run: 42 ms With xpath, with order by <d:searchrequest xmlns:d="DAV:" xmlns:dcr="http://www.day.com/jcr/webdav/1.0"> <dcr:xpath> /jcr:root/article//*[@phpcr:class = 'Own\ApiBundle\Document\Article'] order by @firstImportDate </dcr:xpath> <d:limit> <d:nresults>10</d:nresults> </d:limit> </d:searchrequest> 1st run: 2555 ms 2nd run: 16 ms Those numbers seem to be reasonable. With SQL2 without ordering: <D:searchrequest xmlns:D="DAV:"> <JCR-SQL2> <![CDATA[ SELECT data.* FROM [nt:base] AS data WHERE data.[phpcr:class] = 'Own\ApiBundle\Document\Article' AND ISDESCENDANTNODE(data, '/article') ]]> </JCR-SQL2> <D:limit> <D:nresults>10</D:nresults> </D:limit> </D:searchrequest> 1st run: 2'006'634 ms (33 minutes.) >From the log SQL2 SELECT took 2004498 ms. selector: [nt:base] AS data, columns: [data.jcr:primaryType], constraint: (data.[phpcr:class] = 'Own\ApiBundle\Document\Article') AND (ISDESCENDANTNODE(data, [/article])), offset 0, limit 10 SQL2 SORT took 1479 ms. SQL2 QUERY execute took 2006634 ms. native sort is false. With those results, I didn't even try a 2nd time (caches are full anyway) or with ordering. Something seems to be quite wrong here. If you want more measurements, just tell me Greetings chregu On 10.04.12 11:55, Christian Stocker wrote: > > > On 10.04.12 11:51, Ard Schrijvers wrote: >> On Tue, Apr 10, 2012 at 11:42 AM, Christian Stocker >> <[email protected]> wrote: >>> >>> >>> On 10.04.12 11:32, Ard Schrijvers wrote: >>>> On Tue, Apr 10, 2012 at 11:21 AM, Lukas Kahwe Smith <[email protected]> >>>> wrote: >>>>> Hi, >>>>> >>>>> Currently I see some big issues with queries that return large result >>>>> sets. A lot of work is not done inside Lucene, which will probably not be >>>>> fixed soon (or maybe never inside 2.x). However I think its important to >>>>> do some intermediate improvements. >>>>> >>>>> Here are some suggestions I have. I hope we can brainstorm together on >>>>> some ideas that are feasible to get implemented in a shorter time period >>>>> than waiting for Oak: >>>>> >>>>> 1) there should be a way to get a count >>>>> >>>>> This way if I need to do a query that needs to be ordered, I can first >>>>> check if the count is too high to determine if I should even bother >>>>> running the search. Aka in most cases a search leading to 100+ results >>>>> means that who ever did the search needs to further narrow it down. >>>> >>>> The cpu is not spend in ordering the results: That is done quite fast >>>> in Lucene, unless you have millions of hits >>> >>> I read the code and also read this >>> https://issues.apache.org/jira/browse/JCR-2959 and it looks to me that >>> jackrabbit always sorts the result set by itself and not in lucene (or >>> maybe additionally). This makes it slow even if you have a limit set, >>> because it first sorts all nodes (fetching it from the PM if necessary), >>> then does the limit. Maybe I have missed something but real life tests >>> showed exactly this behaviour. >> >> Ah, I don't know about that part: We always sticked to xpath queries : >> Sorting is done in Lucene (more precisely, in some Lucene exensions in >> jr, but are equally fast) for at least xpath, I am quite sure > > Is the search part done differently in SQL2 and XPath? Can't remember ;) > >>>> The problem with getting a correct count is authorization : This total >>>> search index count should is fast (if you try to avoid some known slow >>>> searches). However, authorizing for example 100k+ nodes if they are >>>> not in the jackrabbit caches is very expensive. >>>> >>>> Either way: You get a correct count if you make sure that you include >>>> in your (xpath) search at least an order by clause. Then, to avoid >>>> 100k + hits, make sure you also set a limit. For example a limit of >>>> 501 : You can then show 50 pages of 10 hits, and if the count is 501 >>>> you state that there are at least 500+ hits >>> >>> That's what we do now, but it doesn't help (as said above) if we have >>> thousends of results which have to be ordered first. >> >> And the second sort is also slow? The first sort is also slow with >> Lucene, as Lucene needs to load all terms to sort on from FS in >> memory. However, consecutive searches are fast. We don't have problems >> for resultsets sorting for a million hits > > It definitively loaded all nodes from the PM before sorting it. The > lucene part itself was fast enough, that wasn't the issue. > >> >>> >>>> >>>> We also wanted to get around this, thus in our api hooked in a >>>> 'getTotalSize()' which returns the Lucene unauthorized count >>> >>> That would help us a lot, since we currently don't use the ACLs of >>> Jackrabbit, so the lucene count would be pretty correct for our use case. >> >> Yes, however, you would have to hook into jr itself to get this done > > Yep, saw that, that's somewhere deep in the code. That's why I didn't > try to adress that yet > > chregu > >> >> Regards Ard >> >>> >>> chregu >>> >>>> >>>>> >>>>> I guess the most sensible thing would be to simply offer a way to do >>>>> SELECT COUNT(*) FROM .. >>>>> >>>>> 2) a way to automatically stop long running queries >>>> >>>> It is not just about 'long' . Some queries easily blow up, and bring >>>> you app to an OOM before they can be stopped. For example jcr:like is >>>> such a thing. Or range queries on many unique values >>> >>> >>>> >>>> Regards Ard >>>> >>>>> >>>>> It would be great if one could define a timeout for queries. If a query >>>>> takes longer than X, it should just fail. This should be a global >>>>> setting, but ideally it should be possible to override this on a per >>>>> query basis. >>>>> >>>>> 3) .. ? >>>>> >>>>> regards, >>>>> Lukas Kahwe Smith >>>>> [email protected] >>>>> >>>>> >>>>> >>>> >>>> >>>> >>> >>> -- >>> Liip AG // Feldstrasse 133 // CH-8004 Zurich >>> Tel +41 43 500 39 81 // Mobile +41 76 561 88 60 >>> www.liip.ch // blog.liip.ch // GnuPG 0x0748D5FE >>> >> >> >> >
