On Wed, Apr 11, 2012 at 9:46 AM, Christian Stocker <[email protected]> wrote: > Ok, that made me wondering and I did some short tests on my macbook, > there are approx. 600'000 nodes, which match those queries > > With xpath, without ordering > > > <d:searchrequest xmlns:d="DAV:" > xmlns:dcr="http://www.day.com/jcr/webdav/1.0"> > <dcr:xpath> > > /jcr:root/article//*[@phpcr:class = 'Own\ApiBundle\Document\Article'] > > </dcr:xpath> > <d:limit> > <d:nresults>10</d:nresults> > </d:limit> > </d:searchrequest> > > > 1st run: 455 ms > 2nd run: 42 ms > > With xpath, with order by > > <d:searchrequest xmlns:d="DAV:" > xmlns:dcr="http://www.day.com/jcr/webdav/1.0"> > <dcr:xpath> > > /jcr:root/article//*[@phpcr:class = 'Own\ApiBundle\Document\Article'] > order by @firstImportDate > > </dcr:xpath> > <d:limit> > <d:nresults>10</d:nresults> > </d:limit> > </d:searchrequest> > > > 1st run: 2555 ms > 2nd run: 16 ms
This makes sense: Only the first time, Lucene needs to load the unique terms of the 600.000 you want to sort on, so many FS lookups. After that, they are cached About SQL2, I have no experience with it, nor ever looked at it. I can't help you out there. We stick to xpath as jr 2.x Regards Ard > > > Those numbers seem to be reasonable. > > With SQL2 without ordering: > > <D:searchrequest xmlns:D="DAV:"> > <JCR-SQL2> > <![CDATA[ > SELECT data.* FROM [nt:base] AS data WHERE data.[phpcr:class] = > 'Own\ApiBundle\Document\Article' AND ISDESCENDANTNODE(data, '/article') > ]]> > </JCR-SQL2> > <D:limit> > <D:nresults>10</D:nresults> > </D:limit> > </D:searchrequest> > > 1st run: 2'006'634 ms (33 minutes.) > > From the log > > SQL2 SELECT took 2004498 ms. selector: [nt:base] AS data, columns: > [data.jcr:primaryType], constraint: (data.[phpcr:class] = > 'Own\ApiBundle\Document\Article') AND (ISDESCENDANTNODE(data, > [/article])), offset 0, limit 10 > SQL2 SORT took 1479 ms. > SQL2 QUERY execute took 2006634 ms. native sort is false. > > > With those results, I didn't even try a 2nd time (caches are full > anyway) or with ordering. > > Something seems to be quite wrong here. If you want more measurements, > just tell me > > Greetings > > chregu > > > > > > On 10.04.12 11:55, Christian Stocker wrote: >> >> >> On 10.04.12 11:51, Ard Schrijvers wrote: >>> On Tue, Apr 10, 2012 at 11:42 AM, Christian Stocker >>> <[email protected]> wrote: >>>> >>>> >>>> On 10.04.12 11:32, Ard Schrijvers wrote: >>>>> On Tue, Apr 10, 2012 at 11:21 AM, Lukas Kahwe Smith <[email protected]> >>>>> wrote: >>>>>> Hi, >>>>>> >>>>>> Currently I see some big issues with queries that return large result >>>>>> sets. A lot of work is not done inside Lucene, which will probably not >>>>>> be fixed soon (or maybe never inside 2.x). However I think its important >>>>>> to do some intermediate improvements. >>>>>> >>>>>> Here are some suggestions I have. I hope we can brainstorm together on >>>>>> some ideas that are feasible to get implemented in a shorter time period >>>>>> than waiting for Oak: >>>>>> >>>>>> 1) there should be a way to get a count >>>>>> >>>>>> This way if I need to do a query that needs to be ordered, I can first >>>>>> check if the count is too high to determine if I should even bother >>>>>> running the search. Aka in most cases a search leading to 100+ results >>>>>> means that who ever did the search needs to further narrow it down. >>>>> >>>>> The cpu is not spend in ordering the results: That is done quite fast >>>>> in Lucene, unless you have millions of hits >>>> >>>> I read the code and also read this >>>> https://issues.apache.org/jira/browse/JCR-2959 and it looks to me that >>>> jackrabbit always sorts the result set by itself and not in lucene (or >>>> maybe additionally). This makes it slow even if you have a limit set, >>>> because it first sorts all nodes (fetching it from the PM if necessary), >>>> then does the limit. Maybe I have missed something but real life tests >>>> showed exactly this behaviour. >>> >>> Ah, I don't know about that part: We always sticked to xpath queries : >>> Sorting is done in Lucene (more precisely, in some Lucene exensions in >>> jr, but are equally fast) for at least xpath, I am quite sure >> >> Is the search part done differently in SQL2 and XPath? Can't remember ;) >> >>>>> The problem with getting a correct count is authorization : This total >>>>> search index count should is fast (if you try to avoid some known slow >>>>> searches). However, authorizing for example 100k+ nodes if they are >>>>> not in the jackrabbit caches is very expensive. >>>>> >>>>> Either way: You get a correct count if you make sure that you include >>>>> in your (xpath) search at least an order by clause. Then, to avoid >>>>> 100k + hits, make sure you also set a limit. For example a limit of >>>>> 501 : You can then show 50 pages of 10 hits, and if the count is 501 >>>>> you state that there are at least 500+ hits >>>> >>>> That's what we do now, but it doesn't help (as said above) if we have >>>> thousends of results which have to be ordered first. >>> >>> And the second sort is also slow? The first sort is also slow with >>> Lucene, as Lucene needs to load all terms to sort on from FS in >>> memory. However, consecutive searches are fast. We don't have problems >>> for resultsets sorting for a million hits >> >> It definitively loaded all nodes from the PM before sorting it. The >> lucene part itself was fast enough, that wasn't the issue. >> >>> >>>> >>>>> >>>>> We also wanted to get around this, thus in our api hooked in a >>>>> 'getTotalSize()' which returns the Lucene unauthorized count >>>> >>>> That would help us a lot, since we currently don't use the ACLs of >>>> Jackrabbit, so the lucene count would be pretty correct for our use case. >>> >>> Yes, however, you would have to hook into jr itself to get this done >> >> Yep, saw that, that's somewhere deep in the code. That's why I didn't >> try to adress that yet >> >> chregu >> >>> >>> Regards Ard >>> >>>> >>>> chregu >>>> >>>>> >>>>>> >>>>>> I guess the most sensible thing would be to simply offer a way to do >>>>>> SELECT COUNT(*) FROM .. >>>>>> >>>>>> 2) a way to automatically stop long running queries >>>>> >>>>> It is not just about 'long' . Some queries easily blow up, and bring >>>>> you app to an OOM before they can be stopped. For example jcr:like is >>>>> such a thing. Or range queries on many unique values >>>> >>>> >>>>> >>>>> Regards Ard >>>>> >>>>>> >>>>>> It would be great if one could define a timeout for queries. If a query >>>>>> takes longer than X, it should just fail. This should be a global >>>>>> setting, but ideally it should be possible to override this on a per >>>>>> query basis. >>>>>> >>>>>> 3) .. ? >>>>>> >>>>>> regards, >>>>>> Lukas Kahwe Smith >>>>>> [email protected] >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>>> -- >>>> Liip AG // Feldstrasse 133 // CH-8004 Zurich >>>> Tel +41 43 500 39 81 // Mobile +41 76 561 88 60 >>>> www.liip.ch // blog.liip.ch // GnuPG 0x0748D5FE >>>> >>> >>> >>> >> -- Amsterdam - Oosteinde 11, 1017 WT Amsterdam Boston - 1 Broadway, Cambridge, MA 02142 US +1 877 414 4776 (toll free) Europe +31(0)20 522 4466 www.onehippo.com
