On 10.04.12 11:51, Ard Schrijvers wrote: > On Tue, Apr 10, 2012 at 11:42 AM, Christian Stocker > <[email protected]> wrote: >> >> >> On 10.04.12 11:32, Ard Schrijvers wrote: >>> On Tue, Apr 10, 2012 at 11:21 AM, Lukas Kahwe Smith <[email protected]> >>> wrote: >>>> Hi, >>>> >>>> Currently I see some big issues with queries that return large result >>>> sets. A lot of work is not done inside Lucene, which will probably not be >>>> fixed soon (or maybe never inside 2.x). However I think its important to >>>> do some intermediate improvements. >>>> >>>> Here are some suggestions I have. I hope we can brainstorm together on >>>> some ideas that are feasible to get implemented in a shorter time period >>>> than waiting for Oak: >>>> >>>> 1) there should be a way to get a count >>>> >>>> This way if I need to do a query that needs to be ordered, I can first >>>> check if the count is too high to determine if I should even bother >>>> running the search. Aka in most cases a search leading to 100+ results >>>> means that who ever did the search needs to further narrow it down. >>> >>> The cpu is not spend in ordering the results: That is done quite fast >>> in Lucene, unless you have millions of hits >> >> I read the code and also read this >> https://issues.apache.org/jira/browse/JCR-2959 and it looks to me that >> jackrabbit always sorts the result set by itself and not in lucene (or >> maybe additionally). This makes it slow even if you have a limit set, >> because it first sorts all nodes (fetching it from the PM if necessary), >> then does the limit. Maybe I have missed something but real life tests >> showed exactly this behaviour. > > Ah, I don't know about that part: We always sticked to xpath queries : > Sorting is done in Lucene (more precisely, in some Lucene exensions in > jr, but are equally fast) for at least xpath, I am quite sure
Is the search part done differently in SQL2 and XPath? Can't remember ;) >>> The problem with getting a correct count is authorization : This total >>> search index count should is fast (if you try to avoid some known slow >>> searches). However, authorizing for example 100k+ nodes if they are >>> not in the jackrabbit caches is very expensive. >>> >>> Either way: You get a correct count if you make sure that you include >>> in your (xpath) search at least an order by clause. Then, to avoid >>> 100k + hits, make sure you also set a limit. For example a limit of >>> 501 : You can then show 50 pages of 10 hits, and if the count is 501 >>> you state that there are at least 500+ hits >> >> That's what we do now, but it doesn't help (as said above) if we have >> thousends of results which have to be ordered first. > > And the second sort is also slow? The first sort is also slow with > Lucene, as Lucene needs to load all terms to sort on from FS in > memory. However, consecutive searches are fast. We don't have problems > for resultsets sorting for a million hits It definitively loaded all nodes from the PM before sorting it. The lucene part itself was fast enough, that wasn't the issue. > >> >>> >>> We also wanted to get around this, thus in our api hooked in a >>> 'getTotalSize()' which returns the Lucene unauthorized count >> >> That would help us a lot, since we currently don't use the ACLs of >> Jackrabbit, so the lucene count would be pretty correct for our use case. > > Yes, however, you would have to hook into jr itself to get this done Yep, saw that, that's somewhere deep in the code. That's why I didn't try to adress that yet chregu > > Regards Ard > >> >> chregu >> >>> >>>> >>>> I guess the most sensible thing would be to simply offer a way to do >>>> SELECT COUNT(*) FROM .. >>>> >>>> 2) a way to automatically stop long running queries >>> >>> It is not just about 'long' . Some queries easily blow up, and bring >>> you app to an OOM before they can be stopped. For example jcr:like is >>> such a thing. Or range queries on many unique values >> >> >>> >>> Regards Ard >>> >>>> >>>> It would be great if one could define a timeout for queries. If a query >>>> takes longer than X, it should just fail. This should be a global setting, >>>> but ideally it should be possible to override this on a per query basis. >>>> >>>> 3) .. ? >>>> >>>> regards, >>>> Lukas Kahwe Smith >>>> [email protected] >>>> >>>> >>>> >>> >>> >>> >> >> -- >> Liip AG // Feldstrasse 133 // CH-8004 Zurich >> Tel +41 43 500 39 81 // Mobile +41 76 561 88 60 >> www.liip.ch // blog.liip.ch // GnuPG 0x0748D5FE >> > > > -- Liip AG // Feldstrasse 133 // CH-8004 Zurich Tel +41 43 500 39 81 // Mobile +41 76 561 88 60 www.liip.ch // blog.liip.ch // GnuPG 0x0748D5FE
