On 10.04.12 11:32, Ard Schrijvers wrote: > On Tue, Apr 10, 2012 at 11:21 AM, Lukas Kahwe Smith <[email protected]> > wrote: >> Hi, >> >> Currently I see some big issues with queries that return large result sets. >> A lot of work is not done inside Lucene, which will probably not be fixed >> soon (or maybe never inside 2.x). However I think its important to do some >> intermediate improvements. >> >> Here are some suggestions I have. I hope we can brainstorm together on some >> ideas that are feasible to get implemented in a shorter time period than >> waiting for Oak: >> >> 1) there should be a way to get a count >> >> This way if I need to do a query that needs to be ordered, I can first check >> if the count is too high to determine if I should even bother running the >> search. Aka in most cases a search leading to 100+ results means that who >> ever did the search needs to further narrow it down. > > The cpu is not spend in ordering the results: That is done quite fast > in Lucene, unless you have millions of hits
I read the code and also read this https://issues.apache.org/jira/browse/JCR-2959 and it looks to me that jackrabbit always sorts the result set by itself and not in lucene (or maybe additionally). This makes it slow even if you have a limit set, because it first sorts all nodes (fetching it from the PM if necessary), then does the limit. Maybe I have missed something but real life tests showed exactly this behaviour. > > The problem with getting a correct count is authorization : This total > search index count should is fast (if you try to avoid some known slow > searches). However, authorizing for example 100k+ nodes if they are > not in the jackrabbit caches is very expensive. > > Either way: You get a correct count if you make sure that you include > in your (xpath) search at least an order by clause. Then, to avoid > 100k + hits, make sure you also set a limit. For example a limit of > 501 : You can then show 50 pages of 10 hits, and if the count is 501 > you state that there are at least 500+ hits That's what we do now, but it doesn't help (as said above) if we have thousends of results which have to be ordered first. > > We also wanted to get around this, thus in our api hooked in a > 'getTotalSize()' which returns the Lucene unauthorized count That would help us a lot, since we currently don't use the ACLs of Jackrabbit, so the lucene count would be pretty correct for our use case. chregu > >> >> I guess the most sensible thing would be to simply offer a way to do SELECT >> COUNT(*) FROM .. >> >> 2) a way to automatically stop long running queries > > It is not just about 'long' . Some queries easily blow up, and bring > you app to an OOM before they can be stopped. For example jcr:like is > such a thing. Or range queries on many unique values > > Regards Ard > >> >> It would be great if one could define a timeout for queries. If a query >> takes longer than X, it should just fail. This should be a global setting, >> but ideally it should be possible to override this on a per query basis. >> >> 3) .. ? >> >> regards, >> Lukas Kahwe Smith >> [email protected] >> >> >> > > > -- Liip AG // Feldstrasse 133 // CH-8004 Zurich Tel +41 43 500 39 81 // Mobile +41 76 561 88 60 www.liip.ch // blog.liip.ch // GnuPG 0x0748D5FE
