On Tue, Apr 10, 2012 at 11:21 AM, Lukas Kahwe Smith <[email protected]> wrote: > Hi, > > Currently I see some big issues with queries that return large result sets. A > lot of work is not done inside Lucene, which will probably not be fixed soon > (or maybe never inside 2.x). However I think its important to do some > intermediate improvements. > > Here are some suggestions I have. I hope we can brainstorm together on some > ideas that are feasible to get implemented in a shorter time period than > waiting for Oak: > > 1) there should be a way to get a count > > This way if I need to do a query that needs to be ordered, I can first check > if the count is too high to determine if I should even bother running the > search. Aka in most cases a search leading to 100+ results means that who > ever did the search needs to further narrow it down.
The cpu is not spend in ordering the results: That is done quite fast in Lucene, unless you have millions of hits The problem with getting a correct count is authorization : This total search index count should is fast (if you try to avoid some known slow searches). However, authorizing for example 100k+ nodes if they are not in the jackrabbit caches is very expensive. Either way: You get a correct count if you make sure that you include in your (xpath) search at least an order by clause. Then, to avoid 100k + hits, make sure you also set a limit. For example a limit of 501 : You can then show 50 pages of 10 hits, and if the count is 501 you state that there are at least 500+ hits We also wanted to get around this, thus in our api hooked in a 'getTotalSize()' which returns the Lucene unauthorized count > > I guess the most sensible thing would be to simply offer a way to do SELECT > COUNT(*) FROM .. > > 2) a way to automatically stop long running queries It is not just about 'long' . Some queries easily blow up, and bring you app to an OOM before they can be stopped. For example jcr:like is such a thing. Or range queries on many unique values Regards Ard > > It would be great if one could define a timeout for queries. If a query takes > longer than X, it should just fail. This should be a global setting, but > ideally it should be possible to override this on a per query basis. > > 3) .. ? > > regards, > Lukas Kahwe Smith > [email protected] > > > -- Amsterdam - Oosteinde 11, 1017 WT Amsterdam Boston - 1 Broadway, Cambridge, MA 02142 US +1 877 414 4776 (toll free) Europe +31(0)20 522 4466 www.onehippo.com
