Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())
Hello, Sorry to chime in so late in this thread, hope my remarks are still welcome. I did read the entire thread, and won't reply in line, but just try to recap and explain how we got around it in Hippo repository. The problem is obvious: *** How to get efficiently a correct count of total hits when finegrained authorization is involved *** Regarding the remark in this thread : 'For example, it doesn't make any sense to display 1045430 hits if calculating this number takes 1.5 hours' , I wholeheartedly agree, but our customers *never* agree. They want the exact hit count, no matter what So, we tackled this at Hippo as follows. 1) Next to getSize() iterator we also added getTotalSize(). I don't like the name because it is actually more something like: getTotalSizeWithoutCheckingACLs(). This method give you back directly the number of hits from the backing search index. That one is fast of course. What is slow, is authorizing potentially 1.000.000 hits because all those nodes need to be fetched from backing storage, etc etc. However, most of our customers have an application that show the results for some siteuser below some folder : The siteuser has read access for the entire folder. We just show the getTotalSizeWithoutCheckingACLs() as total hits. Worst case, the number is higher than the actual number the siteuser is allowed to read 2) We have our ACLs based on node properties. Hence, we have been able to create an AuthorizationQuery, mapped directly to a cached Lucene bitset. When a jcr session searches in our repository, we combine his cached authorization bitset. After changes in the repository we need to reload these bitsets (on request) but they are shared between all users that have the same authorization. It is blistering fast that way, resulting in correct authorized counts Now, I don't think (2) can be part of oak, as it implies a certain ACL model which is not generic enough. Quite some ACL mappings of course cannot be translated to a Lucene query. However, (1) should be very issue, and already is a lot better. It is up to the developer then to use getTotalSizeWithoutCheckingACLs() (and then a decent name :-) or not My 2 cents Regards Ard On Tue, Sep 11, 2012 at 12:08 PM, Jukka Zitting jukka.zitt...@gmail.com wrote: Hi, [moving this to oak-dev@ for a broader discussion] On Tue, Sep 11, 2012 at 9:55 AM, Thomas Mueller (JIRA) j...@apache.org wrote: [...] For compatibility with Jackrabbit 2.0, and for ease of use, it would be good to have a clearly defined way to get the size of the result. [...] I've always found the -1 return value from getSize() incredibly annoying as it forces client code to use extra conditionals and go through extra hoops if the size turns out not to be available. There are basically three potential scenarios: 1. The client doesn't need to know the size, so it never calls getSize(). 2. The client does need to know the size, so it calls getSize() and has to iterate through all results if getSize() returns -1. 3. The client could use the size (for UI, optimization, etc.), so it calls getSize() and ignores the result if its -1. The main problem I have with the -1 return value is that case 2 becomes really annoying to handle. Instead I'd propose the following design: * The getSize() method always returns the size, by buffering all results in memory if necessary. * A separate hasSize() method can be used to check if the size is quickly available (i.e. if getSize() will complete in O(1) time). With such a design the above cases become easier to handle: 1. The client doesn't need to know the size, so it never calls getSize(). 2. The client does need to know the size, so it calls getSize(). 3. The client could use the size (for UI, optimization, etc.), so it calls hasSize() and possibly follows up with getSize(). PS. Note that implementing an estimated size feature like seen in many public search engines (results 1-10 of thousands) is really difficult to implement in a manner that's both efficient and secure. Public search engines can make such estimates efficiently since all their content is public and they thus don't need to worry about accidentally leaking sensitive information. BR, Jukka Zitting -- Amsterdam - Oosteinde 11, 1017 WT Amsterdam Boston - 1 Broadway, Cambridge, MA 02142 US +1 877 414 4776 (toll free) Europe +31(0)20 522 4466 www.onehippo.com
Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())
Hi, 1) Next to getSize() iterator we also added getTotalSize(). I don't like the name because it is actually more something like: getTotalSizeWithoutCheckingACLs(). Hm, wouldn't that be a security problem? Couldn't it be better (from a security perspective) if you can only get this number if you use an admin session or similar? The query engine currently doesn't have a way to retrieve the size, without reading the nodes - even for the admin session. Could you create an (OAK-) issue about this? It could be implemented, but I wonder if we shouldn't rather implement aggregate functions in the query language: select count(*) from [nt:base] where ... That way the query engine knows in advance it doesn't have to read the nodes at all. Regards, Thomas
Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())
On 14.09.2012, at 07:41, Thomas Mueller muel...@adobe.com wrote: If getSize() returns -1 then you know for sure there are more than 20 results, so you know you have to display 'next page'. How do I know it's for sure more than 20, or whatever my page size happens to be? -1 simply means: no idea. That the implementation does a timeout handling cannot really be assumed on a JCR application level. In that case the oak getSize(max) is probably a better fit. If you call getSize(Integer.MAX_VALUE), then no, it's worse, because that call might take hours. The only thing where it helps is if you have a different page size than 20. I was not referring to Integer.MAX_VALUE, but simply to the proposed new method. Please note if you use offset and limit, getSize() will return the size of the result minus the offset (it will give you the number of rows you can read from the result). Really? Is that according to the JCR spec? Cheers, Alex
Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())
Hi, How do I know it's for sure more than 20 Because the PrefetchIterator will try to prefetch 20 nodes. or whatever my page size happens to be? If you have a higher page size then you need getSize(max). Please note if you use offset and limit, getSize() will return the size of the result minus the offset (it will give you the number of rows you can read from the result). Really? Is that according to the JCR spec? Is it not? Regards, Thomas
Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())
On 12.9.12 19:06, Michael Marth wrote: As an alternative: we could use a separate method getSize(int max) which * if called with max == -1 returns the exact size if quickly available, * returns -1 otherwise, and * returns the exact size but not more then max when called with max = 0. This allows for estimates but leaves the caller in control. +1 (and getSize() would still return -1 I guess?) For backward compatibility I'd leave the behaviour as unchanged as possible. That is, return -1 if the size is not quickly available. Michael
Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())
Hi, There's no need for the Oak API to reflect JCR in all its details. Sure. First we need to define how the JCR API implementation is supposed to behave. Based on that we can then still decide what the Oak API should look like. The Oak API is (more or less) an implementation detail. Of course the Oak API support more features, but that's not really relevant for the user of Oak. When I opened OAK-300, it was for the JCR API implementation. If getSize(int max) is just in the Oak API but not available for the end user, then we didn't really gain much :-) Regards, Thomas
Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())
Hi, The main question is still what the JCR API method getSize() should return. A new method getSize(int max) is nice, and of course we can do that. But I guess people will not use it in the near future because it's not part of the JCR API. Regards, Thomas On 9/12/12 8:06 PM, Michael Marth mma...@adobe.com wrote: As an alternative: we could use a separate method getSize(int max) which * if called with max == -1 returns the exact size if quickly available, * returns -1 otherwise, and * returns the exact size but not more then max when called with max = 0. This allows for estimates but leaves the caller in control. +1 (and getSize() would still return -1 I guess?)
Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())
Hi, On Thu, Sep 13, 2012 at 9:56 AM, Thomas Mueller muel...@adobe.com wrote: [...] When I opened OAK-300, it was for the JCR API implementation. [...] Ah, I see, sorry for confusing the matter. For the JCR API I'd simply start by ensuring that getSize() will always return the correct size if the result set has fewer than something like 1000 entries. That should cover most practical cases and existing client code without requiring new query syntax or other API extensions. If getSize(int max) is just in the Oak API but not available for the end user, then we didn't really gain much :-) If we come up with a better API design in oak-core, that'll be a good candidate for inclusion in JCR 2.1. BR, Jukka Zitting
Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())
Hi, return the correct size if the result set has fewer than something like 1000 entries. That should cover most practical cases Yes. Let's discuss the value now! 1000 sounds OK in general, however there is a potential performance problem. For Jackrabbit 2.x, if there are more than a few million nodes in the repository, you can only load about 100 nodes per second (when using a regular hard disk). Now if we always try to prefetch 1000 nodes (which is what you have to do in this case), then that could result in getSize() to take 10 seconds. I think that's not acceptable. Therefore I argue the value should be about 20, which means worst case getSize() would take 0.2 seconds. Of course we can still change the value later, but I would start with 20 (or even lower) just so we detect problems with that early on. If getSize(int max) is just in the Oak API but not available for the end user, then we didn't really gain much :-) If we come up with a better API design in oak-core, that'll be a good candidate for inclusion in JCR 2.1. Yes, that makes sense of course. Regards, Thomas
Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())
Hi, On Thu, Sep 13, 2012 at 10:40 AM, Thomas Mueller muel...@adobe.com wrote: Sounds good. I guess we could do both: always prefetch 20 nodes, and if there is still time, fetch more up to 0.1 seconds or so, or at most 200 nodes. I guess 200 should be enough to for a GUI to decide what to display (10 pages for 20 nodes per page). +1 BR, Jukka Zitting
Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())
On 13.09.2012, at 01:40, Thomas Mueller muel...@adobe.com wrote: Sounds good. I guess we could do both: always prefetch 20 nodes, and if there is still time, fetch more up to 0.1 seconds or so, or at most 200 nodes. I guess 200 should be enough to for a GUI to decide what to display (10 pages for 20 nodes per page). The idea with the timeout sounds good, but what should we recommend an application to do if getSize() takes too long and returns -1? Imagine while paging search results, the first page query is fast enough (getSize() returns something), but the second is too long and now returns -1: should the application give up and loose the page navigation or should it count itself, now taking ages again In that case the oak getSize(max) is probably a better fit. (BTW, there should probably be a way to indicate the difference between search results == max and search results max, for example by returning max + 1 in the latter case). Cheers, Alex
Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())
Hi, The idea with the timeout sounds good, but what should we recommend an application to do if getSize() takes too long and returns -1? Imagine while paging search results, the first page query is fast enough (getSize() returns something), but the second is too long and now returns -1: should the application give up and loose the page navigation or should it count itself, now taking ages again If getSize() returns -1 then you know for sure there are more than 20 results, so you know you have to display 'next page'. If it returns 0 you know there are no rows. If it returns a value between 1 and 20 you know that there are this many rows. Does this answer your question? In that case the oak getSize(max) is probably a better fit. If you call getSize(Integer.MAX_VALUE), then no, it's worse, because that call might take hours. The only thing where it helps is if you have a different page size than 20. Please note if you use offset and limit, getSize() will return the size of the result minus the offset (it will give you the number of rows you can read from the result). Regards, Thomas
Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())
Hi, 2. The client does need to know the size, so it calls getSize() and I currently can't come up with a convincing use case - what is your use case? Display the actual number of search results to the user? Do you want to risk that the method getSize() takes 1.5 hours just to display the actual number of search results to the user? In this case, I suggest the application doesn't always display the exact size, he could display one of the following instead: * 0 hits * 2 hits * 13 hits * 23 hits * More than 30 hits The reason for the More than 30 hits versus the exact number of hits is: calculating the exact number of hits is very inefficient if there are more than just a few hits. For example, it doesn't make any sense to display 1045430 hits if calculating this number takes 1.5 hours. If they show next page, they usually also show the number of results and the number of pages. Yes, I understand it is a nice feature. But it is a very dangerous feature given the problems described above. In many cases the paging etc. doesn't have to be exact, so if there is a way that Oak could provide a fairly good estimate that would be really cool (like caching the numbers). Estimating the result size is sometimes possible, that's true, but it is potentially a security problem. Do you want me to describe the attack scenario? It is easier for Google and so on where everybody may see everything. Regards, Thomas
Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())
As an alternative: we could use a separate method getSize(int max) which * if called with max == -1 returns the exact size if quickly available, * returns -1 otherwise, and * returns the exact size but not more then max when called with max = 0. This allows for estimates but leaves the caller in control. +1 (and getSize() would still return -1 I guess?)
Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())
Hi, On Wed, Sep 12, 2012 at 10:37 AM, Thomas Mueller muel...@adobe.com wrote: Yes, that would work as well. I have added that to OAK-300. The disadvantage is that it's a new API (not part of the JCR specification). The two options I proposed don't require a new API. There's no need for the Oak API to reflect JCR in all its details. If there's a better way to design something, we should do so as long as it's still possible for oak-jcr to implement the relevant parts of JCR based on the underlying Oak API. BR, Jukka Zitting
Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())
On 12.09.2012, at 00:33, Thomas Mueller muel...@adobe.com wrote: Display the actual number of search results to the user? Do you want to risk that the method getSize() takes 1.5 hours just to display the actual number of search results to the user? Well, the application can decide if there are other options such as the proposed hasSize() or getSize(max int) (which I like). If they show next page, they usually also show the number of results and the number of pages. Yes, I understand it is a nice feature. But it is a very dangerous feature given the problems described above. Developers are used to get fast counts in RDBMS with dedicated indexes. Cheers, Alex
Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())
On 12.09.2012, at 01:23, Michael Dürig mdue...@apache.org wrote: As an alternative: we could use a separate method getSize(int max) which * if called with max == -1 returns the exact size if quickly available, * returns -1 otherwise, and * returns the exact size but not more then max when called with max = 0. +1 clean and obvious (points the app developer to the issue without having to know another method like hasSize()) Cheers, Alex
The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())
Hi, [moving this to oak-dev@ for a broader discussion] On Tue, Sep 11, 2012 at 9:55 AM, Thomas Mueller (JIRA) j...@apache.org wrote: [...] For compatibility with Jackrabbit 2.0, and for ease of use, it would be good to have a clearly defined way to get the size of the result. [...] I've always found the -1 return value from getSize() incredibly annoying as it forces client code to use extra conditionals and go through extra hoops if the size turns out not to be available. There are basically three potential scenarios: 1. The client doesn't need to know the size, so it never calls getSize(). 2. The client does need to know the size, so it calls getSize() and has to iterate through all results if getSize() returns -1. 3. The client could use the size (for UI, optimization, etc.), so it calls getSize() and ignores the result if its -1. The main problem I have with the -1 return value is that case 2 becomes really annoying to handle. Instead I'd propose the following design: * The getSize() method always returns the size, by buffering all results in memory if necessary. * A separate hasSize() method can be used to check if the size is quickly available (i.e. if getSize() will complete in O(1) time). With such a design the above cases become easier to handle: 1. The client doesn't need to know the size, so it never calls getSize(). 2. The client does need to know the size, so it calls getSize(). 3. The client could use the size (for UI, optimization, etc.), so it calls hasSize() and possibly follows up with getSize(). PS. Note that implementing an estimated size feature like seen in many public search engines (results 1-10 of thousands) is really difficult to implement in a manner that's both efficient and secure. Public search engines can make such estimates efficiently since all their content is public and they thus don't need to worry about accidentally leaking sensitive information. BR, Jukka Zitting