Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

2012-09-20 Thread Ard Schrijvers
Hello,

Sorry to chime in so late in this thread, hope my remarks are still
welcome. I did read the entire thread, and won't reply in line, but
just try to recap and explain how we got around it in Hippo
repository. The problem is obvious:

*** How to get efficiently a correct count of total hits  when
finegrained authorization is involved ***

Regarding the remark in this thread : 'For example, it doesn't make
any sense to display 1045430 hits if calculating this number takes
1.5 hours' , I wholeheartedly agree, but our customers *never* agree.
They want the exact hit count, no matter what

So, we tackled this at Hippo as follows.

1) Next to getSize() iterator we also added getTotalSize(). I don't
like the name because it is actually more something like:
getTotalSizeWithoutCheckingACLs(). This method give you back directly
the number of hits from the backing search index. That one is fast of
course. What is slow, is authorizing potentially 1.000.000 hits
because all those nodes need to be fetched from backing storage, etc
etc. However, most of our customers have an application that show the
results for some siteuser below some folder : The siteuser has read
access for the entire folder. We just show the
getTotalSizeWithoutCheckingACLs() as total hits. Worst case, the
number is higher than the actual number the siteuser is allowed to
read

2) We have our ACLs based on node properties. Hence, we have been able
to create an AuthorizationQuery, mapped directly to a cached Lucene
bitset. When a jcr session searches in our repository, we combine his
cached authorization bitset. After changes in the repository we need
to reload these bitsets (on request) but they are shared between all
users that have the same authorization. It is blistering fast that
way, resulting in correct authorized counts

Now, I don't think (2) can be part of oak, as it implies a certain ACL
model which is not generic enough. Quite some ACL mappings of course
cannot be translated to a Lucene query. However, (1) should be very
issue, and already is a lot better. It is up to the developer then to
use getTotalSizeWithoutCheckingACLs() (and then a decent name :-) or
not

My 2 cents

Regards Ard

On Tue, Sep 11, 2012 at 12:08 PM, Jukka Zitting jukka.zitt...@gmail.com wrote:
 Hi,

 [moving this to oak-dev@ for a broader discussion]

 On Tue, Sep 11, 2012 at 9:55 AM, Thomas Mueller (JIRA) j...@apache.org 
 wrote:
 [...] For compatibility with Jackrabbit 2.0, and for ease of use, it would 
 be good to
 have a clearly defined way to get the size of the result. [...]

 I've always found the -1 return value from getSize() incredibly
 annoying as it forces client code to use extra conditionals and go
 through extra hoops if the size turns out not to be available. There
 are basically three potential scenarios:

 1. The client doesn't need to know the size, so it never calls getSize().
 2. The client does need to know the size, so it calls getSize() and
 has to iterate through all results if getSize() returns -1.
 3. The client could use the size (for UI, optimization, etc.), so it
 calls getSize() and ignores the result if its -1.

 The main problem I have with the -1 return value is that case 2
 becomes really annoying to handle.

 Instead I'd propose the following design:

 * The getSize() method always returns the size, by buffering all
 results in memory if necessary.
 * A separate hasSize() method can be used to check if the size is
 quickly available (i.e. if getSize() will complete in O(1) time).

 With such a design the above cases become easier to handle:

 1. The client doesn't need to know the size, so it never calls getSize().
 2. The client does need to know the size, so it calls getSize().
 3. The client could use the size (for UI, optimization, etc.), so it
 calls hasSize() and possibly follows up with getSize().

 PS. Note that implementing an estimated size feature like seen in
 many public search engines (results 1-10 of thousands) is really
 difficult to implement in a manner that's both efficient and secure.
 Public search engines can make such estimates efficiently since all
 their content is public and they thus don't need to worry about
 accidentally leaking sensitive information.

 BR,

 Jukka Zitting



-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com


Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

2012-09-20 Thread Thomas Mueller
Hi,

1) Next to getSize() iterator we also added getTotalSize(). I don't
like the name because it is actually more something like:
getTotalSizeWithoutCheckingACLs().

Hm, wouldn't that be a security problem?

Couldn't it be better (from a security perspective) if you can only get
this number if you use an admin session or similar?

The query engine currently doesn't have a way to retrieve the size,
without reading the nodes - even for the admin session. Could you create
an (OAK-) issue about this? It could be implemented, but I wonder if we
shouldn't rather implement aggregate functions in the query language:

select count(*) from [nt:base] where ...

That way the query engine knows in advance it doesn't have to read the
nodes at all.

Regards,
Thomas



Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

2012-09-19 Thread Alexander Klimetschek
On 14.09.2012, at 07:41, Thomas Mueller muel...@adobe.com wrote:

 If getSize() returns -1 then you know for sure there are more than 20
 results, so you know you have to display 'next page'.

How do I know it's for sure more than 20, or whatever my page size happens to 
be? -1 simply means: no idea. That the implementation does a timeout handling 
cannot really be assumed on a JCR application level.

 In that case the oak getSize(max) is probably a better fit.
 
 If you call getSize(Integer.MAX_VALUE), then no, it's worse, because that
 call might take hours. The only thing where it helps is if you have a
 different page size than 20.

I was not referring to Integer.MAX_VALUE, but simply to the proposed new method.

 Please note if you use offset and limit, getSize() will return the size of
 the result minus the offset (it will give you the number of rows you can
 read from the result).

Really? Is that according to the JCR spec?

Cheers,
Alex

Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

2012-09-19 Thread Thomas Mueller
Hi,

How do I know it's for sure more than 20

Because the PrefetchIterator will try to prefetch 20 nodes.

or whatever my page size happens to be?

If you have a higher page size then you need getSize(max).

Please note if you use offset and limit, getSize() will return the size
of
 the result minus the offset (it will give you the number of rows you can
 read from the result).

Really? Is that according to the JCR spec?

Is it not?

Regards,
Thomas



Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

2012-09-13 Thread Michael Dürig



On 12.9.12 19:06, Michael Marth wrote:

As an alternative: we could use a separate method getSize(int max) which

* if called with max == -1 returns the exact size if quickly available,
* returns -1 otherwise, and
* returns the exact size but not more then max when called with max = 0.

This allows for estimates but leaves the caller in control.


+1

(and getSize() would still return -1 I guess?)


For backward compatibility I'd leave the behaviour as unchanged as 
possible. That is, return -1 if the size is not quickly available.


Michael





Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

2012-09-13 Thread Thomas Mueller
Hi,

There's no need for the Oak API to reflect JCR in all its details.

Sure. First we need to define how the JCR API implementation is supposed
to behave. Based on that we can then still decide what the Oak API should
look like. The Oak API is (more or less) an implementation detail. Of
course the Oak API support more features, but that's not really relevant
for the user of Oak. When I opened OAK-300, it was for the JCR API
implementation. If getSize(int max) is just in the Oak API but not
available for the end user, then we didn't really gain much :-)

Regards,
Thomas



Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

2012-09-13 Thread Thomas Mueller
Hi,

The main question is still what the JCR API method getSize() should return.


A new method getSize(int max) is nice, and of course we can do that. But I
guess people will not use it in the near future because it's not part of
the JCR API.

Regards,
Thomas




On 9/12/12 8:06 PM, Michael Marth mma...@adobe.com wrote:

 As an alternative: we could use a separate method getSize(int max) which
 
 * if called with max == -1 returns the exact size if quickly available,
 * returns -1 otherwise, and
 * returns the exact size but not more then max when called with max =
0.
 
 This allows for estimates but leaves the caller in control.

+1

(and getSize() would still return -1 I guess?)



Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

2012-09-13 Thread Jukka Zitting
Hi,

On Thu, Sep 13, 2012 at 9:56 AM, Thomas Mueller muel...@adobe.com wrote:
 [...] When I opened OAK-300, it was for the JCR API implementation. [...]

Ah, I see, sorry for confusing the matter.

For the JCR API I'd simply start by ensuring that getSize() will
always return the correct size if the result set has fewer than
something like 1000 entries. That should cover most practical cases
and existing client code without requiring new query syntax or other
API extensions.

 If getSize(int max) is just in the Oak API but not
 available for the end user, then we didn't really gain much :-)

If we come up with a better API design in oak-core, that'll be a good
candidate for inclusion in JCR 2.1.

BR,

Jukka Zitting


Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

2012-09-13 Thread Thomas Mueller
Hi,

return the correct size if the result set has fewer than
something like 1000 entries. That should cover most practical cases

Yes. Let's discuss the value now! 1000 sounds OK in general, however there
is a potential performance problem. For Jackrabbit 2.x, if there are more
than a few million nodes in the repository, you can only load about 100
nodes per second (when using a regular hard disk). Now if we always try to
prefetch 1000 nodes (which is what you have to do in this case), then that
could result in getSize() to take 10 seconds. I think that's not
acceptable. Therefore I argue the value should be about 20, which means
worst case getSize() would take 0.2 seconds. Of course we can still change
the value later, but I would start with 20 (or even lower) just so we
detect problems with that early on.

 If getSize(int max) is just in the Oak API but not
 available for the end user, then we didn't really gain much :-)

If we come up with a better API design in oak-core, that'll be a good
candidate for inclusion in JCR 2.1.

Yes, that makes sense of course.

Regards,
Thomas



Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

2012-09-13 Thread Jukka Zitting
Hi,

On Thu, Sep 13, 2012 at 10:40 AM, Thomas Mueller muel...@adobe.com wrote:
 Sounds good. I guess we could do both: always prefetch 20 nodes, and if
 there is still time, fetch more up to 0.1 seconds or so, or at most 200
 nodes. I guess 200 should be enough to for a GUI to decide what to display
 (10 pages for 20 nodes per page).

+1

BR,

Jukka Zitting


Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

2012-09-13 Thread Alexander Klimetschek
On 13.09.2012, at 01:40, Thomas Mueller muel...@adobe.com wrote:

 Sounds good. I guess we could do both: always prefetch 20 nodes, and if
 there is still time, fetch more up to 0.1 seconds or so, or at most 200
 nodes. I guess 200 should be enough to for a GUI to decide what to display
 (10 pages for 20 nodes per page).

The idea with the timeout sounds good, but what should we recommend an 
application to do if getSize() takes too long and returns -1? 

Imagine while paging search results, the first page query is fast enough 
(getSize() returns something), but the second is too long and now returns -1: 
should the application give up and loose the page navigation or should it 
count itself, now taking ages again

In that case the oak getSize(max) is probably a better fit. (BTW, there should 
probably be a way to indicate the difference between search results == max and 
search results  max, for example by returning max + 1 in the latter case).

Cheers,
Alex



Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

2012-09-13 Thread Thomas Mueller
Hi,

The idea with the timeout sounds good, but what should we recommend an
application to do if getSize() takes too long and returns -1?

Imagine while paging search results, the first page query is fast enough
(getSize() returns something), but the second is too long and now returns
-1: should the application give up and loose the page navigation or
should it count itself, now taking ages again

If getSize() returns -1 then you know for sure there are more than 20
results, so you know you have to display 'next page'. If it returns 0 you
know there are no rows. If it returns a value between 1 and 20 you know
that there are this many rows. Does this answer your question?

In that case the oak getSize(max) is probably a better fit.

If you call getSize(Integer.MAX_VALUE), then no, it's worse, because that
call might take hours. The only thing where it helps is if you have a
different page size than 20.

Please note if you use offset and limit, getSize() will return the size of
the result minus the offset (it will give you the number of rows you can
read from the result).

Regards,
Thomas



Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

2012-09-12 Thread Thomas Mueller
Hi,

2. The client does need to know the size, so it calls getSize() and
 
 I currently can't come up with a convincing use case - what is your use
 case?

Display the actual number of search results to the user?

Do you want to risk that the method getSize() takes 1.5 hours just to
display the actual number of search results to the user?

In this case, I suggest the application doesn't always display the exact
size, he could display one of the following instead:


* 0 hits
* 2 hits
* 13 hits
* 23 hits
* More than 30 hits

The reason for the More than 30 hits versus the exact number of hits is:
calculating the exact number of hits is very inefficient if there are more
than just a few hits. For example, it doesn't make any sense to display
1045430 hits if calculating this number takes 1.5 hours.

If they show next page, they usually also show the number of results
and the number of pages.

Yes, I understand it is a nice feature. But it is a very dangerous feature
given the problems described above.

In many cases the paging etc. doesn't have to be exact, so if there is a
way that Oak could provide a fairly good estimate that would be really
cool (like caching the numbers).

Estimating the result size is sometimes possible, that's true, but it is
potentially a security problem. Do you want me to describe the attack
scenario? It is easier for Google and so on where everybody may see
everything.

Regards,
Thomas
 



Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

2012-09-12 Thread Michael Marth
 As an alternative: we could use a separate method getSize(int max) which
 
 * if called with max == -1 returns the exact size if quickly available,
 * returns -1 otherwise, and
 * returns the exact size but not more then max when called with max = 0.
 
 This allows for estimates but leaves the caller in control.

+1

(and getSize() would still return -1 I guess?)


Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

2012-09-12 Thread Jukka Zitting
Hi,

On Wed, Sep 12, 2012 at 10:37 AM, Thomas Mueller muel...@adobe.com wrote:
 Yes, that would work as well. I have added that to OAK-300. The
 disadvantage is that it's a new API (not part of the JCR specification).
 The two options I proposed don't require a new API.

There's no need for the Oak API to reflect JCR in all its details. If
there's a better way to design something, we should do so as long as
it's still possible for oak-jcr to implement the relevant parts of JCR
based on the underlying Oak API.

BR,

Jukka Zitting


Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

2012-09-12 Thread Alexander Klimetschek
On 12.09.2012, at 00:33, Thomas Mueller muel...@adobe.com wrote:

 Display the actual number of search results to the user?
 
 Do you want to risk that the method getSize() takes 1.5 hours just to
 display the actual number of search results to the user?

Well, the application can decide if there are other options such as the 
proposed hasSize() or getSize(max int) (which I like).

 If they show next page, they usually also show the number of results
 and the number of pages.
 
 Yes, I understand it is a nice feature. But it is a very dangerous feature
 given the problems described above.

Developers are used to get fast counts in RDBMS with dedicated indexes.

Cheers,
Alex

Re: The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

2012-09-12 Thread Alexander Klimetschek
On 12.09.2012, at 01:23, Michael Dürig mdue...@apache.org wrote:

 As an alternative: we could use a separate method getSize(int max) which
 
 * if called with max == -1 returns the exact size if quickly available,
 * returns -1 otherwise, and
 * returns the exact size but not more then max when called with max = 0.

+1 clean and obvious (points the app developer to the issue without having to 
know another method like hasSize())

Cheers,
Alex

The infamous getSize() == -1 (Was: [jira] [Created] (OAK-300) Query: QueryResult.getRows().getSize())

2012-09-11 Thread Jukka Zitting
Hi,

[moving this to oak-dev@ for a broader discussion]

On Tue, Sep 11, 2012 at 9:55 AM, Thomas Mueller (JIRA) j...@apache.org wrote:
 [...] For compatibility with Jackrabbit 2.0, and for ease of use, it would be 
 good to
 have a clearly defined way to get the size of the result. [...]

I've always found the -1 return value from getSize() incredibly
annoying as it forces client code to use extra conditionals and go
through extra hoops if the size turns out not to be available. There
are basically three potential scenarios:

1. The client doesn't need to know the size, so it never calls getSize().
2. The client does need to know the size, so it calls getSize() and
has to iterate through all results if getSize() returns -1.
3. The client could use the size (for UI, optimization, etc.), so it
calls getSize() and ignores the result if its -1.

The main problem I have with the -1 return value is that case 2
becomes really annoying to handle.

Instead I'd propose the following design:

* The getSize() method always returns the size, by buffering all
results in memory if necessary.
* A separate hasSize() method can be used to check if the size is
quickly available (i.e. if getSize() will complete in O(1) time).

With such a design the above cases become easier to handle:

1. The client doesn't need to know the size, so it never calls getSize().
2. The client does need to know the size, so it calls getSize().
3. The client could use the size (for UI, optimization, etc.), so it
calls hasSize() and possibly follows up with getSize().

PS. Note that implementing an estimated size feature like seen in
many public search engines (results 1-10 of thousands) is really
difficult to implement in a manner that's both efficient and secure.
Public search engines can make such estimates efficiently since all
their content is public and they thus don't need to worry about
accidentally leaking sensitive information.

BR,

Jukka Zitting