Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-06-01 Thread Chetan Mehrotra
I have started a new mail thread around "Usecases around Binary handling in
Oak" so as to first collect the kind of usecases we need to support. Once
we decide that we can discuss the possible solution.

So lets continue the discussion on that thread

Chetan Mehrotra

On Tue, May 17, 2016 at 12:31 PM, Angela Schreiber 
wrote:

> Hi Oak-Devs
>
> Just for the record: This topic has been discussed in a Adobe
> internal Oak-coordination call last Wednesday.
>
> Michael Marth first provided some background information and
> we discussed the various concerns mentioned in this thread
> and tried to identity the core issue(s).
>
> Marcel, Michael Duerig and Thomas proposed alternative approaches
> on how to address the original issues that lead to the API
> proposal, which all would avoid leaking out information about
> the internal blob handling.
>
> Unfortunately we ran out of time and didn't conclude the call
> with an agreement on how to proceed.
>
> From my perception the concerns raised here could not be resolved
> by the additional information.
>
> I would suggest that we try to continue the discussion here
> on the list. Maybe with a summary of the alternative proposals?
>
> Kind regards
> Angela
>
> On 11/05/16 15:38, "Ian Boston"  wrote:
>
> >Hi,
> >
> >On 11 May 2016 at 14:21, Marius Petria  wrote:
> >
> >> Hi,
> >>
> >> I would add another use case in the same area, even if it is more
> >> problematic from the point of view of security. To better support load
> >> spikes an application could return 302 redirects to  (signed) S3 urls
> >>such
> >> that binaries are fetched directly from S3.
> >>
> >
> >Perhaps that question exposes the underlying requirement for some
> >downstream users.
> >
> >This is a question, not a statement:
> >
> >If the application using Oak exposed a RESTfull API that had all the same
> >functionality as [1], and was able to perform at the scale of S3, and had
> >the same security semantics as Oak, would applications that are needing
> >direct access to S3 or a File based datastore be able to use that API in
> >preference ?
> >
> >Is this really about issues with scalability and performance rather than a
> >fundamental need to drill deep into the internals of Oak ? If so,
> >shouldn't
> >the scalability and performance be fixed ? (assuming its a real concern)
> >
> >
> >
> >
> >>
> >> (if this can already be done or you think is not really related to the
> >> other two please disregard).
> >>
> >
> >AFAIK this is not possible at the moment. If it was deployments could use
> >nginX X-SendFile and other request offloading mechanisms.
> >
> >Best Regards
> >Ian
> >
> >
> >1 http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectOps.html
> >
> >
> >>
> >> Marius
> >>
> >>
> >>
> >> On 5/11/16, 1:41 PM, "Angela Schreiber"  wrote:
> >>
> >> >Hi Chetan
> >> >
> >> >IMHO your original mail didn't write down the fundamental analysis
> >> >but instead presented the solution for every the 2 case I was
> >> >lacking the information _why_ this is needed.
> >> >
> >> >Both have been answered in private conversions only (1 today in
> >> >the oak call and 2 in a private discussion with tom). And
> >> >having heard didn't make me more confident that the solution
> >> >you propose is the right thing to do.
> >> >
> >> >Kind regards
> >> >Angela
> >> >
> >> >On 11/05/16 12:17, "Chetan Mehrotra" 
> wrote:
> >> >
> >> >>Hi Angela,
> >> >>
> >> >>On Tue, May 10, 2016 at 9:49 PM, Angela Schreiber 
> >> >>wrote:
> >> >>
> >> >>> Quite frankly I would very much appreciate if took the time to
> >>collect
> >> >>> and write down the required (i.e. currently known and expected)
> >> >>> functionality.
> >> >>>
> >> >>> Then look at the requirements and look what is wrong with the
> >>current
> >> >>> API that we can't meet those requirements:
> >> >>> - is it just missing API extensions that can be added with moderate
> >> >>>effort?
> >> >>> - are there fundamental problems with the current API that we
> >>needed to
> >> >>> address?
> >> >>> - maybe we even have intrinsic issues with the way we think about
> >>the
> >> >>>role
> >> >>> of the repo?
> >> >>>
> >> >>> IMHO, sticking to kludges might look promising on a short term but
> >> >>> I am convinced that we are better off with a fundamental analysis of
> >> >>> the problems... after all the Binary topic comes up on a regular
> >>basis.
> >> >>> That leaves me with the impression that yet another tiny extra and
> >> >>> adaptables won't really address the core issues.
> >> >>>
> >> >>
> >> >>Makes sense.
> >> >>
> >> >>Have a look in of the initial mail in the thread at [1] which talks
> >>about
> >> >>the 2 usecase I know of. The image rendition usecase manifest itself
> >>in
> >> >>one
> >> >>form or other, basically providing access to Native programs via file
> >> path
> >> >>reference.
> >> >>
> >> >>The approach proposed 

Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-17 Thread Angela Schreiber
Hi Oak-Devs

Just for the record: This topic has been discussed in a Adobe
internal Oak-coordination call last Wednesday.

Michael Marth first provided some background information and
we discussed the various concerns mentioned in this thread
and tried to identity the core issue(s).

Marcel, Michael Duerig and Thomas proposed alternative approaches
on how to address the original issues that lead to the API
proposal, which all would avoid leaking out information about
the internal blob handling.

Unfortunately we ran out of time and didn't conclude the call
with an agreement on how to proceed.

>From my perception the concerns raised here could not be resolved
by the additional information.

I would suggest that we try to continue the discussion here
on the list. Maybe with a summary of the alternative proposals?

Kind regards
Angela

On 11/05/16 15:38, "Ian Boston"  wrote:

>Hi,
>
>On 11 May 2016 at 14:21, Marius Petria  wrote:
>
>> Hi,
>>
>> I would add another use case in the same area, even if it is more
>> problematic from the point of view of security. To better support load
>> spikes an application could return 302 redirects to  (signed) S3 urls
>>such
>> that binaries are fetched directly from S3.
>>
>
>Perhaps that question exposes the underlying requirement for some
>downstream users.
>
>This is a question, not a statement:
>
>If the application using Oak exposed a RESTfull API that had all the same
>functionality as [1], and was able to perform at the scale of S3, and had
>the same security semantics as Oak, would applications that are needing
>direct access to S3 or a File based datastore be able to use that API in
>preference ?
>
>Is this really about issues with scalability and performance rather than a
>fundamental need to drill deep into the internals of Oak ? If so,
>shouldn't
>the scalability and performance be fixed ? (assuming its a real concern)
>
>
>
>
>>
>> (if this can already be done or you think is not really related to the
>> other two please disregard).
>>
>
>AFAIK this is not possible at the moment. If it was deployments could use
>nginX X-SendFile and other request offloading mechanisms.
>
>Best Regards
>Ian
>
>
>1 http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectOps.html
>
>
>>
>> Marius
>>
>>
>>
>> On 5/11/16, 1:41 PM, "Angela Schreiber"  wrote:
>>
>> >Hi Chetan
>> >
>> >IMHO your original mail didn't write down the fundamental analysis
>> >but instead presented the solution for every the 2 case I was
>> >lacking the information _why_ this is needed.
>> >
>> >Both have been answered in private conversions only (1 today in
>> >the oak call and 2 in a private discussion with tom). And
>> >having heard didn't make me more confident that the solution
>> >you propose is the right thing to do.
>> >
>> >Kind regards
>> >Angela
>> >
>> >On 11/05/16 12:17, "Chetan Mehrotra"  wrote:
>> >
>> >>Hi Angela,
>> >>
>> >>On Tue, May 10, 2016 at 9:49 PM, Angela Schreiber 
>> >>wrote:
>> >>
>> >>> Quite frankly I would very much appreciate if took the time to
>>collect
>> >>> and write down the required (i.e. currently known and expected)
>> >>> functionality.
>> >>>
>> >>> Then look at the requirements and look what is wrong with the
>>current
>> >>> API that we can't meet those requirements:
>> >>> - is it just missing API extensions that can be added with moderate
>> >>>effort?
>> >>> - are there fundamental problems with the current API that we
>>needed to
>> >>> address?
>> >>> - maybe we even have intrinsic issues with the way we think about
>>the
>> >>>role
>> >>> of the repo?
>> >>>
>> >>> IMHO, sticking to kludges might look promising on a short term but
>> >>> I am convinced that we are better off with a fundamental analysis of
>> >>> the problems... after all the Binary topic comes up on a regular
>>basis.
>> >>> That leaves me with the impression that yet another tiny extra and
>> >>> adaptables won't really address the core issues.
>> >>>
>> >>
>> >>Makes sense.
>> >>
>> >>Have a look in of the initial mail in the thread at [1] which talks
>>about
>> >>the 2 usecase I know of. The image rendition usecase manifest itself
>>in
>> >>one
>> >>form or other, basically providing access to Native programs via file
>> path
>> >>reference.
>> >>
>> >>The approach proposed so far would be able to address them and hence
>> >>closer
>> >>to "is it just missing API extensions that can be added with moderate
>> >>effort?". If there are any other approach we can address both of the
>> >>referred usecases then we implement them.
>> >>
>> >>Let me know if more details are required. If required I can put it up
>>on
>> a
>> >>wiki page also.
>> >>
>> >>Chetan Mehrotra
>> >>[1]
>> >>
>> 
>>http://markmail.org/thread/6mq4je75p64c5nyn#query:+page:1+mid:zv5dzsgmoeg
>>u
>> >>pd7l+state:results
>> >
>>



Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-11 Thread Ian Boston
Hi,

On 11 May 2016 at 14:21, Marius Petria  wrote:

> Hi,
>
> I would add another use case in the same area, even if it is more
> problematic from the point of view of security. To better support load
> spikes an application could return 302 redirects to  (signed) S3 urls such
> that binaries are fetched directly from S3.
>

Perhaps that question exposes the underlying requirement for some
downstream users.

This is a question, not a statement:

If the application using Oak exposed a RESTfull API that had all the same
functionality as [1], and was able to perform at the scale of S3, and had
the same security semantics as Oak, would applications that are needing
direct access to S3 or a File based datastore be able to use that API in
preference ?

Is this really about issues with scalability and performance rather than a
fundamental need to drill deep into the internals of Oak ? If so, shouldn't
the scalability and performance be fixed ? (assuming its a real concern)




>
> (if this can already be done or you think is not really related to the
> other two please disregard).
>

AFAIK this is not possible at the moment. If it was deployments could use
nginX X-SendFile and other request offloading mechanisms.

Best Regards
Ian


1 http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectOps.html


>
> Marius
>
>
>
> On 5/11/16, 1:41 PM, "Angela Schreiber"  wrote:
>
> >Hi Chetan
> >
> >IMHO your original mail didn't write down the fundamental analysis
> >but instead presented the solution for every the 2 case I was
> >lacking the information _why_ this is needed.
> >
> >Both have been answered in private conversions only (1 today in
> >the oak call and 2 in a private discussion with tom). And
> >having heard didn't make me more confident that the solution
> >you propose is the right thing to do.
> >
> >Kind regards
> >Angela
> >
> >On 11/05/16 12:17, "Chetan Mehrotra"  wrote:
> >
> >>Hi Angela,
> >>
> >>On Tue, May 10, 2016 at 9:49 PM, Angela Schreiber 
> >>wrote:
> >>
> >>> Quite frankly I would very much appreciate if took the time to collect
> >>> and write down the required (i.e. currently known and expected)
> >>> functionality.
> >>>
> >>> Then look at the requirements and look what is wrong with the current
> >>> API that we can't meet those requirements:
> >>> - is it just missing API extensions that can be added with moderate
> >>>effort?
> >>> - are there fundamental problems with the current API that we needed to
> >>> address?
> >>> - maybe we even have intrinsic issues with the way we think about the
> >>>role
> >>> of the repo?
> >>>
> >>> IMHO, sticking to kludges might look promising on a short term but
> >>> I am convinced that we are better off with a fundamental analysis of
> >>> the problems... after all the Binary topic comes up on a regular basis.
> >>> That leaves me with the impression that yet another tiny extra and
> >>> adaptables won't really address the core issues.
> >>>
> >>
> >>Makes sense.
> >>
> >>Have a look in of the initial mail in the thread at [1] which talks about
> >>the 2 usecase I know of. The image rendition usecase manifest itself in
> >>one
> >>form or other, basically providing access to Native programs via file
> path
> >>reference.
> >>
> >>The approach proposed so far would be able to address them and hence
> >>closer
> >>to "is it just missing API extensions that can be added with moderate
> >>effort?". If there are any other approach we can address both of the
> >>referred usecases then we implement them.
> >>
> >>Let me know if more details are required. If required I can put it up on
> a
> >>wiki page also.
> >>
> >>Chetan Mehrotra
> >>[1]
> >>
> http://markmail.org/thread/6mq4je75p64c5nyn#query:+page:1+mid:zv5dzsgmoegu
> >>pd7l+state:results
> >
>


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-11 Thread Marius Petria
Hi,

I would add another use case in the same area, even if it is more problematic 
from the point of view of security. To better support load spikes an 
application could return 302 redirects to  (signed) S3 urls such that binaries 
are fetched directly from S3.

(if this can already be done or you think is not really related to the other 
two please disregard).

Marius



On 5/11/16, 1:41 PM, "Angela Schreiber"  wrote:

>Hi Chetan
>
>IMHO your original mail didn't write down the fundamental analysis
>but instead presented the solution for every the 2 case I was
>lacking the information _why_ this is needed.
>
>Both have been answered in private conversions only (1 today in
>the oak call and 2 in a private discussion with tom). And
>having heard didn't make me more confident that the solution
>you propose is the right thing to do.
>
>Kind regards
>Angela
>
>On 11/05/16 12:17, "Chetan Mehrotra"  wrote:
>
>>Hi Angela,
>>
>>On Tue, May 10, 2016 at 9:49 PM, Angela Schreiber 
>>wrote:
>>
>>> Quite frankly I would very much appreciate if took the time to collect
>>> and write down the required (i.e. currently known and expected)
>>> functionality.
>>>
>>> Then look at the requirements and look what is wrong with the current
>>> API that we can't meet those requirements:
>>> - is it just missing API extensions that can be added with moderate
>>>effort?
>>> - are there fundamental problems with the current API that we needed to
>>> address?
>>> - maybe we even have intrinsic issues with the way we think about the
>>>role
>>> of the repo?
>>>
>>> IMHO, sticking to kludges might look promising on a short term but
>>> I am convinced that we are better off with a fundamental analysis of
>>> the problems... after all the Binary topic comes up on a regular basis.
>>> That leaves me with the impression that yet another tiny extra and
>>> adaptables won't really address the core issues.
>>>
>>
>>Makes sense.
>>
>>Have a look in of the initial mail in the thread at [1] which talks about
>>the 2 usecase I know of. The image rendition usecase manifest itself in
>>one
>>form or other, basically providing access to Native programs via file path
>>reference.
>>
>>The approach proposed so far would be able to address them and hence
>>closer
>>to "is it just missing API extensions that can be added with moderate
>>effort?". If there are any other approach we can address both of the
>>referred usecases then we implement them.
>>
>>Let me know if more details are required. If required I can put it up on a
>>wiki page also.
>>
>>Chetan Mehrotra
>>[1]
>>http://markmail.org/thread/6mq4je75p64c5nyn#query:+page:1+mid:zv5dzsgmoegu
>>pd7l+state:results
>


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-11 Thread Angela Schreiber
Hi Chetan

IMHO your original mail didn't write down the fundamental analysis
but instead presented the solution for every the 2 case I was
lacking the information _why_ this is needed.

Both have been answered in private conversions only (1 today in
the oak call and 2 in a private discussion with tom). And
having heard didn't make me more confident that the solution
you propose is the right thing to do.

Kind regards
Angela

On 11/05/16 12:17, "Chetan Mehrotra"  wrote:

>Hi Angela,
>
>On Tue, May 10, 2016 at 9:49 PM, Angela Schreiber 
>wrote:
>
>> Quite frankly I would very much appreciate if took the time to collect
>> and write down the required (i.e. currently known and expected)
>> functionality.
>>
>> Then look at the requirements and look what is wrong with the current
>> API that we can't meet those requirements:
>> - is it just missing API extensions that can be added with moderate
>>effort?
>> - are there fundamental problems with the current API that we needed to
>> address?
>> - maybe we even have intrinsic issues with the way we think about the
>>role
>> of the repo?
>>
>> IMHO, sticking to kludges might look promising on a short term but
>> I am convinced that we are better off with a fundamental analysis of
>> the problems... after all the Binary topic comes up on a regular basis.
>> That leaves me with the impression that yet another tiny extra and
>> adaptables won't really address the core issues.
>>
>
>Makes sense.
>
>Have a look in of the initial mail in the thread at [1] which talks about
>the 2 usecase I know of. The image rendition usecase manifest itself in
>one
>form or other, basically providing access to Native programs via file path
>reference.
>
>The approach proposed so far would be able to address them and hence
>closer
>to "is it just missing API extensions that can be added with moderate
>effort?". If there are any other approach we can address both of the
>referred usecases then we implement them.
>
>Let me know if more details are required. If required I can put it up on a
>wiki page also.
>
>Chetan Mehrotra
>[1]
>http://markmail.org/thread/6mq4je75p64c5nyn#query:+page:1+mid:zv5dzsgmoegu
>pd7l+state:results



Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-11 Thread Chetan Mehrotra
Hi Angela,

On Tue, May 10, 2016 at 9:49 PM, Angela Schreiber  wrote:

> Quite frankly I would very much appreciate if took the time to collect
> and write down the required (i.e. currently known and expected)
> functionality.
>
> Then look at the requirements and look what is wrong with the current
> API that we can't meet those requirements:
> - is it just missing API extensions that can be added with moderate effort?
> - are there fundamental problems with the current API that we needed to
> address?
> - maybe we even have intrinsic issues with the way we think about the role
> of the repo?
>
> IMHO, sticking to kludges might look promising on a short term but
> I am convinced that we are better off with a fundamental analysis of
> the problems... after all the Binary topic comes up on a regular basis.
> That leaves me with the impression that yet another tiny extra and
> adaptables won't really address the core issues.
>

Makes sense.

Have a look in of the initial mail in the thread at [1] which talks about
the 2 usecase I know of. The image rendition usecase manifest itself in one
form or other, basically providing access to Native programs via file path
reference.

The approach proposed so far would be able to address them and hence closer
to "is it just missing API extensions that can be added with moderate
effort?". If there are any other approach we can address both of the
referred usecases then we implement them.

Let me know if more details are required. If required I can put it up on a
wiki page also.

Chetan Mehrotra
[1]
http://markmail.org/thread/6mq4je75p64c5nyn#query:+page:1+mid:zv5dzsgmoegupd7l+state:results


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-11 Thread Chetan Mehrotra
> what guarantees do/can we give re. this file handle within this context.
Can it suddenly go away (e.g. because of gc or internal re-organisation)?
How do we establish, test and maintain (e.g. from regressions) such
guarantees?

Logically it should not go away suddenly. So GC logic should be aware of
such "inUse" instances (there is already such support for inUse cases).
Such a requirement can be validated via integration testcase

>  and more concerningly, how do we protect Oak from data corruption by
misbehaving clients? E.g. clients writing on that handle or removing it?
Again, if this is public API we need ways to test this.

Not sure by misbehaving client - Is it malicious (by design) or badly
written code. For later yes that might pose a problem but we can have some
defense. I would expect the code making use of the api to behave properly.
In addition as proposed above [1] for FileDataStore we can provide a
symlinked file reference which exposes a read only file handle. For
S3DataStore code should have access to aws credentials to perform any write
operation, which should be a sufficient defense

> In an earlier mail you quite fittingly compared this to commit hooks,
which for good reason are an internal SPI.

Bit of nit pick here ;) As per Jcr class [1] one can provide a CommitHook
instance so not sure if we can term it internal. However point that I
wanted to emphasize is that Oak does provide some critical extension point
and with a misbehaving code one can shoot himself at foot and as
implementation only so much can be done.

regards
Chetan
[1]
http://markmail.org/thread/6mq4je75p64c5nyn#query:+page:1+mid:237kzuhor5y3tpli+state:results
[2]
https://github.com/apache/jackrabbit-oak/blob/trunk/oak-jcr/src/main/java/org/apache/jackrabbit/oak/jcr/Jcr.java#L190

Chetan Mehrotra


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-11 Thread Michael Dürig


Such an approach makes the API contract more explicit to the consumer by 
providing a context outside which there will be no guarantees for the 
passed "file handle". However, there is still the issues of


- what guarantees do/can we give re. this file handle within this 
context. Can it suddenly go away (e.g. because of gc or internal 
re-organisation)? How do we establish, test and maintain (e.g. from 
regressions) such guarantees?


- and more concerningly, how do we protect Oak from data corruption by 
misbehaving clients? E.g. clients writing on that handle or removing it? 
Again, if this is public API we need ways to test this.


In an earlier mail you quite fittingly compared this to commit hooks, 
which for good reason are an internal SPI. The same applies here: this 
is a very low level concern so it must only be exposed as an internal SPI.


Michael


On 9.5.16 3:45 , Chetan Mehrotra wrote:

Had an offline discussion with Michael on this and explained the usecase
requirement in more details. One concern that has been raised is that such
a generic adaptTo API is too inviting for improper use and Oak does not
have any context around when this url is exposed for what time it is used.

So instead of having a generic adaptTo API at JCR level we can have a
BlobProcessor callback (Approach #B). Below is more of a strawman proposal.
Once we have a consensus then we can go over the details

interface BlobProcessor {
   void process(AdaptableBlob blob);
}

Where AdaptableBlob is

public interface AdaptableBlob {
 AdapterType adaptTo(Class type);
}

The BlobProcessor instance can be passed via BlobStore API. So client would
look for a BlobStore service (so use the Oak level API) and pass it the
ContentIdentity of JCR Binary aka blobId

interface BlobStore{
 void process(String blobId, BlobProcessor processor)
}

The approach ensures

1. That any blob handle exposed is only guaranteed for the duration
of  'process' invocation
2. There is no guarantee on the utility of blob handle (File, S3 Object)
beyond the callback. So one should not collect the passed File handle for
later use

Hopefully this should address some of the concerns raised in this thread.
Looking forward to feedback :)

Chetan Mehrotra

On Mon, May 9, 2016 at 6:24 PM, Michael Dürig  wrote:




On 9.5.16 11:43 , Chetan Mehrotra wrote:


To highlight - As mentioned earlier the user of proposed api is tying
itself to implementation details of Oak and if this changes later then
that
code would also need to be changed. Or as Ian summed it up

if the API is introduced it should create an out of band agreement with



the consumers of the API to act responsibly.



So what does "to act responsibly" actually means? Are we even in a
position to precisely specify this? Experience tells me that we only find
out about those semantics after the fact when dealing with painful and
expensive customer escalations.

And even if we could, it would tie Oak into very tight constraints on how
it has to behave and how not. Constraints that would turn out prohibitively
expensive for future evolution. Furthermore a huge amount of resources
would be required to formalise such constraints via test coverage to guard
against regressions.




The method is to be used for those important case where you do rely on
implementation detail to get optimal performance in very specific
scenarios. Its like DocumentNodeStore making use of some Mongo specific
API
to perform some important critical operation to achieve better performance
by checking if the underlying DocumentStore is Mongo based.



Right, but the Mongo specific API is a (hopefully) well thought through
API where as with your proposal there are a lot of open questions and
concerns as per my last mail.

Mongo (and any other COTS DB) for good reasons also don't give you direct
access to its internal file handles.




I have seen discussion of JCR-3534 and other related issue but still do
not
see any conclusion on how to answer such queries where direct access to
blobs is required for performance aspect. This issue is not about exposing
the blob reference for remote access but more about optimal path for in VM
access



One bottom line of the discussions in that issue is that we came to a
conclusion after clarifying the specifics of the use case. Something I'm
still missing here. The case you brought forward is too general to serve as
a guideline for a solution. Quite to the contrary, to me it looks like a
solution to some problem (I'm trying to understand).




who owns the resource? Who coordinates (concurrent) access to it and how?



What are the correctness and performance implications here (races,
deadlock, corruptions, JCR semantics)?

The client code would need to be implemented in a proper way. Its more
like
implementing a CommitHook. If implemented in incorrect way it would cause
issues deadlocks etc. But then we assume that any one implementing that
interface would take 

Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-10 Thread Ian Boston
Hi Angela,

On 10 May 2016 at 17:19, Angela Schreiber  wrote:

> Hi Ian
>
> >Fair enough, provided there is a solution that addresses the issue Chetan
> >is trying to address.
>
> That's what we are all looking for :)
>
> >The alternative, for some applications, seems to store the binary data
> >outside Oak, which defeats the purpose completely.
>
> You mean with the current setup, right?
>

yes.


>
> That might well be... while I haven't been involved with a concrete
> case I wouldn't categorically reject that this might in same cases
> even be the right solution.
> But maybe I am biased due to the fact that we also have a big
> community that effectively stores and manages their user/group
> accounts outside the repository and where I am seeing plenty of
> trouble with the conception that those accounts _must_ be synced
> (i.e. copied) into the repo.
>
> So, I'd definitely like to understand why you think that this
> "completely defeats the purpose". I agree that it's not always
> desirable but nevertheless there might be valid use-cases.
>


If the purpose of Oak is to provide a content repository to store metadata
and assets, then if the application built on top of Oak, in order to
achieve its scalability targets has to store its asset data (blobs) outside
Oak, that defeats the purpose of supporting the storage of assets within
Oak. Oak should support the storage of assets within Oak supporting the
scalability requirements of the application. Since they are non trivial and
hard to quantify, that means horizontal scalability limited only by
available budget to purchase VM's or hardware.

You can argue that horizontal scalability is not really required.
I can share use cases, not exactly the same ones Chetan is working on where
it is.
Sorry I can't share them on list.



>
> >I don't have a perfect handle on the issue he is trying to address or what
> >would be an acceptable solution, but I suspect the only solution that is
> >not vulnerable by design will a solution that abstracts all the required
> >functionality behind an Oak API (ie no S3Object, File object or anything
> >that could leak) and then provide all the required functionality with an
> >acceptable level of performance in the implementation. That is doable, but
> >a lot more work.
>
> Not sure about that :-)
> Quite frankly I would very much appreciate if took the time to collect
> and write down the required (i.e. currently known and expected)
> functionality.
>

In the context of what I said above, for AWS deployment that means wrapping
[1] so nothing can leak and supporting almost everything expressed by [2]
via an Oak API/jar in a way that enables horizontal scalability.


>
> Then look at the requirements and look what is wrong with the current
> API that we can't meet those requirements:
> - is it just missing API extensions that can be added with moderate effort?
> - are there fundamental problems with the current API that we needed to
> address?
> - maybe we even have intrinsic issues with the way we think about the role
> of the repo?
>
> IMHO, sticking to kludges might look promising on a short term but
> I am convinced that we are better off with a fundamental analysis of
> the problems... after all the Binary topic comes up on a regular basis.
> That leaves me with the impression that yet another tiny extra and
> adaptables won't really address the core issues.
>

I agree.
It comes up time and again because the applications are being asked to do
something Oak does not currently support, so developers look for a work
arround.
It should be done properly, once and for all.
imvho, that is a lot of work upfront, but since I am not the one doing the
work its not right for me to estimate or suggest anyone do it.

Best Regards
Ian

1
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/S3Object.html
2 http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectOps.html



> Kind regards
> Angela
>
>
>
> >
> >
> >Best Regards
> >Ian
> >
> >
> >>
> >> Kind regards
> >> Angela
> >>
> >> >
> >> >Best Regards
> >> >Ian
> >> >
> >> >
> >> >On 3 May 2016 at 15:36, Chetan Mehrotra 
> >> wrote:
> >> >
> >> >> Hi Team,
> >> >>
> >> >> For OAK-1963 we need to allow access to actaul Blob location say in
> >>form
> >> >> File instance or S3 object id etc. This access is need to perform
> >> >>optimized
> >> >> IO operation around binary object e.g.
> >> >>
> >> >> 1. The File object can be used to spool the file content with zero
> >>copy
> >> >> using NIO by accessing the File Channel directly [1]
> >> >>
> >> >> 2. Client code can efficiently replicate a binary stored in S3 by
> >>having
> >> >> direct access to S3 object using copy operation
> >> >>
> >> >> To allow such access we would need a new API in the form of
> >> >> AdaptableBinary.
> >> >>
> >> >> API
> >> >> ===
> >> >>
> >> >> public interface AdaptableBinary {
> >> >>
> >> >> /**
> >> >>  * Adapts the 

Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-10 Thread Angela Schreiber
Hi Ian

>Fair enough, provided there is a solution that addresses the issue Chetan
>is trying to address.

That's what we are all looking for :)

>The alternative, for some applications, seems to store the binary data
>outside Oak, which defeats the purpose completely.

You mean with the current setup, right?

That might well be... while I haven't been involved with a concrete
case I wouldn't categorically reject that this might in same cases
even be the right solution.
But maybe I am biased due to the fact that we also have a big
community that effectively stores and manages their user/group
accounts outside the repository and where I am seeing plenty of
trouble with the conception that those accounts _must_ be synced
(i.e. copied) into the repo.

So, I'd definitely like to understand why you think that this
"completely defeats the purpose". I agree that it's not always
desirable but nevertheless there might be valid use-cases.

>I don't have a perfect handle on the issue he is trying to address or what
>would be an acceptable solution, but I suspect the only solution that is
>not vulnerable by design will a solution that abstracts all the required
>functionality behind an Oak API (ie no S3Object, File object or anything
>that could leak) and then provide all the required functionality with an
>acceptable level of performance in the implementation. That is doable, but
>a lot more work.

Not sure about that :-)
Quite frankly I would very much appreciate if took the time to collect
and write down the required (i.e. currently known and expected)
functionality.

Then look at the requirements and look what is wrong with the current
API that we can't meet those requirements:
- is it just missing API extensions that can be added with moderate effort?
- are there fundamental problems with the current API that we needed to
address?
- maybe we even have intrinsic issues with the way we think about the role
of the repo?

IMHO, sticking to kludges might look promising on a short term but
I am convinced that we are better off with a fundamental analysis of
the problems... after all the Binary topic comes up on a regular basis.
That leaves me with the impression that yet another tiny extra and
adaptables won't really address the core issues.

Kind regards
Angela



>
>
>Best Regards
>Ian
>
>
>>
>> Kind regards
>> Angela
>>
>> >
>> >Best Regards
>> >Ian
>> >
>> >
>> >On 3 May 2016 at 15:36, Chetan Mehrotra 
>> wrote:
>> >
>> >> Hi Team,
>> >>
>> >> For OAK-1963 we need to allow access to actaul Blob location say in
>>form
>> >> File instance or S3 object id etc. This access is need to perform
>> >>optimized
>> >> IO operation around binary object e.g.
>> >>
>> >> 1. The File object can be used to spool the file content with zero
>>copy
>> >> using NIO by accessing the File Channel directly [1]
>> >>
>> >> 2. Client code can efficiently replicate a binary stored in S3 by
>>having
>> >> direct access to S3 object using copy operation
>> >>
>> >> To allow such access we would need a new API in the form of
>> >> AdaptableBinary.
>> >>
>> >> API
>> >> ===
>> >>
>> >> public interface AdaptableBinary {
>> >>
>> >> /**
>> >>  * Adapts the binary to another type like File, URL etc
>> >>  *
>> >>  * @param  The generic type to which this binary is
>> >> adapted
>> >>  *to
>> >>  * @param type The Class object of the target type, such as
>> >>  *File.class
>> >>  * @return The adapter target or null if the binary
>> >>cannot
>> >>  * adapt to the requested type
>> >>  */
>> >>  AdapterType adaptTo(Class type);
>> >> }
>> >>
>> >> Usage
>> >> =
>> >>
>> >> Binary binProp = node.getProperty("jcr:data").getBinary();
>> >>
>> >> //Check if Binary is of type AdaptableBinary
>> >> if (binProp instanceof AdaptableBinary){
>> >>  AdaptableBinary adaptableBinary = (AdaptableBinary) binProp;
>> >>
>> >> //Adapt it to File instance
>> >>  File file = adaptableBinary.adaptTo(File.class);
>> >> }
>> >>
>> >>
>> >>
>> >> The Binary instance returned by Oak
>> >> i.e. org.apache.jackrabbit.oak.plugins.value.BinaryImpl would then
>> >> implement this interface and calling code can then check the type and
>> >>cast
>> >> it and then adapt it
>> >>
>> >> Key Points
>> >> 
>> >>
>> >> 1. Depending on backing BlobStore the binary can be adapted to
>>various
>> >> types. For FileDataStore it can be adapted to File. For S3DataStore
>>it
>> >>can
>> >> either be adapted to URL or some S3DataStore specific type.
>> >>
>> >> 2. Security - Thomas suggested that for better security the ability
>>to
>> >> adapt should be restricted based on session permissions. So if the
>>user
>> >>has
>> >> required permission then only adaptation would work otherwise null
>> >>would be
>> >> returned.
>> >>
>> >> 3. Adaptation proposal is based on Sling Adaptable [2]
>> >>
>> >> 4. This API is for now exposed only at JCR level. Not sure should 

Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-10 Thread Michael Dürig



On 10.5.16 5:39 , Ian Boston wrote:

I don't have a perfect handle on the issue he is trying to address or what
would be an acceptable solution, but I suspect the only solution that is
not vulnerable by design will a solution that abstracts all the required
functionality behind an Oak API (ie no S3Object, File object or anything
that could leak) and then provide all the required functionality with an
acceptable level of performance in the implementation. That is doable, but
a lot more work.


I doubt this. It is a lot more *upfront work* vs. never ending fire 
fighting in production systems.


Michael


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-10 Thread Ian Boston
On 10 May 2016 at 15:02, Angela Schreiber  wrote:

> Hi Ian
>
> On 04/05/16 18:37, "Ian Boston"  wrote:
> >[...] The locations will certainly probably leak
> >outside the context of an Oak session so the API contract should make it
> >clear that the code using a direct location needs to behave responsibly.
>
> See my reply to Chetan, who was referring to
> SlingRepository.loginAdministrative
> which always had a pretty clear API contract wrt responsible usage.
>
> As a matter of fact (and I guess you are aware of this) it turned into a
> total nightmare with developers using it just everywhere, ignoring not
> only
> the API contract but also all concerns raised for years. This can even
> been seen in Apache Sling code base itself.


> So, I am quite pessimistic about responsible usage and API contract
> and definitely prefer an API implementation that effectively enforces
> the contract.
>
> Vulnerable by design is IMHO a bad guideline for introducing new APIs.
> From my experiences they backfire usually sooner than later and need
> to be abandoned again... so, I'd rather aim for a properly secured
> solution right from the beginning.
>

Fair enough, provided there is a solution that addresses the issue Chetan
is trying to address.
The alternative, for some applications, seems to store the binary data
outside Oak, which defeats the purpose completely.

I don't have a perfect handle on the issue he is trying to address or what
would be an acceptable solution, but I suspect the only solution that is
not vulnerable by design will a solution that abstracts all the required
functionality behind an Oak API (ie no S3Object, File object or anything
that could leak) and then provide all the required functionality with an
acceptable level of performance in the implementation. That is doable, but
a lot more work.


Best Regards
Ian


>
> Kind regards
> Angela
>
> >
> >Best Regards
> >Ian
> >
> >
> >On 3 May 2016 at 15:36, Chetan Mehrotra 
> wrote:
> >
> >> Hi Team,
> >>
> >> For OAK-1963 we need to allow access to actaul Blob location say in form
> >> File instance or S3 object id etc. This access is need to perform
> >>optimized
> >> IO operation around binary object e.g.
> >>
> >> 1. The File object can be used to spool the file content with zero copy
> >> using NIO by accessing the File Channel directly [1]
> >>
> >> 2. Client code can efficiently replicate a binary stored in S3 by having
> >> direct access to S3 object using copy operation
> >>
> >> To allow such access we would need a new API in the form of
> >> AdaptableBinary.
> >>
> >> API
> >> ===
> >>
> >> public interface AdaptableBinary {
> >>
> >> /**
> >>  * Adapts the binary to another type like File, URL etc
> >>  *
> >>  * @param  The generic type to which this binary is
> >> adapted
> >>  *to
> >>  * @param type The Class object of the target type, such as
> >>  *File.class
> >>  * @return The adapter target or null if the binary
> >>cannot
> >>  * adapt to the requested type
> >>  */
> >>  AdapterType adaptTo(Class type);
> >> }
> >>
> >> Usage
> >> =
> >>
> >> Binary binProp = node.getProperty("jcr:data").getBinary();
> >>
> >> //Check if Binary is of type AdaptableBinary
> >> if (binProp instanceof AdaptableBinary){
> >>  AdaptableBinary adaptableBinary = (AdaptableBinary) binProp;
> >>
> >> //Adapt it to File instance
> >>  File file = adaptableBinary.adaptTo(File.class);
> >> }
> >>
> >>
> >>
> >> The Binary instance returned by Oak
> >> i.e. org.apache.jackrabbit.oak.plugins.value.BinaryImpl would then
> >> implement this interface and calling code can then check the type and
> >>cast
> >> it and then adapt it
> >>
> >> Key Points
> >> 
> >>
> >> 1. Depending on backing BlobStore the binary can be adapted to various
> >> types. For FileDataStore it can be adapted to File. For S3DataStore it
> >>can
> >> either be adapted to URL or some S3DataStore specific type.
> >>
> >> 2. Security - Thomas suggested that for better security the ability to
> >> adapt should be restricted based on session permissions. So if the user
> >>has
> >> required permission then only adaptation would work otherwise null
> >>would be
> >> returned.
> >>
> >> 3. Adaptation proposal is based on Sling Adaptable [2]
> >>
> >> 4. This API is for now exposed only at JCR level. Not sure should we do
> >>it
> >> at Oak level as Blob instance are currently not bound to any session. So
> >> proposal is to place this in 'org.apache.jackrabbit.oak.api' package
> >>
> >> Kindly provide your feedback! Also any suggestion/guidance around how
> >>the
> >> access control be implemented
> >>
> >> Chetan Mehrotra
> >> [1] http://www.ibm.com/developerworks/library/j-zerocopy/
> >> [2]
> >>
> >>
> >>
> https://sling.apache.org/apidocs/sling5/org/apache/sling/api/adapter/Adap
> >>table.html
> >>
>
>


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-10 Thread Angela Schreiber
Hi 

Same here... Francesco already summarised my concerns very nicely.

The links Michael provided below resonate what came to my mind
regarding past discussions around binary handling in the JCR/Jackrabbit
API and in Oak.

I also distinctly remember that one key argument for the current
design of the Oak Blob API was the fact that the access to
the binaries created through this API is properly secured due
to the fact that they (or their references) are read and written
from/to the Oak repository through calls that are subject to
the configured security setup i.e. are always secured.

@Chetan, regarding your original comment wrt security:

> 2. Security - Thomas suggested that for better security the ability to
> adapt should be restricted based on session permissions. So if the user
>has
> required permission then only adaptation would work otherwise null would
>be
> returned.

As others said before I don't think that this is the critical
part from a security point of view... The access to the property
is secured by the authorization model present with the given
repository. IMO the troublesome part comes only _after_ the adaption
to something else, where you loose the ability to enforce the
constraints imposed by the permission setup.

After all I am not convinced that we should rush this API into the
code base at the current state... from my PoV there are too many valid
concerns. And honestly, I weight the architectural and consistency
concerns even higher than the security issues.

Having said this: I'd rather take one step back again and start looking
for other approaches that would allow us to the address the issue(s)
at hand in a better way.

Kind regards
Angela

On 09/05/16 10:58, "Michael Dürig"  wrote:

>
>Hi,
>
>I very much share Francesco's concerns here. Unconditionally exposing
>access to operation system resources underlying Oak's inner working is
>troublesome for various reasons:
>
>- who owns the resource? Who coordinates (concurrent) access to it and
>how? What are the correctness and performance implications here (races,
>deadlock, corruptions, JCR semantics)?
>
>- it limits implementation freedom and hinders further evolution
>(chunking, de-duplication, content based addressing, compression, gc,
>etc.) for data stores.
>
>- bypassing JCR's security model
>
>Pretty much all of this has been discussed in the scope of
>https://issues.apache.org/jira/browse/JCR-3534 and
>https://issues.apache.org/jira/browse/OAK-834. So I suggest to review
>those discussions before we jump to conclusion.
>
>
>Also what is the use case requiring such a vast API surface? Can't we
>come up with an API that allows the blobs to stay under control of Oak?
>If not, this is probably an indication that those blobs shouldn't go
>into Oak but just references to it as Francesco already proposed.
>Anything else is whether fish nor fowl: you can't have the JCR goodies
>but at the same time access underlying resources at will.
>
>Michael
>
>
>
>On 5.5.16 11:00 , Francesco Mari wrote:
>> This proposal introduces a huge leak of abstractions and has deep
>>security
>> implications.
>>
>> I guess that the reason for this proposal is that some users of Oak
>>would
>> like to perform some operations on binaries in a more performant way by
>> leveraging the way those binaries are stored. If this is the case, I
>> suggest those users to evaluate an applicative solution implemented on
>>top
>> of the JCR API.
>>
>> If a user needs to store some important binary data (files, images,
>>etc.)
>> in an S3 bucket or on the file system for performance reasons, this
>> shouldn't affect how Oak handles blobs internally. If some assets are of
>> special interest for the user, then the user should bypass Oak and take
>> care of the storage of those assets directly. Oak can be used to store
>> *references* to those assets, that can be used in user code to
>>manipulate
>> the assets in his own business logic.
>>
>> If the scenario I outlined is not what inspired this proposal, I would
>>like
>> to know more about the reasons why this proposal was brought up. Which
>> problems are we going to solve with this API? Is there a more concrete
>>use
>> case that we can use as a driving example?
>>
>> 2016-05-05 10:06 GMT+02:00 Davide Giannella :
>>
>>> On 04/05/2016 17:37, Ian Boston wrote:
 Hi,
 If the File or URL is writable, will writing to the location cause
issues
 for Oak ?
 IIRC some Oak DS implementations use a digest of the content to
determine
 the location in the DS, so changing the content via Oak will change
the
 location, but changing the content via the File or URL wont. If I
didn't
 remember correctly, then ignore the concern.  Fully supportive of the
 approach, as a consumer of Oak. The locations will certainly probably
>>> leak
 outside the context of an Oak session so the API contract should make
it
 clear that the code using a 

Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-10 Thread Angela Schreiber
Hi Ian

On 04/05/16 18:37, "Ian Boston"  wrote:
>[...] The locations will certainly probably leak
>outside the context of an Oak session so the API contract should make it
>clear that the code using a direct location needs to behave responsibly.

See my reply to Chetan, who was referring to
SlingRepository.loginAdministrative
which always had a pretty clear API contract wrt responsible usage.

As a matter of fact (and I guess you are aware of this) it turned into a
total nightmare with developers using it just everywhere, ignoring not
only 
the API contract but also all concerns raised for years. This can even
been seen in Apache Sling code base itself.

So, I am quite pessimistic about responsible usage and API contract
and definitely prefer an API implementation that effectively enforces
the contract.

Vulnerable by design is IMHO a bad guideline for introducing new APIs.
>From my experiences they backfire usually sooner than later and need
to be abandoned again... so, I'd rather aim for a properly secured
solution right from the beginning.

Kind regards
Angela

>
>Best Regards
>Ian
>
>
>On 3 May 2016 at 15:36, Chetan Mehrotra  wrote:
>
>> Hi Team,
>>
>> For OAK-1963 we need to allow access to actaul Blob location say in form
>> File instance or S3 object id etc. This access is need to perform
>>optimized
>> IO operation around binary object e.g.
>>
>> 1. The File object can be used to spool the file content with zero copy
>> using NIO by accessing the File Channel directly [1]
>>
>> 2. Client code can efficiently replicate a binary stored in S3 by having
>> direct access to S3 object using copy operation
>>
>> To allow such access we would need a new API in the form of
>> AdaptableBinary.
>>
>> API
>> ===
>>
>> public interface AdaptableBinary {
>>
>> /**
>>  * Adapts the binary to another type like File, URL etc
>>  *
>>  * @param  The generic type to which this binary is
>> adapted
>>  *to
>>  * @param type The Class object of the target type, such as
>>  *File.class
>>  * @return The adapter target or null if the binary
>>cannot
>>  * adapt to the requested type
>>  */
>>  AdapterType adaptTo(Class type);
>> }
>>
>> Usage
>> =
>>
>> Binary binProp = node.getProperty("jcr:data").getBinary();
>>
>> //Check if Binary is of type AdaptableBinary
>> if (binProp instanceof AdaptableBinary){
>>  AdaptableBinary adaptableBinary = (AdaptableBinary) binProp;
>>
>> //Adapt it to File instance
>>  File file = adaptableBinary.adaptTo(File.class);
>> }
>>
>>
>>
>> The Binary instance returned by Oak
>> i.e. org.apache.jackrabbit.oak.plugins.value.BinaryImpl would then
>> implement this interface and calling code can then check the type and
>>cast
>> it and then adapt it
>>
>> Key Points
>> 
>>
>> 1. Depending on backing BlobStore the binary can be adapted to various
>> types. For FileDataStore it can be adapted to File. For S3DataStore it
>>can
>> either be adapted to URL or some S3DataStore specific type.
>>
>> 2. Security - Thomas suggested that for better security the ability to
>> adapt should be restricted based on session permissions. So if the user
>>has
>> required permission then only adaptation would work otherwise null
>>would be
>> returned.
>>
>> 3. Adaptation proposal is based on Sling Adaptable [2]
>>
>> 4. This API is for now exposed only at JCR level. Not sure should we do
>>it
>> at Oak level as Blob instance are currently not bound to any session. So
>> proposal is to place this in 'org.apache.jackrabbit.oak.api' package
>>
>> Kindly provide your feedback! Also any suggestion/guidance around how
>>the
>> access control be implemented
>>
>> Chetan Mehrotra
>> [1] http://www.ibm.com/developerworks/library/j-zerocopy/
>> [2]
>>
>> 
>>https://sling.apache.org/apidocs/sling5/org/apache/sling/api/adapter/Adap
>>table.html
>>



Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-10 Thread Ian Boston
Hi,
By processing independently I meant async, outside the callback, eg inside
a Mesos+Frenzo cluster [1], processors not running Oak.
Best Regards
Ian


1
http://techblog.netflix.com/2015/08/fenzo-oss-scheduler-for-apache-mesos.html

On 10 May 2016 at 06:02, Chetan Mehrotra  wrote:

> On Mon, May 9, 2016 at 8:27 PM, Ian Boston  wrote:
>
> > I thought the consumers of this api want things like the absolute path of
> > the File in the BlobStore, or the bucket and key of the S3 Object, so
> that
> > they could transmit it and use it for processing independently of Oak
> > outside the callback ?
> >
>
> Most cases can still be done, just do it within the callback
>
> blobStore.process("xxx", new BlobProcessor(){
> void process(AdaptableBlob blob){
>  File file = blob.adaptTo(File.class);
>  transformImage(file);
> }
> });
>
> Doing this within callback would allow Oak to enforce some safeguards (more
> on that in next mail) and still allows the user to perform optimal binary
> processing
>
> Chetan Mehrotra
>


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-09 Thread Chetan Mehrotra
Some more points around the proposed callback based approach

1.Possible security or enforcing a read only access to the exposed file -
The file provided within the BlobProcessor callback can be a symlink
created with a os user account which only has read only access. The symlink
can be removed once the callback returns

2. S3 DataStore Security Concern - For S3 DataStore we would only be
exposing the S3 object identifier and the client code would still need the
aws credentials to connect to the bucket and perform required copy operation

3. Possibility of further optimization in S3DataStore processing -
Currently when reading a binary from S3DataStore the binary content are
*always* spooled to some local temporary file (in local cache) and then a
InputStream is opened on that file. So even if the code need to read
initial few bytes of stream the whole file would have to be read. This
happens because with current JCR Binary API we are not in control of
lifetime of exposed InputStream. So if say we expose the InputStream we
cannot determine untill when the backing S3 SDK resources need to be held

Also current S3DataStore always creates local copy - With a callback based
approach we can safely expose this file which would allow layers above to
avoid spooling the content again locally for processing. And with callback
boundary we can later do required cleanup


Chetan Mehrotra

On Mon, May 9, 2016 at 7:15 PM, Chetan Mehrotra 
wrote:

> Had an offline discussion with Michael on this and explained the usecase
> requirement in more details. One concern that has been raised is that such
> a generic adaptTo API is too inviting for improper use and Oak does not
> have any context around when this url is exposed for what time it is used.
>
> So instead of having a generic adaptTo API at JCR level we can have a
> BlobProcessor callback (Approach #B). Below is more of a strawman proposal.
> Once we have a consensus then we can go over the details
>
> interface BlobProcessor {
>void process(AdaptableBlob blob);
> }
>
> Where AdaptableBlob is
>
> public interface AdaptableBlob {
>  AdapterType adaptTo(Class type);
> }
>
> The BlobProcessor instance can be passed via BlobStore API. So client
> would look for a BlobStore service (so use the Oak level API) and pass it
> the ContentIdentity of JCR Binary aka blobId
>
> interface BlobStore{
>  void process(String blobId, BlobProcessor processor)
> }
>
> The approach ensures
>
> 1. That any blob handle exposed is only guaranteed for the duration
> of  'process' invocation
> 2. There is no guarantee on the utility of blob handle (File, S3 Object)
> beyond the callback. So one should not collect the passed File handle for
> later use
>
> Hopefully this should address some of the concerns raised in this thread.
> Looking forward to feedback :)
>
> Chetan Mehrotra
>
> On Mon, May 9, 2016 at 6:24 PM, Michael Dürig  wrote:
>
>>
>>
>> On 9.5.16 11:43 , Chetan Mehrotra wrote:
>>
>>> To highlight - As mentioned earlier the user of proposed api is tying
>>> itself to implementation details of Oak and if this changes later then
>>> that
>>> code would also need to be changed. Or as Ian summed it up
>>>
>>> if the API is introduced it should create an out of band agreement with

>>> the consumers of the API to act responsibly.
>>>
>>
>> So what does "to act responsibly" actually means? Are we even in a
>> position to precisely specify this? Experience tells me that we only find
>> out about those semantics after the fact when dealing with painful and
>> expensive customer escalations.
>>
>> And even if we could, it would tie Oak into very tight constraints on how
>> it has to behave and how not. Constraints that would turn out prohibitively
>> expensive for future evolution. Furthermore a huge amount of resources
>> would be required to formalise such constraints via test coverage to guard
>> against regressions.
>>
>>
>>
>>> The method is to be used for those important case where you do rely on
>>> implementation detail to get optimal performance in very specific
>>> scenarios. Its like DocumentNodeStore making use of some Mongo specific
>>> API
>>> to perform some important critical operation to achieve better
>>> performance
>>> by checking if the underlying DocumentStore is Mongo based.
>>>
>>
>> Right, but the Mongo specific API is a (hopefully) well thought through
>> API where as with your proposal there are a lot of open questions and
>> concerns as per my last mail.
>>
>> Mongo (and any other COTS DB) for good reasons also don't give you direct
>> access to its internal file handles.
>>
>>
>>
>>> I have seen discussion of JCR-3534 and other related issue but still do
>>> not
>>> see any conclusion on how to answer such queries where direct access to
>>> blobs is required for performance aspect. This issue is not about
>>> exposing
>>> the blob reference for remote access but more about optimal path for in
>>> 

Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-09 Thread Marius Petria
Hi,

Can the uses cases presented by Chetan be solved the other way around? Instead 
of exposing implementation details via JCR/OAK API maybe it is possible to 
include the blobid in the S3 id/filename (a prefix?), such that external 
applications can identify external resources based on their oak storage. This 
can be optionally enabled for the blob stores that support such naming 
conventions.

Marius




On 5/9/16, 5:57 PM, "ianbos...@gmail.com on behalf of Ian Boston" 
 wrote:

>Hi,
>
>Thinking about the validity of the File and S3 Objects
>
>I thought the consumers of this api want things like the absolute path of
>the File in the BlobStore, or the bucket and key of the S3 Object, so that
>they could transmit it and use it for processing independently of Oak
>outside the callback ?
>
>Or are you proposing, if they want to do that, they should not use JCR Data
>but should (as others have suggested) store pointers to the data as JCR
>properties and not store any large scale binary data in Oak ?  (ie store
>the S3 bucket and Key or store the a relative path from a known location as
>a property of the node.)
>
>
>Best Regards
>Ian
>
>
>
>
>
>On 9 May 2016 at 14:45, Chetan Mehrotra  wrote:
>
>> Had an offline discussion with Michael on this and explained the usecase
>> requirement in more details. One concern that has been raised is that such
>> a generic adaptTo API is too inviting for improper use and Oak does not
>> have any context around when this url is exposed for what time it is used.
>>
>> So instead of having a generic adaptTo API at JCR level we can have a
>> BlobProcessor callback (Approach #B). Below is more of a strawman proposal.
>> Once we have a consensus then we can go over the details
>>
>> interface BlobProcessor {
>>void process(AdaptableBlob blob);
>> }
>>
>> Where AdaptableBlob is
>>
>> public interface AdaptableBlob {
>>  AdapterType adaptTo(Class type);
>> }
>>
>> The BlobProcessor instance can be passed via BlobStore API. So client would
>> look for a BlobStore service (so use the Oak level API) and pass it the
>> ContentIdentity of JCR Binary aka blobId
>>
>> interface BlobStore{
>>  void process(String blobId, BlobProcessor processor)
>> }
>>
>> The approach ensures
>>
>> 1. That any blob handle exposed is only guaranteed for the duration
>> of  'process' invocation
>> 2. There is no guarantee on the utility of blob handle (File, S3 Object)
>> beyond the callback. So one should not collect the passed File handle for
>> later use
>>
>> Hopefully this should address some of the concerns raised in this thread.
>> Looking forward to feedback :)
>>
>> Chetan Mehrotra
>>
>> On Mon, May 9, 2016 at 6:24 PM, Michael Dürig  wrote:
>>
>> >
>> >
>> > On 9.5.16 11:43 , Chetan Mehrotra wrote:
>> >
>> >> To highlight - As mentioned earlier the user of proposed api is tying
>> >> itself to implementation details of Oak and if this changes later then
>> >> that
>> >> code would also need to be changed. Or as Ian summed it up
>> >>
>> >> if the API is introduced it should create an out of band agreement with
>> >>>
>> >> the consumers of the API to act responsibly.
>> >>
>> >
>> > So what does "to act responsibly" actually means? Are we even in a
>> > position to precisely specify this? Experience tells me that we only find
>> > out about those semantics after the fact when dealing with painful and
>> > expensive customer escalations.
>> >
>> > And even if we could, it would tie Oak into very tight constraints on how
>> > it has to behave and how not. Constraints that would turn out
>> prohibitively
>> > expensive for future evolution. Furthermore a huge amount of resources
>> > would be required to formalise such constraints via test coverage to
>> guard
>> > against regressions.
>> >
>> >
>> >
>> >> The method is to be used for those important case where you do rely on
>> >> implementation detail to get optimal performance in very specific
>> >> scenarios. Its like DocumentNodeStore making use of some Mongo specific
>> >> API
>> >> to perform some important critical operation to achieve better
>> performance
>> >> by checking if the underlying DocumentStore is Mongo based.
>> >>
>> >
>> > Right, but the Mongo specific API is a (hopefully) well thought through
>> > API where as with your proposal there are a lot of open questions and
>> > concerns as per my last mail.
>> >
>> > Mongo (and any other COTS DB) for good reasons also don't give you direct
>> > access to its internal file handles.
>> >
>> >
>> >
>> >> I have seen discussion of JCR-3534 and other related issue but still do
>> >> not
>> >> see any conclusion on how to answer such queries where direct access to
>> >> blobs is required for performance aspect. This issue is not about
>> exposing
>> >> the blob reference for remote access but more about optimal path for in
>> VM
>> >> access
>> >>
>> >
>> > One bottom line of 

Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-09 Thread Ian Boston
Hi,

Thinking about the validity of the File and S3 Objects

I thought the consumers of this api want things like the absolute path of
the File in the BlobStore, or the bucket and key of the S3 Object, so that
they could transmit it and use it for processing independently of Oak
outside the callback ?

Or are you proposing, if they want to do that, they should not use JCR Data
but should (as others have suggested) store pointers to the data as JCR
properties and not store any large scale binary data in Oak ?  (ie store
the S3 bucket and Key or store the a relative path from a known location as
a property of the node.)


Best Regards
Ian





On 9 May 2016 at 14:45, Chetan Mehrotra  wrote:

> Had an offline discussion with Michael on this and explained the usecase
> requirement in more details. One concern that has been raised is that such
> a generic adaptTo API is too inviting for improper use and Oak does not
> have any context around when this url is exposed for what time it is used.
>
> So instead of having a generic adaptTo API at JCR level we can have a
> BlobProcessor callback (Approach #B). Below is more of a strawman proposal.
> Once we have a consensus then we can go over the details
>
> interface BlobProcessor {
>void process(AdaptableBlob blob);
> }
>
> Where AdaptableBlob is
>
> public interface AdaptableBlob {
>  AdapterType adaptTo(Class type);
> }
>
> The BlobProcessor instance can be passed via BlobStore API. So client would
> look for a BlobStore service (so use the Oak level API) and pass it the
> ContentIdentity of JCR Binary aka blobId
>
> interface BlobStore{
>  void process(String blobId, BlobProcessor processor)
> }
>
> The approach ensures
>
> 1. That any blob handle exposed is only guaranteed for the duration
> of  'process' invocation
> 2. There is no guarantee on the utility of blob handle (File, S3 Object)
> beyond the callback. So one should not collect the passed File handle for
> later use
>
> Hopefully this should address some of the concerns raised in this thread.
> Looking forward to feedback :)
>
> Chetan Mehrotra
>
> On Mon, May 9, 2016 at 6:24 PM, Michael Dürig  wrote:
>
> >
> >
> > On 9.5.16 11:43 , Chetan Mehrotra wrote:
> >
> >> To highlight - As mentioned earlier the user of proposed api is tying
> >> itself to implementation details of Oak and if this changes later then
> >> that
> >> code would also need to be changed. Or as Ian summed it up
> >>
> >> if the API is introduced it should create an out of band agreement with
> >>>
> >> the consumers of the API to act responsibly.
> >>
> >
> > So what does "to act responsibly" actually means? Are we even in a
> > position to precisely specify this? Experience tells me that we only find
> > out about those semantics after the fact when dealing with painful and
> > expensive customer escalations.
> >
> > And even if we could, it would tie Oak into very tight constraints on how
> > it has to behave and how not. Constraints that would turn out
> prohibitively
> > expensive for future evolution. Furthermore a huge amount of resources
> > would be required to formalise such constraints via test coverage to
> guard
> > against regressions.
> >
> >
> >
> >> The method is to be used for those important case where you do rely on
> >> implementation detail to get optimal performance in very specific
> >> scenarios. Its like DocumentNodeStore making use of some Mongo specific
> >> API
> >> to perform some important critical operation to achieve better
> performance
> >> by checking if the underlying DocumentStore is Mongo based.
> >>
> >
> > Right, but the Mongo specific API is a (hopefully) well thought through
> > API where as with your proposal there are a lot of open questions and
> > concerns as per my last mail.
> >
> > Mongo (and any other COTS DB) for good reasons also don't give you direct
> > access to its internal file handles.
> >
> >
> >
> >> I have seen discussion of JCR-3534 and other related issue but still do
> >> not
> >> see any conclusion on how to answer such queries where direct access to
> >> blobs is required for performance aspect. This issue is not about
> exposing
> >> the blob reference for remote access but more about optimal path for in
> VM
> >> access
> >>
> >
> > One bottom line of the discussions in that issue is that we came to a
> > conclusion after clarifying the specifics of the use case. Something I'm
> > still missing here. The case you brought forward is too general to serve
> as
> > a guideline for a solution. Quite to the contrary, to me it looks like a
> > solution to some problem (I'm trying to understand).
> >
> >
> >
> >> who owns the resource? Who coordinates (concurrent) access to it and
> how?
> >>>
> >> What are the correctness and performance implications here (races,
> >> deadlock, corruptions, JCR semantics)?
> >>
> >> The client code would need to be implemented in a proper way. Its more
> >> like
> >> 

Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-09 Thread Chetan Mehrotra
Had an offline discussion with Michael on this and explained the usecase
requirement in more details. One concern that has been raised is that such
a generic adaptTo API is too inviting for improper use and Oak does not
have any context around when this url is exposed for what time it is used.

So instead of having a generic adaptTo API at JCR level we can have a
BlobProcessor callback (Approach #B). Below is more of a strawman proposal.
Once we have a consensus then we can go over the details

interface BlobProcessor {
   void process(AdaptableBlob blob);
}

Where AdaptableBlob is

public interface AdaptableBlob {
 AdapterType adaptTo(Class type);
}

The BlobProcessor instance can be passed via BlobStore API. So client would
look for a BlobStore service (so use the Oak level API) and pass it the
ContentIdentity of JCR Binary aka blobId

interface BlobStore{
 void process(String blobId, BlobProcessor processor)
}

The approach ensures

1. That any blob handle exposed is only guaranteed for the duration
of  'process' invocation
2. There is no guarantee on the utility of blob handle (File, S3 Object)
beyond the callback. So one should not collect the passed File handle for
later use

Hopefully this should address some of the concerns raised in this thread.
Looking forward to feedback :)

Chetan Mehrotra

On Mon, May 9, 2016 at 6:24 PM, Michael Dürig  wrote:

>
>
> On 9.5.16 11:43 , Chetan Mehrotra wrote:
>
>> To highlight - As mentioned earlier the user of proposed api is tying
>> itself to implementation details of Oak and if this changes later then
>> that
>> code would also need to be changed. Or as Ian summed it up
>>
>> if the API is introduced it should create an out of band agreement with
>>>
>> the consumers of the API to act responsibly.
>>
>
> So what does "to act responsibly" actually means? Are we even in a
> position to precisely specify this? Experience tells me that we only find
> out about those semantics after the fact when dealing with painful and
> expensive customer escalations.
>
> And even if we could, it would tie Oak into very tight constraints on how
> it has to behave and how not. Constraints that would turn out prohibitively
> expensive for future evolution. Furthermore a huge amount of resources
> would be required to formalise such constraints via test coverage to guard
> against regressions.
>
>
>
>> The method is to be used for those important case where you do rely on
>> implementation detail to get optimal performance in very specific
>> scenarios. Its like DocumentNodeStore making use of some Mongo specific
>> API
>> to perform some important critical operation to achieve better performance
>> by checking if the underlying DocumentStore is Mongo based.
>>
>
> Right, but the Mongo specific API is a (hopefully) well thought through
> API where as with your proposal there are a lot of open questions and
> concerns as per my last mail.
>
> Mongo (and any other COTS DB) for good reasons also don't give you direct
> access to its internal file handles.
>
>
>
>> I have seen discussion of JCR-3534 and other related issue but still do
>> not
>> see any conclusion on how to answer such queries where direct access to
>> blobs is required for performance aspect. This issue is not about exposing
>> the blob reference for remote access but more about optimal path for in VM
>> access
>>
>
> One bottom line of the discussions in that issue is that we came to a
> conclusion after clarifying the specifics of the use case. Something I'm
> still missing here. The case you brought forward is too general to serve as
> a guideline for a solution. Quite to the contrary, to me it looks like a
> solution to some problem (I'm trying to understand).
>
>
>
>> who owns the resource? Who coordinates (concurrent) access to it and how?
>>>
>> What are the correctness and performance implications here (races,
>> deadlock, corruptions, JCR semantics)?
>>
>> The client code would need to be implemented in a proper way. Its more
>> like
>> implementing a CommitHook. If implemented in incorrect way it would cause
>> issues deadlocks etc. But then we assume that any one implementing that
>> interface would take proper care in implementation.
>>
>
> But a commit hook is an internal SPI. It is not advertised to the whole
> world as a public API.
>
>
>
>>  it limits implementation freedom and hinders further evolution
>>>
>> (chunking, de-duplication, content based addressing, compression, gc,
>> etc.)
>> for data stores.
>>
>> As mentioned earlier. Some part of API indicates a closer dependency on
>> how
>> things work (like SPI, or ConsumerType AP on OSGi terms). By using such
>> API
>> client code definitely ties itself to Oak implementation detail but it
>> should not limit how Oak implementation detail evolve. So when it changes
>> client code need to adapt itself accordingly. Oak can express that
>> by increment the minor version of exported package to indicate change
>> in 

Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-09 Thread Michael Dürig



On 9.5.16 11:43 , Chetan Mehrotra wrote:

To highlight - As mentioned earlier the user of proposed api is tying
itself to implementation details of Oak and if this changes later then that
code would also need to be changed. Or as Ian summed it up


if the API is introduced it should create an out of band agreement with

the consumers of the API to act responsibly.


So what does "to act responsibly" actually means? Are we even in a 
position to precisely specify this? Experience tells me that we only 
find out about those semantics after the fact when dealing with painful 
and expensive customer escalations.


And even if we could, it would tie Oak into very tight constraints on 
how it has to behave and how not. Constraints that would turn out 
prohibitively expensive for future evolution. Furthermore a huge amount 
of resources would be required to formalise such constraints via test 
coverage to guard against regressions.





The method is to be used for those important case where you do rely on
implementation detail to get optimal performance in very specific
scenarios. Its like DocumentNodeStore making use of some Mongo specific API
to perform some important critical operation to achieve better performance
by checking if the underlying DocumentStore is Mongo based.


Right, but the Mongo specific API is a (hopefully) well thought through 
API where as with your proposal there are a lot of open questions and 
concerns as per my last mail.


Mongo (and any other COTS DB) for good reasons also don't give you 
direct access to its internal file handles.





I have seen discussion of JCR-3534 and other related issue but still do not
see any conclusion on how to answer such queries where direct access to
blobs is required for performance aspect. This issue is not about exposing
the blob reference for remote access but more about optimal path for in VM
access


One bottom line of the discussions in that issue is that we came to a 
conclusion after clarifying the specifics of the use case. Something I'm 
still missing here. The case you brought forward is too general to serve 
as a guideline for a solution. Quite to the contrary, to me it looks 
like a solution to some problem (I'm trying to understand).






who owns the resource? Who coordinates (concurrent) access to it and how?

What are the correctness and performance implications here (races,
deadlock, corruptions, JCR semantics)?

The client code would need to be implemented in a proper way. Its more like
implementing a CommitHook. If implemented in incorrect way it would cause
issues deadlocks etc. But then we assume that any one implementing that
interface would take proper care in implementation.


But a commit hook is an internal SPI. It is not advertised to the whole 
world as a public API.






 it limits implementation freedom and hinders further evolution

(chunking, de-duplication, content based addressing, compression, gc, etc.)
for data stores.

As mentioned earlier. Some part of API indicates a closer dependency on how
things work (like SPI, or ConsumerType AP on OSGi terms). By using such API
client code definitely ties itself to Oak implementation detail but it
should not limit how Oak implementation detail evolve. So when it changes
client code need to adapt itself accordingly. Oak can express that
by increment the minor version of exported package to indicate change
in behavior.


Which IMO is completely contradictory. Such an API would prevent us from 
refactoring internal storage formats if a new format couldn't implement 
the API (e.g. because of chunking, compression, deduplication etc).




Can't we come up with an API that allows the blobs to stay under control

of Oak?

The code need to work either at OS level say file handle or say S3 object.
So I do not see a way where it can work without having access to those
details


Again, why? What's the precise use case here? If this really is the 
conclusions, then a corollary would be that those binaries must not go 
into Oak.




FWIW there is code out there which reverse engineers the blobId to access
the actual binary. People do it so as to get decent throughput in image
rendition logic for large scale deployment. The proposal here was to
formalize that approach by providing a proper api. If we do not provide
such an API then the only way for them would be to continue relying on
reverse engineering the blobId!


This is hardly a good argument. Formalising other people's hacks means 
making us liable. What *we* need to do is understand their use case and 
come up with a clean solution.






If not, this is probably an indication that those blobs shouldn't go into

Oak but just references to it as Francesco already proposed. Anything else
is whether fish nor fowl: you can't have the JCR goodies but at the same
time access underlying resources at will.

Thats a fine argument to make. But then users here have real problem to
solve which we should not ignore. Oak based systems 

Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-09 Thread Chetan Mehrotra
To highlight - As mentioned earlier the user of proposed api is tying
itself to implementation details of Oak and if this changes later then that
code would also need to be changed. Or as Ian summed it up

> if the API is introduced it should create an out of band agreement with
the consumers of the API to act responsibly.

The method is to be used for those important case where you do rely on
implementation detail to get optimal performance in very specific
scenarios. Its like DocumentNodeStore making use of some Mongo specific API
to perform some important critical operation to achieve better performance
by checking if the underlying DocumentStore is Mongo based.

I have seen discussion of JCR-3534 and other related issue but still do not
see any conclusion on how to answer such queries where direct access to
blobs is required for performance aspect. This issue is not about exposing
the blob reference for remote access but more about optimal path for in VM
access

> who owns the resource? Who coordinates (concurrent) access to it and how?
What are the correctness and performance implications here (races,
deadlock, corruptions, JCR semantics)?

The client code would need to be implemented in a proper way. Its more like
implementing a CommitHook. If implemented in incorrect way it would cause
issues deadlocks etc. But then we assume that any one implementing that
interface would take proper care in implementation.

>  it limits implementation freedom and hinders further evolution
(chunking, de-duplication, content based addressing, compression, gc, etc.)
for data stores.

As mentioned earlier. Some part of API indicates a closer dependency on how
things work (like SPI, or ConsumerType AP on OSGi terms). By using such API
client code definitely ties itself to Oak implementation detail but it
should not limit how Oak implementation detail evolve. So when it changes
client code need to adapt itself accordingly. Oak can express that
by increment the minor version of exported package to indicate change
in behavior.

> bypassing JCR's security model

I yet do not see the attack vector which we need to defend differently
here. Again the blob url is not being exposed say as part of webdav or any
other remote call. So would like to understand the security concern better
here (unless it defending against a malicious , badly implemented client
code which we discussed above)

> Can't we come up with an API that allows the blobs to stay under control
of Oak?

The code need to work either at OS level say file handle or say S3 object.
So I do not see a way where it can work without having access to those
details

FWIW there is code out there which reverse engineers the blobId to access
the actual binary. People do it so as to get decent throughput in image
rendition logic for large scale deployment. The proposal here was to
formalize that approach by providing a proper api. If we do not provide
such an API then the only way for them would be to continue relying on
reverse engineering the blobId!

> If not, this is probably an indication that those blobs shouldn't go into
Oak but just references to it as Francesco already proposed. Anything else
is whether fish nor fowl: you can't have the JCR goodies but at the same
time access underlying resources at will.

Thats a fine argument to make. But then users here have real problem to
solve which we should not ignore. Oak based systems are being proposed for
large asset deployment where one of the primary requirement is asset
handling/processing of 100 of TB of binary data. So we would then have to
recommend for such cases to not use JCR Binary abstraction and manage the
binaries on your own. That would then solve both the problems (that might
though break lots of tooling build on top of JCR API to manage those
binaries)!

Thinking more - Another approach that I can then suggest it people
implement there own BlobStore (may be by extending ours) and provide this
API there i.e. say which takes Blob id and provide the required details.
This way we "outsource" the problem. Would that be acceptable?

Chetan Mehrotra

On Mon, May 9, 2016 at 2:28 PM, Michael Dürig  wrote:

>
> Hi,
>
> I very much share Francesco's concerns here. Unconditionally exposing
> access to operation system resources underlying Oak's inner working is
> troublesome for various reasons:
>
> - who owns the resource? Who coordinates (concurrent) access to it and
> how? What are the correctness and performance implications here (races,
> deadlock, corruptions, JCR semantics)?
>
> - it limits implementation freedom and hinders further evolution
> (chunking, de-duplication, content based addressing, compression, gc, etc.)
> for data stores.
>
> - bypassing JCR's security model
>
> Pretty much all of this has been discussed in the scope of
> https://issues.apache.org/jira/browse/JCR-3534 and
> https://issues.apache.org/jira/browse/OAK-834. So I suggest to review
> those discussions before we jump 

Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-09 Thread Michael Dürig


Hi,

I very much share Francesco's concerns here. Unconditionally exposing 
access to operation system resources underlying Oak's inner working is 
troublesome for various reasons:


- who owns the resource? Who coordinates (concurrent) access to it and 
how? What are the correctness and performance implications here (races, 
deadlock, corruptions, JCR semantics)?


- it limits implementation freedom and hinders further evolution 
(chunking, de-duplication, content based addressing, compression, gc, 
etc.) for data stores.


- bypassing JCR's security model

Pretty much all of this has been discussed in the scope of 
https://issues.apache.org/jira/browse/JCR-3534 and 
https://issues.apache.org/jira/browse/OAK-834. So I suggest to review 
those discussions before we jump to conclusion.



Also what is the use case requiring such a vast API surface? Can't we 
come up with an API that allows the blobs to stay under control of Oak? 
If not, this is probably an indication that those blobs shouldn't go 
into Oak but just references to it as Francesco already proposed. 
Anything else is whether fish nor fowl: you can't have the JCR goodies 
but at the same time access underlying resources at will.


Michael



On 5.5.16 11:00 , Francesco Mari wrote:

This proposal introduces a huge leak of abstractions and has deep security
implications.

I guess that the reason for this proposal is that some users of Oak would
like to perform some operations on binaries in a more performant way by
leveraging the way those binaries are stored. If this is the case, I
suggest those users to evaluate an applicative solution implemented on top
of the JCR API.

If a user needs to store some important binary data (files, images, etc.)
in an S3 bucket or on the file system for performance reasons, this
shouldn't affect how Oak handles blobs internally. If some assets are of
special interest for the user, then the user should bypass Oak and take
care of the storage of those assets directly. Oak can be used to store
*references* to those assets, that can be used in user code to manipulate
the assets in his own business logic.

If the scenario I outlined is not what inspired this proposal, I would like
to know more about the reasons why this proposal was brought up. Which
problems are we going to solve with this API? Is there a more concrete use
case that we can use as a driving example?

2016-05-05 10:06 GMT+02:00 Davide Giannella :


On 04/05/2016 17:37, Ian Boston wrote:

Hi,
If the File or URL is writable, will writing to the location cause issues
for Oak ?
IIRC some Oak DS implementations use a digest of the content to determine
the location in the DS, so changing the content via Oak will change the
location, but changing the content via the File or URL wont. If I didn't
remember correctly, then ignore the concern.  Fully supportive of the
approach, as a consumer of Oak. The locations will certainly probably

leak

outside the context of an Oak session so the API contract should make it
clear that the code using a direct location needs to behave responsibly.



It's a reasonable concern and I'm not in the details of the
implementation. It's worth to keep in mind though and remember if we
want to adapt to URL or File that maybe we'll have to come up with some
sort of read-only version of such.

For the File class, IIRC, we could force/use the setReadOnly(),
setWritable() methods. I remember those to be quite expensive in time
though.

Davide







Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-05 Thread Chetan Mehrotra
On Thu, May 5, 2016 at 5:07 PM, Francesco Mari 
wrote:

>
> This is a totally different thing. The change to the node will be committed
> with the privileges of the session that retrieved the node. If the session
> doesn't have enough privileges to delete that node, the node will be
> deleted, There is no escape from the security model.


A "bad code" when passes a node backed via admin session can still do bad
thing as admin session has all the privileges. In same way if a bad code is
passed a file handle then it can cause issue. So I am still not sure on the
attack vector which we are defending against.

Chetan Mehrotra


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-05 Thread Francesco Mari
2016-05-05 13:22 GMT+02:00 Chetan Mehrotra :

> On Thu, May 5, 2016 at 4:38 PM, Francesco Mari 
> wrote:
>
> > The security concern is quite easy to explain: it's a bypass of our
> > security model. Imagine that, using a session with the appropriate
> > privileges, a user accesses a Blob and adapts it to a file handle, an S3
> > bucket or a URL. This code passes this reference to another piece of code
> > that modifies the data directly even if - in the same deployment - it
> > shouldn't be able to access the Blob instance to begin with.
> >
>
> How is this different from the case where a code obtains a Node via an
> admin session and passes that Node instance to another code which say
> deletes important content via it. In the end we have to trust the client
> code to do correct thing when given appropriate rights. So in current
> proposal the code can only adapt the binary if the session has expected
> permissions. Post that we need to trust the code to behave properly.
>

This is a totally different thing. The change to the node will be committed
with the privileges of the session that retrieved the node. If the session
doesn't have enough privileges to delete that node, the node will be
deleted, There is no escape from the security model.


>
> > In both the use case, the customer is coupling the data with the most
> > appropriate storage solution for his business case. In this case,
> customer
> > code - and not Oak - should be responsible for the management of that
> data.
>
> Well then it means that customer implements its very own DataStore like
> solution and all the application code do not make use of JCR Binary and
> instead use another service to resolve the references. This would greatly
> reduce the usefulness of JCR for asset heavy application which use JCR to
> manage binary content along with its metadata
>

What I said doesn't reduce the usefulness of JCR. JCR defines an
abstraction that is independent from the actual storage solution. If a
client is fine with using the abstraction, JCR can be a very useful tool.
If a client needs to escape the abstraction, he has to do it at his own
risk without breaking the abstraction for everyone else. In the outlined
use cases, the customer needs to be responsible for his own storage
mechanisms.


>
>
> Chetan Mehrotra
>


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-05 Thread Chetan Mehrotra
On Thu, May 5, 2016 at 4:38 PM, Francesco Mari 
wrote:

> The security concern is quite easy to explain: it's a bypass of our
> security model. Imagine that, using a session with the appropriate
> privileges, a user accesses a Blob and adapts it to a file handle, an S3
> bucket or a URL. This code passes this reference to another piece of code
> that modifies the data directly even if - in the same deployment - it
> shouldn't be able to access the Blob instance to begin with.
>

How is this different from the case where a code obtains a Node via an
admin session and passes that Node instance to another code which say
deletes important content via it. In the end we have to trust the client
code to do correct thing when given appropriate rights. So in current
proposal the code can only adapt the binary if the session has expected
permissions. Post that we need to trust the code to behave properly.

> In both the use case, the customer is coupling the data with the most
> appropriate storage solution for his business case. In this case, customer
> code - and not Oak - should be responsible for the management of that
data.

Well then it means that customer implements its very own DataStore like
solution and all the application code do not make use of JCR Binary and
instead use another service to resolve the references. This would greatly
reduce the usefulness of JCR for asset heavy application which use JCR to
manage binary content along with its metadata


Chetan Mehrotra


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-05 Thread Francesco Mari
The security concern is quite easy to explain: it's a bypass of our
security model. Imagine that, using a session with the appropriate
privileges, a user accesses a Blob and adapts it to a file handle, an S3
bucket or a URL. This code passes this reference to another piece of code
that modifies the data directly even if - in the same deployment - it
shouldn't be able to access the Blob instance to begin with.

In addition to that, I'm very concerned with the correctness of this
solution. In both the use cases you mentioned above, you assume that the
leaked reference is only used to read the data. The truth is that, once a
reference leaks, we can't be sure that we are the only agent managing the
data. We would have to program defensively because we are - as a matter of
fact - sharing the management of the data with an unspecified amount of
user code. I don't even know if it's possible to anticipate every single
thing that can go wrong.

In both the use case, the customer is coupling the data with the most
appropriate storage solution for his business case. In this case, customer
code - and not Oak - should be responsible for the management of that data.
Oak can still be used to store references to that data - paths on the file
system, the ID of the S3 bucket or the URI to the resource.

2016-05-05 12:38 GMT+02:00 Chetan Mehrotra :

> > This proposal introduces a huge leak of abstractions and has deep
> security
> implications.
>
> I understand the leak of abstractions concern. However would like to
> understand the security concern bit more.
>
> One way I can think of that it can cause security concern is you have some
> malicious code running in same jvm which can then do bad things with the
> file handle. Do note that the File handle would not get exposed via any
> remoting api we currently support. Now in this case if malicious code is
> already running in same jvm then security is breached and code can anyway
> make use of reflection to access internal details.
>
> So if there is any other possible security concern then would like to
> discuss.
>
> Coming to usecases
>
> Usecase A - Image rendition generation
> -
>
> We have some bigger deployments where lots of images gets uploaded to the
> repository and there are some conversions (rendition generation) which are
> performed by OS specific native executables. Such programs work directly on
> file handle. Without this change currently we need to first spool the file
> content into some temporary location and then pass that to the other
> program. This add unnecessary overhead and something which can be avoided
> in case there is a FileDataStore being used where we can provide a direct
> access to the file
>
> Usecase B - Efficient replication across regions in S3
> --
>
> This for AEM based setup which is running on Oak with S3DataStore. There we
> have global deployment where author instance is running in 1 region and
> binary content is to be distributed to publish instances running in
> different regions. The DataStore size is huge say 100TB and for efficient
> operation we need to use Binary less replication. In most cases only a very
> small subset of binary content would need to be present in other
> regions. Current
> way (via shared DataStore) to support that would involve synchronizing the
> S3 bucket across all such regions which would increase the storage cost
> considerable.
>
> Instead of that plan is to replicate the specific assets via s3 copy
> operation. This would ensure that big assets can be copied efficiently at
> S3 level and that would require direct access to the S3 object.
>
> Again in all such cases one can always resort to current level support i.e.
> copy over all the content via inputstream into some temporary store and
> then use that. But that would add considerable overhead when assets are of
> 100MB sizes or more. So the approach proposed would allow client code to
> this efficiently depending on the underlying storage capability
>
> > To me sounds like breaching the JCR and NodeState layers to directly
> > manipulate NodeStore binaries (from the DataStore), e.g. to perform smart
> > replication across different instances, but imho the right way to address
> > that is extending one of the current DataStore implementations or create
> a
> > new one.
>
> The original proposed approach in OAK-1963 was like that i.e. introduce
> this access method on BlobStore which works on reference. But in that case
> client code would need to deal with BlobStore API. In either case access to
> actual binary storage data would be required
>
> Chetan Mehrotra
>
> On Thu, May 5, 2016 at 2:49 PM, Tommaso Teofili  >
> wrote:
>
> > +1 to Francesco's concerns, exposing the location of a binary at the
> > application level doesn't sound good from a security perspective.
> 

Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-05 Thread Chetan Mehrotra
> This proposal introduces a huge leak of abstractions and has deep security
implications.

I understand the leak of abstractions concern. However would like to
understand the security concern bit more.

One way I can think of that it can cause security concern is you have some
malicious code running in same jvm which can then do bad things with the
file handle. Do note that the File handle would not get exposed via any
remoting api we currently support. Now in this case if malicious code is
already running in same jvm then security is breached and code can anyway
make use of reflection to access internal details.

So if there is any other possible security concern then would like to
discuss.

Coming to usecases

Usecase A - Image rendition generation
-

We have some bigger deployments where lots of images gets uploaded to the
repository and there are some conversions (rendition generation) which are
performed by OS specific native executables. Such programs work directly on
file handle. Without this change currently we need to first spool the file
content into some temporary location and then pass that to the other
program. This add unnecessary overhead and something which can be avoided
in case there is a FileDataStore being used where we can provide a direct
access to the file

Usecase B - Efficient replication across regions in S3
--

This for AEM based setup which is running on Oak with S3DataStore. There we
have global deployment where author instance is running in 1 region and
binary content is to be distributed to publish instances running in
different regions. The DataStore size is huge say 100TB and for efficient
operation we need to use Binary less replication. In most cases only a very
small subset of binary content would need to be present in other
regions. Current
way (via shared DataStore) to support that would involve synchronizing the
S3 bucket across all such regions which would increase the storage cost
considerable.

Instead of that plan is to replicate the specific assets via s3 copy
operation. This would ensure that big assets can be copied efficiently at
S3 level and that would require direct access to the S3 object.

Again in all such cases one can always resort to current level support i.e.
copy over all the content via inputstream into some temporary store and
then use that. But that would add considerable overhead when assets are of
100MB sizes or more. So the approach proposed would allow client code to
this efficiently depending on the underlying storage capability

> To me sounds like breaching the JCR and NodeState layers to directly
> manipulate NodeStore binaries (from the DataStore), e.g. to perform smart
> replication across different instances, but imho the right way to address
> that is extending one of the current DataStore implementations or create a
> new one.

The original proposed approach in OAK-1963 was like that i.e. introduce
this access method on BlobStore which works on reference. But in that case
client code would need to deal with BlobStore API. In either case access to
actual binary storage data would be required

Chetan Mehrotra

On Thu, May 5, 2016 at 2:49 PM, Tommaso Teofili 
wrote:

> +1 to Francesco's concerns, exposing the location of a binary at the
> application level doesn't sound good from a security perspective.
> To me sounds like breaching the JCR and NodeState layers to directly
> manipulate NodeStore binaries (from the DataStore), e.g. to perform smart
> replication across different instances, but imho the right way to address
> that is extending one of the current DataStore implementations or create a
> new one.
> I am also concerned that this Adaptable pattern would open room for other
> such hacks into the stack.
>
> My 2 cents,
> Tommaso
>
>
> Il giorno gio 5 mag 2016 alle ore 11:00 Francesco Mari <
> mari.france...@gmail.com> ha scritto:
>
> > This proposal introduces a huge leak of abstractions and has deep
> security
> > implications.
> >
> > I guess that the reason for this proposal is that some users of Oak would
> > like to perform some operations on binaries in a more performant way by
> > leveraging the way those binaries are stored. If this is the case, I
> > suggest those users to evaluate an applicative solution implemented on
> top
> > of the JCR API.
> >
> > If a user needs to store some important binary data (files, images, etc.)
> > in an S3 bucket or on the file system for performance reasons, this
> > shouldn't affect how Oak handles blobs internally. If some assets are of
> > special interest for the user, then the user should bypass Oak and take
> > care of the storage of those assets directly. Oak can be used to store
> > *references* to those assets, that can be used in user code to manipulate
> > the assets in his own business logic.
> >
> > If the scenario I outlined is 

Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-05 Thread Tommaso Teofili
+1 to Francesco's concerns, exposing the location of a binary at the
application level doesn't sound good from a security perspective.
To me sounds like breaching the JCR and NodeState layers to directly
manipulate NodeStore binaries (from the DataStore), e.g. to perform smart
replication across different instances, but imho the right way to address
that is extending one of the current DataStore implementations or create a
new one.
I am also concerned that this Adaptable pattern would open room for other
such hacks into the stack.

My 2 cents,
Tommaso


Il giorno gio 5 mag 2016 alle ore 11:00 Francesco Mari <
mari.france...@gmail.com> ha scritto:

> This proposal introduces a huge leak of abstractions and has deep security
> implications.
>
> I guess that the reason for this proposal is that some users of Oak would
> like to perform some operations on binaries in a more performant way by
> leveraging the way those binaries are stored. If this is the case, I
> suggest those users to evaluate an applicative solution implemented on top
> of the JCR API.
>
> If a user needs to store some important binary data (files, images, etc.)
> in an S3 bucket or on the file system for performance reasons, this
> shouldn't affect how Oak handles blobs internally. If some assets are of
> special interest for the user, then the user should bypass Oak and take
> care of the storage of those assets directly. Oak can be used to store
> *references* to those assets, that can be used in user code to manipulate
> the assets in his own business logic.
>
> If the scenario I outlined is not what inspired this proposal, I would like
> to know more about the reasons why this proposal was brought up. Which
> problems are we going to solve with this API? Is there a more concrete use
> case that we can use as a driving example?
>
> 2016-05-05 10:06 GMT+02:00 Davide Giannella :
>
> > On 04/05/2016 17:37, Ian Boston wrote:
> > > Hi,
> > > If the File or URL is writable, will writing to the location cause
> issues
> > > for Oak ?
> > > IIRC some Oak DS implementations use a digest of the content to
> determine
> > > the location in the DS, so changing the content via Oak will change the
> > > location, but changing the content via the File or URL wont. If I
> didn't
> > > remember correctly, then ignore the concern.  Fully supportive of the
> > > approach, as a consumer of Oak. The locations will certainly probably
> > leak
> > > outside the context of an Oak session so the API contract should make
> it
> > > clear that the code using a direct location needs to behave
> responsibly.
> > >
> >
> > It's a reasonable concern and I'm not in the details of the
> > implementation. It's worth to keep in mind though and remember if we
> > want to adapt to URL or File that maybe we'll have to come up with some
> > sort of read-only version of such.
> >
> > For the File class, IIRC, we could force/use the setReadOnly(),
> > setWritable() methods. I remember those to be quite expensive in time
> > though.
> >
> > Davide
> >
> >
> >
>


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-05 Thread Chetan Mehrotra
On Wed, May 4, 2016 at 10:07 PM, Ian Boston  wrote:

> If the File or URL is writable, will writing to the location cause issues
> for Oak ?
>

Yes that would cause problem. Expectation here is that code using a direct
location needs to behave responsibly.

Chetan Mehrotra


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-05 Thread Francesco Mari
This proposal introduces a huge leak of abstractions and has deep security
implications.

I guess that the reason for this proposal is that some users of Oak would
like to perform some operations on binaries in a more performant way by
leveraging the way those binaries are stored. If this is the case, I
suggest those users to evaluate an applicative solution implemented on top
of the JCR API.

If a user needs to store some important binary data (files, images, etc.)
in an S3 bucket or on the file system for performance reasons, this
shouldn't affect how Oak handles blobs internally. If some assets are of
special interest for the user, then the user should bypass Oak and take
care of the storage of those assets directly. Oak can be used to store
*references* to those assets, that can be used in user code to manipulate
the assets in his own business logic.

If the scenario I outlined is not what inspired this proposal, I would like
to know more about the reasons why this proposal was brought up. Which
problems are we going to solve with this API? Is there a more concrete use
case that we can use as a driving example?

2016-05-05 10:06 GMT+02:00 Davide Giannella :

> On 04/05/2016 17:37, Ian Boston wrote:
> > Hi,
> > If the File or URL is writable, will writing to the location cause issues
> > for Oak ?
> > IIRC some Oak DS implementations use a digest of the content to determine
> > the location in the DS, so changing the content via Oak will change the
> > location, but changing the content via the File or URL wont. If I didn't
> > remember correctly, then ignore the concern.  Fully supportive of the
> > approach, as a consumer of Oak. The locations will certainly probably
> leak
> > outside the context of an Oak session so the API contract should make it
> > clear that the code using a direct location needs to behave responsibly.
> >
>
> It's a reasonable concern and I'm not in the details of the
> implementation. It's worth to keep in mind though and remember if we
> want to adapt to URL or File that maybe we'll have to come up with some
> sort of read-only version of such.
>
> For the File class, IIRC, we could force/use the setReadOnly(),
> setWritable() methods. I remember those to be quite expensive in time
> though.
>
> Davide
>
>
>


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-05 Thread Davide Giannella
On 04/05/2016 17:37, Ian Boston wrote:
> Hi,
> If the File or URL is writable, will writing to the location cause issues
> for Oak ?
> IIRC some Oak DS implementations use a digest of the content to determine
> the location in the DS, so changing the content via Oak will change the
> location, but changing the content via the File or URL wont. If I didn't
> remember correctly, then ignore the concern.  Fully supportive of the
> approach, as a consumer of Oak. The locations will certainly probably leak
> outside the context of an Oak session so the API contract should make it
> clear that the code using a direct location needs to behave responsibly.
>

It's a reasonable concern and I'm not in the details of the
implementation. It's worth to keep in mind though and remember if we
want to adapt to URL or File that maybe we'll have to come up with some
sort of read-only version of such.

For the File class, IIRC, we could force/use the setReadOnly(),
setWritable() methods. I remember those to be quite expensive in time
though.

Davide




Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-05 Thread Davide Giannella
On 03/05/2016 15:36, Chetan Mehrotra wrote:
> ...
> //Check if Binary is of type AdaptableBinary
> if (binProp instanceof AdaptableBinary){

Would it be possible to avoid the `instaceof`? Which means, in my
opinion, all our binaries should be Adaptable. In case the
implementation is not it can return null. Would it work fine as API
contract? It would ease the usage of such API.

Plus I would add anyhow an oak.api interface Adaptable so that we can
then, if needed, apply the same concept anywhere else.

> ...
>
> 1. Depending on backing BlobStore the binary can be adapted to various
> types. For FileDataStore it can be adapted to File. For S3DataStore it can
> either be adapted to URL or some S3DataStore specific type.

+1

> ...
>
> 2. Security - Thomas suggested that for better security the ability to
> adapt should be restricted based on session permissions. So if the user has
> required permission then only adaptation would work otherwise null would be
> returned.

+1


> ...
>
> 4. This API is for now exposed only at JCR level. Not sure should we do it
> at Oak level as Blob instance are currently not bound to any session. So
> proposal is to place this in 'org.apache.jackrabbit.oak.api' package

As said above I would create an Adaptable interface at oak level and
then use it where needed. It's a powerful tool.

Cheers
Davide




Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-04 Thread Ian Boston
Hi,
If the File or URL is writable, will writing to the location cause issues
for Oak ?
IIRC some Oak DS implementations use a digest of the content to determine
the location in the DS, so changing the content via Oak will change the
location, but changing the content via the File or URL wont. If I didn't
remember correctly, then ignore the concern.  Fully supportive of the
approach, as a consumer of Oak. The locations will certainly probably leak
outside the context of an Oak session so the API contract should make it
clear that the code using a direct location needs to behave responsibly.

Best Regards
Ian


On 3 May 2016 at 15:36, Chetan Mehrotra  wrote:

> Hi Team,
>
> For OAK-1963 we need to allow access to actaul Blob location say in form
> File instance or S3 object id etc. This access is need to perform optimized
> IO operation around binary object e.g.
>
> 1. The File object can be used to spool the file content with zero copy
> using NIO by accessing the File Channel directly [1]
>
> 2. Client code can efficiently replicate a binary stored in S3 by having
> direct access to S3 object using copy operation
>
> To allow such access we would need a new API in the form of
> AdaptableBinary.
>
> API
> ===
>
> public interface AdaptableBinary {
>
> /**
>  * Adapts the binary to another type like File, URL etc
>  *
>  * @param  The generic type to which this binary is
> adapted
>  *to
>  * @param type The Class object of the target type, such as
>  *File.class
>  * @return The adapter target or null if the binary cannot
>  * adapt to the requested type
>  */
>  AdapterType adaptTo(Class type);
> }
>
> Usage
> =
>
> Binary binProp = node.getProperty("jcr:data").getBinary();
>
> //Check if Binary is of type AdaptableBinary
> if (binProp instanceof AdaptableBinary){
>  AdaptableBinary adaptableBinary = (AdaptableBinary) binProp;
>
> //Adapt it to File instance
>  File file = adaptableBinary.adaptTo(File.class);
> }
>
>
>
> The Binary instance returned by Oak
> i.e. org.apache.jackrabbit.oak.plugins.value.BinaryImpl would then
> implement this interface and calling code can then check the type and cast
> it and then adapt it
>
> Key Points
> 
>
> 1. Depending on backing BlobStore the binary can be adapted to various
> types. For FileDataStore it can be adapted to File. For S3DataStore it can
> either be adapted to URL or some S3DataStore specific type.
>
> 2. Security - Thomas suggested that for better security the ability to
> adapt should be restricted based on session permissions. So if the user has
> required permission then only adaptation would work otherwise null would be
> returned.
>
> 3. Adaptation proposal is based on Sling Adaptable [2]
>
> 4. This API is for now exposed only at JCR level. Not sure should we do it
> at Oak level as Blob instance are currently not bound to any session. So
> proposal is to place this in 'org.apache.jackrabbit.oak.api' package
>
> Kindly provide your feedback! Also any suggestion/guidance around how the
> access control be implemented
>
> Chetan Mehrotra
> [1] http://www.ibm.com/developerworks/library/j-zerocopy/
> [2]
>
> https://sling.apache.org/apidocs/sling5/org/apache/sling/api/adapter/Adaptable.html
>


API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-03 Thread Chetan Mehrotra
Hi Team,

For OAK-1963 we need to allow access to actaul Blob location say in form
File instance or S3 object id etc. This access is need to perform optimized
IO operation around binary object e.g.

1. The File object can be used to spool the file content with zero copy
using NIO by accessing the File Channel directly [1]

2. Client code can efficiently replicate a binary stored in S3 by having
direct access to S3 object using copy operation

To allow such access we would need a new API in the form of
AdaptableBinary.

API
===

public interface AdaptableBinary {

/**
 * Adapts the binary to another type like File, URL etc
 *
 * @param  The generic type to which this binary is adapted
 *to
 * @param type The Class object of the target type, such as
 *File.class
 * @return The adapter target or null if the binary cannot
 * adapt to the requested type
 */
 AdapterType adaptTo(Class type);
}

Usage
=

Binary binProp = node.getProperty("jcr:data").getBinary();

//Check if Binary is of type AdaptableBinary
if (binProp instanceof AdaptableBinary){
 AdaptableBinary adaptableBinary = (AdaptableBinary) binProp;

//Adapt it to File instance
 File file = adaptableBinary.adaptTo(File.class);
}



The Binary instance returned by Oak
i.e. org.apache.jackrabbit.oak.plugins.value.BinaryImpl would then
implement this interface and calling code can then check the type and cast
it and then adapt it

Key Points


1. Depending on backing BlobStore the binary can be adapted to various
types. For FileDataStore it can be adapted to File. For S3DataStore it can
either be adapted to URL or some S3DataStore specific type.

2. Security - Thomas suggested that for better security the ability to
adapt should be restricted based on session permissions. So if the user has
required permission then only adaptation would work otherwise null would be
returned.

3. Adaptation proposal is based on Sling Adaptable [2]

4. This API is for now exposed only at JCR level. Not sure should we do it
at Oak level as Blob instance are currently not bound to any session. So
proposal is to place this in 'org.apache.jackrabbit.oak.api' package

Kindly provide your feedback! Also any suggestion/guidance around how the
access control be implemented

Chetan Mehrotra
[1] http://www.ibm.com/developerworks/library/j-zerocopy/
[2]
https://sling.apache.org/apidocs/sling5/org/apache/sling/api/adapter/Adaptable.html