Re: API proposal for - Expose URL for Blob source (OAK-1963)
I have started a new mail thread around "Usecases around Binary handling in Oak" so as to first collect the kind of usecases we need to support. Once we decide that we can discuss the possible solution. So lets continue the discussion on that thread Chetan Mehrotra On Tue, May 17, 2016 at 12:31 PM, Angela Schreiberwrote: > Hi Oak-Devs > > Just for the record: This topic has been discussed in a Adobe > internal Oak-coordination call last Wednesday. > > Michael Marth first provided some background information and > we discussed the various concerns mentioned in this thread > and tried to identity the core issue(s). > > Marcel, Michael Duerig and Thomas proposed alternative approaches > on how to address the original issues that lead to the API > proposal, which all would avoid leaking out information about > the internal blob handling. > > Unfortunately we ran out of time and didn't conclude the call > with an agreement on how to proceed. > > From my perception the concerns raised here could not be resolved > by the additional information. > > I would suggest that we try to continue the discussion here > on the list. Maybe with a summary of the alternative proposals? > > Kind regards > Angela > > On 11/05/16 15:38, "Ian Boston" wrote: > > >Hi, > > > >On 11 May 2016 at 14:21, Marius Petria wrote: > > > >> Hi, > >> > >> I would add another use case in the same area, even if it is more > >> problematic from the point of view of security. To better support load > >> spikes an application could return 302 redirects to (signed) S3 urls > >>such > >> that binaries are fetched directly from S3. > >> > > > >Perhaps that question exposes the underlying requirement for some > >downstream users. > > > >This is a question, not a statement: > > > >If the application using Oak exposed a RESTfull API that had all the same > >functionality as [1], and was able to perform at the scale of S3, and had > >the same security semantics as Oak, would applications that are needing > >direct access to S3 or a File based datastore be able to use that API in > >preference ? > > > >Is this really about issues with scalability and performance rather than a > >fundamental need to drill deep into the internals of Oak ? If so, > >shouldn't > >the scalability and performance be fixed ? (assuming its a real concern) > > > > > > > > > >> > >> (if this can already be done or you think is not really related to the > >> other two please disregard). > >> > > > >AFAIK this is not possible at the moment. If it was deployments could use > >nginX X-SendFile and other request offloading mechanisms. > > > >Best Regards > >Ian > > > > > >1 http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectOps.html > > > > > >> > >> Marius > >> > >> > >> > >> On 5/11/16, 1:41 PM, "Angela Schreiber" wrote: > >> > >> >Hi Chetan > >> > > >> >IMHO your original mail didn't write down the fundamental analysis > >> >but instead presented the solution for every the 2 case I was > >> >lacking the information _why_ this is needed. > >> > > >> >Both have been answered in private conversions only (1 today in > >> >the oak call and 2 in a private discussion with tom). And > >> >having heard didn't make me more confident that the solution > >> >you propose is the right thing to do. > >> > > >> >Kind regards > >> >Angela > >> > > >> >On 11/05/16 12:17, "Chetan Mehrotra" > wrote: > >> > > >> >>Hi Angela, > >> >> > >> >>On Tue, May 10, 2016 at 9:49 PM, Angela Schreiber > >> >>wrote: > >> >> > >> >>> Quite frankly I would very much appreciate if took the time to > >>collect > >> >>> and write down the required (i.e. currently known and expected) > >> >>> functionality. > >> >>> > >> >>> Then look at the requirements and look what is wrong with the > >>current > >> >>> API that we can't meet those requirements: > >> >>> - is it just missing API extensions that can be added with moderate > >> >>>effort? > >> >>> - are there fundamental problems with the current API that we > >>needed to > >> >>> address? > >> >>> - maybe we even have intrinsic issues with the way we think about > >>the > >> >>>role > >> >>> of the repo? > >> >>> > >> >>> IMHO, sticking to kludges might look promising on a short term but > >> >>> I am convinced that we are better off with a fundamental analysis of > >> >>> the problems... after all the Binary topic comes up on a regular > >>basis. > >> >>> That leaves me with the impression that yet another tiny extra and > >> >>> adaptables won't really address the core issues. > >> >>> > >> >> > >> >>Makes sense. > >> >> > >> >>Have a look in of the initial mail in the thread at [1] which talks > >>about > >> >>the 2 usecase I know of. The image rendition usecase manifest itself > >>in > >> >>one > >> >>form or other, basically providing access to Native programs via file > >> path > >> >>reference. > >> >> > >> >>The approach proposed
Re: API proposal for - Expose URL for Blob source (OAK-1963)
Hi Oak-Devs Just for the record: This topic has been discussed in a Adobe internal Oak-coordination call last Wednesday. Michael Marth first provided some background information and we discussed the various concerns mentioned in this thread and tried to identity the core issue(s). Marcel, Michael Duerig and Thomas proposed alternative approaches on how to address the original issues that lead to the API proposal, which all would avoid leaking out information about the internal blob handling. Unfortunately we ran out of time and didn't conclude the call with an agreement on how to proceed. >From my perception the concerns raised here could not be resolved by the additional information. I would suggest that we try to continue the discussion here on the list. Maybe with a summary of the alternative proposals? Kind regards Angela On 11/05/16 15:38, "Ian Boston"wrote: >Hi, > >On 11 May 2016 at 14:21, Marius Petria wrote: > >> Hi, >> >> I would add another use case in the same area, even if it is more >> problematic from the point of view of security. To better support load >> spikes an application could return 302 redirects to (signed) S3 urls >>such >> that binaries are fetched directly from S3. >> > >Perhaps that question exposes the underlying requirement for some >downstream users. > >This is a question, not a statement: > >If the application using Oak exposed a RESTfull API that had all the same >functionality as [1], and was able to perform at the scale of S3, and had >the same security semantics as Oak, would applications that are needing >direct access to S3 or a File based datastore be able to use that API in >preference ? > >Is this really about issues with scalability and performance rather than a >fundamental need to drill deep into the internals of Oak ? If so, >shouldn't >the scalability and performance be fixed ? (assuming its a real concern) > > > > >> >> (if this can already be done or you think is not really related to the >> other two please disregard). >> > >AFAIK this is not possible at the moment. If it was deployments could use >nginX X-SendFile and other request offloading mechanisms. > >Best Regards >Ian > > >1 http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectOps.html > > >> >> Marius >> >> >> >> On 5/11/16, 1:41 PM, "Angela Schreiber" wrote: >> >> >Hi Chetan >> > >> >IMHO your original mail didn't write down the fundamental analysis >> >but instead presented the solution for every the 2 case I was >> >lacking the information _why_ this is needed. >> > >> >Both have been answered in private conversions only (1 today in >> >the oak call and 2 in a private discussion with tom). And >> >having heard didn't make me more confident that the solution >> >you propose is the right thing to do. >> > >> >Kind regards >> >Angela >> > >> >On 11/05/16 12:17, "Chetan Mehrotra" wrote: >> > >> >>Hi Angela, >> >> >> >>On Tue, May 10, 2016 at 9:49 PM, Angela Schreiber >> >>wrote: >> >> >> >>> Quite frankly I would very much appreciate if took the time to >>collect >> >>> and write down the required (i.e. currently known and expected) >> >>> functionality. >> >>> >> >>> Then look at the requirements and look what is wrong with the >>current >> >>> API that we can't meet those requirements: >> >>> - is it just missing API extensions that can be added with moderate >> >>>effort? >> >>> - are there fundamental problems with the current API that we >>needed to >> >>> address? >> >>> - maybe we even have intrinsic issues with the way we think about >>the >> >>>role >> >>> of the repo? >> >>> >> >>> IMHO, sticking to kludges might look promising on a short term but >> >>> I am convinced that we are better off with a fundamental analysis of >> >>> the problems... after all the Binary topic comes up on a regular >>basis. >> >>> That leaves me with the impression that yet another tiny extra and >> >>> adaptables won't really address the core issues. >> >>> >> >> >> >>Makes sense. >> >> >> >>Have a look in of the initial mail in the thread at [1] which talks >>about >> >>the 2 usecase I know of. The image rendition usecase manifest itself >>in >> >>one >> >>form or other, basically providing access to Native programs via file >> path >> >>reference. >> >> >> >>The approach proposed so far would be able to address them and hence >> >>closer >> >>to "is it just missing API extensions that can be added with moderate >> >>effort?". If there are any other approach we can address both of the >> >>referred usecases then we implement them. >> >> >> >>Let me know if more details are required. If required I can put it up >>on >> a >> >>wiki page also. >> >> >> >>Chetan Mehrotra >> >>[1] >> >> >> >>http://markmail.org/thread/6mq4je75p64c5nyn#query:+page:1+mid:zv5dzsgmoeg >>u >> >>pd7l+state:results >> > >>
Re: API proposal for - Expose URL for Blob source (OAK-1963)
Hi, On 11 May 2016 at 14:21, Marius Petriawrote: > Hi, > > I would add another use case in the same area, even if it is more > problematic from the point of view of security. To better support load > spikes an application could return 302 redirects to (signed) S3 urls such > that binaries are fetched directly from S3. > Perhaps that question exposes the underlying requirement for some downstream users. This is a question, not a statement: If the application using Oak exposed a RESTfull API that had all the same functionality as [1], and was able to perform at the scale of S3, and had the same security semantics as Oak, would applications that are needing direct access to S3 or a File based datastore be able to use that API in preference ? Is this really about issues with scalability and performance rather than a fundamental need to drill deep into the internals of Oak ? If so, shouldn't the scalability and performance be fixed ? (assuming its a real concern) > > (if this can already be done or you think is not really related to the > other two please disregard). > AFAIK this is not possible at the moment. If it was deployments could use nginX X-SendFile and other request offloading mechanisms. Best Regards Ian 1 http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectOps.html > > Marius > > > > On 5/11/16, 1:41 PM, "Angela Schreiber" wrote: > > >Hi Chetan > > > >IMHO your original mail didn't write down the fundamental analysis > >but instead presented the solution for every the 2 case I was > >lacking the information _why_ this is needed. > > > >Both have been answered in private conversions only (1 today in > >the oak call and 2 in a private discussion with tom). And > >having heard didn't make me more confident that the solution > >you propose is the right thing to do. > > > >Kind regards > >Angela > > > >On 11/05/16 12:17, "Chetan Mehrotra" wrote: > > > >>Hi Angela, > >> > >>On Tue, May 10, 2016 at 9:49 PM, Angela Schreiber > >>wrote: > >> > >>> Quite frankly I would very much appreciate if took the time to collect > >>> and write down the required (i.e. currently known and expected) > >>> functionality. > >>> > >>> Then look at the requirements and look what is wrong with the current > >>> API that we can't meet those requirements: > >>> - is it just missing API extensions that can be added with moderate > >>>effort? > >>> - are there fundamental problems with the current API that we needed to > >>> address? > >>> - maybe we even have intrinsic issues with the way we think about the > >>>role > >>> of the repo? > >>> > >>> IMHO, sticking to kludges might look promising on a short term but > >>> I am convinced that we are better off with a fundamental analysis of > >>> the problems... after all the Binary topic comes up on a regular basis. > >>> That leaves me with the impression that yet another tiny extra and > >>> adaptables won't really address the core issues. > >>> > >> > >>Makes sense. > >> > >>Have a look in of the initial mail in the thread at [1] which talks about > >>the 2 usecase I know of. The image rendition usecase manifest itself in > >>one > >>form or other, basically providing access to Native programs via file > path > >>reference. > >> > >>The approach proposed so far would be able to address them and hence > >>closer > >>to "is it just missing API extensions that can be added with moderate > >>effort?". If there are any other approach we can address both of the > >>referred usecases then we implement them. > >> > >>Let me know if more details are required. If required I can put it up on > a > >>wiki page also. > >> > >>Chetan Mehrotra > >>[1] > >> > http://markmail.org/thread/6mq4je75p64c5nyn#query:+page:1+mid:zv5dzsgmoegu > >>pd7l+state:results > > >
Re: API proposal for - Expose URL for Blob source (OAK-1963)
Hi, I would add another use case in the same area, even if it is more problematic from the point of view of security. To better support load spikes an application could return 302 redirects to (signed) S3 urls such that binaries are fetched directly from S3. (if this can already be done or you think is not really related to the other two please disregard). Marius On 5/11/16, 1:41 PM, "Angela Schreiber"wrote: >Hi Chetan > >IMHO your original mail didn't write down the fundamental analysis >but instead presented the solution for every the 2 case I was >lacking the information _why_ this is needed. > >Both have been answered in private conversions only (1 today in >the oak call and 2 in a private discussion with tom). And >having heard didn't make me more confident that the solution >you propose is the right thing to do. > >Kind regards >Angela > >On 11/05/16 12:17, "Chetan Mehrotra" wrote: > >>Hi Angela, >> >>On Tue, May 10, 2016 at 9:49 PM, Angela Schreiber >>wrote: >> >>> Quite frankly I would very much appreciate if took the time to collect >>> and write down the required (i.e. currently known and expected) >>> functionality. >>> >>> Then look at the requirements and look what is wrong with the current >>> API that we can't meet those requirements: >>> - is it just missing API extensions that can be added with moderate >>>effort? >>> - are there fundamental problems with the current API that we needed to >>> address? >>> - maybe we even have intrinsic issues with the way we think about the >>>role >>> of the repo? >>> >>> IMHO, sticking to kludges might look promising on a short term but >>> I am convinced that we are better off with a fundamental analysis of >>> the problems... after all the Binary topic comes up on a regular basis. >>> That leaves me with the impression that yet another tiny extra and >>> adaptables won't really address the core issues. >>> >> >>Makes sense. >> >>Have a look in of the initial mail in the thread at [1] which talks about >>the 2 usecase I know of. The image rendition usecase manifest itself in >>one >>form or other, basically providing access to Native programs via file path >>reference. >> >>The approach proposed so far would be able to address them and hence >>closer >>to "is it just missing API extensions that can be added with moderate >>effort?". If there are any other approach we can address both of the >>referred usecases then we implement them. >> >>Let me know if more details are required. If required I can put it up on a >>wiki page also. >> >>Chetan Mehrotra >>[1] >>http://markmail.org/thread/6mq4je75p64c5nyn#query:+page:1+mid:zv5dzsgmoegu >>pd7l+state:results >
Re: API proposal for - Expose URL for Blob source (OAK-1963)
Hi Chetan IMHO your original mail didn't write down the fundamental analysis but instead presented the solution for every the 2 case I was lacking the information _why_ this is needed. Both have been answered in private conversions only (1 today in the oak call and 2 in a private discussion with tom). And having heard didn't make me more confident that the solution you propose is the right thing to do. Kind regards Angela On 11/05/16 12:17, "Chetan Mehrotra"wrote: >Hi Angela, > >On Tue, May 10, 2016 at 9:49 PM, Angela Schreiber >wrote: > >> Quite frankly I would very much appreciate if took the time to collect >> and write down the required (i.e. currently known and expected) >> functionality. >> >> Then look at the requirements and look what is wrong with the current >> API that we can't meet those requirements: >> - is it just missing API extensions that can be added with moderate >>effort? >> - are there fundamental problems with the current API that we needed to >> address? >> - maybe we even have intrinsic issues with the way we think about the >>role >> of the repo? >> >> IMHO, sticking to kludges might look promising on a short term but >> I am convinced that we are better off with a fundamental analysis of >> the problems... after all the Binary topic comes up on a regular basis. >> That leaves me with the impression that yet another tiny extra and >> adaptables won't really address the core issues. >> > >Makes sense. > >Have a look in of the initial mail in the thread at [1] which talks about >the 2 usecase I know of. The image rendition usecase manifest itself in >one >form or other, basically providing access to Native programs via file path >reference. > >The approach proposed so far would be able to address them and hence >closer >to "is it just missing API extensions that can be added with moderate >effort?". If there are any other approach we can address both of the >referred usecases then we implement them. > >Let me know if more details are required. If required I can put it up on a >wiki page also. > >Chetan Mehrotra >[1] >http://markmail.org/thread/6mq4je75p64c5nyn#query:+page:1+mid:zv5dzsgmoegu >pd7l+state:results
Re: API proposal for - Expose URL for Blob source (OAK-1963)
Hi Angela, On Tue, May 10, 2016 at 9:49 PM, Angela Schreiberwrote: > Quite frankly I would very much appreciate if took the time to collect > and write down the required (i.e. currently known and expected) > functionality. > > Then look at the requirements and look what is wrong with the current > API that we can't meet those requirements: > - is it just missing API extensions that can be added with moderate effort? > - are there fundamental problems with the current API that we needed to > address? > - maybe we even have intrinsic issues with the way we think about the role > of the repo? > > IMHO, sticking to kludges might look promising on a short term but > I am convinced that we are better off with a fundamental analysis of > the problems... after all the Binary topic comes up on a regular basis. > That leaves me with the impression that yet another tiny extra and > adaptables won't really address the core issues. > Makes sense. Have a look in of the initial mail in the thread at [1] which talks about the 2 usecase I know of. The image rendition usecase manifest itself in one form or other, basically providing access to Native programs via file path reference. The approach proposed so far would be able to address them and hence closer to "is it just missing API extensions that can be added with moderate effort?". If there are any other approach we can address both of the referred usecases then we implement them. Let me know if more details are required. If required I can put it up on a wiki page also. Chetan Mehrotra [1] http://markmail.org/thread/6mq4je75p64c5nyn#query:+page:1+mid:zv5dzsgmoegupd7l+state:results
Re: API proposal for - Expose URL for Blob source (OAK-1963)
> what guarantees do/can we give re. this file handle within this context. Can it suddenly go away (e.g. because of gc or internal re-organisation)? How do we establish, test and maintain (e.g. from regressions) such guarantees? Logically it should not go away suddenly. So GC logic should be aware of such "inUse" instances (there is already such support for inUse cases). Such a requirement can be validated via integration testcase > and more concerningly, how do we protect Oak from data corruption by misbehaving clients? E.g. clients writing on that handle or removing it? Again, if this is public API we need ways to test this. Not sure by misbehaving client - Is it malicious (by design) or badly written code. For later yes that might pose a problem but we can have some defense. I would expect the code making use of the api to behave properly. In addition as proposed above [1] for FileDataStore we can provide a symlinked file reference which exposes a read only file handle. For S3DataStore code should have access to aws credentials to perform any write operation, which should be a sufficient defense > In an earlier mail you quite fittingly compared this to commit hooks, which for good reason are an internal SPI. Bit of nit pick here ;) As per Jcr class [1] one can provide a CommitHook instance so not sure if we can term it internal. However point that I wanted to emphasize is that Oak does provide some critical extension point and with a misbehaving code one can shoot himself at foot and as implementation only so much can be done. regards Chetan [1] http://markmail.org/thread/6mq4je75p64c5nyn#query:+page:1+mid:237kzuhor5y3tpli+state:results [2] https://github.com/apache/jackrabbit-oak/blob/trunk/oak-jcr/src/main/java/org/apache/jackrabbit/oak/jcr/Jcr.java#L190 Chetan Mehrotra
Re: API proposal for - Expose URL for Blob source (OAK-1963)
Such an approach makes the API contract more explicit to the consumer by providing a context outside which there will be no guarantees for the passed "file handle". However, there is still the issues of - what guarantees do/can we give re. this file handle within this context. Can it suddenly go away (e.g. because of gc or internal re-organisation)? How do we establish, test and maintain (e.g. from regressions) such guarantees? - and more concerningly, how do we protect Oak from data corruption by misbehaving clients? E.g. clients writing on that handle or removing it? Again, if this is public API we need ways to test this. In an earlier mail you quite fittingly compared this to commit hooks, which for good reason are an internal SPI. The same applies here: this is a very low level concern so it must only be exposed as an internal SPI. Michael On 9.5.16 3:45 , Chetan Mehrotra wrote: Had an offline discussion with Michael on this and explained the usecase requirement in more details. One concern that has been raised is that such a generic adaptTo API is too inviting for improper use and Oak does not have any context around when this url is exposed for what time it is used. So instead of having a generic adaptTo API at JCR level we can have a BlobProcessor callback (Approach #B). Below is more of a strawman proposal. Once we have a consensus then we can go over the details interface BlobProcessor { void process(AdaptableBlob blob); } Where AdaptableBlob is public interface AdaptableBlob { AdapterType adaptTo(Class type); } The BlobProcessor instance can be passed via BlobStore API. So client would look for a BlobStore service (so use the Oak level API) and pass it the ContentIdentity of JCR Binary aka blobId interface BlobStore{ void process(String blobId, BlobProcessor processor) } The approach ensures 1. That any blob handle exposed is only guaranteed for the duration of 'process' invocation 2. There is no guarantee on the utility of blob handle (File, S3 Object) beyond the callback. So one should not collect the passed File handle for later use Hopefully this should address some of the concerns raised in this thread. Looking forward to feedback :) Chetan Mehrotra On Mon, May 9, 2016 at 6:24 PM, Michael Dürigwrote: On 9.5.16 11:43 , Chetan Mehrotra wrote: To highlight - As mentioned earlier the user of proposed api is tying itself to implementation details of Oak and if this changes later then that code would also need to be changed. Or as Ian summed it up if the API is introduced it should create an out of band agreement with the consumers of the API to act responsibly. So what does "to act responsibly" actually means? Are we even in a position to precisely specify this? Experience tells me that we only find out about those semantics after the fact when dealing with painful and expensive customer escalations. And even if we could, it would tie Oak into very tight constraints on how it has to behave and how not. Constraints that would turn out prohibitively expensive for future evolution. Furthermore a huge amount of resources would be required to formalise such constraints via test coverage to guard against regressions. The method is to be used for those important case where you do rely on implementation detail to get optimal performance in very specific scenarios. Its like DocumentNodeStore making use of some Mongo specific API to perform some important critical operation to achieve better performance by checking if the underlying DocumentStore is Mongo based. Right, but the Mongo specific API is a (hopefully) well thought through API where as with your proposal there are a lot of open questions and concerns as per my last mail. Mongo (and any other COTS DB) for good reasons also don't give you direct access to its internal file handles. I have seen discussion of JCR-3534 and other related issue but still do not see any conclusion on how to answer such queries where direct access to blobs is required for performance aspect. This issue is not about exposing the blob reference for remote access but more about optimal path for in VM access One bottom line of the discussions in that issue is that we came to a conclusion after clarifying the specifics of the use case. Something I'm still missing here. The case you brought forward is too general to serve as a guideline for a solution. Quite to the contrary, to me it looks like a solution to some problem (I'm trying to understand). who owns the resource? Who coordinates (concurrent) access to it and how? What are the correctness and performance implications here (races, deadlock, corruptions, JCR semantics)? The client code would need to be implemented in a proper way. Its more like implementing a CommitHook. If implemented in incorrect way it would cause issues deadlocks etc. But then we assume that any one implementing that interface would take
Re: API proposal for - Expose URL for Blob source (OAK-1963)
Hi Angela, On 10 May 2016 at 17:19, Angela Schreiberwrote: > Hi Ian > > >Fair enough, provided there is a solution that addresses the issue Chetan > >is trying to address. > > That's what we are all looking for :) > > >The alternative, for some applications, seems to store the binary data > >outside Oak, which defeats the purpose completely. > > You mean with the current setup, right? > yes. > > That might well be... while I haven't been involved with a concrete > case I wouldn't categorically reject that this might in same cases > even be the right solution. > But maybe I am biased due to the fact that we also have a big > community that effectively stores and manages their user/group > accounts outside the repository and where I am seeing plenty of > trouble with the conception that those accounts _must_ be synced > (i.e. copied) into the repo. > > So, I'd definitely like to understand why you think that this > "completely defeats the purpose". I agree that it's not always > desirable but nevertheless there might be valid use-cases. > If the purpose of Oak is to provide a content repository to store metadata and assets, then if the application built on top of Oak, in order to achieve its scalability targets has to store its asset data (blobs) outside Oak, that defeats the purpose of supporting the storage of assets within Oak. Oak should support the storage of assets within Oak supporting the scalability requirements of the application. Since they are non trivial and hard to quantify, that means horizontal scalability limited only by available budget to purchase VM's or hardware. You can argue that horizontal scalability is not really required. I can share use cases, not exactly the same ones Chetan is working on where it is. Sorry I can't share them on list. > > >I don't have a perfect handle on the issue he is trying to address or what > >would be an acceptable solution, but I suspect the only solution that is > >not vulnerable by design will a solution that abstracts all the required > >functionality behind an Oak API (ie no S3Object, File object or anything > >that could leak) and then provide all the required functionality with an > >acceptable level of performance in the implementation. That is doable, but > >a lot more work. > > Not sure about that :-) > Quite frankly I would very much appreciate if took the time to collect > and write down the required (i.e. currently known and expected) > functionality. > In the context of what I said above, for AWS deployment that means wrapping [1] so nothing can leak and supporting almost everything expressed by [2] via an Oak API/jar in a way that enables horizontal scalability. > > Then look at the requirements and look what is wrong with the current > API that we can't meet those requirements: > - is it just missing API extensions that can be added with moderate effort? > - are there fundamental problems with the current API that we needed to > address? > - maybe we even have intrinsic issues with the way we think about the role > of the repo? > > IMHO, sticking to kludges might look promising on a short term but > I am convinced that we are better off with a fundamental analysis of > the problems... after all the Binary topic comes up on a regular basis. > That leaves me with the impression that yet another tiny extra and > adaptables won't really address the core issues. > I agree. It comes up time and again because the applications are being asked to do something Oak does not currently support, so developers look for a work arround. It should be done properly, once and for all. imvho, that is a lot of work upfront, but since I am not the one doing the work its not right for me to estimate or suggest anyone do it. Best Regards Ian 1 http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/S3Object.html 2 http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectOps.html > Kind regards > Angela > > > > > > > > >Best Regards > >Ian > > > > > >> > >> Kind regards > >> Angela > >> > >> > > >> >Best Regards > >> >Ian > >> > > >> > > >> >On 3 May 2016 at 15:36, Chetan Mehrotra > >> wrote: > >> > > >> >> Hi Team, > >> >> > >> >> For OAK-1963 we need to allow access to actaul Blob location say in > >>form > >> >> File instance or S3 object id etc. This access is need to perform > >> >>optimized > >> >> IO operation around binary object e.g. > >> >> > >> >> 1. The File object can be used to spool the file content with zero > >>copy > >> >> using NIO by accessing the File Channel directly [1] > >> >> > >> >> 2. Client code can efficiently replicate a binary stored in S3 by > >>having > >> >> direct access to S3 object using copy operation > >> >> > >> >> To allow such access we would need a new API in the form of > >> >> AdaptableBinary. > >> >> > >> >> API > >> >> === > >> >> > >> >> public interface AdaptableBinary { > >> >> > >> >> /** > >> >> * Adapts the
Re: API proposal for - Expose URL for Blob source (OAK-1963)
Hi Ian >Fair enough, provided there is a solution that addresses the issue Chetan >is trying to address. That's what we are all looking for :) >The alternative, for some applications, seems to store the binary data >outside Oak, which defeats the purpose completely. You mean with the current setup, right? That might well be... while I haven't been involved with a concrete case I wouldn't categorically reject that this might in same cases even be the right solution. But maybe I am biased due to the fact that we also have a big community that effectively stores and manages their user/group accounts outside the repository and where I am seeing plenty of trouble with the conception that those accounts _must_ be synced (i.e. copied) into the repo. So, I'd definitely like to understand why you think that this "completely defeats the purpose". I agree that it's not always desirable but nevertheless there might be valid use-cases. >I don't have a perfect handle on the issue he is trying to address or what >would be an acceptable solution, but I suspect the only solution that is >not vulnerable by design will a solution that abstracts all the required >functionality behind an Oak API (ie no S3Object, File object or anything >that could leak) and then provide all the required functionality with an >acceptable level of performance in the implementation. That is doable, but >a lot more work. Not sure about that :-) Quite frankly I would very much appreciate if took the time to collect and write down the required (i.e. currently known and expected) functionality. Then look at the requirements and look what is wrong with the current API that we can't meet those requirements: - is it just missing API extensions that can be added with moderate effort? - are there fundamental problems with the current API that we needed to address? - maybe we even have intrinsic issues with the way we think about the role of the repo? IMHO, sticking to kludges might look promising on a short term but I am convinced that we are better off with a fundamental analysis of the problems... after all the Binary topic comes up on a regular basis. That leaves me with the impression that yet another tiny extra and adaptables won't really address the core issues. Kind regards Angela > > >Best Regards >Ian > > >> >> Kind regards >> Angela >> >> > >> >Best Regards >> >Ian >> > >> > >> >On 3 May 2016 at 15:36, Chetan Mehrotra>> wrote: >> > >> >> Hi Team, >> >> >> >> For OAK-1963 we need to allow access to actaul Blob location say in >>form >> >> File instance or S3 object id etc. This access is need to perform >> >>optimized >> >> IO operation around binary object e.g. >> >> >> >> 1. The File object can be used to spool the file content with zero >>copy >> >> using NIO by accessing the File Channel directly [1] >> >> >> >> 2. Client code can efficiently replicate a binary stored in S3 by >>having >> >> direct access to S3 object using copy operation >> >> >> >> To allow such access we would need a new API in the form of >> >> AdaptableBinary. >> >> >> >> API >> >> === >> >> >> >> public interface AdaptableBinary { >> >> >> >> /** >> >> * Adapts the binary to another type like File, URL etc >> >> * >> >> * @param The generic type to which this binary is >> >> adapted >> >> *to >> >> * @param type The Class object of the target type, such as >> >> *File.class >> >> * @return The adapter target or null if the binary >> >>cannot >> >> * adapt to the requested type >> >> */ >> >> AdapterType adaptTo(Class type); >> >> } >> >> >> >> Usage >> >> = >> >> >> >> Binary binProp = node.getProperty("jcr:data").getBinary(); >> >> >> >> //Check if Binary is of type AdaptableBinary >> >> if (binProp instanceof AdaptableBinary){ >> >> AdaptableBinary adaptableBinary = (AdaptableBinary) binProp; >> >> >> >> //Adapt it to File instance >> >> File file = adaptableBinary.adaptTo(File.class); >> >> } >> >> >> >> >> >> >> >> The Binary instance returned by Oak >> >> i.e. org.apache.jackrabbit.oak.plugins.value.BinaryImpl would then >> >> implement this interface and calling code can then check the type and >> >>cast >> >> it and then adapt it >> >> >> >> Key Points >> >> >> >> >> >> 1. Depending on backing BlobStore the binary can be adapted to >>various >> >> types. For FileDataStore it can be adapted to File. For S3DataStore >>it >> >>can >> >> either be adapted to URL or some S3DataStore specific type. >> >> >> >> 2. Security - Thomas suggested that for better security the ability >>to >> >> adapt should be restricted based on session permissions. So if the >>user >> >>has >> >> required permission then only adaptation would work otherwise null >> >>would be >> >> returned. >> >> >> >> 3. Adaptation proposal is based on Sling Adaptable [2] >> >> >> >> 4. This API is for now exposed only at JCR level. Not sure should
Re: API proposal for - Expose URL for Blob source (OAK-1963)
On 10.5.16 5:39 , Ian Boston wrote: I don't have a perfect handle on the issue he is trying to address or what would be an acceptable solution, but I suspect the only solution that is not vulnerable by design will a solution that abstracts all the required functionality behind an Oak API (ie no S3Object, File object or anything that could leak) and then provide all the required functionality with an acceptable level of performance in the implementation. That is doable, but a lot more work. I doubt this. It is a lot more *upfront work* vs. never ending fire fighting in production systems. Michael
Re: API proposal for - Expose URL for Blob source (OAK-1963)
On 10 May 2016 at 15:02, Angela Schreiberwrote: > Hi Ian > > On 04/05/16 18:37, "Ian Boston" wrote: > >[...] The locations will certainly probably leak > >outside the context of an Oak session so the API contract should make it > >clear that the code using a direct location needs to behave responsibly. > > See my reply to Chetan, who was referring to > SlingRepository.loginAdministrative > which always had a pretty clear API contract wrt responsible usage. > > As a matter of fact (and I guess you are aware of this) it turned into a > total nightmare with developers using it just everywhere, ignoring not > only > the API contract but also all concerns raised for years. This can even > been seen in Apache Sling code base itself. > So, I am quite pessimistic about responsible usage and API contract > and definitely prefer an API implementation that effectively enforces > the contract. > > Vulnerable by design is IMHO a bad guideline for introducing new APIs. > From my experiences they backfire usually sooner than later and need > to be abandoned again... so, I'd rather aim for a properly secured > solution right from the beginning. > Fair enough, provided there is a solution that addresses the issue Chetan is trying to address. The alternative, for some applications, seems to store the binary data outside Oak, which defeats the purpose completely. I don't have a perfect handle on the issue he is trying to address or what would be an acceptable solution, but I suspect the only solution that is not vulnerable by design will a solution that abstracts all the required functionality behind an Oak API (ie no S3Object, File object or anything that could leak) and then provide all the required functionality with an acceptable level of performance in the implementation. That is doable, but a lot more work. Best Regards Ian > > Kind regards > Angela > > > > >Best Regards > >Ian > > > > > >On 3 May 2016 at 15:36, Chetan Mehrotra > wrote: > > > >> Hi Team, > >> > >> For OAK-1963 we need to allow access to actaul Blob location say in form > >> File instance or S3 object id etc. This access is need to perform > >>optimized > >> IO operation around binary object e.g. > >> > >> 1. The File object can be used to spool the file content with zero copy > >> using NIO by accessing the File Channel directly [1] > >> > >> 2. Client code can efficiently replicate a binary stored in S3 by having > >> direct access to S3 object using copy operation > >> > >> To allow such access we would need a new API in the form of > >> AdaptableBinary. > >> > >> API > >> === > >> > >> public interface AdaptableBinary { > >> > >> /** > >> * Adapts the binary to another type like File, URL etc > >> * > >> * @param The generic type to which this binary is > >> adapted > >> *to > >> * @param type The Class object of the target type, such as > >> *File.class > >> * @return The adapter target or null if the binary > >>cannot > >> * adapt to the requested type > >> */ > >> AdapterType adaptTo(Class type); > >> } > >> > >> Usage > >> = > >> > >> Binary binProp = node.getProperty("jcr:data").getBinary(); > >> > >> //Check if Binary is of type AdaptableBinary > >> if (binProp instanceof AdaptableBinary){ > >> AdaptableBinary adaptableBinary = (AdaptableBinary) binProp; > >> > >> //Adapt it to File instance > >> File file = adaptableBinary.adaptTo(File.class); > >> } > >> > >> > >> > >> The Binary instance returned by Oak > >> i.e. org.apache.jackrabbit.oak.plugins.value.BinaryImpl would then > >> implement this interface and calling code can then check the type and > >>cast > >> it and then adapt it > >> > >> Key Points > >> > >> > >> 1. Depending on backing BlobStore the binary can be adapted to various > >> types. For FileDataStore it can be adapted to File. For S3DataStore it > >>can > >> either be adapted to URL or some S3DataStore specific type. > >> > >> 2. Security - Thomas suggested that for better security the ability to > >> adapt should be restricted based on session permissions. So if the user > >>has > >> required permission then only adaptation would work otherwise null > >>would be > >> returned. > >> > >> 3. Adaptation proposal is based on Sling Adaptable [2] > >> > >> 4. This API is for now exposed only at JCR level. Not sure should we do > >>it > >> at Oak level as Blob instance are currently not bound to any session. So > >> proposal is to place this in 'org.apache.jackrabbit.oak.api' package > >> > >> Kindly provide your feedback! Also any suggestion/guidance around how > >>the > >> access control be implemented > >> > >> Chetan Mehrotra > >> [1] http://www.ibm.com/developerworks/library/j-zerocopy/ > >> [2] > >> > >> > >> > https://sling.apache.org/apidocs/sling5/org/apache/sling/api/adapter/Adap > >>table.html > >> > >
Re: API proposal for - Expose URL for Blob source (OAK-1963)
Hi Same here... Francesco already summarised my concerns very nicely. The links Michael provided below resonate what came to my mind regarding past discussions around binary handling in the JCR/Jackrabbit API and in Oak. I also distinctly remember that one key argument for the current design of the Oak Blob API was the fact that the access to the binaries created through this API is properly secured due to the fact that they (or their references) are read and written from/to the Oak repository through calls that are subject to the configured security setup i.e. are always secured. @Chetan, regarding your original comment wrt security: > 2. Security - Thomas suggested that for better security the ability to > adapt should be restricted based on session permissions. So if the user >has > required permission then only adaptation would work otherwise null would >be > returned. As others said before I don't think that this is the critical part from a security point of view... The access to the property is secured by the authorization model present with the given repository. IMO the troublesome part comes only _after_ the adaption to something else, where you loose the ability to enforce the constraints imposed by the permission setup. After all I am not convinced that we should rush this API into the code base at the current state... from my PoV there are too many valid concerns. And honestly, I weight the architectural and consistency concerns even higher than the security issues. Having said this: I'd rather take one step back again and start looking for other approaches that would allow us to the address the issue(s) at hand in a better way. Kind regards Angela On 09/05/16 10:58, "Michael Dürig"wrote: > >Hi, > >I very much share Francesco's concerns here. Unconditionally exposing >access to operation system resources underlying Oak's inner working is >troublesome for various reasons: > >- who owns the resource? Who coordinates (concurrent) access to it and >how? What are the correctness and performance implications here (races, >deadlock, corruptions, JCR semantics)? > >- it limits implementation freedom and hinders further evolution >(chunking, de-duplication, content based addressing, compression, gc, >etc.) for data stores. > >- bypassing JCR's security model > >Pretty much all of this has been discussed in the scope of >https://issues.apache.org/jira/browse/JCR-3534 and >https://issues.apache.org/jira/browse/OAK-834. So I suggest to review >those discussions before we jump to conclusion. > > >Also what is the use case requiring such a vast API surface? Can't we >come up with an API that allows the blobs to stay under control of Oak? >If not, this is probably an indication that those blobs shouldn't go >into Oak but just references to it as Francesco already proposed. >Anything else is whether fish nor fowl: you can't have the JCR goodies >but at the same time access underlying resources at will. > >Michael > > > >On 5.5.16 11:00 , Francesco Mari wrote: >> This proposal introduces a huge leak of abstractions and has deep >>security >> implications. >> >> I guess that the reason for this proposal is that some users of Oak >>would >> like to perform some operations on binaries in a more performant way by >> leveraging the way those binaries are stored. If this is the case, I >> suggest those users to evaluate an applicative solution implemented on >>top >> of the JCR API. >> >> If a user needs to store some important binary data (files, images, >>etc.) >> in an S3 bucket or on the file system for performance reasons, this >> shouldn't affect how Oak handles blobs internally. If some assets are of >> special interest for the user, then the user should bypass Oak and take >> care of the storage of those assets directly. Oak can be used to store >> *references* to those assets, that can be used in user code to >>manipulate >> the assets in his own business logic. >> >> If the scenario I outlined is not what inspired this proposal, I would >>like >> to know more about the reasons why this proposal was brought up. Which >> problems are we going to solve with this API? Is there a more concrete >>use >> case that we can use as a driving example? >> >> 2016-05-05 10:06 GMT+02:00 Davide Giannella : >> >>> On 04/05/2016 17:37, Ian Boston wrote: Hi, If the File or URL is writable, will writing to the location cause issues for Oak ? IIRC some Oak DS implementations use a digest of the content to determine the location in the DS, so changing the content via Oak will change the location, but changing the content via the File or URL wont. If I didn't remember correctly, then ignore the concern. Fully supportive of the approach, as a consumer of Oak. The locations will certainly probably >>> leak outside the context of an Oak session so the API contract should make it clear that the code using a
Re: API proposal for - Expose URL for Blob source (OAK-1963)
Hi Ian On 04/05/16 18:37, "Ian Boston"wrote: >[...] The locations will certainly probably leak >outside the context of an Oak session so the API contract should make it >clear that the code using a direct location needs to behave responsibly. See my reply to Chetan, who was referring to SlingRepository.loginAdministrative which always had a pretty clear API contract wrt responsible usage. As a matter of fact (and I guess you are aware of this) it turned into a total nightmare with developers using it just everywhere, ignoring not only the API contract but also all concerns raised for years. This can even been seen in Apache Sling code base itself. So, I am quite pessimistic about responsible usage and API contract and definitely prefer an API implementation that effectively enforces the contract. Vulnerable by design is IMHO a bad guideline for introducing new APIs. >From my experiences they backfire usually sooner than later and need to be abandoned again... so, I'd rather aim for a properly secured solution right from the beginning. Kind regards Angela > >Best Regards >Ian > > >On 3 May 2016 at 15:36, Chetan Mehrotra wrote: > >> Hi Team, >> >> For OAK-1963 we need to allow access to actaul Blob location say in form >> File instance or S3 object id etc. This access is need to perform >>optimized >> IO operation around binary object e.g. >> >> 1. The File object can be used to spool the file content with zero copy >> using NIO by accessing the File Channel directly [1] >> >> 2. Client code can efficiently replicate a binary stored in S3 by having >> direct access to S3 object using copy operation >> >> To allow such access we would need a new API in the form of >> AdaptableBinary. >> >> API >> === >> >> public interface AdaptableBinary { >> >> /** >> * Adapts the binary to another type like File, URL etc >> * >> * @param The generic type to which this binary is >> adapted >> *to >> * @param type The Class object of the target type, such as >> *File.class >> * @return The adapter target or null if the binary >>cannot >> * adapt to the requested type >> */ >> AdapterType adaptTo(Class type); >> } >> >> Usage >> = >> >> Binary binProp = node.getProperty("jcr:data").getBinary(); >> >> //Check if Binary is of type AdaptableBinary >> if (binProp instanceof AdaptableBinary){ >> AdaptableBinary adaptableBinary = (AdaptableBinary) binProp; >> >> //Adapt it to File instance >> File file = adaptableBinary.adaptTo(File.class); >> } >> >> >> >> The Binary instance returned by Oak >> i.e. org.apache.jackrabbit.oak.plugins.value.BinaryImpl would then >> implement this interface and calling code can then check the type and >>cast >> it and then adapt it >> >> Key Points >> >> >> 1. Depending on backing BlobStore the binary can be adapted to various >> types. For FileDataStore it can be adapted to File. For S3DataStore it >>can >> either be adapted to URL or some S3DataStore specific type. >> >> 2. Security - Thomas suggested that for better security the ability to >> adapt should be restricted based on session permissions. So if the user >>has >> required permission then only adaptation would work otherwise null >>would be >> returned. >> >> 3. Adaptation proposal is based on Sling Adaptable [2] >> >> 4. This API is for now exposed only at JCR level. Not sure should we do >>it >> at Oak level as Blob instance are currently not bound to any session. So >> proposal is to place this in 'org.apache.jackrabbit.oak.api' package >> >> Kindly provide your feedback! Also any suggestion/guidance around how >>the >> access control be implemented >> >> Chetan Mehrotra >> [1] http://www.ibm.com/developerworks/library/j-zerocopy/ >> [2] >> >> >>https://sling.apache.org/apidocs/sling5/org/apache/sling/api/adapter/Adap >>table.html >>
Re: API proposal for - Expose URL for Blob source (OAK-1963)
Hi, By processing independently I meant async, outside the callback, eg inside a Mesos+Frenzo cluster [1], processors not running Oak. Best Regards Ian 1 http://techblog.netflix.com/2015/08/fenzo-oss-scheduler-for-apache-mesos.html On 10 May 2016 at 06:02, Chetan Mehrotrawrote: > On Mon, May 9, 2016 at 8:27 PM, Ian Boston wrote: > > > I thought the consumers of this api want things like the absolute path of > > the File in the BlobStore, or the bucket and key of the S3 Object, so > that > > they could transmit it and use it for processing independently of Oak > > outside the callback ? > > > > Most cases can still be done, just do it within the callback > > blobStore.process("xxx", new BlobProcessor(){ > void process(AdaptableBlob blob){ > File file = blob.adaptTo(File.class); > transformImage(file); > } > }); > > Doing this within callback would allow Oak to enforce some safeguards (more > on that in next mail) and still allows the user to perform optimal binary > processing > > Chetan Mehrotra >
Re: API proposal for - Expose URL for Blob source (OAK-1963)
Some more points around the proposed callback based approach 1.Possible security or enforcing a read only access to the exposed file - The file provided within the BlobProcessor callback can be a symlink created with a os user account which only has read only access. The symlink can be removed once the callback returns 2. S3 DataStore Security Concern - For S3 DataStore we would only be exposing the S3 object identifier and the client code would still need the aws credentials to connect to the bucket and perform required copy operation 3. Possibility of further optimization in S3DataStore processing - Currently when reading a binary from S3DataStore the binary content are *always* spooled to some local temporary file (in local cache) and then a InputStream is opened on that file. So even if the code need to read initial few bytes of stream the whole file would have to be read. This happens because with current JCR Binary API we are not in control of lifetime of exposed InputStream. So if say we expose the InputStream we cannot determine untill when the backing S3 SDK resources need to be held Also current S3DataStore always creates local copy - With a callback based approach we can safely expose this file which would allow layers above to avoid spooling the content again locally for processing. And with callback boundary we can later do required cleanup Chetan Mehrotra On Mon, May 9, 2016 at 7:15 PM, Chetan Mehrotrawrote: > Had an offline discussion with Michael on this and explained the usecase > requirement in more details. One concern that has been raised is that such > a generic adaptTo API is too inviting for improper use and Oak does not > have any context around when this url is exposed for what time it is used. > > So instead of having a generic adaptTo API at JCR level we can have a > BlobProcessor callback (Approach #B). Below is more of a strawman proposal. > Once we have a consensus then we can go over the details > > interface BlobProcessor { >void process(AdaptableBlob blob); > } > > Where AdaptableBlob is > > public interface AdaptableBlob { > AdapterType adaptTo(Class type); > } > > The BlobProcessor instance can be passed via BlobStore API. So client > would look for a BlobStore service (so use the Oak level API) and pass it > the ContentIdentity of JCR Binary aka blobId > > interface BlobStore{ > void process(String blobId, BlobProcessor processor) > } > > The approach ensures > > 1. That any blob handle exposed is only guaranteed for the duration > of 'process' invocation > 2. There is no guarantee on the utility of blob handle (File, S3 Object) > beyond the callback. So one should not collect the passed File handle for > later use > > Hopefully this should address some of the concerns raised in this thread. > Looking forward to feedback :) > > Chetan Mehrotra > > On Mon, May 9, 2016 at 6:24 PM, Michael Dürig wrote: > >> >> >> On 9.5.16 11:43 , Chetan Mehrotra wrote: >> >>> To highlight - As mentioned earlier the user of proposed api is tying >>> itself to implementation details of Oak and if this changes later then >>> that >>> code would also need to be changed. Or as Ian summed it up >>> >>> if the API is introduced it should create an out of band agreement with >>> the consumers of the API to act responsibly. >>> >> >> So what does "to act responsibly" actually means? Are we even in a >> position to precisely specify this? Experience tells me that we only find >> out about those semantics after the fact when dealing with painful and >> expensive customer escalations. >> >> And even if we could, it would tie Oak into very tight constraints on how >> it has to behave and how not. Constraints that would turn out prohibitively >> expensive for future evolution. Furthermore a huge amount of resources >> would be required to formalise such constraints via test coverage to guard >> against regressions. >> >> >> >>> The method is to be used for those important case where you do rely on >>> implementation detail to get optimal performance in very specific >>> scenarios. Its like DocumentNodeStore making use of some Mongo specific >>> API >>> to perform some important critical operation to achieve better >>> performance >>> by checking if the underlying DocumentStore is Mongo based. >>> >> >> Right, but the Mongo specific API is a (hopefully) well thought through >> API where as with your proposal there are a lot of open questions and >> concerns as per my last mail. >> >> Mongo (and any other COTS DB) for good reasons also don't give you direct >> access to its internal file handles. >> >> >> >>> I have seen discussion of JCR-3534 and other related issue but still do >>> not >>> see any conclusion on how to answer such queries where direct access to >>> blobs is required for performance aspect. This issue is not about >>> exposing >>> the blob reference for remote access but more about optimal path for in >>>
Re: API proposal for - Expose URL for Blob source (OAK-1963)
Hi, Can the uses cases presented by Chetan be solved the other way around? Instead of exposing implementation details via JCR/OAK API maybe it is possible to include the blobid in the S3 id/filename (a prefix?), such that external applications can identify external resources based on their oak storage. This can be optionally enabled for the blob stores that support such naming conventions. Marius On 5/9/16, 5:57 PM, "ianbos...@gmail.com on behalf of Ian Boston"wrote: >Hi, > >Thinking about the validity of the File and S3 Objects > >I thought the consumers of this api want things like the absolute path of >the File in the BlobStore, or the bucket and key of the S3 Object, so that >they could transmit it and use it for processing independently of Oak >outside the callback ? > >Or are you proposing, if they want to do that, they should not use JCR Data >but should (as others have suggested) store pointers to the data as JCR >properties and not store any large scale binary data in Oak ? (ie store >the S3 bucket and Key or store the a relative path from a known location as >a property of the node.) > > >Best Regards >Ian > > > > > >On 9 May 2016 at 14:45, Chetan Mehrotra wrote: > >> Had an offline discussion with Michael on this and explained the usecase >> requirement in more details. One concern that has been raised is that such >> a generic adaptTo API is too inviting for improper use and Oak does not >> have any context around when this url is exposed for what time it is used. >> >> So instead of having a generic adaptTo API at JCR level we can have a >> BlobProcessor callback (Approach #B). Below is more of a strawman proposal. >> Once we have a consensus then we can go over the details >> >> interface BlobProcessor { >>void process(AdaptableBlob blob); >> } >> >> Where AdaptableBlob is >> >> public interface AdaptableBlob { >> AdapterType adaptTo(Class type); >> } >> >> The BlobProcessor instance can be passed via BlobStore API. So client would >> look for a BlobStore service (so use the Oak level API) and pass it the >> ContentIdentity of JCR Binary aka blobId >> >> interface BlobStore{ >> void process(String blobId, BlobProcessor processor) >> } >> >> The approach ensures >> >> 1. That any blob handle exposed is only guaranteed for the duration >> of 'process' invocation >> 2. There is no guarantee on the utility of blob handle (File, S3 Object) >> beyond the callback. So one should not collect the passed File handle for >> later use >> >> Hopefully this should address some of the concerns raised in this thread. >> Looking forward to feedback :) >> >> Chetan Mehrotra >> >> On Mon, May 9, 2016 at 6:24 PM, Michael Dürig wrote: >> >> > >> > >> > On 9.5.16 11:43 , Chetan Mehrotra wrote: >> > >> >> To highlight - As mentioned earlier the user of proposed api is tying >> >> itself to implementation details of Oak and if this changes later then >> >> that >> >> code would also need to be changed. Or as Ian summed it up >> >> >> >> if the API is introduced it should create an out of band agreement with >> >>> >> >> the consumers of the API to act responsibly. >> >> >> > >> > So what does "to act responsibly" actually means? Are we even in a >> > position to precisely specify this? Experience tells me that we only find >> > out about those semantics after the fact when dealing with painful and >> > expensive customer escalations. >> > >> > And even if we could, it would tie Oak into very tight constraints on how >> > it has to behave and how not. Constraints that would turn out >> prohibitively >> > expensive for future evolution. Furthermore a huge amount of resources >> > would be required to formalise such constraints via test coverage to >> guard >> > against regressions. >> > >> > >> > >> >> The method is to be used for those important case where you do rely on >> >> implementation detail to get optimal performance in very specific >> >> scenarios. Its like DocumentNodeStore making use of some Mongo specific >> >> API >> >> to perform some important critical operation to achieve better >> performance >> >> by checking if the underlying DocumentStore is Mongo based. >> >> >> > >> > Right, but the Mongo specific API is a (hopefully) well thought through >> > API where as with your proposal there are a lot of open questions and >> > concerns as per my last mail. >> > >> > Mongo (and any other COTS DB) for good reasons also don't give you direct >> > access to its internal file handles. >> > >> > >> > >> >> I have seen discussion of JCR-3534 and other related issue but still do >> >> not >> >> see any conclusion on how to answer such queries where direct access to >> >> blobs is required for performance aspect. This issue is not about >> exposing >> >> the blob reference for remote access but more about optimal path for in >> VM >> >> access >> >> >> > >> > One bottom line of
Re: API proposal for - Expose URL for Blob source (OAK-1963)
Hi, Thinking about the validity of the File and S3 Objects I thought the consumers of this api want things like the absolute path of the File in the BlobStore, or the bucket and key of the S3 Object, so that they could transmit it and use it for processing independently of Oak outside the callback ? Or are you proposing, if they want to do that, they should not use JCR Data but should (as others have suggested) store pointers to the data as JCR properties and not store any large scale binary data in Oak ? (ie store the S3 bucket and Key or store the a relative path from a known location as a property of the node.) Best Regards Ian On 9 May 2016 at 14:45, Chetan Mehrotrawrote: > Had an offline discussion with Michael on this and explained the usecase > requirement in more details. One concern that has been raised is that such > a generic adaptTo API is too inviting for improper use and Oak does not > have any context around when this url is exposed for what time it is used. > > So instead of having a generic adaptTo API at JCR level we can have a > BlobProcessor callback (Approach #B). Below is more of a strawman proposal. > Once we have a consensus then we can go over the details > > interface BlobProcessor { >void process(AdaptableBlob blob); > } > > Where AdaptableBlob is > > public interface AdaptableBlob { > AdapterType adaptTo(Class type); > } > > The BlobProcessor instance can be passed via BlobStore API. So client would > look for a BlobStore service (so use the Oak level API) and pass it the > ContentIdentity of JCR Binary aka blobId > > interface BlobStore{ > void process(String blobId, BlobProcessor processor) > } > > The approach ensures > > 1. That any blob handle exposed is only guaranteed for the duration > of 'process' invocation > 2. There is no guarantee on the utility of blob handle (File, S3 Object) > beyond the callback. So one should not collect the passed File handle for > later use > > Hopefully this should address some of the concerns raised in this thread. > Looking forward to feedback :) > > Chetan Mehrotra > > On Mon, May 9, 2016 at 6:24 PM, Michael Dürig wrote: > > > > > > > On 9.5.16 11:43 , Chetan Mehrotra wrote: > > > >> To highlight - As mentioned earlier the user of proposed api is tying > >> itself to implementation details of Oak and if this changes later then > >> that > >> code would also need to be changed. Or as Ian summed it up > >> > >> if the API is introduced it should create an out of band agreement with > >>> > >> the consumers of the API to act responsibly. > >> > > > > So what does "to act responsibly" actually means? Are we even in a > > position to precisely specify this? Experience tells me that we only find > > out about those semantics after the fact when dealing with painful and > > expensive customer escalations. > > > > And even if we could, it would tie Oak into very tight constraints on how > > it has to behave and how not. Constraints that would turn out > prohibitively > > expensive for future evolution. Furthermore a huge amount of resources > > would be required to formalise such constraints via test coverage to > guard > > against regressions. > > > > > > > >> The method is to be used for those important case where you do rely on > >> implementation detail to get optimal performance in very specific > >> scenarios. Its like DocumentNodeStore making use of some Mongo specific > >> API > >> to perform some important critical operation to achieve better > performance > >> by checking if the underlying DocumentStore is Mongo based. > >> > > > > Right, but the Mongo specific API is a (hopefully) well thought through > > API where as with your proposal there are a lot of open questions and > > concerns as per my last mail. > > > > Mongo (and any other COTS DB) for good reasons also don't give you direct > > access to its internal file handles. > > > > > > > >> I have seen discussion of JCR-3534 and other related issue but still do > >> not > >> see any conclusion on how to answer such queries where direct access to > >> blobs is required for performance aspect. This issue is not about > exposing > >> the blob reference for remote access but more about optimal path for in > VM > >> access > >> > > > > One bottom line of the discussions in that issue is that we came to a > > conclusion after clarifying the specifics of the use case. Something I'm > > still missing here. The case you brought forward is too general to serve > as > > a guideline for a solution. Quite to the contrary, to me it looks like a > > solution to some problem (I'm trying to understand). > > > > > > > >> who owns the resource? Who coordinates (concurrent) access to it and > how? > >>> > >> What are the correctness and performance implications here (races, > >> deadlock, corruptions, JCR semantics)? > >> > >> The client code would need to be implemented in a proper way. Its more > >> like > >>
Re: API proposal for - Expose URL for Blob source (OAK-1963)
Had an offline discussion with Michael on this and explained the usecase requirement in more details. One concern that has been raised is that such a generic adaptTo API is too inviting for improper use and Oak does not have any context around when this url is exposed for what time it is used. So instead of having a generic adaptTo API at JCR level we can have a BlobProcessor callback (Approach #B). Below is more of a strawman proposal. Once we have a consensus then we can go over the details interface BlobProcessor { void process(AdaptableBlob blob); } Where AdaptableBlob is public interface AdaptableBlob { AdapterType adaptTo(Class type); } The BlobProcessor instance can be passed via BlobStore API. So client would look for a BlobStore service (so use the Oak level API) and pass it the ContentIdentity of JCR Binary aka blobId interface BlobStore{ void process(String blobId, BlobProcessor processor) } The approach ensures 1. That any blob handle exposed is only guaranteed for the duration of 'process' invocation 2. There is no guarantee on the utility of blob handle (File, S3 Object) beyond the callback. So one should not collect the passed File handle for later use Hopefully this should address some of the concerns raised in this thread. Looking forward to feedback :) Chetan Mehrotra On Mon, May 9, 2016 at 6:24 PM, Michael Dürigwrote: > > > On 9.5.16 11:43 , Chetan Mehrotra wrote: > >> To highlight - As mentioned earlier the user of proposed api is tying >> itself to implementation details of Oak and if this changes later then >> that >> code would also need to be changed. Or as Ian summed it up >> >> if the API is introduced it should create an out of band agreement with >>> >> the consumers of the API to act responsibly. >> > > So what does "to act responsibly" actually means? Are we even in a > position to precisely specify this? Experience tells me that we only find > out about those semantics after the fact when dealing with painful and > expensive customer escalations. > > And even if we could, it would tie Oak into very tight constraints on how > it has to behave and how not. Constraints that would turn out prohibitively > expensive for future evolution. Furthermore a huge amount of resources > would be required to formalise such constraints via test coverage to guard > against regressions. > > > >> The method is to be used for those important case where you do rely on >> implementation detail to get optimal performance in very specific >> scenarios. Its like DocumentNodeStore making use of some Mongo specific >> API >> to perform some important critical operation to achieve better performance >> by checking if the underlying DocumentStore is Mongo based. >> > > Right, but the Mongo specific API is a (hopefully) well thought through > API where as with your proposal there are a lot of open questions and > concerns as per my last mail. > > Mongo (and any other COTS DB) for good reasons also don't give you direct > access to its internal file handles. > > > >> I have seen discussion of JCR-3534 and other related issue but still do >> not >> see any conclusion on how to answer such queries where direct access to >> blobs is required for performance aspect. This issue is not about exposing >> the blob reference for remote access but more about optimal path for in VM >> access >> > > One bottom line of the discussions in that issue is that we came to a > conclusion after clarifying the specifics of the use case. Something I'm > still missing here. The case you brought forward is too general to serve as > a guideline for a solution. Quite to the contrary, to me it looks like a > solution to some problem (I'm trying to understand). > > > >> who owns the resource? Who coordinates (concurrent) access to it and how? >>> >> What are the correctness and performance implications here (races, >> deadlock, corruptions, JCR semantics)? >> >> The client code would need to be implemented in a proper way. Its more >> like >> implementing a CommitHook. If implemented in incorrect way it would cause >> issues deadlocks etc. But then we assume that any one implementing that >> interface would take proper care in implementation. >> > > But a commit hook is an internal SPI. It is not advertised to the whole > world as a public API. > > > >> it limits implementation freedom and hinders further evolution >>> >> (chunking, de-duplication, content based addressing, compression, gc, >> etc.) >> for data stores. >> >> As mentioned earlier. Some part of API indicates a closer dependency on >> how >> things work (like SPI, or ConsumerType AP on OSGi terms). By using such >> API >> client code definitely ties itself to Oak implementation detail but it >> should not limit how Oak implementation detail evolve. So when it changes >> client code need to adapt itself accordingly. Oak can express that >> by increment the minor version of exported package to indicate change >> in
Re: API proposal for - Expose URL for Blob source (OAK-1963)
On 9.5.16 11:43 , Chetan Mehrotra wrote: To highlight - As mentioned earlier the user of proposed api is tying itself to implementation details of Oak and if this changes later then that code would also need to be changed. Or as Ian summed it up if the API is introduced it should create an out of band agreement with the consumers of the API to act responsibly. So what does "to act responsibly" actually means? Are we even in a position to precisely specify this? Experience tells me that we only find out about those semantics after the fact when dealing with painful and expensive customer escalations. And even if we could, it would tie Oak into very tight constraints on how it has to behave and how not. Constraints that would turn out prohibitively expensive for future evolution. Furthermore a huge amount of resources would be required to formalise such constraints via test coverage to guard against regressions. The method is to be used for those important case where you do rely on implementation detail to get optimal performance in very specific scenarios. Its like DocumentNodeStore making use of some Mongo specific API to perform some important critical operation to achieve better performance by checking if the underlying DocumentStore is Mongo based. Right, but the Mongo specific API is a (hopefully) well thought through API where as with your proposal there are a lot of open questions and concerns as per my last mail. Mongo (and any other COTS DB) for good reasons also don't give you direct access to its internal file handles. I have seen discussion of JCR-3534 and other related issue but still do not see any conclusion on how to answer such queries where direct access to blobs is required for performance aspect. This issue is not about exposing the blob reference for remote access but more about optimal path for in VM access One bottom line of the discussions in that issue is that we came to a conclusion after clarifying the specifics of the use case. Something I'm still missing here. The case you brought forward is too general to serve as a guideline for a solution. Quite to the contrary, to me it looks like a solution to some problem (I'm trying to understand). who owns the resource? Who coordinates (concurrent) access to it and how? What are the correctness and performance implications here (races, deadlock, corruptions, JCR semantics)? The client code would need to be implemented in a proper way. Its more like implementing a CommitHook. If implemented in incorrect way it would cause issues deadlocks etc. But then we assume that any one implementing that interface would take proper care in implementation. But a commit hook is an internal SPI. It is not advertised to the whole world as a public API. it limits implementation freedom and hinders further evolution (chunking, de-duplication, content based addressing, compression, gc, etc.) for data stores. As mentioned earlier. Some part of API indicates a closer dependency on how things work (like SPI, or ConsumerType AP on OSGi terms). By using such API client code definitely ties itself to Oak implementation detail but it should not limit how Oak implementation detail evolve. So when it changes client code need to adapt itself accordingly. Oak can express that by increment the minor version of exported package to indicate change in behavior. Which IMO is completely contradictory. Such an API would prevent us from refactoring internal storage formats if a new format couldn't implement the API (e.g. because of chunking, compression, deduplication etc). Can't we come up with an API that allows the blobs to stay under control of Oak? The code need to work either at OS level say file handle or say S3 object. So I do not see a way where it can work without having access to those details Again, why? What's the precise use case here? If this really is the conclusions, then a corollary would be that those binaries must not go into Oak. FWIW there is code out there which reverse engineers the blobId to access the actual binary. People do it so as to get decent throughput in image rendition logic for large scale deployment. The proposal here was to formalize that approach by providing a proper api. If we do not provide such an API then the only way for them would be to continue relying on reverse engineering the blobId! This is hardly a good argument. Formalising other people's hacks means making us liable. What *we* need to do is understand their use case and come up with a clean solution. If not, this is probably an indication that those blobs shouldn't go into Oak but just references to it as Francesco already proposed. Anything else is whether fish nor fowl: you can't have the JCR goodies but at the same time access underlying resources at will. Thats a fine argument to make. But then users here have real problem to solve which we should not ignore. Oak based systems
Re: API proposal for - Expose URL for Blob source (OAK-1963)
To highlight - As mentioned earlier the user of proposed api is tying itself to implementation details of Oak and if this changes later then that code would also need to be changed. Or as Ian summed it up > if the API is introduced it should create an out of band agreement with the consumers of the API to act responsibly. The method is to be used for those important case where you do rely on implementation detail to get optimal performance in very specific scenarios. Its like DocumentNodeStore making use of some Mongo specific API to perform some important critical operation to achieve better performance by checking if the underlying DocumentStore is Mongo based. I have seen discussion of JCR-3534 and other related issue but still do not see any conclusion on how to answer such queries where direct access to blobs is required for performance aspect. This issue is not about exposing the blob reference for remote access but more about optimal path for in VM access > who owns the resource? Who coordinates (concurrent) access to it and how? What are the correctness and performance implications here (races, deadlock, corruptions, JCR semantics)? The client code would need to be implemented in a proper way. Its more like implementing a CommitHook. If implemented in incorrect way it would cause issues deadlocks etc. But then we assume that any one implementing that interface would take proper care in implementation. > it limits implementation freedom and hinders further evolution (chunking, de-duplication, content based addressing, compression, gc, etc.) for data stores. As mentioned earlier. Some part of API indicates a closer dependency on how things work (like SPI, or ConsumerType AP on OSGi terms). By using such API client code definitely ties itself to Oak implementation detail but it should not limit how Oak implementation detail evolve. So when it changes client code need to adapt itself accordingly. Oak can express that by increment the minor version of exported package to indicate change in behavior. > bypassing JCR's security model I yet do not see the attack vector which we need to defend differently here. Again the blob url is not being exposed say as part of webdav or any other remote call. So would like to understand the security concern better here (unless it defending against a malicious , badly implemented client code which we discussed above) > Can't we come up with an API that allows the blobs to stay under control of Oak? The code need to work either at OS level say file handle or say S3 object. So I do not see a way where it can work without having access to those details FWIW there is code out there which reverse engineers the blobId to access the actual binary. People do it so as to get decent throughput in image rendition logic for large scale deployment. The proposal here was to formalize that approach by providing a proper api. If we do not provide such an API then the only way for them would be to continue relying on reverse engineering the blobId! > If not, this is probably an indication that those blobs shouldn't go into Oak but just references to it as Francesco already proposed. Anything else is whether fish nor fowl: you can't have the JCR goodies but at the same time access underlying resources at will. Thats a fine argument to make. But then users here have real problem to solve which we should not ignore. Oak based systems are being proposed for large asset deployment where one of the primary requirement is asset handling/processing of 100 of TB of binary data. So we would then have to recommend for such cases to not use JCR Binary abstraction and manage the binaries on your own. That would then solve both the problems (that might though break lots of tooling build on top of JCR API to manage those binaries)! Thinking more - Another approach that I can then suggest it people implement there own BlobStore (may be by extending ours) and provide this API there i.e. say which takes Blob id and provide the required details. This way we "outsource" the problem. Would that be acceptable? Chetan Mehrotra On Mon, May 9, 2016 at 2:28 PM, Michael Dürigwrote: > > Hi, > > I very much share Francesco's concerns here. Unconditionally exposing > access to operation system resources underlying Oak's inner working is > troublesome for various reasons: > > - who owns the resource? Who coordinates (concurrent) access to it and > how? What are the correctness and performance implications here (races, > deadlock, corruptions, JCR semantics)? > > - it limits implementation freedom and hinders further evolution > (chunking, de-duplication, content based addressing, compression, gc, etc.) > for data stores. > > - bypassing JCR's security model > > Pretty much all of this has been discussed in the scope of > https://issues.apache.org/jira/browse/JCR-3534 and > https://issues.apache.org/jira/browse/OAK-834. So I suggest to review > those discussions before we jump
Re: API proposal for - Expose URL for Blob source (OAK-1963)
Hi, I very much share Francesco's concerns here. Unconditionally exposing access to operation system resources underlying Oak's inner working is troublesome for various reasons: - who owns the resource? Who coordinates (concurrent) access to it and how? What are the correctness and performance implications here (races, deadlock, corruptions, JCR semantics)? - it limits implementation freedom and hinders further evolution (chunking, de-duplication, content based addressing, compression, gc, etc.) for data stores. - bypassing JCR's security model Pretty much all of this has been discussed in the scope of https://issues.apache.org/jira/browse/JCR-3534 and https://issues.apache.org/jira/browse/OAK-834. So I suggest to review those discussions before we jump to conclusion. Also what is the use case requiring such a vast API surface? Can't we come up with an API that allows the blobs to stay under control of Oak? If not, this is probably an indication that those blobs shouldn't go into Oak but just references to it as Francesco already proposed. Anything else is whether fish nor fowl: you can't have the JCR goodies but at the same time access underlying resources at will. Michael On 5.5.16 11:00 , Francesco Mari wrote: This proposal introduces a huge leak of abstractions and has deep security implications. I guess that the reason for this proposal is that some users of Oak would like to perform some operations on binaries in a more performant way by leveraging the way those binaries are stored. If this is the case, I suggest those users to evaluate an applicative solution implemented on top of the JCR API. If a user needs to store some important binary data (files, images, etc.) in an S3 bucket or on the file system for performance reasons, this shouldn't affect how Oak handles blobs internally. If some assets are of special interest for the user, then the user should bypass Oak and take care of the storage of those assets directly. Oak can be used to store *references* to those assets, that can be used in user code to manipulate the assets in his own business logic. If the scenario I outlined is not what inspired this proposal, I would like to know more about the reasons why this proposal was brought up. Which problems are we going to solve with this API? Is there a more concrete use case that we can use as a driving example? 2016-05-05 10:06 GMT+02:00 Davide Giannella: On 04/05/2016 17:37, Ian Boston wrote: Hi, If the File or URL is writable, will writing to the location cause issues for Oak ? IIRC some Oak DS implementations use a digest of the content to determine the location in the DS, so changing the content via Oak will change the location, but changing the content via the File or URL wont. If I didn't remember correctly, then ignore the concern. Fully supportive of the approach, as a consumer of Oak. The locations will certainly probably leak outside the context of an Oak session so the API contract should make it clear that the code using a direct location needs to behave responsibly. It's a reasonable concern and I'm not in the details of the implementation. It's worth to keep in mind though and remember if we want to adapt to URL or File that maybe we'll have to come up with some sort of read-only version of such. For the File class, IIRC, we could force/use the setReadOnly(), setWritable() methods. I remember those to be quite expensive in time though. Davide
Re: API proposal for - Expose URL for Blob source (OAK-1963)
On Thu, May 5, 2016 at 5:07 PM, Francesco Mariwrote: > > This is a totally different thing. The change to the node will be committed > with the privileges of the session that retrieved the node. If the session > doesn't have enough privileges to delete that node, the node will be > deleted, There is no escape from the security model. A "bad code" when passes a node backed via admin session can still do bad thing as admin session has all the privileges. In same way if a bad code is passed a file handle then it can cause issue. So I am still not sure on the attack vector which we are defending against. Chetan Mehrotra
Re: API proposal for - Expose URL for Blob source (OAK-1963)
2016-05-05 13:22 GMT+02:00 Chetan Mehrotra: > On Thu, May 5, 2016 at 4:38 PM, Francesco Mari > wrote: > > > The security concern is quite easy to explain: it's a bypass of our > > security model. Imagine that, using a session with the appropriate > > privileges, a user accesses a Blob and adapts it to a file handle, an S3 > > bucket or a URL. This code passes this reference to another piece of code > > that modifies the data directly even if - in the same deployment - it > > shouldn't be able to access the Blob instance to begin with. > > > > How is this different from the case where a code obtains a Node via an > admin session and passes that Node instance to another code which say > deletes important content via it. In the end we have to trust the client > code to do correct thing when given appropriate rights. So in current > proposal the code can only adapt the binary if the session has expected > permissions. Post that we need to trust the code to behave properly. > This is a totally different thing. The change to the node will be committed with the privileges of the session that retrieved the node. If the session doesn't have enough privileges to delete that node, the node will be deleted, There is no escape from the security model. > > > In both the use case, the customer is coupling the data with the most > > appropriate storage solution for his business case. In this case, > customer > > code - and not Oak - should be responsible for the management of that > data. > > Well then it means that customer implements its very own DataStore like > solution and all the application code do not make use of JCR Binary and > instead use another service to resolve the references. This would greatly > reduce the usefulness of JCR for asset heavy application which use JCR to > manage binary content along with its metadata > What I said doesn't reduce the usefulness of JCR. JCR defines an abstraction that is independent from the actual storage solution. If a client is fine with using the abstraction, JCR can be a very useful tool. If a client needs to escape the abstraction, he has to do it at his own risk without breaking the abstraction for everyone else. In the outlined use cases, the customer needs to be responsible for his own storage mechanisms. > > > Chetan Mehrotra >
Re: API proposal for - Expose URL for Blob source (OAK-1963)
On Thu, May 5, 2016 at 4:38 PM, Francesco Mariwrote: > The security concern is quite easy to explain: it's a bypass of our > security model. Imagine that, using a session with the appropriate > privileges, a user accesses a Blob and adapts it to a file handle, an S3 > bucket or a URL. This code passes this reference to another piece of code > that modifies the data directly even if - in the same deployment - it > shouldn't be able to access the Blob instance to begin with. > How is this different from the case where a code obtains a Node via an admin session and passes that Node instance to another code which say deletes important content via it. In the end we have to trust the client code to do correct thing when given appropriate rights. So in current proposal the code can only adapt the binary if the session has expected permissions. Post that we need to trust the code to behave properly. > In both the use case, the customer is coupling the data with the most > appropriate storage solution for his business case. In this case, customer > code - and not Oak - should be responsible for the management of that data. Well then it means that customer implements its very own DataStore like solution and all the application code do not make use of JCR Binary and instead use another service to resolve the references. This would greatly reduce the usefulness of JCR for asset heavy application which use JCR to manage binary content along with its metadata Chetan Mehrotra
Re: API proposal for - Expose URL for Blob source (OAK-1963)
The security concern is quite easy to explain: it's a bypass of our security model. Imagine that, using a session with the appropriate privileges, a user accesses a Blob and adapts it to a file handle, an S3 bucket or a URL. This code passes this reference to another piece of code that modifies the data directly even if - in the same deployment - it shouldn't be able to access the Blob instance to begin with. In addition to that, I'm very concerned with the correctness of this solution. In both the use cases you mentioned above, you assume that the leaked reference is only used to read the data. The truth is that, once a reference leaks, we can't be sure that we are the only agent managing the data. We would have to program defensively because we are - as a matter of fact - sharing the management of the data with an unspecified amount of user code. I don't even know if it's possible to anticipate every single thing that can go wrong. In both the use case, the customer is coupling the data with the most appropriate storage solution for his business case. In this case, customer code - and not Oak - should be responsible for the management of that data. Oak can still be used to store references to that data - paths on the file system, the ID of the S3 bucket or the URI to the resource. 2016-05-05 12:38 GMT+02:00 Chetan Mehrotra: > > This proposal introduces a huge leak of abstractions and has deep > security > implications. > > I understand the leak of abstractions concern. However would like to > understand the security concern bit more. > > One way I can think of that it can cause security concern is you have some > malicious code running in same jvm which can then do bad things with the > file handle. Do note that the File handle would not get exposed via any > remoting api we currently support. Now in this case if malicious code is > already running in same jvm then security is breached and code can anyway > make use of reflection to access internal details. > > So if there is any other possible security concern then would like to > discuss. > > Coming to usecases > > Usecase A - Image rendition generation > - > > We have some bigger deployments where lots of images gets uploaded to the > repository and there are some conversions (rendition generation) which are > performed by OS specific native executables. Such programs work directly on > file handle. Without this change currently we need to first spool the file > content into some temporary location and then pass that to the other > program. This add unnecessary overhead and something which can be avoided > in case there is a FileDataStore being used where we can provide a direct > access to the file > > Usecase B - Efficient replication across regions in S3 > -- > > This for AEM based setup which is running on Oak with S3DataStore. There we > have global deployment where author instance is running in 1 region and > binary content is to be distributed to publish instances running in > different regions. The DataStore size is huge say 100TB and for efficient > operation we need to use Binary less replication. In most cases only a very > small subset of binary content would need to be present in other > regions. Current > way (via shared DataStore) to support that would involve synchronizing the > S3 bucket across all such regions which would increase the storage cost > considerable. > > Instead of that plan is to replicate the specific assets via s3 copy > operation. This would ensure that big assets can be copied efficiently at > S3 level and that would require direct access to the S3 object. > > Again in all such cases one can always resort to current level support i.e. > copy over all the content via inputstream into some temporary store and > then use that. But that would add considerable overhead when assets are of > 100MB sizes or more. So the approach proposed would allow client code to > this efficiently depending on the underlying storage capability > > > To me sounds like breaching the JCR and NodeState layers to directly > > manipulate NodeStore binaries (from the DataStore), e.g. to perform smart > > replication across different instances, but imho the right way to address > > that is extending one of the current DataStore implementations or create > a > > new one. > > The original proposed approach in OAK-1963 was like that i.e. introduce > this access method on BlobStore which works on reference. But in that case > client code would need to deal with BlobStore API. In either case access to > actual binary storage data would be required > > Chetan Mehrotra > > On Thu, May 5, 2016 at 2:49 PM, Tommaso Teofili > > wrote: > > > +1 to Francesco's concerns, exposing the location of a binary at the > > application level doesn't sound good from a security perspective. >
Re: API proposal for - Expose URL for Blob source (OAK-1963)
> This proposal introduces a huge leak of abstractions and has deep security implications. I understand the leak of abstractions concern. However would like to understand the security concern bit more. One way I can think of that it can cause security concern is you have some malicious code running in same jvm which can then do bad things with the file handle. Do note that the File handle would not get exposed via any remoting api we currently support. Now in this case if malicious code is already running in same jvm then security is breached and code can anyway make use of reflection to access internal details. So if there is any other possible security concern then would like to discuss. Coming to usecases Usecase A - Image rendition generation - We have some bigger deployments where lots of images gets uploaded to the repository and there are some conversions (rendition generation) which are performed by OS specific native executables. Such programs work directly on file handle. Without this change currently we need to first spool the file content into some temporary location and then pass that to the other program. This add unnecessary overhead and something which can be avoided in case there is a FileDataStore being used where we can provide a direct access to the file Usecase B - Efficient replication across regions in S3 -- This for AEM based setup which is running on Oak with S3DataStore. There we have global deployment where author instance is running in 1 region and binary content is to be distributed to publish instances running in different regions. The DataStore size is huge say 100TB and for efficient operation we need to use Binary less replication. In most cases only a very small subset of binary content would need to be present in other regions. Current way (via shared DataStore) to support that would involve synchronizing the S3 bucket across all such regions which would increase the storage cost considerable. Instead of that plan is to replicate the specific assets via s3 copy operation. This would ensure that big assets can be copied efficiently at S3 level and that would require direct access to the S3 object. Again in all such cases one can always resort to current level support i.e. copy over all the content via inputstream into some temporary store and then use that. But that would add considerable overhead when assets are of 100MB sizes or more. So the approach proposed would allow client code to this efficiently depending on the underlying storage capability > To me sounds like breaching the JCR and NodeState layers to directly > manipulate NodeStore binaries (from the DataStore), e.g. to perform smart > replication across different instances, but imho the right way to address > that is extending one of the current DataStore implementations or create a > new one. The original proposed approach in OAK-1963 was like that i.e. introduce this access method on BlobStore which works on reference. But in that case client code would need to deal with BlobStore API. In either case access to actual binary storage data would be required Chetan Mehrotra On Thu, May 5, 2016 at 2:49 PM, Tommaso Teofiliwrote: > +1 to Francesco's concerns, exposing the location of a binary at the > application level doesn't sound good from a security perspective. > To me sounds like breaching the JCR and NodeState layers to directly > manipulate NodeStore binaries (from the DataStore), e.g. to perform smart > replication across different instances, but imho the right way to address > that is extending one of the current DataStore implementations or create a > new one. > I am also concerned that this Adaptable pattern would open room for other > such hacks into the stack. > > My 2 cents, > Tommaso > > > Il giorno gio 5 mag 2016 alle ore 11:00 Francesco Mari < > mari.france...@gmail.com> ha scritto: > > > This proposal introduces a huge leak of abstractions and has deep > security > > implications. > > > > I guess that the reason for this proposal is that some users of Oak would > > like to perform some operations on binaries in a more performant way by > > leveraging the way those binaries are stored. If this is the case, I > > suggest those users to evaluate an applicative solution implemented on > top > > of the JCR API. > > > > If a user needs to store some important binary data (files, images, etc.) > > in an S3 bucket or on the file system for performance reasons, this > > shouldn't affect how Oak handles blobs internally. If some assets are of > > special interest for the user, then the user should bypass Oak and take > > care of the storage of those assets directly. Oak can be used to store > > *references* to those assets, that can be used in user code to manipulate > > the assets in his own business logic. > > > > If the scenario I outlined is
Re: API proposal for - Expose URL for Blob source (OAK-1963)
+1 to Francesco's concerns, exposing the location of a binary at the application level doesn't sound good from a security perspective. To me sounds like breaching the JCR and NodeState layers to directly manipulate NodeStore binaries (from the DataStore), e.g. to perform smart replication across different instances, but imho the right way to address that is extending one of the current DataStore implementations or create a new one. I am also concerned that this Adaptable pattern would open room for other such hacks into the stack. My 2 cents, Tommaso Il giorno gio 5 mag 2016 alle ore 11:00 Francesco Mari < mari.france...@gmail.com> ha scritto: > This proposal introduces a huge leak of abstractions and has deep security > implications. > > I guess that the reason for this proposal is that some users of Oak would > like to perform some operations on binaries in a more performant way by > leveraging the way those binaries are stored. If this is the case, I > suggest those users to evaluate an applicative solution implemented on top > of the JCR API. > > If a user needs to store some important binary data (files, images, etc.) > in an S3 bucket or on the file system for performance reasons, this > shouldn't affect how Oak handles blobs internally. If some assets are of > special interest for the user, then the user should bypass Oak and take > care of the storage of those assets directly. Oak can be used to store > *references* to those assets, that can be used in user code to manipulate > the assets in his own business logic. > > If the scenario I outlined is not what inspired this proposal, I would like > to know more about the reasons why this proposal was brought up. Which > problems are we going to solve with this API? Is there a more concrete use > case that we can use as a driving example? > > 2016-05-05 10:06 GMT+02:00 Davide Giannella: > > > On 04/05/2016 17:37, Ian Boston wrote: > > > Hi, > > > If the File or URL is writable, will writing to the location cause > issues > > > for Oak ? > > > IIRC some Oak DS implementations use a digest of the content to > determine > > > the location in the DS, so changing the content via Oak will change the > > > location, but changing the content via the File or URL wont. If I > didn't > > > remember correctly, then ignore the concern. Fully supportive of the > > > approach, as a consumer of Oak. The locations will certainly probably > > leak > > > outside the context of an Oak session so the API contract should make > it > > > clear that the code using a direct location needs to behave > responsibly. > > > > > > > It's a reasonable concern and I'm not in the details of the > > implementation. It's worth to keep in mind though and remember if we > > want to adapt to URL or File that maybe we'll have to come up with some > > sort of read-only version of such. > > > > For the File class, IIRC, we could force/use the setReadOnly(), > > setWritable() methods. I remember those to be quite expensive in time > > though. > > > > Davide > > > > > > >
Re: API proposal for - Expose URL for Blob source (OAK-1963)
On Wed, May 4, 2016 at 10:07 PM, Ian Bostonwrote: > If the File or URL is writable, will writing to the location cause issues > for Oak ? > Yes that would cause problem. Expectation here is that code using a direct location needs to behave responsibly. Chetan Mehrotra
Re: API proposal for - Expose URL for Blob source (OAK-1963)
This proposal introduces a huge leak of abstractions and has deep security implications. I guess that the reason for this proposal is that some users of Oak would like to perform some operations on binaries in a more performant way by leveraging the way those binaries are stored. If this is the case, I suggest those users to evaluate an applicative solution implemented on top of the JCR API. If a user needs to store some important binary data (files, images, etc.) in an S3 bucket or on the file system for performance reasons, this shouldn't affect how Oak handles blobs internally. If some assets are of special interest for the user, then the user should bypass Oak and take care of the storage of those assets directly. Oak can be used to store *references* to those assets, that can be used in user code to manipulate the assets in his own business logic. If the scenario I outlined is not what inspired this proposal, I would like to know more about the reasons why this proposal was brought up. Which problems are we going to solve with this API? Is there a more concrete use case that we can use as a driving example? 2016-05-05 10:06 GMT+02:00 Davide Giannella: > On 04/05/2016 17:37, Ian Boston wrote: > > Hi, > > If the File or URL is writable, will writing to the location cause issues > > for Oak ? > > IIRC some Oak DS implementations use a digest of the content to determine > > the location in the DS, so changing the content via Oak will change the > > location, but changing the content via the File or URL wont. If I didn't > > remember correctly, then ignore the concern. Fully supportive of the > > approach, as a consumer of Oak. The locations will certainly probably > leak > > outside the context of an Oak session so the API contract should make it > > clear that the code using a direct location needs to behave responsibly. > > > > It's a reasonable concern and I'm not in the details of the > implementation. It's worth to keep in mind though and remember if we > want to adapt to URL or File that maybe we'll have to come up with some > sort of read-only version of such. > > For the File class, IIRC, we could force/use the setReadOnly(), > setWritable() methods. I remember those to be quite expensive in time > though. > > Davide > > >
Re: API proposal for - Expose URL for Blob source (OAK-1963)
On 04/05/2016 17:37, Ian Boston wrote: > Hi, > If the File or URL is writable, will writing to the location cause issues > for Oak ? > IIRC some Oak DS implementations use a digest of the content to determine > the location in the DS, so changing the content via Oak will change the > location, but changing the content via the File or URL wont. If I didn't > remember correctly, then ignore the concern. Fully supportive of the > approach, as a consumer of Oak. The locations will certainly probably leak > outside the context of an Oak session so the API contract should make it > clear that the code using a direct location needs to behave responsibly. > It's a reasonable concern and I'm not in the details of the implementation. It's worth to keep in mind though and remember if we want to adapt to URL or File that maybe we'll have to come up with some sort of read-only version of such. For the File class, IIRC, we could force/use the setReadOnly(), setWritable() methods. I remember those to be quite expensive in time though. Davide
Re: API proposal for - Expose URL for Blob source (OAK-1963)
On 03/05/2016 15:36, Chetan Mehrotra wrote: > ... > //Check if Binary is of type AdaptableBinary > if (binProp instanceof AdaptableBinary){ Would it be possible to avoid the `instaceof`? Which means, in my opinion, all our binaries should be Adaptable. In case the implementation is not it can return null. Would it work fine as API contract? It would ease the usage of such API. Plus I would add anyhow an oak.api interface Adaptable so that we can then, if needed, apply the same concept anywhere else. > ... > > 1. Depending on backing BlobStore the binary can be adapted to various > types. For FileDataStore it can be adapted to File. For S3DataStore it can > either be adapted to URL or some S3DataStore specific type. +1 > ... > > 2. Security - Thomas suggested that for better security the ability to > adapt should be restricted based on session permissions. So if the user has > required permission then only adaptation would work otherwise null would be > returned. +1 > ... > > 4. This API is for now exposed only at JCR level. Not sure should we do it > at Oak level as Blob instance are currently not bound to any session. So > proposal is to place this in 'org.apache.jackrabbit.oak.api' package As said above I would create an Adaptable interface at oak level and then use it where needed. It's a powerful tool. Cheers Davide
Re: API proposal for - Expose URL for Blob source (OAK-1963)
Hi, If the File or URL is writable, will writing to the location cause issues for Oak ? IIRC some Oak DS implementations use a digest of the content to determine the location in the DS, so changing the content via Oak will change the location, but changing the content via the File or URL wont. If I didn't remember correctly, then ignore the concern. Fully supportive of the approach, as a consumer of Oak. The locations will certainly probably leak outside the context of an Oak session so the API contract should make it clear that the code using a direct location needs to behave responsibly. Best Regards Ian On 3 May 2016 at 15:36, Chetan Mehrotrawrote: > Hi Team, > > For OAK-1963 we need to allow access to actaul Blob location say in form > File instance or S3 object id etc. This access is need to perform optimized > IO operation around binary object e.g. > > 1. The File object can be used to spool the file content with zero copy > using NIO by accessing the File Channel directly [1] > > 2. Client code can efficiently replicate a binary stored in S3 by having > direct access to S3 object using copy operation > > To allow such access we would need a new API in the form of > AdaptableBinary. > > API > === > > public interface AdaptableBinary { > > /** > * Adapts the binary to another type like File, URL etc > * > * @param The generic type to which this binary is > adapted > *to > * @param type The Class object of the target type, such as > *File.class > * @return The adapter target or null if the binary cannot > * adapt to the requested type > */ > AdapterType adaptTo(Class type); > } > > Usage > = > > Binary binProp = node.getProperty("jcr:data").getBinary(); > > //Check if Binary is of type AdaptableBinary > if (binProp instanceof AdaptableBinary){ > AdaptableBinary adaptableBinary = (AdaptableBinary) binProp; > > //Adapt it to File instance > File file = adaptableBinary.adaptTo(File.class); > } > > > > The Binary instance returned by Oak > i.e. org.apache.jackrabbit.oak.plugins.value.BinaryImpl would then > implement this interface and calling code can then check the type and cast > it and then adapt it > > Key Points > > > 1. Depending on backing BlobStore the binary can be adapted to various > types. For FileDataStore it can be adapted to File. For S3DataStore it can > either be adapted to URL or some S3DataStore specific type. > > 2. Security - Thomas suggested that for better security the ability to > adapt should be restricted based on session permissions. So if the user has > required permission then only adaptation would work otherwise null would be > returned. > > 3. Adaptation proposal is based on Sling Adaptable [2] > > 4. This API is for now exposed only at JCR level. Not sure should we do it > at Oak level as Blob instance are currently not bound to any session. So > proposal is to place this in 'org.apache.jackrabbit.oak.api' package > > Kindly provide your feedback! Also any suggestion/guidance around how the > access control be implemented > > Chetan Mehrotra > [1] http://www.ibm.com/developerworks/library/j-zerocopy/ > [2] > > https://sling.apache.org/apidocs/sling5/org/apache/sling/api/adapter/Adaptable.html >
API proposal for - Expose URL for Blob source (OAK-1963)
Hi Team, For OAK-1963 we need to allow access to actaul Blob location say in form File instance or S3 object id etc. This access is need to perform optimized IO operation around binary object e.g. 1. The File object can be used to spool the file content with zero copy using NIO by accessing the File Channel directly [1] 2. Client code can efficiently replicate a binary stored in S3 by having direct access to S3 object using copy operation To allow such access we would need a new API in the form of AdaptableBinary. API === public interface AdaptableBinary { /** * Adapts the binary to another type like File, URL etc * * @param The generic type to which this binary is adapted *to * @param type The Class object of the target type, such as *File.class * @return The adapter target or null if the binary cannot * adapt to the requested type */ AdapterType adaptTo(Class type); } Usage = Binary binProp = node.getProperty("jcr:data").getBinary(); //Check if Binary is of type AdaptableBinary if (binProp instanceof AdaptableBinary){ AdaptableBinary adaptableBinary = (AdaptableBinary) binProp; //Adapt it to File instance File file = adaptableBinary.adaptTo(File.class); } The Binary instance returned by Oak i.e. org.apache.jackrabbit.oak.plugins.value.BinaryImpl would then implement this interface and calling code can then check the type and cast it and then adapt it Key Points 1. Depending on backing BlobStore the binary can be adapted to various types. For FileDataStore it can be adapted to File. For S3DataStore it can either be adapted to URL or some S3DataStore specific type. 2. Security - Thomas suggested that for better security the ability to adapt should be restricted based on session permissions. So if the user has required permission then only adaptation would work otherwise null would be returned. 3. Adaptation proposal is based on Sling Adaptable [2] 4. This API is for now exposed only at JCR level. Not sure should we do it at Oak level as Blob instance are currently not bound to any session. So proposal is to place this in 'org.apache.jackrabbit.oak.api' package Kindly provide your feedback! Also any suggestion/guidance around how the access control be implemented Chetan Mehrotra [1] http://www.ibm.com/developerworks/library/j-zerocopy/ [2] https://sling.apache.org/apidocs/sling5/org/apache/sling/api/adapter/Adaptable.html