Re: A federated data store
I put together a very crude initial POC which can be seen at [0]. This simply allows a FileDataStore to be used as a delegate data store and the FederatedDataStore to be used in Oak as the primary data store. The approach is simply that the FederatedDataStore has information about the delegates (one primary and zero or more secondaries) and can defer all actions to the appropriate delegate. The goal of this POC was to determine if this simple idea could possibly work. I'm simply doing an internal mapping from a simple data store name to a fully qualified class name, and then using reflection to create the data store. This prevents coupling between the FederatedDataStore and other data stores but also limits it to only work with supported data store delegates. One question I have with this has to do with basic correctness of approach. Is it acceptable to create the data store objects directly (e.g. OakCachingFDS), or should the service be going through OSGi to create other data store service objects instead (e.g. FileDataStoreService)? I have a concern that creating service objects may mean OSGi limits me to a single service, whereas if we create the data store objects directly we could have a number of them. For example, multiple S3DataStore objects, each with a different bucket for different purposes. But I'm not sure if that limitation on service objects really exists. Thoughts? [0] - https://github.com/mattvryan/jackrabbit-oak/tree/federated-data-store/oak-blob-federated/src/main/java/org/apache/jackrabbit/oak/blob/federated -MR On Thu, Apr 20, 2017 at 12:20 PM, Matt Ryan <o...@mvryan.org> wrote: > Hi, > > I'm looking at the possibility of creating a new kind of data store, let's > call it a federated data store, and wanted to see what everyone thinks > about this. > > The basic idea is that the federated data store would allow for more than > one data store to be configured for an Oak instance. Oak would then be > able to choose which data store to use based on a number of criteria, like > file size, JCR path, node type, existence of a node property, a node > property value, or other items, or a combination of items. In my thinking > these are defined in configuration so the federated data store would know > how to select which data store is used to store which binary. > > I think this is a step towards UC14 - Hierarchical BlobStore in [0]. Once > the federated data store was implemented we should be able to support UC14 > with little work. I can also foresee other possible capabilities it could > offer, such as storing blobs for different node types in different data > stores, or choosing from a few different data stores based on geographic > location (UC2 in [0]). > > In my mind we could add capability to DataStoreBlobStore.writeStream() > where the decision is made whether to write a stream to the data store > delegate or put it in-memory. Instead we could defer the decision directly > to the delegate, adding a method to the appropriate interface (BlobStore or > GarbageCollectibleBlobStore) to handle this decision, and default the > decision in AbstractBlobStore to be based on the record size (which is the > current behavior, except currently that decision is made in > DataStoreBlobStore IIUC). All other existing data stores should then > behave the same. But in the case of the federated data store this decision > would be more involved, selecting the right data store based on > configuration. > > The federated data store would need to exist independent of other data > stores, so figuring out how to create those data stores without having a > code dependency would be a challenge to figure out. > > > Please let me know what you think, is my idea about the implementation > flawed, is there a better way to accomplish this, what concerns are there > about it, etc. I'd like to brainstorm with the list something that can > work in this area and then I'll create a ticket for it. Or I can create > the ticket, and we can have the discussion in the ticket. Let me know > which is best. > > > [0] - https://wiki.apache.org/jackrabbit/JCR%20Binary%20Usecase > > > - Matt Ryan >
Re: A federated data store
On Fri, Apr 21, 2017 at 7:20 AM, Davide Giannellawrote: > On 20/04/2017 19:30, Matt Ryan wrote: > > I misremembered above when I was describing a possible implementation. I > > was thinking we'd add a method to the delegate, but that would be added > to > > the DataStore interface, obviously (not BlobStore or > > GarbageCollectibleBlobStore). Likewise, the default implementation would > > exist in AbstractDataStore (not AbstractBlobStore). > > I like the idea overall and I'm not familiar with the DS codebase so > what I'm saying can be wrong. > > If I think about the idea without knowing the current implementation I > would expect some sort of API which allows for the Visitor pattern to be > leveraged. In this way in an OSGi environment we could simply pull in > all the Visitor services and act and in plain java it will be more > around the repository construction/configuration. > > Davide, thanks for the suggestion of using the Visitor pattern. I spent a fair bit of time over the past couple of weeks researching the Visitor pattern again and thinking about how it would apply. I am not opposed to using that or any other relevant design pattern (I'm generally a fan). But I'm struggling to see how the Visitor pattern would work here, so maybe you can help me see what you had in mind. >From [0] there is an image of a sequence diagram for the visitor pattern [1] that is essentially taken right out of the GoF "Design Patterns" book. Looking at the sequence diagram and trying to map it to this problem: - I believe the class labeled "xx:Composite" would be the FederatedDataStore (some class within this component). - I believe the classes labeled "anA:ConcreteA" and "aB:ConcreteB" would be delegate data stores, e.g. FileDataStore, S3DataStore, or something like that. - I believe the class labeled "v:ConcreteVisitorType1" is ... ??? That's where I get stuck - I can't figure out what the delegated data stores would be visiting. In the GoF "Design Patterns" book for the Visitor Pattern under "Applicability" (page 333): - Bullet one says use the Visitor when "an object structure contains many classes of objects with differing interfaces". Shouldn't be the case here - all the data store delegates should be able to be treated pretty much the same. - Bullet two says use the Visitor when "many distinct and unrelated operations need to be performed on an object structure, and you want to avoid 'polluting' their classes with these operations." I don't think this applies either - the operations are slightly different implementation but similar in purpose, and are not unrelated; we don't need to perform many operations but rather select which one is right; we actually do want to 'pollute' their classes with the operations, because it is within those classes where the logic to do the operation is contained. Can you help me see what you had in mind? I think I'm missing it. [0] - http://www.ghytred.com/ShowArticle.aspx?VisitorPattern [1] - http://www.ghytred.com/images/visitor2.jpg -MR
Re: A federated data store
Davide, Chetan, thanks for the feedback. Please allow me some time to process it, and I'll try to come back with something more concrete and detailed to discuss further. Additional ideas, suggestions, and corrections welcome. On Fri, Apr 21, 2017 at 8:53 AM, Chetan Mehrotra <chetan.mehro...@gmail.com> wrote: > Hi Matt, > > On Thu, Apr 20, 2017 at 11:50 PM, Matt Ryan <o...@mvryan.org> wrote: > > Oak would then be > > able to choose which data store to use based on a number of criteria, > like > > file size, JCR path, node type, existence of a node property, a node > > property value, or other items, or a combination of items. In my > thinking > > these are defined in configuration so the federated data store would know > > how to select which data store is used to store which binary. > > This would need some more details. The way a binary gets written using > the JCR API is > > 1. Code create a Binary using ValueFactory say by spooling the stream. > By this time binary is already added to DataStore > 2. The returned binary reference is then stored as part of JCR Node by > setting the passed Binary property. > > So to make storage of Binary a function of final Node would require > some more thought. A federated store has 2 aspects > > 1. Writing a binary - Destination store selection = f(node, path, user > option) > > 2. Reading a binary - This would be simple as the actual store > information would be encoded within the blobId (like some url?) and > then BlobStore which needs to be used for reading should be selected > based on scheme in blobid > > Further current Blob related API is used in following ways > > B1. Code logic dealing with blob creation - JCR ValueFactory, > NodeStore#createBlob. They only work with BlobStore api > B2. Code logic dealing with BlobGC - It uses methods in > GarbageCollectableBlobStore > > Amit added a BlobStore#writeBlob(InputStream, BlobOption) as part of > OAK-5174. This can now be extended to support Federated usecase. One > possible approach can be like below > > 1. Setup would have multiple BlobStore service implementations registered. > 2. These service would have a property "type" defined to indicate the > scheme. > 3. The setup would have a default BlobStore and multiple secondary stores > 4. Any code in #B1 above would be dealing with a FederatedBlobStore > aka the "master"/primary store > 5. The NodeStores would be bound to this "master" BlobStore > > FederatedBlobStore would use the default store for any Binary created > via NodeStore#createBlob. . However any call to > BlobStore#writeBlob(InputStream, BlobOption) would be passed to other > stored which can indicate if they can handle the call or not. If yes > then they would return the Blob ID. We can also look into exposing the > new method as part of NodeStore API > > OakValueFactory can then wrap the "context" i.e. path, node etc as > part of BlobOption which can then be used for store selection. > > How this impacts the GC logic would also needs to be thought about. > > Chetan Mehrotra > > PS: Above is more of a brain dump in thinking out loud mode :) >
Re: A federated data store
Hi Matt, On Thu, Apr 20, 2017 at 11:50 PM, Matt Ryan <o...@mvryan.org> wrote: > Oak would then be > able to choose which data store to use based on a number of criteria, like > file size, JCR path, node type, existence of a node property, a node > property value, or other items, or a combination of items. In my thinking > these are defined in configuration so the federated data store would know > how to select which data store is used to store which binary. This would need some more details. The way a binary gets written using the JCR API is 1. Code create a Binary using ValueFactory say by spooling the stream. By this time binary is already added to DataStore 2. The returned binary reference is then stored as part of JCR Node by setting the passed Binary property. So to make storage of Binary a function of final Node would require some more thought. A federated store has 2 aspects 1. Writing a binary - Destination store selection = f(node, path, user option) 2. Reading a binary - This would be simple as the actual store information would be encoded within the blobId (like some url?) and then BlobStore which needs to be used for reading should be selected based on scheme in blobid Further current Blob related API is used in following ways B1. Code logic dealing with blob creation - JCR ValueFactory, NodeStore#createBlob. They only work with BlobStore api B2. Code logic dealing with BlobGC - It uses methods in GarbageCollectableBlobStore Amit added a BlobStore#writeBlob(InputStream, BlobOption) as part of OAK-5174. This can now be extended to support Federated usecase. One possible approach can be like below 1. Setup would have multiple BlobStore service implementations registered. 2. These service would have a property "type" defined to indicate the scheme. 3. The setup would have a default BlobStore and multiple secondary stores 4. Any code in #B1 above would be dealing with a FederatedBlobStore aka the "master"/primary store 5. The NodeStores would be bound to this "master" BlobStore FederatedBlobStore would use the default store for any Binary created via NodeStore#createBlob. . However any call to BlobStore#writeBlob(InputStream, BlobOption) would be passed to other stored which can indicate if they can handle the call or not. If yes then they would return the Blob ID. We can also look into exposing the new method as part of NodeStore API OakValueFactory can then wrap the "context" i.e. path, node etc as part of BlobOption which can then be used for store selection. How this impacts the GC logic would also needs to be thought about. Chetan Mehrotra PS: Above is more of a brain dump in thinking out loud mode :)
Re: A federated data store
On 20/04/2017 19:30, Matt Ryan wrote: > I misremembered above when I was describing a possible implementation. I > was thinking we'd add a method to the delegate, but that would be added to > the DataStore interface, obviously (not BlobStore or > GarbageCollectibleBlobStore). Likewise, the default implementation would > exist in AbstractDataStore (not AbstractBlobStore). I like the idea overall and I'm not familiar with the DS codebase so what I'm saying can be wrong. If I think about the idea without knowing the current implementation I would expect some sort of API which allows for the Visitor pattern to be leveraged. In this way in an OSGi environment we could simply pull in all the Visitor services and act and in plain java it will be more around the repository construction/configuration. The overall approach would allow as well, oak based applications, to implement their own business dedicated strategies by simply implementing the strategy (Visitor) and, in case of OSGi, simply deploy the service. Each strategy, for OSGi use case, would have to require configuration to be active which will allow for a very fine grained repository configuration. As I said at the beginning, I'm not familiar at all with the current codebase and don't know what we could actually do even by changing the API. D.
Re: A federated data store
I misremembered above when I was describing a possible implementation. I was thinking we'd add a method to the delegate, but that would be added to the DataStore interface, obviously (not BlobStore or GarbageCollectibleBlobStore). Likewise, the default implementation would exist in AbstractDataStore (not AbstractBlobStore). Sorry about the mix-up. On Thu, Apr 20, 2017 at 12:20 PM, Matt Ryan <o...@mvryan.org> wrote: > Hi, > > I'm looking at the possibility of creating a new kind of data store, let's > call it a federated data store, and wanted to see what everyone thinks > about this. > > The basic idea is that the federated data store would allow for more than > one data store to be configured for an Oak instance. Oak would then be > able to choose which data store to use based on a number of criteria, like > file size, JCR path, node type, existence of a node property, a node > property value, or other items, or a combination of items. In my thinking > these are defined in configuration so the federated data store would know > how to select which data store is used to store which binary. > > I think this is a step towards UC14 - Hierarchical BlobStore in [0]. Once > the federated data store was implemented we should be able to support UC14 > with little work. I can also foresee other possible capabilities it could > offer, such as storing blobs for different node types in different data > stores, or choosing from a few different data stores based on geographic > location (UC2 in [0]). > > In my mind we could add capability to DataStoreBlobStore.writeStream() > where the decision is made whether to write a stream to the data store > delegate or put it in-memory. Instead we could defer the decision directly > to the delegate, adding a method to the appropriate interface (BlobStore or > GarbageCollectibleBlobStore) to handle this decision, and default the > decision in AbstractBlobStore to be based on the record size (which is the > current behavior, except currently that decision is made in > DataStoreBlobStore IIUC). All other existing data stores should then > behave the same. But in the case of the federated data store this decision > would be more involved, selecting the right data store based on > configuration. > > The federated data store would need to exist independent of other data > stores, so figuring out how to create those data stores without having a > code dependency would be a challenge to figure out. > > > Please let me know what you think, is my idea about the implementation > flawed, is there a better way to accomplish this, what concerns are there > about it, etc. I'd like to brainstorm with the list something that can > work in this area and then I'll create a ticket for it. Or I can create > the ticket, and we can have the discussion in the ticket. Let me know > which is best. > > > [0] - https://wiki.apache.org/jackrabbit/JCR%20Binary%20Usecase > > > - Matt Ryan >
A federated data store
Hi, I'm looking at the possibility of creating a new kind of data store, let's call it a federated data store, and wanted to see what everyone thinks about this. The basic idea is that the federated data store would allow for more than one data store to be configured for an Oak instance. Oak would then be able to choose which data store to use based on a number of criteria, like file size, JCR path, node type, existence of a node property, a node property value, or other items, or a combination of items. In my thinking these are defined in configuration so the federated data store would know how to select which data store is used to store which binary. I think this is a step towards UC14 - Hierarchical BlobStore in [0]. Once the federated data store was implemented we should be able to support UC14 with little work. I can also foresee other possible capabilities it could offer, such as storing blobs for different node types in different data stores, or choosing from a few different data stores based on geographic location (UC2 in [0]). In my mind we could add capability to DataStoreBlobStore.writeStream() where the decision is made whether to write a stream to the data store delegate or put it in-memory. Instead we could defer the decision directly to the delegate, adding a method to the appropriate interface (BlobStore or GarbageCollectibleBlobStore) to handle this decision, and default the decision in AbstractBlobStore to be based on the record size (which is the current behavior, except currently that decision is made in DataStoreBlobStore IIUC). All other existing data stores should then behave the same. But in the case of the federated data store this decision would be more involved, selecting the right data store based on configuration. The federated data store would need to exist independent of other data stores, so figuring out how to create those data stores without having a code dependency would be a challenge to figure out. Please let me know what you think, is my idea about the implementation flawed, is there a better way to accomplish this, what concerns are there about it, etc. I'd like to brainstorm with the list something that can work in this area and then I'll create a ticket for it. Or I can create the ticket, and we can have the discussion in the ticket. Let me know which is best. [0] - https://wiki.apache.org/jackrabbit/JCR%20Binary%20Usecase - Matt Ryan