Re: A federated data store

2017-05-05 Thread Matt Ryan
I put together a very crude initial POC which can be seen at [0].  This
simply allows a FileDataStore to be used as a delegate data store and the
FederatedDataStore to be used in Oak as the primary data store.

The approach is simply that the FederatedDataStore has information about
the delegates (one primary and zero or more secondaries) and can defer all
actions to the appropriate delegate.  The goal of this POC was to determine
if this simple idea could possibly work.  I'm simply doing an internal
mapping from a simple data store name to a fully qualified class name, and
then using reflection to create the data store.  This prevents coupling
between the FederatedDataStore and other data stores but also limits it to
only work with supported data store delegates.

One question I have with this has to do with basic correctness of
approach.  Is it acceptable to create the data store objects directly (e.g.
OakCachingFDS), or should the service be going through OSGi to create other
data store service objects instead (e.g. FileDataStoreService)?

I have a concern that creating service objects may mean OSGi limits me to a
single service, whereas if we create the data store objects directly we
could have a number of them.  For example, multiple S3DataStore objects,
each with a different bucket for different purposes.  But I'm not sure if
that limitation on service objects really exists.

Thoughts?


[0] -
https://github.com/mattvryan/jackrabbit-oak/tree/federated-data-store/oak-blob-federated/src/main/java/org/apache/jackrabbit/oak/blob/federated


-MR

On Thu, Apr 20, 2017 at 12:20 PM, Matt Ryan <o...@mvryan.org> wrote:

> Hi,
>
> I'm looking at the possibility of creating a new kind of data store, let's
> call it a federated data store, and wanted to see what everyone thinks
> about this.
>
> The basic idea is that the federated data store would allow for more than
> one data store to be configured for an Oak instance.  Oak would then be
> able to choose which data store to use based on a number of criteria, like
> file size, JCR path, node type, existence of a node property, a node
> property value, or other items, or a combination of items.  In my thinking
> these are defined in configuration so the federated data store would know
> how to select which data store is used to store which binary.
>
> I think this is a step towards UC14 - Hierarchical BlobStore in [0].  Once
> the federated data store was implemented we should be able to support UC14
> with little work.  I can also foresee other possible capabilities it could
> offer, such as storing blobs for different node types in different data
> stores, or choosing from a few different data stores based on geographic
> location (UC2 in [0]).
>
> In my mind we could add capability to DataStoreBlobStore.writeStream()
> where the decision is made whether to write a stream to the data store
> delegate or put it in-memory.  Instead we could defer the decision directly
> to the delegate, adding a method to the appropriate interface (BlobStore or
> GarbageCollectibleBlobStore) to handle this decision, and default the
> decision in AbstractBlobStore to be based on the record size (which is the
> current behavior, except currently that decision is made in
> DataStoreBlobStore IIUC).  All other existing data stores should then
> behave the same.  But in the case of the federated data store this decision
> would be more involved, selecting the right data store based on
> configuration.
>
> The federated data store would need to exist independent of other data
> stores, so figuring out how to create those data stores without having a
> code dependency would be a challenge to figure out.
>
>
> Please let me know what you think, is my idea about the implementation
> flawed, is there a better way to accomplish this, what concerns are there
> about it, etc.  I'd like to brainstorm with the list something that can
> work in this area and then I'll create a ticket for it.  Or I can create
> the ticket, and we can have the discussion in the ticket.  Let me know
> which is best.
>
>
> [0] - https://wiki.apache.org/jackrabbit/JCR%20Binary%20Usecase
>
>
> - Matt Ryan
>


Re: A federated data store

2017-05-05 Thread Matt Ryan
On Fri, Apr 21, 2017 at 7:20 AM, Davide Giannella  wrote:

> On 20/04/2017 19:30, Matt Ryan wrote:
> > I misremembered above when I was describing a possible implementation.  I
> > was thinking we'd add a method to the delegate, but that would be added
> to
> > the DataStore interface, obviously (not BlobStore or
> > GarbageCollectibleBlobStore).  Likewise, the default implementation would
> > exist in AbstractDataStore (not AbstractBlobStore).
>
> I like the idea overall and I'm not familiar with the DS codebase so
> what I'm saying can be wrong.
>
> If I think about the idea without knowing the current implementation I
> would expect some sort of API which allows for the Visitor pattern to be
> leveraged. In this way in an OSGi environment we could simply pull in
> all the Visitor services and act and in plain java it will be more
> around the repository construction/configuration.
>
>
Davide, thanks for the suggestion of using the Visitor pattern.

I spent a fair bit of time over the past couple of weeks researching the
Visitor pattern again and thinking about how it would apply.  I am not
opposed to using that or any other relevant design pattern (I'm generally a
fan).  But I'm struggling to see how the Visitor pattern would work here,
so maybe you can help me see what you had in mind.

>From [0] there is an image of a sequence diagram for the visitor pattern
[1] that is essentially taken right out of the GoF "Design Patterns" book.
Looking at the sequence diagram and trying to map it to this problem:
-  I believe the class labeled "xx:Composite" would be the
FederatedDataStore (some class within this component).
-  I believe the classes labeled "anA:ConcreteA" and "aB:ConcreteB" would
be delegate data stores, e.g. FileDataStore, S3DataStore, or something like
that.
-  I believe the class labeled "v:ConcreteVisitorType1" is ... ???

That's where I get stuck - I can't figure out what the delegated data
stores would be visiting.

In the GoF "Design Patterns" book for the Visitor Pattern under
"Applicability" (page 333):
-  Bullet one says use the Visitor when "an object structure contains many
classes of objects with differing interfaces".  Shouldn't be the case here
- all the data store delegates should be able to be treated pretty much the
same.
-  Bullet two says use the Visitor when "many distinct and unrelated
operations need to be performed on an object structure, and you want to
avoid 'polluting' their classes with these operations."  I don't think this
applies either - the operations are slightly different implementation but
similar in purpose, and are not unrelated; we don't need to perform many
operations but rather select which one is right; we actually do want to
'pollute' their classes with the operations, because it is within those
classes where the logic to do the operation is contained.

Can you help me see what you had in mind?  I think I'm missing it.


[0] - http://www.ghytred.com/ShowArticle.aspx?VisitorPattern
[1] - http://www.ghytred.com/images/visitor2.jpg


-MR


Re: A federated data store

2017-04-21 Thread Matt Ryan
Davide, Chetan, thanks for the feedback.  Please allow me some time to
process it, and I'll try to come back with something more concrete and
detailed to discuss further.

Additional ideas, suggestions, and corrections welcome.

On Fri, Apr 21, 2017 at 8:53 AM, Chetan Mehrotra <chetan.mehro...@gmail.com>
wrote:

> Hi Matt,
>
> On Thu, Apr 20, 2017 at 11:50 PM, Matt Ryan <o...@mvryan.org> wrote:
> > Oak would then be
> > able to choose which data store to use based on a number of criteria,
> like
> > file size, JCR path, node type, existence of a node property, a node
> > property value, or other items, or a combination of items.  In my
> thinking
> > these are defined in configuration so the federated data store would know
> > how to select which data store is used to store which binary.
>
> This would need some more details. The way a binary gets written using
> the JCR API is
>
> 1. Code create a Binary using ValueFactory say by spooling the stream.
> By this time binary is already added to DataStore
> 2. The returned binary reference is then stored as part of JCR Node by
> setting the passed Binary property.
>
> So to make storage of Binary a function of final Node would require
> some more thought. A federated store has 2 aspects
>
> 1. Writing a binary - Destination store selection = f(node, path, user
> option)
>
> 2. Reading a binary - This would be simple as the actual store
> information would be encoded within the blobId (like some url?) and
> then BlobStore which needs to be used for reading should be selected
> based on scheme in blobid
>
> Further current Blob related API is used in following ways
>
> B1. Code logic dealing with blob creation - JCR ValueFactory,
> NodeStore#createBlob. They only work with BlobStore api
> B2. Code logic dealing with BlobGC - It uses methods in
> GarbageCollectableBlobStore
>
> Amit added a BlobStore#writeBlob(InputStream, BlobOption) as part of
> OAK-5174. This can now be extended to support Federated usecase. One
> possible approach can be like below
>
> 1. Setup would have multiple BlobStore service implementations registered.
> 2. These service would have a property "type" defined to indicate the
> scheme.
> 3. The setup would have a default BlobStore and multiple secondary stores
> 4. Any code in #B1 above would be dealing with a FederatedBlobStore
> aka the "master"/primary store
> 5. The NodeStores would be bound to this "master" BlobStore
>
> FederatedBlobStore would use the default store for any Binary created
> via NodeStore#createBlob. . However any call to
> BlobStore#writeBlob(InputStream, BlobOption) would be passed to other
> stored which can indicate if they can handle the call or not. If yes
> then they would return the Blob ID. We can also look into exposing the
> new method as part of NodeStore API
>
> OakValueFactory can then wrap the "context" i.e. path, node etc as
> part of BlobOption which can then be used for store selection.
>
> How this impacts the GC logic would also needs to be thought about.
>
> Chetan Mehrotra
>
> PS: Above is more of a brain dump in thinking out loud mode :)
>


Re: A federated data store

2017-04-21 Thread Chetan Mehrotra
Hi Matt,

On Thu, Apr 20, 2017 at 11:50 PM, Matt Ryan <o...@mvryan.org> wrote:
> Oak would then be
> able to choose which data store to use based on a number of criteria, like
> file size, JCR path, node type, existence of a node property, a node
> property value, or other items, or a combination of items.  In my thinking
> these are defined in configuration so the federated data store would know
> how to select which data store is used to store which binary.

This would need some more details. The way a binary gets written using
the JCR API is

1. Code create a Binary using ValueFactory say by spooling the stream.
By this time binary is already added to DataStore
2. The returned binary reference is then stored as part of JCR Node by
setting the passed Binary property.

So to make storage of Binary a function of final Node would require
some more thought. A federated store has 2 aspects

1. Writing a binary - Destination store selection = f(node, path, user option)

2. Reading a binary - This would be simple as the actual store
information would be encoded within the blobId (like some url?) and
then BlobStore which needs to be used for reading should be selected
based on scheme in blobid

Further current Blob related API is used in following ways

B1. Code logic dealing with blob creation - JCR ValueFactory,
NodeStore#createBlob. They only work with BlobStore api
B2. Code logic dealing with BlobGC - It uses methods in
GarbageCollectableBlobStore

Amit added a BlobStore#writeBlob(InputStream, BlobOption) as part of
OAK-5174. This can now be extended to support Federated usecase. One
possible approach can be like below

1. Setup would have multiple BlobStore service implementations registered.
2. These service would have a property "type" defined to indicate the scheme.
3. The setup would have a default BlobStore and multiple secondary stores
4. Any code in #B1 above would be dealing with a FederatedBlobStore
aka the "master"/primary store
5. The NodeStores would be bound to this "master" BlobStore

FederatedBlobStore would use the default store for any Binary created
via NodeStore#createBlob. . However any call to
BlobStore#writeBlob(InputStream, BlobOption) would be passed to other
stored which can indicate if they can handle the call or not. If yes
then they would return the Blob ID. We can also look into exposing the
new method as part of NodeStore API

OakValueFactory can then wrap the "context" i.e. path, node etc as
part of BlobOption which can then be used for store selection.

How this impacts the GC logic would also needs to be thought about.

Chetan Mehrotra

PS: Above is more of a brain dump in thinking out loud mode :)


Re: A federated data store

2017-04-21 Thread Davide Giannella
On 20/04/2017 19:30, Matt Ryan wrote:
> I misremembered above when I was describing a possible implementation.  I
> was thinking we'd add a method to the delegate, but that would be added to
> the DataStore interface, obviously (not BlobStore or
> GarbageCollectibleBlobStore).  Likewise, the default implementation would
> exist in AbstractDataStore (not AbstractBlobStore).

I like the idea overall and I'm not familiar with the DS codebase so
what I'm saying can be wrong.

If I think about the idea without knowing the current implementation I
would expect some sort of API which allows for the Visitor pattern to be
leveraged. In this way in an OSGi environment we could simply pull in
all the Visitor services and act and in plain java it will be more
around the repository construction/configuration.

The overall approach would allow as well, oak based applications, to
implement their own business dedicated strategies by simply implementing
the strategy (Visitor) and, in case of OSGi, simply deploy the service.

Each strategy, for OSGi use case, would have to require configuration to
be active which will allow for a very fine grained repository configuration.

As I said at the beginning, I'm not familiar at all with the current
codebase and don't know what we could actually do even by changing the API.

D.



Re: A federated data store

2017-04-20 Thread Matt Ryan
I misremembered above when I was describing a possible implementation.  I
was thinking we'd add a method to the delegate, but that would be added to
the DataStore interface, obviously (not BlobStore or
GarbageCollectibleBlobStore).  Likewise, the default implementation would
exist in AbstractDataStore (not AbstractBlobStore).

Sorry about the mix-up.

On Thu, Apr 20, 2017 at 12:20 PM, Matt Ryan <o...@mvryan.org> wrote:

> Hi,
>
> I'm looking at the possibility of creating a new kind of data store, let's
> call it a federated data store, and wanted to see what everyone thinks
> about this.
>
> The basic idea is that the federated data store would allow for more than
> one data store to be configured for an Oak instance.  Oak would then be
> able to choose which data store to use based on a number of criteria, like
> file size, JCR path, node type, existence of a node property, a node
> property value, or other items, or a combination of items.  In my thinking
> these are defined in configuration so the federated data store would know
> how to select which data store is used to store which binary.
>
> I think this is a step towards UC14 - Hierarchical BlobStore in [0].  Once
> the federated data store was implemented we should be able to support UC14
> with little work.  I can also foresee other possible capabilities it could
> offer, such as storing blobs for different node types in different data
> stores, or choosing from a few different data stores based on geographic
> location (UC2 in [0]).
>
> In my mind we could add capability to DataStoreBlobStore.writeStream()
> where the decision is made whether to write a stream to the data store
> delegate or put it in-memory.  Instead we could defer the decision directly
> to the delegate, adding a method to the appropriate interface (BlobStore or
> GarbageCollectibleBlobStore) to handle this decision, and default the
> decision in AbstractBlobStore to be based on the record size (which is the
> current behavior, except currently that decision is made in
> DataStoreBlobStore IIUC).  All other existing data stores should then
> behave the same.  But in the case of the federated data store this decision
> would be more involved, selecting the right data store based on
> configuration.
>
> The federated data store would need to exist independent of other data
> stores, so figuring out how to create those data stores without having a
> code dependency would be a challenge to figure out.
>
>
> Please let me know what you think, is my idea about the implementation
> flawed, is there a better way to accomplish this, what concerns are there
> about it, etc.  I'd like to brainstorm with the list something that can
> work in this area and then I'll create a ticket for it.  Or I can create
> the ticket, and we can have the discussion in the ticket.  Let me know
> which is best.
>
>
> [0] - https://wiki.apache.org/jackrabbit/JCR%20Binary%20Usecase
>
>
> - Matt Ryan
>


A federated data store

2017-04-20 Thread Matt Ryan
Hi,

I'm looking at the possibility of creating a new kind of data store, let's
call it a federated data store, and wanted to see what everyone thinks
about this.

The basic idea is that the federated data store would allow for more than
one data store to be configured for an Oak instance.  Oak would then be
able to choose which data store to use based on a number of criteria, like
file size, JCR path, node type, existence of a node property, a node
property value, or other items, or a combination of items.  In my thinking
these are defined in configuration so the federated data store would know
how to select which data store is used to store which binary.

I think this is a step towards UC14 - Hierarchical BlobStore in [0].  Once
the federated data store was implemented we should be able to support UC14
with little work.  I can also foresee other possible capabilities it could
offer, such as storing blobs for different node types in different data
stores, or choosing from a few different data stores based on geographic
location (UC2 in [0]).

In my mind we could add capability to DataStoreBlobStore.writeStream()
where the decision is made whether to write a stream to the data store
delegate or put it in-memory.  Instead we could defer the decision directly
to the delegate, adding a method to the appropriate interface (BlobStore or
GarbageCollectibleBlobStore) to handle this decision, and default the
decision in AbstractBlobStore to be based on the record size (which is the
current behavior, except currently that decision is made in
DataStoreBlobStore IIUC).  All other existing data stores should then
behave the same.  But in the case of the federated data store this decision
would be more involved, selecting the right data store based on
configuration.

The federated data store would need to exist independent of other data
stores, so figuring out how to create those data stores without having a
code dependency would be a challenge to figure out.


Please let me know what you think, is my idea about the implementation
flawed, is there a better way to accomplish this, what concerns are there
about it, etc.  I'd like to brainstorm with the list something that can
work in this area and then I'll create a ticket for it.  Or I can create
the ticket, and we can have the discussion in the ticket.  Let me know
which is best.


[0] - https://wiki.apache.org/jackrabbit/JCR%20Binary%20Usecase


- Matt Ryan