Re: How does oak cluster work
Hi, > Im still very interested in understand some of the design chooses oak > core team had taken and why . For the long lived snapshots what is use > case for this and also I like to understand how indexes are sync between > nodes and the role of a oak leader and how the leader node election > occurs. One of the primary use cases is asynchronous index updates. The current index state is always associated with a snapshot of the repository. Potential updates of asynchronous indexes are checked periodically by comparing the current repository state with the snapshot referenced in the index. If an index update is needed, the new index state will reference the more recent state of the repository and release the older snapshot. Most of the time the repository only has to keep rather recent snapshots, but there are may also be cases when a snapshot must be kept for a longer period of time, e.g. when an index is re-created. Oak indexes are stored in the repository just like regular content. The only difference is that the actual index data (e.g. the Lucene files) is stored on hidden nodes. You don't see those node when you access the repository over the JCR API. The data is managed internally by Oak. Leader election is not something done by Oak but delegated to another module. Apache Sling Discovery works well and you can find documentation here: https://sling.apache.org/documentation/bundles/discovery-api-and-impl.html More general information on Oak is also available here: https://jackrabbit.apache.org/oak/docs/architecture/nodestate.html Regards Marcel
Re: How does oak cluster work
Hi Team , Im still very interested in understand some of the design chooses oak core team had taken and why . For the long lived snapshots what is use case for this and also I like to understand how indexes are sync between nodes and the role of a oak leader and how the leader node election occurs. thank Emily On Thu, Dec 20, 2018 at 3:02 PM ems eril wrote: > Hi Marcel , thanks for the information . I would love to understand the > use cases for having long lived snapshots in oak . Would you be able for > provide specific examples or functions within oak that needs this > capability ? > > On Wed, Dec 19, 2018 at 12:43 AM Marcel Reutegger > wrote: > >> Hi, >> >> On 18.12.18, 01:55, "ems eril" wrote: >> > 1) Is this a blocking call ? And any plans for callback or java future >> > support? >> >> Yes, Clusterable.isVisible() is a blocking call and you can give it a >> timeout. >> There are no plans right now to add an async variant of this feature. >> >> > 2) Is there any JCR level API we can use as its currently very low >> level ? >> >> No, there is no JCR/Jackrabbit API equivalent for this feature. >> >> > If not is Sling have any plans to use this ? >> >> You will have to ask this on the Sling list. >> >> > 3) Any reason why documentstore needs to implement revision >> snapshotting ? >> > Why can we leverage existing documentstore database capabilities such as >> > mongo https://docs.mongodb.com/manual/core/wiredtiger/ as most support >> MVCC >> >> In Oak we have the requirement to keep a snapshot of the repository for a >> longer >> period of time and not just for concurrency control. E.g. you can create >> a checkpoint >> with a lifetime of several days or even months [0]. >> >> Regards >> Marcel >> >> [0] >> https://jackrabbit.apache.org/oak/docs/apidocs/org/apache/jackrabbit/oak/spi/state/NodeStore.html#checkpoint-long-java.util.Map- >> >>
Re: How does oak cluster work
Hi Marcel , thanks for the information . I would love to understand the use cases for having long lived snapshots in oak . Would you be able for provide specific examples or functions within oak that needs this capability ? On Wed, Dec 19, 2018 at 12:43 AM Marcel Reutegger wrote: > Hi, > > On 18.12.18, 01:55, "ems eril" wrote: > > 1) Is this a blocking call ? And any plans for callback or java future > > support? > > Yes, Clusterable.isVisible() is a blocking call and you can give it a > timeout. > There are no plans right now to add an async variant of this feature. > > > 2) Is there any JCR level API we can use as its currently very low level > ? > > No, there is no JCR/Jackrabbit API equivalent for this feature. > > > If not is Sling have any plans to use this ? > > You will have to ask this on the Sling list. > > > 3) Any reason why documentstore needs to implement revision snapshotting > ? > > Why can we leverage existing documentstore database capabilities such as > > mongo https://docs.mongodb.com/manual/core/wiredtiger/ as most support > MVCC > > In Oak we have the requirement to keep a snapshot of the repository for a > longer > period of time and not just for concurrency control. E.g. you can create a > checkpoint > with a lifetime of several days or even months [0]. > > Regards > Marcel > > [0] > https://jackrabbit.apache.org/oak/docs/apidocs/org/apache/jackrabbit/oak/spi/state/NodeStore.html#checkpoint-long-java.util.Map- > >
Re: How does oak cluster work
Hi, On 18.12.18, 01:55, "ems eril" wrote: > 1) Is this a blocking call ? And any plans for callback or java future > support? Yes, Clusterable.isVisible() is a blocking call and you can give it a timeout. There are no plans right now to add an async variant of this feature. > 2) Is there any JCR level API we can use as its currently very low level ? No, there is no JCR/Jackrabbit API equivalent for this feature. > If not is Sling have any plans to use this ? You will have to ask this on the Sling list. > 3) Any reason why documentstore needs to implement revision snapshotting ? > Why can we leverage existing documentstore database capabilities such as > mongo https://docs.mongodb.com/manual/core/wiredtiger/ as most support MVCC In Oak we have the requirement to keep a snapshot of the repository for a longer period of time and not just for concurrency control. E.g. you can create a checkpoint with a lifetime of several days or even months [0]. Regards Marcel [0] https://jackrabbit.apache.org/oak/docs/apidocs/org/apache/jackrabbit/oak/spi/state/NodeStore.html#checkpoint-long-java.util.Map-
Re: How does oak cluster work
Thank Marcel this is very helpful . Couple of questions I have with this interface 1) Is this a blocking call ? And any plans for callback or java future support? 2) Is there any JCR level API we can use as its currently very low level ? If not is Sling have any plans to use this ? 3) Any reason why documentstore needs to implement revision snapshotting ? Why can we leverage existing documentstore database capabilities such as mongo https://docs.mongodb.com/manual/core/wiredtiger/ as most support MVCC . Thanks Emily On Sun, Dec 16, 2018 at 11:58 PM Marcel Reutegger wrote: > Hi, > > There are different ways to approach this in Oak. > > Your application can register an event listener and gets notified about > changes when they are visible on the local cluster node. > > The application can store a visibility token with the job data you have in > Kafka. The visibility token concept is described on the Clusterable [0] > interface, which is an extension to the NodeStore implemented by the > DocumentNodeStore. On the processing cluster node the visibility token is > then used to suspend the job until the changes are visible. > > Regards > Marcel > > [0] > https://jackrabbit.apache.org/oak/docs/apidocs/org/apache/jackrabbit/oak/spi/state/Clusterable.html > > > On 15.12.18, 02:23, "ems eril" wrote: > > Hi Matt , > > Yes your correct, the job is triggered by consumer listening to kafka > queue . But to you earlier statement that this is not a Oak issue I > have to > disagree . In Mongo you can > control write concern and make replication synchronize but we cannot do > something similar in Oak . > > Thanks > > On Fri, Dec 14, 2018 at 3:25 PM Matt Ryan wrote: > > > Hi, > > > > I believe your concern is: Content could be uploaded to the cluster > via > > one Oak instance, and your job to process the content runs in a > different > > Oak instance, and that there is a possibility that the job to > process the > > content reads from a MongoDB node that has stale data, so the > content is > > not available yet. > > > > If I've understood your concern correctly, you are correct that this > is > > something you have to worry about, that there is a possibility that > when > > the job runs it gets stale data because where it reads from has not > been > > updated yet. However, that's not something being caused by Oak; > this would > > be something you'd have to deal with whether Oak was there or not, no > > matter what type of backing database cluster was being used. > > > > Maybe I'm still missing something in your question. How are you > planning > > to trigger your job? > > > > > > > > On Fri, Dec 14, 2018 at 1:01 PM ems eril wrote: > > > > > Hi Matt , > > > > > >I was looking for more details on the inner workings . I came > across > > > this https://markmail.org/message/jbkrsmz3krllqghr where it > mentioned > > that > > > changes in the cluster would eventually appear across other nodes > and > > this > > > is not a mongo specific issue but something oak has introduced . I > can > > set > > > the write concern to majority in mongo but if oak has its own > eventually > > > consistency model this can cause stale reads from other nodes > which would > > > be a problem with the distributed job Im trying to create. > > > > > > Thanks > > > > > > On Fri, Dec 14, 2018 at 8:02 AM Matt Ryan > wrote: > > > > > > > Hi Emily, > > > > > > > > Content is stored in Oak in two different configurable storage > > services. > > > > This is a bit of an oversimplification, but basically the > structure of > > > > content repository - the content tree, nodes, properties, etc. - > is > > > stored > > > > in a Node Store [0] and the binary content is stored in a Blob > Store > > [1] > > > > (you'll also sometimes see the term "data store"). Oak manages > all of > > > this > > > > transparently to external clients. > > > > > > > > Oak clustering is therefore achieved by configuring Oak > instances to > > use > > > > clusterable storage services underneath [2]. For the node > store, an > > > > implementation of a DocumentNodeStore [3] is needed; one such > > > > implementation uses MongoDB [4]. For the blob store, an > implementation > > > of > > > > a SharedDataStore is needed. For example, both the > SharedS3DataStore > > and > > > > AzureDataStore implementations can be used as a data store for > an Oak > > > > cluster. > > > > > > > > So, assume you were using MongoDB and S3. Setting up an Oak > cluster > > then > > > > merely means that you have more than one Oak instance, each of > which is > > > > configured to use the MongoDB cluster as the node store, and S3 > as the > > > data > > > > store. > > > > >
Re: How does oak cluster work
Hi Matt , Yes your correct, the job is triggered by consumer listening to kafka queue . But to you earlier statement that this is not a Oak issue I have to disagree . In Mongo you can control write concern and make replication synchronize but we cannot do something similar in Oak . Thanks On Fri, Dec 14, 2018 at 3:25 PM Matt Ryan wrote: > Hi, > > I believe your concern is: Content could be uploaded to the cluster via > one Oak instance, and your job to process the content runs in a different > Oak instance, and that there is a possibility that the job to process the > content reads from a MongoDB node that has stale data, so the content is > not available yet. > > If I've understood your concern correctly, you are correct that this is > something you have to worry about, that there is a possibility that when > the job runs it gets stale data because where it reads from has not been > updated yet. However, that's not something being caused by Oak; this would > be something you'd have to deal with whether Oak was there or not, no > matter what type of backing database cluster was being used. > > Maybe I'm still missing something in your question. How are you planning > to trigger your job? > > > > On Fri, Dec 14, 2018 at 1:01 PM ems eril wrote: > > > Hi Matt , > > > >I was looking for more details on the inner workings . I came across > > this https://markmail.org/message/jbkrsmz3krllqghr where it mentioned > that > > changes in the cluster would eventually appear across other nodes and > this > > is not a mongo specific issue but something oak has introduced . I can > set > > the write concern to majority in mongo but if oak has its own eventually > > consistency model this can cause stale reads from other nodes which would > > be a problem with the distributed job Im trying to create. > > > > Thanks > > > > On Fri, Dec 14, 2018 at 8:02 AM Matt Ryan wrote: > > > > > Hi Emily, > > > > > > Content is stored in Oak in two different configurable storage > services. > > > This is a bit of an oversimplification, but basically the structure of > > > content repository - the content tree, nodes, properties, etc. - is > > stored > > > in a Node Store [0] and the binary content is stored in a Blob Store > [1] > > > (you'll also sometimes see the term "data store"). Oak manages all of > > this > > > transparently to external clients. > > > > > > Oak clustering is therefore achieved by configuring Oak instances to > use > > > clusterable storage services underneath [2]. For the node store, an > > > implementation of a DocumentNodeStore [3] is needed; one such > > > implementation uses MongoDB [4]. For the blob store, an implementation > > of > > > a SharedDataStore is needed. For example, both the SharedS3DataStore > and > > > AzureDataStore implementations can be used as a data store for an Oak > > > cluster. > > > > > > So, assume you were using MongoDB and S3. Setting up an Oak cluster > then > > > merely means that you have more than one Oak instance, each of which is > > > configured to use the MongoDB cluster as the node store, and S3 as the > > data > > > store. > > > > > > > > > [0] - > > > > > > > > > https://github.com/apache/jackrabbit-oak/blob/trunk/oak-doc/src/site/markdown/nodestore/overview.md > > > [1] - > > > > > > > > > https://github.com/apache/jackrabbit-oak/blob/trunk/oak-doc/src/site/markdown/plugins/blobstore.md > > > [2] - > > > > > > > > > https://github.com/apache/jackrabbit-oak/blob/trunk/oak-doc/src/site/markdown/clustering.md > > > [3] - > > > > > > > > > https://github.com/apache/jackrabbit-oak/blob/trunk/oak-doc/src/site/markdown/nodestore/documentmk.md > > > [4] - > > > > > > > > > https://github.com/apache/jackrabbit-oak/blob/trunk/oak-doc/src/site/markdown/nodestore/document/mongo-document-store.md > > > > > > > > > Does that help? > > > > > > > > > -MR > > > > > > On Thu, Dec 13, 2018 at 5:52 PM ems eril wrote: > > > > > > > Hi Team , > > > > > > > >Im really interested in understanding how oak cluster works and > how > > do > > > > cluster nodes sync up . These are some of the questions I have > > > > > > > > 1) How does the nodes sync > > > > 2) What is the mongo role > > > > 3) How does indexes in cluster work and sync up > > > > 4) What is the distributed model master/slave multi master > > > > 5) What is co-ordinated by the master node > > > > 6) How is master node elected > > > > > > > >One use case I have is to be able to leverage a oak cluster to be > > able > > > > to upload images/videos and have a consumer on one of the nodes > process > > > it > > > > in a distributed way . I like to try my best to avoid unnecessary > read > > > > checks if possible . > > > > > > > > Thanks > > > > > > > > Emily > > > > > > > > > >
Re: How does oak cluster work
Hi Matt , I was looking for more details on the inner workings . I came across this https://markmail.org/message/jbkrsmz3krllqghr where it mentioned that changes in the cluster would eventually appear across other nodes and this is not a mongo specific issue but something oak has introduced . I can set the write concern to majority in mongo but if oak has its own eventually consistency model this can cause stale reads from other nodes which would be a problem with the distributed job Im trying to create. Thanks On Fri, Dec 14, 2018 at 8:02 AM Matt Ryan wrote: > Hi Emily, > > Content is stored in Oak in two different configurable storage services. > This is a bit of an oversimplification, but basically the structure of > content repository - the content tree, nodes, properties, etc. - is stored > in a Node Store [0] and the binary content is stored in a Blob Store [1] > (you'll also sometimes see the term "data store"). Oak manages all of this > transparently to external clients. > > Oak clustering is therefore achieved by configuring Oak instances to use > clusterable storage services underneath [2]. For the node store, an > implementation of a DocumentNodeStore [3] is needed; one such > implementation uses MongoDB [4]. For the blob store, an implementation of > a SharedDataStore is needed. For example, both the SharedS3DataStore and > AzureDataStore implementations can be used as a data store for an Oak > cluster. > > So, assume you were using MongoDB and S3. Setting up an Oak cluster then > merely means that you have more than one Oak instance, each of which is > configured to use the MongoDB cluster as the node store, and S3 as the data > store. > > > [0] - > > https://github.com/apache/jackrabbit-oak/blob/trunk/oak-doc/src/site/markdown/nodestore/overview.md > [1] - > > https://github.com/apache/jackrabbit-oak/blob/trunk/oak-doc/src/site/markdown/plugins/blobstore.md > [2] - > > https://github.com/apache/jackrabbit-oak/blob/trunk/oak-doc/src/site/markdown/clustering.md > [3] - > > https://github.com/apache/jackrabbit-oak/blob/trunk/oak-doc/src/site/markdown/nodestore/documentmk.md > [4] - > > https://github.com/apache/jackrabbit-oak/blob/trunk/oak-doc/src/site/markdown/nodestore/document/mongo-document-store.md > > > Does that help? > > > -MR > > On Thu, Dec 13, 2018 at 5:52 PM ems eril wrote: > > > Hi Team , > > > >Im really interested in understanding how oak cluster works and how do > > cluster nodes sync up . These are some of the questions I have > > > > 1) How does the nodes sync > > 2) What is the mongo role > > 3) How does indexes in cluster work and sync up > > 4) What is the distributed model master/slave multi master > > 5) What is co-ordinated by the master node > > 6) How is master node elected > > > >One use case I have is to be able to leverage a oak cluster to be able > > to upload images/videos and have a consumer on one of the nodes process > it > > in a distributed way . I like to try my best to avoid unnecessary read > > checks if possible . > > > > Thanks > > > > Emily > > >
Re: How does oak cluster work
Hi Emily, Content is stored in Oak in two different configurable storage services. This is a bit of an oversimplification, but basically the structure of content repository - the content tree, nodes, properties, etc. - is stored in a Node Store [0] and the binary content is stored in a Blob Store [1] (you'll also sometimes see the term "data store"). Oak manages all of this transparently to external clients. Oak clustering is therefore achieved by configuring Oak instances to use clusterable storage services underneath [2]. For the node store, an implementation of a DocumentNodeStore [3] is needed; one such implementation uses MongoDB [4]. For the blob store, an implementation of a SharedDataStore is needed. For example, both the SharedS3DataStore and AzureDataStore implementations can be used as a data store for an Oak cluster. So, assume you were using MongoDB and S3. Setting up an Oak cluster then merely means that you have more than one Oak instance, each of which is configured to use the MongoDB cluster as the node store, and S3 as the data store. [0] - https://github.com/apache/jackrabbit-oak/blob/trunk/oak-doc/src/site/markdown/nodestore/overview.md [1] - https://github.com/apache/jackrabbit-oak/blob/trunk/oak-doc/src/site/markdown/plugins/blobstore.md [2] - https://github.com/apache/jackrabbit-oak/blob/trunk/oak-doc/src/site/markdown/clustering.md [3] - https://github.com/apache/jackrabbit-oak/blob/trunk/oak-doc/src/site/markdown/nodestore/documentmk.md [4] - https://github.com/apache/jackrabbit-oak/blob/trunk/oak-doc/src/site/markdown/nodestore/document/mongo-document-store.md Does that help? -MR On Thu, Dec 13, 2018 at 5:52 PM ems eril wrote: > Hi Team , > >Im really interested in understanding how oak cluster works and how do > cluster nodes sync up . These are some of the questions I have > > 1) How does the nodes sync > 2) What is the mongo role > 3) How does indexes in cluster work and sync up > 4) What is the distributed model master/slave multi master > 5) What is co-ordinated by the master node > 6) How is master node elected > >One use case I have is to be able to leverage a oak cluster to be able > to upload images/videos and have a consumer on one of the nodes process it > in a distributed way . I like to try my best to avoid unnecessary read > checks if possible . > > Thanks > > Emily >