Re: Feedback on S3DataStore JMX stats patch (OAK-4712)
Thanks Matt!! I will take a look tomorrow. Regards Amit On Thu, Sep 8, 2016 at 8:53 PM, Matt Ryanwrote: > New patch attached to address feedback in OAK-4712 on the last patch. > > On Thu, Sep 1, 2016 at 11:00 AM, Matt Ryan wrote: > >> This patch I believe addresses the issues identified with the previous >> patch. I've also uploaded it to OAK-4712. >> >> Looking for review and feedback. >> >> This patch introduces a new interface I've named NodeIdMapper.java. Not >> sure where to put this. I've added it in order to be able to obtain the >> blob ID for a named node within the context of a NodeStoreService class by >> using an anonymous class. However, most of the functionality in each case >> (DocumentNodeStoreService and SegmentNodeStoreService) is similar. This >> could probably be addressed with a lot less code duplication by using an >> abstract base class, but since I'm not sure where would be best to put such >> a class. Ideas? >> >> I also want to call attention to how the blob ID is being used in these >> anonymous classes in the *NodeStoreService classes. The IDs I was getting >> back had trailing information with a '#' followed by some number of >> digits. I just split the string and discarded the last part. Is there a >> better way to do this? >> >> >> On Sun, Aug 28, 2016 at 10:04 PM, Amit Jain wrote: >> >>> Hi Matt, >>> >>> I have directly replied to your comments on the jira. >>> >>> Thanks >>> Amit >>> >>> On Sat, Aug 27, 2016 at 4:22 AM, Matt Ryan wrote: >>> >>> > Use this patch instead; updated patch from latest in trunk. >>> > >>> > On Fri, Aug 26, 2016 at 4:41 PM, Matt Ryan wrote: >>> > >>> >> Hi Oak Devs, >>> >> >>> >> I've created OAK-4712 and submitted a patch for the same. I've >>> attached >>> >> the same patch to this email. >>> >> >>> >> The submission is to add a new MBean, S3DataStoreStats, which will >>> allow >>> >> reporting via JMX about the state of the S3DataStore. Two metrics are >>> >> intended. The first is to report the number of files that are in the >>> sync >>> >> state, meaning they have been added to the S3DataStore but are not yet >>> >> completely copied into S3. The second is to allow to query the sync >>> state >>> >> of a file - a return value of true would mean that the file provided >>> is >>> >> fully synchronized. This uses an external file name, not the >>> internal ID. >>> >> >>> >> I have some questions about the implementation first: >>> >> 1. For the request to query the sync state of a file, should the file >>> >> name provided be a full path to a local file or a path to a file >>> within OAK >>> >> (e.g. /content/dam/myimage.jpg)? Current implementation uses a local >>> file >>> >> path but I have been wondering if it should be an OAK path. >>> >> 2. For the request to query the sync state of a file, when converting >>> >> from the externally-supplied file name to an internal DataIdentifier, >>> this >>> >> implementation is performing the same calculation to determine the >>> internal >>> >> ID name as is done when a file is stored. I have a number of >>> concerns with >>> >> this: >>> >>- It is inefficient - the entire file has to be read and digested >>> in >>> >> order to compute the internal ID. This takes a long time for large >>> assets. >>> >>- I've essentially duplicated the logic from CachingDataStore into >>> >> S3DataStore to compute the internal ID. I hate duplicating the code, >>> but I >>> >> am trying to avoid exposing internal IDs in API, and I am not seeing >>> a good >>> >> way in the current implementation to avoid this without either >>> modifying >>> >> public API to CachingDataStore, or exposing the internal ID via API, >>> or >>> >> both. >>> >> >>> >> Any suggestions on these two issues? >>> >> >>> >> >>> >> I'm also experiencing a problem with this patch. In my testing it >>> >> appears to work fine, until I delete a file. For example, if I >>> delete an >>> >> asset via the REST API, I will see the asset deleted in CRXDE. >>> However, >>> >> the file still remains in S3. This MBean as implemented only knows >>> how to >>> >> check with S3DataStore and the corresponding backend, and these all >>> appear >>> >> to believe the file still exists. So the MBean continues to report >>> that >>> >> the file's sync state is synchronized (i.e. isFileSynced() returns >>> true) >>> >> even though the file has been removed from the JCR. Any suggestions >>> on how >>> >> to resolve this? >>> >> >>> >> >>> >> Finally, any feedback on the patch is welcome. And if I did the >>> process >>> >> wrong please correct me (gently) - first time submission here. Thanks >>> >> >>> >> >>> > >>> >> >> >
Re: Feedback on S3DataStore JMX stats patch (OAK-4712)
New patch attached to address feedback in OAK-4712 on the last patch. On Thu, Sep 1, 2016 at 11:00 AM, Matt Ryanwrote: > This patch I believe addresses the issues identified with the previous > patch. I've also uploaded it to OAK-4712. > > Looking for review and feedback. > > This patch introduces a new interface I've named NodeIdMapper.java. Not > sure where to put this. I've added it in order to be able to obtain the > blob ID for a named node within the context of a NodeStoreService class by > using an anonymous class. However, most of the functionality in each case > (DocumentNodeStoreService and SegmentNodeStoreService) is similar. This > could probably be addressed with a lot less code duplication by using an > abstract base class, but since I'm not sure where would be best to put such > a class. Ideas? > > I also want to call attention to how the blob ID is being used in these > anonymous classes in the *NodeStoreService classes. The IDs I was getting > back had trailing information with a '#' followed by some number of > digits. I just split the string and discarded the last part. Is there a > better way to do this? > > > On Sun, Aug 28, 2016 at 10:04 PM, Amit Jain wrote: > >> Hi Matt, >> >> I have directly replied to your comments on the jira. >> >> Thanks >> Amit >> >> On Sat, Aug 27, 2016 at 4:22 AM, Matt Ryan wrote: >> >> > Use this patch instead; updated patch from latest in trunk. >> > >> > On Fri, Aug 26, 2016 at 4:41 PM, Matt Ryan wrote: >> > >> >> Hi Oak Devs, >> >> >> >> I've created OAK-4712 and submitted a patch for the same. I've >> attached >> >> the same patch to this email. >> >> >> >> The submission is to add a new MBean, S3DataStoreStats, which will >> allow >> >> reporting via JMX about the state of the S3DataStore. Two metrics are >> >> intended. The first is to report the number of files that are in the >> sync >> >> state, meaning they have been added to the S3DataStore but are not yet >> >> completely copied into S3. The second is to allow to query the sync >> state >> >> of a file - a return value of true would mean that the file provided is >> >> fully synchronized. This uses an external file name, not the internal >> ID. >> >> >> >> I have some questions about the implementation first: >> >> 1. For the request to query the sync state of a file, should the file >> >> name provided be a full path to a local file or a path to a file >> within OAK >> >> (e.g. /content/dam/myimage.jpg)? Current implementation uses a local >> file >> >> path but I have been wondering if it should be an OAK path. >> >> 2. For the request to query the sync state of a file, when converting >> >> from the externally-supplied file name to an internal DataIdentifier, >> this >> >> implementation is performing the same calculation to determine the >> internal >> >> ID name as is done when a file is stored. I have a number of concerns >> with >> >> this: >> >>- It is inefficient - the entire file has to be read and digested in >> >> order to compute the internal ID. This takes a long time for large >> assets. >> >>- I've essentially duplicated the logic from CachingDataStore into >> >> S3DataStore to compute the internal ID. I hate duplicating the code, >> but I >> >> am trying to avoid exposing internal IDs in API, and I am not seeing a >> good >> >> way in the current implementation to avoid this without either >> modifying >> >> public API to CachingDataStore, or exposing the internal ID via API, or >> >> both. >> >> >> >> Any suggestions on these two issues? >> >> >> >> >> >> I'm also experiencing a problem with this patch. In my testing it >> >> appears to work fine, until I delete a file. For example, if I delete >> an >> >> asset via the REST API, I will see the asset deleted in CRXDE. >> However, >> >> the file still remains in S3. This MBean as implemented only knows >> how to >> >> check with S3DataStore and the corresponding backend, and these all >> appear >> >> to believe the file still exists. So the MBean continues to report >> that >> >> the file's sync state is synchronized (i.e. isFileSynced() returns >> true) >> >> even though the file has been removed from the JCR. Any suggestions >> on how >> >> to resolve this? >> >> >> >> >> >> Finally, any feedback on the patch is welcome. And if I did the >> process >> >> wrong please correct me (gently) - first time submission here. Thanks >> >> >> >> >> > >> > > diff --git a/oak-blob-cloud/pom.xml b/oak-blob-cloud/pom.xml index 800f716..ad3cb3c 100644 --- a/oak-blob-cloud/pom.xml +++ b/oak-blob-cloud/pom.xml @@ -41,7 +41,7 @@ maven-bundle-plugin - org.apache.jackrabbit.oak.blob.cloud.aws.s3 + org.apache.jackrabbit.oak.blob.cloud.aws.s3,org.apache.jackrabbit.oak.blob.cloud.aws.s3.stats sun.io
Re: Feedback on S3DataStore JMX stats patch (OAK-4712)
This patch I believe addresses the issues identified with the previous patch. I've also uploaded it to OAK-4712. Looking for review and feedback. This patch introduces a new interface I've named NodeIdMapper.java. Not sure where to put this. I've added it in order to be able to obtain the blob ID for a named node within the context of a NodeStoreService class by using an anonymous class. However, most of the functionality in each case (DocumentNodeStoreService and SegmentNodeStoreService) is similar. This could probably be addressed with a lot less code duplication by using an abstract base class, but since I'm not sure where would be best to put such a class. Ideas? I also want to call attention to how the blob ID is being used in these anonymous classes in the *NodeStoreService classes. The IDs I was getting back had trailing information with a '#' followed by some number of digits. I just split the string and discarded the last part. Is there a better way to do this? On Sun, Aug 28, 2016 at 10:04 PM, Amit Jainwrote: > Hi Matt, > > I have directly replied to your comments on the jira. > > Thanks > Amit > > On Sat, Aug 27, 2016 at 4:22 AM, Matt Ryan wrote: > > > Use this patch instead; updated patch from latest in trunk. > > > > On Fri, Aug 26, 2016 at 4:41 PM, Matt Ryan wrote: > > > >> Hi Oak Devs, > >> > >> I've created OAK-4712 and submitted a patch for the same. I've attached > >> the same patch to this email. > >> > >> The submission is to add a new MBean, S3DataStoreStats, which will allow > >> reporting via JMX about the state of the S3DataStore. Two metrics are > >> intended. The first is to report the number of files that are in the > sync > >> state, meaning they have been added to the S3DataStore but are not yet > >> completely copied into S3. The second is to allow to query the sync > state > >> of a file - a return value of true would mean that the file provided is > >> fully synchronized. This uses an external file name, not the internal > ID. > >> > >> I have some questions about the implementation first: > >> 1. For the request to query the sync state of a file, should the file > >> name provided be a full path to a local file or a path to a file within > OAK > >> (e.g. /content/dam/myimage.jpg)? Current implementation uses a local > file > >> path but I have been wondering if it should be an OAK path. > >> 2. For the request to query the sync state of a file, when converting > >> from the externally-supplied file name to an internal DataIdentifier, > this > >> implementation is performing the same calculation to determine the > internal > >> ID name as is done when a file is stored. I have a number of concerns > with > >> this: > >>- It is inefficient - the entire file has to be read and digested in > >> order to compute the internal ID. This takes a long time for large > assets. > >>- I've essentially duplicated the logic from CachingDataStore into > >> S3DataStore to compute the internal ID. I hate duplicating the code, > but I > >> am trying to avoid exposing internal IDs in API, and I am not seeing a > good > >> way in the current implementation to avoid this without either modifying > >> public API to CachingDataStore, or exposing the internal ID via API, or > >> both. > >> > >> Any suggestions on these two issues? > >> > >> > >> I'm also experiencing a problem with this patch. In my testing it > >> appears to work fine, until I delete a file. For example, if I delete > an > >> asset via the REST API, I will see the asset deleted in CRXDE. However, > >> the file still remains in S3. This MBean as implemented only knows how > to > >> check with S3DataStore and the corresponding backend, and these all > appear > >> to believe the file still exists. So the MBean continues to report that > >> the file's sync state is synchronized (i.e. isFileSynced() returns true) > >> even though the file has been removed from the JCR. Any suggestions on > how > >> to resolve this? > >> > >> > >> Finally, any feedback on the patch is welcome. And if I did the process > >> wrong please correct me (gently) - first time submission here. Thanks > >> > >> > > > diff --git a/oak-blob-cloud/pom.xml b/oak-blob-cloud/pom.xml index 800f716..ad3cb3c 100644 --- a/oak-blob-cloud/pom.xml +++ b/oak-blob-cloud/pom.xml @@ -41,7 +41,7 @@ maven-bundle-plugin - org.apache.jackrabbit.oak.blob.cloud.aws.s3 + org.apache.jackrabbit.oak.blob.cloud.aws.s3,org.apache.jackrabbit.oak.blob.cloud.aws.s3.stats sun.io @@ -101,6 +101,13 @@ ${jackrabbit.version} + + +org.apache.jackrabbit +oak-commons +${project.version} + + com.amazonaws @@ -140,6 +147,12 @@
Re: Feedback on S3DataStore JMX stats patch (OAK-4712)
Hi Matt, I have directly replied to your comments on the jira. Thanks Amit On Sat, Aug 27, 2016 at 4:22 AM, Matt Ryanwrote: > Use this patch instead; updated patch from latest in trunk. > > On Fri, Aug 26, 2016 at 4:41 PM, Matt Ryan wrote: > >> Hi Oak Devs, >> >> I've created OAK-4712 and submitted a patch for the same. I've attached >> the same patch to this email. >> >> The submission is to add a new MBean, S3DataStoreStats, which will allow >> reporting via JMX about the state of the S3DataStore. Two metrics are >> intended. The first is to report the number of files that are in the sync >> state, meaning they have been added to the S3DataStore but are not yet >> completely copied into S3. The second is to allow to query the sync state >> of a file - a return value of true would mean that the file provided is >> fully synchronized. This uses an external file name, not the internal ID. >> >> I have some questions about the implementation first: >> 1. For the request to query the sync state of a file, should the file >> name provided be a full path to a local file or a path to a file within OAK >> (e.g. /content/dam/myimage.jpg)? Current implementation uses a local file >> path but I have been wondering if it should be an OAK path. >> 2. For the request to query the sync state of a file, when converting >> from the externally-supplied file name to an internal DataIdentifier, this >> implementation is performing the same calculation to determine the internal >> ID name as is done when a file is stored. I have a number of concerns with >> this: >>- It is inefficient - the entire file has to be read and digested in >> order to compute the internal ID. This takes a long time for large assets. >>- I've essentially duplicated the logic from CachingDataStore into >> S3DataStore to compute the internal ID. I hate duplicating the code, but I >> am trying to avoid exposing internal IDs in API, and I am not seeing a good >> way in the current implementation to avoid this without either modifying >> public API to CachingDataStore, or exposing the internal ID via API, or >> both. >> >> Any suggestions on these two issues? >> >> >> I'm also experiencing a problem with this patch. In my testing it >> appears to work fine, until I delete a file. For example, if I delete an >> asset via the REST API, I will see the asset deleted in CRXDE. However, >> the file still remains in S3. This MBean as implemented only knows how to >> check with S3DataStore and the corresponding backend, and these all appear >> to believe the file still exists. So the MBean continues to report that >> the file's sync state is synchronized (i.e. isFileSynced() returns true) >> even though the file has been removed from the JCR. Any suggestions on how >> to resolve this? >> >> >> Finally, any feedback on the patch is welcome. And if I did the process >> wrong please correct me (gently) - first time submission here. Thanks >> >> >
Re: Feedback on S3DataStore JMX stats patch (OAK-4712)
Use this patch instead; updated patch from latest in trunk. On Fri, Aug 26, 2016 at 4:41 PM, Matt Ryanwrote: > Hi Oak Devs, > > I've created OAK-4712 and submitted a patch for the same. I've attached > the same patch to this email. > > The submission is to add a new MBean, S3DataStoreStats, which will allow > reporting via JMX about the state of the S3DataStore. Two metrics are > intended. The first is to report the number of files that are in the sync > state, meaning they have been added to the S3DataStore but are not yet > completely copied into S3. The second is to allow to query the sync state > of a file - a return value of true would mean that the file provided is > fully synchronized. This uses an external file name, not the internal ID. > > I have some questions about the implementation first: > 1. For the request to query the sync state of a file, should the file name > provided be a full path to a local file or a path to a file within OAK > (e.g. /content/dam/myimage.jpg)? Current implementation uses a local file > path but I have been wondering if it should be an OAK path. > 2. For the request to query the sync state of a file, when converting from > the externally-supplied file name to an internal DataIdentifier, this > implementation is performing the same calculation to determine the internal > ID name as is done when a file is stored. I have a number of concerns with > this: >- It is inefficient - the entire file has to be read and digested in > order to compute the internal ID. This takes a long time for large assets. >- I've essentially duplicated the logic from CachingDataStore into > S3DataStore to compute the internal ID. I hate duplicating the code, but I > am trying to avoid exposing internal IDs in API, and I am not seeing a good > way in the current implementation to avoid this without either modifying > public API to CachingDataStore, or exposing the internal ID via API, or > both. > > Any suggestions on these two issues? > > > I'm also experiencing a problem with this patch. In my testing it appears > to work fine, until I delete a file. For example, if I delete an asset via > the REST API, I will see the asset deleted in CRXDE. However, the file > still remains in S3. This MBean as implemented only knows how to check > with S3DataStore and the corresponding backend, and these all appear to > believe the file still exists. So the MBean continues to report that the > file's sync state is synchronized (i.e. isFileSynced() returns true) even > though the file has been removed from the JCR. Any suggestions on how to > resolve this? > > > Finally, any feedback on the patch is welcome. And if I did the process > wrong please correct me (gently) - first time submission here. Thanks > > diff --git a/oak-blob-cloud/pom.xml b/oak-blob-cloud/pom.xml index 800f716..ad3cb3c 100644 --- a/oak-blob-cloud/pom.xml +++ b/oak-blob-cloud/pom.xml @@ -41,7 +41,7 @@ maven-bundle-plugin - org.apache.jackrabbit.oak.blob.cloud.aws.s3 + org.apache.jackrabbit.oak.blob.cloud.aws.s3,org.apache.jackrabbit.oak.blob.cloud.aws.s3.stats sun.io @@ -101,6 +101,13 @@ ${jackrabbit.version} + + +org.apache.jackrabbit +oak-commons +${project.version} + + com.amazonaws @@ -140,6 +147,12 @@ logback-classic test + +org.mockito +mockito-core +1.10.19 +test + diff --git a/oak-blob-cloud/src/main/java/org/apache/jackrabbit/oak/blob/cloud/aws/s3/S3DataStore.java b/oak-blob-cloud/src/main/java/org/apache/jackrabbit/oak/blob/cloud/aws/s3/S3DataStore.java index fc21bf6..be2ed54 100644 --- a/oak-blob-cloud/src/main/java/org/apache/jackrabbit/oak/blob/cloud/aws/s3/S3DataStore.java +++ b/oak-blob-cloud/src/main/java/org/apache/jackrabbit/oak/blob/cloud/aws/s3/S3DataStore.java @@ -16,17 +16,64 @@ */ package org.apache.jackrabbit.oak.blob.cloud.aws.s3; +import java.io.FileInputStream; +import java.io.OutputStream; +import java.security.DigestOutputStream; +import java.security.MessageDigest; +import java.security.NoSuchAlgorithmException; import java.util.Properties; +import java.util.concurrent.ExecutionException; +import java.util.concurrent.TimeUnit; + +import com.google.common.cache.CacheBuilder; +import com.google.common.cache.CacheLoader; +import com.google.common.cache.LoadingCache; +import org.apache.commons.io.IOUtils; +import org.apache.commons.io.output.NullOutputStream; import org.apache.jackrabbit.core.data.Backend; import org.apache.jackrabbit.core.data.CachingDataStore; +import org.apache.jackrabbit.core.data.DataIdentifier; +import
Feedback on S3DataStore JMX stats patch (OAK-4712)
Hi Oak Devs, I've created OAK-4712 and submitted a patch for the same. I've attached the same patch to this email. The submission is to add a new MBean, S3DataStoreStats, which will allow reporting via JMX about the state of the S3DataStore. Two metrics are intended. The first is to report the number of files that are in the sync state, meaning they have been added to the S3DataStore but are not yet completely copied into S3. The second is to allow to query the sync state of a file - a return value of true would mean that the file provided is fully synchronized. This uses an external file name, not the internal ID. I have some questions about the implementation first: 1. For the request to query the sync state of a file, should the file name provided be a full path to a local file or a path to a file within OAK (e.g. /content/dam/myimage.jpg)? Current implementation uses a local file path but I have been wondering if it should be an OAK path. 2. For the request to query the sync state of a file, when converting from the externally-supplied file name to an internal DataIdentifier, this implementation is performing the same calculation to determine the internal ID name as is done when a file is stored. I have a number of concerns with this: - It is inefficient - the entire file has to be read and digested in order to compute the internal ID. This takes a long time for large assets. - I've essentially duplicated the logic from CachingDataStore into S3DataStore to compute the internal ID. I hate duplicating the code, but I am trying to avoid exposing internal IDs in API, and I am not seeing a good way in the current implementation to avoid this without either modifying public API to CachingDataStore, or exposing the internal ID via API, or both. Any suggestions on these two issues? I'm also experiencing a problem with this patch. In my testing it appears to work fine, until I delete a file. For example, if I delete an asset via the REST API, I will see the asset deleted in CRXDE. However, the file still remains in S3. This MBean as implemented only knows how to check with S3DataStore and the corresponding backend, and these all appear to believe the file still exists. So the MBean continues to report that the file's sync state is synchronized (i.e. isFileSynced() returns true) even though the file has been removed from the JCR. Any suggestions on how to resolve this? Finally, any feedback on the patch is welcome. And if I did the process wrong please correct me (gently) - first time submission here. Thanks diff --git a/oak-blob-cloud/pom.xml b/oak-blob-cloud/pom.xml index 800f716..ad3cb3c 100644 --- a/oak-blob-cloud/pom.xml +++ b/oak-blob-cloud/pom.xml @@ -41,7 +41,7 @@ maven-bundle-plugin - org.apache.jackrabbit.oak.blob.cloud.aws.s3 + org.apache.jackrabbit.oak.blob.cloud.aws.s3,org.apache.jackrabbit.oak.blob.cloud.aws.s3.stats sun.io @@ -101,6 +101,13 @@ ${jackrabbit.version} + + +org.apache.jackrabbit +oak-commons +${project.version} + + com.amazonaws @@ -140,6 +147,12 @@ logback-classic test + +org.mockito +mockito-core +1.10.19 +test + diff --git a/oak-blob-cloud/src/main/java/org/apache/jackrabbit/oak/blob/cloud/aws/s3/S3DataStore.java b/oak-blob-cloud/src/main/java/org/apache/jackrabbit/oak/blob/cloud/aws/s3/S3DataStore.java index fc21bf6..be2ed54 100644 --- a/oak-blob-cloud/src/main/java/org/apache/jackrabbit/oak/blob/cloud/aws/s3/S3DataStore.java +++ b/oak-blob-cloud/src/main/java/org/apache/jackrabbit/oak/blob/cloud/aws/s3/S3DataStore.java @@ -16,17 +16,64 @@ */ package org.apache.jackrabbit.oak.blob.cloud.aws.s3; +import java.io.FileInputStream; +import java.io.OutputStream; +import java.security.DigestOutputStream; +import java.security.MessageDigest; +import java.security.NoSuchAlgorithmException; import java.util.Properties; +import java.util.concurrent.ExecutionException; +import java.util.concurrent.TimeUnit; + +import com.google.common.cache.CacheBuilder; +import com.google.common.cache.CacheLoader; +import com.google.common.cache.LoadingCache; +import org.apache.commons.io.IOUtils; +import org.apache.commons.io.output.NullOutputStream; import org.apache.jackrabbit.core.data.Backend; import org.apache.jackrabbit.core.data.CachingDataStore; +import org.apache.jackrabbit.core.data.DataIdentifier; +import org.apache.jackrabbit.core.data.DataStoreException; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; /** * An Amazon S3 data store. */ public class S3DataStore extends CachingDataStore { + +/** + * Logger instance. + */ +