[jira] [Commented] (OAK-8552) Minimize network calls required when creating a direct download URI

2019-08-28 Thread Amit Jain (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918283#comment-16918283
 ] 

Amit Jain commented on OAK-8552:


[~alexander.klimetschek] 
Yes nothing needs to change now, I was just trying to highlight that there's 
dynamism involved with lastModified while other properties are stable over the 
lifetime of the blob even without de-deduplication.
{quote}With direct binary access we also no longer support de-duplication (too 
costly), so that aspect can be ignored for binaries uploaded that way or if a 
corresponding configuration is set
{quote}
But there's still a way that the binaries can be uploaded through JCR. So, we 
would in an application still have the mix of both I guess.
{quote}But there is still copying/versioning of nodes which leads to multiple 
references to the same blob - does this lead to a last modified update of the 
blob?
{quote}
No copy of nodes would not update the timestamp of the blob.

Also, there's the case of a 'Shared' DataStore where nodes might have been 
replicated from another instance through some form of replication and share the 
DataStore.

> Minimize network calls required when creating a direct download URI
> ---
>
> Key: OAK-8552
> URL: https://issues.apache.org/jira/browse/OAK-8552
> Project: Jackrabbit Oak
>  Issue Type: Sub-task
>  Components: blob-cloud, blob-cloud-azure
>Reporter: Matt Ryan
>Assignee: Matt Ryan
>Priority: Major
> Fix For: 1.18.0
>
> Attachments: OAK-8552_ApiChange.patch
>
>
> We need to isolate and try to optimize network calls required to create a 
> direct download URI.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (OAK-8552) Minimize network calls required when creating a direct download URI

2019-08-28 Thread Matt Ryan (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918033#comment-16918033
 ] 

Matt Ryan commented on OAK-8552:


Speeding up signed URI generation is fixed with 
[r1866044|https://svn.apache.org/viewvc?view=revision&revision=1866044].

I believe that is sufficient to resolve this issue, considering that the other 
aspect (improving the check to determine if a binary was inlined) was resolved 
via OAK-8578 IIUC.  [~amitjain] please reopen if you disagree.

> Minimize network calls required when creating a direct download URI
> ---
>
> Key: OAK-8552
> URL: https://issues.apache.org/jira/browse/OAK-8552
> Project: Jackrabbit Oak
>  Issue Type: Sub-task
>  Components: blob-cloud, blob-cloud-azure
>Reporter: Matt Ryan
>Assignee: Matt Ryan
>Priority: Major
> Attachments: OAK-8552_ApiChange.patch
>
>
> We need to isolate and try to optimize network calls required to create a 
> direct download URI.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (OAK-8552) Minimize network calls required when creating a direct download URI

2019-08-28 Thread Alexander Klimetschek (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16917841#comment-16917841
 ] 

Alexander Klimetschek commented on OAK-8552:


[~amitjain] I don’t think [~ianeboston] was suggesting to use the 
jcr:lastModified JCR property for that. All of these could be internal fields 
not exposed on the JCR level.

With direct binary access we also no longer support de-duplication (too 
costly), so that aspect can be ignored for binaries uploaded that way or if a 
corresponding configuration is set. But there is still copying/versioning of 
nodes which leads to multiple references to the same blob - does this lead to a 
last modified update of the blob? (Just curious because I wasn’t aware of that)

Nonetheless, for the issue at hand I don’t think we need to change anything 
stored in the NodeStore. The size/length is already stored in the NS as part of 
the internal blob id.

> Minimize network calls required when creating a direct download URI
> ---
>
> Key: OAK-8552
> URL: https://issues.apache.org/jira/browse/OAK-8552
> Project: Jackrabbit Oak
>  Issue Type: Sub-task
>  Components: blob-cloud, blob-cloud-azure
>Reporter: Matt Ryan
>Assignee: Matt Ryan
>Priority: Major
> Attachments: OAK-8552_ApiChange.patch
>
>
> We need to isolate and try to optimize network calls required to create a 
> direct download URI.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (OAK-8552) Minimize network calls required when creating a direct download URI

2019-08-28 Thread Amit Jain (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16917525#comment-16917525
 ] 

Amit Jain commented on OAK-8552:


[~ianeboston]

 
{quote}the Oak NodeStore (SegmentNodeStore or DocumentNodeStore) should be the 
record of authority for blob existence, length, lastModified
{quote}
The node's jcr:lastModified does not reflect the blob's lastModified timestamp 
and only signifies when the node was last modified (IIUC the spec also says 
that [1]). I am also not sure if it can reflect the blobs lastModified without 
taking a severe hit. The reason being since blobs are de-duplicated, when an 
already existing blob is uploaded to Jcr, its lastModified stamp is updated in 
the DataStore and the blob is not uploaded again to the DataStore. This update 
to the blob's lastModified is a requirement for DGC.

This updated lastModified for the blob cannot be updated for all nodes from 
where already referenced retrospectively without a performance hit (and maybe a 
design change, DataStore is the lowest layer and has no information of the 
NodeStore and de-duplication with SHA hash is an implementation detail not 
known to the NodeStore). 

[1] - [https://docs.adobe.com/docs/en/spec/jcr/2.0/3_Repository_Model.html]

> Minimize network calls required when creating a direct download URI
> ---
>
> Key: OAK-8552
> URL: https://issues.apache.org/jira/browse/OAK-8552
> Project: Jackrabbit Oak
>  Issue Type: Sub-task
>  Components: blob-cloud, blob-cloud-azure
>Reporter: Matt Ryan
>Assignee: Matt Ryan
>Priority: Major
> Attachments: OAK-8552_ApiChange.patch
>
>
> We need to isolate and try to optimize network calls required to create a 
> direct download URI.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (OAK-8552) Minimize network calls required when creating a direct download URI

2019-08-27 Thread Matt Ryan (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16916987#comment-16916987
 ] 

Matt Ryan commented on OAK-8552:


I did an implementation of how this would work using the config option to 
disable the existence check.  See this [pull 
request|https://github.com/apache/jackrabbit-oak/compare/trunk...mattvryan:OAK-8552-with-config-to-disable-existence-check?expand=1]
 for details, and please feel free to comment.

> Minimize network calls required when creating a direct download URI
> ---
>
> Key: OAK-8552
> URL: https://issues.apache.org/jira/browse/OAK-8552
> Project: Jackrabbit Oak
>  Issue Type: Sub-task
>  Components: blob-cloud, blob-cloud-azure
>Reporter: Matt Ryan
>Assignee: Matt Ryan
>Priority: Major
> Attachments: OAK-8552_ApiChange.patch
>
>
> We need to isolate and try to optimize network calls required to create a 
> direct download URI.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (OAK-8552) Minimize network calls required when creating a direct download URI

2019-08-27 Thread Ian Boston (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16916482#comment-16916482
 ] 

Ian Boston commented on OAK-8552:
-

I have not digested all the details, but IMHO, the Oak NodeStore 
(SegmentNodeStore or DocumentNodeStore) should be the record of authority for 
blob existence, length, lastModified, blobID, such that to answer any question 
about a binary, only the NodeStore should have to be consulted, and no network 
API calls made.  Only when a process actively needs to validate the date in 
NodeStore or interact directly with the binaries (upload, download, streaming) 
should it be forced to make network API calls to the blob storage API.

IIUC there are edge cases highlight by Alex that abuse this principal (async 
upload?), but in general the principal holds. Also IIUC to make the Oak 
NodeStore the record of authority might require some additional properties to 
be stored (existence flag, length?, lastModified ?, etc?)

Sorry if I have oversimplified.

> Minimize network calls required when creating a direct download URI
> ---
>
> Key: OAK-8552
> URL: https://issues.apache.org/jira/browse/OAK-8552
> Project: Jackrabbit Oak
>  Issue Type: Sub-task
>  Components: blob-cloud, blob-cloud-azure
>Reporter: Matt Ryan
>Assignee: Matt Ryan
>Priority: Major
> Attachments: OAK-8552_ApiChange.patch
>
>
> We need to isolate and try to optimize network calls required to create a 
> direct download URI.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (OAK-8552) Minimize network calls required when creating a direct download URI

2019-08-26 Thread Alexander Klimetschek (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16916427#comment-16916427
 ] 

Alexander Klimetschek commented on OAK-8552:


[~amitjain] +1 to Blob#isInlined(), that seems like the right solution in place 
of the getReference() == null.



> Minimize network calls required when creating a direct download URI
> ---
>
> Key: OAK-8552
> URL: https://issues.apache.org/jira/browse/OAK-8552
> Project: Jackrabbit Oak
>  Issue Type: Sub-task
>  Components: blob-cloud, blob-cloud-azure
>Reporter: Matt Ryan
>Assignee: Matt Ryan
>Priority: Major
> Attachments: OAK-8552_ApiChange.patch
>
>
> We need to isolate and try to optimize network calls required to create a 
> direct download URI.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (OAK-8552) Minimize network calls required when creating a direct download URI

2019-08-26 Thread Alexander Klimetschek (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16916411#comment-16916411
 ] 

Alexander Klimetschek commented on OAK-8552:


To add some context:

In our case that uncovered this issue, the problem really only exists because 
in some special case, we uploaded binaries through JCR with async uploads in 
caching DS enabled, and then immediately after the session.save() requested 
presigned GET URLs to pass them on to an external service. That lead us first 
to make the presigned URL generation support existence check and „polling“ by 
returning null for not-yet-in-blob-store cases. However, that is shifting the 
solution to the wrong end and increasing application complexity (polling loop).

(Also note this edge case is a short term solution to be replaced at some point 
with proper direct binary access for upload)

The source of the problem here is the async upload: we need to switch this to 
synchronous uploads = blocking session.save(), to avoid the issue in the first 
place.

In all regular cases, binaries are uploaded through the new direct binary 
access, which by design guarantees the presence of the binary when the 
reference is in the NodeStore. 

If the binary gets deleted from the blob store due to some malfunctioning, then 
it does not matter to the application if we return null or an URL that returns 
404. But the latter allows us to completely drop any existence checks upon 
presigned GET URL generation.

Same for inlined: configuration must prevent this in the first place, then no 
special check is required at access time.

> Minimize network calls required when creating a direct download URI
> ---
>
> Key: OAK-8552
> URL: https://issues.apache.org/jira/browse/OAK-8552
> Project: Jackrabbit Oak
>  Issue Type: Sub-task
>  Components: blob-cloud, blob-cloud-azure
>Reporter: Matt Ryan
>Assignee: Matt Ryan
>Priority: Major
> Attachments: OAK-8552_ApiChange.patch
>
>
> We need to isolate and try to optimize network calls required to create a 
> direct download URI.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (OAK-8552) Minimize network calls required when creating a direct download URI

2019-08-26 Thread Matt Ryan (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16916403#comment-16916403
 ] 

Matt Ryan commented on OAK-8552:


More thinking about this - this could also be done via configuration.  But if 
done via config probably the default would be to check existence, since async 
uploads is also the default (IIUC).  Turning off the check for existence 
without also disabling async uploads would be unadvisable because of the 
condition that can arise that was addressed in OAK-7998.  But adding a config 
option here would be a simple fix without requiring API change that would allow 
an instance to skip the existence check for users who are aware of the tradeoff.

This would require a doc change if this approach was taken.

> Minimize network calls required when creating a direct download URI
> ---
>
> Key: OAK-8552
> URL: https://issues.apache.org/jira/browse/OAK-8552
> Project: Jackrabbit Oak
>  Issue Type: Sub-task
>  Components: blob-cloud, blob-cloud-azure
>Reporter: Matt Ryan
>Assignee: Matt Ryan
>Priority: Major
> Attachments: OAK-8552_ApiChange.patch
>
>
> We need to isolate and try to optimize network calls required to create a 
> direct download URI.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (OAK-8552) Minimize network calls required when creating a direct download URI

2019-08-26 Thread Matt Ryan (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16916280#comment-16916280
 ] 

Matt Ryan commented on OAK-8552:


Another option (mentioned in offline conversation w/ [~alexander.klimetschek]) 
would be somewhat of a combination of 2 and 3 in the previous comment.  In this 
case we would allow clients via the API to specify that they want an existence 
check performed before generating the URI.  The default would be to not check 
for existence.

This way clients would be able to specify whether they want an existence 
guarantee and are willing to pay the performance hit.

We could do this via the {{BinaryDownloadOptions}} object, perhaps via a new 
interface so we can avoid making a breaking API change?

> Minimize network calls required when creating a direct download URI
> ---
>
> Key: OAK-8552
> URL: https://issues.apache.org/jira/browse/OAK-8552
> Project: Jackrabbit Oak
>  Issue Type: Sub-task
>  Components: blob-cloud, blob-cloud-azure
>Reporter: Matt Ryan
>Assignee: Matt Ryan
>Priority: Major
> Attachments: OAK-8552_ApiChange.patch
>
>
> We need to isolate and try to optimize network calls required to create a 
> direct download URI.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (OAK-8552) Minimize network calls required when creating a direct download URI

2019-08-26 Thread Matt Ryan (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16916035#comment-16916035
 ] 

Matt Ryan commented on OAK-8552:


WRT removing the existence check from {{createHttpDownloadURI()}}, I see the 
following options:
 # Leave the code as-is and live with the O(100ms) cost of checking existence 
before creating the signed URI - but clients know the blob existed at the time 
the URI was created
 # Revert OAK-7998, dropping the cost of generating a signed URI to O(100 
microseconds) but clients may get a URI that returns a 404 (blob not yet in 
storage)
 # Leave the fix for OAK-7998 but add some form of cache or lookup table, which 
would consume additional memory but could speed up a non-zero number of signed 
URI generation requests, while still guaranteeing to clients that the blob 
existed at the time the URI was created
 ** If doing this we would need to figure out how to populate the cache or 
lookup table - on demand, at startup, etc.

I'm pretty sure [~ianeboston] would vote for #2 or #3.  Any other options or 
votes?

 

> Minimize network calls required when creating a direct download URI
> ---
>
> Key: OAK-8552
> URL: https://issues.apache.org/jira/browse/OAK-8552
> Project: Jackrabbit Oak
>  Issue Type: Sub-task
>  Components: blob-cloud, blob-cloud-azure
>Reporter: Matt Ryan
>Assignee: Matt Ryan
>Priority: Major
> Attachments: OAK-8552_ApiChange.patch
>
>
> We need to isolate and try to optimize network calls required to create a 
> direct download URI.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (OAK-8552) Minimize network calls required when creating a direct download URI

2019-08-25 Thread Amit Jain (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16915493#comment-16915493
 ] 

Amit Jain commented on OAK-8552:


bq. I'm afraid I don't currently understand why "remove the existence check" 
would "lead to perceived performance drop"
Not removing the existence check but disabling the asynchronous uploads would 
lead to a perceived performance drop.

bq. I don't understand how "remove the existence check" would work at all
Yes it is about reverting OAK-7998. Without the existence check any download 
attempt using the signed download URI might fail because the blob backing it is 
not available (yet) in the cloud. I believe that is how things worked before 
introducing this additional check.

> Minimize network calls required when creating a direct download URI
> ---
>
> Key: OAK-8552
> URL: https://issues.apache.org/jira/browse/OAK-8552
> Project: Jackrabbit Oak
>  Issue Type: Sub-task
>  Components: blob-cloud, blob-cloud-azure
>Reporter: Matt Ryan
>Assignee: Matt Ryan
>Priority: Major
> Attachments: OAK-8552_ApiChange.patch
>
>
> We need to isolate and try to optimize network calls required to create a 
> direct download URI.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (OAK-8552) Minimize network calls required when creating a direct download URI

2019-08-23 Thread Thomas Mueller (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914292#comment-16914292
 ] 

Thomas Mueller commented on OAK-8552:
-

* #getReference
** My vote would go to "introduce a new cleaner API Blob#isInlined (name can be 
changed)". I might be more work, but I think the solution would be much clearer.
* #exists
** I think the exists method implies a network access (unless inlined).
** I'm afraid I don't currently understand why "remove the existence check" 
would "lead to perceived performance drop"... I don't understand how "remove 
the existence check" would work at all... I mean, it's reverting OAK-7998, 
right? If we revert OAK-7998 (which might be OK), then I assume we need to 
solve the root problem of OAK-7998 in some other way.

> Minimize network calls required when creating a direct download URI
> ---
>
> Key: OAK-8552
> URL: https://issues.apache.org/jira/browse/OAK-8552
> Project: Jackrabbit Oak
>  Issue Type: Sub-task
>  Components: blob-cloud, blob-cloud-azure
>Reporter: Matt Ryan
>Assignee: Matt Ryan
>Priority: Major
> Attachments: OAK-8552_ApiChange.patch
>
>
> We need to isolate and try to optimize network calls required to create a 
> direct download URI.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (OAK-8552) Minimize network calls required when creating a direct download URI

2019-08-23 Thread Amit Jain (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914136#comment-16914136
 ] 

Amit Jain commented on OAK-8552:


[~mattvryan]

Regarding the 2 issues above
 * #getReference - Tried a few optimizations:
 ** Removed the need to get a DataRecord instance and use a dummy DataRecord 
instance [1]. It fails segment standby test case(s) because of expectation that 
non null getReference means available locally [2]. That can be fixed but the 
solution seems hacky.
 ** Another option can be to introduce a new cleaner API Blob#isInlined (name 
can be changed) as outlined in the patch [^OAK-8552_ApiChange.patch]. The 
changes touch a lot of places but is quite trivial. Test cases still need to be 
added.
 * #exists check -
 ** IIUC, the need to check existence is because of asynchronous uploads, then 
one option is to actually disable that and remove the existence check. It would 
lead to perceived performance drop, perceived because JCR call returns quickly 
but time to reach the cloud backend for a binary would be the same as in 
synchronous uploads or even little worse.
 ** Another option is to introduce a in-memory cache locally in the Backend for 
ids uploaded. The idea to use BlobTracker does not work because that was only 
introduced for DSGC and it also doesn't wait to add the id only after an 
asynchronous upload. Also, if DSGC is meant to run outside the server (i.e. 
oak-run) then it is most likely the BlobTracker would be disabled.

[~tmueller] wdyt?

[1]
{code:java}
Index: 
oak-blob-plugins/src/main/java/org/apache/jackrabbit/oak/plugins/blob/datastore/DataStoreBlobStore.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===
--- 
oak-blob-plugins/src/main/java/org/apache/jackrabbit/oak/plugins/blob/datastore/DataStoreBlobStore.java
 (revision b4e0a5ba954b7de4b508aa197847223800f1c320)
+++ 
oak-blob-plugins/src/main/java/org/apache/jackrabbit/oak/plugins/blob/datastore/DataStoreBlobStore.java
 (date 1566293954000)
@@ -52,6 +52,8 @@
 import com.google.common.io.Closeables;
 import org.apache.commons.io.FileUtils;
 import org.apache.commons.io.IOUtils;
+import org.apache.jackrabbit.core.data.AbstractDataRecord;
+import org.apache.jackrabbit.core.data.AbstractDataStore;
 import org.apache.jackrabbit.core.data.DataIdentifier;
 import org.apache.jackrabbit.core.data.DataRecord;
 import org.apache.jackrabbit.core.data.DataStore;
@@ -312,16 +314,34 @@
 return null;
 }
 
-DataRecord record;
-try {
-record = delegate.getRecordIfStored(new DataIdentifier(blobId));
-if (record != null) {
-return record.getReference();
-} else {
-log.debug("No blob found for id [{}]", blobId);
-}
-} catch (DataStoreException e) {
-log.warn("Unable to access the blobId for  [{}]", blobId, e);
+// Get reference without possible round-tripping using a dummy data 
record
+if (delegate instanceof AbstractDataStore) {
+return new AbstractDataRecord((AbstractDataStore) delegate, new 
DataIdentifier(blobId)) {
+
+@Override public long getLength() {
+return 0;
+}
+
+@Override public InputStream getStream() {
+return null;
+}
+
+@Override public long getLastModified() {
+return 0;
+}
+}.getReference();
+} else {
+DataRecord record;
+try {
+record = delegate.getRecordIfStored(new 
DataIdentifier(blobId));
+if (record != null) {
+return record.getReference();
+} else {
+log.debug("No blob found for id [{}]", blobId);
+}
+} catch (DataStoreException e) {
+log.warn("Unable to access the blobId for  [{}]", blobId, e);
+}
 }
 return  null;
 }
{code}
[2] 
[https://github.com/apache/jackrabbit-oak/blob/trunk/oak-segment-tar/src/main/java/org/apache/jackrabbit/oak/segment/standby/client/RemoteBlobProcessor.java#L78-L89]

> Minimize network calls required when creating a direct download URI
> ---
>
> Key: OAK-8552
> URL: https://issues.apache.org/jira/browse/OAK-8552
> Project: Jackrabbit Oak
>  Issue Type: Sub-task
>  Components: blob-cloud, blob-cloud-azure
>Reporter: Matt Ryan
>Assignee: Matt Ryan
>Priority: Major
> Attachments: OAK-8552_ApiChange.patch
>
>
> We need to isolate and try to optimize network calls required to create 

[jira] [Commented] (OAK-8552) Minimize network calls required when creating a direct download URI

2019-08-21 Thread Matt Ryan (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16912527#comment-16912527
 ] 

Matt Ryan commented on OAK-8552:


Simply removing the existence check implemented in OAK-7998 changes the stats 
as follows:
 * Getting a download URI for binaries uploaded through the JCR - average time 
drops from 65 milliseconds to around 120 microseconds (0.12 milliseconds).
 * Getting a download URI for binaries uploaded directly - average time drops 
from 200 milliseconds to around 130 milliseconds.

I still think we need the fix in OAK-7998, but clearly if we can eliminate a 
network call to check existence that will help.  Although it won't completely 
solve the problem by itself.

> Minimize network calls required when creating a direct download URI
> ---
>
> Key: OAK-8552
> URL: https://issues.apache.org/jira/browse/OAK-8552
> Project: Jackrabbit Oak
>  Issue Type: Sub-task
>  Components: blob-cloud, blob-cloud-azure
>Reporter: Matt Ryan
>Assignee: Matt Ryan
>Priority: Major
>
> We need to isolate and try to optimize network calls required to create a 
> direct download URI.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (OAK-8552) Minimize network calls required when creating a direct download URI

2019-08-21 Thread Matt Ryan (Jira)


[ 
https://issues.apache.org/jira/browse/OAK-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16912519#comment-16912519
 ] 

Matt Ryan commented on OAK-8552:


I wrote a simple test program which first creates a basic repo with binaries in 
it, then tries to get a download URI for each binary and times each step 
(getting the {{Binary}} object, and then requesting the download URI).

Here's what I found:
 * Getting the {{Binary}} object from the repository takes almost no time.  
Average time is around 30 microseconds (0.03 milliseconds).
 * Almost all of the total time is in getting the download URI.  In my testing 
today, average time to get a download URI is around 65 milliseconds for a 
binary uploaded through the JCR and around 200 milliseconds for a binary 
uploaded via direct upload.  The difference between the two is probably the 
cost of the {{getReference()}} call mentioned above; my guess is the binaries 
uploaded via the JCR are probably still in cache so that makes 
{{getReference()}} faster for those.

I will try a few different optimizations and post results.

> Minimize network calls required when creating a direct download URI
> ---
>
> Key: OAK-8552
> URL: https://issues.apache.org/jira/browse/OAK-8552
> Project: Jackrabbit Oak
>  Issue Type: Sub-task
>  Components: blob-cloud, blob-cloud-azure
>Reporter: Matt Ryan
>Assignee: Matt Ryan
>Priority: Major
>
> We need to isolate and try to optimize network calls required to create a 
> direct download URI.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (OAK-8552) Minimize network calls required when creating a direct download URI

2019-08-18 Thread Amit Jain (JIRA)


[ 
https://issues.apache.org/jira/browse/OAK-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16910176#comment-16910176
 ] 

Amit Jain commented on OAK-8552:


[~mattvryan] 
1. Essentially, I guess what you are saying is that the n/w call should only be 
one. I don't know enough about the Azure blob APIs but maybe we can see if we 
can substitute the below call so that we don't need the additional download 
attributes call later.
{code}
CloudBlockBlob blob = getAzureContainer().getBlockBlobReference(key);
{code}

bq. But all that is really needed in this case is the reference, which can be 
obtained from the back end directly using the blob id - no network calls 
required.  
Yes it still makes a call for existence in the 
DataStoreBlobStore#getReference() which can be potentially removed. Not sure if 
there are cases when the call receives a blob id but is not available in the 
backend. But in such a case that will lead to situation 2, is it not?

bq. Furthermore, the reason we are even trying to get the reference in the 
first place is to determine if this blob is stored inline or not.  Maybe there 
is a better way to determine this.
I don't think there's much we can do here. The storage inline or not is an 
internal implementation of the blob store and the code in oak-store-spi would 
have to be aware of such details to check the same.

2. Here it seems there's no alternative but to check for existence. Is that 
right? Also, the test that you conducted removed the existence check here only 
or also in the condition observed in 1.? If its just the the existence check 
that is newly added then that is the major cause of the slow down ~ 250 times 
(147 s Vs 4). Not sure how much changing 1 will help according to your test.


> Minimize network calls required when creating a direct download URI
> ---
>
> Key: OAK-8552
> URL: https://issues.apache.org/jira/browse/OAK-8552
> Project: Jackrabbit Oak
>  Issue Type: Sub-task
>  Components: blob-cloud, blob-cloud-azure
>Reporter: Matt Ryan
>Priority: Major
>
> We need to isolate and try to optimize network calls required to create a 
> direct download URI.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (OAK-8552) Minimize network calls required when creating a direct download URI

2019-08-16 Thread Matt Ryan (JIRA)


[ 
https://issues.apache.org/jira/browse/OAK-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909545#comment-16909545
 ] 

Matt Ryan commented on OAK-8552:


The entry point for getting a direct download URI begins with a {{Binary}} 
instance and the {{getURI()}} call.

Known causes of network requests in this call:
 * Starting at 
[https://github.com/apache/jackrabbit-oak/blob/22c3be68e4bc7fdf811ab0fbb2471f2d026508e7/oak-store-spi/src/main/java/org/apache/jackrabbit/oak/plugins/value/jcr/BinaryImpl.java#L96]
 - the call to {{getReference()}} calls through the blob implementation into 
{{DataStoreBlobStore#getReference()}} which calls 
{{AbstractSharedCachingDataStore#getRecordIfStored()}}. If the blob is not 
cached this will result in a call to the backend's {{getRecord()}}.  For 
{{AzureBlobStoreBackend}}, for example, this actually currently makes two 
network calls - one to check if the blob exists, and another to get the blob 
metadata needed to construct the {{DataRecord}}.  (See 
[https://github.com/apache/jackrabbit-oak/blob/22c3be68e4bc7fdf811ab0fbb2471f2d026508e7/oak-blob-cloud-azure/src/main/java/org/apache/jackrabbit/oak/blob/cloud/azure/blobstorage/AzureBlobStoreBackend.java#L355).]
  But all that is really needed in this case is the reference, which can be 
obtained from the back end directly using the blob id - no network calls 
required.  Furthermore, the reason we are even trying to get the reference in 
the first place is to determine if this blob is stored inline or not.  Maybe 
there is a better way to determine this.
 * Starting at 
[https://github.com/apache/jackrabbit-oak/blob/22c3be68e4bc7fdf811ab0fbb2471f2d026508e7/oak-store-spi/src/main/java/org/apache/jackrabbit/oak/plugins/value/jcr/BinaryImpl.java#L107]
 - the call to {{getDownloadURI()}} eventually results in a call to the data 
store implementation's {{getDownloadURI()}} method.  In the case of 
{{AzureDataStore}}, this calls into the backend's {{createHttpDownloadURI()}} 
method which (now, due to OAK-7998) is checking that the binary exists - a 
network call - before creating the signed download URI.  Note that creating the 
download URI doesn't require the network call, but checking for the existence 
of the blob ID does.

In a benchmark test I showed that creating 1000 download URIs took just over 
4 milliseconds, averaging around 40 milliseconds per request.  This result 
is actually not that bad - but removing the existence check and running the 
test again dropped the time to 147 milliseconds for all 1000 URIs.  So we can 
see that if the network latency is bad this could potentially be a problem.

> Minimize network calls required when creating a direct download URI
> ---
>
> Key: OAK-8552
> URL: https://issues.apache.org/jira/browse/OAK-8552
> Project: Jackrabbit Oak
>  Issue Type: Sub-task
>  Components: blob-cloud, blob-cloud-azure
>Reporter: Matt Ryan
>Priority: Major
>
> We need to isolate and try to optimize network calls required to create a 
> direct download URI.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)