[jira] [Comment Edited] (OAK-6922) Azure support for the segment-tar

2018-03-17 Thread JIRA

[ 
https://issues.apache.org/jira/browse/OAK-6922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16402371#comment-16402371
 ] 

Tomek Rękawek edited comment on OAK-6922 at 3/17/18 9:49 PM:
-

[~frm], [~mduerig] - thanks for the comments. I moved all the required 
interfaces to SPI packages and applied all the suggestions (I think). Since the 
changes in the oak-segment-tar are now quite extensive, I created a separate 
issue to cover the SPI updated: OAK-7355. See the issue for the patch and the 
summary of changes.

The  [^OAK-6922-3.patch] now only contains the new Azure bundle, it requires 
the OAK-7355 to work.


was (Author: tomek.rekawek):
[~frm], [~mduerig] - thanks for the comments. I moved all the required 
interfaces to SPI packages and applied all the suggestions (I think). Since the 
changes in the oak-segment-tar are now quite extensive, I created a separate 
issue to cover the SPI updated: OAK-7355. See the issue for the patch and the 
summary of changes.

The  [^OAK-6922-3.patch] now only contains the new Azure bundle, it requires 
the OAK-7355 to work.

> Azure support for the segment-tar
> -
>
> Key: OAK-6922
> URL: https://issues.apache.org/jira/browse/OAK-6922
> Project: Jackrabbit Oak
>  Issue Type: New Feature
>  Components: segment-tar
>Reporter: Tomek Rękawek
>Assignee: Tomek Rękawek
>Priority: Major
> Fix For: 1.9.0, 1.10
>
> Attachments: OAK-6922-2.patch, OAK-6922-3.patch, OAK-6922.patch
>
>
> An Azure Blob Storage implementation of the segment storage, based on the 
> OAK-6921 work.
> h3. Segment files layout
> Thew new implementation doesn't use tar files. They are replaced with 
> directories, storing segments, named after their UUIDs. This approach has 
> following advantages:
> * no need to call seek(), which may be expensive on a remote file system. 
> Rather than that we can read the whole file (=segment) at once.
> * it's possible to send multiple segments at once, asynchronously, which 
> reduces the performance overhead (see below).
> The file structure is as follows:
> {noformat}
> [~]$ az storage blob list -c oak --output table
> Name  Blob Type
> Blob TierLengthContent Type  Last Modified
>   ---  
> ---      -
> oak/data0a.tar/.ca1326d1-edf4-4d53-aef0-0f14a6d05b63  BlockBlob   
>   192   application/octet-stream  2018-01-31T10:59:14+00:00
> oak/data0a.tar/0001.c6e03426-db9d-4315-a20a-12559e6aee54  BlockBlob   
>   262144application/octet-stream  2018-01-31T10:59:14+00:00
> oak/data0a.tar/0002.b3784e27-6d16-4f80-afc1-6f3703f6bdb9  BlockBlob   
>   262144application/octet-stream  2018-01-31T10:59:14+00:00
> oak/data0a.tar/0003.5d2f9588-0c92-4547-abf7-0263ee7c37bb  BlockBlob   
>   259216application/octet-stream  2018-01-31T10:59:14+00:00
> ...
> oak/data0a.tar/006e.7b8cf63d-849a-4120-aa7c-47c3dde25e48  BlockBlob   
>   4368  application/octet-stream  2018-01-31T12:01:09+00:00
> oak/data0a.tar/006f.93799ae9-288e-4b32-afc2-bbc676fad7e5  BlockBlob   
>   3792  application/octet-stream  2018-01-31T12:01:14+00:00
> oak/data0a.tar/0070.8b2d5ff2-6a74-4ac3-a3cc-cc439367c2aa  BlockBlob   
>   3680  application/octet-stream  2018-01-31T12:01:14+00:00
> oak/data0a.tar/0071.2a1c49f0-ce33-4777-a042-8aa8a704d202  BlockBlob   
>   7760  application/octet-stream  2018-01-31T12:10:54+00:00
> oak/journal.log.001   AppendBlob  
>   1010  application/octet-stream  2018-01-31T12:10:54+00:00
> oak/manifest  BlockBlob   
>   46application/octet-stream  2018-01-31T10:59:14+00:00
> oak/repo.lock BlockBlob   
> application/octet-stream  2018-01-31T10:59:14+00:00
> {noformat}
> For the segment files, each name is prefixed with the index number. This 
> allows to maintain an order, as in the tar archive. This order is normally 
> stored in the index files as well, but if it's missing, the recovery process 
> uses the prefixes to maintain it.
> Each file contains the raw segment data, with no padding/headers. Apart from 
> the segment files, there are 3 special files: binary references (.brf), 
> segment graph (.gph) and segment index (.idx).
> h3. Asynchronous writes
> Normally, all the TarWriter writes are synchronous, appending the segments to 
> the tar file. In case of Azure Blob Stor

[jira] [Comment Edited] (OAK-6922) Azure support for the segment-tar

2018-02-27 Thread JIRA

[ 
https://issues.apache.org/jira/browse/OAK-6922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337848#comment-16337848
 ] 

Tomek Rękawek edited comment on OAK-6922 at 2/27/18 11:57 AM:
--

The azure implementation:
https://github.com/trekawek/jackrabbit-oak/tree/OAK-6922


was (Author: tomek.rekawek):
The azure implementation:
https://github.com/trekawek/jackrabbit-oak/tree/segment-tar-trunk/azure

> Azure support for the segment-tar
> -
>
> Key: OAK-6922
> URL: https://issues.apache.org/jira/browse/OAK-6922
> Project: Jackrabbit Oak
>  Issue Type: New Feature
>  Components: segment-tar
>Reporter: Tomek Rękawek
>Assignee: Tomek Rękawek
>Priority: Major
> Fix For: 1.9.0, 1.10
>
> Attachments: OAK-6922.patch
>
>
> An Azure Blob Storage implementation of the segment storage, based on the 
> OAK-6921 work.
> h3. Segment files layout
> Thew new implementation doesn't use tar files. They are replaced with 
> directories, storing segments, named after their UUIDs. This approach has 
> following advantages:
> * no need to call seek(), which may be expensive on a remote file system. 
> Rather than that we can read the whole file (=segment) at once.
> * it's possible to send multiple segments at once, asynchronously, which 
> reduces the performance overhead (see below).
> The file structure is as follows:
> {noformat}
> [~]$ az storage blob list -c oak --output table
> Name  Blob Type
> Blob TierLengthContent Type  Last Modified
>   ---  
> ---      -
> oak/data0a.tar/.ca1326d1-edf4-4d53-aef0-0f14a6d05b63  BlockBlob   
>   192   application/octet-stream  2018-01-31T10:59:14+00:00
> oak/data0a.tar/0001.c6e03426-db9d-4315-a20a-12559e6aee54  BlockBlob   
>   262144application/octet-stream  2018-01-31T10:59:14+00:00
> oak/data0a.tar/0002.b3784e27-6d16-4f80-afc1-6f3703f6bdb9  BlockBlob   
>   262144application/octet-stream  2018-01-31T10:59:14+00:00
> oak/data0a.tar/0003.5d2f9588-0c92-4547-abf7-0263ee7c37bb  BlockBlob   
>   259216application/octet-stream  2018-01-31T10:59:14+00:00
> ...
> oak/data0a.tar/006e.7b8cf63d-849a-4120-aa7c-47c3dde25e48  BlockBlob   
>   4368  application/octet-stream  2018-01-31T12:01:09+00:00
> oak/data0a.tar/006f.93799ae9-288e-4b32-afc2-bbc676fad7e5  BlockBlob   
>   3792  application/octet-stream  2018-01-31T12:01:14+00:00
> oak/data0a.tar/0070.8b2d5ff2-6a74-4ac3-a3cc-cc439367c2aa  BlockBlob   
>   3680  application/octet-stream  2018-01-31T12:01:14+00:00
> oak/data0a.tar/0071.2a1c49f0-ce33-4777-a042-8aa8a704d202  BlockBlob   
>   7760  application/octet-stream  2018-01-31T12:10:54+00:00
> oak/journal.log.001   AppendBlob  
>   1010  application/octet-stream  2018-01-31T12:10:54+00:00
> oak/manifest  BlockBlob   
>   46application/octet-stream  2018-01-31T10:59:14+00:00
> oak/repo.lock BlockBlob   
> application/octet-stream  2018-01-31T10:59:14+00:00
> {noformat}
> For the segment files, each name is prefixed with the index number. This 
> allows to maintain an order, as in the tar archive. This order is normally 
> stored in the index files as well, but if it's missing, the recovery process 
> needs it.
> Each file contains the raw segment data, with no padding/headers. Apart from 
> the segment files, there are 3 special files: binary references (.brf), 
> segment graph (.gph) and segment index (.idx).
> h3. Asynchronous writes
> Normally, all the TarWriter writes are synchronous, appending the segments to 
> the tar file. In case of Azure Blob Storage each write involves a network 
> latency. That's why the SegmentWriteQueue was introduced. The segments are 
> added to the blocking dequeue, which is served by a number of the consumer 
> threads, writing the segments to the cloud. There's also a map UUID->Segment, 
> which allows to return the segments in case they are requested by the 
> readSegment() method before they are actually persisted. Segments are removed 
> from the map only after a successful write operation.
> The flush() method blocks accepting the new segments and returns after all 
> waiting segments are written. The close() method waits until the current 
> operations are finished and stops all threads.
> The asynchronous mode can be disabled by setting the

[jira] [Comment Edited] (OAK-6922) Azure support for the segment-tar

2018-02-13 Thread JIRA

[ 
https://issues.apache.org/jira/browse/OAK-6922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337848#comment-16337848
 ] 

Tomek Rękawek edited comment on OAK-6922 at 2/13/18 10:36 AM:
--

The azure implementation:
https://github.com/trekawek/jackrabbit-oak/tree/segment-tar-trunk/azure


was (Author: tomek.rekawek):
The azure implementation:
https://github.com/trekawek/jackrabbit-oak/tree/segment-tar/azure

> Azure support for the segment-tar
> -
>
> Key: OAK-6922
> URL: https://issues.apache.org/jira/browse/OAK-6922
> Project: Jackrabbit Oak
>  Issue Type: New Feature
>  Components: segment-tar
>Reporter: Tomek Rękawek
>Priority: Major
> Fix For: 1.9.0, 1.10
>
> Attachments: OAK-6922.patch
>
>
> An Azure Blob Storage implementation of the segment storage, based on the 
> OAK-6921 work.
> h3. Segment files layout
> Thew new implementation doesn't use tar files. They are replaced with 
> directories, storing segments, named after their UUIDs. This approach has 
> following advantages:
> * no need to call seek(), which may be expensive on a remote file system. 
> Rather than that we can read the whole file (=segment) at once.
> * it's possible to send multiple segments at once, asynchronously, which 
> reduces the performance overhead (see below).
> The file structure is as follows:
> {noformat}
> [~]$ az storage blob list -c oak --output table
> Name  Blob Type
> Blob TierLengthContent Type  Last Modified
>   ---  
> ---      -
> oak/data0a.tar/.ca1326d1-edf4-4d53-aef0-0f14a6d05b63  BlockBlob   
>   192   application/octet-stream  2018-01-31T10:59:14+00:00
> oak/data0a.tar/0001.c6e03426-db9d-4315-a20a-12559e6aee54  BlockBlob   
>   262144application/octet-stream  2018-01-31T10:59:14+00:00
> oak/data0a.tar/0002.b3784e27-6d16-4f80-afc1-6f3703f6bdb9  BlockBlob   
>   262144application/octet-stream  2018-01-31T10:59:14+00:00
> oak/data0a.tar/0003.5d2f9588-0c92-4547-abf7-0263ee7c37bb  BlockBlob   
>   259216application/octet-stream  2018-01-31T10:59:14+00:00
> ...
> oak/data0a.tar/006e.7b8cf63d-849a-4120-aa7c-47c3dde25e48  BlockBlob   
>   4368  application/octet-stream  2018-01-31T12:01:09+00:00
> oak/data0a.tar/006f.93799ae9-288e-4b32-afc2-bbc676fad7e5  BlockBlob   
>   3792  application/octet-stream  2018-01-31T12:01:14+00:00
> oak/data0a.tar/0070.8b2d5ff2-6a74-4ac3-a3cc-cc439367c2aa  BlockBlob   
>   3680  application/octet-stream  2018-01-31T12:01:14+00:00
> oak/data0a.tar/0071.2a1c49f0-ce33-4777-a042-8aa8a704d202  BlockBlob   
>   7760  application/octet-stream  2018-01-31T12:10:54+00:00
> oak/journal.log.001   AppendBlob  
>   1010  application/octet-stream  2018-01-31T12:10:54+00:00
> oak/manifest  BlockBlob   
>   46application/octet-stream  2018-01-31T10:59:14+00:00
> oak/repo.lock BlockBlob   
> application/octet-stream  2018-01-31T10:59:14+00:00
> {noformat}
> For the segment files, each name is prefixed with the index number. This 
> allows to maintain an order, as in the tar archive. This order is normally 
> stored in the index files as well, but if it's missing, the recovery process 
> needs it.
> Each file contains the raw segment data, with no padding/headers. Apart from 
> the segment files, there are 3 special files: binary references (.brf), 
> segment graph (.gph) and segment index (.idx).
> h3. Asynchronous writes
> Normally, all the TarWriter writes are synchronous, appending the segments to 
> the tar file. In case of Azure Blob Storage each write involves a network 
> latency. That's why the SegmentWriteQueue was introduced. The segments are 
> added to the blocking dequeue, which is served by a number of the consumer 
> threads, writing the segments to the cloud. There's also a map UUID->Segment, 
> which allows to return the segments in case they are requested by the 
> readSegment() method before they are actually persisted. Segments are removed 
> from the map only after a successful write operation.
> The flush() method blocks accepting the new segments and returns after all 
> waiting segments are written. The close() method waits until the current 
> operations are finished and stops all threads.
> The asynchronous mode can be disabled by setting the number of threads to 0.
> h