[jira] [Commented] (HDFS-13186) [PROVIDED Phase 2] Multipart Uploader API

2019-03-14 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16793001#comment-16793001
 ] 

Steve Loughran commented on HDFS-13186:
---

new feature requirement: can the complete() call return (PathHandle, 
FileStatus)? I ask as in the S3A code, having a status returned atomically 
means that the caller gets the exact details about the file created without any 
round trip overhead or needing to worry about inconsistencies. Especially if 
that status includes the etag & version ID,

> [PROVIDED Phase 2] Multipart Uploader API
> -
>
> Key: HDFS-13186
> URL: https://issues.apache.org/jira/browse/HDFS-13186
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ewan Higgs
>Assignee: Ewan Higgs
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: HDFS-13186.001.patch, HDFS-13186.002.patch, 
> HDFS-13186.003.patch, HDFS-13186.004.patch, HDFS-13186.005.patch, 
> HDFS-13186.006.patch, HDFS-13186.007.patch, HDFS-13186.008.patch, 
> HDFS-13186.009.patch, HDFS-13186.010.patch
>
>
> To write files in parallel to an external storage system as in HDFS-12090, 
> there are two approaches:
>  # Naive approach: use a single datanode per file that copies blocks locally 
> as it streams data to the external service. This requires a copy for each 
> block inside the HDFS system and then a copy for the block to be sent to the 
> external system.
>  # Better approach: Single point (e.g. Namenode or SPS style external client) 
> and Datanodes coordinate in a multipart - multinode upload.
> This system needs to work with multiple back ends and needs to coordinate 
> across the network. So we propose an API that resembles the following:
> {code:java}
> public UploadHandle multipartInit(Path filePath) throws IOException;
> public PartHandle multipartPutPart(InputStream inputStream,
> int partNumber, UploadHandle uploadId) throws IOException;
> public void multipartComplete(Path filePath,
> List> handles, 
> UploadHandle multipartUploadId) throws IOException;{code}
> Here, UploadHandle and PartHandle are opaque handlers in the vein of 
> PathHandle so they can be serialized and deserialized in hadoop-hdfs project 
> without knowledge of how to deserialize e.g. S3A's version of a UpoadHandle 
> and PartHandle.
> In an object store such as S3A, the implementation is straight forward. In 
> the case of writing multipart/multinode to HDFS, we can write each block as a 
> file part. The complete call will perform a concat on the blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13186) [PROVIDED Phase 2] Multipart Uploader API

2019-03-07 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16786610#comment-16786610
 ] 

Steve Loughran commented on HDFS-13186:
---

We've started on the async code with openFile(), which, if your code can handle 
it, can compensate for the latency in the open. 

We may even want to change the spec there to actually states that openFile() 
may postpone the existence checks until the first read., which would let us 
skip the initial HEAD check on the open, and wait for the GET to fail. so 
saving >1 round trip

I think for the MPU API we should do the following 

# get that base PathCapabilities API in with the core operations probed for 
-but allow that to be expanded in future
# move the MPU uploader creation to a method in the FS which takes a path, and 
declare that the MPU so instantiated only works for paths in the given 
directory. Why so: viewFS & c may map different paths to different underlying 
stores
# then make it async. Our biggest limitation there is that java-lang closures 
don't take IOEs. I've added some stuff in org.apache.hadoop.fs.impl to help 
implementations there, but don't want to expose that to any public API yet 
until we've done more internal use to understand how to do this best

> [PROVIDED Phase 2] Multipart Uploader API
> -
>
> Key: HDFS-13186
> URL: https://issues.apache.org/jira/browse/HDFS-13186
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ewan Higgs
>Assignee: Ewan Higgs
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: HDFS-13186.001.patch, HDFS-13186.002.patch, 
> HDFS-13186.003.patch, HDFS-13186.004.patch, HDFS-13186.005.patch, 
> HDFS-13186.006.patch, HDFS-13186.007.patch, HDFS-13186.008.patch, 
> HDFS-13186.009.patch, HDFS-13186.010.patch
>
>
> To write files in parallel to an external storage system as in HDFS-12090, 
> there are two approaches:
>  # Naive approach: use a single datanode per file that copies blocks locally 
> as it streams data to the external service. This requires a copy for each 
> block inside the HDFS system and then a copy for the block to be sent to the 
> external system.
>  # Better approach: Single point (e.g. Namenode or SPS style external client) 
> and Datanodes coordinate in a multipart - multinode upload.
> This system needs to work with multiple back ends and needs to coordinate 
> across the network. So we propose an API that resembles the following:
> {code:java}
> public UploadHandle multipartInit(Path filePath) throws IOException;
> public PartHandle multipartPutPart(InputStream inputStream,
> int partNumber, UploadHandle uploadId) throws IOException;
> public void multipartComplete(Path filePath,
> List> handles, 
> UploadHandle multipartUploadId) throws IOException;{code}
> Here, UploadHandle and PartHandle are opaque handlers in the vein of 
> PathHandle so they can be serialized and deserialized in hadoop-hdfs project 
> without knowledge of how to deserialize e.g. S3A's version of a UpoadHandle 
> and PartHandle.
> In an object store such as S3A, the implementation is straight forward. In 
> the case of writing multipart/multinode to HDFS, we can write each block as a 
> file part. The complete call will perform a concat on the blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13186) [PROVIDED Phase 2] Multipart Uploader API

2019-03-07 Thread Ewan Higgs (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16786534#comment-16786534
 ] 

Ewan Higgs commented on HDFS-13186:
---

{quote}The HADOOP-15691 PathCapabilities patch is intended to allow callers to 
probe for a feature being available before making the API Call. This'd let you 
go{quote}
A capability model is much better, indeed.

{quote}Bear in mind I also want to move the MPU API to being async block 
uploads, complete calls. For the classic local and HDFS stores, these would 
actually be done in the current thread. For S3 they'd run in the thread pool, 
so you could trivially kick off a parallel upload of blocks from a single 
thread without even knowing that the FS impl worked that way.{quote}
This is a good idea. When designing the API I thought we wanted to stick to a 
synchronous model to be consistent with the rest of the APIs but async is a 
much better fit for this as it's remote calls and we don't do anything like 
locking (which can be hairy in async code).

> [PROVIDED Phase 2] Multipart Uploader API
> -
>
> Key: HDFS-13186
> URL: https://issues.apache.org/jira/browse/HDFS-13186
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ewan Higgs
>Assignee: Ewan Higgs
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: HDFS-13186.001.patch, HDFS-13186.002.patch, 
> HDFS-13186.003.patch, HDFS-13186.004.patch, HDFS-13186.005.patch, 
> HDFS-13186.006.patch, HDFS-13186.007.patch, HDFS-13186.008.patch, 
> HDFS-13186.009.patch, HDFS-13186.010.patch
>
>
> To write files in parallel to an external storage system as in HDFS-12090, 
> there are two approaches:
>  # Naive approach: use a single datanode per file that copies blocks locally 
> as it streams data to the external service. This requires a copy for each 
> block inside the HDFS system and then a copy for the block to be sent to the 
> external system.
>  # Better approach: Single point (e.g. Namenode or SPS style external client) 
> and Datanodes coordinate in a multipart - multinode upload.
> This system needs to work with multiple back ends and needs to coordinate 
> across the network. So we propose an API that resembles the following:
> {code:java}
> public UploadHandle multipartInit(Path filePath) throws IOException;
> public PartHandle multipartPutPart(InputStream inputStream,
> int partNumber, UploadHandle uploadId) throws IOException;
> public void multipartComplete(Path filePath,
> List> handles, 
> UploadHandle multipartUploadId) throws IOException;{code}
> Here, UploadHandle and PartHandle are opaque handlers in the vein of 
> PathHandle so they can be serialized and deserialized in hadoop-hdfs project 
> without knowledge of how to deserialize e.g. S3A's version of a UpoadHandle 
> and PartHandle.
> In an object store such as S3A, the implementation is straight forward. In 
> the case of writing multipart/multinode to HDFS, we can write each block as a 
> file part. The complete call will perform a concat on the blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13186) [PROVIDED Phase 2] Multipart Uploader API

2019-03-06 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16786039#comment-16786039
 ] 

Steve Loughran commented on HDFS-13186:
---

We are going to have an API to request an MPU on a path of an existing 
filesystem/filecontext. This is needed because the service loader API is 
brittle, inflexible and cannot handle things like proxyfs, viewfs etc. a

bq. Would it make sense to make a checksumFS MPU that throws upon creation?

but how do you load it through the service loader API, as that's bonded to the 
FS Schema? and file:// matches both RawLocal and the checksummed FS.

bq. but using inheritance to remove functionality as checksum FS is doing is 
already broken.

Not sure how else you'd do it.

The HADOOP-15691 PathCapabilities patch is intended to allow callers to probe 
for a feature being available before making the API Call. This'd let you go

{code}
if (fs.hasPathCapability("fs.path.multipart-upload", dest)) {
  uploader=   fs.createMultipartUpload(path)
  ... 
 } else {
 // fallback
}
{code}

Bear in mind I also want to move the MPU API to being async block uploads, 
complete calls. For the classic local and HDFS stores, these would actually be 
done in the current thread. For S3 they'd run in the thread pool, so you could 
trivially kick off a parallel upload of blocks from a single thread without 
even knowing that the FS impl worked that way.

[~fabbri] another use of this is that it effectively provides a stable API For 
the S3A committers to move to -one which could even be accessed through filter 
filesystems if needed; as well as a high-speed distcp. Currently distcp upload 
of very large files from HDFS to S3 is really slow because it's done a file at 
a time; this will enable block-at-a-time



> [PROVIDED Phase 2] Multipart Uploader API
> -
>
> Key: HDFS-13186
> URL: https://issues.apache.org/jira/browse/HDFS-13186
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ewan Higgs
>Assignee: Ewan Higgs
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: HDFS-13186.001.patch, HDFS-13186.002.patch, 
> HDFS-13186.003.patch, HDFS-13186.004.patch, HDFS-13186.005.patch, 
> HDFS-13186.006.patch, HDFS-13186.007.patch, HDFS-13186.008.patch, 
> HDFS-13186.009.patch, HDFS-13186.010.patch
>
>
> To write files in parallel to an external storage system as in HDFS-12090, 
> there are two approaches:
>  # Naive approach: use a single datanode per file that copies blocks locally 
> as it streams data to the external service. This requires a copy for each 
> block inside the HDFS system and then a copy for the block to be sent to the 
> external system.
>  # Better approach: Single point (e.g. Namenode or SPS style external client) 
> and Datanodes coordinate in a multipart - multinode upload.
> This system needs to work with multiple back ends and needs to coordinate 
> across the network. So we propose an API that resembles the following:
> {code:java}
> public UploadHandle multipartInit(Path filePath) throws IOException;
> public PartHandle multipartPutPart(InputStream inputStream,
> int partNumber, UploadHandle uploadId) throws IOException;
> public void multipartComplete(Path filePath,
> List> handles, 
> UploadHandle multipartUploadId) throws IOException;{code}
> Here, UploadHandle and PartHandle are opaque handlers in the vein of 
> PathHandle so they can be serialized and deserialized in hadoop-hdfs project 
> without knowledge of how to deserialize e.g. S3A's version of a UpoadHandle 
> and PartHandle.
> In an object store such as S3A, the implementation is straight forward. In 
> the case of writing multipart/multinode to HDFS, we can write each block as a 
> file part. The complete call will perform a concat on the blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13186) [PROVIDED Phase 2] Multipart Uploader API

2019-03-06 Thread Ewan Higgs (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785547#comment-16785547
 ] 

Ewan Higgs commented on HDFS-13186:
---

[~ste...@apache.org]
{quote}
Bad news; that new concat operation in raw local makes it possible to create 
files in a checksummed FS which don't have checksums: HADOOP-16150
{quote}
Would it make sense to make a checksumFS MPU that throws upon creation? I don't 
like the approach but using inheritance to remove functionality as checksum FS 
is doing is already broken.

> [PROVIDED Phase 2] Multipart Uploader API
> -
>
> Key: HDFS-13186
> URL: https://issues.apache.org/jira/browse/HDFS-13186
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ewan Higgs
>Assignee: Ewan Higgs
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: HDFS-13186.001.patch, HDFS-13186.002.patch, 
> HDFS-13186.003.patch, HDFS-13186.004.patch, HDFS-13186.005.patch, 
> HDFS-13186.006.patch, HDFS-13186.007.patch, HDFS-13186.008.patch, 
> HDFS-13186.009.patch, HDFS-13186.010.patch
>
>
> To write files in parallel to an external storage system as in HDFS-12090, 
> there are two approaches:
>  # Naive approach: use a single datanode per file that copies blocks locally 
> as it streams data to the external service. This requires a copy for each 
> block inside the HDFS system and then a copy for the block to be sent to the 
> external system.
>  # Better approach: Single point (e.g. Namenode or SPS style external client) 
> and Datanodes coordinate in a multipart - multinode upload.
> This system needs to work with multiple back ends and needs to coordinate 
> across the network. So we propose an API that resembles the following:
> {code:java}
> public UploadHandle multipartInit(Path filePath) throws IOException;
> public PartHandle multipartPutPart(InputStream inputStream,
> int partNumber, UploadHandle uploadId) throws IOException;
> public void multipartComplete(Path filePath,
> List> handles, 
> UploadHandle multipartUploadId) throws IOException;{code}
> Here, UploadHandle and PartHandle are opaque handlers in the vein of 
> PathHandle so they can be serialized and deserialized in hadoop-hdfs project 
> without knowledge of how to deserialize e.g. S3A's version of a UpoadHandle 
> and PartHandle.
> In an object store such as S3A, the implementation is straight forward. In 
> the case of writing multipart/multinode to HDFS, we can write each block as a 
> file part. The complete call will perform a concat on the blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13186) [PROVIDED Phase 2] Multipart Uploader API

2019-03-06 Thread Ewan Higgs (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16785546#comment-16785546
 ] 

Ewan Higgs commented on HDFS-13186:
---

[~fabbri]
{quote}What is the motivation for this?  Even if not part of FileSystem it is 
more surface area we need to deal with.{quote}

The motivation for this is to be able to write files to FileSystems in parallel 
and surviving upload failures without having to restart the entire upload. One 
immediate use case is that Tiered Storage can write files from datanodes to a 
synchronization endpoint without having to reassemble the files locally. The NN 
can initialize the write and tell the DNs to upload files and when they are 
done, the NN will commit the work. Further down the line, it's possible that a 
tool like DistCp could be written in terms of this uploader to allow 
users/admins to copy data from one HDFS system to another without having to 
stream blocks locally to a single worker on a single DN.

> [PROVIDED Phase 2] Multipart Uploader API
> -
>
> Key: HDFS-13186
> URL: https://issues.apache.org/jira/browse/HDFS-13186
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ewan Higgs
>Assignee: Ewan Higgs
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: HDFS-13186.001.patch, HDFS-13186.002.patch, 
> HDFS-13186.003.patch, HDFS-13186.004.patch, HDFS-13186.005.patch, 
> HDFS-13186.006.patch, HDFS-13186.007.patch, HDFS-13186.008.patch, 
> HDFS-13186.009.patch, HDFS-13186.010.patch
>
>
> To write files in parallel to an external storage system as in HDFS-12090, 
> there are two approaches:
>  # Naive approach: use a single datanode per file that copies blocks locally 
> as it streams data to the external service. This requires a copy for each 
> block inside the HDFS system and then a copy for the block to be sent to the 
> external system.
>  # Better approach: Single point (e.g. Namenode or SPS style external client) 
> and Datanodes coordinate in a multipart - multinode upload.
> This system needs to work with multiple back ends and needs to coordinate 
> across the network. So we propose an API that resembles the following:
> {code:java}
> public UploadHandle multipartInit(Path filePath) throws IOException;
> public PartHandle multipartPutPart(InputStream inputStream,
> int partNumber, UploadHandle uploadId) throws IOException;
> public void multipartComplete(Path filePath,
> List> handles, 
> UploadHandle multipartUploadId) throws IOException;{code}
> Here, UploadHandle and PartHandle are opaque handlers in the vein of 
> PathHandle so they can be serialized and deserialized in hadoop-hdfs project 
> without knowledge of how to deserialize e.g. S3A's version of a UpoadHandle 
> and PartHandle.
> In an object store such as S3A, the implementation is straight forward. In 
> the case of writing multipart/multinode to HDFS, we can write each block as a 
> file part. The complete call will perform a concat on the blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13186) [PROVIDED Phase 2] Multipart Uploader API

2019-02-26 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778477#comment-16778477
 ] 

Steve Loughran commented on HDFS-13186:
---

Bad news; that new concat operation in raw local makes it possible to create 
files in a checksummed FS which don't have checksums: HADOOP-16150

> [PROVIDED Phase 2] Multipart Uploader API
> -
>
> Key: HDFS-13186
> URL: https://issues.apache.org/jira/browse/HDFS-13186
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ewan Higgs
>Assignee: Ewan Higgs
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: HDFS-13186.001.patch, HDFS-13186.002.patch, 
> HDFS-13186.003.patch, HDFS-13186.004.patch, HDFS-13186.005.patch, 
> HDFS-13186.006.patch, HDFS-13186.007.patch, HDFS-13186.008.patch, 
> HDFS-13186.009.patch, HDFS-13186.010.patch
>
>
> To write files in parallel to an external storage system as in HDFS-12090, 
> there are two approaches:
>  # Naive approach: use a single datanode per file that copies blocks locally 
> as it streams data to the external service. This requires a copy for each 
> block inside the HDFS system and then a copy for the block to be sent to the 
> external system.
>  # Better approach: Single point (e.g. Namenode or SPS style external client) 
> and Datanodes coordinate in a multipart - multinode upload.
> This system needs to work with multiple back ends and needs to coordinate 
> across the network. So we propose an API that resembles the following:
> {code:java}
> public UploadHandle multipartInit(Path filePath) throws IOException;
> public PartHandle multipartPutPart(InputStream inputStream,
> int partNumber, UploadHandle uploadId) throws IOException;
> public void multipartComplete(Path filePath,
> List> handles, 
> UploadHandle multipartUploadId) throws IOException;{code}
> Here, UploadHandle and PartHandle are opaque handlers in the vein of 
> PathHandle so they can be serialized and deserialized in hadoop-hdfs project 
> without knowledge of how to deserialize e.g. S3A's version of a UpoadHandle 
> and PartHandle.
> In an object store such as S3A, the implementation is straight forward. In 
> the case of writing multipart/multinode to HDFS, we can write each block as a 
> file part. The complete call will perform a concat on the blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13186) [PROVIDED Phase 2] Multipart Uploader API

2018-11-05 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676211#comment-16676211
 ] 

Sunil Govindan commented on HDFS-13186:
---

For future RM's reference. Commit message for this Jira is (instead of 
HDF-13186, its HADOOP-13186)
{code:java}
Author: Chris Douglas 
Date: Sun Jun 17 11:54:26 2018 -0700

 HADOOP-13186. Multipart Uploader API. Contributed by Ewan Higgs{code}

> [PROVIDED Phase 2] Multipart Uploader API
> -
>
> Key: HDFS-13186
> URL: https://issues.apache.org/jira/browse/HDFS-13186
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ewan Higgs
>Assignee: Ewan Higgs
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: HDFS-13186.001.patch, HDFS-13186.002.patch, 
> HDFS-13186.003.patch, HDFS-13186.004.patch, HDFS-13186.005.patch, 
> HDFS-13186.006.patch, HDFS-13186.007.patch, HDFS-13186.008.patch, 
> HDFS-13186.009.patch, HDFS-13186.010.patch
>
>
> To write files in parallel to an external storage system as in HDFS-12090, 
> there are two approaches:
>  # Naive approach: use a single datanode per file that copies blocks locally 
> as it streams data to the external service. This requires a copy for each 
> block inside the HDFS system and then a copy for the block to be sent to the 
> external system.
>  # Better approach: Single point (e.g. Namenode or SPS style external client) 
> and Datanodes coordinate in a multipart - multinode upload.
> This system needs to work with multiple back ends and needs to coordinate 
> across the network. So we propose an API that resembles the following:
> {code:java}
> public UploadHandle multipartInit(Path filePath) throws IOException;
> public PartHandle multipartPutPart(InputStream inputStream,
> int partNumber, UploadHandle uploadId) throws IOException;
> public void multipartComplete(Path filePath,
> List> handles, 
> UploadHandle multipartUploadId) throws IOException;{code}
> Here, UploadHandle and PartHandle are opaque handlers in the vein of 
> PathHandle so they can be serialized and deserialized in hadoop-hdfs project 
> without knowledge of how to deserialize e.g. S3A's version of a UpoadHandle 
> and PartHandle.
> In an object store such as S3A, the implementation is straight forward. In 
> the case of writing multipart/multinode to HDFS, we can write each block as a 
> file part. The complete call will perform a concat on the blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13186) [PROVIDED Phase 2] Multipart Uploader API

2018-07-03 Thread Aaron Fabbri (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16532066#comment-16532066
 ] 

Aaron Fabbri commented on HDFS-13186:
-

Hey folks. Catching up on stuff after vacation. Took a quick look at this.  
Couple comments:

 
 # Thanks for keeping the API backend-agnostic (good layering)
 # What is the motivation for this?  Even if not part of FileSystem it is more 
surface area we need to deal with.

Looks like you have the basic idea of how to update S3Guard.  Think of the 
S3Guard MetadataStore as a "trailing log of metadata changes made to the 
underlying bucket". 

> [PROVIDED Phase 2] Multipart Uploader API
> -
>
> Key: HDFS-13186
> URL: https://issues.apache.org/jira/browse/HDFS-13186
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ewan Higgs
>Assignee: Ewan Higgs
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: HDFS-13186.001.patch, HDFS-13186.002.patch, 
> HDFS-13186.003.patch, HDFS-13186.004.patch, HDFS-13186.005.patch, 
> HDFS-13186.006.patch, HDFS-13186.007.patch, HDFS-13186.008.patch, 
> HDFS-13186.009.patch, HDFS-13186.010.patch
>
>
> To write files in parallel to an external storage system as in HDFS-12090, 
> there are two approaches:
>  # Naive approach: use a single datanode per file that copies blocks locally 
> as it streams data to the external service. This requires a copy for each 
> block inside the HDFS system and then a copy for the block to be sent to the 
> external system.
>  # Better approach: Single point (e.g. Namenode or SPS style external client) 
> and Datanodes coordinate in a multipart - multinode upload.
> This system needs to work with multiple back ends and needs to coordinate 
> across the network. So we propose an API that resembles the following:
> {code:java}
> public UploadHandle multipartInit(Path filePath) throws IOException;
> public PartHandle multipartPutPart(InputStream inputStream,
> int partNumber, UploadHandle uploadId) throws IOException;
> public void multipartComplete(Path filePath,
> List> handles, 
> UploadHandle multipartUploadId) throws IOException;{code}
> Here, UploadHandle and PartHandle are opaque handlers in the vein of 
> PathHandle so they can be serialized and deserialized in hadoop-hdfs project 
> without knowledge of how to deserialize e.g. S3A's version of a UpoadHandle 
> and PartHandle.
> In an object store such as S3A, the implementation is straight forward. In 
> the case of writing multipart/multinode to HDFS, we can write each block as a 
> file part. The complete call will perform a concat on the blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13186) [PROVIDED Phase 2] Multipart Uploader API

2018-07-02 Thread Ewan Higgs (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529946#comment-16529946
 ] 

Ewan Higgs commented on HDFS-13186:
---

{quote}Also doesn't work with S3Guard, because it's calling 
{{S3AFileSystem..getAmazonS3Client()}}; S3Guard won't notice when a new object 
is created.
{quote}
AIUI, all that's needed here is to add this to complete:
{code:java}
MetadataStore metadataStore = s3a.getMetadataStore();
S3AInstrumentation instrumentation = s3a.getInstrumentation();
S3AFileStatus status = s3aFileStatus(filePath, key, Collection.emptySet());
S3Guard.putAndReturn(metadataStore, status, instrumentation);{code}

{quote}
Ewan, if you are doing stuff in the hadoop FS APIs, I'll be expecting a 
followup "add this to the Hadoop FS API",
{quote}
This is intentionally not in the Hadoop FS API, but  the contract tests are a 
v. good idea. 

> [PROVIDED Phase 2] Multipart Uploader API
> -
>
> Key: HDFS-13186
> URL: https://issues.apache.org/jira/browse/HDFS-13186
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ewan Higgs
>Assignee: Ewan Higgs
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: HDFS-13186.001.patch, HDFS-13186.002.patch, 
> HDFS-13186.003.patch, HDFS-13186.004.patch, HDFS-13186.005.patch, 
> HDFS-13186.006.patch, HDFS-13186.007.patch, HDFS-13186.008.patch, 
> HDFS-13186.009.patch, HDFS-13186.010.patch
>
>
> To write files in parallel to an external storage system as in HDFS-12090, 
> there are two approaches:
>  # Naive approach: use a single datanode per file that copies blocks locally 
> as it streams data to the external service. This requires a copy for each 
> block inside the HDFS system and then a copy for the block to be sent to the 
> external system.
>  # Better approach: Single point (e.g. Namenode or SPS style external client) 
> and Datanodes coordinate in a multipart - multinode upload.
> This system needs to work with multiple back ends and needs to coordinate 
> across the network. So we propose an API that resembles the following:
> {code:java}
> public UploadHandle multipartInit(Path filePath) throws IOException;
> public PartHandle multipartPutPart(InputStream inputStream,
> int partNumber, UploadHandle uploadId) throws IOException;
> public void multipartComplete(Path filePath,
> List> handles, 
> UploadHandle multipartUploadId) throws IOException;{code}
> Here, UploadHandle and PartHandle are opaque handlers in the vein of 
> PathHandle so they can be serialized and deserialized in hadoop-hdfs project 
> without knowledge of how to deserialize e.g. S3A's version of a UpoadHandle 
> and PartHandle.
> In an object store such as S3A, the implementation is straight forward. In 
> the case of writing multipart/multinode to HDFS, we can write each block as a 
> file part. The complete call will perform a concat on the blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13186) [PROVIDED Phase 2] Multipart Uploader API

2018-07-02 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529894#comment-16529894
 ] 

Steve Loughran commented on HDFS-13186:
---

you know, I've never actually looked at the code behind this.

Ewan, if you are doing stuff in the hadoop FS APIs, I'll be expecting a 
followup "add this to the Hadoop FS API", with some minimal contract tests to 
verify that all FS's implementing this do have the correct semantics (not 
visible until completed, when completed you get the blocks you uploaded in the 
right order, etc. Sorry.

Also doesn't work with S3Guard, because it's calling 
{{S3AFileSystem..getAmazonS3Client()}}; S3Guard won't notice when a new object 
is created. I know that's not a direct concern of yours, but you are, I'm 
afraid, expected to at least work with it, so that those new tests you are 
going to have to add will work with S3Guard in auth mode. 

thanks



> [PROVIDED Phase 2] Multipart Uploader API
> -
>
> Key: HDFS-13186
> URL: https://issues.apache.org/jira/browse/HDFS-13186
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ewan Higgs
>Assignee: Ewan Higgs
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: HDFS-13186.001.patch, HDFS-13186.002.patch, 
> HDFS-13186.003.patch, HDFS-13186.004.patch, HDFS-13186.005.patch, 
> HDFS-13186.006.patch, HDFS-13186.007.patch, HDFS-13186.008.patch, 
> HDFS-13186.009.patch, HDFS-13186.010.patch
>
>
> To write files in parallel to an external storage system as in HDFS-12090, 
> there are two approaches:
>  # Naive approach: use a single datanode per file that copies blocks locally 
> as it streams data to the external service. This requires a copy for each 
> block inside the HDFS system and then a copy for the block to be sent to the 
> external system.
>  # Better approach: Single point (e.g. Namenode or SPS style external client) 
> and Datanodes coordinate in a multipart - multinode upload.
> This system needs to work with multiple back ends and needs to coordinate 
> across the network. So we propose an API that resembles the following:
> {code:java}
> public UploadHandle multipartInit(Path filePath) throws IOException;
> public PartHandle multipartPutPart(InputStream inputStream,
> int partNumber, UploadHandle uploadId) throws IOException;
> public void multipartComplete(Path filePath,
> List> handles, 
> UploadHandle multipartUploadId) throws IOException;{code}
> Here, UploadHandle and PartHandle are opaque handlers in the vein of 
> PathHandle so they can be serialized and deserialized in hadoop-hdfs project 
> without knowledge of how to deserialize e.g. S3A's version of a UpoadHandle 
> and PartHandle.
> In an object store such as S3A, the implementation is straight forward. In 
> the case of writing multipart/multinode to HDFS, we can write each block as a 
> file part. The complete call will perform a concat on the blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13186) [PROVIDED Phase 2] Multipart Uploader API

2018-06-15 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514542#comment-16514542
 ] 

genericqa commented on HDFS-13186:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 6 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
39s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 
44s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 26m 
59s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
22s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  3m 
46s{color} | {color:green} trunk passed {color} |
| {color:red}-1{color} | {color:red} shadedclient {color} | {color:red}  6m 
37s{color} | {color:red} branch has errors when building and testing our client 
artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  5m 
33s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  3m  
0s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
18s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 25m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 25m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 25m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
24s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  3m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
1s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:red}-1{color} | {color:red} shadedclient {color} | {color:red}  2m 
22s{color} | {color:red} patch has errors when building and testing our client 
artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  6m  
6s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
56s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  8m 
15s{color} | {color:green} hadoop-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m 
45s{color} | {color:green} hadoop-hdfs-client in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}101m 24s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  4m 
39s{color} | {color:green} hadoop-aws in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  1m 
 2s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}231m 29s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.web.TestWebHdfsTimeouts |
|   | hadoop.hdfs.TestDFSInotifyEventInputStreamKerberized |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd |
| JIRA Issue | HDFS-13186 |
| JIRA Patch URL | 

[jira] [Commented] (HDFS-13186) [PROVIDED Phase 2] Multipart Uploader API

2018-06-15 Thread Chris Douglas (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514298#comment-16514298
 ] 

Chris Douglas commented on HDFS-13186:
--

v010 just fixes javadoc errors and added more param javadoc for 
{{MultipartUploader}}. I'm +1 on this and will commit if jenkins comes back 
clean.

> [PROVIDED Phase 2] Multipart Uploader API
> -
>
> Key: HDFS-13186
> URL: https://issues.apache.org/jira/browse/HDFS-13186
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ewan Higgs
>Assignee: Ewan Higgs
>Priority: Major
> Attachments: HDFS-13186.001.patch, HDFS-13186.002.patch, 
> HDFS-13186.003.patch, HDFS-13186.004.patch, HDFS-13186.005.patch, 
> HDFS-13186.006.patch, HDFS-13186.007.patch, HDFS-13186.008.patch, 
> HDFS-13186.009.patch, HDFS-13186.010.patch
>
>
> To write files in parallel to an external storage system as in HDFS-12090, 
> there are two approaches:
>  # Naive approach: use a single datanode per file that copies blocks locally 
> as it streams data to the external service. This requires a copy for each 
> block inside the HDFS system and then a copy for the block to be sent to the 
> external system.
>  # Better approach: Single point (e.g. Namenode or SPS style external client) 
> and Datanodes coordinate in a multipart - multinode upload.
> This system needs to work with multiple back ends and needs to coordinate 
> across the network. So we propose an API that resembles the following:
> {code:java}
> public UploadHandle multipartInit(Path filePath) throws IOException;
> public PartHandle multipartPutPart(InputStream inputStream,
> int partNumber, UploadHandle uploadId) throws IOException;
> public void multipartComplete(Path filePath,
> List> handles, 
> UploadHandle multipartUploadId) throws IOException;{code}
> Here, UploadHandle and PartHandle are opaque handlers in the vein of 
> PathHandle so they can be serialized and deserialized in hadoop-hdfs project 
> without knowledge of how to deserialize e.g. S3A's version of a UpoadHandle 
> and PartHandle.
> In an object store such as S3A, the implementation is straight forward. In 
> the case of writing multipart/multinode to HDFS, we can write each block as a 
> file part. The complete call will perform a concat on the blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org