Re: S3 Writes using SIG4 Authentication

2018-03-07 Thread Rahul Huilgol
I was looking at SIG4's documentation for S3 here

earlier. The section on Chunked Upload confused me because it said I need
to pass Content-Length header in the request.

I now realize that I was using the terms `chunked upload` and `multipart
upload` interchangeably. They are actually different things.
Multipart upload functionality is similar to the earlier behavior that I
had mentioned. Each part can be sent as a normal PUT request with a certain
uploadId parameter. This request can again be uploaded as multiple chunks
if necessary. Chunked upload requires total size of that part beforehand,
but multipart upload itself does not require the total length of data
beforehand.

I've now updated my PR to support writes as well.

Thanks for your help!

Regards,
Rahul

On Wed, Mar 7, 2018 at 9:56 AM, Bhavin Thaker 
wrote:

> Multi-part upload with finalization seems like a good approach for this
> problem.
>
> Bhavin Thaker.
>
> On Wed, Mar 7, 2018 at 7:45 AM Naveen Swamy  wrote:
>
> > Rahul,
> > IMO It is not Ok to write to a local file before streaming, you have to
> > consider security implications such as:
> > 1) will your local file be encrypted(encryption at rest)
> > 2) what happens if the process crashes, you will have to make sure the
> > local file is deleted in failure and process exit scenarios.
> >
> > My understanding is for multi part uploads it uses chunked transfer
> > encoding and for that you do not need to know the total size and only
> know
> > the chunked data size.
> > https://en.wikipedia.org/wiki/Chunked_transfer_encoding
> >
> > See this SO answer:
> >
> > https://stackoverflow.com/questions/8653146/can-i-
> stream-a-file-upload-to-s3-without-a-content-length-header
> >
> > Can you point to the literature that asks to know the total size.
> >
> > -Naveen
> >
> >
> > On Tue, Mar 6, 2018 at 10:34 PM, Rahul Huilgol 
> > wrote:
> >
> > > Hi Chris,
> > >
> > > S3 doesn't support append calls. They promote the use of multipart
> > uploads
> > > to upload large files in parallel, or when network reliability is an
> > issue.
> > > Writing like a stream does not seem to be the purpose of multipart
> > uploads.
> > >
> > > I looked into what the AWS SDK does (in Java). It buffers in memory
> > however
> > > large the file might be, and then uploads. I imagine this involves
> > > reallocating and copying the buffer to the larger buffer. There are few
> > > issues raised regarding this on the sdk repos like this
> > > . But this doesn't
> seem
> > to
> > > be something the SDKs can do anything about. People seem to be writing
> to
> > > temporary files and then uploading.
> > >
> > > Regards,
> > > Rahul
> > >
> > > On Tue, Mar 6, 2018 at 9:04 PM, Chris Olivier 
> > > wrote:
> > >
> > > > it seems strange that s3 would make such a major restriction. there’s
> > > > literally no way to incrementally write a file without knowing the
> size
> > > > beforehand? some sort of separate append calls, maybe?
> > > >
> > > > On Tue, Mar 6, 2018 at 8:53 PM Rahul Huilgol  >
> > > > wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > I have been looking at updating the authentication used by
> > S3FileSystem
> > > > in
> > > > > dmlc-core. Current code uses Signature version 2, which works only
> in
> > > the
> > > > > region us-east-1 now. We need to update the authentication scheme
> to
> > > use
> > > > > Signature version 4 (SIG4).
> > > > >
> > > > > I've submitted a PR 
> to
> > > > change
> > > > > this for Reads. But I wanted to seek out thoughts on what to do for
> > > > Writes,
> > > > > as there is a potential problem.
> > > > >
> > > > > *How writes to S3 work currently:*
> > > > > Whenever s3filesystem's stream.write() is called, data is buffered.
> > > When
> > > > > the buffer is full, a request is made to S3. Since this can happen
> > > > multiple
> > > > > times, multipart upload feature is used. An upload id is created
> when
> > > > > stream is initialized. This upload id is used till the stream is
> > > closed.
> > > > > Default buffer size is 64MB.
> > > > >
> > > > > *Problem:*
> > > > > The new SIG4 authentication scheme changes how multipart uploads
> > work.
> > > > Such
> > > > > an upload now requires that we know the total size of data to be
> sent
> > > > (sum
> > > > > of sizes of all parts) when we create the first request itself. We
> > need
> > > > to
> > > > > pass the total size of payload as part of header. This is not
> > possible
> > > > > given that we don't know all the write calls beforehand. For
> > example, a
> > > > > call to save model's parameters makes 145 calls to the stream's
> > write.
> > > > >
> > > > > *Approach?*
> > > > > Is it okay to buffer it to a local file, and then 

Re: S3 Writes using SIG4 Authentication

2018-03-07 Thread Bhavin Thaker
Multi-part upload with finalization seems like a good approach for this
problem.

Bhavin Thaker.

On Wed, Mar 7, 2018 at 7:45 AM Naveen Swamy  wrote:

> Rahul,
> IMO It is not Ok to write to a local file before streaming, you have to
> consider security implications such as:
> 1) will your local file be encrypted(encryption at rest)
> 2) what happens if the process crashes, you will have to make sure the
> local file is deleted in failure and process exit scenarios.
>
> My understanding is for multi part uploads it uses chunked transfer
> encoding and for that you do not need to know the total size and only know
> the chunked data size.
> https://en.wikipedia.org/wiki/Chunked_transfer_encoding
>
> See this SO answer:
>
> https://stackoverflow.com/questions/8653146/can-i-stream-a-file-upload-to-s3-without-a-content-length-header
>
> Can you point to the literature that asks to know the total size.
>
> -Naveen
>
>
> On Tue, Mar 6, 2018 at 10:34 PM, Rahul Huilgol 
> wrote:
>
> > Hi Chris,
> >
> > S3 doesn't support append calls. They promote the use of multipart
> uploads
> > to upload large files in parallel, or when network reliability is an
> issue.
> > Writing like a stream does not seem to be the purpose of multipart
> uploads.
> >
> > I looked into what the AWS SDK does (in Java). It buffers in memory
> however
> > large the file might be, and then uploads. I imagine this involves
> > reallocating and copying the buffer to the larger buffer. There are few
> > issues raised regarding this on the sdk repos like this
> > . But this doesn't seem
> to
> > be something the SDKs can do anything about. People seem to be writing to
> > temporary files and then uploading.
> >
> > Regards,
> > Rahul
> >
> > On Tue, Mar 6, 2018 at 9:04 PM, Chris Olivier 
> > wrote:
> >
> > > it seems strange that s3 would make such a major restriction. there’s
> > > literally no way to incrementally write a file without knowing the size
> > > beforehand? some sort of separate append calls, maybe?
> > >
> > > On Tue, Mar 6, 2018 at 8:53 PM Rahul Huilgol 
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > I have been looking at updating the authentication used by
> S3FileSystem
> > > in
> > > > dmlc-core. Current code uses Signature version 2, which works only in
> > the
> > > > region us-east-1 now. We need to update the authentication scheme to
> > use
> > > > Signature version 4 (SIG4).
> > > >
> > > > I've submitted a PR  to
> > > change
> > > > this for Reads. But I wanted to seek out thoughts on what to do for
> > > Writes,
> > > > as there is a potential problem.
> > > >
> > > > *How writes to S3 work currently:*
> > > > Whenever s3filesystem's stream.write() is called, data is buffered.
> > When
> > > > the buffer is full, a request is made to S3. Since this can happen
> > > multiple
> > > > times, multipart upload feature is used. An upload id is created when
> > > > stream is initialized. This upload id is used till the stream is
> > closed.
> > > > Default buffer size is 64MB.
> > > >
> > > > *Problem:*
> > > > The new SIG4 authentication scheme changes how multipart uploads
> work.
> > > Such
> > > > an upload now requires that we know the total size of data to be sent
> > > (sum
> > > > of sizes of all parts) when we create the first request itself. We
> need
> > > to
> > > > pass the total size of payload as part of header. This is not
> possible
> > > > given that we don't know all the write calls beforehand. For
> example, a
> > > > call to save model's parameters makes 145 calls to the stream's
> write.
> > > >
> > > > *Approach?*
> > > > Is it okay to buffer it to a local file, and then upload this file to
> > S3
> > > at
> > > > the end?
> > > > What use case do we have for writes to S3 generally? I believe we
> would
> > > > want to write params after training or logs. These wouldn't be too
> > large
> > > or
> > > > frequent I imagine. What would you suggest?
> > > >
> > > > Appreciate your thoughts and suggestions.
> > > >
> > > > Thanks,
> > > > Rahul Huilgol
> > > >
> > >
> >
> >
> >
> > --
> > Rahul Huilgol
> >
>


Re: S3 Writes using SIG4 Authentication

2018-03-07 Thread Naveen Swamy
Rahul,
IMO It is not Ok to write to a local file before streaming, you have to
consider security implications such as:
1) will your local file be encrypted(encryption at rest)
2) what happens if the process crashes, you will have to make sure the
local file is deleted in failure and process exit scenarios.

My understanding is for multi part uploads it uses chunked transfer
encoding and for that you do not need to know the total size and only know
the chunked data size.
https://en.wikipedia.org/wiki/Chunked_transfer_encoding

See this SO answer:
https://stackoverflow.com/questions/8653146/can-i-stream-a-file-upload-to-s3-without-a-content-length-header

Can you point to the literature that asks to know the total size.

-Naveen


On Tue, Mar 6, 2018 at 10:34 PM, Rahul Huilgol 
wrote:

> Hi Chris,
>
> S3 doesn't support append calls. They promote the use of multipart uploads
> to upload large files in parallel, or when network reliability is an issue.
> Writing like a stream does not seem to be the purpose of multipart uploads.
>
> I looked into what the AWS SDK does (in Java). It buffers in memory however
> large the file might be, and then uploads. I imagine this involves
> reallocating and copying the buffer to the larger buffer. There are few
> issues raised regarding this on the sdk repos like this
> . But this doesn't seem to
> be something the SDKs can do anything about. People seem to be writing to
> temporary files and then uploading.
>
> Regards,
> Rahul
>
> On Tue, Mar 6, 2018 at 9:04 PM, Chris Olivier 
> wrote:
>
> > it seems strange that s3 would make such a major restriction. there’s
> > literally no way to incrementally write a file without knowing the size
> > beforehand? some sort of separate append calls, maybe?
> >
> > On Tue, Mar 6, 2018 at 8:53 PM Rahul Huilgol 
> > wrote:
> >
> > > Hi everyone,
> > >
> > > I have been looking at updating the authentication used by S3FileSystem
> > in
> > > dmlc-core. Current code uses Signature version 2, which works only in
> the
> > > region us-east-1 now. We need to update the authentication scheme to
> use
> > > Signature version 4 (SIG4).
> > >
> > > I've submitted a PR  to
> > change
> > > this for Reads. But I wanted to seek out thoughts on what to do for
> > Writes,
> > > as there is a potential problem.
> > >
> > > *How writes to S3 work currently:*
> > > Whenever s3filesystem's stream.write() is called, data is buffered.
> When
> > > the buffer is full, a request is made to S3. Since this can happen
> > multiple
> > > times, multipart upload feature is used. An upload id is created when
> > > stream is initialized. This upload id is used till the stream is
> closed.
> > > Default buffer size is 64MB.
> > >
> > > *Problem:*
> > > The new SIG4 authentication scheme changes how multipart uploads work.
> > Such
> > > an upload now requires that we know the total size of data to be sent
> > (sum
> > > of sizes of all parts) when we create the first request itself. We need
> > to
> > > pass the total size of payload as part of header. This is not possible
> > > given that we don't know all the write calls beforehand. For example, a
> > > call to save model's parameters makes 145 calls to the stream's write.
> > >
> > > *Approach?*
> > > Is it okay to buffer it to a local file, and then upload this file to
> S3
> > at
> > > the end?
> > > What use case do we have for writes to S3 generally? I believe we would
> > > want to write params after training or logs. These wouldn't be too
> large
> > or
> > > frequent I imagine. What would you suggest?
> > >
> > > Appreciate your thoughts and suggestions.
> > >
> > > Thanks,
> > > Rahul Huilgol
> > >
> >
>
>
>
> --
> Rahul Huilgol
>


Re: S3 Writes using SIG4 Authentication

2018-03-07 Thread Bhavin Thaker
Hi Rahul,

Rahul> Is it okay to buffer it to a local file, and then upload this file
to S3 at the end?

Short answer: Yes.

S3 is object storage. See: https://en.m.wikipedia.org/wiki/Object_storage
The granularity of access for object storage is an object. This is
different from a filesystem or block storage where the granularity of
access is a block.

So, I think it is perfectly reasonable to buffer/collect all the I/O
locally either in a memory region or on a local filesystem first and when
the data is ready to be put into the remote S3 object over the network,
then do the S3 put. Note that this is also efficient because local memory
or disk I/O (assuming SSD or high-grade rotating hard disk) will be faster
than multiple S3 put calls over long-distance network.

Bhavin Thaker.

On Tue, Mar 6, 2018 at 10:34 PM Rahul Huilgol 
wrote:

> Hi Chris,
>
> S3 doesn't support append calls. They promote the use of multipart uploads
> to upload large files in parallel, or when network reliability is an issue.
> Writing like a stream does not seem to be the purpose of multipart uploads.
>
> I looked into what the AWS SDK does (in Java). It buffers in memory however
> large the file might be, and then uploads. I imagine this involves
> reallocating and copying the buffer to the larger buffer. There are few
> issues raised regarding this on the sdk repos like this
> . But this doesn't seem to
> be something the SDKs can do anything about. People seem to be writing to
> temporary files and then uploading.
>
> Regards,
> Rahul
>
> On Tue, Mar 6, 2018 at 9:04 PM, Chris Olivier 
> wrote:
>
> > it seems strange that s3 would make such a major restriction. there’s
> > literally no way to incrementally write a file without knowing the size
> > beforehand? some sort of separate append calls, maybe?
> >
> > On Tue, Mar 6, 2018 at 8:53 PM Rahul Huilgol 
> > wrote:
> >
> > > Hi everyone,
> > >
> > > I have been looking at updating the authentication used by S3FileSystem
> > in
> > > dmlc-core. Current code uses Signature version 2, which works only in
> the
> > > region us-east-1 now. We need to update the authentication scheme to
> use
> > > Signature version 4 (SIG4).
> > >
> > > I've submitted a PR  to
> > change
> > > this for Reads. But I wanted to seek out thoughts on what to do for
> > Writes,
> > > as there is a potential problem.
> > >
> > > *How writes to S3 work currently:*
> > > Whenever s3filesystem's stream.write() is called, data is buffered.
> When
> > > the buffer is full, a request is made to S3. Since this can happen
> > multiple
> > > times, multipart upload feature is used. An upload id is created when
> > > stream is initialized. This upload id is used till the stream is
> closed.
> > > Default buffer size is 64MB.
> > >
> > > *Problem:*
> > > The new SIG4 authentication scheme changes how multipart uploads work.
> > Such
> > > an upload now requires that we know the total size of data to be sent
> > (sum
> > > of sizes of all parts) when we create the first request itself. We need
> > to
> > > pass the total size of payload as part of header. This is not possible
> > > given that we don't know all the write calls beforehand. For example, a
> > > call to save model's parameters makes 145 calls to the stream's write.
> > >
> > > *Approach?*
> > > Is it okay to buffer it to a local file, and then upload this file to
> S3
> > at
> > > the end?
> > > What use case do we have for writes to S3 generally? I believe we would
> > > want to write params after training or logs. These wouldn't be too
> large
> > or
> > > frequent I imagine. What would you suggest?
> > >
> > > Appreciate your thoughts and suggestions.
> > >
> > > Thanks,
> > > Rahul Huilgol
> > >
> >
>
>
>
> --
> Rahul Huilgol
>


Re: S3 Writes using SIG4 Authentication

2018-03-06 Thread Rahul Huilgol
Hi Chris,

S3 doesn't support append calls. They promote the use of multipart uploads
to upload large files in parallel, or when network reliability is an issue.
Writing like a stream does not seem to be the purpose of multipart uploads.

I looked into what the AWS SDK does (in Java). It buffers in memory however
large the file might be, and then uploads. I imagine this involves
reallocating and copying the buffer to the larger buffer. There are few
issues raised regarding this on the sdk repos like this
. But this doesn't seem to
be something the SDKs can do anything about. People seem to be writing to
temporary files and then uploading.

Regards,
Rahul

On Tue, Mar 6, 2018 at 9:04 PM, Chris Olivier  wrote:

> it seems strange that s3 would make such a major restriction. there’s
> literally no way to incrementally write a file without knowing the size
> beforehand? some sort of separate append calls, maybe?
>
> On Tue, Mar 6, 2018 at 8:53 PM Rahul Huilgol 
> wrote:
>
> > Hi everyone,
> >
> > I have been looking at updating the authentication used by S3FileSystem
> in
> > dmlc-core. Current code uses Signature version 2, which works only in the
> > region us-east-1 now. We need to update the authentication scheme to use
> > Signature version 4 (SIG4).
> >
> > I've submitted a PR  to
> change
> > this for Reads. But I wanted to seek out thoughts on what to do for
> Writes,
> > as there is a potential problem.
> >
> > *How writes to S3 work currently:*
> > Whenever s3filesystem's stream.write() is called, data is buffered. When
> > the buffer is full, a request is made to S3. Since this can happen
> multiple
> > times, multipart upload feature is used. An upload id is created when
> > stream is initialized. This upload id is used till the stream is closed.
> > Default buffer size is 64MB.
> >
> > *Problem:*
> > The new SIG4 authentication scheme changes how multipart uploads work.
> Such
> > an upload now requires that we know the total size of data to be sent
> (sum
> > of sizes of all parts) when we create the first request itself. We need
> to
> > pass the total size of payload as part of header. This is not possible
> > given that we don't know all the write calls beforehand. For example, a
> > call to save model's parameters makes 145 calls to the stream's write.
> >
> > *Approach?*
> > Is it okay to buffer it to a local file, and then upload this file to S3
> at
> > the end?
> > What use case do we have for writes to S3 generally? I believe we would
> > want to write params after training or logs. These wouldn't be too large
> or
> > frequent I imagine. What would you suggest?
> >
> > Appreciate your thoughts and suggestions.
> >
> > Thanks,
> > Rahul Huilgol
> >
>



-- 
Rahul Huilgol


Re: S3 Writes using SIG4 Authentication

2018-03-06 Thread Chris Olivier
it seems strange that s3 would make such a major restriction. there’s
literally no way to incrementally write a file without knowing the size
beforehand? some sort of separate append calls, maybe?

On Tue, Mar 6, 2018 at 8:53 PM Rahul Huilgol  wrote:

> Hi everyone,
>
> I have been looking at updating the authentication used by S3FileSystem in
> dmlc-core. Current code uses Signature version 2, which works only in the
> region us-east-1 now. We need to update the authentication scheme to use
> Signature version 4 (SIG4).
>
> I've submitted a PR  to change
> this for Reads. But I wanted to seek out thoughts on what to do for Writes,
> as there is a potential problem.
>
> *How writes to S3 work currently:*
> Whenever s3filesystem's stream.write() is called, data is buffered. When
> the buffer is full, a request is made to S3. Since this can happen multiple
> times, multipart upload feature is used. An upload id is created when
> stream is initialized. This upload id is used till the stream is closed.
> Default buffer size is 64MB.
>
> *Problem:*
> The new SIG4 authentication scheme changes how multipart uploads work. Such
> an upload now requires that we know the total size of data to be sent (sum
> of sizes of all parts) when we create the first request itself. We need to
> pass the total size of payload as part of header. This is not possible
> given that we don't know all the write calls beforehand. For example, a
> call to save model's parameters makes 145 calls to the stream's write.
>
> *Approach?*
> Is it okay to buffer it to a local file, and then upload this file to S3 at
> the end?
> What use case do we have for writes to S3 generally? I believe we would
> want to write params after training or logs. These wouldn't be too large or
> frequent I imagine. What would you suggest?
>
> Appreciate your thoughts and suggestions.
>
> Thanks,
> Rahul Huilgol
>