[jira] [Comment Edited] (FLINK-5706) Implement Flink's own S3 filesystem

Steve Loughran (JIRA) Sat, 04 Mar 2017 10:25:18 -0800

    [ 
https://issues.apache.org/jira/browse/FLINK-5706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895648#comment-15895648
 ]


Steve Loughran edited comment on FLINK-5706 at 3/4/17 6:24 PM:
---------------------------------------------------------------

Stefan, I don't think you appreciate how hard it is to do this.

I will draw your attention to all the features coming in Hadoop 2.8, 
HADOOP-11694, including seek-optimised input streams, disk/heap/byte block 
buffered uploads, support for encryption, optimisation of all requests HTTPS 
call by HTTPS call.

Then there's the todo list for later HADOOP-13204. One aspect of this, 
HADOOP-13345, s3guard uses dynamo DB for that consistent view of metadata, and 
in HADOOP-13786, something to direct commits to s3 which supports speculation 
and fault tolerance. 

These are all the things you get to replicate, along with the scale tests, 
which do find things, as HADOOP-14028 showed up on 70GB writes, the various 
intermittent failures you don't see often but cause serious problems when they 
do: example, the final POST of a multipart PUT doesn't do retries, you have to 
yourself. After you find the problems.

As a sibling project, you are free to lift the entirety of the s3a code, along 
with all the tests it includes. But you then take on the burden of maintaining 
it, fielding support calls, doing your entire integration test work yourself, 
performance turning. Did [I mention 
testing?|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md].
 We have put a lot of effort in there. You have to balance remote test runs 
with scalable performance and affordability, where reusing amazon's own 
datasets is the secret, our own TCP-DS datasets, and running tests in-EC2 to 
really stress things.

This is not trivial, even if you lift the latest branch-2 code. We are still 
finding bugs there, ones that surface in the field. We may have a broader set 
of downstream things to deal with: distcp, mapreduce, hive., spark, even flink, 
but that helps us with the test reports (We keep an eye on JIRAs and stack 
overflow for the word "s3a"), and the different deployment scenarios.

Please, do not take this on lightly.

Returning to your example above, 
# it's not just that the {{exists()/HEAD}} probe can take time to respond, it 
is that the directory listing lags the direct {{HEAD object}} call; even if the 
exists() check returns 404, a LIST operation may still list the entry. And 
because the front end load balancers cache things, the code deleting the object 
may get a 404 indicating that the object is gone, *there is no guarantee that a 
different caller will not get a 200*. 
# You may even get it in the same process, though if your calls are using a 
thread pool of keep-alive HTTP1.1 calls and all calls are on the same TCP 
connection, you'll be hitting the same load balancer and so get the cached 404. 
Because yes, load balancers cache 404 entries, meaning you don't even get 
create consistency if you do a check first.
# S3 doesn't have RAW consistency. It now has create consistency across all 
regions (yes, for a long time it had different behaviours on US-East vs the 
others) provided you don't do a HEAD first.
# You don't get PUT-over-PUT  consistency, DELETE consistency, and metadata 
queries invariably lag the object state, even on create.
# there is no such thing as `rename()`, merely a COPY of approx 6-10MB/s, so 
being O(data) and non atomic.
# if you are copying atop objects with the same name, you hit update 
consistency, for which there are no guarantees. Again, different callers may 
see different results, irrespective of call ordering, and listing will lag 
creation.

What you have seen so far is "demo scale" behaviours over a reused HTTP/1.1 
connection against the same load balancer. You cannot extrapolate from what 
works there with what offers guaranteed outcomes on large-scale operations with 
production data across multiple clusters, except for the special case "if it 
doesn't work here it won't magically work in production"




was (Author: ste...@apache.org):
Stefan, I don't think you appreciate how hard it is to do this.

I will draw your attention to all the features coming in Hadoop 2.8, 
HADOOP-11694, including seek-optimised input streams, disk/heap/byte block 
buffered uploads, support for encryption, optimisation of all requests HTTPS 
call by HTTPS call.

Then there's the todo list for later HADOOP-13204. One aspect of this, 
HADOOP-13345, s3guard uses dynamo DB for that consistent view of metadata, and 
in HADOOP-13786, something to direct commits to s3 which supports speculation 
and fault tolerance. 

These are all the things you get to replicate, along with the scale tests, 
which do find things, as HADOOP-14028 showed up on 70GB writes, the various 
intermittent failures you don't see often but cause serious problems when they 
do: example, the final POST of a multipart PUT doesn't do retries, you have to 
yourself. After you find the problems.

As a sibling project, you are free to lift the entirety of the s3a code, along 
with all the tests it includes. But you then take on the burden of maintaining 
it, fielding support calls, doing your entire integration test work yourself, 
performance turning. Did [I mention 
testing?|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md].
 We have put a lot of effort in there. You have to balance remote test runs 
with scalable performance and affordability, where reusing amazon's own 
datasets is the secret, our own TCP-DS datasets, and running tests in-EC2 to 
really stress things.

This is not trivial, even if you lift the latest branch-2 code. We are still 
finding bugs there, ones that surface in the field. We may have a broader set 
of downstream things to deal with: distcp, mapreduce, hive., spark, even flink, 
but that helps us with the test reports (We keep an eye on JIRAs and stack 
overflow for the word "s3a"), and the different deployment scenarios.

Please, do not take this on lightly.

Returning to your example above, 
# it's not just that the {{exists()/HEAD}} probe can take time to respond, it 
is that the directory listing lags the direct {{HEAD object}} call; even if the 
exists() check returns 404, a LIST operation may still list the entry. And 
because the front end load balancers cache things, the code deleting the object 
may get a 404 indicating that the object is gone, *there is no guarantee that a 
different caller will not get a 200*. 
# You may even get it in the same process, though if your calls are using a 
thread pool of keep-alive HTTP1.1 calls and all calls are on the same TCP 
connection, you'll be hitting the same load balancer and so get the cached 404. 
Because yes, load balancers cache 404 entries, meaning you don't even get 
create consistency if you do a check first.
# S3 doesn't have RAW consistency. It now has create consistency across all 
regions (yes, for a long time it had different behaviours on US-East vs the 
others) provided you don't do a HEAD first.
# You don't get PUT-over-PUT  consistency, DELETE consistency, and metadata 
queries invariably lag the object state, even on create.
# there is no such thing as `rename()`, merely a COPY of approx 6-10MB/s, so 
being O(data) and non atomic.
# if you are copying atop objects with the same name, you hit update 
consistency, for which there are no guarantees. Again, different callers may 
see different results, irrespective of call ordering, and listing will lag 
creation.

What you have seen so far is "demo scale" behaviours over a reused HTTP1/1 
thread against the same load balancer. You cannot extrapolate from what works 
there with what offers guaranteed outcomes on large-scale operations with 
production data across multiple clusters, except for the special case "if it 
doesn't work here it won't magically work in production"



> Implement Flink's own S3 filesystem
> -----------------------------------
>
>                 Key: FLINK-5706
>                 URL: https://issues.apache.org/jira/browse/FLINK-5706
>             Project: Flink
>          Issue Type: New Feature
>          Components: filesystem-connector
>            Reporter: Stephan Ewen
>
> As part of the effort to make Flink completely independent from Hadoop, Flink 
> needs its own S3 filesystem implementation. Currently Flink relies on 
> Hadoop's S3a and S3n file systems.
> An own S3 file system can be implemented using the AWS SDK. As the basis of 
> the implementation, the Hadoop File System can be used (Apache Licensed, 
> should be okay to reuse some code as long as we do a proper attribution).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (FLINK-5706) Implement Flink's own S3 filesystem

Reply via email to