[jira] [Commented] (FLINK-19481) Add support for a flink native GCS FileSystem

2021-05-17 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346542#comment-17346542
 ] 

Xintong Song commented on FLINK-19481:
--

{quote}The runtime complexity of having the additional Hadoop layer will likely 
be strictly worse. This is because each layer has it's own configuration and 
things like thread pooling, pool sizes, buffering, and other non-trivial tuning 
parameters.
{quote}
I'm not sure about this. Looking into o.a.f.runtime.fs.hdfs.HadoopFileSystem, 
the Flink filesystem is practically a layer of API mappings around the Hadoop 
filesystem. It might be true that the parameters to be tuned are separated into 
different layers, but I wonder how many extra parameters, thus complexity, are 
introduced due to the additional layer. Shouldn't the total amount of 
parameters be the same?
{quote}In my experience the more native (fewer layers of abstraction) you can 
achieve the better the result.
{quote}
I admit that, if we are building the GCS file system from the ground up, the 
less layers the better. 
 # GCS SDK -> Hadoop FileSystem -> Flink FileSystem
 # GCS SDK -> Flink FileSystem

However, we don't have to build everything from the ground up. In the first 
path above, there are already off-the-shelf solution for both mappings (google 
connector for sdk -> hadoop fs, and o.a.f.runtime.fs.hdfs.HadoopFileSystem for 
hadoop-> flink). It requires almost no extra efforts in addition to assembling 
existing artifacts. On the other hand, in the second path we need to implement 
a brand new file system, which seems to be re-inventing the wheel.
{quote}It seems from reading the comments here though that a good solution 
would be a hybrid of Ben's work on the native GCS Filesystem combined with 
Galen's work on the RecoverableWriter.
{quote}
Unless there're more inputs on why we should have a native GCS file system, I'm 
leaning towards not introducing such a native implementation based on the 
discussion so far.

 

> Add support for a flink native GCS FileSystem
> -
>
> Key: FLINK-19481
> URL: https://issues.apache.org/jira/browse/FLINK-19481
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, FileSystems
>Affects Versions: 1.12.0
>Reporter: Ben Augarten
>Priority: Minor
>  Labels: auto-deprioritized-major
>
> Currently, GCS is supported but only by using the hadoop connector[1]
>  
> The objective of this improvement is to add support for checkpointing to 
> Google Cloud Storage with the Flink File System,
>  
> This would allow the `gs://` scheme to be used for savepointing and 
> checkpointing. Long term, it would be nice if we could use the GCS FileSystem 
> as a source and sink in flink jobs as well. 
>  
> Long term, I hope that implementing a flink native GCS FileSystem will 
> simplify usage of GCS because the hadoop FileSystem ends up bringing in many 
> unshaded dependencies.
>  
> [1] 
> [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19481) Add support for a flink native GCS FileSystem

2021-05-17 Thread Jamie Grier (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346179#comment-17346179
 ] 

Jamie Grier commented on FLINK-19481:
-

Yes, [~xintongsong], that's my opinion based on experience.  The runtime 
complexity of having the additional Hadoop layer will likely be strictly worse. 
 This is because each layer has it's own configuration and things like thread 
pooling, pool sizes, buffering, and other non-trivial tuning parameters.

 

It can be very difficult to tune this stuff for production workloads with 
non-trivial throughput and having all of those layers makes it (much) worse.  
Due to the config It's a leaky abstraction so you end up having to understand, 
configure, and tune the Flink, Hadoop, and GCS layers anyway.

 

Again, this is based mostly on my experience with the various flavors of the S3 
connector but it will still apply here.  In my experience the more native 
(fewer layers of abstraction) you can achieve the better the result.

 

That said I have not looked at Galen's PR.  It seems from reading the comments 
here though that a good solution would be a hybrid of Ben's work on the native 
GCS Filesystem combined with Galen's work on the RecoverableWriter.

 

 

> Add support for a flink native GCS FileSystem
> -
>
> Key: FLINK-19481
> URL: https://issues.apache.org/jira/browse/FLINK-19481
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, FileSystems
>Affects Versions: 1.12.0
>Reporter: Ben Augarten
>Priority: Minor
>  Labels: auto-deprioritized-major
>
> Currently, GCS is supported but only by using the hadoop connector[1]
>  
> The objective of this improvement is to add support for checkpointing to 
> Google Cloud Storage with the Flink File System,
>  
> This would allow the `gs://` scheme to be used for savepointing and 
> checkpointing. Long term, it would be nice if we could use the GCS FileSystem 
> as a source and sink in flink jobs as well. 
>  
> Long term, I hope that implementing a flink native GCS FileSystem will 
> simplify usage of GCS because the hadoop FileSystem ends up bringing in many 
> unshaded dependencies.
>  
> [1] 
> [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19481) Add support for a flink native GCS FileSystem

2021-05-16 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345823#comment-17345823
 ] 

Xintong Song commented on FLINK-19481:
--

Hi [~jgrier], thanks for your input.

I have noticed your earlier comment. However, that comment was before Galen's 
PR and I think things are a bit different with this PR now.
- IIUC, what you've described are benefits of a native implementation 
*comparing to the current status*, where Flink does not provide any specific 
supports for GS and users have to deal with the Hadoop dependencies and Flink's 
FS abstractions by themselves. 
- What I'm trying to understand are the benefits *comparing to the status once 
Galen's PR is merged*. The PR provides an out-of-box GS FS implementation, so 
that users no longer need to deal with the dependencies and abstractions. In 
that case, is it still beneficial that this implementation, internally, is 
built directly on top of the GCS native SDK, rather than leveraging the 
existing Hadoop stack provided by google storage connector?

> Add support for a flink native GCS FileSystem
> -
>
> Key: FLINK-19481
> URL: https://issues.apache.org/jira/browse/FLINK-19481
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, FileSystems
>Affects Versions: 1.12.0
>Reporter: Ben Augarten
>Priority: Minor
>  Labels: auto-deprioritized-major
>
> Currently, GCS is supported but only by using the hadoop connector[1]
>  
> The objective of this improvement is to add support for checkpointing to 
> Google Cloud Storage with the Flink File System,
>  
> This would allow the `gs://` scheme to be used for savepointing and 
> checkpointing. Long term, it would be nice if we could use the GCS FileSystem 
> as a source and sink in flink jobs as well. 
>  
> Long term, I hope that implementing a flink native GCS FileSystem will 
> simplify usage of GCS because the hadoop FileSystem ends up bringing in many 
> unshaded dependencies.
>  
> [1] 
> [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19481) Add support for a flink native GCS FileSystem

2021-05-15 Thread Jamie Grier (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345032#comment-17345032
 ] 

Jamie Grier commented on FLINK-19481:
-

The primary benefits of a native implementation are described earlier in this 
ticket.  This is based on my own experience in production for several years 
with the other Hadoop based File Systems – primarily the S3 one though.

 
{noformat}
I think a native GCS filesytem would be a major benefit to Flink users.  The 
only way to support GCS currently is, as stated, through the Hadoop Filesystem 
implementation which brings several problems along with it.  The two largest 
problems I've experienced are:1) Hadoop has a huge dependency footprint which 
is a significant headache for Flink application developers dealing with 
dependency-hell.2) The total stack of FileSystem abstractions on this path 
becomes very difficult to tune, understand, and support.  By stack I'm 
referring to Flink's own FileSystem abstraction, then the Hadoop layer, then 
the GCS libraries.  This is very difficult to work with in production as each 
layer has its own intricacies, connection pools, thread pools, tunable 
configuration, versions, dependency versions, etc.Having gone down this path 
with the old-style Hadoop+S3 filesystem approach I know how difficult it can be 
and a native implementation should prove to be much simpler to support and 
easier to tune and modify for performance.  This is why the presto-s3-fs 
filesystem was adopted, for example.{noformat}

> Add support for a flink native GCS FileSystem
> -
>
> Key: FLINK-19481
> URL: https://issues.apache.org/jira/browse/FLINK-19481
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, FileSystems
>Affects Versions: 1.12.0
>Reporter: Ben Augarten
>Priority: Minor
>  Labels: auto-deprioritized-major
>
> Currently, GCS is supported but only by using the hadoop connector[1]
>  
> The objective of this improvement is to add support for checkpointing to 
> Google Cloud Storage with the Flink File System,
>  
> This would allow the `gs://` scheme to be used for savepointing and 
> checkpointing. Long term, it would be nice if we could use the GCS FileSystem 
> as a source and sink in flink jobs as well. 
>  
> Long term, I hope that implementing a flink native GCS FileSystem will 
> simplify usage of GCS because the hadoop FileSystem ends up bringing in many 
> unshaded dependencies.
>  
> [1] 
> [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19481) Add support for a flink native GCS FileSystem

2021-05-10 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342242#comment-17342242
 ] 

Xintong Song commented on FLINK-19481:
--

[~galenwarren],

I think the RecoverableWriter implementation is beneficial regardless of which 
(or both) file system implementation we use. Meantime, I'm still a bit unsure 
about introducing another fs implementation. We probably should not block a 
definite improvement on an uncertain thread. 

Even we decide to introduce a native gcs fs implementation, there's only a 
small fraction of FLINK-11838 needs further changes. I think we can make those 
changes when we indeed introduce the native gcs fs.

> Add support for a flink native GCS FileSystem
> -
>
> Key: FLINK-19481
> URL: https://issues.apache.org/jira/browse/FLINK-19481
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, FileSystems
>Affects Versions: 1.12.0
>Reporter: Ben Augarten
>Priority: Minor
>  Labels: auto-deprioritized-major
>
> Currently, GCS is supported but only by using the hadoop connector[1]
>  
> The objective of this improvement is to add support for checkpointing to 
> Google Cloud Storage with the Flink File System,
>  
> This would allow the `gs://` scheme to be used for savepointing and 
> checkpointing. Long term, it would be nice if we could use the GCS FileSystem 
> as a source and sink in flink jobs as well. 
>  
> Long term, I hope that implementing a flink native GCS FileSystem will 
> simplify usage of GCS because the hadoop FileSystem ends up bringing in many 
> unshaded dependencies.
>  
> [1] 
> [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19481) Add support for a flink native GCS FileSystem

2021-05-10 Thread Galen Warren (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17342087#comment-17342087
 ] 

Galen Warren commented on FLINK-19481:
--

I wanted to check in here. Should I wait until this question is resolved before 
proceeding with the PR? 

Personally, my preference would be to see Flink HadoopFileSystem + 
GoogleHadoopFileSystem as at least _an_ option for the file system 
implementation, just because those components seem to be well established. I'm 
not opposed to an alternate implementation, though, i.e. as has been done with 
S3. If that's the path we're going down, it might mean some changes for the 
code in the PR I'm working on, hence the question.

 

> Add support for a flink native GCS FileSystem
> -
>
> Key: FLINK-19481
> URL: https://issues.apache.org/jira/browse/FLINK-19481
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, FileSystems
>Affects Versions: 1.12.0
>Reporter: Ben Augarten
>Priority: Minor
>  Labels: auto-deprioritized-major
>
> Currently, GCS is supported but only by using the hadoop connector[1]
>  
> The objective of this improvement is to add support for checkpointing to 
> Google Cloud Storage with the Flink File System,
>  
> This would allow the `gs://` scheme to be used for savepointing and 
> checkpointing. Long term, it would be nice if we could use the GCS FileSystem 
> as a source and sink in flink jobs as well. 
>  
> Long term, I hope that implementing a flink native GCS FileSystem will 
> simplify usage of GCS because the hadoop FileSystem ends up bringing in many 
> unshaded dependencies.
>  
> [1] 
> [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19481) Add support for a flink native GCS FileSystem

2021-05-05 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17339981#comment-17339981
 ] 

Xintong Song commented on FLINK-19481:
--

Thanks for the discussion.

I agree with Ben, that Galen's PR is mostly focusing on adding support for 
RecoverableWriter, and is not depending on the specific file system 
implementation. Migrating from hadoop to native gcs file system implementation 
should not require much reworking.

However, before deciding dong it, I'd like to understand the benefits of 
implementing a native gcs file system. Galen's PR does introduce a 
GSFileSystem, which simply wraps GoogleHadoopFileSystem. It seems to me this 
already solves most of the problems.
- "gs://" scheme can be supported
- Users no longer need to deal with the dependencies and FileSystem 
hierarchies. They should simply add the new flink-gs-fs-hadoop artifact, and 
ideally everything else needed should be included.

Is there any significant benefits that I overlooked, that can only be achieved 
by a native gcs file system implementation rather than the GSFileSystem in 
Galen's PR?

> Add support for a flink native GCS FileSystem
> -
>
> Key: FLINK-19481
> URL: https://issues.apache.org/jira/browse/FLINK-19481
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, FileSystems
>Affects Versions: 1.12.0
>Reporter: Ben Augarten
>Priority: Minor
>  Labels: auto-deprioritized-major
>
> Currently, GCS is supported but only by using the hadoop connector[1]
>  
> The objective of this improvement is to add support for checkpointing to 
> Google Cloud Storage with the Flink File System,
>  
> This would allow the `gs://` scheme to be used for savepointing and 
> checkpointing. Long term, it would be nice if we could use the GCS FileSystem 
> as a source and sink in flink jobs as well. 
>  
> Long term, I hope that implementing a flink native GCS FileSystem will 
> simplify usage of GCS because the hadoop FileSystem ends up bringing in many 
> unshaded dependencies.
>  
> [1] 
> [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19481) Add support for a flink native GCS FileSystem

2021-05-03 Thread Ben Augarten (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338455#comment-17338455
 ] 

Ben Augarten commented on FLINK-19481:
--

Hey Robert and Galen, I appreciate you both weighing in. I got a chance to read 
through Galen's PR briefly and it does seem like it's mostly concerned with 
adding support for the RecoverableWriter interface, and their implementation of 
the RecoverableWriter interface does not have any explicit hadoop dependencies. 
So, it seems like their implementation would be useful with either a native or 
hadoop based implementation of the google cloud storage file system.

Our native implementation does have support for RecoverableWriter, but I didn't 
work directly on that and I don't believe it's being used in production right 
now. We've primarily been using our implementation for checkpointing, 
savepointing, and job graph storage.

The two paths forward I see are:

* As Galen proposed keep two separate implementations of the GCS FileSystem, 
one that goes through the hadoop stack and one that uses GCS SDKs, both using 
the shared RecoverableWriter implementations.
* Consolidate down to a native GCS FileSystem implementation, using Galen's 
implementation of the RecoverableWriter.

To me, the second option makes most sense, based on my experience as a user of 
flink and my general impression of the desire to move away from hadoop based 
file systems.

To accomplish that, I think that Galen should continue working on their MR. I 
can open another MR once theirs lands on master, or open an MR on their WIP. 
Though, I'd prefer waiting until outstanding discussions are resolved.

> Add support for a flink native GCS FileSystem
> -
>
> Key: FLINK-19481
> URL: https://issues.apache.org/jira/browse/FLINK-19481
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, FileSystems
>Affects Versions: 1.12.0
>Reporter: Ben Augarten
>Priority: Minor
>  Labels: auto-deprioritized-major
>
> Currently, GCS is supported but only by using the hadoop connector[1]
>  
> The objective of this improvement is to add support for checkpointing to 
> Google Cloud Storage with the Flink File System,
>  
> This would allow the `gs://` scheme to be used for savepointing and 
> checkpointing. Long term, it would be nice if we could use the GCS FileSystem 
> as a source and sink in flink jobs as well. 
>  
> Long term, I hope that implementing a flink native GCS FileSystem will 
> simplify usage of GCS because the hadoop FileSystem ends up bringing in many 
> unshaded dependencies.
>  
> [1] 
> [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19481) Add support for a flink native GCS FileSystem

2021-05-03 Thread Galen Warren (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338418#comment-17338418
 ] 

Galen Warren commented on FLINK-19481:
--

Hi all, I'm the author of the other 
[PR|https://github.com/apache/flink/pull/15599] that relates to Google Cloud 
Storage. [~xintongsong] has been working with me on this.

The main goal of my PR is to add support for the RecoverableWriter interface, 
so that one can write to GCS via a StreamingFileSink. The file system support 
goes through the Hadoop stack, as noted above, using Google's [cloud storage 
connector|https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage].

I have not personally had problems using the GCS connector and the Hadoop stack 
– it seems to write check/savepoints properly. I also use it to write job 
manager HA data to GCS, which seems to work fine.

However, if we do want to support a native implementation in addition to the 
Hadoop-based one, we could approach it similarly to what has been done for S3, 
i.e. have a shared base project (flink-gs-fs-base?) and then projects for each 
of the implementations ( flink-gs-fs-hadoop and flink-gs-fs-native?). The 
recoverable-writer code could go into the shared project so that both of the 
implementations could use it (assuming that the native implementation doesn't 
already have a recoverable-writer implementation).

I'll defer to the Flink experts on whether that's a worthwhile effort or not. 
At this point, from my perspective, it wouldn't be that much work to rework the 
project structure to support this.

 

> Add support for a flink native GCS FileSystem
> -
>
> Key: FLINK-19481
> URL: https://issues.apache.org/jira/browse/FLINK-19481
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, FileSystems
>Affects Versions: 1.12.0
>Reporter: Ben Augarten
>Priority: Minor
>  Labels: auto-deprioritized-major
>
> Currently, GCS is supported but only by using the hadoop connector[1]
>  
> The objective of this improvement is to add support for checkpointing to 
> Google Cloud Storage with the Flink File System,
>  
> This would allow the `gs://` scheme to be used for savepointing and 
> checkpointing. Long term, it would be nice if we could use the GCS FileSystem 
> as a source and sink in flink jobs as well. 
>  
> Long term, I hope that implementing a flink native GCS FileSystem will 
> simplify usage of GCS because the hadoop FileSystem ends up bringing in many 
> unshaded dependencies.
>  
> [1] 
> [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19481) Add support for a flink native GCS FileSystem

2021-05-03 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17338198#comment-17338198
 ] 

Robert Metzger commented on FLINK-19481:


Hi Ben, thanks a lot for getting back on this ticket. It's great to hear that 
you have a battle-tested connector implementation available. It has been 
brought up a few times that Flink doesn't ship with a GCS connector.
Before proceeding, we have to figure out one problem: Related to FLINK-11838, 
there seems to be a pull request under review 
(https://github.com/apache/flink/pull/15599), that also intends to add a GCS 
file system implementation. In that case, it is mostly about supporting the 
StreamingFileSink.
The implementation from PR #15599 seems to go through the Hadoop stack, so I 
guess it is different from your implementation, which goes through to the 
google APIs directly.
Does your implementation support the recoverable writer interface?
I'm not very deep in the filesystems implementations these days, could you take 
a quick look for me at the other PR and make a proposal how we can proceed? 
(join forces?, add two implementations?, discard one?, ??? )

> Add support for a flink native GCS FileSystem
> -
>
> Key: FLINK-19481
> URL: https://issues.apache.org/jira/browse/FLINK-19481
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, FileSystems
>Affects Versions: 1.12.0
>Reporter: Ben Augarten
>Priority: Minor
>  Labels: auto-deprioritized-major
>
> Currently, GCS is supported but only by using the hadoop connector[1]
>  
> The objective of this improvement is to add support for checkpointing to 
> Google Cloud Storage with the Flink File System,
>  
> This would allow the `gs://` scheme to be used for savepointing and 
> checkpointing. Long term, it would be nice if we could use the GCS FileSystem 
> as a source and sink in flink jobs as well. 
>  
> Long term, I hope that implementing a flink native GCS FileSystem will 
> simplify usage of GCS because the hadoop FileSystem ends up bringing in many 
> unshaded dependencies.
>  
> [1] 
> [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19481) Add support for a flink native GCS FileSystem

2021-05-01 Thread Ben Augarten (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17337835#comment-17337835
 ] 

Ben Augarten commented on FLINK-19481:
--

We've been using a non-hadoop implementation of a GCSFileSystem for a few 
months now in production. I think it's in a good place to open source / 
contribute back if there is consensus that such a plugin is worth adding to 
open source flink. 

 

If there is consensus, I am happy to work on a PR

> Add support for a flink native GCS FileSystem
> -
>
> Key: FLINK-19481
> URL: https://issues.apache.org/jira/browse/FLINK-19481
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, FileSystems
>Affects Versions: 1.12.0
>Reporter: Ben Augarten
>Priority: Minor
>  Labels: auto-deprioritized-major
>
> Currently, GCS is supported but only by using the hadoop connector[1]
>  
> The objective of this improvement is to add support for checkpointing to 
> Google Cloud Storage with the Flink File System,
>  
> This would allow the `gs://` scheme to be used for savepointing and 
> checkpointing. Long term, it would be nice if we could use the GCS FileSystem 
> as a source and sink in flink jobs as well. 
>  
> Long term, I hope that implementing a flink native GCS FileSystem will 
> simplify usage of GCS because the hadoop FileSystem ends up bringing in many 
> unshaded dependencies.
>  
> [1] 
> [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19481) Add support for a flink native GCS FileSystem

2021-04-29 Thread Flink Jira Bot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17336078#comment-17336078
 ] 

Flink Jira Bot commented on FLINK-19481:


This issue was labeled "stale-major" 7 ago and has not received any updates so 
it is being deprioritized. If this ticket is actually Major, please raise the 
priority and ask a committer to assign you the issue or revive the public 
discussion.


> Add support for a flink native GCS FileSystem
> -
>
> Key: FLINK-19481
> URL: https://issues.apache.org/jira/browse/FLINK-19481
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, FileSystems
>Affects Versions: 1.12.0
>Reporter: Ben Augarten
>Priority: Major
>  Labels: stale-major
>
> Currently, GCS is supported but only by using the hadoop connector[1]
>  
> The objective of this improvement is to add support for checkpointing to 
> Google Cloud Storage with the Flink File System,
>  
> This would allow the `gs://` scheme to be used for savepointing and 
> checkpointing. Long term, it would be nice if we could use the GCS FileSystem 
> as a source and sink in flink jobs as well. 
>  
> Long term, I hope that implementing a flink native GCS FileSystem will 
> simplify usage of GCS because the hadoop FileSystem ends up bringing in many 
> unshaded dependencies.
>  
> [1] 
> [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19481) Add support for a flink native GCS FileSystem

2021-04-22 Thread Flink Jira Bot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17327621#comment-17327621
 ] 

Flink Jira Bot commented on FLINK-19481:


This major issue is unassigned and itself and all of its Sub-Tasks have not 
been updated for 30 days. So, it has been labeled "stale-major". If this ticket 
is indeed "major", please either assign yourself or give an update. Afterwards, 
please remove the label. In 7 days the issue will be deprioritized.

> Add support for a flink native GCS FileSystem
> -
>
> Key: FLINK-19481
> URL: https://issues.apache.org/jira/browse/FLINK-19481
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, FileSystems
>Affects Versions: 1.12.0
>Reporter: Ben Augarten
>Priority: Major
>  Labels: stale-major
>
> Currently, GCS is supported but only by using the hadoop connector[1]
>  
> The objective of this improvement is to add support for checkpointing to 
> Google Cloud Storage with the Flink File System,
>  
> This would allow the `gs://` scheme to be used for savepointing and 
> checkpointing. Long term, it would be nice if we could use the GCS FileSystem 
> as a source and sink in flink jobs as well. 
>  
> Long term, I hope that implementing a flink native GCS FileSystem will 
> simplify usage of GCS because the hadoop FileSystem ends up bringing in many 
> unshaded dependencies.
>  
> [1] 
> [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19481) Add support for a flink native GCS FileSystem

2021-01-25 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17271241#comment-17271241
 ] 

Robert Metzger commented on FLINK-19481:


I'm generally +1 on adding this to Flink. I've seen quite a few people on the 
ML that seem to use GCS.

Once we have somebody who's willing to drive this, let's see how that is, and 
whether we need an ML discussion (if there's a committer strongly supporting 
this, then I don't think we need a ML discussion).

> Add support for a flink native GCS FileSystem
> -
>
> Key: FLINK-19481
> URL: https://issues.apache.org/jira/browse/FLINK-19481
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, FileSystems
>Affects Versions: 1.12.0
>Reporter: Ben Augarten
>Priority: Major
>
> Currently, GCS is supported but only by using the hadoop connector[1]
>  
> The objective of this improvement is to add support for checkpointing to 
> Google Cloud Storage with the Flink File System,
>  
> This would allow the `gs://` scheme to be used for savepointing and 
> checkpointing. Long term, it would be nice if we could use the GCS FileSystem 
> as a source and sink in flink jobs as well. 
>  
> Long term, I hope that implementing a flink native GCS FileSystem will 
> simplify usage of GCS because the hadoop FileSystem ends up bringing in many 
> unshaded dependencies.
>  
> [1] 
> [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19481) Add support for a flink native GCS FileSystem

2020-10-10 Thread Yun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211631#comment-17211631
 ] 

Yun Tang commented on FLINK-19481:
--

[~jgrier] Sounds reasonable to me.

[~rmetzger] what do you think of this ticket, and shall we launch a discussing 
mail thread to talk about this support in case of no resources to review this 
PR if agreed to continue.

> Add support for a flink native GCS FileSystem
> -
>
> Key: FLINK-19481
> URL: https://issues.apache.org/jira/browse/FLINK-19481
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, FileSystems
>Affects Versions: 1.12.0
>Reporter: Ben Augarten
>Priority: Major
> Fix For: 1.12.0
>
>
> Currently, GCS is supported but only by using the hadoop connector[1]
>  
> The objective of this improvement is to add support for checkpointing to 
> Google Cloud Storage with the Flink File System,
>  
> This would allow the `gs://` scheme to be used for savepointing and 
> checkpointing. Long term, it would be nice if we could use the GCS FileSystem 
> as a source and sink in flink jobs as well. 
>  
> Long term, I hope that implementing a flink native GCS FileSystem will 
> simplify usage of GCS because the hadoop FileSystem ends up bringing in many 
> unshaded dependencies.
>  
> [1] 
> [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19481) Add support for a flink native GCS FileSystem

2020-10-09 Thread Jamie Grier (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211477#comment-17211477
 ] 

Jamie Grier commented on FLINK-19481:
-

I think a native GCS filesytem would be a major benefit to Flink users.  The 
only way to support GCS currently is, as stated, through the Hadoop Filesystem 
implementation which brings several problems along with it.  The two largest 
problems I've experienced are:

1) Hadoop has a huge dependency footprint which is a significant headache for 
Flink application developers dealing with dependency-hell.

2) The total stack of FileSystem abstractions on this path becomes very 
difficult to tune, understand, and support.  By stack I'm referring to Flink's 
own FileSystem abstraction, then the Hadoop layer, then the GCS libraries.  
This is very difficult to work with in production as each layer has its own 
intricacies, connection pools, thread pools, tunable configuration, versions, 
dependency versions, etc.

Having gone down this path with the old-style Hadoop+S3 filesystem approach I 
know how difficult it can be and a native implementation should prove to be 
much simpler to support and easier to tune and modify for performance.  This is 
why the presto-s3-fs filesystem was adopted, for example.

 

> Add support for a flink native GCS FileSystem
> -
>
> Key: FLINK-19481
> URL: https://issues.apache.org/jira/browse/FLINK-19481
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, FileSystems
>Affects Versions: 1.12.0
>Reporter: Ben Augarten
>Priority: Major
> Fix For: 1.12.0
>
>
> Currently, GCS is supported but only by using the hadoop connector[1]
>  
> The objective of this improvement is to add support for checkpointing to 
> Google Cloud Storage with the Flink File System,
>  
> This would allow the `gs://` scheme to be used for savepointing and 
> checkpointing. Long term, it would be nice if we could use the GCS FileSystem 
> as a source and sink in flink jobs as well. 
>  
> Long term, I hope that implementing a flink native GCS FileSystem will 
> simplify usage of GCS because the hadoop FileSystem ends up bringing in many 
> unshaded dependencies.
>  
> [1] 
> [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19481) Add support for a flink native GCS FileSystem

2020-10-09 Thread Yun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211228#comment-17211228
 ] 

Yun Tang commented on FLINK-19481:
--

If so, why we still must need to support native GCS file system? Supporting 
another file system needs more community resource. [~baugarten]

> Add support for a flink native GCS FileSystem
> -
>
> Key: FLINK-19481
> URL: https://issues.apache.org/jira/browse/FLINK-19481
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, FileSystems
>Affects Versions: 1.12.0
>Reporter: Ben Augarten
>Priority: Major
> Fix For: 1.12.0
>
>
> Currently, GCS is supported but only by using the hadoop connector[1]
>  
> The objective of this improvement is to add support for checkpointing to 
> Google Cloud Storage with the Flink File System,
>  
> This would allow the `gs://` scheme to be used for savepointing and 
> checkpointing. Long term, it would be nice if we could use the GCS FileSystem 
> as a source and sink in flink jobs as well. 
>  
> Long term, I hope that implementing a flink native GCS FileSystem will 
> simplify usage of GCS because the hadoop FileSystem ends up bringing in many 
> unshaded dependencies.
>  
> [1] 
> [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19481) Add support for a flink native GCS FileSystem

2020-10-06 Thread Ben Augarten (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208844#comment-17208844
 ] 

Ben Augarten commented on FLINK-19481:
--

[~yunta] yes, that's right. Checkpoint on GCS via the hadoop filesystem 
currently works well as far as I'm aware.

> Add support for a flink native GCS FileSystem
> -
>
> Key: FLINK-19481
> URL: https://issues.apache.org/jira/browse/FLINK-19481
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, FileSystems
>Affects Versions: 1.12.0
>Reporter: Ben Augarten
>Priority: Major
> Fix For: 1.12.0
>
>
> Currently, GCS is supported but only by using the hadoop connector[1]
>  
> The objective of this improvement is to add support for checkpointing to 
> Google Cloud Storage with the Flink File System,
>  
> This would allow the `gs://` scheme to be used for savepointing and 
> checkpointing. Long term, it would be nice if we could use the GCS FileSystem 
> as a source and sink in flink jobs as well. 
>  
> Long term, I hope that implementing a flink native GCS FileSystem will 
> simplify usage of GCS because the hadoop FileSystem ends up bringing in many 
> unshaded dependencies.
>  
> [1] 
> [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19481) Add support for a flink native GCS FileSystem

2020-10-03 Thread Yun Tang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1720#comment-1720
 ] 

Yun Tang commented on FLINK-19481:
--

[~baugarten], from my point of view, we could still checkpoint on Google cloud 
storage via [hadoop file 
system|https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/filesystems/#hadoop-file-system-hdfs-and-its-other-implementations]
 without this improvement, is that right?

> Add support for a flink native GCS FileSystem
> -
>
> Key: FLINK-19481
> URL: https://issues.apache.org/jira/browse/FLINK-19481
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, FileSystems
>Affects Versions: 1.12.0
>Reporter: Ben Augarten
>Priority: Major
> Fix For: 1.12.0
>
>
> Currently, GCS is supported but only by using the hadoop connector[1]
>  
> The objective of this improvement is to add support for checkpointing to 
> Google Cloud Storage with the Flink File System,
>  
> This would allow the `gs://` scheme to be used for savepointing and 
> checkpointing. Long term, it would be nice if we could use the GCS FileSystem 
> as a source and sink in flink jobs as well. 
>  
> Long term, I hope that implementing a flink native GCS FileSystem will 
> simplify usage of GCS because the hadoop FileSystem ends up bringing in many 
> unshaded dependencies.
>  
> [1] 
> [https://github.com/GoogleCloudDataproc/hadoop-connectors|https://github.com/GoogleCloudDataproc/hadoop-connectors)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)