[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15992322#comment-15992322 ] ASF GitHub Bot commented on BEAM-2005: -- Github user asfgit closed the pull request at: https://github.com/apache/beam/pull/2777 > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Luke Cwik > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15989699#comment-15989699 ] ASF GitHub Bot commented on BEAM-2005: -- GitHub user lukecwik opened a pull request: https://github.com/apache/beam/pull/2777 [BEAM-2005, BEAM-2030, BEAM-2031, BEAM-2032, BEAM-2033, BEAM-2070] Base implementation of HadoopFileSystem. TODO: * Add multiplexing FileSystem that is able to route requests based upon the base URI when configured for multiple file systems. * Take a look at copy/rename again to see if we can do better than moving all the bytes through the local machine. Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [x] Replace `` in the title with the actual Jira issue number, if there is one. - [x] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- You can merge this pull request into a Git repository by running: $ git pull https://github.com/lukecwik/incubator-beam hdfs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/beam/pull/2777.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2777 commit c792da69467ccb0a486fe74ff7df0e5e9cb54abf Author: Luke Cwik Date: 2017-04-29T02:07:59Z [BEAM-2005, BEAM-2030, BEAM-2031, BEAM-2032, BEAM-2033, BEAM-2070] Base implementation of HadoopFileSystem. TODO: * Add multiplexing FileSystem that is able to route requests based upon the base URI when configured for multiple file systems. * Take a look at copy/rename again to see if we can do better than moving all the bytes through the local machine. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Luke Cwik > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15989688#comment-15989688 ] ASF GitHub Bot commented on BEAM-2005: -- Github user lukecwik closed the pull request at: https://github.com/apache/beam/pull/2776 > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Luke Cwik > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15989683#comment-15989683 ] ASF GitHub Bot commented on BEAM-2005: -- GitHub user lukecwik opened a pull request: https://github.com/apache/beam/pull/2776 [BEAM-2005, BEAM-2030, BEAM-2031, BEAM-2032, BEAM-2033, BEAM-2070] Base implementation of HadoopFileSystem. TODO: * Add multiplexing FileSystem that is able to route requests based upon the base URI when configured for multiple file systems. * Take a look at copy/rename again to see if we can do better than moving all the bytes through the local machine. Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [x] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [x] Replace `` in the title with the actual Jira issue number, if there is one. - [x] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- You can merge this pull request into a Git repository by running: $ git pull https://github.com/lukecwik/incubator-beam hdfs2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/beam/pull/2776.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2776 commit ecfb534c79cb38bdf378dffe25d6a7ee3e20e5c6 Author: Luke Cwik Date: 2017-04-29T01:37:03Z [BEAM-2005, BEAM-2030, BEAM-2031, BEAM-2032, BEAM-2033, BEAM-2070] Base implementation of HDFS. TODO: * Add multiplexing FileSystem that is able to route requests based upon the base URI when configured for multiple file systems. * Take a look at copy/rename again to see if we can do better than moving all the bytes through the local machine. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Luke Cwik > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15989625#comment-15989625 ] ASF GitHub Bot commented on BEAM-2005: -- Github user asfgit closed the pull request at: https://github.com/apache/beam/pull/2772 > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Luke Cwik > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15989592#comment-15989592 ] ASF GitHub Bot commented on BEAM-2005: -- GitHub user lukecwik opened a pull request: https://github.com/apache/beam/pull/2772 [BEAM-2005] Swap to use Lists within MatchResult instead of arrays. Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- Everywhere in our codebase we go through and want to have lists as the return type. This also AutoValue's them. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lukecwik/incubator-beam hdfs2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/beam/pull/2772.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2772 commit f9fe921cb64a19a092590644e1d94a4b202db3c9 Author: Luke Cwik Date: 2017-04-28T23:27:53Z [BEAM-2005] Swap to use Lists within MatchResult instead of arrays. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Luke Cwik > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15989245#comment-15989245 ] ASF GitHub Bot commented on BEAM-2005: -- Github user tgroh closed the pull request at: https://github.com/apache/beam/pull/2763 > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Luke Cwik > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15989203#comment-15989203 ] ASF GitHub Bot commented on BEAM-2005: -- GitHub user tgroh opened a pull request: https://github.com/apache/beam/pull/2763 Revert "[BEAM-2005] Move getScheme from FileSystemRegistrar to FileSy… …stem" This reverts commit ce88c88b14e963ac17fac83dd19495c835c1f6cb. Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- You can merge this pull request into a Git repository by running: $ git pull https://github.com/tgroh/beam revert_getscheme_fs_registrar Alternatively you can review and apply these changes as the patch at: https://github.com/apache/beam/pull/2763.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2763 commit 25f9b4a4b1001549e15989eb4d94b6c2978872ac Author: Thomas Groh Date: 2017-04-28T17:30:15Z Revert "[BEAM-2005] Move getScheme from FileSystemRegistrar to FileSystem" This reverts commit ce88c88b14e963ac17fac83dd19495c835c1f6cb. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Luke Cwik > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15989071#comment-15989071 ] ASF GitHub Bot commented on BEAM-2005: -- Github user asfgit closed the pull request at: https://github.com/apache/beam/pull/2757 > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15987992#comment-15987992 ] ASF GitHub Bot commented on BEAM-2005: -- GitHub user lukecwik opened a pull request: https://github.com/apache/beam/pull/2757 [BEAM-2005] Move getScheme from FileSystemRegistrar to FileSystem Note that I needed to update FileSystems to instantiate the FileSystem(s) upfront instead of remembering the mapping from scheme to registrar during "registration" of the PipelineOptions. Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- You can merge this pull request into a Git repository by running: $ git pull https://github.com/lukecwik/incubator-beam hdfs2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/beam/pull/2757.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2757 commit a2ccc739f7dc2e6f606fafd6e158b63efe190b10 Author: Luke Cwik Date: 2017-04-28T01:13:43Z [BEAM-2005] Move getScheme from FileSystemRegistrar to FileSystem Note that I needed to update FileSystems to instantiate the FileSystem(s) upfront instead of remembering the mapping from scheme to registrar. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982303#comment-15982303 ] Jean-Baptiste Onofré commented on BEAM-2005: +1 (sorry I'm late on this, I was busy with CassandraIO and Spark 2 support). Gonna help you in the coming days. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982174#comment-15982174 ] Stephen Sisk commented on BEAM-2005: also - all the sub-tasks on this are assigned to me since we don't want FSR items unassigned, but I'm only currently working on one of them. Let me know if you're interested on working on this with me. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982100#comment-15982100 ] Stephen Sisk commented on BEAM-2005: Also, I filed https://issues.apache.org/jira/browse/BEAM-2069 (Remove FileSystem.getCurrentDirectory()?) based on an issue I ran into while implementing HadoopResourceId. For now, I'm having getCurrentDirectory() throw NotImplementedException() > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15982030#comment-15982030 ] Stephen Sisk commented on BEAM-2005: note that adding support for other Hadoop FileSystem should be simple (I'm not doing anything to block them) and might be as straightforward as adding another FileSystemRegistrar that returns the HadoopFileSystem for different schemes. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981966#comment-15981966 ] Stephen Sisk commented on BEAM-2005: Since I haven't heard anything and we are trying to finish soon for FSR, I'm doing as follows: 1. Implement a simple version of FileSystem that supports only HDFS, and only one configuration and gets that configuration via pipelineOptions that looks like a json blob 2. It will be checked in at the same location as the existing HadoopFileSystem.java (and thus, this will live in extensions) I like the idea of the scheme -> module mapping, but that probably belongs in another JIRA if we want it. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15977537#comment-15977537 ] Stephen Sisk commented on BEAM-2005: There are a couple possible ways we could adapt to this: * Potentially we could give different connections different schemas, but that falls apart if someone wants to use URIs generated elsewhere * Start passing in the FileSystem object as an option on the read transform (like hadoop does) - this also incidentally solves the problem of "how will people know if hdfs is in the set of modules loaded on your system" problem that was discussed above - they'll need to instantiate the instance themselves and they'll go through their normal discovery mechanism for doing so. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15977497#comment-15977497 ] Stephen Sisk commented on BEAM-2005: Some additional questions that I think are related to the registration question we're talking about here - As discussed above, Hadoop FileSystem can be used to access multiple types of filesystems (s3/hdfs/etc...) 1) However, FileSystemRegistrar only allows 1 schema to be registered per FileSystemRegistrar. That means the single class can only handle one schema. We could either change the interface to allow registering multiple schema, or create multiple classes that all inherit from a base class and declare a separate schema. (eg s3HadoopFileSystem, HdfsHadoopFileSystem, etc...) 2) Additionally, Hadoop filesystems are configured via Configuration objects (eg, the options discussed here: https://issues.apache.org/jira/browse/HADOOP-10400 for S3) - that means that a user might/probably should be able to configure those options and have multiple connections per schema type (ie, "I want to connect to two different HDFS instances") Looking at how the Beam FileSystem is currently implemented, it's not clear to me that it is possible today to handle this scenario. This 2nd question shouldn't block having a simple "I can read from one hdfs instance" case working, but it does seem important in the long run. cc [~davor] [~dhalp...@google.com] > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15976952#comment-15976952 ] Stephen Sisk commented on BEAM-2005: I don't want to derail this conversation, but I did have a couple other concerns - Beam's FileSystem has a copy() command, however I can't find a good analog in Hadoop's FileSystem. https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html shows lots of copy to/from local files, but no "copy between these two arbitrary paths". I also believe that since Beam FileSystem objects are configured via PipelineOptions, we need to pass a Hadoop Configuration through PipelineOptions. I think that's very solvable, but it does seem semi-complicated. I'm going to open subtasks for discussion so we can discuss in separate threads. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15976308#comment-15976308 ] Aviem Zur commented on BEAM-2005: - Yes, it makes sense that the code would be in an extension and a BoM/archetypes + good documentation will help users get up and running. However I still think the case I mentioned will happen in practice: bq. a user creates a project from scratch, adds a dependency on a runner (say direct runner), uses TextIO to do a word count and it works for them when passing "file://path/to/file", changing this to "hdfs://path/to/file" will not work. So in this case the user will have to resort to looking up documentation on how to achieve what they wanted. What we could do, if we don't want to have {{core}} bloated with dependencies on all filesystems out of the box is at least have a {{scheme}} -> {{module}} mapping which can be used to display an informative error message such as: bq. To enable HDFS support add a dependency on sdk-java-extensions-hadoop" And a similar message for the other filesystem schemes which we have support for in our extension modules. This could be achieved by a static {{Map}} in {{core}}. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15976262#comment-15976262 ] Jean-Baptiste Onofré commented on BEAM-2005: I think there's two topics: - where to put the code itself - what dependency artifacts do we bring with the core (via a BoM for example) Definitely, I think it's better to have HDFS filesystem outside of the core, in an extension. Now, from an user perspective, this extension can come by default with the dependency set (again using a BoM for instance). > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15976256#comment-15976256 ] Ismaël Mejía commented on BEAM-2005: [~aviemzur] I think that this should be part of the Hadoop extensions and SDK core should NOT depend on it for two reasons: (1) We have done a big process of degooglification of the SDK so I don’t see a strong reason to add a strong dependency to other group of libraries like the ones that Hadoop will bring. (2) The SDK should be object storage neutral, so I don’t think there is a particular reason to add support out of the box for hdfs and don’t do it for s3 or other storage systems, specially since we can register those dynamically via BeamFileSystemRegistrar once the dependency is added (like runners do). Note that I expect that this will also be the case for Google Storage and that the GCP dependencies won’t be needed as part of the core sdk for neutrality reasons too. However I agree with your feeling that for user experience having this support out of the box would be nice, but we can cover this with better documentation or with some starter (batteries included) maven poms e.g. one for people who are full on GCP with all the google storage out of the box, one for people on spark that can bring the hadoop extensions ready, etc. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15976046#comment-15976046 ] Aviem Zur commented on BEAM-2005: - Regarding {{core}} vs {{extensions}}. This can reside in a separate module from core, but I think that core should depend on it so users get this functionality out of the box. For example, if a user uses {{TextIO}} and it works for them when passing {{"file://path/to/file"}}, changing this to {{"hdfs://path/to/file"}} should work without the need to add a new dependency to their project. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15975557#comment-15975557 ] Stephen Sisk commented on BEAM-2005: [~aviemzur] correct, that's what we'll need to do. [~jbonofre] I've also started working on this - mind sharing your code? Maybe in a feature branch of the beam repo where we could collaborate? > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15975196#comment-15975196 ] Aviem Zur commented on BEAM-2005: - Wouldn't this ticket mean actually implementing https://github.com/apache/beam/blob/master/sdks/java/io/hdfs/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystem.java ? > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15975193#comment-15975193 ] Aviem Zur commented on BEAM-2005: - I'll refine - it should be in a separate module, but I think core should depend on it (i.e. out of the box functionality). > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15975164#comment-15975164 ] Eugene Kirpichov commented on BEAM-2005: After this is done, and before the first stable release, we should delete https://github.com/apache/beam/tree/master/sdks/java/io/hdfs/src/main/java/org/apache/beam/sdk/io/hdfs. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15975001#comment-15975001 ] Daniel Halperin commented on BEAM-2005: --- `core` vs `extensions` -- this won't be in `sdk-java-core` itself, it will probably be in `sdk-java-extensions-hadoop` or whatever (just like `GcsFileSystem` either is moving or has moved to the new `sdks-java-extensions-gcp-core`). I could see also `sdk-java-io-hadoop`, but I think this is a reasonable-ish use of `core`. Our JIRA tags are not perfect. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15974991#comment-15974991 ] Jean-Baptiste Onofré commented on BEAM-2005: I don't mind about core or extension (it could make sense to have filesystem extension). Full agree that it's a must have for first stable release. That's why I started to focus on this last week end. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Stephen Sisk >Assignee: Stephen Sisk > Fix For: First stable release > > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15974980#comment-15974980 ] Aviem Zur commented on BEAM-2005: - I think this should be in `core` not in `extensions`. Also, I think this should be a blocker for first stable release. WDYT? > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Affects Versions: First stable release >Reporter: Stephen Sisk >Assignee: Stephen Sisk > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15974783#comment-15974783 ] Jean-Baptiste Onofré commented on BEAM-2005: I'm thinking about something like: {code} TextIO.from("hdfs://credentialTag@namenode/path/to/file") {code} [~sisk] [~dhalp...@google.com] Thoughts ? > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Affects Versions: First stable release >Reporter: Stephen Sisk >Assignee: Stephen Sisk > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15974779#comment-15974779 ] Jean-Baptiste Onofré commented on BEAM-2005: By the way, an important part in HDFS is the support of Kerberos. I think it would require a couple of additional methods (as we have for Google Storage) around kerberos authentication. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Affects Versions: First stable release >Reporter: Stephen Sisk >Assignee: Stephen Sisk > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2005) Add a Hadoop FileSystem implementation of Beam's FileSystem
[ https://issues.apache.org/jira/browse/BEAM-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15974120#comment-15974120 ] Jean-Baptiste Onofré commented on BEAM-2005: Fully agree, I started this + S3 + Azure (as there's some slightly difference). I'm also experimenting the MongoDB GridFS filesystem. > Add a Hadoop FileSystem implementation of Beam's FileSystem > --- > > Key: BEAM-2005 > URL: https://issues.apache.org/jira/browse/BEAM-2005 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Affects Versions: First stable release >Reporter: Stephen Sisk >Assignee: Stephen Sisk > > Beam's FileSystem creates an abstraction for reading from files in many > different places. > We should add a Hadoop FileSystem implementation > (https://hadoop.apache.org/docs/r2.8.0/api/org/apache/hadoop/fs/FileSystem.html) > - that would enable us to read from any file system that implements > FileSystem (including HDFS, azure, s3, etc..) > I'm investigating this now. -- This message was sent by Atlassian JIRA (v6.3.15#6346)