Re: Any reason to not use [vfs]?

2018-03-06 Thread Romain Manni-Bucau
Just to share a bit more than a ticket here is a bootstrap impl
https://github.com/apache/beam/pull/4803


Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn  | Book


2018-03-06 12:42 GMT+01:00 Reuven Lax :

> Cool. Then for now we should create a separate Vfs-backed Filesystem impl.
> Once Vfs supports all we need, I think we can consider keeping only that.
>
> Keep in mind that the bulk operations Luke mentioned translate to native
> bulk operations for Gcs at least (BatchRequest is part of the Gcs API). I'm
> not entirely sure whether HDFS natively supports this or not. This implies
> that we would need some way of expressing bulk operations through Vfs.
>
>
> On Tue, Mar 6, 2018 at 2:27 AM Romain Manni-Bucau 
> wrote:
>
>> @Reuven: this was what I had in mind yes.
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>>  | Book
>> 
>>
>> 2018-03-06 11:24 GMT+01:00 Reuven Lax :
>>
>>> Part of the point of the current Filesystem class _is_ to handle these
>>> things (e.g. bulk delete/rename). If Vfs doesn't, then maybe the right
>>> answer is to keep Filesystem but put Vfs under it (and maybe that will
>>> eventually allow us to remove some of the current code).
>>>
>>>
>>> On Mon, Mar 5, 2018 at 10:02 PM Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>


 Le 6 mars 2018 01:05, "Lukasz Cwik"  a écrit :

 As is, how does VFS improve upon the current FileSystem solution?


 This is about the ecosystem. I never saw beam fs implemented outside
 beam but saw a tons of vfs users and impl.



 How much work is it before VFS supports the Apache Beam usecases  (bulk
 operations, glob support)?


 I dont expect vfs to handle glob but i expect beam to handle them on
 top of vfs. Said otherwise glob matching is independent of the fs impl.

 Bulk is a good thing but think it can be handled as well I think. Best
 would be to support it on top of vfs.

 I see vfs as  the pure connectivity layer allowing beam to split and
 parallel process data and not as a complete replacement.



 Is it the right direction for the VFS project to support the above
 changes? (things that are important to a data parallel processing
 system aren't always important for a filesystem implementation.)


 Bulk will be. Distributed peocessing will stay in beam, parallel
 processing is important for vfs IMHO since it targets plain batch (like
 jbatch) as well.


 For example, Apache Beam relies on bulk match, bulk delete, bulk rename
 to be able to do things within FileIO efficiently (Datalow has had a
 bunch of experience where renaming one file at a time even when using
 multiple threads is quite slow when you have a million files to rename so
 having bulk APIs is important). It has a registration mechanism and
 ties into PipelineOptions pretty well. In my opinion the largest deficiency
 I see with FileSystems is that we should have used URIs[1] instead of
 abstract resource types since we could standardize how URIs are resolved
 and how globs work for them, allowing FileSystem authors to implement even
 less. The tricky part is how does a URI map onto the file system correctly.


 Sounds like something unrelated to vfs right? That said it is not too
 late to use a fallback mecanism when parsing the path?



 1: https://issues.apache.org/jira/browse/BEAM-2283

 On Mon, Mar 5, 2018 at 1:45 PM, Romain Manni-Bucau <
 rmannibu...@gmail.com> wrote:

>
>
> Le 5 mars 2018 22:26, "Robert Bradshaw"  a
> écrit :
>
> First, let's try to make the terminology abundantly clear, as I for
> one have (I think) misinterpreted what has been proposed.
>
> VfsFileSystem: A subclass of https://github.com/apache/beam/blob/
> 9f81fd299bd32e0d6056a7da9fa994cf74db0ed9/sdks/java/core/src/
> main/java/org/apache/beam/sdk/io/FileSystem.java
>
> VfsIO: A replacement for https://github.com/apache/beam/blob/
> 1e84e49e253f8833f28f1268bec3813029f582d0/sdks/java/core/src/
> main/java/org/apache/beam/sdk/io/FileIO.java written using Vfs
> instead of https://github.com/apache/beam/blob/
> 29859eb54d05b96a9db477e7bb04537510273bd2/sdks/java/core/src/
> main/java/org/apache/beam/sdk/io/FileSystems.java
>
>

Re: Any reason to not use [vfs]?

2018-03-06 Thread Reuven Lax
Cool. Then for now we should create a separate Vfs-backed Filesystem impl.
Once Vfs supports all we need, I think we can consider keeping only that.

Keep in mind that the bulk operations Luke mentioned translate to native
bulk operations for Gcs at least (BatchRequest is part of the Gcs API). I'm
not entirely sure whether HDFS natively supports this or not. This implies
that we would need some way of expressing bulk operations through Vfs.


On Tue, Mar 6, 2018 at 2:27 AM Romain Manni-Bucau 
wrote:

> @Reuven: this was what I had in mind yes.
>
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> 
>
> 2018-03-06 11:24 GMT+01:00 Reuven Lax :
>
>> Part of the point of the current Filesystem class _is_ to handle these
>> things (e.g. bulk delete/rename). If Vfs doesn't, then maybe the right
>> answer is to keep Filesystem but put Vfs under it (and maybe that will
>> eventually allow us to remove some of the current code).
>>
>>
>> On Mon, Mar 5, 2018 at 10:02 PM Romain Manni-Bucau 
>> wrote:
>>
>>>
>>>
>>> Le 6 mars 2018 01:05, "Lukasz Cwik"  a écrit :
>>>
>>> As is, how does VFS improve upon the current FileSystem solution?
>>>
>>>
>>> This is about the ecosystem. I never saw beam fs implemented outside
>>> beam but saw a tons of vfs users and impl.
>>>
>>>
>>>
>>> How much work is it before VFS supports the Apache Beam usecases  (bulk
>>> operations, glob support)?
>>>
>>>
>>> I dont expect vfs to handle glob but i expect beam to handle them on top
>>> of vfs. Said otherwise glob matching is independent of the fs impl.
>>>
>>> Bulk is a good thing but think it can be handled as well I think. Best
>>> would be to support it on top of vfs.
>>>
>>> I see vfs as  the pure connectivity layer allowing beam to split and
>>> parallel process data and not as a complete replacement.
>>>
>>>
>>>
>>> Is it the right direction for the VFS project to support the above
>>> changes? (things that are important to a data parallel processing
>>> system aren't always important for a filesystem implementation.)
>>>
>>>
>>> Bulk will be. Distributed peocessing will stay in beam, parallel
>>> processing is important for vfs IMHO since it targets plain batch (like
>>> jbatch) as well.
>>>
>>>
>>> For example, Apache Beam relies on bulk match, bulk delete, bulk rename
>>> to be able to do things within FileIO efficiently (Datalow has had a
>>> bunch of experience where renaming one file at a time even when using
>>> multiple threads is quite slow when you have a million files to rename so
>>> having bulk APIs is important). It has a registration mechanism and
>>> ties into PipelineOptions pretty well. In my opinion the largest deficiency
>>> I see with FileSystems is that we should have used URIs[1] instead of
>>> abstract resource types since we could standardize how URIs are resolved
>>> and how globs work for them, allowing FileSystem authors to implement even
>>> less. The tricky part is how does a URI map onto the file system correctly.
>>>
>>>
>>> Sounds like something unrelated to vfs right? That said it is not too
>>> late to use a fallback mecanism when parsing the path?
>>>
>>>
>>>
>>> 1: https://issues.apache.org/jira/browse/BEAM-2283
>>>
>>> On Mon, Mar 5, 2018 at 1:45 PM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>


 Le 5 mars 2018 22:26, "Robert Bradshaw"  a écrit :

 First, let's try to make the terminology abundantly clear, as I for one
 have (I think) misinterpreted what has been proposed.

 VfsFileSystem: A subclass of
 https://github.com/apache/beam/blob/9f81fd299bd32e0d6056a7da9fa994cf74db0ed9/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystem.java

 VfsIO: A replacement for
 https://github.com/apache/beam/blob/1e84e49e253f8833f28f1268bec3813029f582d0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java
 written using Vfs instead of
 https://github.com/apache/beam/blob/29859eb54d05b96a9db477e7bb04537510273bd2/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java



 Ack


 Between these two options, VfsFileSystem is the way to go. It will
 allow us to use all our existing File sources/sinks (including all the
 fancy watching/streaming support from FileIO) with any filesystem supported
 by Vcs. Long-term, if VFS is good enough (and we'll be able to do direct
 experiments of HadoopFileSystem vs. VfsFileSystem-on-Hadoop) we could
 consider moving to VFS entirely and even removing the layer of indirection.
 Vfs is a filesystem, this is the right level of abstraction to plug into.
 Even if it's lacking in some respects, it may still be worth keeping in
 par

Re: Any reason to not use [vfs]?

2018-03-06 Thread Jean-Baptiste Onofré
+1 for the discussion and tracking.

Regards
JB

On 03/06/2018 12:07 PM, Romain Manni-Bucau wrote:
> created https://issues.apache.org/jira/browse/BEAM-3786 to track the 
> discussion
> (without putting too much details in the ticket for now)
> 
> 
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github  |
> LinkedIn  | Book
> 
> 
> 2018-03-06 11:27 GMT+01:00 Romain Manni-Bucau  >:
> 
> @Reuven: this was what I had in mind yes.
> 
> 
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> 
> 
> 
> 2018-03-06 11:24 GMT+01:00 Reuven Lax  >:
> 
> Part of the point of the current Filesystem class _is_ to handle these
> things (e.g. bulk delete/rename). If Vfs doesn't, then maybe the right
> answer is to keep Filesystem but put Vfs under it (and maybe that will
> eventually allow us to remove some of the current code).
> 
> 
> On Mon, Mar 5, 2018 at 10:02 PM Romain Manni-Bucau
> mailto:rmannibu...@gmail.com>> wrote:
> 
> 
> 
> Le 6 mars 2018 01:05, "Lukasz Cwik"  > a écrit :
> 
> As is, how does VFS improve upon the current FileSystem 
> solution?
> 
> 
> This is about the ecosystem. I never saw beam fs implemented 
> outside
> beam but saw a tons of vfs users and impl.
> 
> 
> 
> How much work is it before VFS supports the Apache Beam 
> usecases
>  (bulk operations, glob support)?
> 
> 
> I dont expect vfs to handle glob but i expect beam to handle them 
> on
> top of vfs. Said otherwise glob matching is independent of the fs 
> impl.
> 
> Bulk is a good thing but think it can be handled as well I think.
> Best would be to support it on top of vfs.
> 
> I see vfs as  the pure connectivity layer allowing beam to split 
> and
> parallel process data and not as a complete replacement.
> 
> 
> 
> Is it the right direction for the VFS project to support the
> above changes? (things that are important to a data parallel
> processing system aren't always important for a filesystem
> implementation.)
> 
> 
> Bulk will be. Distributed peocessing will stay in beam, parallel
> processing is important for vfs IMHO since it targets plain batch
> (like jbatch) as well.
> 
> 
> For example, Apache Beam relies on bulk match, bulk delete, 
> bulk
> rename to be able to do things within FileIO efficiently
> (Datalow has had a bunch of experience where renaming one file
> at a time even when using multiple threads is quite slow when
> you have a million files to rename so having bulk APIs is
> important). It has a registration mechanism and ties into
> PipelineOptions pretty well. In my opinion the largest
> deficiency I see with FileSystems is that we should have used
> URIs[1] instead of abstract resource types since we could
> standardize how URIs are resolved and how globs work for them,
> allowing FileSystem authors to implement even less. The tricky
> part is how does a URI map onto the file system correctly.
> 
> 
> Sounds like something unrelated to vfs right? That said it is not
> too late to use a fallback mecanism when parsing the path?
> 
> 
> 
> 1: https://issues.apache.org/jira/browse/BEAM-2283
> 
> 
> On Mon, Mar 5, 2018 at 1:45 PM, Romain Manni-Bucau
> mailto:rmannibu...@gmail.com>> wrote:
> 
> 
> 
> Le 5 mars 2018 22:26, "Robert Bradshaw" 
>  > a écrit :
> 
> First, let's try to make the terminology abundantly
> clear, as I for one have (I think) misinterpreted what
> has been proposed. 
> 
> VfsFileSystem: A subclass
> of 
> https://github.com/apache/beam/blob/9f81fd299bd32e0d6056a7da9fa994cf74db0ed9/sdks/java/core/src/main/ja

Re: Any reason to not use [vfs]?

2018-03-06 Thread Romain Manni-Bucau
created https://issues.apache.org/jira/browse/BEAM-3786 to track the
discussion (without putting too much details in the ticket for now)


Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn  | Book


2018-03-06 11:27 GMT+01:00 Romain Manni-Bucau :

> @Reuven: this was what I had in mind yes.
>
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> 
>
> 2018-03-06 11:24 GMT+01:00 Reuven Lax :
>
>> Part of the point of the current Filesystem class _is_ to handle these
>> things (e.g. bulk delete/rename). If Vfs doesn't, then maybe the right
>> answer is to keep Filesystem but put Vfs under it (and maybe that will
>> eventually allow us to remove some of the current code).
>>
>>
>> On Mon, Mar 5, 2018 at 10:02 PM Romain Manni-Bucau 
>> wrote:
>>
>>>
>>>
>>> Le 6 mars 2018 01:05, "Lukasz Cwik"  a écrit :
>>>
>>> As is, how does VFS improve upon the current FileSystem solution?
>>>
>>>
>>> This is about the ecosystem. I never saw beam fs implemented outside
>>> beam but saw a tons of vfs users and impl.
>>>
>>>
>>>
>>> How much work is it before VFS supports the Apache Beam usecases  (bulk
>>> operations, glob support)?
>>>
>>>
>>> I dont expect vfs to handle glob but i expect beam to handle them on top
>>> of vfs. Said otherwise glob matching is independent of the fs impl.
>>>
>>> Bulk is a good thing but think it can be handled as well I think. Best
>>> would be to support it on top of vfs.
>>>
>>> I see vfs as  the pure connectivity layer allowing beam to split and
>>> parallel process data and not as a complete replacement.
>>>
>>>
>>>
>>> Is it the right direction for the VFS project to support the above
>>> changes? (things that are important to a data parallel processing
>>> system aren't always important for a filesystem implementation.)
>>>
>>>
>>> Bulk will be. Distributed peocessing will stay in beam, parallel
>>> processing is important for vfs IMHO since it targets plain batch (like
>>> jbatch) as well.
>>>
>>>
>>> For example, Apache Beam relies on bulk match, bulk delete, bulk rename
>>> to be able to do things within FileIO efficiently (Datalow has had a
>>> bunch of experience where renaming one file at a time even when using
>>> multiple threads is quite slow when you have a million files to rename so
>>> having bulk APIs is important). It has a registration mechanism and
>>> ties into PipelineOptions pretty well. In my opinion the largest deficiency
>>> I see with FileSystems is that we should have used URIs[1] instead of
>>> abstract resource types since we could standardize how URIs are resolved
>>> and how globs work for them, allowing FileSystem authors to implement even
>>> less. The tricky part is how does a URI map onto the file system correctly.
>>>
>>>
>>> Sounds like something unrelated to vfs right? That said it is not too
>>> late to use a fallback mecanism when parsing the path?
>>>
>>>
>>>
>>> 1: https://issues.apache.org/jira/browse/BEAM-2283
>>>
>>> On Mon, Mar 5, 2018 at 1:45 PM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>


 Le 5 mars 2018 22:26, "Robert Bradshaw"  a écrit :

 First, let's try to make the terminology abundantly clear, as I for one
 have (I think) misinterpreted what has been proposed.

 VfsFileSystem: A subclass of https://github.com/apache/b
 eam/blob/9f81fd299bd32e0d6056a7da9fa994cf74db0ed9/sdks/java/
 core/src/main/java/org/apache/beam/sdk/io/FileSystem.java

 VfsIO: A replacement for https://github.com/apache/
 beam/blob/1e84e49e253f8833f28f1268bec3813029f582d0/sdks/
 java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java written
 using Vfs instead of https://github.com/apache/b
 eam/blob/29859eb54d05b96a9db477e7bb04537510273bd2/sdks/java/
 core/src/main/java/org/apache/beam/sdk/io/FileSystems.java



 Ack


 Between these two options, VfsFileSystem is the way to go. It will
 allow us to use all our existing File sources/sinks (including all the
 fancy watching/streaming support from FileIO) with any filesystem supported
 by Vcs. Long-term, if VFS is good enough (and we'll be able to do direct
 experiments of HadoopFileSystem vs. VfsFileSystem-on-Hadoop) we could
 consider moving to VFS entirely and even removing the layer of indirection.
 Vfs is a filesystem, this is the right level of abstraction to plug into.
 Even if it's lacking in some respects, i

Re: Any reason to not use [vfs]?

2018-03-06 Thread Romain Manni-Bucau
@Reuven: this was what I had in mind yes.


Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn  | Book


2018-03-06 11:24 GMT+01:00 Reuven Lax :

> Part of the point of the current Filesystem class _is_ to handle these
> things (e.g. bulk delete/rename). If Vfs doesn't, then maybe the right
> answer is to keep Filesystem but put Vfs under it (and maybe that will
> eventually allow us to remove some of the current code).
>
>
> On Mon, Mar 5, 2018 at 10:02 PM Romain Manni-Bucau 
> wrote:
>
>>
>>
>> Le 6 mars 2018 01:05, "Lukasz Cwik"  a écrit :
>>
>> As is, how does VFS improve upon the current FileSystem solution?
>>
>>
>> This is about the ecosystem. I never saw beam fs implemented outside beam
>> but saw a tons of vfs users and impl.
>>
>>
>>
>> How much work is it before VFS supports the Apache Beam usecases  (bulk
>> operations, glob support)?
>>
>>
>> I dont expect vfs to handle glob but i expect beam to handle them on top
>> of vfs. Said otherwise glob matching is independent of the fs impl.
>>
>> Bulk is a good thing but think it can be handled as well I think. Best
>> would be to support it on top of vfs.
>>
>> I see vfs as  the pure connectivity layer allowing beam to split and
>> parallel process data and not as a complete replacement.
>>
>>
>>
>> Is it the right direction for the VFS project to support the above
>> changes? (things that are important to a data parallel processing system
>> aren't always important for a filesystem implementation.)
>>
>>
>> Bulk will be. Distributed peocessing will stay in beam, parallel
>> processing is important for vfs IMHO since it targets plain batch (like
>> jbatch) as well.
>>
>>
>> For example, Apache Beam relies on bulk match, bulk delete, bulk rename
>> to be able to do things within FileIO efficiently (Datalow has had a
>> bunch of experience where renaming one file at a time even when using
>> multiple threads is quite slow when you have a million files to rename so
>> having bulk APIs is important). It has a registration mechanism and ties
>> into PipelineOptions pretty well. In my opinion the largest deficiency I
>> see with FileSystems is that we should have used URIs[1] instead of
>> abstract resource types since we could standardize how URIs are resolved
>> and how globs work for them, allowing FileSystem authors to implement even
>> less. The tricky part is how does a URI map onto the file system correctly.
>>
>>
>> Sounds like something unrelated to vfs right? That said it is not too
>> late to use a fallback mecanism when parsing the path?
>>
>>
>>
>> 1: https://issues.apache.org/jira/browse/BEAM-2283
>>
>> On Mon, Mar 5, 2018 at 1:45 PM, Romain Manni-Bucau > > wrote:
>>
>>>
>>>
>>> Le 5 mars 2018 22:26, "Robert Bradshaw"  a écrit :
>>>
>>> First, let's try to make the terminology abundantly clear, as I for one
>>> have (I think) misinterpreted what has been proposed.
>>>
>>> VfsFileSystem: A subclass of https://github.com/apache/beam/blob/
>>> 9f81fd299bd32e0d6056a7da9fa994cf74db0ed9/sdks/java/core/src/
>>> main/java/org/apache/beam/sdk/io/FileSystem.java
>>>
>>> VfsIO: A replacement for https://github.com/apache/beam/blob/
>>> 1e84e49e253f8833f28f1268bec3813029f582d0/sdks/java/core/src/
>>> main/java/org/apache/beam/sdk/io/FileIO.java written using Vfs instead
>>> of https://github.com/apache/beam/blob/29859eb54d05b96a9db477e7bb0453
>>> 7510273bd2/sdks/java/core/src/main/java/org/apache/beam/sdk/
>>> io/FileSystems.java
>>>
>>>
>>>
>>> Ack
>>>
>>>
>>> Between these two options, VfsFileSystem is the way to go. It will allow
>>> us to use all our existing File sources/sinks (including all the fancy
>>> watching/streaming support from FileIO) with any filesystem supported by
>>> Vcs. Long-term, if VFS is good enough (and we'll be able to do direct
>>> experiments of HadoopFileSystem vs. VfsFileSystem-on-Hadoop) we could
>>> consider moving to VFS entirely and even removing the layer of indirection.
>>> Vfs is a filesystem, this is the right level of abstraction to plug into.
>>> Even if it's lacking in some respects, it may still be worth keeping in
>>> parallel to the existing FileSystem implementations long-term if it has
>>> significantly better coverage.
>>>
>>>
>>> Ok
>>>
>>>
>>> On the other hand, a re-implementation of FileIO on top of Vfs seems
>>> like a lot of duplication of code (and ongoing maintenance cost) and will
>>> be difficult to build on top of (e.g. the binding of TextIO to FileIO is
>>> not dynamic like the binding of filesystems).
>>>
>>>
>>> Well it shouldnt. Let me clarify my view: we - as asf and not just beam
>>> - can make both project growing from that work and be more mature and
>>> interoperable with the existing ecosystem (who does impl a beam 

Re: Any reason to not use [vfs]?

2018-03-06 Thread Reuven Lax
Part of the point of the current Filesystem class _is_ to handle these
things (e.g. bulk delete/rename). If Vfs doesn't, then maybe the right
answer is to keep Filesystem but put Vfs under it (and maybe that will
eventually allow us to remove some of the current code).


On Mon, Mar 5, 2018 at 10:02 PM Romain Manni-Bucau 
wrote:

>
>
> Le 6 mars 2018 01:05, "Lukasz Cwik"  a écrit :
>
> As is, how does VFS improve upon the current FileSystem solution?
>
>
> This is about the ecosystem. I never saw beam fs implemented outside beam
> but saw a tons of vfs users and impl.
>
>
>
> How much work is it before VFS supports the Apache Beam usecases  (bulk
> operations, glob support)?
>
>
> I dont expect vfs to handle glob but i expect beam to handle them on top
> of vfs. Said otherwise glob matching is independent of the fs impl.
>
> Bulk is a good thing but think it can be handled as well I think. Best
> would be to support it on top of vfs.
>
> I see vfs as  the pure connectivity layer allowing beam to split and
> parallel process data and not as a complete replacement.
>
>
>
> Is it the right direction for the VFS project to support the above
> changes? (things that are important to a data parallel processing system
> aren't always important for a filesystem implementation.)
>
>
> Bulk will be. Distributed peocessing will stay in beam, parallel
> processing is important for vfs IMHO since it targets plain batch (like
> jbatch) as well.
>
>
> For example, Apache Beam relies on bulk match, bulk delete, bulk rename to
> be able to do things within FileIO efficiently (Datalow has had a bunch
> of experience where renaming one file at a time even when using multiple
> threads is quite slow when you have a million files to rename so having
> bulk APIs is important). It has a registration mechanism and ties into
> PipelineOptions pretty well. In my opinion the largest deficiency I see
> with FileSystems is that we should have used URIs[1] instead of abstract
> resource types since we could standardize how URIs are resolved and how
> globs work for them, allowing FileSystem authors to implement even less.
> The tricky part is how does a URI map onto the file system correctly.
>
>
> Sounds like something unrelated to vfs right? That said it is not too late
> to use a fallback mecanism when parsing the path?
>
>
>
> 1: https://issues.apache.org/jira/browse/BEAM-2283
>
> On Mon, Mar 5, 2018 at 1:45 PM, Romain Manni-Bucau 
> wrote:
>
>>
>>
>> Le 5 mars 2018 22:26, "Robert Bradshaw"  a écrit :
>>
>> First, let's try to make the terminology abundantly clear, as I for one
>> have (I think) misinterpreted what has been proposed.
>>
>> VfsFileSystem: A subclass of
>> https://github.com/apache/beam/blob/9f81fd299bd32e0d6056a7da9fa994cf74db0ed9/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystem.java
>>
>> VfsIO: A replacement for
>> https://github.com/apache/beam/blob/1e84e49e253f8833f28f1268bec3813029f582d0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java
>> written using Vfs instead of
>> https://github.com/apache/beam/blob/29859eb54d05b96a9db477e7bb04537510273bd2/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java
>>
>>
>>
>> Ack
>>
>>
>> Between these two options, VfsFileSystem is the way to go. It will allow
>> us to use all our existing File sources/sinks (including all the fancy
>> watching/streaming support from FileIO) with any filesystem supported by
>> Vcs. Long-term, if VFS is good enough (and we'll be able to do direct
>> experiments of HadoopFileSystem vs. VfsFileSystem-on-Hadoop) we could
>> consider moving to VFS entirely and even removing the layer of indirection.
>> Vfs is a filesystem, this is the right level of abstraction to plug into.
>> Even if it's lacking in some respects, it may still be worth keeping in
>> parallel to the existing FileSystem implementations long-term if it has
>> significantly better coverage.
>>
>>
>> Ok
>>
>>
>> On the other hand, a re-implementation of FileIO on top of Vfs seems like
>> a lot of duplication of code (and ongoing maintenance cost) and will be
>> difficult to build on top of (e.g. the binding of TextIO to FileIO is not
>> dynamic like the binding of filesystems).
>>
>>
>> Well it shouldnt. Let me clarify my view: we - as asf and not just beam -
>> can make both project growing from that work and be more mature and
>> interoperable with the existing ecosystem (who does impl a beam filesystem
>> when providing a new filesystem). Interesting thing is recent java version
>> have a filesystem absstraction  too but this one is harder to make evolving
>> for our need. High level goal is to keep it ecosystem friendly and not
>> create yet another one.
>>
>>
>>
>>
>> On Mon, Mar 5, 2018 at 1:05 PM Romain Manni-Bucau 
>> wrote:
>>
>>> Not backing vfs by a filesystem sounds saner so VfsIO is probably the
>>> way to go. It would be a FileIO concurrent and hopefully replacement on the
>>> mid/long term.
>>>
>>> What about doing the opposite

Re: Any reason to not use [vfs]?

2018-03-05 Thread Romain Manni-Bucau
Le 6 mars 2018 01:05, "Lukasz Cwik"  a écrit :

As is, how does VFS improve upon the current FileSystem solution?


This is about the ecosystem. I never saw beam fs implemented outside beam
but saw a tons of vfs users and impl.



How much work is it before VFS supports the Apache Beam usecases  (bulk
operations, glob support)?


I dont expect vfs to handle glob but i expect beam to handle them on top of
vfs. Said otherwise glob matching is independent of the fs impl.

Bulk is a good thing but think it can be handled as well I think. Best
would be to support it on top of vfs.

I see vfs as  the pure connectivity layer allowing beam to split and
parallel process data and not as a complete replacement.



Is it the right direction for the VFS project to support the above changes?
(things that are important to a data parallel processing system aren't
always important for a filesystem implementation.)


Bulk will be. Distributed peocessing will stay in beam, parallel processing
is important for vfs IMHO since it targets plain batch (like jbatch) as
well.


For example, Apache Beam relies on bulk match, bulk delete, bulk rename to
be able to do things within FileIO efficiently (Datalow has had a bunch of
experience where renaming one file at a time even when using multiple
threads is quite slow when you have a million files to rename so having
bulk APIs is important). It has a registration mechanism and ties into
PipelineOptions pretty well. In my opinion the largest deficiency I see
with FileSystems is that we should have used URIs[1] instead of abstract
resource types since we could standardize how URIs are resolved and how
globs work for them, allowing FileSystem authors to implement even less.
The tricky part is how does a URI map onto the file system correctly.


Sounds like something unrelated to vfs right? That said it is not too late
to use a fallback mecanism when parsing the path?



1: https://issues.apache.org/jira/browse/BEAM-2283

On Mon, Mar 5, 2018 at 1:45 PM, Romain Manni-Bucau 
wrote:

>
>
> Le 5 mars 2018 22:26, "Robert Bradshaw"  a écrit :
>
> First, let's try to make the terminology abundantly clear, as I for one
> have (I think) misinterpreted what has been proposed.
>
> VfsFileSystem: A subclass of https://github.com/apache/b
> eam/blob/9f81fd299bd32e0d6056a7da9fa994cf74db0ed9/sdks/java/
> core/src/main/java/org/apache/beam/sdk/io/FileSystem.java
>
> VfsIO: A replacement for https://github.com/apache/
> beam/blob/1e84e49e253f8833f28f1268bec3813029f582d0/sdks/java
> /core/src/main/java/org/apache/beam/sdk/io/FileIO.java written using Vfs
> instead of https://github.com/apache/beam/blob/29859eb54d05b96a9db47
> 7e7bb04537510273bd2/sdks/java/core/src/main/java/org/apache/
> beam/sdk/io/FileSystems.java
>
>
>
> Ack
>
>
> Between these two options, VfsFileSystem is the way to go. It will allow
> us to use all our existing File sources/sinks (including all the fancy
> watching/streaming support from FileIO) with any filesystem supported by
> Vcs. Long-term, if VFS is good enough (and we'll be able to do direct
> experiments of HadoopFileSystem vs. VfsFileSystem-on-Hadoop) we could
> consider moving to VFS entirely and even removing the layer of indirection.
> Vfs is a filesystem, this is the right level of abstraction to plug into.
> Even if it's lacking in some respects, it may still be worth keeping in
> parallel to the existing FileSystem implementations long-term if it has
> significantly better coverage.
>
>
> Ok
>
>
> On the other hand, a re-implementation of FileIO on top of Vfs seems like
> a lot of duplication of code (and ongoing maintenance cost) and will be
> difficult to build on top of (e.g. the binding of TextIO to FileIO is not
> dynamic like the binding of filesystems).
>
>
> Well it shouldnt. Let me clarify my view: we - as asf and not just beam -
> can make both project growing from that work and be more mature and
> interoperable with the existing ecosystem (who does impl a beam filesystem
> when providing a new filesystem). Interesting thing is recent java version
> have a filesystem absstraction  too but this one is harder to make evolving
> for our need. High level goal is to keep it ecosystem friendly and not
> create yet another one.
>
>
>
>
> On Mon, Mar 5, 2018 at 1:05 PM Romain Manni-Bucau 
> wrote:
>
>> Not backing vfs by a filesystem sounds saner so VfsIO is probably the way
>> to go. It would be a FileIO concurrent and hopefully replacement on the
>> mid/long term.
>>
>> What about doing the opposite: implementing a vfs filesystem for all the
>> fs we support, potentially enrich vfs if needed? Then we can just drop beam
>> abstraction from what i read.
>>
>> Le 5 mars 2018 20:49, "Reuven Lax"  a écrit :
>>
>>> terminology is confusing here, since the existing FileIO is a
>>> PTransform. VfsFilesystem would be a better name.
>>>
>>>
>>> On Mon, Mar 5, 2018 at 11:46 AM Robert Bradshaw 
>>> wrote:
>>>
 On Mon, Mar 5, 2018 at 11:38 AM Reuven Lax  wrote:

> 

Re: Any reason to not use [vfs]?

2018-03-05 Thread Lukasz Cwik
As is, how does VFS improve upon the current FileSystem solution?

How much work is it before VFS supports the Apache Beam usecases  (bulk
operations, glob support)?

Is it the right direction for the VFS project to support the above changes?
(things that are important to a data parallel processing system aren't
always important for a filesystem implementation.)

For example, Apache Beam relies on bulk match, bulk delete, bulk rename to
be able to do things within FileIO efficiently (Datalow has had a bunch of
experience where renaming one file at a time even when using multiple
threads is quite slow when you have a million files to rename so having
bulk APIs is important). It has a registration mechanism and ties into
PipelineOptions pretty well. In my opinion the largest deficiency I see
with FileSystems is that we should have used URIs[1] instead of abstract
resource types since we could standardize how URIs are resolved and how
globs work for them, allowing FileSystem authors to implement even less.
The tricky part is how does a URI map onto the file system correctly.

1: https://issues.apache.org/jira/browse/BEAM-2283

On Mon, Mar 5, 2018 at 1:45 PM, Romain Manni-Bucau 
wrote:

>
>
> Le 5 mars 2018 22:26, "Robert Bradshaw"  a écrit :
>
> First, let's try to make the terminology abundantly clear, as I for one
> have (I think) misinterpreted what has been proposed.
>
> VfsFileSystem: A subclass of https://github.com/apache/b
> eam/blob/9f81fd299bd32e0d6056a7da9fa994cf74db0ed9/sdks/java/
> core/src/main/java/org/apache/beam/sdk/io/FileSystem.java
>
> VfsIO: A replacement for https://github.com/apache/
> beam/blob/1e84e49e253f8833f28f1268bec3813029f582d0/sdks/
> java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java written using
> Vfs instead of https://github.com/apache/beam/blob/29859eb54d05b96a9db47
> 7e7bb04537510273bd2/sdks/java/core/src/main/java/org/apache/
> beam/sdk/io/FileSystems.java
>
>
>
> Ack
>
>
> Between these two options, VfsFileSystem is the way to go. It will allow
> us to use all our existing File sources/sinks (including all the fancy
> watching/streaming support from FileIO) with any filesystem supported by
> Vcs. Long-term, if VFS is good enough (and we'll be able to do direct
> experiments of HadoopFileSystem vs. VfsFileSystem-on-Hadoop) we could
> consider moving to VFS entirely and even removing the layer of indirection.
> Vfs is a filesystem, this is the right level of abstraction to plug into.
> Even if it's lacking in some respects, it may still be worth keeping in
> parallel to the existing FileSystem implementations long-term if it has
> significantly better coverage.
>
>
> Ok
>
>
> On the other hand, a re-implementation of FileIO on top of Vfs seems like
> a lot of duplication of code (and ongoing maintenance cost) and will be
> difficult to build on top of (e.g. the binding of TextIO to FileIO is not
> dynamic like the binding of filesystems).
>
>
> Well it shouldnt. Let me clarify my view: we - as asf and not just beam -
> can make both project growing from that work and be more mature and
> interoperable with the existing ecosystem (who does impl a beam filesystem
> when providing a new filesystem). Interesting thing is recent java version
> have a filesystem absstraction  too but this one is harder to make evolving
> for our need. High level goal is to keep it ecosystem friendly and not
> create yet another one.
>
>
>
>
> On Mon, Mar 5, 2018 at 1:05 PM Romain Manni-Bucau 
> wrote:
>
>> Not backing vfs by a filesystem sounds saner so VfsIO is probably the way
>> to go. It would be a FileIO concurrent and hopefully replacement on the
>> mid/long term.
>>
>> What about doing the opposite: implementing a vfs filesystem for all the
>> fs we support, potentially enrich vfs if needed? Then we can just drop beam
>> abstraction from what i read.
>>
>> Le 5 mars 2018 20:49, "Reuven Lax"  a écrit :
>>
>>> terminology is confusing here, since the existing FileIO is a
>>> PTransform. VfsFilesystem would be a better name.
>>>
>>>
>>> On Mon, Mar 5, 2018 at 11:46 AM Robert Bradshaw 
>>> wrote:
>>>
 On Mon, Mar 5, 2018 at 11:38 AM Reuven Lax  wrote:

> What about a beam Filesystem impl on top of Vfs as an alternative
> short-term solution? This would allow Vfs to be used with any IO.
>

 Yes, I think this is the VfsIO that was proposed.


> On Mon, Mar 5, 2018 at 11:37 AM Robert Bradshaw 
> wrote:
>
>>
>> On Mon, Mar 5, 2018 at 11:23 AM Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>>
>>> 2018-03-05 20:04 GMT+01:00 Chamikara Jayalath 
>>> :
>>>
 I assume you mean https://commons.apache.org/proper/commons-vfs/.

 I'm not sure if we considered this when we originally implemented
 our own file-system abstraction but based on a quick look seems like 
 this
 is Java only.

>>>
>>> Yes, java only
>>>
>>>

 I t

Re: Any reason to not use [vfs]?

2018-03-05 Thread Romain Manni-Bucau
Le 5 mars 2018 22:26, "Robert Bradshaw"  a écrit :

First, let's try to make the terminology abundantly clear, as I for one
have (I think) misinterpreted what has been proposed.

VfsFileSystem: A subclass of https://github.com/apache/beam/blob/
9f81fd299bd32e0d6056a7da9fa994cf74db0ed9/sdks/java/core/src/
main/java/org/apache/beam/sdk/io/FileSystem.java

VfsIO: A replacement for https://github.com/apache/beam/blob/
1e84e49e253f8833f28f1268bec3813029f582d0/sdks/java/core/src/
main/java/org/apache/beam/sdk/io/FileIO.java written using Vfs instead of
https://github.com/apache/beam/blob/29859eb54d05b96a9db477e7bb0453
7510273bd2/sdks/java/core/src/main/java/org/apache/beam/sdk/
io/FileSystems.java



Ack


Between these two options, VfsFileSystem is the way to go. It will allow us
to use all our existing File sources/sinks (including all the fancy
watching/streaming support from FileIO) with any filesystem supported by
Vcs. Long-term, if VFS is good enough (and we'll be able to do direct
experiments of HadoopFileSystem vs. VfsFileSystem-on-Hadoop) we could
consider moving to VFS entirely and even removing the layer of indirection.
Vfs is a filesystem, this is the right level of abstraction to plug into.
Even if it's lacking in some respects, it may still be worth keeping in
parallel to the existing FileSystem implementations long-term if it has
significantly better coverage.


Ok


On the other hand, a re-implementation of FileIO on top of Vfs seems like a
lot of duplication of code (and ongoing maintenance cost) and will be
difficult to build on top of (e.g. the binding of TextIO to FileIO is not
dynamic like the binding of filesystems).


Well it shouldnt. Let me clarify my view: we - as asf and not just beam -
can make both project growing from that work and be more mature and
interoperable with the existing ecosystem (who does impl a beam filesystem
when providing a new filesystem). Interesting thing is recent java version
have a filesystem absstraction  too but this one is harder to make evolving
for our need. High level goal is to keep it ecosystem friendly and not
create yet another one.




On Mon, Mar 5, 2018 at 1:05 PM Romain Manni-Bucau 
wrote:

> Not backing vfs by a filesystem sounds saner so VfsIO is probably the way
> to go. It would be a FileIO concurrent and hopefully replacement on the
> mid/long term.
>
> What about doing the opposite: implementing a vfs filesystem for all the
> fs we support, potentially enrich vfs if needed? Then we can just drop beam
> abstraction from what i read.
>
> Le 5 mars 2018 20:49, "Reuven Lax"  a écrit :
>
>> terminology is confusing here, since the existing FileIO is a PTransform.
>> VfsFilesystem would be a better name.
>>
>>
>> On Mon, Mar 5, 2018 at 11:46 AM Robert Bradshaw 
>> wrote:
>>
>>> On Mon, Mar 5, 2018 at 11:38 AM Reuven Lax  wrote:
>>>
 What about a beam Filesystem impl on top of Vfs as an alternative
 short-term solution? This would allow Vfs to be used with any IO.

>>>
>>> Yes, I think this is the VfsIO that was proposed.
>>>
>>>
 On Mon, Mar 5, 2018 at 11:37 AM Robert Bradshaw 
 wrote:

>
> On Mon, Mar 5, 2018 at 11:23 AM Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
>>
>> 2018-03-05 20:04 GMT+01:00 Chamikara Jayalath :
>>
>>> I assume you mean https://commons.apache.org/proper/commons-vfs/.
>>>
>>> I'm not sure if we considered this when we originally implemented
>>> our own file-system abstraction but based on a quick look seems like 
>>> this
>>> is Java only.
>>>
>>
>> Yes, java only
>>
>>
>>>
>>> I think having a similar file-system abstraction for various
>>> languages is a plus point for Beam. May be we should consider a Java
>>> file-system implementation for VFS ?
>>>
>>
>> Can be an option but when I see the current complexity I'm not sure
>> mixing 2 abstractions would help, maybe just a VfsIO for java users would
>> be good enough - thinking out loud.
>>
>> What sounds clear to me is that each language will need its own
>> abstraction - which kind of join your proposal. However we can still make
>> it smooth and easy on the java side - which
>> will likely stay mainstream for still some years - using vfs as our
>> java impl instead of reimplementing the full abstraction? This way we 
>> keep
>> our *API* but we drop beam *impl* to just reuse VFS.
>>
>> PS: for gcs https://github.com/ltouati/vfs-gcs can be a good example
>> on how it can work.
>>
>
> I think a VfsIO makes a lot of sense in the short term, and will give
> use the experience needed to decide if we can move solely to VFS (for Java
> at least) for implementation, and possibly API in a future major release,
> in the long run.
>



Re: Any reason to not use [vfs]?

2018-03-05 Thread Eugene Kirpichov
If VFS was mature enough for our needs, then I'd give a +1 to using it in
Beam Java SDK - currently it's not, so we can't use it directly.
It's indeed a reasonable option to use the VFS API inside Beam, and port
our implementations of FileSystem(s) to that API, and then potentially
donate that to the VFS project.
Upsides:
- big contribution to the Apache ecosystem
- more contributors
Downsides:
- we become dependent on VFS release cycles for fixing bugs in our
filesystem implementations, but maybe that's ok, depending on how frequent
are its releases and how well it's maintained in general.
- since the codebase is no longer under our full control, we become
dependent on the diligence of VFS committers and their testing procedures
for code quality. I'd assume they are diligent, but being unfamiliar with
the project, it may be a risk.

I'd say the upsides outweigh the downsides, so - this seems like a very
substantial amount of work but if someone's willing to do it, great.

As for creating a VfsIO transform: I'm very strongly against this. A
filesystem layer should transparently work with everything file-related,
and not be limited to use from a single transform. Same reason we don't
have GCSIO, S3IO, LocalIO, ZipIO etc.

On Mon, Mar 5, 2018 at 1:05 PM Romain Manni-Bucau 
wrote:

> Not backing vfs by a filesystem sounds saner so VfsIO is probably the way
> to go. It would be a FileIO concurrent and hopefully replacement on the
> mid/long term.
>
> What about doing the opposite: implementing a vfs filesystem for all the
> fs we support, potentially enrich vfs if needed? Then we can just drop beam
> abstraction from what i read.
>
> Le 5 mars 2018 20:49, "Reuven Lax"  a écrit :
>
>> terminology is confusing here, since the existing FileIO is a PTransform.
>> VfsFilesystem would be a better name.
>>
>>
>> On Mon, Mar 5, 2018 at 11:46 AM Robert Bradshaw 
>> wrote:
>>
>>> On Mon, Mar 5, 2018 at 11:38 AM Reuven Lax  wrote:
>>>
 What about a beam Filesystem impl on top of Vfs as an alternative
 short-term solution? This would allow Vfs to be used with any IO.

>>>
>>> Yes, I think this is the VfsIO that was proposed.
>>>
>>>
 On Mon, Mar 5, 2018 at 11:37 AM Robert Bradshaw 
 wrote:

>
> On Mon, Mar 5, 2018 at 11:23 AM Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
>>
>> 2018-03-05 20:04 GMT+01:00 Chamikara Jayalath :
>>
>>> I assume you mean https://commons.apache.org/proper/commons-vfs/.
>>>
>>> I'm not sure if we considered this when we originally implemented
>>> our own file-system abstraction but based on a quick look seems like 
>>> this
>>> is Java only.
>>>
>>
>> Yes, java only
>>
>>
>>>
>>> I think having a similar file-system abstraction for various
>>> languages is a plus point for Beam. May be we should consider a Java
>>> file-system implementation for VFS ?
>>>
>>
>> Can be an option but when I see the current complexity I'm not sure
>> mixing 2 abstractions would help, maybe just a VfsIO for java users would
>> be good enough - thinking out loud.
>>
>> What sounds clear to me is that each language will need its own
>> abstraction - which kind of join your proposal. However we can still make
>> it smooth and easy on the java side - which
>> will likely stay mainstream for still some years - using vfs as our
>> java impl instead of reimplementing the full abstraction? This way we 
>> keep
>> our *API* but we drop beam *impl* to just reuse VFS.
>>
>> PS: for gcs https://github.com/ltouati/vfs-gcs can be a good example
>> on how it can work.
>>
>
> I think a VfsIO makes a lot of sense in the short term, and will give
> use the experience needed to decide if we can move solely to VFS (for Java
> at least) for implementation, and possibly API in a future major release,
> in the long run.
>



Re: Any reason to not use [vfs]?

2018-03-05 Thread Robert Bradshaw
First, let's try to make the terminology abundantly clear, as I for one
have (I think) misinterpreted what has been proposed.

VfsFileSystem: A subclass of
https://github.com/apache/beam/blob/9f81fd299bd32e0d6056a7da9fa994cf74db0ed9/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystem.java

VfsIO: A replacement for
https://github.com/apache/beam/blob/1e84e49e253f8833f28f1268bec3813029f582d0/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java
written using Vfs instead of
https://github.com/apache/beam/blob/29859eb54d05b96a9db477e7bb04537510273bd2/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystems.java

Between these two options, VfsFileSystem is the way to go. It will allow us
to use all our existing File sources/sinks (including all the fancy
watching/streaming support from FileIO) with any filesystem supported by
Vcs. Long-term, if VFS is good enough (and we'll be able to do direct
experiments of HadoopFileSystem vs. VfsFileSystem-on-Hadoop) we could
consider moving to VFS entirely and even removing the layer of indirection.
Vfs is a filesystem, this is the right level of abstraction to plug into.
Even if it's lacking in some respects, it may still be worth keeping in
parallel to the existing FileSystem implementations long-term if it has
significantly better coverage.

On the other hand, a re-implementation of FileIO on top of Vfs seems like a
lot of duplication of code (and ongoing maintenance cost) and will be
difficult to build on top of (e.g. the binding of TextIO to FileIO is not
dynamic like the binding of filesystems).


On Mon, Mar 5, 2018 at 1:05 PM Romain Manni-Bucau 
wrote:

> Not backing vfs by a filesystem sounds saner so VfsIO is probably the way
> to go. It would be a FileIO concurrent and hopefully replacement on the
> mid/long term.
>
> What about doing the opposite: implementing a vfs filesystem for all the
> fs we support, potentially enrich vfs if needed? Then we can just drop beam
> abstraction from what i read.
>
> Le 5 mars 2018 20:49, "Reuven Lax"  a écrit :
>
>> terminology is confusing here, since the existing FileIO is a PTransform.
>> VfsFilesystem would be a better name.
>>
>>
>> On Mon, Mar 5, 2018 at 11:46 AM Robert Bradshaw 
>> wrote:
>>
>>> On Mon, Mar 5, 2018 at 11:38 AM Reuven Lax  wrote:
>>>
 What about a beam Filesystem impl on top of Vfs as an alternative
 short-term solution? This would allow Vfs to be used with any IO.

>>>
>>> Yes, I think this is the VfsIO that was proposed.
>>>
>>>
 On Mon, Mar 5, 2018 at 11:37 AM Robert Bradshaw 
 wrote:

>
> On Mon, Mar 5, 2018 at 11:23 AM Romain Manni-Bucau <
> rmannibu...@gmail.com> wrote:
>
>>
>> 2018-03-05 20:04 GMT+01:00 Chamikara Jayalath :
>>
>>> I assume you mean https://commons.apache.org/proper/commons-vfs/.
>>>
>>> I'm not sure if we considered this when we originally implemented
>>> our own file-system abstraction but based on a quick look seems like 
>>> this
>>> is Java only.
>>>
>>
>> Yes, java only
>>
>>
>>>
>>> I think having a similar file-system abstraction for various
>>> languages is a plus point for Beam. May be we should consider a Java
>>> file-system implementation for VFS ?
>>>
>>
>> Can be an option but when I see the current complexity I'm not sure
>> mixing 2 abstractions would help, maybe just a VfsIO for java users would
>> be good enough - thinking out loud.
>>
>> What sounds clear to me is that each language will need its own
>> abstraction - which kind of join your proposal. However we can still make
>> it smooth and easy on the java side - which
>> will likely stay mainstream for still some years - using vfs as our
>> java impl instead of reimplementing the full abstraction? This way we 
>> keep
>> our *API* but we drop beam *impl* to just reuse VFS.
>>
>> PS: for gcs https://github.com/ltouati/vfs-gcs can be a good example
>> on how it can work.
>>
>
> I think a VfsIO makes a lot of sense in the short term, and will give
> use the experience needed to decide if we can move solely to VFS (for Java
> at least) for implementation, and possibly API in a future major release,
> in the long run.
>



Re: Any reason to not use [vfs]?

2018-03-05 Thread Romain Manni-Bucau
Not backing vfs by a filesystem sounds saner so VfsIO is probably the way
to go. It would be a FileIO concurrent and hopefully replacement on the
mid/long term.

What about doing the opposite: implementing a vfs filesystem for all the fs
we support, potentially enrich vfs if needed? Then we can just drop beam
abstraction from what i read.

Le 5 mars 2018 20:49, "Reuven Lax"  a écrit :

> terminology is confusing here, since the existing FileIO is a PTransform.
> VfsFilesystem would be a better name.
>
>
> On Mon, Mar 5, 2018 at 11:46 AM Robert Bradshaw 
> wrote:
>
>> On Mon, Mar 5, 2018 at 11:38 AM Reuven Lax  wrote:
>>
>>> What about a beam Filesystem impl on top of Vfs as an alternative
>>> short-term solution? This would allow Vfs to be used with any IO.
>>>
>>
>> Yes, I think this is the VfsIO that was proposed.
>>
>>
>>> On Mon, Mar 5, 2018 at 11:37 AM Robert Bradshaw 
>>> wrote:
>>>

 On Mon, Mar 5, 2018 at 11:23 AM Romain Manni-Bucau <
 rmannibu...@gmail.com> wrote:

>
> 2018-03-05 20:04 GMT+01:00 Chamikara Jayalath :
>
>> I assume you mean https://commons.apache.org/proper/commons-vfs/.
>>
>> I'm not sure if we considered this when we originally implemented our
>> own file-system abstraction but based on a quick look seems like this is
>> Java only.
>>
>
> Yes, java only
>
>
>>
>> I think having a similar file-system abstraction for various
>> languages is a plus point for Beam. May be we should consider a Java
>> file-system implementation for VFS ?
>>
>
> Can be an option but when I see the current complexity I'm not sure
> mixing 2 abstractions would help, maybe just a VfsIO for java users would
> be good enough - thinking out loud.
>
> What sounds clear to me is that each language will need its own
> abstraction - which kind of join your proposal. However we can still make
> it smooth and easy on the java side - which
> will likely stay mainstream for still some years - using vfs as our
> java impl instead of reimplementing the full abstraction? This way we keep
> our *API* but we drop beam *impl* to just reuse VFS.
>
> PS: for gcs https://github.com/ltouati/vfs-gcs can be a good example
> on how it can work.
>

 I think a VfsIO makes a lot of sense in the short term, and will give
 use the experience needed to decide if we can move solely to VFS (for Java
 at least) for implementation, and possibly API in a future major release,
 in the long run.

>>>


Re: Any reason to not use [vfs]?

2018-03-05 Thread Reuven Lax
terminology is confusing here, since the existing FileIO is a PTransform.
VfsFilesystem would be a better name.


On Mon, Mar 5, 2018 at 11:46 AM Robert Bradshaw  wrote:

> On Mon, Mar 5, 2018 at 11:38 AM Reuven Lax  wrote:
>
>> What about a beam Filesystem impl on top of Vfs as an alternative
>> short-term solution? This would allow Vfs to be used with any IO.
>>
>
> Yes, I think this is the VfsIO that was proposed.
>
>
>> On Mon, Mar 5, 2018 at 11:37 AM Robert Bradshaw 
>> wrote:
>>
>>>
>>> On Mon, Mar 5, 2018 at 11:23 AM Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>

 2018-03-05 20:04 GMT+01:00 Chamikara Jayalath :

> I assume you mean https://commons.apache.org/proper/commons-vfs/.
>
> I'm not sure if we considered this when we originally implemented our
> own file-system abstraction but based on a quick look seems like this is
> Java only.
>

 Yes, java only


>
> I think having a similar file-system abstraction for various languages
> is a plus point for Beam. May be we should consider a Java file-system
> implementation for VFS ?
>

 Can be an option but when I see the current complexity I'm not sure
 mixing 2 abstractions would help, maybe just a VfsIO for java users would
 be good enough - thinking out loud.

 What sounds clear to me is that each language will need its own
 abstraction - which kind of join your proposal. However we can still make
 it smooth and easy on the java side - which
 will likely stay mainstream for still some years - using vfs as our
 java impl instead of reimplementing the full abstraction? This way we keep
 our *API* but we drop beam *impl* to just reuse VFS.

 PS: for gcs https://github.com/ltouati/vfs-gcs can be a good example
 on how it can work.

>>>
>>> I think a VfsIO makes a lot of sense in the short term, and will give
>>> use the experience needed to decide if we can move solely to VFS (for Java
>>> at least) for implementation, and possibly API in a future major release,
>>> in the long run.
>>>
>>


Re: Any reason to not use [vfs]?

2018-03-05 Thread Robert Bradshaw
On Mon, Mar 5, 2018 at 11:38 AM Reuven Lax  wrote:

> What about a beam Filesystem impl on top of Vfs as an alternative
> short-term solution? This would allow Vfs to be used with any IO.
>

Yes, I think this is the VfsIO that was proposed.


> On Mon, Mar 5, 2018 at 11:37 AM Robert Bradshaw 
> wrote:
>
>>
>> On Mon, Mar 5, 2018 at 11:23 AM Romain Manni-Bucau 
>> wrote:
>>
>>>
>>> 2018-03-05 20:04 GMT+01:00 Chamikara Jayalath :
>>>
 I assume you mean https://commons.apache.org/proper/commons-vfs/.

 I'm not sure if we considered this when we originally implemented our
 own file-system abstraction but based on a quick look seems like this is
 Java only.

>>>
>>> Yes, java only
>>>
>>>

 I think having a similar file-system abstraction for various languages
 is a plus point for Beam. May be we should consider a Java file-system
 implementation for VFS ?

>>>
>>> Can be an option but when I see the current complexity I'm not sure
>>> mixing 2 abstractions would help, maybe just a VfsIO for java users would
>>> be good enough - thinking out loud.
>>>
>>> What sounds clear to me is that each language will need its own
>>> abstraction - which kind of join your proposal. However we can still make
>>> it smooth and easy on the java side - which
>>> will likely stay mainstream for still some years - using vfs as our java
>>> impl instead of reimplementing the full abstraction? This way we keep our
>>> *API* but we drop beam *impl* to just reuse VFS.
>>>
>>> PS: for gcs https://github.com/ltouati/vfs-gcs can be a good example on
>>> how it can work.
>>>
>>
>> I think a VfsIO makes a lot of sense in the short term, and will give use
>> the experience needed to decide if we can move solely to VFS (for Java at
>> least) for implementation, and possibly API in a future major release, in
>> the long run.
>>
>


Re: Any reason to not use [vfs]?

2018-03-05 Thread Reuven Lax
What about a beam Filesystem impl on top of Vfs as an alternative
short-term solution? This would allow Vfs to be used with any IO.


On Mon, Mar 5, 2018 at 11:37 AM Robert Bradshaw  wrote:

>
> On Mon, Mar 5, 2018 at 11:23 AM Romain Manni-Bucau 
> wrote:
>
>>
>> 2018-03-05 20:04 GMT+01:00 Chamikara Jayalath :
>>
>>> I assume you mean https://commons.apache.org/proper/commons-vfs/.
>>>
>>> I'm not sure if we considered this when we originally implemented our
>>> own file-system abstraction but based on a quick look seems like this is
>>> Java only.
>>>
>>
>> Yes, java only
>>
>>
>>>
>>> I think having a similar file-system abstraction for various languages
>>> is a plus point for Beam. May be we should consider a Java file-system
>>> implementation for VFS ?
>>>
>>
>> Can be an option but when I see the current complexity I'm not sure
>> mixing 2 abstractions would help, maybe just a VfsIO for java users would
>> be good enough - thinking out loud.
>>
>> What sounds clear to me is that each language will need its own
>> abstraction - which kind of join your proposal. However we can still make
>> it smooth and easy on the java side - which
>> will likely stay mainstream for still some years - using vfs as our java
>> impl instead of reimplementing the full abstraction? This way we keep our
>> *API* but we drop beam *impl* to just reuse VFS.
>>
>> PS: for gcs https://github.com/ltouati/vfs-gcs can be a good example on
>> how it can work.
>>
>
> I think a VfsIO makes a lot of sense in the short term, and will give use
> the experience needed to decide if we can move solely to VFS (for Java at
> least) for implementation, and possibly API in a future major release, in
> the long run.
>


Re: Any reason to not use [vfs]?

2018-03-05 Thread Chamikara Jayalath
On Mon, Mar 5, 2018 at 11:14 AM Romain Manni-Bucau 
wrote:

> 2018-03-05 19:54 GMT+01:00 Reuven Lax :
>
>> Are the filesystem classes marked experimental? If so, precise
>> compatibility is less of a concern. However vfs does need to have better fs
>> support first.
>>
>
> Anyone has some cycle to list the details here? (even without being a spec
> but a few bullet points a bit structured with a small description
> sentence). I can get in touch with vfs to see what they think but I used it
> to write in my previous job (in java batches) so it sounds like a very good
> candidate to be pluggable.
>

Here are current Java and Python Beam file-system abstractions in case
that's the information you are asking for.

https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystem.java
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystem.py

This indeed seems to be marked as experiments but we have other file-based
sources as well as end users (for example, directly using this from a
ParDo) that might be using this. Additionally these abstractions are
similar as I mentioned earlier which might help users who are transitioning
for Java to Python and vice versa.



>
>
>>
>> Also what about other languages?
>>
>
> This is a bit "?" for me if other languages must go through java or not.
> Last option meaning we can't have any valid codebase and increasing the
> beam maintenance costs a lot. Since other languages should go through the
> portable API IMHO, and most - all? - runners are java based it would be a
> better way to go through vfs to have more pluggability than a custom system
> rarely extended in the ecosystem, no?
>
>
>>
>> On Mon, Mar 5, 2018, 3:35 PM Romain Manni-Bucau 
>> wrote:
>>
>>> I'd say to beam 2.x and to beam 3 to move all IO/extension from the core
>>> to actual IO/extension modules. Sounds compatible this way - in the sense
>>> we can have it eagerly without breaking anything.
>>>
>>> wdyt?
>>>
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau  |  Blog
>>>  | Old Blog
>>>  | Github
>>>  | LinkedIn
>>>  | Book
>>> 
>>>
>>> 2018-03-05 19:32 GMT+01:00 Reuven Lax :
>>>
 Actually FileIO is only somewhat related.

 It's an interesting proposal. However a quick look shows that vfs only
 has read-only support for hdfs and I'm not sure it has any support for gcs.
 Both are often used with Beam. Once vfs supports these filesystems it's
 worth looking at.

 Maybe add to the beam 3.0 hotlidt?

 On Mon, Mar 5, 2018, 3:26 PM Romain Manni-Bucau 
 wrote:

> Yes (FileIO being the visible part of the FileSystems iceberg ;)).
>
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> 
>
> 2018-03-05 19:23 GMT+01:00 Reuven Lax :
>
>> I'm confused, as FileIO doesn't seem the same as vfs. Are you maybe
>> referring to the filesystem abstraction instead?
>>
>> On Mon, Mar 5, 2018, 3:19 PM Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>> Hi guys,
>>>
>>> What's the rational behind the fileIO impl?
>>>
>>> Why not using commons-vfs + a pluggable format? Sounds way more open
>>> and reusable for end users than a few hardcoded supported formats, no?
>>> What's the blocker? If there is a blocker, can't we contribute to  
>>> [vfs] to
>>> make it disappear?
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau  |  Blog
>>>  | Old Blog
>>>  | Github
>>>  | LinkedIn
>>>  | Book
>>> 
>>>
>>
>
>>>


Re: Any reason to not use [vfs]?

2018-03-05 Thread Reuven Lax
Java only is not a blocker - we don't expect all language SDKs to look the
same. They should all support the same functionality, but should do so in a
way that is idiomatically correct for that language.


On Mon, Mar 5, 2018 at 11:23 AM Romain Manni-Bucau 
wrote:

>
> 2018-03-05 20:04 GMT+01:00 Chamikara Jayalath :
>
>> I assume you mean https://commons.apache.org/proper/commons-vfs/.
>>
>> I'm not sure if we considered this when we originally implemented our own
>> file-system abstraction but based on a quick look seems like this is Java
>> only.
>>
>
> Yes, java only
>
>
>>
>> I think having a similar file-system abstraction for various languages is
>> a plus point for Beam. May be we should consider a Java file-system
>> implementation for VFS ?
>>
>
> Can be an option but when I see the current complexity I'm not sure mixing
> 2 abstractions would help, maybe just a VfsIO for java users would be good
> enough - thinking out loud.
>
> What sounds clear to me is that each language will need its own
> abstraction - which kind of join your proposal. However we can still make
> it smooth and easy on the java side - which
> will likely stay mainstream for still some years - using vfs as our java
> impl instead of reimplementing the full abstraction? This way we keep our
> *API* but we drop beam *impl* to just reuse VFS.
>
> PS: for gcs https://github.com/ltouati/vfs-gcs can be a good example on
> how it can work.
>
>
>
>>
>> Thanks,
>> Cham
>>
>>
>>
>> On Mon, Mar 5, 2018 at 10:56 AM Reuven Lax  wrote:
>>
>>> Are the filesystem classes marked experimental? If so, precise
>>> compatibility is less of a concern. However vfs does need to have better fs
>>> support first.
>>>
>>> Also what about other languages?
>>>
>>> On Mon, Mar 5, 2018, 3:35 PM Romain Manni-Bucau 
>>> wrote:
>>>
 I'd say to beam 2.x and to beam 3 to move all IO/extension from the
 core to actual IO/extension modules. Sounds compatible this way - in the
 sense we can have it eagerly without breaking anything.

 wdyt?


 Romain Manni-Bucau
 @rmannibucau  |  Blog
  | Old Blog
  | Github
  | LinkedIn
  | Book
 

 2018-03-05 19:32 GMT+01:00 Reuven Lax :

> Actually FileIO is only somewhat related.
>
> It's an interesting proposal. However a quick look shows that vfs only
> has read-only support for hdfs and I'm not sure it has any support for 
> gcs.
> Both are often used with Beam. Once vfs supports these filesystems it's
> worth looking at.
>
> Maybe add to the beam 3.0 hotlidt?
>
> On Mon, Mar 5, 2018, 3:26 PM Romain Manni-Bucau 
> wrote:
>
>> Yes (FileIO being the visible part of the FileSystems iceberg ;)).
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>>  | Book
>> 
>>
>> 2018-03-05 19:23 GMT+01:00 Reuven Lax :
>>
>>> I'm confused, as FileIO doesn't seem the same as vfs. Are you maybe
>>> referring to the filesystem abstraction instead?
>>>
>>> On Mon, Mar 5, 2018, 3:19 PM Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
 Hi guys,

 What's the rational behind the fileIO impl?

 Why not using commons-vfs + a pluggable format? Sounds way more
 open and reusable for end users than a few hardcoded supported 
 formats, no?
 What's the blocker? If there is a blocker, can't we contribute to  
 [vfs] to
 make it disappear?

 Romain Manni-Bucau
 @rmannibucau  |  Blog
  | Old Blog
  | Github
  | LinkedIn
  | Book
 

>>>
>>

>


Re: Any reason to not use [vfs]?

2018-03-05 Thread Robert Bradshaw
On Mon, Mar 5, 2018 at 11:23 AM Romain Manni-Bucau 
wrote:

>
> 2018-03-05 20:04 GMT+01:00 Chamikara Jayalath :
>
>> I assume you mean https://commons.apache.org/proper/commons-vfs/.
>>
>> I'm not sure if we considered this when we originally implemented our own
>> file-system abstraction but based on a quick look seems like this is Java
>> only.
>>
>
> Yes, java only
>
>
>>
>> I think having a similar file-system abstraction for various languages is
>> a plus point for Beam. May be we should consider a Java file-system
>> implementation for VFS ?
>>
>
> Can be an option but when I see the current complexity I'm not sure mixing
> 2 abstractions would help, maybe just a VfsIO for java users would be good
> enough - thinking out loud.
>
> What sounds clear to me is that each language will need its own
> abstraction - which kind of join your proposal. However we can still make
> it smooth and easy on the java side - which
> will likely stay mainstream for still some years - using vfs as our java
> impl instead of reimplementing the full abstraction? This way we keep our
> *API* but we drop beam *impl* to just reuse VFS.
>
> PS: for gcs https://github.com/ltouati/vfs-gcs can be a good example on
> how it can work.
>

I think a VfsIO makes a lot of sense in the short term, and will give use
the experience needed to decide if we can move solely to VFS (for Java at
least) for implementation, and possibly API in a future major release, in
the long run.


Re: Any reason to not use [vfs]?

2018-03-05 Thread Romain Manni-Bucau
2018-03-05 20:04 GMT+01:00 Chamikara Jayalath :

> I assume you mean https://commons.apache.org/proper/commons-vfs/.
>
> I'm not sure if we considered this when we originally implemented our own
> file-system abstraction but based on a quick look seems like this is Java
> only.
>

Yes, java only


>
> I think having a similar file-system abstraction for various languages is
> a plus point for Beam. May be we should consider a Java file-system
> implementation for VFS ?
>

Can be an option but when I see the current complexity I'm not sure mixing
2 abstractions would help, maybe just a VfsIO for java users would be good
enough - thinking out loud.

What sounds clear to me is that each language will need its own abstraction
- which kind of join your proposal. However we can still make it smooth and
easy on the java side - which
will likely stay mainstream for still some years - using vfs as our java
impl instead of reimplementing the full abstraction? This way we keep our
*API* but we drop beam *impl* to just reuse VFS.

PS: for gcs https://github.com/ltouati/vfs-gcs can be a good example on how
it can work.



>
> Thanks,
> Cham
>
>
>
> On Mon, Mar 5, 2018 at 10:56 AM Reuven Lax  wrote:
>
>> Are the filesystem classes marked experimental? If so, precise
>> compatibility is less of a concern. However vfs does need to have better fs
>> support first.
>>
>> Also what about other languages?
>>
>> On Mon, Mar 5, 2018, 3:35 PM Romain Manni-Bucau 
>> wrote:
>>
>>> I'd say to beam 2.x and to beam 3 to move all IO/extension from the core
>>> to actual IO/extension modules. Sounds compatible this way - in the sense
>>> we can have it eagerly without breaking anything.
>>>
>>> wdyt?
>>>
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau  |  Blog
>>>  | Old Blog
>>>  | Github
>>>  | LinkedIn
>>>  | Book
>>> 
>>>
>>> 2018-03-05 19:32 GMT+01:00 Reuven Lax :
>>>
 Actually FileIO is only somewhat related.

 It's an interesting proposal. However a quick look shows that vfs only
 has read-only support for hdfs and I'm not sure it has any support for gcs.
 Both are often used with Beam. Once vfs supports these filesystems it's
 worth looking at.

 Maybe add to the beam 3.0 hotlidt?

 On Mon, Mar 5, 2018, 3:26 PM Romain Manni-Bucau 
 wrote:

> Yes (FileIO being the visible part of the FileSystems iceberg ;)).
>
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> 
>
> 2018-03-05 19:23 GMT+01:00 Reuven Lax :
>
>> I'm confused, as FileIO doesn't seem the same as vfs. Are you maybe
>> referring to the filesystem abstraction instead?
>>
>> On Mon, Mar 5, 2018, 3:19 PM Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>> Hi guys,
>>>
>>> What's the rational behind the fileIO impl?
>>>
>>> Why not using commons-vfs + a pluggable format? Sounds way more open
>>> and reusable for end users than a few hardcoded supported formats, no?
>>> What's the blocker? If there is a blocker, can't we contribute to  
>>> [vfs] to
>>> make it disappear?
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau  |  Blog
>>>  | Old Blog
>>>  | Github
>>>  | LinkedIn
>>>  | Book
>>> 
>>>
>>
>
>>>


Re: Any reason to not use [vfs]?

2018-03-05 Thread Chamikara Jayalath
I assume you mean https://commons.apache.org/proper/commons-vfs/.

I'm not sure if we considered this when we originally implemented our own
file-system abstraction but based on a quick look seems like this is Java
only.

I think having a similar file-system abstraction for various languages is a
plus point for Beam. May be we should consider a Java file-system
implementation for VFS ?

Thanks,
Cham


On Mon, Mar 5, 2018 at 10:56 AM Reuven Lax  wrote:

> Are the filesystem classes marked experimental? If so, precise
> compatibility is less of a concern. However vfs does need to have better fs
> support first.
>
> Also what about other languages?
>
> On Mon, Mar 5, 2018, 3:35 PM Romain Manni-Bucau 
> wrote:
>
>> I'd say to beam 2.x and to beam 3 to move all IO/extension from the core
>> to actual IO/extension modules. Sounds compatible this way - in the sense
>> we can have it eagerly without breaking anything.
>>
>> wdyt?
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>>  | Book
>> 
>>
>> 2018-03-05 19:32 GMT+01:00 Reuven Lax :
>>
>>> Actually FileIO is only somewhat related.
>>>
>>> It's an interesting proposal. However a quick look shows that vfs only
>>> has read-only support for hdfs and I'm not sure it has any support for gcs.
>>> Both are often used with Beam. Once vfs supports these filesystems it's
>>> worth looking at.
>>>
>>> Maybe add to the beam 3.0 hotlidt?
>>>
>>> On Mon, Mar 5, 2018, 3:26 PM Romain Manni-Bucau 
>>> wrote:
>>>
 Yes (FileIO being the visible part of the FileSystems iceberg ;)).


 Romain Manni-Bucau
 @rmannibucau  |  Blog
  | Old Blog
  | Github
  | LinkedIn
  | Book
 

 2018-03-05 19:23 GMT+01:00 Reuven Lax :

> I'm confused, as FileIO doesn't seem the same as vfs. Are you maybe
> referring to the filesystem abstraction instead?
>
> On Mon, Mar 5, 2018, 3:19 PM Romain Manni-Bucau 
> wrote:
>
>> Hi guys,
>>
>> What's the rational behind the fileIO impl?
>>
>> Why not using commons-vfs + a pluggable format? Sounds way more open
>> and reusable for end users than a few hardcoded supported formats, no?
>> What's the blocker? If there is a blocker, can't we contribute to  [vfs] 
>> to
>> make it disappear?
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>>  | Book
>> 
>>
>

>>


Re: Any reason to not use [vfs]?

2018-03-05 Thread Romain Manni-Bucau
2018-03-05 19:54 GMT+01:00 Reuven Lax :

> Are the filesystem classes marked experimental? If so, precise
> compatibility is less of a concern. However vfs does need to have better fs
> support first.
>

Anyone has some cycle to list the details here? (even without being a spec
but a few bullet points a bit structured with a small description
sentence). I can get in touch with vfs to see what they think but I used it
to write in my previous job (in java batches) so it sounds like a very good
candidate to be pluggable.


>
> Also what about other languages?
>

This is a bit "?" for me if other languages must go through java or not.
Last option meaning we can't have any valid codebase and increasing the
beam maintenance costs a lot. Since other languages should go through the
portable API IMHO, and most - all? - runners are java based it would be a
better way to go through vfs to have more pluggability than a custom system
rarely extended in the ecosystem, no?


>
> On Mon, Mar 5, 2018, 3:35 PM Romain Manni-Bucau 
> wrote:
>
>> I'd say to beam 2.x and to beam 3 to move all IO/extension from the core
>> to actual IO/extension modules. Sounds compatible this way - in the sense
>> we can have it eagerly without breaking anything.
>>
>> wdyt?
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>>  | Book
>> 
>>
>> 2018-03-05 19:32 GMT+01:00 Reuven Lax :
>>
>>> Actually FileIO is only somewhat related.
>>>
>>> It's an interesting proposal. However a quick look shows that vfs only
>>> has read-only support for hdfs and I'm not sure it has any support for gcs.
>>> Both are often used with Beam. Once vfs supports these filesystems it's
>>> worth looking at.
>>>
>>> Maybe add to the beam 3.0 hotlidt?
>>>
>>> On Mon, Mar 5, 2018, 3:26 PM Romain Manni-Bucau 
>>> wrote:
>>>
 Yes (FileIO being the visible part of the FileSystems iceberg ;)).


 Romain Manni-Bucau
 @rmannibucau  |  Blog
  | Old Blog
  | Github
  | LinkedIn
  | Book
 

 2018-03-05 19:23 GMT+01:00 Reuven Lax :

> I'm confused, as FileIO doesn't seem the same as vfs. Are you maybe
> referring to the filesystem abstraction instead?
>
> On Mon, Mar 5, 2018, 3:19 PM Romain Manni-Bucau 
> wrote:
>
>> Hi guys,
>>
>> What's the rational behind the fileIO impl?
>>
>> Why not using commons-vfs + a pluggable format? Sounds way more open
>> and reusable for end users than a few hardcoded supported formats, no?
>> What's the blocker? If there is a blocker, can't we contribute to  [vfs] 
>> to
>> make it disappear?
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>>  | Book
>> 
>>
>

>>


Re: Any reason to not use [vfs]?

2018-03-05 Thread Reuven Lax
Are the filesystem classes marked experimental? If so, precise
compatibility is less of a concern. However vfs does need to have better fs
support first.

Also what about other languages?

On Mon, Mar 5, 2018, 3:35 PM Romain Manni-Bucau 
wrote:

> I'd say to beam 2.x and to beam 3 to move all IO/extension from the core
> to actual IO/extension modules. Sounds compatible this way - in the sense
> we can have it eagerly without breaking anything.
>
> wdyt?
>
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> 
>
> 2018-03-05 19:32 GMT+01:00 Reuven Lax :
>
>> Actually FileIO is only somewhat related.
>>
>> It's an interesting proposal. However a quick look shows that vfs only
>> has read-only support for hdfs and I'm not sure it has any support for gcs.
>> Both are often used with Beam. Once vfs supports these filesystems it's
>> worth looking at.
>>
>> Maybe add to the beam 3.0 hotlidt?
>>
>> On Mon, Mar 5, 2018, 3:26 PM Romain Manni-Bucau 
>> wrote:
>>
>>> Yes (FileIO being the visible part of the FileSystems iceberg ;)).
>>>
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau  |  Blog
>>>  | Old Blog
>>>  | Github
>>>  | LinkedIn
>>>  | Book
>>> 
>>>
>>> 2018-03-05 19:23 GMT+01:00 Reuven Lax :
>>>
 I'm confused, as FileIO doesn't seem the same as vfs. Are you maybe
 referring to the filesystem abstraction instead?

 On Mon, Mar 5, 2018, 3:19 PM Romain Manni-Bucau 
 wrote:

> Hi guys,
>
> What's the rational behind the fileIO impl?
>
> Why not using commons-vfs + a pluggable format? Sounds way more open
> and reusable for end users than a few hardcoded supported formats, no?
> What's the blocker? If there is a blocker, can't we contribute to  [vfs] 
> to
> make it disappear?
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> 
>

>>>
>


Re: Any reason to not use [vfs]?

2018-03-05 Thread Romain Manni-Bucau
I'd say to beam 2.x and to beam 3 to move all IO/extension from the core to
actual IO/extension modules. Sounds compatible this way - in the sense we
can have it eagerly without breaking anything.

wdyt?


Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn  | Book


2018-03-05 19:32 GMT+01:00 Reuven Lax :

> Actually FileIO is only somewhat related.
>
> It's an interesting proposal. However a quick look shows that vfs only has
> read-only support for hdfs and I'm not sure it has any support for gcs.
> Both are often used with Beam. Once vfs supports these filesystems it's
> worth looking at.
>
> Maybe add to the beam 3.0 hotlidt?
>
> On Mon, Mar 5, 2018, 3:26 PM Romain Manni-Bucau 
> wrote:
>
>> Yes (FileIO being the visible part of the FileSystems iceberg ;)).
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>>  | Book
>> 
>>
>> 2018-03-05 19:23 GMT+01:00 Reuven Lax :
>>
>>> I'm confused, as FileIO doesn't seem the same as vfs. Are you maybe
>>> referring to the filesystem abstraction instead?
>>>
>>> On Mon, Mar 5, 2018, 3:19 PM Romain Manni-Bucau 
>>> wrote:
>>>
 Hi guys,

 What's the rational behind the fileIO impl?

 Why not using commons-vfs + a pluggable format? Sounds way more open
 and reusable for end users than a few hardcoded supported formats, no?
 What's the blocker? If there is a blocker, can't we contribute to  [vfs] to
 make it disappear?

 Romain Manni-Bucau
 @rmannibucau  |  Blog
  | Old Blog
  | Github
  | LinkedIn
  | Book
 

>>>
>>


Re: Any reason to not use [vfs]?

2018-03-05 Thread Reuven Lax
Actually FileIO is only somewhat related.

It's an interesting proposal. However a quick look shows that vfs only has
read-only support for hdfs and I'm not sure it has any support for gcs.
Both are often used with Beam. Once vfs supports these filesystems it's
worth looking at.

Maybe add to the beam 3.0 hotlidt?

On Mon, Mar 5, 2018, 3:26 PM Romain Manni-Bucau 
wrote:

> Yes (FileIO being the visible part of the FileSystems iceberg ;)).
>
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> 
>
> 2018-03-05 19:23 GMT+01:00 Reuven Lax :
>
>> I'm confused, as FileIO doesn't seem the same as vfs. Are you maybe
>> referring to the filesystem abstraction instead?
>>
>> On Mon, Mar 5, 2018, 3:19 PM Romain Manni-Bucau 
>> wrote:
>>
>>> Hi guys,
>>>
>>> What's the rational behind the fileIO impl?
>>>
>>> Why not using commons-vfs + a pluggable format? Sounds way more open and
>>> reusable for end users than a few hardcoded supported formats, no? What's
>>> the blocker? If there is a blocker, can't we contribute to  [vfs] to make
>>> it disappear?
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau  |  Blog
>>>  | Old Blog
>>>  | Github
>>>  | LinkedIn
>>>  | Book
>>> 
>>>
>>
>


Re: Any reason to not use [vfs]?

2018-03-05 Thread Romain Manni-Bucau
Yes (FileIO being the visible part of the FileSystems iceberg ;)).


Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn  | Book


2018-03-05 19:23 GMT+01:00 Reuven Lax :

> I'm confused, as FileIO doesn't seem the same as vfs. Are you maybe
> referring to the filesystem abstraction instead?
>
> On Mon, Mar 5, 2018, 3:19 PM Romain Manni-Bucau 
> wrote:
>
>> Hi guys,
>>
>> What's the rational behind the fileIO impl?
>>
>> Why not using commons-vfs + a pluggable format? Sounds way more open and
>> reusable for end users than a few hardcoded supported formats, no? What's
>> the blocker? If there is a blocker, can't we contribute to  [vfs] to make
>> it disappear?
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>>  | Book
>> 
>>
>


Re: Any reason to not use [vfs]?

2018-03-05 Thread Reuven Lax
I'm confused, as FileIO doesn't seem the same as vfs. Are you maybe
referring to the filesystem abstraction instead?

On Mon, Mar 5, 2018, 3:19 PM Romain Manni-Bucau 
wrote:

> Hi guys,
>
> What's the rational behind the fileIO impl?
>
> Why not using commons-vfs + a pluggable format? Sounds way more open and
> reusable for end users than a few hardcoded supported formats, no? What's
> the blocker? If there is a blocker, can't we contribute to  [vfs] to make
> it disappear?
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> 
>