Re: Schema-Aware PCollections revisited

2018-03-05 Thread Reuven Lax
Yes, I do have a PoC in progress. The Beam Row class was being refactored, so I paused to wait for that to finish. On Sun, Mar 4, 2018 at 8:24 PM Jean-Baptiste Onofré wrote: > Hi Reuven, > > I revive this discussion as I think it would be a great addition. > > We had discussion on the fly, but

Re: Schema-Aware PCollections revisited

2018-03-05 Thread Jean-Baptiste Onofré
Cool, can I work with you on this (sharing a branch for instance) ? Thanks ! Regards JB On 03/05/2018 01:01 PM, Reuven Lax wrote: > Yes, I do have a PoC in progress. The Beam Row class was being refactored, so > I > paused to wait for that to finish. > > > On Sun, Mar 4, 2018 at 8:24 PM Jean-

Re: Schema-Aware PCollections revisited

2018-03-05 Thread Reuven Lax
Of course! I think some BeamSQL folks should be involved as well, as this directly affects SQL work. Anton especially has expressed interest in Row and schemas. Reuven On Mon, Mar 5, 2018 at 4:30 AM Jean-Baptiste Onofré wrote: > Cool, > > can I work with you on this (sharing a branch for insta

Should tests fail due to transient errors on Dataflow Runner?

2018-03-05 Thread Łukasz Gajowy
Hi there! I wonder: why tests that use TestDataflowRunner fail if there are some transient difficulties on Dataflow pipeline? Let's consider the JDBC Performance test case: the pipelines that are there sometimes have trouble connecting to a Postgres instance. If this happens, they retry processin

Any reason to not use [vfs]?

2018-03-05 Thread Romain Manni-Bucau
Hi guys, What's the rational behind the fileIO impl? Why not using commons-vfs + a pluggable format? Sounds way more open and reusable for end users than a few hardcoded supported formats, no? What's the blocker? If there is a blocker, can't we contribute to [vfs] to make it disappear? Romain M

Re: Any reason to not use [vfs]?

2018-03-05 Thread Reuven Lax
I'm confused, as FileIO doesn't seem the same as vfs. Are you maybe referring to the filesystem abstraction instead? On Mon, Mar 5, 2018, 3:19 PM Romain Manni-Bucau wrote: > Hi guys, > > What's the rational behind the fileIO impl? > > Why not using commons-vfs + a pluggable format? Sounds way mo

Re: to a modular embedded java runner to replace the direct runner?

2018-03-05 Thread Romain Manni-Bucau
Hi Lukasz, concretely it is pretty simple - if not let me know, i'll try to gist some code but I don't think we need: (I'll use module names, let's not discuss them, it is just to share the idea) I see it as follow: 1. beam-java-runner - bare API impl (extracted from direct runner, this is not a

Re: Any reason to not use [vfs]?

2018-03-05 Thread Romain Manni-Bucau
Yes (FileIO being the visible part of the FileSystems iceberg ;)). Romain Manni-Bucau @rmannibucau | Blog | Old Blog | Github | LinkedIn

Re: Any reason to not use [vfs]?

2018-03-05 Thread Reuven Lax
Actually FileIO is only somewhat related. It's an interesting proposal. However a quick look shows that vfs only has read-only support for hdfs and I'm not sure it has any support for gcs. Both are often used with Beam. Once vfs supports these filesystems it's worth looking at. Maybe add to the b

Re: Any reason to not use [vfs]?

2018-03-05 Thread Romain Manni-Bucau
I'd say to beam 2.x and to beam 3 to move all IO/extension from the core to actual IO/extension modules. Sounds compatible this way - in the sense we can have it eagerly without breaking anything. wdyt? Romain Manni-Bucau @rmannibucau | Blog

Re: to a modular embedded java runner to replace the direct runner?

2018-03-05 Thread Thomas Groh
The portable java 'DirectRunner' is already in-progress, and has been for several months - it's tracked by https://issues.apache.org/jira/browse/BEAM-2899 My expectation is that the actual portability augmentations is unlikely to require significant changes to the DirectRunner implementations. I'd

Re: to a modular embedded java runner to replace the direct runner?

2018-03-05 Thread Romain Manni-Bucau
Interesting view Thomas - and it makes a lot of sense. Would you rather see 2 modules? embedded-runner+portable-runner+direct-runner (with inheritance in between)? Would work for me. Romain Manni-Bucau @rmannibucau | Blog | Ol

Re: Any reason to not use [vfs]?

2018-03-05 Thread Reuven Lax
Are the filesystem classes marked experimental? If so, precise compatibility is less of a concern. However vfs does need to have better fs support first. Also what about other languages? On Mon, Mar 5, 2018, 3:35 PM Romain Manni-Bucau wrote: > I'd say to beam 2.x and to beam 3 to move all IO/ex

Re: Any reason to not use [vfs]?

2018-03-05 Thread Romain Manni-Bucau
2018-03-05 19:54 GMT+01:00 Reuven Lax : > Are the filesystem classes marked experimental? If so, precise > compatibility is less of a concern. However vfs does need to have better fs > support first. > Anyone has some cycle to list the details here? (even without being a spec but a few bullet poi

Re: Any reason to not use [vfs]?

2018-03-05 Thread Chamikara Jayalath
I assume you mean https://commons.apache.org/proper/commons-vfs/. I'm not sure if we considered this when we originally implemented our own file-system abstraction but based on a quick look seems like this is Java only. I think having a similar file-system abstraction for various languages is a p

Re: Any reason to not use [vfs]?

2018-03-05 Thread Romain Manni-Bucau
2018-03-05 20:04 GMT+01:00 Chamikara Jayalath : > I assume you mean https://commons.apache.org/proper/commons-vfs/. > > I'm not sure if we considered this when we originally implemented our own > file-system abstraction but based on a quick look seems like this is Java > only. > Yes, java only

Re: Any reason to not use [vfs]?

2018-03-05 Thread Robert Bradshaw
On Mon, Mar 5, 2018 at 11:23 AM Romain Manni-Bucau wrote: > > 2018-03-05 20:04 GMT+01:00 Chamikara Jayalath : > >> I assume you mean https://commons.apache.org/proper/commons-vfs/. >> >> I'm not sure if we considered this when we originally implemented our own >> file-system abstraction but based

Re: Any reason to not use [vfs]?

2018-03-05 Thread Reuven Lax
Java only is not a blocker - we don't expect all language SDKs to look the same. They should all support the same functionality, but should do so in a way that is idiomatically correct for that language. On Mon, Mar 5, 2018 at 11:23 AM Romain Manni-Bucau wrote: > > 2018-03-05 20:04 GMT+01:00 Ch

Re: Any reason to not use [vfs]?

2018-03-05 Thread Chamikara Jayalath
On Mon, Mar 5, 2018 at 11:14 AM Romain Manni-Bucau wrote: > 2018-03-05 19:54 GMT+01:00 Reuven Lax : > >> Are the filesystem classes marked experimental? If so, precise >> compatibility is less of a concern. However vfs does need to have better fs >> support first. >> > > Anyone has some cycle to

Re: Any reason to not use [vfs]?

2018-03-05 Thread Reuven Lax
What about a beam Filesystem impl on top of Vfs as an alternative short-term solution? This would allow Vfs to be used with any IO. On Mon, Mar 5, 2018 at 11:37 AM Robert Bradshaw wrote: > > On Mon, Mar 5, 2018 at 11:23 AM Romain Manni-Bucau > wrote: > >> >> 2018-03-05 20:04 GMT+01:00 Chamikar

Re: Any reason to not use [vfs]?

2018-03-05 Thread Robert Bradshaw
On Mon, Mar 5, 2018 at 11:38 AM Reuven Lax wrote: > What about a beam Filesystem impl on top of Vfs as an alternative > short-term solution? This would allow Vfs to be used with any IO. > Yes, I think this is the VfsIO that was proposed. > On Mon, Mar 5, 2018 at 11:37 AM Robert Bradshaw > wro

Re: Any reason to not use [vfs]?

2018-03-05 Thread Reuven Lax
terminology is confusing here, since the existing FileIO is a PTransform. VfsFilesystem would be a better name. On Mon, Mar 5, 2018 at 11:46 AM Robert Bradshaw wrote: > On Mon, Mar 5, 2018 at 11:38 AM Reuven Lax wrote: > >> What about a beam Filesystem impl on top of Vfs as an alternative >> s

Re: Any reason to not use [vfs]?

2018-03-05 Thread Romain Manni-Bucau
Not backing vfs by a filesystem sounds saner so VfsIO is probably the way to go. It would be a FileIO concurrent and hopefully replacement on the mid/long term. What about doing the opposite: implementing a vfs filesystem for all the fs we support, potentially enrich vfs if needed? Then we can jus

Re: Any reason to not use [vfs]?

2018-03-05 Thread Robert Bradshaw
First, let's try to make the terminology abundantly clear, as I for one have (I think) misinterpreted what has been proposed. VfsFileSystem: A subclass of https://github.com/apache/beam/blob/9f81fd299bd32e0d6056a7da9fa994cf74db0ed9/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileSystem.jav

Re: Any reason to not use [vfs]?

2018-03-05 Thread Eugene Kirpichov
If VFS was mature enough for our needs, then I'd give a +1 to using it in Beam Java SDK - currently it's not, so we can't use it directly. It's indeed a reasonable option to use the VFS API inside Beam, and port our implementations of FileSystem(s) to that API, and then potentially donate that to t

Re: Any reason to not use [vfs]?

2018-03-05 Thread Romain Manni-Bucau
Le 5 mars 2018 22:26, "Robert Bradshaw" a écrit : First, let's try to make the terminology abundantly clear, as I for one have (I think) misinterpreted what has been proposed. VfsFileSystem: A subclass of https://github.com/apache/beam/blob/ 9f81fd299bd32e0d6056a7da9fa994cf74db0ed9/sdks/java/cor

Re: Any reason to not use [vfs]?

2018-03-05 Thread Lukasz Cwik
As is, how does VFS improve upon the current FileSystem solution? How much work is it before VFS supports the Apache Beam usecases (bulk operations, glob support)? Is it the right direction for the VFS project to support the above changes? (things that are important to a data parallel processing

Re: Merging Python code? Help avoid Python 3 regressions with these two simple steps :)

2018-03-05 Thread Ahmet Altay
Sent https://github.com/apache/beam/pull/4801 to enable py3 lint for precommits. On Fri, Mar 2, 2018 at 11:23 AM, Robert Bradshaw wrote: > To address the first point, 3.4 is almost certainly sufficient for our > needs (running lint_py3 to prevent regressions). Also, +1 that automating > this is

Re: Should tests fail due to transient errors on Dataflow Runner?

2018-03-05 Thread Lukasz Cwik
That makes sense but you'll want to make sure that no test + runner is relying on this behavior by making your change and running all the validates runner tests. Historically what you say was not always the case because Dataflow streaming jobs were never "DONE", they only were in the "RUNNING" sta

Re: Any reason to not use [vfs]?

2018-03-05 Thread Romain Manni-Bucau
Le 6 mars 2018 01:05, "Lukasz Cwik" a écrit : As is, how does VFS improve upon the current FileSystem solution? This is about the ecosystem. I never saw beam fs implemented outside beam but saw a tons of vfs users and impl. How much work is it before VFS supports the Apache Beam usecases (b