Re: Custom URNs and runner translation

2018-04-24 Thread Chamikara Jayalath
On Tue, Apr 24, 2018 at 3:44 PM Henning Rohde wrote: > > Note that a KafkaDoFn still needs to be provided, but could be a DoFn > that > > fails loudly if it's actually called in the short term rather than a full > > Python implementation. > > For configurable runner-native

Re: Splittable DoFN in Spark discussion

2018-04-24 Thread Reuven Lax
Could we do this behind the scenes by writing a Receiver that publishes periodic pings? On Tue, Apr 24, 2018 at 10:09 PM Eugene Kirpichov wrote: > Kenn - I'm arguing that in Spark SDF style computation can not be > expressed at all, and neither can Beam's timers. > >

Re: Custom URNs and runner translation

2018-04-24 Thread Henning Rohde
> Note that a KafkaDoFn still needs to be provided, but could be a DoFn that > fails loudly if it's actually called in the short term rather than a full > Python implementation. For configurable runner-native IO, for now, I think it is reasonable to use a URN + special data payload directly

Re: Splittable DoFN in Spark discussion

2018-04-24 Thread Eugene Kirpichov
Kenn - I'm arguing that in Spark SDF style computation can not be expressed at all, and neither can Beam's timers. Spark, unlike Flink, does not have a timer facility (only state), and as far as I can tell its programming model has no other primitive that can map a finite RDD into an infinite

Re: Custom URNs and runner translation

2018-04-24 Thread Robert Bradshaw
On Tue, Apr 24, 2018 at 1:14 PM Thomas Weise wrote: > Hi Cham, > Thanks for the feedback! > I should have probably clarified that my POC and questions aren't specific to Kafka as source, but pretty much any other source/sink that we internally use as well. We have existing

Re: Splittable DoFN in Spark discussion

2018-04-24 Thread Kenneth Knowles
I don't think I understand what the limitations of timers are that you are referring to. FWIW I would say implementing other primitives like SDF is an explicit non-goal for Beam state & timers. I got lost at some point in this thread, but is it actually necessary that a bounded PCollection maps

Re: Splittable DoFN in Spark discussion

2018-04-24 Thread Eugene Kirpichov
Would like to revive this thread one more time. At this point I'm pretty certain that Spark can't support this out of the box and we're gonna have to make changes to Spark. Holden, could you advise who would be some Spark experts (yourself included :) ) who could advise what kind of Spark change

Re: Custom URNs and runner translation

2018-04-24 Thread Thomas Weise
Hi Cham, Thanks for the feedback! I should have probably clarified that my POC and questions aren't specific to Kafka as source, but pretty much any other source/sink that we internally use as well. We have existing Flink pipelines that are written in Java and we want to use the same connectors

Re: Custom URNs and runner translation

2018-04-24 Thread Chamikara Jayalath
Hi Thomas, Seems like we are working on similar (partially) things :). On Tue, Apr 24, 2018 at 9:03 AM Thomas Weise wrote: > I'm working on a mini POC to enable Kafka as custom streaming source for a > Python pipeline executing on the (in-progress) portable Flink runner. > >

Re: Merge options in Github UI are confusing

2018-04-24 Thread Robert Bradshaw
1. Yep. 2. I would phrase this as "merge is the default, but we may (should?) sqaush/rewrite obvious fixup commits." 3. If separable commits belong to the same PR, let's keep them in the same PR. Sounds like we made a mistake for this PR, and should err more on the side of not squashing. On Tue,

Re: Merge options in Github UI are confusing

2018-04-24 Thread Kenneth Knowles
I would like to push back on 2 & 3 and replace it with this: (not 2, not 3) Don't squash/rebase someone's pull request unless it is obvious that they are OK with it If someone has used git in the standard way, and you squash their commits, you may have broken them and it is (minor) data loss

Re: [SQL] Reconciling Beam SQL Environments with Calcite Schema

2018-04-24 Thread James
Kenn: yes, MetaStore is user-facing, Users can choose to implement their own MetaStore, currently only an InMemory implementation in Beam CodeBase. Andrew: I like the second option, since it "retains the ability for DDL operations to be processed by a custom MetaStore.", IMO we should have the

Custom URNs and runner translation

2018-04-24 Thread Thomas Weise
I'm working on a mini POC to enable Kafka as custom streaming source for a Python pipeline executing on the (in-progress) portable Flink runner. We eventually want to use the same native Flink connectors for sources and sinks that we also use in other Flink jobs. I got a simple example to work

Re: Merge options in Github UI are confusing

2018-04-24 Thread Andrew Pilloud
Thanks for the feedback. Sounds like there are a few takeaways from the discussion: 1. Not everyone is a git power user, squash and merge is commonly used and can't be disabled. 2. As a non-committer I should expect my commits to be squashed, rewritten, or otherwise changed at the discretion of

Re: [SQL] Reconciling Beam SQL Environments with Calcite Schema

2018-04-24 Thread Kenneth Knowles
Can you say more about how the metastore is used? I presume it is or will be user-facing, so are Beam SQL users already providing their own? I'm sure we want something like that eventually to support things like Apache Atlas and HCatalog, IIUC for the "create if needed" logic when using Beam SQL

Re: [BEAM-4088] Test isolation differences in Gradle

2018-04-24 Thread Kenneth Knowles
What I am trying to figure out is which of these we have: (a) the "test is broken" scenario - pass in maven (thread did not leak) - pass in idea (thread did not leak) - fail in gradle (thread did not leak, but the test incorrectly thinks one did) (b) the "found a bug" scenario - pass in

Re: [BEAM-4088] Test isolation differences in Gradle

2018-04-24 Thread Etienne Chauchot
@Kenn,  Yes it is through the Elasticsearch test framework that you reference that I discovered that direct runner counter metrics thread leaked. I opened this ticket  https://issues.apache.org/jira/browse/BEAM-3119 at the time. But it seemed overkill to ship ES test framework in Romain's PR unit

Re: performance tests of spark fail

2018-04-24 Thread Etienne Chauchot
Hi Kamil, I did not code nor scheduled the spark performance tests but I see in the log (https://builds.apache.org/blue/organizati ons/jenkins/beam_PerformanceTests_Spark/detail/beam_PerformanceTests_Spark/1628/pipeline) that it uses PerfKit  and also that the job that is run is this one:  

Re: performance tests of spark fail

2018-04-24 Thread Kamil Szewczyk
Hi Etienne, I was recently playing a lot with BigQuerry when working on anomaly detection tool and noticed that in db schema timestamp is defined as FLOAT. Perfkit also produces it as a float 'timestamp': 1524485484.41655, so the upload passes. Probably it was defined as float from the