Re: Streaming data from Pubsub to Spanner with Beam dataflow pipeline

2019-10-30 Thread Kenneth Knowles
Moving to u...@beam.apache.org, the best mailing list for questions like this. Yes, this kind of workload is a core use case for Beam. If you have a problem, please write to this user list with details. Kenn On Wed, Oct 30, 2019 at 4:07 AM Taher Koitawala wrote: > Hi All, > My

Re: [spark structured streaming runner] merge to master?

2019-10-30 Thread Kenneth Knowles
Very good points. We definitely ship a lot of code/features in very early stages, and there seems to be no problem. I intend mostly to leave this judgment to people like you who know better about Spark users. But I do think 1 or 2 jars is better than 3. I really don't like "3 jars" and I did

Re: Rethinking the Flink Runner modes

2019-10-30 Thread Robert Bradshaw
On Wed, Oct 30, 2019 at 3:34 PM Maximilian Michels wrote: > > > One thing I don't understand is what it means for "CLI or REST API > > context [to be] present." Where does this context come from? A config > > file in a standard location on the user's machine? Or is this > > something that is only

Re: aggregating over triggered results

2019-10-30 Thread Robert Bradshaw
On Tue, Oct 29, 2019 at 7:01 PM Aaron Dixon wrote: > > Thank you, Luke and Robert. Sorry for hitting dev@, I criss-crossed and meant > to hit user@, but as we're here could you clarify your two points, however-- No problem. This is veering into dev@ territory anyway :). > 1) I am under the

Re: Rethinking the Flink Runner modes

2019-10-30 Thread Maximilian Michels
One thing I don't understand is what it means for "CLI or REST API context [to be] present." Where does this context come from? A config file in a standard location on the user's machine? Or is this something that is only present when a user uploads a jar and then Flink runs it in a specific

Re: [spark structured streaming runner] merge to master?

2019-10-30 Thread Ismaël Mejía
I am still a bit lost about why we are discussing options without giving any arguments or reasons for the options? Why is 2 modules better than 3 or 3 better than 2, or even better, what forces us to have something different than a single module? What are the reasons for wanting to have separate

Re: RFC: python static typing PR

2019-10-30 Thread Robert Bradshaw
On Wed, Oct 30, 2019 at 1:26 PM Chad Dombrova wrote: > >> Do you believe that a future mypy plugin could replace pipeline type checks >> in Beam, or are there limits to what it can do? > > mypy will get us quite far on its own once we completely annotate the beam > code. That said, my PR does

Re: RFC: python static typing PR

2019-10-30 Thread Ismaël Mejía
The herculean term is perfect to describe this impressive achievement Chad. Congratulations and thanks for the effort to make this happen. This will give Beam users not only improved functionality but as Robert mentioned help others to understand more quickly the internals of the python SDK.

Re: Python SDK timestamp precision

2019-10-30 Thread Robert Bradshaw
On Wed, Oct 30, 2019 at 2:00 AM Jan Lukavský wrote: > > TL;DR - can we solve this by representing aggregations as not point-wise > events in time, but time ranges? Explanation below. > > Hi, > > this is pretty interesting from a theoretical point of view. The > question generally seems to be -

Re: RFC: python static typing PR

2019-10-30 Thread Chad Dombrova
> Do you believe that a future mypy plugin could replace pipeline type > checks in Beam, or are there limits to what it can do? > mypy will get us quite far on its own once we completely annotate the beam code. That said, my PR does not include my efforts to turn PTransforms into Generics, which

Re: Rethinking the Flink Runner modes

2019-10-30 Thread Robert Bradshaw
One more question: https://issues.apache.org/jira/browse/BEAM-8396 still seems valuable, but with [auto] as the default, how should we detect whether LOOPBACK is safe to enable from Python? On Wed, Oct 30, 2019 at 11:53 AM Robert Bradshaw wrote: > > Sounds good to me. > > One thing I don't

Re: Rethinking the Flink Runner modes

2019-10-30 Thread Robert Bradshaw
Sounds good to me. One thing I don't understand is what it means for "CLI or REST API context [to be] present." Where does this context come from? A config file in a standard location on the user's machine? Or is this something that is only present when a user uploads a jar and then Flink runs it

Re: [spark structured streaming runner] merge to master?

2019-10-30 Thread Alexey Romanenko
Yes, agree, two jars included in uber jar will work in the similar way. Though having 3 jars looks still quite confusing for me. > On 29 Oct 2019, at 23:54, Kenneth Knowles wrote: > > Is it just as easy to have two jars and build an uber jar with both included? > Then the runner can still be

Re: why are so many transformation needed for a simple TextIO.write() operation

2019-10-30 Thread Luke Cwik
A lot of the logic is around handling various error scenarios. You should notice that the majority of that graph is about passing around metadata around what files were written and what errors there were. That metadata is tiny in comparison and should only be a blip when compared to writing the

Re: RFC: python static typing PR

2019-10-30 Thread Chad Dombrova
> > As Beam devs will be gaining more first-hand experience with the tooling, > we may need to add a style guide/best practices/FAQ to our contributor > guide to clarify known issues. > I'm happy to help out with that, just let me know. -chad

Re: RFC: python static typing PR

2019-10-30 Thread Luke Cwik
+1 for type annotations. On Mon, Oct 28, 2019 at 7:41 PM Robert Burke wrote: > As someone who cribs from the Python SDK to make changes in the Go SDK, > this will make things much easier to follow! Thank you. > > On Mon, Oct 28, 2019, 6:52 PM Chad Dombrova wrote: > >> >> Wow, that is an

Streaming data from Pubsub to Spanner with Beam dataflow pipeline

2019-10-30 Thread Taher Koitawala
Hi All, My current use-case is to write data from Pubsub to Spanner using a streaming pipeline. I do see that Beam does have a SpannerIO to write. However, pubsub being streaming and Spanner being RDBMS like, it would be helpful to you guys can tell me if this will be

Re: Python SDK timestamp precision

2019-10-30 Thread Jan Lukavský
TL;DR - can we solve this by representing aggregations as not point-wise events in time, but time ranges? Explanation below. Hi, this is pretty interesting from a theoretical point of view. The question generally seems to be - having two events, can I reliably order them? One event might be