Re: RabbitMQ and CheckpointMark feasibility

2019-11-13 Thread Jan Lukavský
Hi Danny, as Eugene pointed out, there are essentially two "modes of operation" of CheckpointMark. It can:  a) be used to somehow restore state of a reader (in call to UnboundedSource#createReader)  b) confirm processed elements in CheckpointMark#finalizeCheckpoint If your source doesn't

Re: [discuss] Using a logger hierarchy in Python

2019-11-13 Thread Chad Dombrova
Hi Thomas, > Will this include the ability for users to configure logging via pipeline > options? > We're working on a proposal to allow pluggable logging handlers that can be configured via pipeline options. For example, it would allow you to add a new logging handler for StackDriver or

Re: RabbitMQ and CheckpointMark feasibility

2019-11-13 Thread Daniel Robert
I believe I've nailed down a situation that happens in practice that causes Beam and Rabbit to be incompatible. It seems that runners can and do make assumptions about the serializability (via Coder) of a CheckpointMark. To start, these are the semantics of RabbitMQ: - the client establishes

Why is Pipeline not Serializable and can it be changed to be Serializable

2019-11-13 Thread Pulasthi Supun Wickramasinghe
Hi Dev's Currently, the Pipeline class in Beam is not Serializable. This is not a problem for the current runners since the pipeline is translated and submitted through a centralized Driver like model. However, if the runner has a decentralized model similar to OpenMPI (MPI), which is also the

Re: session window puzzle

2019-11-13 Thread Aaron Dixon
This is a great help. Thank you. I like the custom window solution pattern as a way to hold the watermark and merge down to keep the watermark where it is needed. Perhaps there is some interesting generalized session window here.. I'll have to digest the stateful DoFn approach. Avoiding

Re: session window puzzle

2019-11-13 Thread Kenneth Knowles
You've done a very good analysis* and I think your solution is pretty clever. The simple fact is this: the watermark has to be held to the minimum of any output you intend to produce. So for your use case, the hold has to be the timestamp of the Green element. Your solution does hold the watermark

Re: [discuss] Using a logger hierarchy in Python

2019-11-13 Thread Pablo Estrada
Okay, I've just gone and done this for most modules: Runners modules: https://github.com/apache/beam/pull/10097 IO modules: https://github.com/apache/beam/pull/10099 Other modules (testing, utils): https://github.com/apache/beam/pull/10100 I imagine the trickier one will be runners, since

Re: [Discuss] Beam mascot

2019-11-13 Thread Valentyn Tymofieiev
I like the firefly sketch a lot, it's my favorite so far. On Wed, Nov 13, 2019 at 12:58 PM Robert Bradshaw wrote: > #37 from the sketches was the cuttlefish, which would put it at (with > 4 votes) the most popular so far. I do like the firefly too. > > On Wed, Nov 13, 2019 at 12:03 PM Gris

Re: [Discuss] Beam mascot

2019-11-13 Thread Robert Bradshaw
#37 from the sketches was the cuttlefish, which would put it at (with 4 votes) the most popular so far. I do like the firefly too. On Wed, Nov 13, 2019 at 12:03 PM Gris Cuevas wrote: > > Hi everyone, so exciting to see this convo taking off! > > I loved Alex's firefly! -- it can have so many

[PROPOSAL] Add support for writing flattened schemas to pubsub

2019-11-13 Thread Brian Hulette
I've been looking into adding support for writing (i.e. INSERT INTO statements) for the pubsub DDL, which currently only supports reading. This DDL requires the defined schema to have exactly three fields: event_timestamp, attributes, and payload, corresponding to the fields in PubsubMessage

Re: [Discuss] Beam mascot

2019-11-13 Thread Gris Cuevas
Hi everyone, so exciting to see this convo taking off! I loved Alex's firefly! -- it can have so many cool variations, and as a stuffed animal is very original. Other ideas I had are a caterpillar because it looks like a data pipeline, lol or the beaver! Feedback on the current sketches.

Re: Pipeline AttributeError on Python3

2019-11-13 Thread Valentyn Tymofieiev
I also opened https://issues.apache.org/jira/browse/BEAM-8651 to track this issue and any recommendation for the users that will come out of it. On Thu, Nov 7, 2019 at 6:25 PM Valentyn Tymofieiev wrote: > I think we have heard of this issue from the same source: > > This looks exactly like a

org.apache.beam.sdk.io.clickhouse.AtomicInsertTest.testIdempotentInsert fails

2019-11-13 Thread Tomo Suzuki
Hi Beam developers, The org.apache.beam.sdk.io.clickhouse.AtomicInsertTest fails in my development environment. Created https://issues.apache.org/jira/browse/BEAM-8650 The error message indicates that ClickHouse (which I'm not familiar with) is trying to connect (random) strange IP address for

Re: Type of builtin PTransform/PCollection metrics

2019-11-13 Thread Robert Bradshaw
On Wed, Nov 13, 2019 at 10:56 AM Maximilian Michels wrote: > > > Are you referring specifically to? > > * beam:metric:element_count:v1 > > * beam:metric:pardo_execution_time:start_bundle_msecs:v1 > > * beam:metric:pardo_execution_time:process_bundle_msecs:v1 > > *

Re: [Discuss] Beam mascot

2019-11-13 Thread Jozef Vilcek
Interesting topic :) I kind of liked also Alex's firefly. The impression it made on me. To drive it further, hands on hips make strong / serious pose, hovering in the air above all. I would put logo on the him, to become is torso / body or a dress. Logo with a big B on it almost looks like

Re: Make environment_id a top level attribute of PTransform

2019-11-13 Thread Chamikara Jayalath
On Wed, Nov 13, 2019 at 10:42 AM Luke Cwik wrote: > The original ideology was around having only those attributes that > required to set it would contain the attribute but once something becomes > common enough it makes sense to have it as an optional parameter so +1. > > Are there areas where

Re: [discuss] Using a logger hierarchy in Python

2019-11-13 Thread Chad Dombrova
On Wed, Nov 13, 2019 at 10:52 AM Robert Bradshaw wrote: > I would be in favor of using module-level loggers as well. +1

Re: Type of builtin PTransform/PCollection metrics

2019-11-13 Thread Maximilian Michels
Are you referring specifically to? * beam:metric:element_count:v1 * beam:metric:pardo_execution_time:start_bundle_msecs:v1 * beam:metric:pardo_execution_time:process_bundle_msecs:v1 * beam:metric:pardo_execution_time:finish_bundle_msecs:v1 * beam:metric:ptransform_execution_time:total_msecs:v1

Re: [discuss] Using a logger hierarchy in Python

2019-11-13 Thread Robert Bradshaw
I would be in favor of using module-level loggers as well. I think per-class would be overkill and unlike Java not everything is in a class, as well as being more conventional in Python (where modules are generally seen as the unit of compilation, vs. Java where classes are the unit of compilation

Re: Make environment_id a top level attribute of PTransform

2019-11-13 Thread Luke Cwik
The original ideology was around having only those attributes that required to set it would contain the attribute but once something becomes common enough it makes sense to have it as an optional parameter so +1. Are there areas where the environment id will still exist outside of a PTransform?

Re: Date/Time Ranges & Protobuf

2019-11-13 Thread Luke Cwik
I do agree that Apache Beam can represent dates and times with arbitrary precision and can do it many different ways. My argument has always been should around whether we restrict this range to a common standard to increase interoperability across other systems. For example, SQL database servers

Re: Cleaning up Approximate Algorithms in Beam

2019-11-13 Thread Reuven Lax
On Wed, Nov 13, 2019 at 9:58 AM Ahmet Altay wrote: > Thank you for writing this summary. > > On Tue, Nov 12, 2019 at 6:35 PM Reza Rokni wrote: > >> Hi everyone; >> >> TL/DR : Discussion on Beam's various Approximate Distinct Count >> algorithms. >> >> Today there are several options for

Re: Cleaning up Approximate Algorithms in Beam

2019-11-13 Thread Ahmet Altay
Thank you for writing this summary. On Tue, Nov 12, 2019 at 6:35 PM Reza Rokni wrote: > Hi everyone; > > TL/DR : Discussion on Beam's various Approximate Distinct Count algorithms. > > Today there are several options for Approximate Algorithms in Apache Beam > 2.16 with HLLCount being the most

Re: Type of builtin PTransform/PCollection metrics

2019-11-13 Thread Luke Cwik
Are you referring specifically to? * beam:metric:element_count:v1 * beam:metric:pardo_execution_time:start_bundle_msecs:v1 * beam:metric:pardo_execution_time:process_bundle_msecs:v1 * beam:metric:pardo_execution_time:finish_bundle_msecs:v1 * beam:metric:ptransform_execution_time:total_msecs:v1

Re: [discuss] Using a logger hierarchy in Python

2019-11-13 Thread Luke Cwik
That doesn't seem like a very invasive change so if we adopt it we should adopt it everywhere in the same CL so people see the common pattern and use it. I'm for using a named logger and would rather that it is per class instead of per module since many of the modules have lots of classes but +1

Re: [Discuss] Beam mascot

2019-11-13 Thread Maximilian Michels
Same. What about 37 with the eyes from 52? +1 That would combine two ideas: (1) "Beam" eyes and (2) sea animal. We could set this as the working idea and build a logo based off that. On 12.11.19 22:41, Robert Bradshaw wrote: On Tue, Nov 12, 2019 at 1:29 PM Aizhamal Nurmamat kyzy wrote: 52

Re: [spark structured streaming runner] merge to master?

2019-11-13 Thread Etienne Chauchot
Ok for 1 jar with the 2 runners then. I'll add the banner to the logs and the Experimental in the code and in in the javadocs. Thanks for your opinions guys ! Etienne On 08/11/2019 18:50, Kenneth Knowles wrote: On Thu, Nov 7, 2019 at 5:32 PM Etienne Chauchot >

Type of builtin PTransform/PCollection metrics

2019-11-13 Thread Maximilian Michels
Hi, We have a series of builtin PTransform/PCollection metrics: https://github.com/apache/beam/blob/808cb35018cd228a59b152234b655948da2455fa/model/pipeline/src/main/proto/metrics.proto#L74 Why are those of counters ("beam:metrics:sum_int_64")? I think the better default type for most users

Re: Date/Time Ranges & Protobuf

2019-11-13 Thread Jan Lukavský
Hi, just an idea on these related topics that appear these days - it might help to realize, that what we actually don't need a full arithmetic on timestamps (Beam model IMHO doesn't need to know exactly what is the exact difference of two events). What we actually need is a slightly

Re: Date/Time Ranges & Protobuf

2019-11-13 Thread jincheng sun
Thanks for bringing up this discussion @Luke. As @Kenn mentioned, in Beam we have defined the constants value for the min/max/end of global window. I noticed that google.protobuf.Timestamp/Duration is only used in window definitions, such as FixedWindowsPayload, SlidingWindowsPayload,