Re: [VOTE] Remove deprecated Java Reference Runner code from repository.

2019-05-21 Thread Kenneth Knowles
Nicely done! On Mon, May 20, 2019 at 4:06 PM Daniel Oliveira wrote: > Pablo has merged the PR in and assigned a tag to the commit to make the > ULR code easy to find in the future (java-ulr-removal > ). The Java ULR is > officially removed!

Custom Watermark Instance being created multiple times for KafkaIO

2019-05-21 Thread rahul patwari
Hi, We are using withTimestampPolicyFactory (TimestampPolicyFactory

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-21 Thread Kenneth Knowles
On Tue, May 21, 2019 at 7:02 AM Robert Bradshaw wrote: > Reza: One could provide something like this as a utility class, but > one downside is that it is not scale invariant. It requires a tuning > parameter that, if to small, won't mitigate the problem, but if to > big, greatly increases

Re: RedisIO refactoring

2019-05-21 Thread Varun Dhussa
Hi all, Meta may not be a requirement, but RedisRecord can make more operations cleaner. Modelling Redis Mutations allows a clear way for operations with different payloads (, and >). I feel that Hash-map and sorted set operations can be modeled better with a mutation class. Of-course, this can

Re: SqlTransform Metadata

2019-05-21 Thread Reza Rokni
Hi, Coming back to this do we have enough of a consensus to say that in principle this is a good idea? If yes I will raise a Jira for this. Cheers Reza On Thu, 16 May 2019 at 02:58, Robert Bradshaw wrote: > On Wed, May 15, 2019 at 8:51 PM Kenneth Knowles wrote: > > > > On Wed, May 15, 2019

Re: Better naming for runner specific options

2019-05-21 Thread Reza Rokni
Hi, Coming back to this, is the general consensus that this can be addressed via https://issues.apache.org/jira/browse/BEAM-6531 in Beam 3.0? Cheers Reza On Tue, 7 May 2019 at 23:15, Valentyn Tymofieiev wrote: > I think using RunnerOptions was an idea at some point, but in Python, we > ended

Please add me to BEAM JIRA

2019-05-21 Thread E. J. Arens
Hi, As instructed by Valentyn Tymofieiev at https://issues.apache.org/jira/browse/BEAM-7101 Could you please add me to Beam JIRA, so that you can assign the issue to myself? Met vriendelijke groet, Kind regards, Elwin Arens

Re: [Discussion] A tweak to existing large iterable protocol?

2019-05-21 Thread Kenneth Knowles
The performance tradeoff debate is probably better put to a benchmark than an email thread. Kenn On Tue, May 21, 2019 at 12:33 PM Lukasz Cwik wrote: > I don't think the runner needs to know the size of the iterable ahead of > time for encoding, the binary format could be something like: >

Re: [Discuss] Ideas for Apache Beam presence in social media

2019-05-21 Thread Kenneth Knowles
Great idea. Austin - point well taken about whether the PMC really has to micro-manage here. The stakes are potentially very high, but so are the stakes for code and website changes. I know that comdev votes authoring privileges to people who are not committers, but they are not speaking on

Re: [Discussion] A tweak to existing large iterable protocol?

2019-05-21 Thread Robert Bradshaw
The primary con I see with this is that the runner must know ahead of time, when it starts encoding the iterable, whether or not to treat it as a large one (to also cache the part it's encoding to the data channel, or at least some kind of pointer to it). With the current protocol it can be lazy,

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-21 Thread Reza Rokni
In a lot of cases the initial combiner can dramatically reduce the amount of data in this last phase making it tractable for a lot of use cases. I assume in your example the first phase would not provide enough compression? Cheers Reza On Tue, 21 May 2019, 16:47 Jan Lukavský, wrote: > Hi

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-21 Thread Jan Lukavský
Hi Reza, thanks for reaction, comments inline. On 5/21/19 1:02 AM, Reza Rokni wrote: Hi, If I have understood the use case correctly, your output is an ordered counter of state changes. One approach  which might be worth exploring is outlined below, haven't had a chance to test it so could

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-21 Thread Jan Lukavský
Hi Robert, > Beam has an exactly-once model. If the data was consumed, state mutated, and outputs written downstream (these three are committed together atomically) it will not be replayed. That does not, of course, solve the non-determanism due to ordering (including the fact that two

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-21 Thread Jan Lukavský
Hi Kenn, OK, so if we introduce annotation, we can have stateful ParDo with sorting, that would perfectly resolve my issues. I still have some doubts, though. Let me explain. The current behavior of stateful ParDo has the following properties:  a) might fail in batch, although runs fine in

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-21 Thread Robert Bradshaw
On Mon, May 20, 2019 at 5:24 PM Jan Lukavský wrote: > > > I don't see batch vs. streaming as part of the model. One can have > microbatch, or even a runner that alternates between different modes. > > Although I understand motivation of this statement, this project name is > "Apache Beam: An

Re: Definition of Unified model

2019-05-21 Thread Jan Lukavský
Hi, > Actually, I think it is a larger (open) question whether exactly once is guaranteed by the model or whether runners are allowed to relax that. I would think, however, that sources correctly implemented should be idempotent when run atop an exactly once infrastructure such as Flink of

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-21 Thread Robert Bradshaw
Reza: One could provide something like this as a utility class, but one downside is that it is not scale invariant. It requires a tuning parameter that, if to small, won't mitigate the problem, but if to big, greatly increases latency. (Possibly one could define a dynamic session-like window to

Re: RedisIO refactoring

2019-05-21 Thread Alexey Romanenko
Varun, thank you for starting this topic. I agree that it would make sense to do some refactoring and introduce something like “RedisRecord” which will be parameterised and contain all additional metadata if needed. In the same time, we can keep current public API based on KV, but move to

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-21 Thread Jan Lukavský
Hi Reza, I think it probably would provide enough compression. But it would introduce complications and latency for the streaming case. Although I see your point, I was trying to figure out if the Beam model should support these use cases more "natively". Cheers,  Jan On 5/21/19 11:03 AM,

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-21 Thread Reza Rokni
Hi Jan, It's a super interesting use case you have and has a lot of similarity with complexity that comes up when dealing with time series problems. I wonder if it would be interesting to see if the pattern generalises enough to make some utility classes abstracting the complexity from the user.

Re: Definition of Unified model

2019-05-21 Thread Lukasz Cwik
On Tue, May 21, 2019 at 7:49 AM Jan Lukavský wrote: > Hi, > > > Actually, I think it is a larger (open) question whether exactly once > is guaranteed by the model or whether runners are allowed to relax that. > I would think, however, that sources correctly implemented should be > idempotent

Re: Definition of Unified model

2019-05-21 Thread Reuven Lax
I don't think this is completely accurate w.r.t Flink sinks. Flink provides a way for sinks to buffer data until a snapshot has been performed, at which point the data going to the sink is persistent. This has the exact same effect as other runners (e.g. Dataflow) that persistently buffer data.

Re: [Discussion] A tweak to existing large iterable protocol?

2019-05-21 Thread Lukasz Cwik
I don't think the runner needs to know the size of the iterable ahead of time for encoding, the binary format could be something like: first page | token for first page | token for second page The only difference between the current encoding and the new encoding is whether there are one or two

Re: RedisIO refactoring

2019-05-21 Thread Ismaël Mejía
After a quick review of the code now I think I understand why it was modeled as KV in the first place, the library that RedisIO uses (Jedis) only supports 'mget' operation on Strings, so the first issue would be to find a way to do the native byte[] operations, maybe another library? Ideas? About

Re: Proposal: Add permanent url to community metrics dashboard

2019-05-21 Thread Ahmet Altay
If SSL is a concern that makes sense, I am not familiar with that enough to suggest whether another way to do this exists or not. It will be good to check that we can set robots.txt properly from the begging if we go down this path. On Mon, May 20, 2019 at 10:54 AM Mikhail Gryzykhin wrote: >

Re: Proposal: Add permanent url to community metrics dashboard

2019-05-21 Thread Mikhail Gryzykhin
Current http://104.154.241.245/robots.txt is already disallow all, so we are good here. On Tue, May 21, 2019 at 12:57 PM Ahmet Altay wrote: > If SSL is a concern that makes sense, I am not familiar with that enough > to suggest whether another way to do this exists or not. > > It will be good

Re: Beam dashboards

2019-05-21 Thread Mikhail Gryzykhin
@Łukasz Gajowy Reviving old thread: Grafana doesn't support BigQuery officially, but recently there were news with unofficial BQ plugin: * https://blog.doit-intl.com/power-grafana-with-google-bigquery-6822443a7f99 * https://github.com/doitintl/bigquery-grafana I didn't try it yet, but is looks