Unexpected behavior of StateSpecs

2019-05-09 Thread Jan Lukavský
Hi, I have spent several hour digging into strange issue with DirectRunner, that manifested as non-deterministic run of pipeline. The pipeline contains basically only single stateful ParDo, which adds elements into state and after some timeout flushes these elements into output. The issues wa

Re: Unexpected behavior of StateSpecs

2019-05-09 Thread Jan Lukavský
ly a bug. On Thu, May 9, 2019 at 7:42 AM Jan Lukavský <mailto:je...@seznam.cz>> wrote: Hi, I have spent several hour digging into strange issue with DirectRunner, that manifested as non-deterministic run of pipeline. The pipeline contains basically only single s

Re: Unexpected behavior of StateSpecs

2019-05-10 Thread Jan Lukavský
unner but wasn't able to figure it out yet: https://lists.apache.org/thread.html/dae8b605a218532c085a0eea4e71338eae51922c26820f37b24875c0@%3Cdev.beam.apache.org%3E Regards, Anton *From: *Jan Lukavský mailto:je...@seznam.cz>> *Date: *Thu, May 9, 2019 at 8:13 AM *To: * mailto:dev

Re: Unexpected behavior of StateSpecs

2019-05-10 Thread Jan Lukavský
I'm not sure. Generally it affects any runner that uses HashMap to store StateSpec. Jan On 5/9/19 6:32 PM, Reuven Lax wrote: Is this specific to the DirectRunner, or does it affect other runners? On Thu, May 9, 2019 at 8:13 AM Jan Lukavský <mailto:je...@seznam.cz>> wrote:

Re: Unexpected behavior of StateSpecs

2019-05-10 Thread Jan Lukavský
Fn as its id and should only be using that id for comparisons/lookups. On Fri, May 10, 2019 at 1:07 AM Jan Lukavský mailto:je...@seznam.cz>> wrote: I'm not sure. Generally it affects any runner that uses HashMap to store StateSpec. Jan On 5/9/1

Re: Unexpected behavior of StateSpecs

2019-05-10 Thread Jan Lukavský
as a first pass for understanding the implications. On Fri, May 10, 2019 at 9:31 AM Jan Lukavský <mailto:je...@seznam.cz>> wrote: Hm, yes, the fix might be also in fixing hashCode and equals of SimpleStateTag, so that it doesn't hash and compare the StateSpec, but only th

Semantics of PCollection.isBounded

2019-05-15 Thread Jan Lukavský
Hi, I have come across unexpected (at least for me) behavior of some apparent inconsistency of how a PCollection is processed in DirectRunner and what PCollection.isBounded signals. Let me explain:  - I have a stateful ParDo, which needs to make sure that elements arrive in order - it accomp

Re: Semantics of PCollection.isBounded

2019-05-15 Thread Jan Lukavský
uch a need. Jan On 5/15/19 5:20 PM, Jan Lukavský wrote: Hi, I have come across unexpected (at least for me) behavior of some apparent inconsistency of how a PCollection is processed in DirectRunner and what PCollection.isBounded signals. Let me explain:  - I have a stateful ParDo, which nee

Re: Semantics of PCollection.isBounded

2019-05-15 Thread Jan Lukavský
a, in places where this might matter (which therefore enables some optimizations). The order would be relevant only in the stateful ParDo, I'd say. Jan On 5/15/19 8:34 PM, Jan Lukavský wrote: Just to clarify, I understand, that changing semantics of the PCollection.isBounded,  is prob

Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-16 Thread Jan Lukavský
especially global sorting. If you want to have sorting by key, you could do a GroupByKey and then sort the groups in memory. This only works, of course, if your groups are not too large. On 15. May 2019, at 21:01, Jan Lukavský wrote: Hmmm, looking into the code of FlinkRunner (and also by ob

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-16 Thread Jan Lukavský
y understanding of the Beam model. :-O Best, Aljoscha On 16. May 2019, at 13:53, Jan Lukavský wrote: Hi, this is starting to be really exciting. It seems to me that there is either something wrong with my definition of "Unified model" or with how it is implemented inside (at least)

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-16 Thread Jan Lukavský
o, this is just of the top of my head and I might be wrong in my understanding of the Beam model. :-O > > Best, > Aljoscha > >> On 16. May 2019, at 13:53, Jan Lukavský wrote: >> >> Hi, >> >> this is starting to be really exciting. It seems to me that

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-17 Thread Jan Lukavský
r timestamp in the future (but you sometimes might) > > Those seem quite “poor”, but I think you can’t get better guarantees for general cases for the reasons mentioned above. Also, this is just of the top of my head and I might be wrong in my understanding of the Beam model. :-O > > Be

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-17 Thread Jan Lukavský
w-java/src/main/java/org/apache/beam/runners/dataflow/BatchStatefulParDoOverrides.java#L64 On Thu, May 16, 2019 at 5:09 PM Jan Lukavský mailto:je...@seznam.cz>> wrote: Hi Max, answers inline. -- Původní e-mail -- Od: Maximilian Michels mailto:m...

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-20 Thread Jan Lukavský
/19 12:41 PM, Robert Bradshaw wrote: On Fri, May 17, 2019 at 4:48 PM Jan Lukavský wrote: Hi Reuven, How so? AFAIK stateful DoFns work just fine in batch runners. Stateful ParDo works in batch as far, as the logic inside the state works for absolutely unbounded out-of-orderness of elements. Th

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-20 Thread Jan Lukavský
On 5/20/19 1:39 PM, Reuven Lax wrote: On Mon, May 20, 2019 at 4:19 AM Jan Lukavský <mailto:je...@seznam.cz>> wrote: Hi Robert, yes, I think you rephrased my point - although no *explicit* guarantees of ordering are given in either mode, there is *implicit* or

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-20 Thread Jan Lukavský
timestamp only in batch case (and do it for *all* batch stateful pardos without annotation), or introduce annotation, but then make the same guarantees for streaming case as well. Jan On 5/20/19 4:41 PM, Robert Bradshaw wrote: On Mon, May 20, 2019 at 1:19 PM Jan Lukavský wrote: Hi Robert, yes, I

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-20 Thread Jan Lukavský
mply cannot first read everything from distributed storage and then buffer it all into memory, just to read it again, but sorted. That will not work. And even if it would, it would be a terrible waste of resources. Jan On 5/20/19 8:39 PM, Lukasz Cwik wrote: On Mon, May 20, 2019 at 8:24 AM J

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-20 Thread Jan Lukavský
e a specific window/trigger that is missing that you feel would remove the need for you to use StatefulParDo? On Mon, May 20, 2019 at 12:54 PM Jan Lukavský <mailto:je...@seznam.cz>> wrote: Hi Lukasz, > Today, if you must have a strict order, you must guarantee that y

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-21 Thread Jan Lukavský
tly what the result is, only that it meets the spec. And other cases like user funnels are monotonic enough that you also don't actually need sorting. Kenn On Mon, May 20, 2019 at 2:59 PM Jan Lukavský <mailto:je...@seznam.cz>> wrote: Yes, the problem will arise probably most

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-21 Thread Jan Lukavský
#x27;t actually need sorting. Kenn On Mon, May 20, 2019 at 2:59 PM Jan Lukavský mailto:je...@seznam.cz>> wrote: Yes, the problem will arise probably mostly when you have not well distributed keys (or too few keys). I'm really not sure if a pure

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-21 Thread Jan Lukavský
M, Robert Bradshaw wrote: On Mon, May 20, 2019 at 5:24 PM Jan Lukavský wrote: > I don't see batch vs. streaming as part of the model. One can have microbatch, or even a runner that alternates between different modes. Although I understand motivation of this statement, this project name is

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-21 Thread Jan Lukavský
o don't actually need sorting. Kenn On Mon, May 20, 2019 at 2:59 PM Jan Lukavský mailto:je...@seznam.cz>> wrote: Yes, the problem will arise probably mostly when you have not well distributed keys (or too few keys). I'm really

Re: Definition of Unified model

2019-05-21 Thread Jan Lukavský
ent stuff in timestamp order to optimize. (That and, in practice, our annotation-style process methods don't lend themselves to easy composition.) I think it could work in specific cases though. More inline below. On Tue, May 21, 2019 at 11:38 AM Jan Lukavský wrote: Hi Robert, > Beam has

Re: Jenkins not triggering test runs? Be careful with merges.

2019-05-22 Thread Jan Lukavský
Looks like something is wrong with triggers on github side. Triggers are not triggered on other projects, too. Jan -- Původní e-mail -- Od: Valentyn Tymofieiev Komu: dev Datum: 22. 5. 2019 21:23:19 Předmět: Jenkins not triggering test runs? Be careful with merges. " Is somethin

Re: Definition of Unified model

2019-05-23 Thread Jan Lukavský
allows us to buffer (and checkpoint) processed data and only emit it once a Flink checkpoint has completed. Cheers, Max On 21.05.19 16:49, Jan Lukavský wrote: Hi,  > Actually, I think it is a larger (open) question whether exactly once is guaranteed by the model or whether runners are allowed t

@RequireTimeSortedInput design draft

2019-05-23 Thread Jan Lukavský
Hi, I have written a very brief draft of how it might be possible to implement @RequireTimeSortedInput discussed in [1]. I see the document [2] a starting point for a discussion. There are several open questions, which I believe can be resolved by this great community. :-) Jan [1] http://ma

Re: Definition of Unified model

2019-05-23 Thread Jan Lukavský
l have a necessary ordering. Especially given Beam's decision to have milliseconds timestamps this is possible, but even at microsecond or nanosecond precision this can happen at scale. To handle state machines you usually need some sort of FIFO ordering along with an ordered sources, such as Kafk

Re: Definition of Unified model

2019-05-23 Thread Jan Lukavský
scale. To handle state machines you usually need some sort of FIFO ordering along with an ordered sources, such as Kafka, not timestamp ordering. > > Reuven > > On Thu, May 23, 2019 at 12:32 AM Jan Lukavský mailto:je...@seznam.cz)> wrote: >> >> Hi all, >> >> th

Re: DISCUSS: Sorted MapState API

2019-05-24 Thread Jan Lukavský
Hi, absolutely +1 to add this to the model, but does this imply that MapState can be dropped (or backed by this)? It can have different insert or delete time complexity (O(1)) instead of O(logn). Jan -- Původní e-mail -- Od: Aljoscha Krettek Komu: dev@beam.apache.org Datum: 24.

Re: @RequireTimeSortedInput design draft

2019-05-27 Thread Jan Lukavský
hough it can be (relatively) cheaply done in streaming and batch, it's done in very different ways, and also that it's hard to do via composition). On Thu, May 23, 2019 at 4:10 PM Jan Lukavský wrote: Hi, I have written a very brief draft of how it might be possible to implement @RequireTim

Re: Definition of Unified model

2019-05-28 Thread Jan Lukavský
r to each record to sort by. I still think your proposal is very useful by the way. I'm merely pointing out that to solve the state-machine problem we probably need something more. Reuven On Thu, May 23, 2019 at 9:50 AM Jan Lukavský wrote: Hi, yes. It seems that ordering by use

Re: Definition of Unified model

2019-05-28 Thread Jan Lukavský
eam provides this). It also gets awkward with Flatten - the sequence number is no longer enough, you must also encode which side of the flatten each element came from. On Tue, May 28, 2019 at 3:18 AM Jan Lukavský <mailto:je...@seznam.cz>> wrote: As I understood it, Kenn was support

Re: Definition of Unified model

2019-05-29 Thread Jan Lukavský
ime, which makes them useless for comparison. On 5/29/19 2:44 PM, Robert Bradshaw wrote: On Tue, May 28, 2019 at 12:18 PM Jan Lukavský wrote: As I understood it, Kenn was supporting the idea that sequence metadata is preferable over FIFO. I was trying to point out, that it even should provide the s

Re: Definition of Unified model

2019-05-30 Thread Jan Lukavský
lated to what limits parallelism in streaming sources and why batch sources are generally better parallelised. Jan On 5/30/19 1:35 PM, Reuven Lax wrote: Files can grow (depending on the filesystem), and tailing growing files is a valid use case. On Wed, May 29, 2019 at 3:23 PM Jan Lukavský <

Re: @RequireTimeSortedInput design draft

2019-06-06 Thread Jan Lukavský
ly, though it can be (relatively) cheaply done in streaming and batch, it's done in very different ways, and also that it's hard to do via composition). On Thu, May 23, 2019 at 4:10 PM Jan Lukavský wrote: Hi, I have written a very brief draft of how it might be possible to implement @RequireT

Re: @RequireTimeSortedInput design draft

2019-06-06 Thread Jan Lukavský
in for now, but that is a compromise. /4) more tests (for batch and validatesRunner) are needed/ I just posted a question on the best way to make use of the @ValidateRunner annotation on this list, sounds like it might be useful to you as well :-) On Thu, 6 Jun 2019 at 23:03, Jan Lukavský

Re: I'm thinking about new features, what do you think?

2019-06-07 Thread Jan Lukavský
Hi, that sounds interesting, but it seems to be computationally intensive and might not be well scalable, if I understand it correctly. It looks like it needs a transitive closure, am I right?  Jan On 6/7/19 11:17 AM, i.am.moai wrote: Hello everyone, nice to meet you I am Naoki Hyu(日宇尚記).

Re: @RequireTimeSortedInput design draft

2019-06-10 Thread Jan Lukavský
of the class in other places. Cheers Reza On Fri, 7 Jun 2019 at 14:50, Jan Lukavský <mailto:je...@seznam.cz>> wrote: Hi Reza, interesting suggestions, thanks. When you mentioned join, I recalled an older issue (which apparently was not yet transfered to Beam's JI

DirectRunner timers are not strictly time ordered

2019-06-10 Thread Jan Lukavský
Hi, I have come across issue [1], where I'm not sure how to solve this in most elegant way. Any suggestions? Thanks,  Jan [1] https://issues.apache.org/jira/browse/BEAM-7520

Re: DirectRunner timers are not strictly time ordered

2019-06-10 Thread Jan Lukavský
indow), which results in state being cleared before it gets flushed, which means data loss.  Jan On 6/10/19 5:08 PM, Reuven Lax wrote: Do you mean for a single key or across keys? On Mon, Jun 10, 2019, 5:11 AM Jan Lukavský <mailto:je...@seznam.cz>> wrote: Hi, I have come a

Re: DirectRunner timers are not strictly time ordered

2019-06-10 Thread Jan Lukavský
ode I have if anyone interested. Jan On 6/10/19 5:32 PM, Lukasz Cwik wrote: We hit an instance of this problem before and solved it rescheduling the GC timer again if there was a conflicting timer that was also meant to fire. On Mon, Jun 10, 2019 at 8:17 AM Jan Lukavský <mailto:je...@seznam

Re: DirectRunner timers are not strictly time ordered

2019-06-10 Thread Jan Lukavský
imers? Because I think you will need to do the former. On Mon, Jun 10, 2019 at 8:41 AM Jan Lukavský <mailto:je...@seznam.cz>> wrote: Hm, that would probably work, thanks! But, should the timers behave like that? I'm trying to fix tris by introducing a sequence of wate

Re: DirectRunner timers are not strictly time ordered

2019-06-10 Thread Jan Lukavský
rying to build support for time sorted input on top of the Beam model for timers? Because I think you will need to do the former. On Mon, Jun 10, 2019 at 8:41 AM Jan Lukavský mailto:je...@seznam.cz>> wrote: Hm, that would probably work, thanks! But, should th

SparkRunner Combine.perKey performance

2019-06-13 Thread Jan Lukavský
Hi, I have hit a performance issue with Spark runner, that seems to related to its current Combine.perKey implementation. I'll try to summarize what I have found in the code:  - Combine.perKey uses Spark's combineByKey primitive, which is pretty similar to the definition of CombineFn  - it

Re: SparkRunner Combine.perKey performance

2019-06-13 Thread Jan Lukavský
to avoid fusion breaks). On Thu, Jun 13, 2019 at 3:20 PM Jan Lukavský <mailto:je...@seznam.cz>> wrote: Hi, I have hit a performance issue with Spark runner, that seems to related to its current Combine.perKey implementation. I'll try to summarize what I have

Re: SparkRunner Combine.perKey performance

2019-06-13 Thread Jan Lukavský
the code? On Thu, Jun 13, 2019 at 3:56 PM Jan Lukavský <mailto:je...@seznam.cz>> wrote: Hi Robert, there is a comment around that which states, that the current solution should be more efficient. I'd say, that (for non-merging windows) it would be best to first exp

Re: SparkRunner Combine.perKey performance

2019-06-13 Thread Jan Lukavský
On 6/13/19 6:10 PM, Robert Bradshaw wrote: On Thu, Jun 13, 2019 at 5:28 PM Jan Lukavský <mailto:je...@seznam.cz>> wrote: On 6/13/19 4:31 PM, Robert Bradshaw wrote: The comment fails to take into account the asymmetry between calling addInput vs. mergeAccumulators. It als

Re: SparkRunner Combine.perKey performance

2019-06-14 Thread Jan Lukavský
Hi Robert, thanks for the discussion. I will create a JIRA with summary of this. Some comments inline. Jan On 6/14/19 10:49 AM, Robert Bradshaw wrote: On Thu, Jun 13, 2019 at 8:43 PM Jan Lukavský wrote: On 6/13/19 6:10 PM, Robert Bradshaw wrote: On Thu, Jun 13, 2019 at 5:28 PM Jan

Re: SparkRunner Combine.perKey performance

2019-06-14 Thread Jan Lukavský
ging && not many windows per element, because otherwise it makes sense to group the windows although it is non-merging windowing (sliding with small slide step). Otherwise we would explode the data too much. On 6/14/19 12:19 PM, Robert Bradshaw wrote: On Fri, Jun 14, 2019 at 12:10 PM

Re: SparkRunner Combine.perKey performance

2019-06-17 Thread Jan Lukavský
On 6/14/19 1:42 PM, Robert Bradshaw wrote: On Fri, Jun 14, 2019 at 1:02 PM Jan Lukavský wrote: > Interesting. However, there we should never need to sort the windows of the input, only the set of live windows (of which there may be any number regardless of whether WindowFn does singleton assignme

Re: DirectRunner timers are not strictly time ordered

2019-06-20 Thread Jan Lukavský
e there are only two options - either two panes can come with isLast flag (both end-of-window and late), or it might be possible, that no pane will marked with this flag (because no late pane is fired). Jan  [1] https://github.com/apache/beam/pull/8815 On 6/10/19 6:26 PM, Jan Lukavský wrote:

Re: DirectRunner timers are not strictly time ordered

2019-06-20 Thread Jan Lukavský
ink? Jan On 6/20/19 5:29 PM, Reuven Lax wrote: On Thu, Jun 20, 2019 at 3:08 PM Jan Lukavský <mailto:je...@seznam.cz>> wrote: Hi, this problem seems to be harder than I thought. I have a somewhat working code in [1], but there are still failing some tests (now tests f

Re: DirectRunner timers are not strictly time ordered

2019-06-20 Thread Jan Lukavský
o set more timers, let's suppose it sets timer for T2 - the second instance of timer A (set for T2) will fire *after* timer B (set for T3), breaking time invariant Jan On 6/20/19 8:43 PM, Reuven Lax wrote: On Thu, Jun 20, 2019 at 8:03 PM Jan Lukavský <mailto:je...@seznam.cz>> w

Re: DirectRunner timers are not strictly time ordered

2019-06-20 Thread Jan Lukavský
On 6/20/19 9:30 PM, Reuven Lax wrote: On Thu, Jun 20, 2019 at 8:54 PM Jan Lukavský <mailto:je...@seznam.cz>> wrote: > But that is exactly how time advances. Watermarks often don't move smoothly, as a single old element can hold up the watermark. When that el

Re: DirectRunner timers are not strictly time ordered

2019-06-20 Thread Jan Lukavský
gle.com>> wrote: On Thu, Jun 20, 2019 at 9:35 PM Jan Lukavský mailto:je...@seznam.cz>> wrote: On 6/20/19 9:30 PM, Reuven Lax wrote: On Thu, Jun 20, 2019 at 8:54 PM Jan Lukavský mailto:je...@seznam.cz>> wrote:

Re: Looping timer blog

2019-06-21 Thread Jan Lukavský
Hi Reza, great prezentation on the Beam Summit. I have had a few posts here in the list during last few weeks, some of which might actually be related to both looping timers and validity windows. But maybe you will be able to see a different approach, than I do, so questions:  a) because of

Re: Looping timer blog

2019-06-25 Thread Jan Lukavský
Jan On 6/25/19 6:00 AM, Reza Rokni wrote: On Fri, 21 Jun 2019 at 18:02, Jan Lukavský <mailto:je...@seznam.cz>> wrote: Hi Reza, great prezentation on the Beam Summit. I have had a few posts here in the list during last few weeks, some of which might actually be rela

Re: Looping timer blog

2019-06-25 Thread Jan Lukavský
On 6/25/19 1:43 PM, Reza Rokni wrote: On Tue, 25 Jun 2019 at 18:12, Jan Lukavský <mailto:je...@seznam.cz>> wrote: > The TTL check would be in the same Timer rather than a separate Timer.  The max value processed in each OnTimer call would be stored in valuestate and

[DISCUSS] Solving timer ordering on immutable bundles

2019-06-26 Thread Jan Lukavský
Hi, I have mentioned an issue I have come across [1] on several other threads, but it probably didn't attract the attention that it would desire. I will try to restate the problem here for clarity:  - on runners that use concept of bundles (the original issue mentions DirectRunner, but it wi

Re: [DISCUSS] Solving timer ordering on immutable bundles

2019-06-26 Thread Jan Lukavský
T2 and modifies T3, these would take effect (locally, the runner may not even know about T2) before T3 was processed. On Wed, Jun 26, 2019 at 11:13 AM Jan Lukavský wrote: Hi, I have mentioned an issue I have come across [1] on several other threads, but it probably didn't attract the atte

Re: [DISCUSS] Solving timer ordering on immutable bundles

2019-06-26 Thread Jan Lukavský
unner. On Wed, Jun 26, 2019 at 1:02 PM Jan Lukavský wrote: I think that this approach breaks the assumption that bundles are executed as immutable pieces of work. This way, runners would have to update the runner while executing it. It is another possible option, but seems to have issues of its own

Re: [DISCUSS] Solving timer ordering on immutable bundles

2019-06-26 Thread Jan Lukavský
e fired in order, and if any timers are set to fire them immediately before processing later timers. In other words, if T1 sets T2 and modifies T3, these would take effect (locally, the runner may not even know about T2) before T3 was processed. On Wed, Jun 26, 2019 at 11:13 AM Jan Lukavský wro

Re: [Current spark runner] Combine globally translation is risky and not very performant

2019-06-27 Thread Jan Lukavský
Hi Etienne, I saw that too while working on solving [1]. It seems a little weird and I was a little tempted to changed it to something roughly equivalent to Combine.perKey with single key. But, actually the Combine.globally should be rather small, right? There will be single value for each wi

Re: Looping timer blog

2019-06-27 Thread Jan Lukavský
m and batch) and SparkRunner (batch)). [1] https://github.com/apache/beam/pull/8774 On 6/27/19 6:16 AM, Reza Rokni wrote: On Tue, 25 Jun 2019 at 21:20, Jan Lukavský <mailto:je...@seznam.cz>> wrote: On 6/25/19 1:43 PM, Reza Rokni wrote: On Tue, 25 Jun 2019 at 18:12, Jan Luka

Re: [DISCUSS] Solving timer ordering on immutable bundles

2019-06-27 Thread Jan Lukavský
>> >> ordering of timers by requiring timers to be fired in order, and if >> >> any timers are set to fire them immediately before processing later >> >> timers. In other words, if T1 sets T2 and modifies T3, these

Re: [DISCUSS] Solving timer ordering on immutable bundles

2019-06-27 Thread Jan Lukavský
timers are set to fire them immediately before processing later >> >> timers. In other words, if T1 sets T2 and modifies T3, these would >> >> take effect (locally, the runner may not even know about T2) before

Re: [DISCUSS] Solving timer ordering on immutable bundles

2019-06-27 Thread Jan Lukavský
sed, the "timer watermark" is on hold). On 6/28/19 12:40 AM, Jan Lukavský wrote: At least the implementation in DirectRunner fires timers according to input watemark. Holding the timer up to output watermark causes deadlocks, because timers fired at time T might clear watermark hold

Re: [DISCUSS] Solving timer ordering on immutable bundles

2019-06-28 Thread Jan Lukavský
we would still be able to fire multiple timers (at most one per key) and still have good performance even when the input watermark makes a "hop"? On Thu, Jun 27, 2019 at 3:43 PM Jan Lukavský mailto:je...@seznam.cz>> wrote: It would be possible to ha

[DISCUSS] Thoughts on stateful DoFns in merging windows

2019-06-28 Thread Jan Lukavský
Hi, during my implementation of @RequiresTimeSortedInput I found out, that current runners do not support stateful DoFns on merging windows [1]. The question is why is that? One obvious reason seems to be, that current definition of StateSpec doesn't define a state merge function, which is ne

Re: [DISCUSS] Thoughts on stateful DoFns in merging windows

2019-07-01 Thread Jan Lukavský
ing timers is a bit more awkward. I now tend to think that we're better off providing an onMerge function and let the user handle this. On Fri, Jun 28, 2019, 11:06 AM Jan Lukavský <mailto:je...@seznam.cz>> wrote: Hi, during my implementation of @RequiresTimeSortedInp

Re: [Current spark runner] Combine globally translation is risky and not very performant

2019-07-01 Thread Jan Lukavský
(maybeAccumulated.get()); call implementation. So potentially more than one value per window. For the new spark runner, what I'm using is native combine that all happens at the dataset (equivalent of rdd to simplify) side, so it is all in parallel. Etienne Le jeudi 27 juin 2019 à 15:13 +0200, Jan Lukav

Re: [DISCUSS] Thoughts on stateful DoFns in merging windows

2019-07-02 Thread Jan Lukavský
be wrong. I'm leaning to the conclusion that we're much better off providing an onMerge callback, and letting the user explicitly handle the merging there. Reuven On Mon, Jul 1, 2019 at 1:04 AM Jan Lukavský <mailto:je...@seznam.cz>> wrote: What are the issue

Re: [ANNOUNCE] New committer: Jan Lukavský

2019-08-01 Thread Jan Lukavský
> > Hi, > > Please join me and the rest of the Beam PMC in welcoming a new > committer: Jan Lukavský. >

Re: [ANNOUNCE] New committer: Rui Wang

2019-08-07 Thread Jan Lukavský
Congrats Rui! On 8/7/19 7:01 AM, Connell O'Callaghan wrote: Well done Rui!!! On Tue, Aug 6, 2019 at 15:41 Chamikara Jayalath > wrote: Congrats Rui. On Tue, Aug 6, 2019 at 2:00 PM Melissa Pashniak mailto:meliss...@google.com>> wrote: Congrats

Re: [ANNOUNCE] New committer: Kyle Weaver

2019-08-07 Thread Jan Lukavský
Congrats Kyle! On 8/7/19 7:00 AM, Connell O'Callaghan wrote: Well done congratulations Kyle!!! On Tue, Aug 6, 2019 at 21:58 Thomas Weise > wrote: Congrats! On Tue, Aug 6, 2019, 7:24 PM Reza Rokni mailto:r...@google.com>> wrote: Congratz! On W

Re: Inconsistent Results with GroupIntoBatches PTransform

2019-08-09 Thread Jan Lukavský
Hi Rahul, what version of Beam are you using? There was a bug [1], which was fixed in 2.14.0. This bug could cause what you observe. Jan [1] https://issues.apache.org/jira/browse/BEAM-7269 On 8/9/19 10:35 AM, rahul patwari wrote: Hi Robert, When PCollection is created using "Create.of(lis

Re: Inconsistent Results with GroupIntoBatches PTransform

2019-08-09 Thread Jan Lukavský
e inconsistencies. Does BEAM-7269 affect all the runners? Thanks, Rahul On Fri, Aug 9, 2019 at 2:15 PM Jan Lukavský <mailto:je...@seznam.cz>> wrote: Hi Rahul, what version of Beam are you using? There was a bug [1], which was fixed in 2.14.0. This bug could cause what yo

StateNamespaces.global() != StateNamespaces.window(GlobalWindowCoder, GlobalWindow)

2019-08-12 Thread Jan Lukavský
Hi, I noticed, that StateNamespaces.global() generates a different stringKey than StateNamespaces.window(GlobalWindowCoder, GlobalWindow). In the first case, the stringKey will be simply '/',  in the other it will be '//'. That has some other implications, like that StateNamespaces.fromString

[FLINK-12653] and system state

2019-08-12 Thread Jan Lukavský
Hi, I have come across issue that is very much likely caused by [1]. The issue is that Beam's state is (generally) created lazily, after element is received (as Max described in the Flink's JIRA). Max also created workaround [2], but that seems to work for user state only (i.e. state that has

Re: Can not test Timer with processing time domain

2019-08-12 Thread Jan Lukavský
Hi, if I understand it correctly, the issue there was, that the OnTimerContext.timestamp() is currently used for two things:  a) it tell the caller what timestamp it actually is  b) it is used as event element timestamp when element is output from OnTimer method That implies, that it canno

Re: [FLINK-12653] and system state

2019-08-12 Thread Jan Lukavský
eam/commit/1360fb6bd35192d443f97068c7ba2155f79e8802 On 8/12/19 4:00 PM, Jan Lukavský wrote: Hi, I have come across issue that is very much likely caused by [1]. The issue is that Beam's state is (generally) created lazily, after element is received (as Max described in the Flink's JIRA

Re: [FLINK-12653] and system state

2019-08-13 Thread Jan Lukavský
t the single state I introduced. If this would be an issue for long term I think it would require some other solution. So - I will register the state(s) I have created and test that on Flink 1.9 when I have a little spare time. Will decide what to do next, ok? Jan Cheers, Max On 12.08.19 19:58

Re: StateNamespaces.global() != StateNamespaces.window(GlobalWindowCoder, GlobalWindow)

2019-08-16 Thread Jan Lukavský
nt, fetch '//'. Use '/' for storing. Any other ideas?  Jan On 8/16/19 12:14 PM, Maximilian Michels wrote: Hi Jan, I don't think this is intentional. It looks like an inconsistency which could result in unexpected behavior in UDFs, e.g. when storing state and timers. Wan

Re: StateNamespaces.global() != StateNamespaces.window(GlobalWindowCoder, GlobalWindow)

2019-08-17 Thread Jan Lukavský
ot; namespace is for storage that is per-key but not related to any window. So it is explicitly a separate domain from the global window. It is a concept just for runner implementors, not really part of the model and definitely not user facing. Kenn On Fri, Aug 16, 2019 at 4:22 AM Jan

[DISCUSS] Making consistent use of Optionals

2019-08-21 Thread Jan Lukavský
Hi, sorry if this discussion have been already taken, but I'd like to know others opinions about how we use Optionals. The state in current master is as follows: $ git grep "import" | grep "java.util.Optional" | wc -l 85 $ git grep "import" | grep "Optional" | grep guava | wc -l 45 I'd like

Re: [DISCUSS] Making consistent use of Optionals

2019-08-21 Thread Jan Lukavský
Sorry, forgot to add link to the Flink discussion [1]. [1] https://lists.apache.org/thread.html/f5f8ce92f94c9be6774340fbd7ae5e4afe07386b6765ad3cfb13aec0@%3Cdev.flink.apache.org%3E On 8/21/19 10:08 PM, Jan Lukavský wrote: Hi, sorry if this discussion have been already taken, but I'd li

Re: [ANNOUNCE] New committer: Valentyn Tymofieiev

2019-08-27 Thread Jan Lukavský
Congrats Valentyn! On 8/26/19 11:43 PM, Rui Wang wrote: Congratulations! -Rui On Mon, Aug 26, 2019 at 2:36 PM Hannah Jiang > wrote: Congratulations Valentyn, well deserved! On Mon, Aug 26, 2019 at 2:34 PM Chamikara Jayalath mailto:chamik...@google

Re: Cassandra flaky on Jenkins?

2019-09-03 Thread Jan Lukavský
Hi Alex, I have seen it too. Jan On 9/3/19 2:19 PM, Alex Van Boxel wrote: Hi, is it only me that are bumping on the flaky Cassandra on Jenkins? I like to get my PR approved but I can't get past the Cassandra error... * org.apache.beam.sdk.io.cassandra.CassandraIOTest.classMethod

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-27 Thread Jan Lukavský
imes. Jan On 9/27/19 9:27 AM, Jan Lukavský wrote: I pretty much think so, because that is how Spark works. The Iterable inside is really an Iterator, which cannot be iterated multiple times. Jan On 9/27/19 2:00 AM, Lukasz Cwik wrote: Jan, in Beam users expect to be able to iterate the

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-27 Thread Jan Lukavský
Reuven On Fri, Sep 27, 2019 at 3:04 AM Jan Lukavský mailto:je...@seznam.cz>> wrote: +dev <mailto:dev@beam.apache.org> Lukasz, why do you think that users expect to b

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-27 Thread Jan Lukavský
ot be interpreted as a value at all. Kenn On Fri, Sep 27, 2019 at 10:57 AM Jan Lukavský <mailto:je...@seznam.cz>> wrote: I'd like to know the use-case. Why would you *need* to actually iterate the grouped elements twice? By definition the first iteration would have to extra

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-27 Thread Jan Lukavský
> But - does it imply that it is actually required O(n^2) I meant O(n) iterations, O(n^2) operations on elements. On 9/27/19 8:31 PM, Jan Lukavský wrote: Okay, the self-join example is understandable. But - does it imply that it is actually required O(n^2) iterations (maybe caching

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-27 Thread Jan Lukavský
notation on ParDo? That way the Spark runner can see this and switch to the in-memory version just for groupings consumed by those ParDos. Runners that already support reiteration can ignore this annotation, so it should be backwards compatible. Reuven On Fri, Sep 27, 2019 at 10:57 AM

Re: [ANNOUNCE] New committer: Alan Myrvold

2019-09-27 Thread Jan Lukavský
Congrats Alan! On 9/27/19 10:22 PM, Mark Liu wrote: Congratulations Alan!!! On Fri, Sep 27, 2019 at 12:55 PM Ning Kang > wrote: Congrats Alan! On Fri, Sep 27, 2019 at 12:02 PM Ankur Goenka mailto:goe...@google.com>> wrote: Congratulations Alan!

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-27 Thread Jan Lukavský
rence in the programming API. Therefore I think it makes sense to specify the latter. The need to reiterate is a property of the downstream ParDo, so it should be specified there - not on the GBK. Reuven On Fri, Sep 27, 2019 at 12:05 PM Jan Lukavský mailto:je...@seznam.cz>>

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-28 Thread Jan Lukavský
to run the pipeline (as well as to optimize the pipeline). Reuven On Fri, Sep 27, 2019 at 2:35 PM Jan Lukavský <mailto:je...@seznam.cz>> wrote: I'd suggest Stream instead of Iterator, it has the same semantics and much better API. Still not sure, what is wrong on letti

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-28 Thread Jan Lukavský
have different guarantees regarding various properties of stream processing - to mention one of the most obvious, sorting by timestamp in stateful processing, although supporting for that in core SDK probably would not be that much intrusive. Jan On 9/28/19 9:25 AM, Jan Lukavský wrote: I under

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-30 Thread Jan Lukavský
y a DSL as a substrate, but the DSL does not need to blindly mirror the semantics of the raw Beam API - at least in my opinion! Reuven On Sat, Sep 28, 2019 at 12:26 AM Jan Lukavský <mailto:je...@seznam.cz>> wrote: I understand the concerns. Still, it looks a little like we want

Re: Multiple iterations after GroupByKey with SparkRunner

2019-10-01 Thread Jan Lukavský
e more powerful one for free. Kenn On Mon, Sep 30, 2019 at 2:02 AM Jan Lukavský <mailto:je...@seznam.cz>> wrote: > The fact that the annotation on the ParDo "changes" the GroupByKey implementation is very specific to the Spark runner implementation. I don

  1   2   3   4   5   6   >