Re: DirectRunner timers are not strictly time ordered

2019-06-10 Thread Jan Lukavský
? Because I think you will need to do the former. On Mon, Jun 10, 2019 at 8:41 AM Jan Lukavský <mailto:je...@seznam.cz>> wrote: Hm, that would probably work, thanks! But, should the timers behave like that? I'm trying to fix tris by introducing a sequence of watermarks

Re: DirectRunner timers are not strictly time ordered

2019-06-10 Thread Jan Lukavský
), which results in state being cleared before it gets flushed, which means data loss.  Jan On 6/10/19 5:08 PM, Reuven Lax wrote: Do you mean for a single key or across keys? On Mon, Jun 10, 2019, 5:11 AM Jan Lukavský <mailto:je...@seznam.cz>> wrote: Hi, I have come across

DirectRunner timers are not strictly time ordered

2019-06-10 Thread Jan Lukavský
Hi, I have come across issue [1], where I'm not sure how to solve this in most elegant way. Any suggestions? Thanks,  Jan [1] https://issues.apache.org/jira/browse/BEAM-7520

Re: @RequireTimeSortedInput design draft

2019-06-10 Thread Jan Lukavský
the class in other places. Cheers Reza On Fri, 7 Jun 2019 at 14:50, Jan Lukavský <mailto:je...@seznam.cz>> wrote: Hi Reza, interesting suggestions, thanks. When you mentioned join, I recalled an older issue (which apparently was not yet transfered to Beam's JIRA)  [1].

Re: I'm thinking about new features, what do you think?

2019-06-07 Thread Jan Lukavský
Hi, that sounds interesting, but it seems to be computationally intensive and might not be well scalable, if I understand it correctly. It looks like it needs a transitive closure, am I right?  Jan On 6/7/19 11:17 AM, i.am.moai wrote: Hello everyone, nice to meet you I am Naoki Hyu(日宇尚記).

Re: @RequireTimeSortedInput design draft

2019-06-07 Thread Jan Lukavský
, but that is a compromise. /4) more tests (for batch and validatesRunner) are needed/ I just posted a question on the best way to make use of the @ValidateRunner annotation on this list, sounds like it might be useful to you as well :-) On Thu, 6 Jun 2019 at 23:03, Jan Lukavský <mailto

Re: @RequireTimeSortedInput design draft

2019-06-06 Thread Jan Lukavský
be (relatively) cheaply done in streaming and batch, it's done in very different ways, and also that it's hard to do via composition). On Thu, May 23, 2019 at 4:10 PM Jan Lukavský wrote: Hi, I have written a very brief draft of how it might be possible to implement @RequireTimeSortedInput discussed in [1

Re: Definition of Unified model

2019-05-30 Thread Jan Lukavský
parallelism in streaming sources and why batch sources are generally better parallelised. Jan On 5/30/19 1:35 PM, Reuven Lax wrote: Files can grow (depending on the filesystem), and tailing growing files is a valid use case. On Wed, May 29, 2019 at 3:23 PM Jan Lukavský <mailto:je...@seznam

Re: Definition of Unified model

2019-05-29 Thread Jan Lukavský
ime, which makes them useless for comparison. On 5/29/19 2:44 PM, Robert Bradshaw wrote: On Tue, May 28, 2019 at 12:18 PM Jan Lukavský wrote: As I understood it, Kenn was supporting the idea that sequence metadata is preferable over FIFO. I was trying to point out, that it even should provide the s

Re: Definition of Unified model

2019-05-28 Thread Jan Lukavský
eam provides this). It also gets awkward with Flatten - the sequence number is no longer enough, you must also encode which side of the flatten each element came from. On Tue, May 28, 2019 at 3:18 AM Jan Lukavský <mailto:je...@seznam.cz>> wrote: As I understood it, Kenn was support

Re: Definition of Unified model

2019-05-28 Thread Jan Lukavský
to sort by. I still think your proposal is very useful by the way. I'm merely pointing out that to solve the state-machine problem we probably need something more. Reuven On Thu, May 23, 2019 at 9:50 AM Jan Lukavský wrote: Hi, yes. It seems that ordering by user supplied UDF makes sens

Re: @RequireTimeSortedInput design draft

2019-05-27 Thread Jan Lukavský
be (relatively) cheaply done in streaming and batch, it's done in very different ways, and also that it's hard to do via composition). On Thu, May 23, 2019 at 4:10 PM Jan Lukavský wrote: Hi, I have written a very brief draft of how it might be possible to implement @RequireTimeSortedInput discussed

Re: DISCUSS: Sorted MapState API

2019-05-24 Thread Jan Lukavský
Hi, absolutely +1 to add this to the model, but does this imply that MapState can be dropped (or backed by this)? It can have different insert or delete time complexity (O(1)) instead of O(logn). Jan -- Původní e-mail -- Od: Aljoscha Krettek Komu: dev@beam.apache.org Datum: 24.

Re: Definition of Unified model

2019-05-23 Thread Jan Lukavský
chines you usually need some sort of FIFO ordering along with an ordered sources, such as Kafka, not timestamp ordering. > > Reuven > > On Thu, May 23, 2019 at 12:32 AM Jan Lukavský mailto:je...@seznam.cz)> wrote: >> >> Hi all, >> >> thanks everyone for this discussion.

Re: Definition of Unified model

2019-05-23 Thread Jan Lukavský
rdering. Especially given Beam's decision to have milliseconds timestamps this is possible, but even at microsecond or nanosecond precision this can happen at scale. To handle state machines you usually need some sort of FIFO ordering along with an ordered sources, such as Kafka, not timestamp ordering

@RequireTimeSortedInput design draft

2019-05-23 Thread Jan Lukavský
Hi, I have written a very brief draft of how it might be possible to implement @RequireTimeSortedInput discussed in [1]. I see the document [2] a starting point for a discussion. There are several open questions, which I believe can be resolved by this great community. :-) Jan [1]

Re: Definition of Unified model

2019-05-23 Thread Jan Lukavský
(and checkpoint) processed data and only emit it once a Flink checkpoint has completed. Cheers, Max On 21.05.19 16:49, Jan Lukavský wrote: Hi,  > Actually, I think it is a larger (open) question whether exactly once is guaranteed by the model or whether runners are allowed to relax that. I would th

Re: Jenkins not triggering test runs? Be careful with merges.

2019-05-22 Thread Jan Lukavský
Looks like something is wrong with triggers on github side. Triggers are not triggered on other projects, too. Jan -- Původní e-mail -- Od: Valentyn Tymofieiev Komu: dev Datum: 22. 5. 2019 21:23:19 Předmět: Jenkins not triggering test runs? Be careful with merges. " Is

Re: Definition of Unified model

2019-05-21 Thread Jan Lukavský
order to optimize. (That and, in practice, our annotation-style process methods don't lend themselves to easy composition.) I think it could work in specific cases though. More inline below. On Tue, May 21, 2019 at 11:38 AM Jan Lukavský wrote: Hi Robert, > Beam has an exactly-once model. If

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-21 Thread Jan Lukavský
tually need sorting. Kenn On Mon, May 20, 2019 at 2:59 PM Jan Lukavský mailto:je...@seznam.cz>> wrote: Yes, the problem will arise probably mostly when you have not well distributed keys (or too few keys). I'm really not sure if a pure

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-21 Thread Jan Lukavský
M, Robert Bradshaw wrote: On Mon, May 20, 2019 at 5:24 PM Jan Lukavský wrote: > I don't see batch vs. streaming as part of the model. One can have microbatch, or even a runner that alternates between different modes. Although I understand motivation of this statement, this project name is

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-21 Thread Jan Lukavský
need sorting. Kenn On Mon, May 20, 2019 at 2:59 PM Jan Lukavský mailto:je...@seznam.cz>> wrote: Yes, the problem will arise probably mostly when you have not well distributed keys (or too few keys). I'm really not sure if a pure GBK with a trigger c

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-21 Thread Jan Lukavský
is, only that it meets the spec. And other cases like user funnels are monotonic enough that you also don't actually need sorting. Kenn On Mon, May 20, 2019 at 2:59 PM Jan Lukavský <mailto:je...@seznam.cz>> wrote: Yes, the problem will arise probably mostly when you have not w

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-20 Thread Jan Lukavský
/trigger that is missing that you feel would remove the need for you to use StatefulParDo? On Mon, May 20, 2019 at 12:54 PM Jan Lukavský <mailto:je...@seznam.cz>> wrote: Hi Lukasz, > Today, if you must have a strict order, you must guarantee that your StatefulParD

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-20 Thread Jan Lukavský
cannot first read everything from distributed storage and then buffer it all into memory, just to read it again, but sorted. That will not work. And even if it would, it would be a terrible waste of resources. Jan On 5/20/19 8:39 PM, Lukasz Cwik wrote: On Mon, May 20, 2019 at 8:24 AM Jan Lu

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-20 Thread Jan Lukavský
tch case (and do it for *all* batch stateful pardos without annotation), or introduce annotation, but then make the same guarantees for streaming case as well. Jan On 5/20/19 4:41 PM, Robert Bradshaw wrote: On Mon, May 20, 2019 at 1:19 PM Jan Lukavský wrote: Hi Robert, yes, I think you rephrased my po

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-20 Thread Jan Lukavský
On 5/20/19 1:39 PM, Reuven Lax wrote: On Mon, May 20, 2019 at 4:19 AM Jan Lukavský <mailto:je...@seznam.cz>> wrote: Hi Robert, yes, I think you rephrased my point - although no *explicit* guarantees of ordering are given in either mode, there is *implicit*

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-20 Thread Jan Lukavský
2:41 PM, Robert Bradshaw wrote: On Fri, May 17, 2019 at 4:48 PM Jan Lukavský wrote: Hi Reuven, How so? AFAIK stateful DoFns work just fine in batch runners. Stateful ParDo works in batch as far, as the logic inside the state works for absolutely unbounded out-of-orderness of elements. That ba

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-17 Thread Jan Lukavský
a/src/main/java/org/apache/beam/runners/dataflow/BatchStatefulParDoOverrides.java#L64 On Thu, May 16, 2019 at 5:09 PM Jan Lukavský mailto:je...@seznam.cz>> wrote: Hi Max, answers inline. -- Původní e-mail -- Od: Maximilian Michels mailto:m...@apac

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-17 Thread Jan Lukavský
p in the future (but you sometimes might) > > Those seem quite “poor”, but I think you can’t get better guarantees for general cases for the reasons mentioned above. Also, this is just of the top of my head and I might be wrong in my understanding of the Beam model. :-O > > Best, >

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-16 Thread Jan Lukavský
p of my head and I might be wrong in my understanding of the Beam model. :-O > > Best, > Aljoscha > >> On 16. May 2019, at 13:53, Jan Lukavský wrote: >> >> Hi, >> >> this is starting to be really exciting. It seems to me that there is either some

Re: Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-16 Thread Jan Lukavský
of the Beam model. :-O Best, Aljoscha On 16. May 2019, at 13:53, Jan Lukavský wrote: Hi, this is starting to be really exciting. It seems to me that there is either something wrong with my definition of "Unified model" or with how it is implemented inside (at least) Direct and Fli

Definition of Unified model (WAS: Semantics of PCollection.isBounded)

2019-05-16 Thread Jan Lukavský
ially global sorting. If you want to have sorting by key, you could do a GroupByKey and then sort the groups in memory. This only works, of course, if your groups are not too large. On 15. May 2019, at 21:01, Jan Lukavský wrote: Hmmm, looking into the code of FlinkRunner (and also by ob

Re: Semantics of PCollection.isBounded

2019-05-15 Thread Jan Lukavský
places where this might matter (which therefore enables some optimizations). The order would be relevant only in the stateful ParDo, I'd say. Jan On 5/15/19 8:34 PM, Jan Lukavský wrote: Just to clarify, I understand, that changing semantics of the PCollection.isBounded,  is probably impossible

Re: Semantics of PCollection.isBounded

2019-05-15 Thread Jan Lukavský
such a need. Jan On 5/15/19 5:20 PM, Jan Lukavský wrote: Hi, I have come across unexpected (at least for me) behavior of some apparent inconsistency of how a PCollection is processed in DirectRunner and what PCollection.isBounded signals. Let me explain:  - I have a stateful ParDo, which needs

Semantics of PCollection.isBounded

2019-05-15 Thread Jan Lukavský
Hi, I have come across unexpected (at least for me) behavior of some apparent inconsistency of how a PCollection is processed in DirectRunner and what PCollection.isBounded signals. Let me explain:  - I have a stateful ParDo, which needs to make sure that elements arrive in order - it

Re: Unexpected behavior of StateSpecs

2019-05-10 Thread Jan Lukavský
as a first pass for understanding the implications. On Fri, May 10, 2019 at 9:31 AM Jan Lukavský <mailto:je...@seznam.cz>> wrote: Hm, yes, the fix might be also in fixing hashCode and equals of SimpleStateTag, so that it doesn't hash and compare the StateSpec, but only the St

Re: Unexpected behavior of StateSpecs

2019-05-10 Thread Jan Lukavský
id and should only be using that id for comparisons/lookups. On Fri, May 10, 2019 at 1:07 AM Jan Lukavský mailto:je...@seznam.cz>> wrote: I'm not sure. Generally it affects any runner that uses HashMap to store StateSpec. Jan On 5/9/19 6:32 PM, Reuv

Re: Unexpected behavior of StateSpecs

2019-05-10 Thread Jan Lukavský
I'm not sure. Generally it affects any runner that uses HashMap to store StateSpec. Jan On 5/9/19 6:32 PM, Reuven Lax wrote: Is this specific to the DirectRunner, or does it affect other runners? On Thu, May 9, 2019 at 8:13 AM Jan Lukavský <mailto:je...@seznam.cz>> wrote:

Re: Unexpected behavior of StateSpecs

2019-05-10 Thread Jan Lukavský
but wasn't able to figure it out yet: https://lists.apache.org/thread.html/dae8b605a218532c085a0eea4e71338eae51922c26820f37b24875c0@%3Cdev.beam.apache.org%3E Regards, Anton *From: *Jan Lukavský mailto:je...@seznam.cz>> *Date: *Thu, May 9, 2019 at 8:13 AM *To: * mailto:dev@beam.apac

Re: Unexpected behavior of StateSpecs

2019-05-09 Thread Jan Lukavský
, 2019 at 7:42 AM Jan Lukavský <mailto:je...@seznam.cz>> wrote: Hi, I have spent several hour digging into strange issue with DirectRunner, that manifested as non-deterministic run of pipeline. The pipeline contains basically only single stateful ParDo, which adds

Unexpected behavior of StateSpecs

2019-05-09 Thread Jan Lukavský
Hi, I have spent several hour digging into strange issue with DirectRunner, that manifested as non-deterministic run of pipeline. The pipeline contains basically only single stateful ParDo, which adds elements into state and after some timeout flushes these elements into output. The issues

PubsubIO and projectId

2019-03-21 Thread Jan Lukavský
Hi, I have come across an issue using PubsubIO with flink runner. The problem is described at [1]. I also created PR for this: [2], but there are some doubts described in comment in the JIRA issue. Would someone have time to walk through it and/or provide some insights? Thanks,  Jan [1]

Inject URL into runner's ClassLoader

2019-02-12 Thread Jan Lukavský
Hi, I'm working on a project for which I need to supply URL from where runner should load classes (at least when it cannot find them locally). I'd say, that most runners already use URLClassLoader, so that should be definitely possible, but there seems to be no API for this (definitely not

Re: [DISCUSS] Structuring Java based DSLs

2018-12-12 Thread Jan Lukavský
a string).  To avoid that overhead, I would imagine that SDKs should keep SQL queries and wait for a later but shared processing (I don't know if Portability should handle SQL or if it could). -Rui On Tue, Dec 4, 2018 at 2:04 AM Jan Lukavský <mailto:je...@seznam.cz>> wrote:

Re: [DISCUSS] Structuring Java based DSLs

2018-12-04 Thread Jan Lukavský
win, but don't build any policy here or do big refactors right now. Kenn On Mon, Dec 3, 2018 at 9:31 AM Jan Lukavský <mailto:je...@seznam.cz>> wrote: Hi Robert, currently there is no actual proposal, I was just trying to gather feedback from the community. But my

Re: [DISCUSS] Structuring Java based DSLs

2018-12-03 Thread Jan Lukavský
of abstraction. This makes sense to me. (2) Nesting the directories (but not the gradle targets or module names?). Here I'm not so sure about the benefit, especially vs. the cost. On Sat, Dec 1, 2018 at 8:38 AM Jan Lukavský wrote: I think that the fact that SQL uses some other internal dependency should

Re: [DISCUSS] Structuring Java based DSLs

2018-11-30 Thread Jan Lukavský
relationship does not prohibit the layered approach to implementation that sounds like it makes sense. (As for merging Euphoria into core, my initial impression is that's probably a good idea, and something we should consider for 3.0 at the very least.) On Fri, Nov 30, 2018 at 11:06 PM Jan Lukavský wrote

Re: [DISCUSS] Structuring Java based DSLs

2018-11-30 Thread Jan Lukavský
'll be probably adding sketching extension dependency soon. > > D. > > On Fri, Nov 30, 2018 at 7:08 PM Jan Lukavský mailto:je...@seznam.cz)> wrote: >> >> Hi Anton, >> reactions inline. >> >> -- Původní e-mail -- >> Od: Anton Kedi

Re: [DISCUSS] Structuring Java based DSLs

2018-11-30 Thread Jan Lukavský
ll be probably adding sketching extension dependency soon. > > D. > > On Fri, Nov 30, 2018 at 7:08 PM Jan Lukavský wrote: >> >> Hi Anton, >> reactions inline. >> >> -- Původní e-mail -- >> Od: Anton Kedin >> Komu: dev@beam.apach

Re: [DISCUSS] Structuring Java based DSLs

2018-11-30 Thread Jan Lukavský
ationRel.java#L179) On Fri, Nov 30, 2018 at 6:29 AM Jan Lukavský mailto:je...@seznam.cz)> wrote: "Hi community, I'm part of Euphoria DSL team, and on behalf of this team, I'd like to discuss possible development of Java based DSLs currently present in Beam. In my know

[DISCUSS] Structuring Java based DSLs

2018-11-30 Thread Jan Lukavský
Hi community, I'm part of Euphoria DSL team, and on behalf of this team, I'd like to discuss possible development of Java based DSLs currently present in Beam. In my knowledge, there are currently two DSLs based on Java SDK - Euphoria and SQL. These DSLs currently share only the SDK itself,

Re: (java) stream & beam?

2018-03-14 Thread Jan Lukavský
Hi all, the are actually some steps taken in this direction - a few emails already went to this channel about donation of Euphoria API (https://github.com/seznam/euphoria) to Apache Beam. SGA has already been signed, currently there is work in progress for porting all Euphoria's features to

Re: Euphoria Java 8 DSL - proposal

2017-12-18 Thread Jan Lukavský
, Euphoria seems like a programming model/SDK "close" to Beam more than a DSL on top of an existing Beam SDK. Am I wrong ? Regards JB On 12/18/2017 03:44 PM, Jan Lukavský wrote: Hi Ismael, basically we adopted the Beam's design regarding partitioning (https://github.com/seznam

Re: Euphoria Java 8 DSL - proposal

2017-12-18 Thread Jan Lukavský
Hi Ismael, basically we adopted the Beam's design regarding partitioning (https://github.com/seznam/euphoria/issues/160) and implemented the sorting manually (https://github.com/seznam/euphoria/issues/158). I'm not aware of the time model differences (Euphoria supports ingestion and event

<    1   2   3   4   5   6