Re: [DISCUSS] UnboundedSource and the KafkaIO.

2016-10-10 Thread Amit Sela
Thanks Max! I'll try to explain Spark's stateful operators and how/why I used them with UnboundedSource. Spark has two stateful operators: *updateStateByKey* and *mapWithState*. Since updateStateByKey is bound to output the (updated) state itself - the CheckpointMark in our case - we're left

Re: [DISCUSS] UnboundedSource and the KafkaIO.

2016-10-10 Thread Maximilian Michels
Just to add a comment from the Flink side and its UnboundedSourceWrapper. We experienced the only way to guarantee deterministic splitting of the source, was to generate the splits upon creation of the source and then checkpoint the assignment during runtime. When restoring from a checkpoint, the

Re: [PROPOSAL] Introduce review mailing list and provide update on open discussion

2016-10-10 Thread Jean-Baptiste Onofré
Thanks for the update Frances. I will ping my infra contact to move forward quickly. Regards JB On 10/10/2016 07:27 PM, Frances Perry wrote: Related to #3-5: Also, as we discussed earlier [1], there will be an additional level of tracking in jira for deeper proposal-style conversations to

Re: [DISCUSS] UnboundedSource and the KafkaIO.

2016-10-10 Thread Amit Sela
That's a good question Robert, and I did. First of all, an UnboundedSource is split into splits that implement a sort of "BoundedReadFromUnboundedSource", with Restrictions on time and (optional) number of records - this seems to fit nicely into the *SDF* language. Taking a look at the diagram

Re: Introducing a Redistribute transform

2016-10-10 Thread Eugene Kirpichov
Hi Amit, The transform, the way it's implemented, actually does several things at the same time and that's why it's tricky to document it. Redistribute.arbitrarily(): - Introduces a fusion barrier (in runners that have it), making sure that the runner can fully parallelize processing the output

Re: Introducing a Redistribute transform

2016-10-10 Thread Amit Sela
On Mon, Oct 10, 2016 at 9:21 PM Robert Bradshaw wrote: > On Sat, Oct 8, 2016 at 7:31 AM, Amit Sela wrote: > > > Hi Eugene, > > > > > > This is very interesting. > > > Let me see if I get this right, the "Redistribute" transformation > assigns

Re: [DISCUSS] UnboundedSource and the KafkaIO.

2016-10-10 Thread Jean-Baptiste Onofré
Hi Amit, thanks for the explanation. For 4, you are right, it's slightly different from DataXchange (related to the elements in the PCollection). I think storing the "starting point" for a reader makes sense. Regards JB On 10/10/2016 10:33 AM, Amit Sela wrote: Inline, thanks JB! On Mon,

Re: [DISCUSS] UnboundedSource and the KafkaIO.

2016-10-10 Thread Amit Sela
Inline, thanks JB! On Mon, Oct 10, 2016 at 9:01 AM Jean-Baptiste Onofré wrote: > Hi Amit, > > > > For 1., the runner is responsible of the checkpoint storage (associated > > with the source). It's the way for the runner to retry and know the > > failed bundles. > True, this