Re: makes bundle concept usable?

2017-11-16 Thread Romain Manni-Bucau
Ok, let me try to step back and summarize what we have today and what I miss: 1. we can handle chunking in beam through group in batch (or equivalent) but: > it is not built-in into the transforms (IO) and it is controlled from outside the transforms so no way for a transform to do it properly

Re: makes bundle concept usable?

2017-11-16 Thread Jean-Baptiste Onofré
Thanks for the explanation. Agree, we might talk about different things using the same wording. I'm also struggling to understand the use case (for a generic DoFn). Regards JB On 11/17/2017 07:40 AM, Eugene Kirpichov wrote: To avoid spending a lot of time pursuing a false path, I'd like to sa

Re: makes bundle concept usable?

2017-11-16 Thread Eugene Kirpichov
To avoid spending a lot of time pursuing a false path, I'd like to say straight up that SDF is definitely not going to help here, despite the fact that its API includes the term "checkpoint". In SDF, the "checkpoint" captures the state of processing within a single element. If you're applying an SD

Re: Sink API question

2017-11-16 Thread Eugene Kirpichov
Hi Chet, It sounds like you want the following pattern: - Write data in parallel - Once all parallel writes have completed, gather their results and issue a commit The sink API once used to do something like that, but it turned out that the only thing that mapped well onto that API was files; and

Re: makes bundle concept usable?

2017-11-16 Thread Romain Manni-Bucau
@Eugene: yes and the other alternative of Reuven too but it is still 1. relying on timers, 2. not really checkpointed In other words it seems all solutions are to create a chunk of size 1 and replayable to fake the lack of chunking in the framework. This always implies a chunk handling outside the

Re: makes bundle concept usable?

2017-11-16 Thread Jean-Baptiste Onofré
Sorry, not trigger, I meant tracker (a bit early in the morning for me) ;) The tracker in the SDF controls the restriction/offset/etc. So I think it could be used to group elements no ? Regards JB On 11/17/2017 07:25 AM, Eugene Kirpichov wrote: JB, not sure what you mean? SDFs and triggers a

Re: makes bundle concept usable?

2017-11-16 Thread Eugene Kirpichov
JB, not sure what you mean? SDFs and triggers are unrelated, and the post doesn't mention the word. Did you mean something else, e.g. restriction perhaps? Either way I don't think SDFs are the solution here; SDFs have to do with the ability to split the processing of *a single element* over multipl

Re: makes bundle concept usable?

2017-11-16 Thread Jean-Baptiste Onofré
It sounds like the "Trigger" in the Splittable DoFn, no ? https://beam.apache.org/blog/2017/08/16/splittable-do-fn.html Regards JB On 11/17/2017 06:56 AM, Romain Manni-Bucau wrote: it gives the fn/transform the ability to save a state - it can get back on "restart" / whatever unit we can use,

[VOTE] Release 2.2.0, release candidate #4

2017-11-16 Thread Reuven Lax
Hi everyone, Please review and vote on the release candidate #4 for the version 2.2.0, as follows: [ ] +1, Approve the release [ ] -1, Do not approve the release (please provide specific comments) The complete staging area is available for your review, which includes: * JIRA release notes

Re: makes bundle concept usable?

2017-11-16 Thread Romain Manni-Bucau
it gives the fn/transform the ability to save a state - it can get back on "restart" / whatever unit we can use, probably runner dependent? Without that you need to rewrite all IO usage with something like the previous pattern which makes the IO not self sufficient and kind of makes the entry cost

Re: makes bundle concept usable?

2017-11-16 Thread Reuven Lax
Romain, Can you define what you mean by checkpoint? What are the semantics, what does it accomplish? Reuven On Fri, Nov 17, 2017 at 1:40 PM, Romain Manni-Bucau wrote: > Yes, what I propose earlier was: > > I. checkpoint marker: > > @AnyBeamAnnotation > @CheckpointAfter > public void someHook(S

Re: makes bundle concept usable?

2017-11-16 Thread Romain Manni-Bucau
Yes, what I propose earlier was: I. checkpoint marker: @AnyBeamAnnotation @CheckpointAfter public void someHook(SomeContext ctx); II. pipeline.apply(ParDo.of(new MyFn()).withCheckpointAlgorithm(new CountingAlgo())) III. (I like this one less) // in the dofn @CheckpointTester public boolean sh

Re: Sink API question

2017-11-16 Thread Jean-Baptiste Onofré
Hi, if you take a look on existing IO, most of them doesn't use the Sink API: they implement a Sink using a DoFn. I think Algolia would be the same for the Write. What do you think about updating the index when we finalize a bundle ? NB: what's the Algolia "client/API" license ? Just to dou

Re: makes bundle concept usable?

2017-11-16 Thread Raghu Angadi
How would you define it (rough API is fine)?. Without more details, it is not easy to see wider applicability and feasibility in runners. On Thu, Nov 16, 2017 at 1:13 PM, Romain Manni-Bucau wrote: > This is a fair summary of the current state but also where beam can have a > very strong added va

Sink API question

2017-11-16 Thread Chet Aldrich
Hello all, I’m in the process of implementing a way to write data using a PTransform to Algolia (https://www.algolia.com/ ). However, in the process of doing so I’ve run into a bit of a snag, and was curious if someone here would be able to help me figure this out.

Re: makes bundle concept usable?

2017-11-16 Thread Romain Manni-Bucau
This is a fair summary of the current state but also where beam can have a very strong added value and make big data great and smooth. Instead of this replay feature isnt checkpointing willable? In particular with SDF no? Le 16 nov. 2017 19:50, "Raghu Angadi" a écrit : > Core issue here is tha

Re: makes bundle concept usable?

2017-11-16 Thread Raghu Angadi
Core issue here is that there is no explicit concept of 'checkpoint' in Beam (UnboundedSource has a method 'getCheckpointMark' but that refers to the checkoint on external source). Runners do checkpoint internally as implementation detail. Flink's checkpoint model is entirely different from Dataflo

[Proposal] IOIT test parameters validation

2017-11-16 Thread Łukasz Gajowy
Hi all! We are currently working on the IO IT "test harness" that will allow to run the IOITs on various runners, filesystems and with changing amount of data. It is described in a doc some of you have probably seen and put comments in the doc [1] (in the context of BEAM-3060 [2] task). Part of t

Re: Memory consumption & ValueProvider.RuntimeValueProvider#optionsMap

2017-11-16 Thread Lukasz Cwik
I filed https://issues.apache.org/jira/browse/BEAM-3202 On Thu, Nov 16, 2017 at 9:19 AM, Lukasz Cwik wrote: > That seems like a bug since its expected that the options id is always set > to something when the PipelineOptions object is created so when > serialized/deserialized the same options id

Re: Memory consumption & ValueProvider.RuntimeValueProvider#optionsMap

2017-11-16 Thread Lukasz Cwik
That seems like a bug since its expected that the options id is always set to something when the PipelineOptions object is created so when serialized/deserialized the same options id is always returned. Seems like a trivial fix in PipelineOptionsFactory to always just call getOptionsId at least on

Re: makes bundle concept usable?

2017-11-16 Thread Romain Manni-Bucau
2017-11-16 12:18 GMT+01:00 Reuven Lax : > On Wed, Nov 15, 2017 at 9:16 PM, Romain Manni-Bucau > wrote: > >> @Reuven: it looks like a good workaround >> @Ken: thks a lot for the link! >> >> @all: >> >> 1. do you think it is doable without windowing usage (to have >> something more reliable in term

[VOTE] Choose the "new" Spark runner

2017-11-16 Thread Jean-Baptiste Onofré
Hi guys, To illustrate the current discussion about Spark versions support, you can take a look on: -- Spark 1 & Spark 2 Support Branch https://github.com/jbonofre/beam/tree/BEAM-1920-SPARK2-MODULES This branch contains a Spark runner common module compatible with both Spark 1.x and 2.x. Fo

Memory consumption & ValueProvider.RuntimeValueProvider#optionsMap

2017-11-16 Thread Stas Levin
Hi all, I'm investigating a memory consumption issue (leak, so it seems) and was wondering if it could be related to the way runtime options are handled. In particular, upon deserializing a PipelineOptions object, ProxyInvocationHandler.Deserializer calls ValueProvider.RuntimeValueProvider.setRunt

Re: makes bundle concept usable?

2017-11-16 Thread Reuven Lax
On Wed, Nov 15, 2017 at 9:16 PM, Romain Manni-Bucau wrote: > @Reuven: it looks like a good workaround > @Ken: thks a lot for the link! > > @all: > > 1. do you think it is doable without windowing usage (to have > something more reliable in term of runner since it will depend on less > primitives?

Re: [VOTE] Release 2.2.0, release candidate #3

2017-11-16 Thread Reuven Lax
Retrying the whole step succeeded, so somehow this was an ephemeral error. On Thu, Nov 16, 2017 at 6:28 PM, Reuven Lax wrote: > I've fixed the Python issue - turns out my local path got messed up. > > However, mvn release:prepare is now failing with the following. I haven't > seen this failure b

Re: [VOTE] Release 2.2.0, release candidate #3

2017-11-16 Thread Reuven Lax
I've fixed the Python issue - turns out my local path got messed up. However, mvn release:prepare is now failing with the following. I haven't seen this failure before - does anyone know what might be causing it? [*ERROR*] Failed to execute goal org.apache.maven.plugins:maven-release-plugin:2.5.3

Jenkins build is back to stable : beam_Release_NightlySnapshot #595

2017-11-16 Thread Apache Jenkins Server
See