Re: [VOTE] Release version 0.1.0-incubating

2016-06-08 Thread Jean-Baptiste Onofré
+1 (binding) - all files have incubating - signatures check out (and KEYS there) - disclaimer exists - LICENSE and NOTICE good - No unexpected binary in source - All ASF licensed files have ASF headers - source distribution available and content is good Improvements for next release: - As the

[VOTE] Release version 0.1.0-incubating

2016-06-08 Thread Davor Bonaci
Hi everyone, Here's the first vote for the first release of Apache Beam -- version 0.1.0-incubating! As a reminder, we aren't looking for any specific new functionality, but would like to release the existing code, get something to our users' hands, and test the processes. Previous discussions

Re: 0.1.0-incubating release

2016-06-08 Thread Davor Bonaci
The third release candidate is now available for everyone's review [1], which should be incorporating all feedback so far. Please comment if there's additional feedback, as we are about to start the voting process. [1] https://repository.apache.org/content/repositories/orgapachebeam-1002 On

Re: 0.1.0-incubating release

2016-06-08 Thread P. Taylor Goetz
Thanks for the clarification JB. In the projects I’ve been involved with, I’ve not seen that practice. As long as the resulting release ends up on dist.a.o I don’t think it’s a problem. -Taylor > On Jun 8, 2016, at 12:49 AM, Jean-Baptiste Onofré wrote: > > Hi Taylor, >

Re: 0.1.0-incubating release

2016-06-08 Thread Amit Sela
To Davor, JB and anyone else helping with the release, Thanks! this looks great. On Wed, Jun 8, 2016 at 9:11 PM Amit Sela wrote: > Regarding Dan's questions: > 1. I'm not sure - it is built with spark-*_2.10 but I honestly don't know > if this matters for the runner

Re: 0.1.0-incubating release

2016-06-08 Thread Amit Sela
Regarding Dan's questions: 1. I'm not sure - it is built with spark-*_2.10 but I honestly don't know if this matters for the runner itself, it could be nice to have in order to be more informative. In addition, this will change with Spark 2.0 to Scala 2.11 AFAIK. 2. This is to allow running

Re: DoFn Reuse

2016-06-08 Thread Ben Chambers
On Wed, Jun 8, 2016 at 10:29 AM Raghu Angadi wrote: > On Wed, Jun 8, 2016 at 10:13 AM, Ben Chambers > > wrote: > > > - If failure occurs after finishBundle() but before the consumption is > > committed, then the bundle may be

Re: DoFn Reuse

2016-06-08 Thread Raghu Angadi
On Wed, Jun 8, 2016 at 10:15 AM, Dan Halperin wrote: > > I thought finishBundle() > > exists simply as best-effort indication from the runner to user some > chunk > > of records have been processed.. not part of processing guarantees. Also > > the term "bundle"

Re: DoFn Reuse

2016-06-08 Thread Thomas Groh
In the case of failure, a DoFn instance will not be reused; however, in the case of failure either the inputs will be retried, or the pipeline will fail, allowing a newly deserialized instance of the DoFn to reprocess the inputs (which should produce the same result, meaning there is no data

Re: DoFn Reuse

2016-06-08 Thread Thomas Groh
A Bundle is an arbitrary collection of elements. A PCollection is divided into bundles at the discretion of the runner. However, the bundles must partition the input PCollection; each element is in exactly one bundle, and each bundle is successfully committed exactly once in a successful pipeline.

Re: DoFn Reuse

2016-06-08 Thread Raghu Angadi
On Wed, Jun 8, 2016 at 10:13 AM, Ben Chambers wrote: > - If failure occurs after finishBundle() but before the consumption is > committed, then the bundle may be reprocessed, which leads to duplicated > calls to processElement() and finishBundle(). > > - If

Re: DoFn Reuse

2016-06-08 Thread Robert Bradshaw
The unit of commit is the bundle. Consider a DoFn that does batching (e.g. to interact with some external service less frequently). Items may be buffered during process() but these buffered items must be processed and the results emitted in finishBundle(). If inputs are committed as being

Re: DoFn Reuse

2016-06-08 Thread Dan Halperin
On Wed, Jun 8, 2016 at 10:05 AM, Raghu Angadi wrote: > Such data loss can still occur if the worker dies after finishBundle() > returns, but before the consumption is committed. If the runner is correctly implemented, there will not be data loss in this case -- the

Re: DoFn Reuse

2016-06-08 Thread Raghu Angadi
Such data loss can still occur if the worker dies after finishBundle() returns, but before the consumption is committed. I thought finishBundle() exists simply as best-effort indication from the runner to user some chunk of records have been processed.. not part of processing guarantees. Also the

Re: DoFn Reuse

2016-06-08 Thread Thomas Groh
finishBundle() **must** be called before any input consumption is committed (i.e. marking inputs as completed, which incldues committing any elements they produced). Doing otherwise can cause data loss, as the state of the DoFn is lost if a worker dies, but the input elements will never be

Re: DoFn Reuse

2016-06-08 Thread Aljoscha Krettek
Ahh, what we could do is artificially induce bundles using either count or processing time or both. Just so that finishBundle() is called once in a while. On Wed, 8 Jun 2016 at 17:12 Aljoscha Krettek wrote: > Pretty sure, yes. The Iterable in a MapPartitionFunction should

Re: DoFn Reuse

2016-06-08 Thread Aljoscha Krettek
Pretty sure, yes. The Iterable in a MapPartitionFunction should give you all the values in a given partition. I checked again for streaming execution. We're doing the opposite, right now: every element is a bundle in itself, startBundle()/finishBundle() are called for every element which seems a