Re: [DISCUSS] [BEAM-438] Rename one of PTransform.apply or PInput.apply

2016-12-07 Thread Thomas Groh
+1; This is probably the best way to make sure users don't reverse the polarity of the PCollection flow. This also brings PInput.expand(), POutput.expand(), and PTransform.expand(PInput) into line - namely, for some composite thing, "represent yourself as some collection of primitives" (potentiall

Re: [DISCUSS] Graduation to a top-level project

2016-11-22 Thread Thomas Groh
+1 It's been a thrilling experience thus far, and I'm excited for the future. On Tue, Nov 22, 2016 at 11:07 AM, Aljoscha Krettek wrote: > +1 > > I'm quite enthusiastic about the growth of the community and the open > discussions! > > On Tue, 22 Nov 2016 at 19:51 Jason Kuster > wrote: > > > An

Re: Flink runner. Wrapper for DoFn

2016-11-18 Thread Thomas Groh
I'm going to also comment on why you would Start/FinishBundle or Setup/Teardown. Generally, StartBundle/FinishBundle is for processing behavior and correctness, while Setup/Teardown are about managing persistent resources (like a connection in your case). To be specific, FinishBundle must be calle

Re: [PROPOSAL] Merge apex-runner to master branch

2016-11-08 Thread Thomas Groh
+1. Sweet (and congratulations) On Tue, Nov 8, 2016 at 9:57 AM, Kenneth Knowles wrote: > +1, with enthusiasm. > > On Tue, Nov 8, 2016 at 9:16 AM, Davor Bonaci > wrote: > > > +1 > > > > I'd treat this as an official vote on this procedural matter. > > > > On Tue, Nov 8, 2016 at 6:55 AM, Mukul Ja

Re: Timer and Window behavior

2016-11-07 Thread Thomas Groh
You're right in that the Watermark Hold was not being updated in the presence of empty updates - https://github.com/apache/incubator-beam/pull/1300 fixes that issue. You should have seen the same outputs (as your windows are properties of event time) but the trigger firing would be delayed. On Mon

Re: The Availability of PipelineOptions

2016-10-25 Thread Thomas Groh
gt; > > > > > >> +1 > > > >> > > > >> Agree > > > >> > > > >> Regards > > > >> JB > > > >> > > > >> ⁣ > > > >> > > > >> On Oct 25

The Availability of PipelineOptions

2016-10-24 Thread Thomas Groh
Hey everyone, I've been working on a declaration of intent for how we want to use PipelineOptions and an API change to be consistent with that intent. This is generally part of the move to the Runner API, specifically the desire to be able to reuse Pipelines and the ability to choose runner at the

Re: High CPU usage in DirectRunner

2016-09-28 Thread Thomas Groh
That line is indeed the root of the problem you're seeing - the monitor was written with an implicit assumption that there will generally be work available, so it's expected to do work every time it runs. However, for a low-throughput pipeline this isn't the case, and it will mostly spin and do use

Re: Should UnboundedSource provide a split identifier ?

2016-09-13 Thread Thomas Groh
this, or assigning a running id, is basically the same, as long as > > generateInitialSplits implementation is deterministic (KafkaIO actually > > notes this). > > > > So what if partitions were added at runtime to one (or more) of the > topics > > I'm consuming from ? >

Re: Should UnboundedSource provide a split identifier ?

2016-09-12 Thread Thomas Groh
I'm not sure if I've understood what the problem is - from what I can tell it's about associating UnboundedSource splits with Checkpoints in order to get consistent behavior from the sources. If I'm wrong, the following isn't really relevant to your problem - it's about the expected behavior of a r

Re: KafkaIO Windowing Fn

2016-08-31 Thread Thomas Groh
rote: > > > Hi, > > could the reason for the second part of the trigger never firing be that > > there are never at least 100 elements per key. The trigger would only > fire > > if it saw 100 elements and with only 540 elements that seems unlikely if > > you hav

Re: KafkaIO Windowing Fn

2016-08-31 Thread Thomas Groh
t; > 100) > > > > > ))) > > > > > .discardingFiredPanes()) > > > > > .apply("GroupByTenant", GroupByKey.create()) > > > > > .apply(ParDo.of(new DoFn>,

Re: KafkaIO Windowing Fn

2016-08-26 Thread Thomas Groh
If you use the DirectRunner, do you observe the same behavior? On Thu, Aug 25, 2016 at 4:32 PM, Chawla,Sumit wrote: > Hi Thomas > > I am using FlinkRunner. Yes the second part of trigger never fires for me, > > Regards > Sumit Chawla > > > On Thu, Aug 25, 2016 at 4:

Re: KafkaIO Windowing Fn

2016-08-25 Thread Thomas Groh
st 500 records are available immediately, > > but the remaining 40 don't pass through. I was expecting 2nd to > trigger to help here. > > > > > > > > Regards > Sumit Chawla > > > On Thu, Aug 25, 2016 at 1:13 PM, Thomas Groh > wrote: > > >

Re: KafkaIO Windowing Fn

2016-08-25 Thread Thomas Groh
You can adjust the trigger in the windowing transform if your sink can handle being written to multiple times for the same window. For example, if the sink appends to the output when it receives new data in a window, you could add something like Window.into(...).withAllowedLateness(...).triggering

Re: [Proposal] Add waitToFinish(), cancel(), waitToRunning() to PipelineResult.

2016-07-21 Thread Thomas Groh
TestPipeline is probably the one runner that can be expected to block, as certainly JUnit tests and likely other tests will run the Pipeline, and succeed, even if the PipelineRunner throws an exception. Luckily, this can be added to TestPipeline.run(), which already has additional behavior associat

Adding DoFn Setup and Teardown methods

2016-06-28 Thread Thomas Groh
Hey Everyone: We've recently started to be permitted to reuse DoFn instances in Beam[1]. Beyond the efficiency gains from not having to deserialize new DoFn instances for every bundle, DoFn reuse also provides the ability to minimize expensive setup work done per-bundle, which hasn't formerly been

Re: Running examples with different runners

2016-06-27 Thread Thomas Groh
There are a few tests that still use side inputs to implement the ; the big ones are anything that asserts on a singleton; these are `PAssert.thatSingleton`, `PAssert.thatMap` and `PAssert.thatMultimap`. All of the other asserts are implemented via GroupByKey, and these are by far the most common f

Re: How to control the parallelism when run ParDo on PCollection?

2016-06-24 Thread Thomas Groh
We do also have an active JIRA issue to support limiting parallelism on a per-step basis, BEAM-68 https://issues.apache.org/jira/browse/BEAM-68 As Kenn noted, this is not equivalent to controls over bundling, which is entirely determined by the runner. On Fri, Jun 24, 2016 at 1:25 PM, Shen Li w

Pipeline Runner Renames

2016-06-21 Thread Thomas Groh
Hey everyone; This is a heads-up that all PipelineRunners in the Beam SDK have been renamed to remove the word "Pipeline" from the name at github HEAD. The only change required from users is that command line invocations that set the runner will have to remove the word "Pipeline" from the runner n

Re: Using GroupAlsoByWindowViaWindowSetDoFn for stream of input element

2016-06-17 Thread Thomas Groh
Generally, the above code snippet will work, producing (after trigger firing) an output Iterable containing all of the input elements. It may be notable that timers (and TimerInternals) are also per-key, so that interface must also be updated per element. By specifying the ReduceFn of the ReduceFn

Re: Testing and the Capability Matrix

2016-06-14 Thread Thomas Groh
5 PM, Jean-Baptiste Onofré wrote: > Hi Thomas, > > it looks good to me. > > Just curious: the proposed annotations will be directly in the Java SDK > Test jar right ? > > Thanks, > Regards > JB > > > On 06/11/2016 01:34 AM, Thomas Groh wrote: > >

Changes to the Default Pipeline Runner

2016-06-13 Thread Thomas Groh
Hello everyone: As of github pull request #446 ( https://github.com/apache/incubator-beam/pull/446), we're going to replace the DirectPipelineRunner in the Core SDK with the InProcessPipelineRunner as the default runner (finishing BEAM-22), after which we will rename the InProcessPipelineRunner to

Testing and the Capability Matrix

2016-06-10 Thread Thomas Groh
Hey Beamers! We have a lovely Capability Matrix ( http://beam.incubator.apache.org/capability-matrix/) which describes what runners can do, and what's in the model. However, right now we only have one way to specify that a test is useful to be executed in a runner, the RunnableOnService category.

Re: DoFn Reuse

2016-06-08 Thread Thomas Groh
In the case of failure, a DoFn instance will not be reused; however, in the case of failure either the inputs will be retried, or the pipeline will fail, allowing a newly deserialized instance of the DoFn to reprocess the inputs (which should produce the same result, meaning there is no data loss).

Re: DoFn Reuse

2016-06-08 Thread Thomas Groh
A Bundle is an arbitrary collection of elements. A PCollection is divided into bundles at the discretion of the runner. However, the bundles must partition the input PCollection; each element is in exactly one bundle, and each bundle is successfully committed exactly once in a successful pipeline.

Re: DoFn Reuse

2016-06-08 Thread Thomas Groh
led back the system will handle > it. > >> If that is not allowed we should really update the javadocs around it to > >> explain the pitfalls of doing this. > >> - Bobby > >> > >>On Wednesday, June 8, 2016 4:24 AM, Aljoscha Krettek < > >&g

DoFn Reuse

2016-06-07 Thread Thomas Groh
Hey everyone; I'm starting to work on BEAM-38 ( https://issues.apache.org/jira/browse/BEAM-38), which enables an optimization for runners with many small bundles. BEAM-38 allows runners to reuse DoFn instances so long as that DoFn has not terminated abnormally. This replaces the previous requireme

Re: Serialization for org.apache.beam.sdk.util.WindowedValue$*

2016-06-02 Thread Thomas Groh
The Beam Model ensures that all PCollections have a Coder; the PCollection Coder is the standard way to materialize the elements of a PCollection[1][2]. Most SDK-provided classes that will need to be transferred across the wire have an associated coder, and some additional default datatypes have co

Re: [PROPOSAL] Writing More Expressive Beam Tests

2016-05-23 Thread Thomas Groh
t; > > > > > > > etc. > > > > > > Is there a p.run() at the end? > > > > > > > Almost certainly. > > > > > > > > We could also allow TestSource to work with multiple input pipelines > > like > &g

Re: A question about windowed values

2016-04-13 Thread Thomas Groh
y any PTransform (including Sources) must be in a window, potentially the GlobalWindow. On Wed, Apr 13, 2016 at 8:52 AM, Thomas Groh wrote: > Values should almost always be part of at least one window. WindowFns > should place all elements in at least one window, as values that are in no > w

Re: A question about windowed values

2016-04-13 Thread Thomas Groh
Values should almost always be part of at least one window. WindowFns should place all elements in at least one window, as values that are in no windows will be dropped when they reach a GroupByKey. Elements in no windows, for example those created by WindowedValue.valueInEmptyWindows(T) are gener

Re: [PROPOSAL] Writing More Expressive Beam Tests

2016-03-25 Thread Thomas Groh
Hey everyone; I'd still be happy to get feedback. I'm going to start working on this early next week Thanks, Thomas On Mon, Mar 21, 2016 at 5:38 PM, Thomas Groh wrote: > Hey everyone, > > I've been working on a proposal to expand the capabilities of our testing >

[PROPOSAL] Writing More Expressive Beam Tests

2016-03-21 Thread Thomas Groh
Hey everyone, I've been working on a proposal to expand the capabilities of our testing API, mostly around writing deterministic tests for pipelines that have interesting triggering behavior, especially speculative and late triggers. I've shared a doc here