[jira] [Updated] (BEAM-1170) Streaming watermark should be easier to read
[ https://issues.apache.org/jira/browse/BEAM-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Chambers updated BEAM-1170: --- Component/s: runner-dataflow > Streaming watermark should be easier to read > > > Key: BEAM-1170 > URL: https://issues.apache.org/jira/browse/BEAM-1170 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Joshua Litt >Priority: Minor > > Currently, the only way to get at the streaming watermarks is through > scraping counter names. However, the watermarks are useful for determining if > a streaming job is 'done,' ie watermarks at infinity. We should consider > either exposing the watermarks through a GetWatermarks api or another > alternative might be a WATERMARKS_AT_INFINITY job state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-1169) MetricsTest matchers should loosen expectations on physical values
[ https://issues.apache.org/jira/browse/BEAM-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755377#comment-15755377 ] Ben Chambers commented on BEAM-1169: Alternatively, we may need the Runner to define how Metrics are matched, and the test just sets up the "correct" values. That would allow different runners/tests to impose reasonable restrictions. > MetricsTest matchers should loosen expectations on physical values > -- > > Key: BEAM-1169 > URL: https://issues.apache.org/jira/browse/BEAM-1169 > Project: Beam > Issue Type: Sub-task > Components: beam-model, sdk-java-core, sdk-py >Reporter: Ben Chambers > > We could use `atLeast(N)` rather than `equals(N)` for the attempted values, > but even that may be false without violating the behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-1169) MetricsTest matchers should loosen expectations on physical values
Ben Chambers created BEAM-1169: -- Summary: MetricsTest matchers should loosen expectations on physical values Key: BEAM-1169 URL: https://issues.apache.org/jira/browse/BEAM-1169 Project: Beam Issue Type: Sub-task Reporter: Ben Chambers We could use `atLeast(N)` rather than `equals(N)` for the attempted values, but even that may be false without violating the behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (BEAM-1104) WordCount: Metrics error in the DirectRunner
[ https://issues.apache.org/jira/browse/BEAM-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Chambers resolved BEAM-1104. Resolution: Fixed Fix Version/s: 0.4.0-incubating > WordCount: Metrics error in the DirectRunner > > > Key: BEAM-1104 > URL: https://issues.apache.org/jira/browse/BEAM-1104 > Project: Beam > Issue Type: Bug > Components: runner-direct >Reporter: Daniel Halperin >Assignee: Ben Chambers > Fix For: 0.4.0-incubating > > > I'm following the Beam quickstart to analyze the pom.xml for the examples > archetype in the DirectRunner: > Generate the project: > {code} > mvn archetype:generate \ > > -DarchetypeRepository=https://repository.apache.org/content/groups/snapshots > \ > -DarchetypeGroupId=org.apache.beam \ > -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \ > -DarchetypeVersion=LATEST \ > -DgroupId=org.example \ > -DartifactId=word-count-beam \ > -Dversion="0.1" \ > -Dpackage=org.apache.beam.examples \ > -DinteractiveMode=false > {code} > Count words in the pom.xml: > {code} > mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ > -Dexec.args="--inputFile=pom.xml --output=direct/counts" -Pdirect-runner > {code} > The logs: > {code} > INFO] --- exec-maven-plugin:1.4.0:java (default-cli) @ word-count-beam --- > Dec 07, 2016 9:42:03 PM org.apache.beam.sdk.io.FileBasedSource > expandFilePattern > INFO: Matched 1 files for pattern pom.xml > Dec 07, 2016 9:42:03 PM org.apache.beam.sdk.metrics.MetricsEnvironment > getCurrentContainer > SEVERE: Unable to update metrics on the current thread. Most likely caused by > using metrics outside the managed work-execution thread. > Dec 07, 2016 9:42:03 PM org.apache.beam.sdk.io.Write$Bound$1 processElement > INFO: Initializing write operation > org.apache.beam.sdk.io.TextIO$TextSink$TextWriteOperation@26bbd1cf > Dec 07, 2016 9:42:04 PM org.apache.beam.sdk.io.Write$Bound$WriteBundles > processElement > INFO: Opening writer for write operation > org.apache.beam.sdk.io.TextIO$TextSink$TextWriteOperation@19371061 > Dec 07, 2016 9:42:04 PM org.apache.beam.sdk.io.Write$Bound$WriteBundles > processElement > INFO: Opening writer for write operation > org.apache.beam.sdk.io.TextIO$TextSink$TextWriteOperation@19371061 > Dec 07, 2016 9:42:04 PM org.apache.beam.sdk.io.Write$Bound$WriteBundles > processElement > INFO: Opening writer for write operation > org.apache.beam.sdk.io.TextIO$TextSink$TextWriteOperation@19371061 > Dec 07, 2016 9:42:04 PM org.apache.beam.sdk.io.Write$Bound$WriteBundles > processElement > INFO: Opening writer for write operation > org.apache.beam.sdk.io.TextIO$TextSink$TextWriteOperation@19371061 > Dec 07, 2016 9:42:04 PM org.apache.beam.sdk.io.Write$Bound$2 processElement > INFO: Finalizing write operation > org.apache.beam.sdk.io.TextIO$TextSink$TextWriteOperation@3701012a. > {code} > Presumably, this {{SEVERE}} warning is indicative of a bug (or should be > masked). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-1126) Expose UnboundedSource split backlog in number of events
[ https://issues.apache.org/jira/browse/BEAM-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15749901#comment-15749901 ] Ben Chambers commented on BEAM-1126: The majority of the work is getting Metrics supported well-enough to start removing Aggregators and moving code/examples/documentation towards Metrics. This work (for Java) is tracked in this issue http://issues.apache.org/jira/browse/BEAM-147. I'm working on the Dataflow runner changes right now. The other runners could choose to implement Metrics either in a way similar to how they currently support Aggregators (providing the "committed" value across work that succeeded) or using their own Metrics mechanisms (providing an "attempted" value across all attempts at work). > Expose UnboundedSource split backlog in number of events > > > Key: BEAM-1126 > URL: https://issues.apache.org/jira/browse/BEAM-1126 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Aviem Zur >Assignee: Daniel Halperin >Priority: Minor > > Today {{UnboundedSource}} exposes split backlog in bytes via > {{getSplitBacklogBytes()}} > There is value in exposing backlog in number of events as well, since this > number can be more human comprehensible than bytes. something like > {{getSplitBacklogEvents()}} or {{getSplitBacklogCount()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-1161) Update Javadoc/Examples/etc.
Ben Chambers created BEAM-1161: -- Summary: Update Javadoc/Examples/etc. Key: BEAM-1161 URL: https://issues.apache.org/jira/browse/BEAM-1161 Project: Beam Issue Type: Sub-task Reporter: Ben Chambers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-1148) Port PAssert away from Aggregators
[ https://issues.apache.org/jira/browse/BEAM-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15749188#comment-15749188 ] Ben Chambers commented on BEAM-1148: In general: Aggregators are being replaced with Metrics for the "monitoring" use cases, as described in https://lists.apache.org/thread.html/08af5d8247c316f46f4dc1ec93173721f684109b8a9d41a4431558ec@%3Cdev.beam.apache.org%3E As noted in that thread, they may reappear in the future as a more general shorthand for "side-output + combine", but they need to address things like windowing, and latter use within the pipeline to be really useful in that role. Within PAssert: They may be replaced with either Metrics (if possible) or with an explicit side-output + combine which should be semantically equivalent. > Port PAssert away from Aggregators > -- > > Key: BEAM-1148 > URL: https://issues.apache.org/jira/browse/BEAM-1148 > Project: Beam > Issue Type: New Feature > Components: sdk-java-core >Reporter: Kenneth Knowles > > One step in the removal of Aggregators (in favor of Metrics) is to remove our > reliance on them for PAssert checking. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (BEAM-1122) Change Dataflow profiling options to support saving to GCS
[ https://issues.apache.org/jira/browse/BEAM-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Chambers resolved BEAM-1122. Resolution: Fixed Fix Version/s: 0.4.0-incubating > Change Dataflow profiling options to support saving to GCS > -- > > Key: BEAM-1122 > URL: https://issues.apache.org/jira/browse/BEAM-1122 > Project: Beam > Issue Type: New Feature > Components: runner-dataflow >Affects Versions: 0.4.0-incubating >Reporter: Ben Chambers >Assignee: Ben Chambers >Priority: Minor > Labels: backward-incompatible > Fix For: 0.4.0-incubating > > > Remove the `--enableProfilingAgent` flag and add a `--saveProfilesToGcs` flag > to the `DataflowProfilingOptions`. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-1122) Change Dataflow profiling options to support saving to GCS
Ben Chambers created BEAM-1122: -- Summary: Change Dataflow profiling options to support saving to GCS Key: BEAM-1122 URL: https://issues.apache.org/jira/browse/BEAM-1122 Project: Beam Issue Type: New Feature Components: runner-dataflow Affects Versions: 0.4.0-incubating Reporter: Ben Chambers Assignee: Ben Chambers Priority: Minor Remove the `--enableProfilingAgent` flag and add a `--saveProfilesToGcs` flag to the `DataflowProfilingOptions`. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-775) Remove Aggregators from the Java SDK
Ben Chambers created BEAM-775: - Summary: Remove Aggregators from the Java SDK Key: BEAM-775 URL: https://issues.apache.org/jira/browse/BEAM-775 Project: Beam Issue Type: Sub-task Components: sdk-java-core Reporter: Ben Chambers Assignee: Ben Chambers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-774) Implement Metrics support for Spark runenr
Ben Chambers created BEAM-774: - Summary: Implement Metrics support for Spark runenr Key: BEAM-774 URL: https://issues.apache.org/jira/browse/BEAM-774 Project: Beam Issue Type: Sub-task Components: runner-spark Reporter: Ben Chambers Assignee: Amit Sela -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (BEAM-458) Support for Flink Metrics
[ https://issues.apache.org/jira/browse/BEAM-458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Chambers closed BEAM-458. - Resolution: Duplicate Fix Version/s: Not applicable > Support for Flink Metrics > -- > > Key: BEAM-458 > URL: https://issues.apache.org/jira/browse/BEAM-458 > Project: Beam > Issue Type: New Feature > Components: beam-model >Reporter: Sumit Chawla > Fix For: Not applicable > > > Flink has added support for CodeHale Metrics > (https://ci.apache.org/projects/flink/flink-docs-master/monitoring/metrics.html) > These metrics are more advanced then the current Accumulators. > Adding support for these to Beam level should be a good addition. > https://github.com/apache/flink/pull/1947#issuecomment-233029166 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-773) Implement Metrics support for Flink runner
Ben Chambers created BEAM-773: - Summary: Implement Metrics support for Flink runner Key: BEAM-773 URL: https://issues.apache.org/jira/browse/BEAM-773 Project: Beam Issue Type: Sub-task Components: runner-flink Reporter: Ben Chambers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-772) Implement Metrics support for Dataflow Runner
Ben Chambers created BEAM-772: - Summary: Implement Metrics support for Dataflow Runner Key: BEAM-772 URL: https://issues.apache.org/jira/browse/BEAM-772 Project: Beam Issue Type: Sub-task Components: runner-dataflow Reporter: Ben Chambers Assignee: Ben Chambers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-147) Introduce an easy API for pipeline metrics
[ https://issues.apache.org/jira/browse/BEAM-147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Chambers updated BEAM-147: -- Summary: Introduce an easy API for pipeline metrics (was: Introduce an easy API pipeline metrics) > Introduce an easy API for pipeline metrics > -- > > Key: BEAM-147 > URL: https://issues.apache.org/jira/browse/BEAM-147 > Project: Beam > Issue Type: Bug > Components: beam-model, sdk-java-core, sdk-py >Reporter: Robert Bradshaw >Assignee: Ben Chambers > > The existing Aggregators are confusing both because of their name and because > they serve multiple purposes. > Previous discussions around Aggregators/metrics/etc. > See discussion at > http://mail-archives.apache.org/mod_mbox/incubator-beam-user/201603.mbox/browser > and > http://mail-archives.apache.org/mod_mbox/incubator-beam-dev/201603.mbox/browser > . Exact name still being bikeshedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-726) Standardize naming of PipelineResult objects
[ https://issues.apache.org/jira/browse/BEAM-726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553367#comment-15553367 ] Ben Chambers commented on BEAM-726: --- Great point. We could consider also choosing a different suffix. Perhaps "PipelineExecution" or "PipelineHandle" or something like that. That would need a little more discussion since any user-code mentioning PipelineResult would then break. If we decide to rename it, it seems like it would be good to do these together. > Standardize naming of PipelineResult objects > > > Key: BEAM-726 > URL: https://issues.apache.org/jira/browse/BEAM-726 > Project: Beam > Issue Type: Bug > Components: beam-model >Reporter: Ben Chambers >Assignee: Frances Perry >Priority: Minor > > Today: > PipelineResult is an interface returned by running a pipeline. > DataflowPipelineJob is the Dataflow implementation of that interface > FlinkRunnerResult is the Flink implementation > EvaluationContext is the Spark implementation > DirectPipelineResult is the DirectRunner implementation > Ideally, all the names would indicate that they are a PipelineResult, like > the DirectRunner does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-726) Standardize naming of PipelineResult objects
Ben Chambers created BEAM-726: - Summary: Standardize naming of PipelineResult objects Key: BEAM-726 URL: https://issues.apache.org/jira/browse/BEAM-726 Project: Beam Issue Type: Bug Components: beam-model Reporter: Ben Chambers Assignee: Frances Perry Priority: Minor Today: PipelineResult is an interface returned by running a pipeline. DataflowPipelineJob is the Dataflow implementation of that interface FlinkRunnerResult is the Flink implementation EvaluationContext is the Spark implementation DirectPipelineResult is the DirectRunner implementation Ideally, all the names would indicate that they are a PipelineResult, like the DirectRunner does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-696) Side-Inputs non-deterministic with merging main-input windows
Ben Chambers created BEAM-696: - Summary: Side-Inputs non-deterministic with merging main-input windows Key: BEAM-696 URL: https://issues.apache.org/jira/browse/BEAM-696 Project: Beam Issue Type: Bug Components: beam-model Reporter: Ben Chambers Assignee: Frances Perry Side-Inputs are non-deterministic for several reasons: 1. Because they depend on triggering of the side-input (this is acceptable because triggers are by their nature non-deterministic). 2. They depend on the current state of the main-input window in order to lookup the side-input. This means that with merging 3. Any runner optimizations that affect when the side-input is looked up may cause problems with either or both of these. This issue focuses on #2 -- the non-determinism of side-inputs that execute within a Merging WindowFn. Possible solution would be to defer running anything that looks up the side-input until we need to extract an output, and using the main-window at that point. Specifically, if the main-window is a MergingWindowFn, don't execute any kind of pre-combine, instead buffer all the inputs and combine later. This could still run into some non-determinism if there are triggers controlling when we extract output. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-694) TriggerTester doesn't test timer firings
[ https://issues.apache.org/jira/browse/BEAM-694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534161#comment-15534161 ] Ben Chambers commented on BEAM-694: --- I think this is partly intentional with the newer formulation of Triggers and how they interact with timers (eg., the Triggers as predicates). Specifically, Triggers no longer "receive" timers. Instead, a timer is just an indication that the trigger would like to be re-evaluated at some point in time. So, what we should probably do is: 1. Test that triggers *set* reasonable timers (this ensures they get woken up at reasonable points in time) 2. Separately test that triggers behave correctly when they are woken up (via a call to `shouldFire`). It is important to actually do these separately, since `shouldFire` may be called for other reasons as well (such as when the watermark is passing the end of the window). There may be no timer from the trigger, but it may still get a chance to trigger or not. If I understand the problem, it is that we're missing the tests for 1. I don't think we should necessarily tie the two together in the tests since they are not coupled in the actual implementation. > TriggerTester doesn't test timer firings > > > Key: BEAM-694 > URL: https://issues.apache.org/jira/browse/BEAM-694 > Project: Beam > Issue Type: Bug >Reporter: Eugene Kirpichov > > TriggerTester exposes a `fireIfShouldFire(BoundedWIndow)` method. This is > used to prompt a call to the trigger with the current state of the trigger > tester (Input Watermarks, elements present, etc), and see if the trigger > should fire. > The TriggerTester should automatically call back to the trigger with the > current state whenever a Timer fires, as specified by the current watermarks > and any Timers set by the trigger under test. This ensures that Triggers set > underlying timers properly, so the trigger will fire even if no additional > elements arrive. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-147) Introduce an easy API pipeline metrics
[ https://issues.apache.org/jira/browse/BEAM-147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15530952#comment-15530952 ] Ben Chambers commented on BEAM-147: --- Repurposing this issue to track the actual goal of having an easy-to-use API with an easy-to-discover name for reporting pipeline metrics. > Introduce an easy API pipeline metrics > -- > > Key: BEAM-147 > URL: https://issues.apache.org/jira/browse/BEAM-147 > Project: Beam > Issue Type: Bug > Components: beam-model, sdk-java-core, sdk-py >Reporter: Robert Bradshaw >Assignee: Ben Chambers > > The existing Aggregators are confusing both because of their name and because > they serve multiple purposes. > Previous discussions around Aggregators/metrics/etc. > See discussion at > http://mail-archives.apache.org/mod_mbox/incubator-beam-user/201603.mbox/browser > and > http://mail-archives.apache.org/mod_mbox/incubator-beam-dev/201603.mbox/browser > . Exact name still being bikeshedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-147) Introduce an easy API pipeline metrics
[ https://issues.apache.org/jira/browse/BEAM-147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Chambers updated BEAM-147: -- Description: The existing Aggregators are confusing both because of their name and because they serve multiple purposes. Previous discussions around Aggregators/metrics/etc. See discussion at http://mail-archives.apache.org/mod_mbox/incubator-beam-user/201603.mbox/browser and http://mail-archives.apache.org/mod_mbox/incubator-beam-dev/201603.mbox/browser . Exact name still being bikeshedded. was: The existing Aggregators are confusing both because of their name and because they serve multiple purposes. See discussion at http://mail-archives.apache.org/mod_mbox/incubator-beam-user/201603.mbox/browser and http://mail-archives.apache.org/mod_mbox/incubator-beam-dev/201603.mbox/browser . Exact name still being bikeshedded. > Introduce an easy API pipeline metrics > -- > > Key: BEAM-147 > URL: https://issues.apache.org/jira/browse/BEAM-147 > Project: Beam > Issue Type: Bug > Components: beam-model, sdk-java-core, sdk-py >Reporter: Robert Bradshaw >Assignee: Ben Chambers > > The existing Aggregators are confusing both because of their name and because > they serve multiple purposes. > Previous discussions around Aggregators/metrics/etc. > See discussion at > http://mail-archives.apache.org/mod_mbox/incubator-beam-user/201603.mbox/browser > and > http://mail-archives.apache.org/mod_mbox/incubator-beam-dev/201603.mbox/browser > . Exact name still being bikeshedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-147) Introduce an easy API pipeline metrics
[ https://issues.apache.org/jira/browse/BEAM-147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Chambers updated BEAM-147: -- Description: The existing Aggregators are confusing both because of their name and because they serve multiple purposes. See discussion at http://mail-archives.apache.org/mod_mbox/incubator-beam-user/201603.mbox/browser and http://mail-archives.apache.org/mod_mbox/incubator-beam-dev/201603.mbox/browser . Exact name still being bikeshedded. was: The name "Aggregator" is confusing. See discussion at http://mail-archives.apache.org/mod_mbox/incubator-beam-user/201603.mbox/browser and http://mail-archives.apache.org/mod_mbox/incubator-beam-dev/201603.mbox/browser . Exact name still being bikeshedded. > Introduce an easy API pipeline metrics > -- > > Key: BEAM-147 > URL: https://issues.apache.org/jira/browse/BEAM-147 > Project: Beam > Issue Type: Bug > Components: beam-model, sdk-java-core, sdk-py >Reporter: Robert Bradshaw >Assignee: Ben Chambers > > The existing Aggregators are confusing both because of their name and because > they serve multiple purposes. > See discussion at > http://mail-archives.apache.org/mod_mbox/incubator-beam-user/201603.mbox/browser > and > http://mail-archives.apache.org/mod_mbox/incubator-beam-dev/201603.mbox/browser > . Exact name still being bikeshedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-147) Introduce an easy API pipeline metrics
[ https://issues.apache.org/jira/browse/BEAM-147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Chambers updated BEAM-147: -- Summary: Introduce an easy API pipeline metrics (was: Rename Aggregator to [P]Metric) > Introduce an easy API pipeline metrics > -- > > Key: BEAM-147 > URL: https://issues.apache.org/jira/browse/BEAM-147 > Project: Beam > Issue Type: Bug > Components: beam-model, sdk-java-core, sdk-py >Reporter: Robert Bradshaw >Assignee: Ben Chambers > > The name "Aggregator" is confusing. > See discussion at > http://mail-archives.apache.org/mod_mbox/incubator-beam-user/201603.mbox/browser > and > http://mail-archives.apache.org/mod_mbox/incubator-beam-dev/201603.mbox/browser > . Exact name still being bikeshedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-681) DoFns should be serialized at apply time and deserialized when executing
Ben Chambers created BEAM-681: - Summary: DoFns should be serialized at apply time and deserialized when executing Key: BEAM-681 URL: https://issues.apache.org/jira/browse/BEAM-681 Project: Beam Issue Type: Improvement Components: sdk-py Reporter: Ben Chambers Assignee: Frances Perry 1. Serializing DoFns at application time ensures that any modifications of fields within the DoFn after application do not accidentally pollute the execution. This mirrors the approach taken in Java to provide an approximation of lexical-closure (eg., you only need to know the state of the DoFn at the time it was applied, not afterwards, to understand its behavior). 2. Based on 1, the DIrectRunner should also be deserializing DoFns before running them, which should also detect other classes of errors such as using the pipeline object (which is not pickleable) within the DoFn -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (BEAM-147) Rename Aggregator to [P]Metric
[ https://issues.apache.org/jira/browse/BEAM-147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Chambers reassigned BEAM-147: - Assignee: Ben Chambers (was: Frances Perry) > Rename Aggregator to [P]Metric > -- > > Key: BEAM-147 > URL: https://issues.apache.org/jira/browse/BEAM-147 > Project: Beam > Issue Type: Bug > Components: beam-model, sdk-java-core, sdk-py >Reporter: Robert Bradshaw >Assignee: Ben Chambers > > The name "Aggregator" is confusing. > See discussion at > http://mail-archives.apache.org/mod_mbox/incubator-beam-user/201603.mbox/browser > and > http://mail-archives.apache.org/mod_mbox/incubator-beam-dev/201603.mbox/browser > . Exact name still being bikeshedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (BEAM-661) CalendarWindows#isCompatibleWith should use equals instead of ==
[ https://issues.apache.org/jira/browse/BEAM-661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Chambers closed BEAM-661. - Resolution: Duplicate Fix Version/s: Not applicable > CalendarWindows#isCompatibleWith should use equals instead of == > > > Key: BEAM-661 > URL: https://issues.apache.org/jira/browse/BEAM-661 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Ben Chambers >Assignee: Davor Bonaci >Priority: Minor > Fix For: Not applicable > > > http://stackoverflow.com/questions/39617897/inputs-to-flatten-had-incompatible-window-windowfns-when-cogroupbykey-with-calen > We're using `==` instead of `.equals` to compare objects, which causes > equivalent CalendarWindows to be incompatible. > https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/windowing/CalendarWindows.java#L143 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-661) CalendarWindows#isCompatibleWith should use equals instead of ==
Ben Chambers created BEAM-661: - Summary: CalendarWindows#isCompatibleWith should use equals instead of == Key: BEAM-661 URL: https://issues.apache.org/jira/browse/BEAM-661 Project: Beam Issue Type: Bug Components: sdk-java-core Reporter: Ben Chambers Assignee: Davor Bonaci Priority: Minor http://stackoverflow.com/questions/39617897/inputs-to-flatten-had-incompatible-window-windowfns-when-cogroupbykey-with-calen We're using `==` instead of `.equals` to compare objects, which causes equivalent CalendarWindows to be incompatible. https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/windowing/CalendarWindows.java#L143 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-644) Primitive to shift the watermark while assigning timestamps
[ https://issues.apache.org/jira/browse/BEAM-644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507379#comment-15507379 ] Ben Chambers commented on BEAM-644: --- Minor note on "A function from TimestampedElement to new timestamp that always falls within D of the original timestamp." Rather than "within D" I think the requirement is that for an input with timestamp t, the output timestamp is >= t+D. This ensures that the output timestamps relation to the output watermark is no later than the input timestamps relation to the input watermark. > Primitive to shift the watermark while assigning timestamps > --- > > Key: BEAM-644 > URL: https://issues.apache.org/jira/browse/BEAM-644 > Project: Beam > Issue Type: New Feature > Components: beam-model >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles > > There is a general need, especially important in the presence of > SplittableDoFn, to be able to assign new timestamps to elements without > making them late or droppable. > - DoFn.withAllowedTimestampSkew is inadequate, because it simply allows one > to produce late data, but does not allow one to shift the watermark so the > new data is on-time. > - For a SplittableDoFn, one may receive an element such as the name of a log > file that contains elements for the day preceding the log file. The timestamp > on the filename must currently be the beginning of the log. If such elements > are constantly flowing, it may be OK, but since we don't know that element is > coming, in that absence of data, the watermark may advance. We need a way to > keep it far enough back even in the absence of data holding it back. > One idea is a new primitive ShiftWatermark / AdjustTimestamps with the > following pieces: > - A constant duration (positive or negative) D by which to shift the > watermark. > - A function from TimestampedElement to new timestamp that always falls > within D of the original timestamp. > With this primitive added, outputWithTimestamp and withAllowedTimestampSkew > could be removed, simplifying DoFn. > Alternatively, all of this functionality could be bolted on to DoFn. > This ticket is not a proposal, but a record of the issue and ideas that were > mentioned. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-37) Run DoFnWithContext without conversion to vanilla DoFn
[ https://issues.apache.org/jira/browse/BEAM-37?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379728#comment-15379728 ] Ben Chambers commented on BEAM-37: -- Notes on remaining tasks to complete this: 1. Modify DoFnRunner to allow executing a DoFnWithContext 2. Modify any runner specific code to use DoFnRunner when a DoFnWithContext is received. 3. Remove wrapping of DoFnWithContexts from DataflowRunner (and others) to use the modified code above. > Run DoFnWithContext without conversion to vanilla DoFn > -- > > Key: BEAM-37 > URL: https://issues.apache.org/jira/browse/BEAM-37 > Project: Beam > Issue Type: Sub-task > Components: runner-core >Reporter: Kenneth Knowles >Assignee: Ben Chambers > > DoFnWithContext is an enhanced DoFn where annotations and parameter lists are > inspected to determine whether it accesses windowing information, etc. > Today, each feature of DoFnWithContext requires implementation on DoFn, which > precludes the easy addition of features that we don't have designs for in > DoFn. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-377) BigQueryIO should validate a table or query to read from
Ben Chambers created BEAM-377: - Summary: BigQueryIO should validate a table or query to read from Key: BEAM-377 URL: https://issues.apache.org/jira/browse/BEAM-377 Project: Beam Issue Type: Bug Components: sdk-java-extensions Reporter: Ben Chambers Assignee: Ben Chambers Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-372) CoderProperties: Test that the coder doesn't consume more bytes than it produces
Ben Chambers created BEAM-372: - Summary: CoderProperties: Test that the coder doesn't consume more bytes than it produces Key: BEAM-372 URL: https://issues.apache.org/jira/browse/BEAM-372 Project: Beam Issue Type: Bug Components: sdk-java-core Reporter: Ben Chambers Assignee: Davor Bonaci Priority: Minor Add a test to CoderProperties that does the following: 1. Encode a value using the Coder 2. Add a byte at the end of the encoded array 3. Decode the value using the Coder 4. Verify the extra byte was not consumed (This could possibly just be an enhancement to the existing round-trip encode/decode test) When this fails it can lead to very difficult to debug situations in a coder wrapped around the problematic coder. This would be an easy test that would clearly fail *for the coder which was problematic*. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (BEAM-359) AvroCoder should be able to handle anonymous classes as schemas
[ https://issues.apache.org/jira/browse/BEAM-359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Chambers closed BEAM-359. - Resolution: Fixed Fix Version/s: 0.2.0-incubating > AvroCoder should be able to handle anonymous classes as schemas > --- > > Key: BEAM-359 > URL: https://issues.apache.org/jira/browse/BEAM-359 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Affects Versions: 0.1.0-incubating, 0.2.0-incubating >Reporter: Daniel Mills >Assignee: Ben Chambers > Fix For: 0.2.0-incubating > > > Currently, the determinism checker NPEs with: > java.lang.IllegalArgumentException: Unable to get field id from class null > at > com.google.cloud.dataflow.sdk.coders.AvroCoder$AvroDeterminismChecker.getField(AvroCoder.java:710) > at > com.google.cloud.dataflow.sdk.coders.AvroCoder$AvroDeterminismChecker.checkRecord(AvroCoder.java:548) > at > com.google.cloud.dataflow.sdk.coders.AvroCoder$AvroDeterminismChecker.doCheck(AvroCoder.java:477) > at > com.google.cloud.dataflow.sdk.coders.AvroCoder$AvroDeterminismChecker.recurse(AvroCoder.java:453) > at > com.google.cloud.dataflow.sdk.coders.AvroCoder$AvroDeterminismChecker.checkRecord(AvroCoder.java:567) > at > com.google.cloud.dataflow.sdk.coders.AvroCoder$AvroDeterminismChecker.doCheck(AvroCoder.java:477) > at > com.google.cloud.dataflow.sdk.coders.AvroCoder$AvroDeterminismChecker.recurse(AvroCoder.java:453) > at > com.google.cloud.dataflow.sdk.coders.AvroCoder$AvroDeterminismChecker.check(AvroCoder.java:430) > at > com.google.cloud.dataflow.sdk.coders.AvroCoder.(AvroCoder.java:189) > at com.google.cloud.dataflow.sdk.coders.AvroCoder.of(AvroCoder.java:144) > at mypackage.GenericsTest$1.create(GenericsTest.java:102) > at > com.google.cloud.dataflow.sdk.coders.CoderRegistry.getDefaultCoderFromFactory(CoderRegistry.java:797) > at > com.google.cloud.dataflow.sdk.coders.CoderRegistry.getDefaultCoder(CoderRegistry.java:748) > at > com.google.cloud.dataflow.sdk.coders.CoderRegistry.getDefaultCoder(CoderRegistry.java:719) > at > com.google.cloud.dataflow.sdk.coders.CoderRegistry.getDefaultCoder(CoderRegistry.java:696) > at > com.google.cloud.dataflow.sdk.coders.CoderRegistry.getDefaultCoder(CoderRegistry.java:178) > at > com.google.cloud.dataflow.sdk.values.TypedPValue.inferCoderOrFail(TypedPValue.java:147) > at > com.google.cloud.dataflow.sdk.values.TypedPValue.getCoder(TypedPValue.java:48) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-366) Support Display Data on Composite Transforms
[ https://issues.apache.org/jira/browse/BEAM-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342229#comment-15342229 ] Ben Chambers commented on BEAM-366: --- As mentioned, representing composites explicitly in the Dataflow runner is currently dependent on the improved Runner API. > Support Display Data on Composite Transforms > > > Key: BEAM-366 > URL: https://issues.apache.org/jira/browse/BEAM-366 > Project: Beam > Issue Type: Bug > Components: runner-dataflow >Reporter: Ben Chambers > > Today, Dataflow doesn't represent composites all the way to the UI (it > reconstructs them from the name). This means it doesn't support attaching > Display Data to composites. > With the runner API refactoring, Dataflow should start supporting composites, > at which point we should make sure that Display Data is plumbed through > properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-366) Support Display Data on Composite Transforms
Ben Chambers created BEAM-366: - Summary: Support Display Data on Composite Transforms Key: BEAM-366 URL: https://issues.apache.org/jira/browse/BEAM-366 Project: Beam Issue Type: Bug Components: runner-dataflow Reporter: Ben Chambers Assignee: Davor Bonaci Today, Dataflow doesn't represent composites all the way to the UI (it reconstructs them from the name). This means it doesn't support attaching Display Data to composites. With the runner API refactoring, Dataflow should start supporting composites, at which point we should make sure that Display Data is plumbed through properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-344) Merge, Split, Delay, and Reorder bundles in the DirectRunner
[ https://issues.apache.org/jira/browse/BEAM-344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Chambers updated BEAM-344: -- Summary: Merge, Split, Delay, and Reorder bundles in the DirectRunner (was: Merge, Split, Delay, and Reorder bundles in the InProcessPipelineRunner) > Merge, Split, Delay, and Reorder bundles in the DirectRunner > > > Key: BEAM-344 > URL: https://issues.apache.org/jira/browse/BEAM-344 > Project: Beam > Issue Type: Improvement > Components: runner-direct >Reporter: Thomas Groh >Assignee: Thomas Groh >Priority: Minor > > PCollections have no guarantees about the ordering of elements between steps > Randomly reordering, splitting and merging bundles will break user pipelines > which assume some order will be maintained between steps and within a bundle. > This assumption is not correct, so should break in the Direct Runner. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-352) Add DisplayData to HDFS Sources
Ben Chambers created BEAM-352: - Summary: Add DisplayData to HDFS Sources Key: BEAM-352 URL: https://issues.apache.org/jira/browse/BEAM-352 Project: Beam Issue Type: Bug Components: sdk-java-extensions Reporter: Ben Chambers Assignee: James Malone Priority: Minor Any interesting parameters of the sources/sinks should be exposed as display data. See any of the sources/sinks that already export this (BigQuery, PubSub, etc.) for examples. Also look at the DisplayData builder and HasDisplayData interface for how to wire these up. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-351) Add DisplayData to KafkaIO
Ben Chambers created BEAM-351: - Summary: Add DisplayData to KafkaIO Key: BEAM-351 URL: https://issues.apache.org/jira/browse/BEAM-351 Project: Beam Issue Type: Bug Components: sdk-java-extensions Reporter: Ben Chambers Assignee: James Malone Priority: Minor Any interesting parameters of the sources/sinks should be exposed as display data. See any of the sources/sinks that already export this (BigQuery, PubSub, etc.) for examples. Also look at the DisplayData builder and HasDisplayData interface for how to wire these up. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-324) Improve TypeDescriptor inference of DoFn's created inside a generic PTransform
Ben Chambers created BEAM-324: - Summary: Improve TypeDescriptor inference of DoFn's created inside a generic PTransform Key: BEAM-324 URL: https://issues.apache.org/jira/browse/BEAM-324 Project: Beam Issue Type: Bug Components: beam-model Reporter: Ben Chambers Assignee: Frances Perry Priority: Minor Commit https://github.com/apache/incubator-beam/commit/aa7f07fa5b22f3656d52dc9e1d4557bceb87c013 introduced the ability to infer a {{TypeDescriptor}} from an object created inside a concrete instance of a {{PTransform}} and used it to simplify {{SimpleFunction}} usage. We should probably look at using the same mechanism elsewhere, such as when inferring the output type of a {{ParDo}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (BEAM-117) Implement the API for Static Display Metadata
[ https://issues.apache.org/jira/browse/BEAM-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Chambers resolved BEAM-117. --- Resolution: Fixed Fix Version/s: 0.1.0-incubating > Implement the API for Static Display Metadata > - > > Key: BEAM-117 > URL: https://issues.apache.org/jira/browse/BEAM-117 > Project: Beam > Issue Type: New Feature > Components: sdk-java-core >Reporter: Ben Chambers >Assignee: Scott Wegner > Fix For: 0.1.0-incubating > > > As described in the following doc, we would like the SDK to allow associating > display metadata with PTransforms. > https://docs.google.com/document/d/11enEB9JwVp6vO0uOYYTMYTGkr3TdNfELwWqoiUg5ZxM/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-175) Leak garbage collection timers in GlobalWindow
[ https://issues.apache.org/jira/browse/BEAM-175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284759#comment-15284759 ] Ben Chambers commented on BEAM-175: --- I think in general this seems OK. I like that we're making the behavior explicit rather than trying to guessi t. 1. This may be too much configuration. There may be other/better ways of getting pane indices (eg., once we have a sink API or a state API). We should make sure we understand the use cases before exposing the knob. Especially since ReduceFnRunner is already complicated -- this is just adding more permutations of cases it needs to handle. 2. I worry that the default, at least in the Pane Index case, is actually *less* performant than the ZERO case, which is likely what was desired 90% of the time. If we go this direction, I would propose we change the default. 3. You should flesh this out to address error cases. When do we detect that the user is accessing the PaneIndex with the ZERO behavior specified. What kind of error message do they get? Etc. > Leak garbage collection timers in GlobalWindow > -- > > Key: BEAM-175 > URL: https://issues.apache.org/jira/browse/BEAM-175 > Project: Beam > Issue Type: Bug > Components: runner-core >Reporter: Mark Shields >Assignee: Mark Shields > > Consider the transform: > Window > .into(new GlobalWindows()) > .triggering( > Repeatedly.forever( > AfterProcessingTime.pastFirstElementInPane().plusDelayOf(...))) > .discardingFiredPanes() > This is a common idiom for 'process elements bunched by arrival time'. > Currently we create an end-of-window timer per key, which clearly will only > fire if the pipeline is drained. > Better would be to avoid creating end-of-window timers if there's no state > which needs to be processed at end-of-window (ie at drain if the Global > window). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-247) CombineFn's only definable/usable inside sdk.transforms package
Ben Chambers created BEAM-247: - Summary: CombineFn's only definable/usable inside sdk.transforms package Key: BEAM-247 URL: https://issues.apache.org/jira/browse/BEAM-247 Project: Beam Issue Type: Bug Components: sdk-java-core Reporter: Ben Chambers Assignee: Pei He Priority: Critical {code:java} public abstract static class CombineFnextends AbstractGlobalCombineFn { /* ... */ } abstract static class AbstractGlobalCombineFn implements GlobalCombineFn , Serializable { /* ... */ } {code} Since {{AbstractGlobalCombineFn}} is package protected (and therefore not visible outside of the {{transform}} package, it is not possible to cast any class that extends {{CombineFn}} to a {{GlobalCombineFn}} outside of this package. This prevents applying existing {{CombineFn}}s directly (such as {{Combine.perKey(new Sum.SumIntegersFn())}}, as used in our documentation) and also means that a user cannot define their own {{CombineFn}} unless they put them in the {{transform}} package. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-210) Be consistent with emitting final empty panes
[ https://issues.apache.org/jira/browse/BEAM-210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248532#comment-15248532 ] Ben Chambers commented on BEAM-210: --- Most important, I think, is to get rid of the implicit "if you have a watermark trigger, you must want an ON_TIME pane even if it is empty". I don't know if its "wins", so much as we produce an empty pane if either (the pane is ON_TIME and we think you want an empty ON_TIME pane) OR (the pane is final and we know you want an empty final pane). +1 to a clarifying rename, although I don't think "incremental" and "replacement" actually clarify that much. I would suggest that we standardize on a pane as "the elements since the last triggering". When we output, we produce either the current pane (DISCARDING mode), or all of the accumulated panes (ACCUMULATING mode). So calling these something like outputCurrentPane and outputCumulativePanes (or outputAccumulatedPanes or something like that) may be clear? This is orthogonal, and should probably move out of this issue and on to the dev list when we want to perform said renaming. > Be consistent with emitting final empty panes > - > > Key: BEAM-210 > URL: https://issues.apache.org/jira/browse/BEAM-210 > Project: Beam > Issue Type: Bug > Components: runner-core >Reporter: Mark Shields >Assignee: Mark Shields > > Currently ReduceFnRunner.onTrigger uses shouldEmit to prevent empty final > panes unless the user has requested them. > The same check needs to be done in ReduceFnRunner.onTimer -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-210) Be consistent with emitting final empty panes
[ https://issues.apache.org/jira/browse/BEAM-210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248490#comment-15248490 ] Ben Chambers commented on BEAM-210: --- No. This is in the weird spot. When we say "empty pane" in terms of whether or not to produce an ON_TIME or final pane, we mean the elements that were new in that pane. I think we should standardize on that. When we "accumulate fired panes" it means we retain previous panes, and the *output* in this pane contains data from those earlier ones. But *this* pane may still be empty/non-empty. I tried to reproduce with a test, but was unable to: https://github.com/apache/incubator-beam/pull/211. Still probably worth checking in to improve test coverage of that case. > Be consistent with emitting final empty panes > - > > Key: BEAM-210 > URL: https://issues.apache.org/jira/browse/BEAM-210 > Project: Beam > Issue Type: Bug > Components: runner-core >Reporter: Mark Shields >Assignee: Mark Shields > > Currently ReduceFnRunner.onTrigger uses shouldEmit to prevent empty final > panes unless the user has requested them. > The same check needs to be done in ReduceFnRunner.onTimer -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-184) Using Merging Windows and/or Triggers without a downstream aggregation should fail
Ben Chambers created BEAM-184: - Summary: Using Merging Windows and/or Triggers without a downstream aggregation should fail Key: BEAM-184 URL: https://issues.apache.org/jira/browse/BEAM-184 Project: Beam Issue Type: Bug Components: sdk-java-core Reporter: Ben Chambers Assignee: Davor Bonaci Priority: Minor Both merging windows (such as sessions) and triggering only actually happen at an aggregation (GroupByKey). We should produce errors in any of these cases: 1. Merging window used without a downstream GroupByKey 2. Triggers used without a downstream GroupuByKey 3. Window inspected after inserting a merging window and before the GroupByKey -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-117) Implement the API for Static Display Metadata
Ben Chambers created BEAM-117: - Summary: Implement the API for Static Display Metadata Key: BEAM-117 URL: https://issues.apache.org/jira/browse/BEAM-117 Project: Beam Issue Type: Bug Components: sdk-java-core Reporter: Ben Chambers Assignee: Scott Wegner As described in the following doc, we would like the SDK to allow associating display metadata with PTransforms. https://docs.google.com/document/d/11enEB9JwVp6vO0uOYYTMYTGkr3TdNfELwWqoiUg5ZxM/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (BEAM-121) Publish DisplayData from common PTransforms
[ https://issues.apache.org/jira/browse/BEAM-121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Chambers reassigned BEAM-121: - Assignee: Scott Wegner > Publish DisplayData from common PTransforms > --- > > Key: BEAM-121 > URL: https://issues.apache.org/jira/browse/BEAM-121 > Project: Beam > Issue Type: Sub-task > Components: sdk-java-core >Reporter: Scott Wegner >Assignee: Scott Wegner > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-135) Utilities for "batching" elements in a DoFn
Ben Chambers created BEAM-135: - Summary: Utilities for "batching" elements in a DoFn Key: BEAM-135 URL: https://issues.apache.org/jira/browse/BEAM-135 Project: Beam Issue Type: Bug Components: sdk-java-core Reporter: Ben Chambers We regularly receive questions about how to write a {{DoFn}} that operates on batches of elements. Example answers include: http://stackoverflow.com/questions/35065109/can-datastore-input-in-google-dataflow-pipeline-be-processed-in-a-batch-of-n-ent/35068341#35068341 http://stackoverflow.com/questions/30177812/partition-data-coming-from-csv-so-i-can-process-larger-patches-rather-then-indiv/30178170#30178170 Possible APIs could be to wrap a {{DoFn}} and include a batch size, or to create a utility like {{Filter}}, {{Partition}}, etc. that takes a {{SerializableFunction}} or a {{SimpleFunction}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-117) Implement the API for Static Display Metadata
[ https://issues.apache.org/jira/browse/BEAM-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Chambers updated BEAM-117: -- Issue Type: New Feature (was: Bug) > Implement the API for Static Display Metadata > - > > Key: BEAM-117 > URL: https://issues.apache.org/jira/browse/BEAM-117 > Project: Beam > Issue Type: New Feature > Components: sdk-java-core >Reporter: Ben Chambers >Assignee: Scott Wegner > > As described in the following doc, we would like the SDK to allow associating > display metadata with PTransforms. > https://docs.google.com/document/d/11enEB9JwVp6vO0uOYYTMYTGkr3TdNfELwWqoiUg5ZxM/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (BEAM-36) TimestampedValueInMultipleWindows should use a more compact set representation
[ https://issues.apache.org/jira/browse/BEAM-36?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Chambers closed BEAM-36. Resolution: Won't Fix As per discussion on https://github.com/apache/incubator-beam/pull/43: "We looked some more at the original Jira issue and realized that it is likely a non-issue. It was created to track the fact we needed to examine our usage of a HashSet there, since we ran into problems with the over-allocation of a hash set (eg., 64 slots to hold 23 items, etc.). When we have 1000 of these in memory at a time, the over-allocation starts to hurt. Upon further scrutiny, those WindowedValues should only be getting turned into a Set when we need to do equals or hashCode, to make sure we get an order-independent comparison. Assuming this is limited to tests, we can probably resolve the Jira issue as won't fix." > TimestampedValueInMultipleWindows should use a more compact set representation > -- > > Key: BEAM-36 > URL: https://issues.apache.org/jira/browse/BEAM-36 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Kenneth Knowles >Priority: Trivial > Labels: Windowing > > Today TimestampedValueInMultipleWindows converts its collection of windows to > a LinkedHashSet for comparisons and hashing. Since it is an immutable set, > more compact representations are available. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-112) Using non-IntervalWindows seems to fail
Ben Chambers created BEAM-112: - Summary: Using non-IntervalWindows seems to fail Key: BEAM-112 URL: https://issues.apache.org/jira/browse/BEAM-112 Project: Beam Issue Type: Bug Components: runner-spark Reporter: Ben Chambers Assignee: Amit Sela Priority: Minor See here for more details: http://stackoverflow.com/questions/35993777/globalwindow-cannot-be-cast-to-intervalwindow The linked stack trace indicates this is using the Spark Runner: {noformat:title=Exception} java.lang.ClassCastException: com.google.cloud.dataflow.sdk.transforms.windowing.GlobalWindow cannot be cast to com.google.cloud.dataflow.sdk.transforms.windowing.IntervalWindow at com.google.cloud.dataflow.sdk.transforms.windowing.IntervalWindow$IntervalWindowCoder.encode(IntervalWindow.java:171) at com.google.cloud.dataflow.sdk.coders.IterableLikeCoder.encode(IterableLikeCoder.java:113) at com.google.cloud.dataflow.sdk.coders.IterableLikeCoder.encode(IterableLikeCoder.java:59) at com.google.cloud.dataflow.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:599) at com.google.cloud.dataflow.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:540) at com.cloudera.dataflow.spark.CoderHelpers.toByteArray(CoderHelpers.java:48) at com.cloudera.dataflow.spark.CoderHelpers$3.call(CoderHelpers.java:134) at com.cloudera.dataflow.spark.CoderHelpers$3.call(CoderHelpers.java:131) at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1018) at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1018) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} It seems likely that at some point the Spark runner is assuming that all windows are IntervalWindows, which may not be true. Specifically the GlobalWindow+Triggers is valid, as is any custom implementation of BoundedWindow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-85) DataflowAssert (BeamAssert ;) needs sanity check that it's used correctly
[ https://issues.apache.org/jira/browse/BEAM-85?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Chambers updated BEAM-85: - Description: We should validate two things: # DataflowAssert is not added to a pipeline that has already been run. # The pipeline is run after the DataflowAssert is added. If either of these are not validated, then it is possible that the test doesn't actually verify anything. This code should throw an assertion error or fail in some other way. {code} Pipeline p = TestPipeline.create(); PCollection value = p.apply(Create.of(Boolean.FALSE)); p.run(); DataflowAssert.thatSingleton(value).isEqualTo(true); {code} but it would pass silently. similarly, this code wills pass silently: {code} Pipeline p = TestPipeline.create(); PCollection value = p.apply(Create.of(Boolean.FALSE)); DataflowAssert.thatSingleton(value).isEqualTo(true); {code} was: It is important that assert is applied to pipeline before the pipeline is run, otherwise it does not actually execute the test. This code should throw an assertion error or fail in some other way. {code} Pipeline p = TestPipeline.create(); PCollection value = p.apply(Create.of(Boolean.FALSE)); p.run(); DataflowAssert.thatSingleton(value).isEqualTo(true); {code} but it would pass silently. > DataflowAssert (BeamAssert ;) needs sanity check that it's used correctly > - > > Key: BEAM-85 > URL: https://issues.apache.org/jira/browse/BEAM-85 > Project: Beam > Issue Type: New Feature > Components: sdk-java-core >Reporter: Daniel Halperin > > We should validate two things: > # DataflowAssert is not added to a pipeline that has already been run. > # The pipeline is run after the DataflowAssert is added. > If either of these are not validated, then it is possible that the test > doesn't actually verify anything. > This code should throw an assertion error or fail in some other way. > {code} > Pipeline p = TestPipeline.create(); > PCollection value = p.apply(Create.of(Boolean.FALSE)); > p.run(); > DataflowAssert.thatSingleton(value).isEqualTo(true); > {code} > but it would pass silently. > similarly, this code wills pass silently: > {code} > Pipeline p = TestPipeline.create(); > PCollection value = p.apply(Create.of(Boolean.FALSE)); > DataflowAssert.thatSingleton(value).isEqualTo(true); > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-85) DataflowAssert (BeamAssert ;) needs sanity check that it's used correctly
[ https://issues.apache.org/jira/browse/BEAM-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174750#comment-15174750 ] Ben Chambers commented on BEAM-85: -- One possibility would be to use an `@Rule` instead. for instance: ```java @Rule public TestPipelineRule testPipeline; ... Pipeline p = testPipeline.create ... ``` Then the `tearDown` from the rule can validate proper usage. > DataflowAssert (BeamAssert ;) needs sanity check that it's used correctly > - > > Key: BEAM-85 > URL: https://issues.apache.org/jira/browse/BEAM-85 > Project: Beam > Issue Type: New Feature > Components: sdk-java-core >Reporter: Daniel Halperin >Assignee: Davor Bonaci > > It is important that assert is applied to pipeline before the pipeline is > run, otherwise it does not actually execute the test. > This code should throw an assertion error or fail in some other way. > ```java > { > Pipeline p = TestPipeline.create(); > PCollection value = p.apply(Create.of(Boolean.FALSE)); > p.run(); > DataflowAssert.thatSingleton(value).isEqualTo(true); > } > ``` > but it would pass silently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)