[jira] [Updated] (BEAM-1170) Streaming watermark should be easier to read

2016-12-16 Thread Ben Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chambers updated BEAM-1170:
---
Component/s: runner-dataflow

> Streaming watermark should be easier to read
> 
>
> Key: BEAM-1170
> URL: https://issues.apache.org/jira/browse/BEAM-1170
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Joshua Litt
>Priority: Minor
>
> Currently, the only way to get at the streaming watermarks is through 
> scraping counter names. However, the watermarks are useful for determining if 
> a streaming job is 'done,' ie watermarks at infinity. We should consider 
> either exposing the watermarks through a GetWatermarks api or another 
> alternative might be a WATERMARKS_AT_INFINITY job state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-1169) MetricsTest matchers should loosen expectations on physical values

2016-12-16 Thread Ben Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15755377#comment-15755377
 ] 

Ben Chambers commented on BEAM-1169:


Alternatively, we may need the Runner to define how Metrics are matched, and 
the test just sets up the "correct" values. That would allow different 
runners/tests to impose reasonable restrictions.

> MetricsTest matchers should loosen expectations on physical values
> --
>
> Key: BEAM-1169
> URL: https://issues.apache.org/jira/browse/BEAM-1169
> Project: Beam
>  Issue Type: Sub-task
>  Components: beam-model, sdk-java-core, sdk-py
>Reporter: Ben Chambers
>
> We could use `atLeast(N)` rather than `equals(N)` for the attempted values, 
> but even that may be false without violating the behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-1169) MetricsTest matchers should loosen expectations on physical values

2016-12-16 Thread Ben Chambers (JIRA)
Ben Chambers created BEAM-1169:
--

 Summary: MetricsTest matchers should loosen expectations on 
physical values
 Key: BEAM-1169
 URL: https://issues.apache.org/jira/browse/BEAM-1169
 Project: Beam
  Issue Type: Sub-task
Reporter: Ben Chambers


We could use `atLeast(N)` rather than `equals(N)` for the attempted values, but 
even that may be false without violating the behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (BEAM-1104) WordCount: Metrics error in the DirectRunner

2016-12-15 Thread Ben Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chambers resolved BEAM-1104.

   Resolution: Fixed
Fix Version/s: 0.4.0-incubating

> WordCount: Metrics error in the DirectRunner
> 
>
> Key: BEAM-1104
> URL: https://issues.apache.org/jira/browse/BEAM-1104
> Project: Beam
>  Issue Type: Bug
>  Components: runner-direct
>Reporter: Daniel Halperin
>Assignee: Ben Chambers
> Fix For: 0.4.0-incubating
>
>
> I'm following the Beam quickstart to analyze the pom.xml for the examples 
> archetype in the DirectRunner:
> Generate the project:
> {code}
> mvn archetype:generate \
>   
> -DarchetypeRepository=https://repository.apache.org/content/groups/snapshots 
> \  
>   -DarchetypeGroupId=org.apache.beam \
>   -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
>   -DarchetypeVersion=LATEST \
>   -DgroupId=org.example \
>   -DartifactId=word-count-beam \
>   -Dversion="0.1" \
>   -Dpackage=org.apache.beam.examples \
>   -DinteractiveMode=false
> {code}
> Count words in the pom.xml:
> {code}
> mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
>  -Dexec.args="--inputFile=pom.xml --output=direct/counts" -Pdirect-runner
> {code}
> The logs:
> {code}
> INFO] --- exec-maven-plugin:1.4.0:java (default-cli) @ word-count-beam ---
> Dec 07, 2016 9:42:03 PM org.apache.beam.sdk.io.FileBasedSource 
> expandFilePattern
> INFO: Matched 1 files for pattern pom.xml
> Dec 07, 2016 9:42:03 PM org.apache.beam.sdk.metrics.MetricsEnvironment 
> getCurrentContainer
> SEVERE: Unable to update metrics on the current thread. Most likely caused by 
> using metrics outside the managed work-execution thread.
> Dec 07, 2016 9:42:03 PM org.apache.beam.sdk.io.Write$Bound$1 processElement
> INFO: Initializing write operation 
> org.apache.beam.sdk.io.TextIO$TextSink$TextWriteOperation@26bbd1cf
> Dec 07, 2016 9:42:04 PM org.apache.beam.sdk.io.Write$Bound$WriteBundles 
> processElement
> INFO: Opening writer for write operation 
> org.apache.beam.sdk.io.TextIO$TextSink$TextWriteOperation@19371061
> Dec 07, 2016 9:42:04 PM org.apache.beam.sdk.io.Write$Bound$WriteBundles 
> processElement
> INFO: Opening writer for write operation 
> org.apache.beam.sdk.io.TextIO$TextSink$TextWriteOperation@19371061
> Dec 07, 2016 9:42:04 PM org.apache.beam.sdk.io.Write$Bound$WriteBundles 
> processElement
> INFO: Opening writer for write operation 
> org.apache.beam.sdk.io.TextIO$TextSink$TextWriteOperation@19371061
> Dec 07, 2016 9:42:04 PM org.apache.beam.sdk.io.Write$Bound$WriteBundles 
> processElement
> INFO: Opening writer for write operation 
> org.apache.beam.sdk.io.TextIO$TextSink$TextWriteOperation@19371061
> Dec 07, 2016 9:42:04 PM org.apache.beam.sdk.io.Write$Bound$2 processElement
> INFO: Finalizing write operation 
> org.apache.beam.sdk.io.TextIO$TextSink$TextWriteOperation@3701012a.
> {code}
> Presumably, this {{SEVERE}} warning is indicative of a bug (or should be 
> masked).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-1126) Expose UnboundedSource split backlog in number of events

2016-12-14 Thread Ben Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15749901#comment-15749901
 ] 

Ben Chambers commented on BEAM-1126:


The majority of the work is getting Metrics supported well-enough to start 
removing Aggregators and moving code/examples/documentation towards Metrics. 
This work (for Java) is tracked in this issue 
http://issues.apache.org/jira/browse/BEAM-147.

I'm working on the Dataflow runner changes right now. The other runners could 
choose to implement Metrics either in a way similar to how they currently 
support Aggregators (providing the "committed" value across work that 
succeeded) or using their own Metrics mechanisms (providing an "attempted" 
value across all attempts at work).

> Expose UnboundedSource split backlog in number of events
> 
>
> Key: BEAM-1126
> URL: https://issues.apache.org/jira/browse/BEAM-1126
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Reporter: Aviem Zur
>Assignee: Daniel Halperin
>Priority: Minor
>
> Today {{UnboundedSource}} exposes split backlog in bytes via 
> {{getSplitBacklogBytes()}}
> There is value in exposing backlog in number of events as well, since this 
> number can be more human comprehensible than bytes. something like 
> {{getSplitBacklogEvents()}} or {{getSplitBacklogCount()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-1161) Update Javadoc/Examples/etc.

2016-12-14 Thread Ben Chambers (JIRA)
Ben Chambers created BEAM-1161:
--

 Summary: Update Javadoc/Examples/etc.
 Key: BEAM-1161
 URL: https://issues.apache.org/jira/browse/BEAM-1161
 Project: Beam
  Issue Type: Sub-task
Reporter: Ben Chambers






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-1148) Port PAssert away from Aggregators

2016-12-14 Thread Ben Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15749188#comment-15749188
 ] 

Ben Chambers commented on BEAM-1148:


In general:

Aggregators are being replaced with Metrics for the "monitoring" use cases, as 
described in 
https://lists.apache.org/thread.html/08af5d8247c316f46f4dc1ec93173721f684109b8a9d41a4431558ec@%3Cdev.beam.apache.org%3E

As noted in that thread, they may reappear in the future as a more general 
shorthand for "side-output + combine", but they need to address things like 
windowing, and latter use within the pipeline to be really useful in that role.

Within PAssert:

They may be replaced with either Metrics (if possible) or with an explicit 
side-output + combine which should be semantically equivalent.

> Port PAssert away from Aggregators
> --
>
> Key: BEAM-1148
> URL: https://issues.apache.org/jira/browse/BEAM-1148
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-core
>Reporter: Kenneth Knowles
>
> One step in the removal of Aggregators (in favor of Metrics) is to remove our 
> reliance on them for PAssert checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (BEAM-1122) Change Dataflow profiling options to support saving to GCS

2016-12-09 Thread Ben Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chambers resolved BEAM-1122.

   Resolution: Fixed
Fix Version/s: 0.4.0-incubating

> Change Dataflow profiling options to support saving to GCS
> --
>
> Key: BEAM-1122
> URL: https://issues.apache.org/jira/browse/BEAM-1122
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-dataflow
>Affects Versions: 0.4.0-incubating
>Reporter: Ben Chambers
>Assignee: Ben Chambers
>Priority: Minor
>  Labels: backward-incompatible
> Fix For: 0.4.0-incubating
>
>
> Remove the `--enableProfilingAgent` flag and add a `--saveProfilesToGcs` flag 
> to the `DataflowProfilingOptions`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-1122) Change Dataflow profiling options to support saving to GCS

2016-12-09 Thread Ben Chambers (JIRA)
Ben Chambers created BEAM-1122:
--

 Summary: Change Dataflow profiling options to support saving to GCS
 Key: BEAM-1122
 URL: https://issues.apache.org/jira/browse/BEAM-1122
 Project: Beam
  Issue Type: New Feature
  Components: runner-dataflow
Affects Versions: 0.4.0-incubating
Reporter: Ben Chambers
Assignee: Ben Chambers
Priority: Minor


Remove the `--enableProfilingAgent` flag and add a `--saveProfilesToGcs` flag 
to the `DataflowProfilingOptions`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-775) Remove Aggregators from the Java SDK

2016-10-18 Thread Ben Chambers (JIRA)
Ben Chambers created BEAM-775:
-

 Summary: Remove Aggregators from the Java SDK
 Key: BEAM-775
 URL: https://issues.apache.org/jira/browse/BEAM-775
 Project: Beam
  Issue Type: Sub-task
  Components: sdk-java-core
Reporter: Ben Chambers
Assignee: Ben Chambers






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-774) Implement Metrics support for Spark runenr

2016-10-18 Thread Ben Chambers (JIRA)
Ben Chambers created BEAM-774:
-

 Summary: Implement Metrics support for Spark runenr
 Key: BEAM-774
 URL: https://issues.apache.org/jira/browse/BEAM-774
 Project: Beam
  Issue Type: Sub-task
  Components: runner-spark
Reporter: Ben Chambers
Assignee: Amit Sela






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (BEAM-458) Support for Flink Metrics

2016-10-18 Thread Ben Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chambers closed BEAM-458.
-
   Resolution: Duplicate
Fix Version/s: Not applicable

> Support for Flink Metrics 
> --
>
> Key: BEAM-458
> URL: https://issues.apache.org/jira/browse/BEAM-458
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model
>Reporter: Sumit Chawla
> Fix For: Not applicable
>
>
> Flink has added support for CodeHale Metrics 
> (https://ci.apache.org/projects/flink/flink-docs-master/monitoring/metrics.html)
> These metrics are more advanced then the current Accumulators. 
> Adding support for these to Beam level should be a good addition.
> https://github.com/apache/flink/pull/1947#issuecomment-233029166



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-773) Implement Metrics support for Flink runner

2016-10-18 Thread Ben Chambers (JIRA)
Ben Chambers created BEAM-773:
-

 Summary: Implement Metrics support for Flink runner
 Key: BEAM-773
 URL: https://issues.apache.org/jira/browse/BEAM-773
 Project: Beam
  Issue Type: Sub-task
  Components: runner-flink
Reporter: Ben Chambers






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-772) Implement Metrics support for Dataflow Runner

2016-10-18 Thread Ben Chambers (JIRA)
Ben Chambers created BEAM-772:
-

 Summary: Implement Metrics support for Dataflow Runner
 Key: BEAM-772
 URL: https://issues.apache.org/jira/browse/BEAM-772
 Project: Beam
  Issue Type: Sub-task
  Components: runner-dataflow
Reporter: Ben Chambers
Assignee: Ben Chambers






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (BEAM-147) Introduce an easy API for pipeline metrics

2016-10-13 Thread Ben Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chambers updated BEAM-147:
--
Summary: Introduce an easy API for pipeline metrics  (was: Introduce an 
easy API pipeline metrics)

> Introduce an easy API for pipeline metrics
> --
>
> Key: BEAM-147
> URL: https://issues.apache.org/jira/browse/BEAM-147
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model, sdk-java-core, sdk-py
>Reporter: Robert Bradshaw
>Assignee: Ben Chambers
>
> The existing Aggregators are confusing both because of their name and because 
> they serve multiple purposes.
> Previous discussions around Aggregators/metrics/etc.
> See discussion at 
> http://mail-archives.apache.org/mod_mbox/incubator-beam-user/201603.mbox/browser
>  and 
> http://mail-archives.apache.org/mod_mbox/incubator-beam-dev/201603.mbox/browser
>  . Exact name still being bikeshedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-726) Standardize naming of PipelineResult objects

2016-10-06 Thread Ben Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553367#comment-15553367
 ] 

Ben Chambers commented on BEAM-726:
---

Great point. We could consider also choosing a different suffix. Perhaps 
"PipelineExecution" or "PipelineHandle" or something like that. That would need 
a little more discussion since any user-code mentioning PipelineResult would 
then break. If we decide to rename it, it seems like it would be good to do 
these together.

> Standardize naming of PipelineResult objects
> 
>
> Key: BEAM-726
> URL: https://issues.apache.org/jira/browse/BEAM-726
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Reporter: Ben Chambers
>Assignee: Frances Perry
>Priority: Minor
>
> Today:
> PipelineResult is an interface returned by running a pipeline.
> DataflowPipelineJob is the Dataflow implementation of that interface
> FlinkRunnerResult is the Flink implementation
> EvaluationContext is the Spark implementation
> DirectPipelineResult is the DirectRunner implementation
> Ideally, all the names would indicate that they are a PipelineResult, like 
> the DirectRunner does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-726) Standardize naming of PipelineResult objects

2016-10-06 Thread Ben Chambers (JIRA)
Ben Chambers created BEAM-726:
-

 Summary: Standardize naming of PipelineResult objects
 Key: BEAM-726
 URL: https://issues.apache.org/jira/browse/BEAM-726
 Project: Beam
  Issue Type: Bug
  Components: beam-model
Reporter: Ben Chambers
Assignee: Frances Perry
Priority: Minor


Today:

PipelineResult is an interface returned by running a pipeline.
DataflowPipelineJob is the Dataflow implementation of that interface
FlinkRunnerResult is the Flink implementation
EvaluationContext is the Spark implementation
DirectPipelineResult is the DirectRunner implementation

Ideally, all the names would indicate that they are a PipelineResult, like the 
DirectRunner does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-696) Side-Inputs non-deterministic with merging main-input windows

2016-09-30 Thread Ben Chambers (JIRA)
Ben Chambers created BEAM-696:
-

 Summary: Side-Inputs non-deterministic with merging main-input 
windows
 Key: BEAM-696
 URL: https://issues.apache.org/jira/browse/BEAM-696
 Project: Beam
  Issue Type: Bug
  Components: beam-model
Reporter: Ben Chambers
Assignee: Frances Perry


Side-Inputs are non-deterministic for several reasons:
1. Because they depend on triggering of the side-input (this is acceptable 
because triggers are by their nature non-deterministic).
2. They depend on the current state of the main-input window in order to lookup 
the side-input. This means that with merging
3. Any runner optimizations that affect when the side-input is looked up may 
cause problems with either or both of these.

This issue focuses on #2 -- the non-determinism of side-inputs that execute 
within a Merging WindowFn.

Possible solution would be to defer running anything that looks up the 
side-input until we need to extract an output, and using the main-window at 
that point. Specifically, if the main-window is a MergingWindowFn, don't 
execute any kind of pre-combine, instead buffer all the inputs and combine 
later.

This could still run into some non-determinism if there are triggers 
controlling when we extract output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-694) TriggerTester doesn't test timer firings

2016-09-29 Thread Ben Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534161#comment-15534161
 ] 

Ben Chambers commented on BEAM-694:
---

I think this is partly intentional with the newer formulation of Triggers and 
how they interact with timers (eg., the Triggers as predicates). Specifically, 
Triggers no longer "receive" timers. Instead, a timer is just an indication 
that the trigger would like to be re-evaluated at some point in time. So, what 
we should probably do is:

1. Test that triggers *set* reasonable timers (this ensures they get woken up 
at reasonable points in time)
2. Separately test that triggers behave correctly when they are woken up (via a 
call to `shouldFire`).

It is important to actually do these separately, since `shouldFire` may be 
called for other reasons as well (such as when the watermark is passing the end 
of the window). There may be no timer from the trigger, but it may still get a 
chance to trigger or not.

If I understand the problem, it is that we're missing the tests for 1. I don't 
think we should necessarily tie the two together in the tests since they are 
not coupled in the actual implementation.

> TriggerTester doesn't test timer firings
> 
>
> Key: BEAM-694
> URL: https://issues.apache.org/jira/browse/BEAM-694
> Project: Beam
>  Issue Type: Bug
>Reporter: Eugene Kirpichov
>
> TriggerTester exposes a `fireIfShouldFire(BoundedWIndow)` method. This is 
> used to prompt a call to the trigger with the current state of the trigger 
> tester (Input Watermarks, elements present, etc), and see if the trigger 
> should fire.
> The TriggerTester should automatically call back to the trigger with the 
> current state whenever a Timer fires, as specified by the current watermarks 
> and any Timers set by the trigger under test. This ensures that Triggers set 
> underlying timers properly, so the trigger will fire even if no additional 
> elements arrive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-147) Introduce an easy API pipeline metrics

2016-09-28 Thread Ben Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15530952#comment-15530952
 ] 

Ben Chambers commented on BEAM-147:
---

Repurposing this issue to track the actual goal of having an easy-to-use API 
with an easy-to-discover name for reporting pipeline metrics.

> Introduce an easy API pipeline metrics
> --
>
> Key: BEAM-147
> URL: https://issues.apache.org/jira/browse/BEAM-147
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model, sdk-java-core, sdk-py
>Reporter: Robert Bradshaw
>Assignee: Ben Chambers
>
> The existing Aggregators are confusing both because of their name and because 
> they serve multiple purposes.
> Previous discussions around Aggregators/metrics/etc.
> See discussion at 
> http://mail-archives.apache.org/mod_mbox/incubator-beam-user/201603.mbox/browser
>  and 
> http://mail-archives.apache.org/mod_mbox/incubator-beam-dev/201603.mbox/browser
>  . Exact name still being bikeshedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (BEAM-147) Introduce an easy API pipeline metrics

2016-09-28 Thread Ben Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chambers updated BEAM-147:
--
Description: 
The existing Aggregators are confusing both because of their name and because 
they serve multiple purposes.

Previous discussions around Aggregators/metrics/etc.
See discussion at 
http://mail-archives.apache.org/mod_mbox/incubator-beam-user/201603.mbox/browser
 and 
http://mail-archives.apache.org/mod_mbox/incubator-beam-dev/201603.mbox/browser 
. Exact name still being bikeshedded.

  was:
The existing Aggregators are confusing both because of their name and because 
they serve multiple purposes.

See discussion at 
http://mail-archives.apache.org/mod_mbox/incubator-beam-user/201603.mbox/browser
 and 
http://mail-archives.apache.org/mod_mbox/incubator-beam-dev/201603.mbox/browser 
. Exact name still being bikeshedded.


> Introduce an easy API pipeline metrics
> --
>
> Key: BEAM-147
> URL: https://issues.apache.org/jira/browse/BEAM-147
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model, sdk-java-core, sdk-py
>Reporter: Robert Bradshaw
>Assignee: Ben Chambers
>
> The existing Aggregators are confusing both because of their name and because 
> they serve multiple purposes.
> Previous discussions around Aggregators/metrics/etc.
> See discussion at 
> http://mail-archives.apache.org/mod_mbox/incubator-beam-user/201603.mbox/browser
>  and 
> http://mail-archives.apache.org/mod_mbox/incubator-beam-dev/201603.mbox/browser
>  . Exact name still being bikeshedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (BEAM-147) Introduce an easy API pipeline metrics

2016-09-28 Thread Ben Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chambers updated BEAM-147:
--
Description: 
The existing Aggregators are confusing both because of their name and because 
they serve multiple purposes.

See discussion at 
http://mail-archives.apache.org/mod_mbox/incubator-beam-user/201603.mbox/browser
 and 
http://mail-archives.apache.org/mod_mbox/incubator-beam-dev/201603.mbox/browser 
. Exact name still being bikeshedded.

  was:
The name "Aggregator" is confusing.

See discussion at 
http://mail-archives.apache.org/mod_mbox/incubator-beam-user/201603.mbox/browser
 and 
http://mail-archives.apache.org/mod_mbox/incubator-beam-dev/201603.mbox/browser 
. Exact name still being bikeshedded.


> Introduce an easy API pipeline metrics
> --
>
> Key: BEAM-147
> URL: https://issues.apache.org/jira/browse/BEAM-147
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model, sdk-java-core, sdk-py
>Reporter: Robert Bradshaw
>Assignee: Ben Chambers
>
> The existing Aggregators are confusing both because of their name and because 
> they serve multiple purposes.
> See discussion at 
> http://mail-archives.apache.org/mod_mbox/incubator-beam-user/201603.mbox/browser
>  and 
> http://mail-archives.apache.org/mod_mbox/incubator-beam-dev/201603.mbox/browser
>  . Exact name still being bikeshedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (BEAM-147) Introduce an easy API pipeline metrics

2016-09-28 Thread Ben Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chambers updated BEAM-147:
--
Summary: Introduce an easy API pipeline metrics  (was: Rename Aggregator to 
[P]Metric)

> Introduce an easy API pipeline metrics
> --
>
> Key: BEAM-147
> URL: https://issues.apache.org/jira/browse/BEAM-147
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model, sdk-java-core, sdk-py
>Reporter: Robert Bradshaw
>Assignee: Ben Chambers
>
> The name "Aggregator" is confusing.
> See discussion at 
> http://mail-archives.apache.org/mod_mbox/incubator-beam-user/201603.mbox/browser
>  and 
> http://mail-archives.apache.org/mod_mbox/incubator-beam-dev/201603.mbox/browser
>  . Exact name still being bikeshedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-681) DoFns should be serialized at apply time and deserialized when executing

2016-09-26 Thread Ben Chambers (JIRA)
Ben Chambers created BEAM-681:
-

 Summary: DoFns should be serialized at apply time and deserialized 
when executing
 Key: BEAM-681
 URL: https://issues.apache.org/jira/browse/BEAM-681
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py
Reporter: Ben Chambers
Assignee: Frances Perry


1. Serializing DoFns at application time ensures that any modifications of 
fields within the DoFn after application do not accidentally pollute the 
execution. This mirrors the approach taken in Java to provide an approximation 
of lexical-closure (eg., you only need to know the state of the DoFn at the 
time it was applied, not afterwards, to understand its behavior).

2. Based on 1, the DIrectRunner should also be deserializing DoFns before 
running them, which should also detect other classes of errors such as using 
the pipeline object (which is not pickleable) within the DoFn



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (BEAM-147) Rename Aggregator to [P]Metric

2016-09-22 Thread Ben Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chambers reassigned BEAM-147:
-

Assignee: Ben Chambers  (was: Frances Perry)

> Rename Aggregator to [P]Metric
> --
>
> Key: BEAM-147
> URL: https://issues.apache.org/jira/browse/BEAM-147
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model, sdk-java-core, sdk-py
>Reporter: Robert Bradshaw
>Assignee: Ben Chambers
>
> The name "Aggregator" is confusing.
> See discussion at 
> http://mail-archives.apache.org/mod_mbox/incubator-beam-user/201603.mbox/browser
>  and 
> http://mail-archives.apache.org/mod_mbox/incubator-beam-dev/201603.mbox/browser
>  . Exact name still being bikeshedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (BEAM-661) CalendarWindows#isCompatibleWith should use equals instead of ==

2016-09-21 Thread Ben Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chambers closed BEAM-661.
-
   Resolution: Duplicate
Fix Version/s: Not applicable

> CalendarWindows#isCompatibleWith should use equals instead of ==
> 
>
> Key: BEAM-661
> URL: https://issues.apache.org/jira/browse/BEAM-661
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Ben Chambers
>Assignee: Davor Bonaci
>Priority: Minor
> Fix For: Not applicable
>
>
> http://stackoverflow.com/questions/39617897/inputs-to-flatten-had-incompatible-window-windowfns-when-cogroupbykey-with-calen
> We're using `==` instead of `.equals` to compare objects, which causes 
> equivalent CalendarWindows to be incompatible.
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/windowing/CalendarWindows.java#L143



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-661) CalendarWindows#isCompatibleWith should use equals instead of ==

2016-09-21 Thread Ben Chambers (JIRA)
Ben Chambers created BEAM-661:
-

 Summary: CalendarWindows#isCompatibleWith should use equals 
instead of ==
 Key: BEAM-661
 URL: https://issues.apache.org/jira/browse/BEAM-661
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-core
Reporter: Ben Chambers
Assignee: Davor Bonaci
Priority: Minor


http://stackoverflow.com/questions/39617897/inputs-to-flatten-had-incompatible-window-windowfns-when-cogroupbykey-with-calen

We're using `==` instead of `.equals` to compare objects, which causes 
equivalent CalendarWindows to be incompatible.

https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/windowing/CalendarWindows.java#L143



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-644) Primitive to shift the watermark while assigning timestamps

2016-09-20 Thread Ben Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15507379#comment-15507379
 ] 

Ben Chambers commented on BEAM-644:
---

Minor note on "A function from TimestampedElement to new timestamp that 
always falls within D of the original timestamp."

Rather than "within D" I think the requirement is that for an input with 
timestamp t, the output timestamp is >= t+D. This ensures that the output 
timestamps relation to the output watermark is no later than the input 
timestamps relation to the input watermark.

> Primitive to shift the watermark while assigning timestamps
> ---
>
> Key: BEAM-644
> URL: https://issues.apache.org/jira/browse/BEAM-644
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>
> There is a general need, especially important in the presence of 
> SplittableDoFn, to be able to assign new timestamps to elements without 
> making them late or droppable.
>  - DoFn.withAllowedTimestampSkew is inadequate, because it simply allows one 
> to produce late data, but does not allow one to shift the watermark so the 
> new data is on-time.
>  - For a SplittableDoFn, one may receive an element such as the name of a log 
> file that contains elements for the day preceding the log file. The timestamp 
> on the filename must currently be the beginning of the log. If such elements 
> are constantly flowing, it may be OK, but since we don't know that element is 
> coming, in that absence of data, the watermark may advance. We need a way to 
> keep it far enough back even in the absence of data holding it back.
> One idea is a new primitive ShiftWatermark / AdjustTimestamps with the 
> following pieces:
>  - A constant duration (positive or negative) D by which to shift the 
> watermark.
>  - A function from TimestampedElement to new timestamp that always falls 
> within D of the original timestamp.
> With this primitive added, outputWithTimestamp and withAllowedTimestampSkew 
> could be removed, simplifying DoFn.
> Alternatively, all of this functionality could be bolted on to DoFn.
> This ticket is not a proposal, but a record of the issue and ideas that were 
> mentioned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-37) Run DoFnWithContext without conversion to vanilla DoFn

2016-07-15 Thread Ben Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-37?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15379728#comment-15379728
 ] 

Ben Chambers commented on BEAM-37:
--

Notes on remaining tasks to complete this:

1. Modify DoFnRunner to allow executing a DoFnWithContext
2. Modify any runner specific code to use DoFnRunner when a DoFnWithContext is 
received.
3. Remove wrapping of DoFnWithContexts from DataflowRunner (and others) to use 
the modified code above.

> Run DoFnWithContext without conversion to vanilla DoFn
> --
>
> Key: BEAM-37
> URL: https://issues.apache.org/jira/browse/BEAM-37
> Project: Beam
>  Issue Type: Sub-task
>  Components: runner-core
>Reporter: Kenneth Knowles
>Assignee: Ben Chambers
>
> DoFnWithContext is an enhanced DoFn where annotations and parameter lists are 
> inspected to determine whether it accesses windowing information, etc.
> Today, each feature of DoFnWithContext requires implementation on DoFn, which 
> precludes the easy addition of features that we don't have designs for in 
> DoFn.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-377) BigQueryIO should validate a table or query to read from

2016-06-25 Thread Ben Chambers (JIRA)
Ben Chambers created BEAM-377:
-

 Summary: BigQueryIO should validate a table or query to read from
 Key: BEAM-377
 URL: https://issues.apache.org/jira/browse/BEAM-377
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-extensions
Reporter: Ben Chambers
Assignee: Ben Chambers
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-372) CoderProperties: Test that the coder doesn't consume more bytes than it produces

2016-06-23 Thread Ben Chambers (JIRA)
Ben Chambers created BEAM-372:
-

 Summary: CoderProperties: Test that the coder doesn't consume more 
bytes than it produces
 Key: BEAM-372
 URL: https://issues.apache.org/jira/browse/BEAM-372
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-core
Reporter: Ben Chambers
Assignee: Davor Bonaci
Priority: Minor


Add a test to CoderProperties that does the following:

1. Encode a value using the Coder
2. Add a byte at the end of the encoded array
3. Decode the value using the Coder
4. Verify the extra byte was not consumed

(This could possibly just be an enhancement to the existing round-trip 
encode/decode test)

When this fails it can lead to very difficult to debug situations in a coder 
wrapped around the problematic coder. This would be an easy test that would 
clearly fail *for the coder which was problematic*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (BEAM-359) AvroCoder should be able to handle anonymous classes as schemas

2016-06-22 Thread Ben Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chambers closed BEAM-359.
-
   Resolution: Fixed
Fix Version/s: 0.2.0-incubating

> AvroCoder should be able to handle anonymous classes as schemas
> ---
>
> Key: BEAM-359
> URL: https://issues.apache.org/jira/browse/BEAM-359
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Affects Versions: 0.1.0-incubating, 0.2.0-incubating
>Reporter: Daniel Mills
>Assignee: Ben Chambers
> Fix For: 0.2.0-incubating
>
>
> Currently, the determinism checker NPEs with:
> java.lang.IllegalArgumentException: Unable to get field id from class null
> at 
> com.google.cloud.dataflow.sdk.coders.AvroCoder$AvroDeterminismChecker.getField(AvroCoder.java:710)
> at 
> com.google.cloud.dataflow.sdk.coders.AvroCoder$AvroDeterminismChecker.checkRecord(AvroCoder.java:548)
> at 
> com.google.cloud.dataflow.sdk.coders.AvroCoder$AvroDeterminismChecker.doCheck(AvroCoder.java:477)
> at 
> com.google.cloud.dataflow.sdk.coders.AvroCoder$AvroDeterminismChecker.recurse(AvroCoder.java:453)
> at 
> com.google.cloud.dataflow.sdk.coders.AvroCoder$AvroDeterminismChecker.checkRecord(AvroCoder.java:567)
> at 
> com.google.cloud.dataflow.sdk.coders.AvroCoder$AvroDeterminismChecker.doCheck(AvroCoder.java:477)
> at 
> com.google.cloud.dataflow.sdk.coders.AvroCoder$AvroDeterminismChecker.recurse(AvroCoder.java:453)
> at 
> com.google.cloud.dataflow.sdk.coders.AvroCoder$AvroDeterminismChecker.check(AvroCoder.java:430)
> at 
> com.google.cloud.dataflow.sdk.coders.AvroCoder.(AvroCoder.java:189)
> at com.google.cloud.dataflow.sdk.coders.AvroCoder.of(AvroCoder.java:144)
> at mypackage.GenericsTest$1.create(GenericsTest.java:102)
> at 
> com.google.cloud.dataflow.sdk.coders.CoderRegistry.getDefaultCoderFromFactory(CoderRegistry.java:797)
> at 
> com.google.cloud.dataflow.sdk.coders.CoderRegistry.getDefaultCoder(CoderRegistry.java:748)
> at 
> com.google.cloud.dataflow.sdk.coders.CoderRegistry.getDefaultCoder(CoderRegistry.java:719)
> at 
> com.google.cloud.dataflow.sdk.coders.CoderRegistry.getDefaultCoder(CoderRegistry.java:696)
> at 
> com.google.cloud.dataflow.sdk.coders.CoderRegistry.getDefaultCoder(CoderRegistry.java:178)
> at 
> com.google.cloud.dataflow.sdk.values.TypedPValue.inferCoderOrFail(TypedPValue.java:147)
> at 
> com.google.cloud.dataflow.sdk.values.TypedPValue.getCoder(TypedPValue.java:48)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-366) Support Display Data on Composite Transforms

2016-06-21 Thread Ben Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342229#comment-15342229
 ] 

Ben Chambers commented on BEAM-366:
---

As mentioned, representing composites explicitly in the Dataflow runner is 
currently dependent on the improved Runner API.

> Support Display Data on Composite Transforms
> 
>
> Key: BEAM-366
> URL: https://issues.apache.org/jira/browse/BEAM-366
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Reporter: Ben Chambers
>
> Today, Dataflow doesn't represent composites all the way to the UI (it 
> reconstructs them from the name). This means it doesn't support attaching 
> Display Data to composites.
> With the runner API refactoring, Dataflow should start supporting composites, 
> at which point we should make sure that Display Data is plumbed through 
> properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-366) Support Display Data on Composite Transforms

2016-06-21 Thread Ben Chambers (JIRA)
Ben Chambers created BEAM-366:
-

 Summary: Support Display Data on Composite Transforms
 Key: BEAM-366
 URL: https://issues.apache.org/jira/browse/BEAM-366
 Project: Beam
  Issue Type: Bug
  Components: runner-dataflow
Reporter: Ben Chambers
Assignee: Davor Bonaci


Today, Dataflow doesn't represent composites all the way to the UI (it 
reconstructs them from the name). This means it doesn't support attaching 
Display Data to composites.

With the runner API refactoring, Dataflow should start supporting composites, 
at which point we should make sure that Display Data is plumbed through 
properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (BEAM-344) Merge, Split, Delay, and Reorder bundles in the DirectRunner

2016-06-16 Thread Ben Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chambers updated BEAM-344:
--
Summary: Merge, Split, Delay, and Reorder bundles in the DirectRunner  
(was: Merge, Split, Delay, and Reorder bundles in the InProcessPipelineRunner)

> Merge, Split, Delay, and Reorder bundles in the DirectRunner
> 
>
> Key: BEAM-344
> URL: https://issues.apache.org/jira/browse/BEAM-344
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-direct
>Reporter: Thomas Groh
>Assignee: Thomas Groh
>Priority: Minor
>
> PCollections have no guarantees about the ordering of elements between steps
> Randomly reordering, splitting and merging bundles will break user pipelines 
> which assume some order will be maintained between steps and within a bundle. 
> This assumption is not correct, so should break in the Direct Runner.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-352) Add DisplayData to HDFS Sources

2016-06-16 Thread Ben Chambers (JIRA)
Ben Chambers created BEAM-352:
-

 Summary: Add DisplayData to HDFS Sources
 Key: BEAM-352
 URL: https://issues.apache.org/jira/browse/BEAM-352
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-extensions
Reporter: Ben Chambers
Assignee: James Malone
Priority: Minor


Any interesting parameters of the sources/sinks should be exposed as display 
data. See any of the sources/sinks that already export this (BigQuery, PubSub, 
etc.) for examples. Also look at the DisplayData builder and HasDisplayData 
interface for how to wire these up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-351) Add DisplayData to KafkaIO

2016-06-16 Thread Ben Chambers (JIRA)
Ben Chambers created BEAM-351:
-

 Summary: Add DisplayData to KafkaIO
 Key: BEAM-351
 URL: https://issues.apache.org/jira/browse/BEAM-351
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-extensions
Reporter: Ben Chambers
Assignee: James Malone
Priority: Minor


Any interesting parameters of the sources/sinks should be exposed as display 
data. See any of the sources/sinks that already export this (BigQuery, PubSub, 
etc.) for examples. Also look at the DisplayData builder and HasDisplayData 
interface for how to wire these up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-324) Improve TypeDescriptor inference of DoFn's created inside a generic PTransform

2016-06-02 Thread Ben Chambers (JIRA)
Ben Chambers created BEAM-324:
-

 Summary: Improve TypeDescriptor inference of DoFn's created inside 
a generic PTransform
 Key: BEAM-324
 URL: https://issues.apache.org/jira/browse/BEAM-324
 Project: Beam
  Issue Type: Bug
  Components: beam-model
Reporter: Ben Chambers
Assignee: Frances Perry
Priority: Minor


Commit 
https://github.com/apache/incubator-beam/commit/aa7f07fa5b22f3656d52dc9e1d4557bceb87c013
 introduced the ability to infer a {{TypeDescriptor}} from an object created 
inside a concrete instance of a {{PTransform}} and used it to simplify 
{{SimpleFunction}} usage.

We should probably look at using the same mechanism elsewhere, such as when 
inferring the output type of a {{ParDo}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (BEAM-117) Implement the API for Static Display Metadata

2016-05-25 Thread Ben Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chambers resolved BEAM-117.
---
   Resolution: Fixed
Fix Version/s: 0.1.0-incubating

> Implement the API for Static Display Metadata
> -
>
> Key: BEAM-117
> URL: https://issues.apache.org/jira/browse/BEAM-117
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-core
>Reporter: Ben Chambers
>Assignee: Scott Wegner
> Fix For: 0.1.0-incubating
>
>
> As described in the following doc, we would like the SDK to allow associating 
> display metadata with PTransforms.
> https://docs.google.com/document/d/11enEB9JwVp6vO0uOYYTMYTGkr3TdNfELwWqoiUg5ZxM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-175) Leak garbage collection timers in GlobalWindow

2016-05-16 Thread Ben Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284759#comment-15284759
 ] 

Ben Chambers commented on BEAM-175:
---

I think in general this seems OK. I like that we're making the behavior 
explicit rather than trying to guessi t.

1. This may be too much configuration. There may be other/better ways of 
getting pane indices (eg., once we have a sink API or a state API). We should 
make sure we understand the use cases before exposing the knob. Especially 
since ReduceFnRunner is already complicated -- this is just adding more 
permutations of cases it needs to handle.
2. I worry that the default, at least in the Pane Index case, is actually 
*less* performant than the ZERO case, which is likely what was desired 90% of 
the time. If we go this direction, I would propose we change the default.
3. You should flesh this out to address error cases. When do we detect that the 
user is accessing the PaneIndex with the ZERO behavior specified. What kind of 
error message do they get? Etc.

> Leak garbage collection timers in GlobalWindow
> --
>
> Key: BEAM-175
> URL: https://issues.apache.org/jira/browse/BEAM-175
> Project: Beam
>  Issue Type: Bug
>  Components: runner-core
>Reporter: Mark Shields
>Assignee: Mark Shields
>
> Consider the  transform:
>   Window
> .into(new GlobalWindows())
> .triggering(
>   Repeatedly.forever(
> AfterProcessingTime.pastFirstElementInPane().plusDelayOf(...)))
> .discardingFiredPanes()
> This is a common idiom for 'process elements bunched by arrival time'.
> Currently we create an end-of-window timer per key, which clearly will only 
> fire if the pipeline is drained.
> Better would be to avoid creating end-of-window timers if there's no state 
> which needs to be processed at end-of-window (ie at drain if the Global 
> window).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-247) CombineFn's only definable/usable inside sdk.transforms package

2016-05-02 Thread Ben Chambers (JIRA)
Ben Chambers created BEAM-247:
-

 Summary: CombineFn's only definable/usable inside sdk.transforms 
package
 Key: BEAM-247
 URL: https://issues.apache.org/jira/browse/BEAM-247
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-core
Reporter: Ben Chambers
Assignee: Pei He
Priority: Critical


{code:java}
public abstract static class CombineFn
  extends AbstractGlobalCombineFn { /* ... */ }
abstract static class AbstractGlobalCombineFn
  implements GlobalCombineFn, Serializable { /* ... */ 
}
{code}

Since {{AbstractGlobalCombineFn}} is package protected (and therefore not 
visible outside of the {{transform}} package, it is not possible to cast any 
class that extends {{CombineFn}} to a {{GlobalCombineFn}} outside of this 
package.

This prevents applying existing {{CombineFn}}s directly (such as 
{{Combine.perKey(new Sum.SumIntegersFn())}}, as used in our documentation) and 
also means that a user cannot define their own {{CombineFn}} unless they put 
them in the {{transform}} package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-210) Be consistent with emitting final empty panes

2016-04-19 Thread Ben Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248532#comment-15248532
 ] 

Ben Chambers commented on BEAM-210:
---

Most important, I think, is to get rid of the implicit "if you have a watermark 
trigger, you must want an ON_TIME pane even if it is empty".

I don't know if its "wins", so much as we produce an empty pane if either (the 
pane is ON_TIME and we think you want an empty ON_TIME pane) OR (the pane is 
final and we know you want an empty final pane).

+1 to a clarifying rename, although I don't think "incremental" and 
"replacement" actually clarify that much. I would suggest that we standardize 
on a pane as "the elements since the last triggering". When we output, we 
produce either the current pane (DISCARDING mode), or all of the accumulated 
panes (ACCUMULATING mode). So calling these something like outputCurrentPane 
and outputCumulativePanes (or outputAccumulatedPanes or something like that) 
may be clear? This is orthogonal, and should probably move out of this issue 
and on to the dev list when we want to perform said renaming.

> Be consistent with emitting final empty panes
> -
>
> Key: BEAM-210
> URL: https://issues.apache.org/jira/browse/BEAM-210
> Project: Beam
>  Issue Type: Bug
>  Components: runner-core
>Reporter: Mark Shields
>Assignee: Mark Shields
>
> Currently ReduceFnRunner.onTrigger uses shouldEmit to prevent empty final 
> panes unless the user has requested them.
> The same check needs to be done in ReduceFnRunner.onTimer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-210) Be consistent with emitting final empty panes

2016-04-19 Thread Ben Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248490#comment-15248490
 ] 

Ben Chambers commented on BEAM-210:
---

No. This is in the weird spot. When we say "empty pane" in terms of whether or 
not to produce an ON_TIME or final pane, we mean the elements that were new in 
that pane. I think we should standardize on that. When we "accumulate fired 
panes" it means we retain previous panes, and the *output* in this pane 
contains data from those earlier ones. But *this* pane may still be 
empty/non-empty.

I tried to reproduce with a test, but was unable to: 
https://github.com/apache/incubator-beam/pull/211. Still probably worth 
checking in to improve test coverage of that case.

> Be consistent with emitting final empty panes
> -
>
> Key: BEAM-210
> URL: https://issues.apache.org/jira/browse/BEAM-210
> Project: Beam
>  Issue Type: Bug
>  Components: runner-core
>Reporter: Mark Shields
>Assignee: Mark Shields
>
> Currently ReduceFnRunner.onTrigger uses shouldEmit to prevent empty final 
> panes unless the user has requested them.
> The same check needs to be done in ReduceFnRunner.onTimer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-184) Using Merging Windows and/or Triggers without a downstream aggregation should fail

2016-04-08 Thread Ben Chambers (JIRA)
Ben Chambers created BEAM-184:
-

 Summary: Using Merging Windows and/or Triggers without a 
downstream aggregation should fail
 Key: BEAM-184
 URL: https://issues.apache.org/jira/browse/BEAM-184
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-core
Reporter: Ben Chambers
Assignee: Davor Bonaci
Priority: Minor


Both merging windows (such as sessions) and triggering only actually happen at 
an aggregation (GroupByKey). We should produce errors in any of these cases:

1. Merging window used without a downstream GroupByKey
2. Triggers used without a downstream GroupuByKey
3. Window inspected after inserting a merging window and before the GroupByKey



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-117) Implement the API for Static Display Metadata

2016-03-20 Thread Ben Chambers (JIRA)
Ben Chambers created BEAM-117:
-

 Summary: Implement the API for Static Display Metadata
 Key: BEAM-117
 URL: https://issues.apache.org/jira/browse/BEAM-117
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-core
Reporter: Ben Chambers
Assignee: Scott Wegner


As described in the following doc, we would like the SDK to allow associating 
display metadata with PTransforms.

https://docs.google.com/document/d/11enEB9JwVp6vO0uOYYTMYTGkr3TdNfELwWqoiUg5ZxM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (BEAM-121) Publish DisplayData from common PTransforms

2016-03-20 Thread Ben Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chambers reassigned BEAM-121:
-

Assignee: Scott Wegner

> Publish DisplayData from common PTransforms
> ---
>
> Key: BEAM-121
> URL: https://issues.apache.org/jira/browse/BEAM-121
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-java-core
>Reporter: Scott Wegner
>Assignee: Scott Wegner
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-135) Utilities for "batching" elements in a DoFn

2016-03-19 Thread Ben Chambers (JIRA)
Ben Chambers created BEAM-135:
-

 Summary: Utilities for "batching" elements in a DoFn
 Key: BEAM-135
 URL: https://issues.apache.org/jira/browse/BEAM-135
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-core
Reporter: Ben Chambers


We regularly receive questions about how to write a {{DoFn}} that operates on 
batches of elements. Example answers include:

http://stackoverflow.com/questions/35065109/can-datastore-input-in-google-dataflow-pipeline-be-processed-in-a-batch-of-n-ent/35068341#35068341

http://stackoverflow.com/questions/30177812/partition-data-coming-from-csv-so-i-can-process-larger-patches-rather-then-indiv/30178170#30178170

Possible APIs could be to wrap a {{DoFn}} and include a batch size, or to 
create a utility like {{Filter}}, {{Partition}}, etc. that takes a 
{{SerializableFunction}} or a {{SimpleFunction}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (BEAM-117) Implement the API for Static Display Metadata

2016-03-19 Thread Ben Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chambers updated BEAM-117:
--
Issue Type: New Feature  (was: Bug)

> Implement the API for Static Display Metadata
> -
>
> Key: BEAM-117
> URL: https://issues.apache.org/jira/browse/BEAM-117
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-core
>Reporter: Ben Chambers
>Assignee: Scott Wegner
>
> As described in the following doc, we would like the SDK to allow associating 
> display metadata with PTransforms.
> https://docs.google.com/document/d/11enEB9JwVp6vO0uOYYTMYTGkr3TdNfELwWqoiUg5ZxM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (BEAM-36) TimestampedValueInMultipleWindows should use a more compact set representation

2016-03-19 Thread Ben Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-36?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chambers closed BEAM-36.

Resolution: Won't Fix

As per discussion on https://github.com/apache/incubator-beam/pull/43:

"We looked some more at the original Jira issue and realized that it is likely 
a non-issue. It was created to track the fact we needed to examine our usage of 
a HashSet there, since we ran into problems with the over-allocation of a hash 
set (eg., 64 slots to hold 23 items, etc.). When we have 1000 of these in 
memory at a time, the over-allocation starts to hurt.

Upon further scrutiny, those WindowedValues should only be getting turned into 
a Set when we need to do equals or hashCode, to make sure we get an 
order-independent comparison. Assuming this is limited to tests, we can 
probably resolve the Jira issue as won't fix."

> TimestampedValueInMultipleWindows should use a more compact set representation
> --
>
> Key: BEAM-36
> URL: https://issues.apache.org/jira/browse/BEAM-36
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Reporter: Kenneth Knowles
>Priority: Trivial
>  Labels: Windowing
>
> Today TimestampedValueInMultipleWindows converts its collection of windows to 
> a LinkedHashSet for comparisons and hashing. Since it is an immutable set, 
> more compact representations are available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-112) Using non-IntervalWindows seems to fail

2016-03-14 Thread Ben Chambers (JIRA)
Ben Chambers created BEAM-112:
-

 Summary: Using non-IntervalWindows seems to fail
 Key: BEAM-112
 URL: https://issues.apache.org/jira/browse/BEAM-112
 Project: Beam
  Issue Type: Bug
  Components: runner-spark
Reporter: Ben Chambers
Assignee: Amit Sela
Priority: Minor


See here for more details: 
http://stackoverflow.com/questions/35993777/globalwindow-cannot-be-cast-to-intervalwindow

The linked stack trace indicates this is using the Spark Runner:

{noformat:title=Exception}
java.lang.ClassCastException: 
com.google.cloud.dataflow.sdk.transforms.windowing.GlobalWindow cannot be cast 
to com.google.cloud.dataflow.sdk.transforms.windowing.IntervalWindow
at 
com.google.cloud.dataflow.sdk.transforms.windowing.IntervalWindow$IntervalWindowCoder.encode(IntervalWindow.java:171)
at 
com.google.cloud.dataflow.sdk.coders.IterableLikeCoder.encode(IterableLikeCoder.java:113)
at 
com.google.cloud.dataflow.sdk.coders.IterableLikeCoder.encode(IterableLikeCoder.java:59)
at 
com.google.cloud.dataflow.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:599)
at 
com.google.cloud.dataflow.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:540)
at 
com.cloudera.dataflow.spark.CoderHelpers.toByteArray(CoderHelpers.java:48)
at com.cloudera.dataflow.spark.CoderHelpers$3.call(CoderHelpers.java:134)
at com.cloudera.dataflow.spark.CoderHelpers$3.call(CoderHelpers.java:131)
at 
org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1018)
at 
org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1018)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}

It seems likely that at some point the Spark runner is assuming that all 
windows are IntervalWindows, which may not be true. Specifically the 
GlobalWindow+Triggers is valid, as is any custom implementation of 
BoundedWindow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (BEAM-85) DataflowAssert (BeamAssert ;) needs sanity check that it's used correctly

2016-03-02 Thread Ben Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-85?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Chambers updated BEAM-85:
-
Description: 
We should validate two things:

# DataflowAssert is not added to a pipeline that has already been run.
# The pipeline is run after the DataflowAssert is added.

If either of these are not validated, then it is possible that the test doesn't 
actually verify anything.


This code should throw an assertion error or fail in some other way.
{code}
Pipeline p = TestPipeline.create();
PCollection value = p.apply(Create.of(Boolean.FALSE));
p.run();

DataflowAssert.thatSingleton(value).isEqualTo(true);
{code}

but it would pass silently.


similarly, this code wills pass silently:
{code}
Pipeline p = TestPipeline.create();
PCollection value = p.apply(Create.of(Boolean.FALSE));
DataflowAssert.thatSingleton(value).isEqualTo(true);
{code}

  was:
It is important that assert is applied to pipeline before the pipeline is run, 
otherwise it does not actually execute the test.

This code should throw an assertion error or fail in some other way.

{code}
Pipeline p = TestPipeline.create();
PCollection value = p.apply(Create.of(Boolean.FALSE));
p.run();

DataflowAssert.thatSingleton(value).isEqualTo(true);
{code}

but it would pass silently.


> DataflowAssert (BeamAssert ;) needs sanity check that it's used correctly
> -
>
> Key: BEAM-85
> URL: https://issues.apache.org/jira/browse/BEAM-85
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-core
>Reporter: Daniel Halperin
>
> We should validate two things:
> # DataflowAssert is not added to a pipeline that has already been run.
> # The pipeline is run after the DataflowAssert is added.
> If either of these are not validated, then it is possible that the test 
> doesn't actually verify anything.
> This code should throw an assertion error or fail in some other way.
> {code}
> Pipeline p = TestPipeline.create();
> PCollection value = p.apply(Create.of(Boolean.FALSE));
> p.run();
> DataflowAssert.thatSingleton(value).isEqualTo(true);
> {code}
> but it would pass silently.
> similarly, this code wills pass silently:
> {code}
> Pipeline p = TestPipeline.create();
> PCollection value = p.apply(Create.of(Boolean.FALSE));
> DataflowAssert.thatSingleton(value).isEqualTo(true);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-85) DataflowAssert (BeamAssert ;) needs sanity check that it's used correctly

2016-03-01 Thread Ben Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174750#comment-15174750
 ] 

Ben Chambers commented on BEAM-85:
--

One possibility would be to use an `@Rule` instead. for instance:

```java
@Rule public TestPipelineRule testPipeline;

...

Pipeline p = testPipeline.create
...
```

Then the `tearDown` from the rule can validate proper usage.

> DataflowAssert (BeamAssert ;) needs sanity check that it's used correctly
> -
>
> Key: BEAM-85
> URL: https://issues.apache.org/jira/browse/BEAM-85
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-core
>Reporter: Daniel Halperin
>Assignee: Davor Bonaci
>
> It is important that assert is applied to pipeline before the pipeline is 
> run, otherwise it does not actually execute the test.
> This code should throw an assertion error or fail in some other way.
> ```java
> {
> Pipeline p = TestPipeline.create();
> PCollection value = p.apply(Create.of(Boolean.FALSE));
> p.run();
> DataflowAssert.thatSingleton(value).isEqualTo(true);
> }
> ```
> but it would pass silently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)