from:"JIRA"

[jira] [Commented] (BEAM-469) NullableCoder optimized encoding via passthrough context

2016-12-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768525#comment-15768525
 ] 

ASF GitHub Bot commented on BEAM-469:
-

GitHub user dhalperi opened a pull request:

https://github.com/apache/incubator-beam/pull/1680

[BEAM-XXX] Make KVCoder more efficient by removing unnecessary nesting

See [BEAM-469] for more information about why this is
correct.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dhalperi/incubator-beam 
efficient-nested-coders

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1680.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1680


commit 621e8250c9535d773c4f4440a34ea0833912b51f
Author: Dan Halperin <dhalp...@google.com>
Date:   2016-12-21T23:37:49Z

[BEAM-XXX] Make KVCoder more efficient by removing unnecessary nesting

See [BEAM-469] for more information about why this is
correct.




> NullableCoder optimized encoding via passthrough context
> 
>
> Key: BEAM-469
> URL: https://issues.apache.org/jira/browse/BEAM-469
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Reporter: Luke Cwik
>Assignee: Thomas Groh
>Priority: Trivial
>  Labels: backward-incompatible
> Fix For: 0.3.0-incubating
>
>
> NullableCoder should encode using the context given and not always use the 
> nested context. For coders which can efficiently encode in the outer context 
> such as StringUtf8Coder or ByteArrayCoder, we are forcing them to prefix 
> themselves with their length.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-1201) Remove producesSortedKeys from Source

2016-12-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768483#comment-15768483
 ] 

ASF GitHub Bot commented on BEAM-1201:
--

GitHub user dhalperi opened a pull request:

https://github.com/apache/incubator-beam/pull/1679

[BEAM-1201] Remove BoundedSource.producesSortedKeys

R: @jkff

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dhalperi/incubator-beam 
remove-produces-sorted-keys

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1679.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1679


commit ee15138543f8b9926466cf4e4dc6857b3173345e
Author: Dan Halperin <dhalp...@google.com>
Date:   2016-12-21T23:32:38Z

[BEAM-1201] Remove BoundedSource.producesSortedKeys

Unused and unclear; for more information see the linked JIRA.




> Remove producesSortedKeys from Source
> -
>
> Key: BEAM-1201
> URL: https://issues.apache.org/jira/browse/BEAM-1201
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Reporter: Daniel Halperin
>Assignee: Daniel Halperin
>Priority: Minor
>  Labels: backward-incompatible
>
> This is a holdover from a precursor of the old Dataflow SDK that we just 
> failed to delete before releasing Dataflow 1.0, but we can delete before the 
> first stable release of Beam.
> This function has never been used by any runner. It does not mean anything 
> obvious to implementors, as many sources produce {{T}}, not {{KV<K,V>}} -- 
> what does it mean in the former case? (And how do you get the latter case 
> correct?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (BEAM-1201) Remove producesSortedKeys from Source

2016-12-21 Thread Daniel Halperin (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Halperin updated BEAM-1201:
--
Labels: backward-incompatible  (was: )

> Remove producesSortedKeys from Source
> -
>
> Key: BEAM-1201
> URL: https://issues.apache.org/jira/browse/BEAM-1201
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Reporter: Daniel Halperin
>Assignee: Daniel Halperin
>Priority: Minor
>  Labels: backward-incompatible
>
> This is a holdover from a precursor of the old Dataflow SDK that we just 
> failed to delete before releasing Dataflow 1.0, but we can delete before the 
> first stable release of Beam.
> This function has never been used by any runner. It does not mean anything 
> obvious to implementors, as many sources produce {{T}}, not {{KV<K,V>}} -- 
> what does it mean in the former case? (And how do you get the latter case 
> correct?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (BEAM-1201) Remove producesSortedKeys from BoundedSource

2016-12-21 Thread Daniel Halperin (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Halperin updated BEAM-1201:
--
Summary: Remove producesSortedKeys from BoundedSource  (was: Remove 
producesSortedKeys from Source)

> Remove producesSortedKeys from BoundedSource
> 
>
> Key: BEAM-1201
> URL: https://issues.apache.org/jira/browse/BEAM-1201
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Reporter: Daniel Halperin
>Assignee: Daniel Halperin
>Priority: Minor
>  Labels: backward-incompatible
>
> This is a holdover from a precursor of the old Dataflow SDK that we just 
> failed to delete before releasing Dataflow 1.0, but we can delete before the 
> first stable release of Beam.
> This function has never been used by any runner. It does not mean anything 
> obvious to implementors, as many sources produce {{T}}, not {{KV<K,V>}} -- 
> what does it mean in the former case? (And how do you get the latter case 
> correct?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-469) NullableCoder optimized encoding via passthrough context

2016-12-21 Thread Daniel Halperin (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768476#comment-15768476
 ] 

Daniel Halperin commented on BEAM-469:
--

Sorry I missed this JIRA comment, [~mariusz89016]! A bit late, but...

Say a coder C does not have the nested context. Then we actually have the 
guarantee that no one will put later elements.

So if {{NullableCoder}} does not have the nested context, then no one will put 
more elements after whatever the {{NullableCoder}} puts. If the NC puts {{0}} 
then that's it -- the element is null. But if the NC puts {{1}}, then we know 
that all remaining bytes in the encoded string belong to the inner coder. That 
is effectively saying that the inner coder also does not need to have the 
nested context, so it does not need to write its own length.

In your example, the {{NullableCoder}} is used in an inner context. So the 
inner coder needs to also use the inner context, because there may be more 
encoded elements later.

In either case: the context of the nullable coder can be the same as the 
context of the inner coder. This is why in the patch here, we simply pass the 
NC's context down into the inner coder. All we have removed is the _additional_ 
nesting that was used.

https://patch-diff.githubusercontent.com/raw/apache/incubator-beam/pull/992.patch
 

> NullableCoder optimized encoding via passthrough context
> 
>
> Key: BEAM-469
> URL: https://issues.apache.org/jira/browse/BEAM-469
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Reporter: Luke Cwik
>Assignee: Thomas Groh
>Priority: Trivial
>  Labels: backward-incompatible
> Fix For: 0.3.0-incubating
>
>
> NullableCoder should encode using the context given and not always use the 
> nested context. For coders which can efficiently encode in the outer context 
> such as StringUtf8Coder or ByteArrayCoder, we are forcing them to prefix 
> themselves with their length.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-646) Get runners out of the apply()

2016-12-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768470#comment-15768470
 ] 

ASF GitHub Bot commented on BEAM-646:
-

Github user asfgit closed the pull request at:

https://github.com/apache/incubator-beam/pull/1582


> Get runners out of the apply()
> --
>
> Key: BEAM-646
> URL: https://issues.apache.org/jira/browse/BEAM-646
> Project: Beam
>  Issue Type: Improvement
>  Components: beam-model-runner-api, sdk-java-core
>Reporter: Kenneth Knowles
>Assignee: Thomas Groh
>
> Right now, the runner intercepts calls to apply() and replaces transforms as 
> we go. This means that there is no "original" user graph. For portability and 
> misc architectural benefits, we would like to build the original graph first, 
> and have the runner override later.
> Some runners already work in this manner, but we could integrate it more 
> smoothly, with more validation, via some handy APIs on e.g. the Pipeline 
> object.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-1112) Python E2E Integration Test Framework - Batch Only

2016-12-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768461#comment-15768461
 ] 

ASF GitHub Bot commented on BEAM-1112:
--

Github user markflyhigh closed the pull request at:

https://github.com/apache/incubator-beam/pull/1639


> Python E2E Integration Test Framework - Batch Only
> --
>
> Key: BEAM-1112
> URL: https://issues.apache.org/jira/browse/BEAM-1112
> Project: Beam
>  Issue Type: Task
>  Components: sdk-py, testing
>Reporter: Mark Liu
>Assignee: Mark Liu
>
> Parity with Java. 
> Build e2e integration test framework that can configure and run batch 
> pipeline with specified test runner, wait for pipeline execution and verify 
> results with given verifiers in the end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-1112) Python E2E Integration Test Framework - Batch Only

2016-12-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768462#comment-15768462
 ] 

ASF GitHub Bot commented on BEAM-1112:
--

GitHub user markflyhigh reopened a pull request:

https://github.com/apache/incubator-beam/pull/1639

[BEAM-1112] Python E2E Test Framework And Wordcount E2E Test

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [x] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [ ] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [ ] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---

 - E2e test framework that supports TestRunner start and verify pipeline 
job.
   - add `TestOptions` which defined `on_success_matcher` that is used to 
verify state/output of pipeline job.
   - validate `on_success_matcher` before pipeline execution to make sure 
it's unpicklable to a subclass of BaseMatcher.
   - create a `TestDataflowRunner` which provide functionalities of 
`DataflowRunner` plus result verification.
   - provide a test verifier `PipelineStateMatcher` that can verify 
pipeline job finished in DONE or not.
 - Add wordcount_it (it = integration test) that build e2e test based on 
existing wordcount pipeline.
   - include wordcount_it to nose collector, so that wordcount_it can be 
collected and run by nose.
   - skip ITs when running unit tests from tox in precommit and postcommit.

Current changes will not change behavior of existing pre/postcommit.
Test is done by running `tox -e py27 -c sdks/python/tox.ini` for unit test 
and running wordcount_it with `TestDataflowRunner` on service 
([link](https://pantheon.corp.google.com/dataflow/job/2016-12-15_17_36_16-3857167705491723621?project=google.com:clouddfe)).

TODO:
 - Output data verifier that verify pipeline output that stores in 
filesystem.
 - Add wordcount_it to precommit and replace existing wordcount execution 
command in postcommit with a better structured nose command.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/markflyhigh/incubator-beam e2e-testrunner

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1639.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1639






> Python E2E Integration Test Framework - Batch Only
> --
>
> Key: BEAM-1112
> URL: https://issues.apache.org/jira/browse/BEAM-1112
> Project: Beam
>  Issue Type: Task
>  Components: sdk-py, testing
>Reporter: Mark Liu
>Assignee: Mark Liu
>
> Parity with Java. 
> Build e2e integration test framework that can configure and run batch 
> pipeline with specified test runner, wait for pipeline execution and verify 
> results with given verifiers in the end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (BEAM-1202) Coders should have meaningful equals methods

2016-12-21 Thread Thomas Groh (JIRA)

Thomas Groh created BEAM-1202:
-

 Summary: Coders should have meaningful equals methods
 Key: BEAM-1202
 URL: https://issues.apache.org/jira/browse/BEAM-1202
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-core
Reporter: Thomas Groh
Assignee: Davor Bonaci


{{StandardCoder}} implements an equality check based on the component coders 
and equal classes. Any coder that is configured, or that does not extend 
{{StandardCoder}}, should have meaningful implementations of {{equals}} and 
{{hashCode}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (BEAM-1202) Coders should have meaningful equals methods

2016-12-21 Thread Thomas Groh (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Groh updated BEAM-1202:
--
Assignee: (was: Davor Bonaci)

> Coders should have meaningful equals methods
> 
>
> Key: BEAM-1202
> URL: https://issues.apache.org/jira/browse/BEAM-1202
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Thomas Groh
>
> {{StandardCoder}} implements an equality check based on the component coders 
> and equal classes. Any coder that is configured, or that does not extend 
> {{StandardCoder}}, should have meaningful implementations of {{equals}} and 
> {{hashCode}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (BEAM-1201) Remove producesSortedKeys from Source

2016-12-21 Thread Daniel Halperin (JIRA)

Daniel Halperin created BEAM-1201:
-

 Summary: Remove producesSortedKeys from Source
 Key: BEAM-1201
 URL: https://issues.apache.org/jira/browse/BEAM-1201
 Project: Beam
  Issue Type: Improvement
  Components: sdk-java-core
Reporter: Daniel Halperin
Assignee: Daniel Halperin
Priority: Minor


This is a holdover from a precursor of the old Dataflow SDK that we just failed 
to delete before releasing Dataflow 1.0, but we can delete before the first 
stable release of Beam.

This function has never been used by any runner. It does not mean anything 
obvious to implementors, as many sources produce {{T}}, not {{KV<K,V>}} -- what 
does it mean in the former case? (And how do you get the latter case correct?)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-1198) ViewFn: explicitly decouple runner materialization of side inputs from SDK-specific mapping

2016-12-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768387#comment-15768387
 ] 

ASF GitHub Bot commented on BEAM-1198:
--

GitHub user kennknowles opened a pull request:

https://github.com/apache/incubator-beam/pull/1678

[BEAM-1198, BEAM-846, BEAM-260] Refactor Dataflow translator to decouple 
input and output graphs more easily

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [x] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [x] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [x] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---

This is preparatory work to make it possible for the translator to have a 
more loosely coupled relationship between its input and output.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kennknowles/incubator-beam Dataflow-Translator

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1678.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1678


commit 8ed4bb68660c537e4a12c1077ecfa104f9a82eaa
Author: Kenneth Knowles <k...@google.com>
Date:   2016-12-21T22:21:50Z

Inline needless interface DataflowTranslator.TranslationContext

The only implementation was DataflowTranslator.Translator. This class
needs some updating and the extra layer of the interface simply
obscures that work.

commit 272d06d7507ad7162616dd1b613efa7c8f5f4069
Author: Kenneth Knowles <k...@google.com>
Date:   2016-12-21T22:34:27Z

Explicitly pass Step to mutate in Dataflow translator

Previously, there was always a "current" step that was the most recent
step created. This makes it cumbersome or impossible to do things like
translate one primitive transform into a small subgraph of steps. Thus
we added hacks like CreatePCollectionView which are not actually part
of the model at all - in fact, we should be able to add the needed
CollectionToSingleton steps simply by looking at the side inputs of a
ParDo node.




> ViewFn: explicitly decouple runner materialization of side inputs from 
> SDK-specific mapping
> ---
>
> Key: BEAM-1198
> URL: https://issues.apache.org/jira/browse/BEAM-1198
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model-fn-api, beam-model-runner-api
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>
> For side inputs, the field {{PCollectionView.fromIterableInternal}} implies 
> an "iterable" materialization of the contents of a PCollection, which is 
> adapted to the desired user-facing type within a UDF (today the 
> PCollectionView "is" the UDF)
> In practice, runners get adequate performance by special casing just a couple 
> of types of PCollectionView in a rather cumbersome manner.
> The design to improve this is https://s.apache.org/beam-side-inputs-1-pager 
> and this makes the de facto standard approach the actual model.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (BEAM-1198) ViewFn: explicitly decouple runner materialization of side inputs from SDK-specific mapping

2016-12-21 Thread Kenneth Knowles (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles reassigned BEAM-1198:
-

Assignee: Kenneth Knowles

> ViewFn: explicitly decouple runner materialization of side inputs from 
> SDK-specific mapping
> ---
>
> Key: BEAM-1198
> URL: https://issues.apache.org/jira/browse/BEAM-1198
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model-fn-api, beam-model-runner-api
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>
> For side inputs, the field {{PCollectionView.fromIterableInternal}} implies 
> an "iterable" materialization of the contents of a PCollection, which is 
> adapted to the desired user-facing type within a UDF (today the 
> PCollectionView "is" the UDF)
> In practice, runners get adequate performance by special casing just a couple 
> of types of PCollectionView in a rather cumbersome manner.
> The design to improve this is https://s.apache.org/beam-side-inputs-1-pager 
> and this makes the de facto standard approach the actual model.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (BEAM-1200) PubsubIO should allow for a user to supply the function which computes the watermark that is reported

2016-12-21 Thread Luke Cwik (JIRA)

Luke Cwik created BEAM-1200:
---

 Summary: PubsubIO should allow for a user to supply the function 
which computes the watermark that is reported
 Key: BEAM-1200
 URL: https://issues.apache.org/jira/browse/BEAM-1200
 Project: Beam
  Issue Type: Improvement
  Components: sdk-java-gcp
Reporter: Luke Cwik
Assignee: Daniel Halperin
Priority: Minor


A user wanted to build a watermark function which tracked the datas watermark 
but never falls behind current time more than Y minutes. PubsubIO does not 
support specifying the function which computes and reports the watermark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-1117) Support for new Timer API in Direct runner

2016-12-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768294#comment-15768294
 ] 

ASF GitHub Bot commented on BEAM-1117:
--

Github user asfgit closed the pull request at:

https://github.com/apache/incubator-beam/pull/1669


> Support for new Timer API in Direct runner
> --
>
> Key: BEAM-1117
> URL: https://issues.apache.org/jira/browse/BEAM-1117
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-direct
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-1194) DataflowRunner should test a variety of valid tempLocation/stagingLocation/etc formats.

2016-12-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768284#comment-15768284
 ] 

ASF GitHub Bot commented on BEAM-1194:
--

Github user dhalperi closed the pull request at:

https://github.com/apache/incubator-beam/pull/1671


> DataflowRunner should test a variety of valid 
> tempLocation/stagingLocation/etc formats.
> ---
>
> Key: BEAM-1194
> URL: https://issues.apache.org/jira/browse/BEAM-1194
> Project: Beam
>  Issue Type: Test
>  Components: runner-dataflow
>Reporter: Daniel Halperin
>Assignee: Daniel Halperin
>Priority: Minor
>
> Cloud Dataflow has a minor history of small bugs related to various code 
> paths expecting there to be or not be a trailing forward-slash in these 
> location fields. The way that Beam's integration tests are set up, we are 
> likely to only have one of these two cases tested (there is a single set of 
> integration test pipeline options).
> We should add a dedicated DataflowRunner integration test to handle this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (BEAM-1199) Condense recordAsOutput, finishSpecifyingOutput from POutput

2016-12-21 Thread Thomas Groh (JIRA)

Thomas Groh created BEAM-1199:
-

 Summary: Condense recordAsOutput, finishSpecifyingOutput from 
POutput
 Key: BEAM-1199
 URL: https://issues.apache.org/jira/browse/BEAM-1199
 Project: Beam
  Issue Type: Improvement
  Components: sdk-java-core
Reporter: Thomas Groh
Assignee: Davor Bonaci
Priority: Minor


{{recordAsOutput}} and {{finishSpecifyingOutput}} are both methods which are 
called after an output has been attached to a PTransform application. They can 
be combined to only have one method that does any after-production work (such 
as the initial run of Coder inference)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (BEAM-846) Decouple side input window mapping from WindowFn

2016-12-21 Thread Kenneth Knowles (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613384#comment-15613384
 ] 

Kenneth Knowles edited comment on BEAM-846 at 12/21/16 9:19 PM:


Design 1-pager is https://s.apache.org/beam-windowmappingfn-1-pager


was (Author: kenn):
Design 1-pager is https://s.apache.org/beam-side-inputs-1-pager and a couple 
PRs have been authored 
([#520|https://github.com/apache/incubator-beam/pull/520] and 
[#1076|https://github.com/apache/incubator-beam/pull/1076]) attributed to 
BEAM-115 (the "all of the Runner API" ticket)


> Decouple side input window mapping from WindowFn
> 
>
> Key: BEAM-846
> URL: https://issues.apache.org/jira/browse/BEAM-846
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model-runner-api, sdk-java-core
>Reporter: Robert Bradshaw
>Assignee: Kenneth Knowles
>  Labels: backward-incompatible
>
> Currently the main WindowFn provides as getSideInputWindow method. Instead, 
> this mapping should be specified per-side-input (thought the default mapping 
> would remain the same). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (BEAM-210) Allow control of empty ON_TIME panes analogous to final panes

2016-12-21 Thread Kenneth Knowles (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles updated BEAM-210:
-
Component/s: (was: beam-model)

> Allow control of empty ON_TIME panes analogous to final panes
> -
>
> Key: BEAM-210
> URL: https://issues.apache.org/jira/browse/BEAM-210
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model-runner-api, sdk-java-core
>Reporter: Mark Shields
>Assignee: Thomas Groh
>
> Today, ON_TIME panes are emitted whether or not they are empty. We had 
> decided that for final panes the user would want to control this behavior, to 
> control data volume. But for ON_TIME panes no such control exists. The 
> rationale is perhaps that the ON_TIME pane is a fundamental result that 
> should not be elided. To be considered: whether this is what we want.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (BEAM-260) WindowMappingFn: Know the getSideInputWindow upper bound to release side input resources

2016-12-21 Thread Kenneth Knowles (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles updated BEAM-260:
-
Component/s: (was: beam-model)
 beam-model-fn-api

> WindowMappingFn: Know the getSideInputWindow upper bound to release side 
> input resources
> 
>
> Key: BEAM-260
> URL: https://issues.apache.org/jira/browse/BEAM-260
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model-fn-api, beam-model-runner-api
>Reporter: Mark Shields
>Assignee: Kenneth Knowles
>
> We currently have no static knowledge about the getSideInputWindow function, 
> and runners are thus forced to hold on to all side input state / elements in 
> case a future element reaches back into an earlier side input element.
> Maybe we need an upper bound on lag from current to result of 
> getSideInputWindow so we can have a progressing gc horizon as we do for  GKB 
> window state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (BEAM-846) Decouple side input window mapping from WindowFn

2016-12-21 Thread Kenneth Knowles (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles updated BEAM-846:
-
Component/s: (was: beam-model)

> Decouple side input window mapping from WindowFn
> 
>
> Key: BEAM-846
> URL: https://issues.apache.org/jira/browse/BEAM-846
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model-runner-api, sdk-java-core
>Reporter: Robert Bradshaw
>Assignee: Kenneth Knowles
>  Labels: backward-incompatible
>
> Currently the main WindowFn provides as getSideInputWindow method. Instead, 
> this mapping should be specified per-side-input (thought the default mapping 
> would remain the same). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (BEAM-1193) Give Coders URNs and document their binary formats outside the Java code base

2016-12-21 Thread Kenneth Knowles (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles updated BEAM-1193:
--
Component/s: (was: beam-model)

> Give Coders URNs and document their binary formats outside the Java code base
> -
>
> Key: BEAM-1193
> URL: https://issues.apache.org/jira/browse/BEAM-1193
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model-runner-api
>Reporter: Kenneth Knowles
>Assignee: Frances Perry
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (BEAM-653) Refine specification for WindowFn.isCompatible()

2016-12-21 Thread Kenneth Knowles (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles updated BEAM-653:
-
Component/s: beam-model-runner-api

> Refine specification for WindowFn.isCompatible() 
> -
>
> Key: BEAM-653
> URL: https://issues.apache.org/jira/browse/BEAM-653
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model, beam-model-runner-api
>Reporter: Kenneth Knowles
>
> {{WindowFn#isCompatible}} doesn't really have a spec. In practice, it is used 
> primarily when flattening together multiple PCollections. All of the 
> WindowFns must be compatible, and then just a single WindowFn is selected 
> arbitrarily for the output PCollection.
> In consequence, downstream of the Flatten, the merging behavior will be taken 
> from this WindowFn.
> Currently, there are some mismatches:
>  - Sessions with different gap durations _are_ compatible today, but probably 
> shouldn't be since merging makes little sense. (The use of tiny proto-windows 
> is an implementation detail anyhow)
>  - SlidingWindows and FixedWindows _could_ reasonably be compatible if they 
> had the same duration, though it might be odd.
> Either way, we should just nail down what we actually mean so we can arrive 
> at a verdict in these cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (BEAM-1149) Side input access fails in direct runner (possibly others too) when input element in multiple windows

2016-12-21 Thread Kenneth Knowles (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles updated BEAM-1149:
--
Component/s: runner-core

> Side input access fails in direct runner (possibly others too) when input 
> element in multiple windows
> -
>
> Key: BEAM-1149
> URL: https://issues.apache.org/jira/browse/BEAM-1149
> Project: Beam
>  Issue Type: Bug
>  Components: runner-core
>Reporter: Eugene Kirpichov
>Assignee: Kenneth Knowles
>Priority: Blocker
> Fix For: 0.4.0-incubating
>
>
> {code:java}
>   private static class FnWithSideInputs extends DoFn<String, String> {
> private final PCollectionView view;
> private FnWithSideInputs(PCollectionView view) {
>   this.view = view;
> }
> @ProcessElement
> public void processElement(ProcessContext c) {
>   c.output(c.element() + ":" + c.sideInput(view));
> }
>   }
>   @Test
>   public void testSideInputsWithMultipleWindows() {
> Pipeline p = TestPipeline.create();
> MutableDateTime mutableNow = Instant.now().toMutableDateTime();
> mutableNow.setMillisOfSecond(0);
> Instant now = mutableNow.toInstant();
> SlidingWindows windowFn =
> 
> SlidingWindows.of(Duration.standardSeconds(5)).every(Duration.standardSeconds(1));
> PCollectionView view = 
> p.apply(Create.of(1)).apply(View.asSingleton());
> PCollection res =
> p.apply(Create.timestamped(TimestampedValue.of("a", now)))
> .apply(Window.into(windowFn))
> .apply(ParDo.of(new FnWithSideInputs(view)).withSideInputs(view));
> PAssert.that(res).containsInAnyOrder("a:1");
> p.run();
>   }
> {code}
> This fails with the following exception:
> {code}
> org.apache.beam.sdk.Pipeline$PipelineExecutionException: 
> java.lang.IllegalStateException: sideInput called when main input element is 
> in multiple windows
>   at 
> org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:343)
>   at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:1)
>   at org.apache.beam.sdk.Pipeline.run(Pipeline.java:176)
>   at org.apache.beam.sdk.testing.TestPipeline.run(TestPipeline.java:112)
>   at 
> Caused by: java.lang.IllegalStateException: sideInput called when main input 
> element is in multiple windows
>   at 
> org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.sideInput(SimpleDoFnRunner.java:514)
>   at 
> org.apache.beam.sdk.transforms.ParDoTest$FnWithSideInputs.processElement(ParDoTest.java:738)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-25) Add user-ready API for interacting with state

2016-12-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768175#comment-15768175
 ] 

ASF GitHub Bot commented on BEAM-25:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-beam/pull/1670


> Add user-ready API for interacting with state
> -
>
> Key: BEAM-25
> URL: https://issues.apache.org/jira/browse/BEAM-25
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-core
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>  Labels: State
>
> Our current state API is targeted at runner implementers, not pipeline 
> authors. As such it has many capabilities that are not necessary nor 
> desirable for simple use cases of stateful ParDo (such as dynamic state tag 
> creation). Implement a simple state intended for user access.
> (Details of our current thoughts in forthcoming design doc)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (BEAM-1198) ViewFn: explicitly decouple runner materialization of side inputs from SDK-specific mapping

2016-12-21 Thread Kenneth Knowles (JIRA)

Kenneth Knowles created BEAM-1198:
-

 Summary: ViewFn: explicitly decouple runner materialization of 
side inputs from SDK-specific mapping
 Key: BEAM-1198
 URL: https://issues.apache.org/jira/browse/BEAM-1198
 Project: Beam
  Issue Type: New Feature
  Components: beam-model-fn-api, beam-model-runner-api
Reporter: Kenneth Knowles


For side inputs, the field {{PCollectionView.fromIterableInternal}} implies an 
"iterable" materialization of the contents of a PCollection, which is adapted 
to the desired user-facing type within a UDF (today the PCollectionView "is" 
the UDF)

In practice, runners get adequate performance by special casing just a couple 
of types of PCollectionView in a rather cumbersome manner.

The design to improve this is https://s.apache.org/beam-side-inputs-1-pager and 
this makes the de facto standard approach the actual model.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (BEAM-1003) Enable caching of side-input dependent computations

2016-12-21 Thread Kenneth Knowles (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles closed BEAM-1003.
-
   Resolution: Duplicate
Fix Version/s: Not applicable

Looks like an exact duplicate, a la form resubmission.

> Enable caching of side-input dependent computations
> ---
>
> Key: BEAM-1003
> URL: https://issues.apache.org/jira/browse/BEAM-1003
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model
>Reporter: Robert Bradshaw
>Assignee: Frances Perry
> Fix For: Not applicable
>
>
> Sometimes the kind of computations one wants to perform in startBundle depend 
> on side inputs (and, implicitly, the window). For example, one might want to 
> initialize a (non-serializable) stateful object. In particular, this leads to 
> users incorrectly (in the case of triggered or non-globally-windowed side 
> inputs) memoizing this computation in the first processElement call. 
> One option would be to fold this into a customizable ViewFn. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (BEAM-260) WindowMappingFn: Know the getSideInputWindow upper bound to release side input resources

2016-12-21 Thread Kenneth Knowles (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles updated BEAM-260:
-
Summary: WindowMappingFn: Know the getSideInputWindow upper bound to 
release side input resources  (was: Know the getSideInputWindow upper bound so 
can gc side input state)

> WindowMappingFn: Know the getSideInputWindow upper bound to release side 
> input resources
> 
>
> Key: BEAM-260
> URL: https://issues.apache.org/jira/browse/BEAM-260
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model, beam-model-runner-api
>Reporter: Mark Shields
>Assignee: Kenneth Knowles
>
> We currently have no static knowledge about the getSideInputWindow function, 
> and runners are thus forced to hold on to all side input state / elements in 
> case a future element reaches back into an earlier side input element.
> Maybe we need an upper bound on lag from current to result of 
> getSideInputWindow so we can have a progressing gc horizon as we do for  GKB 
> window state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-79) Gearpump runner

2016-12-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768076#comment-15768076
 ] 

ASF GitHub Bot commented on BEAM-79:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-beam/pull/1663


> Gearpump runner
> ---
>
> Key: BEAM-79
> URL: https://issues.apache.org/jira/browse/BEAM-79
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-gearpump
>Reporter: Tyler Akidau
>Assignee: Manu Zhang
>
> Intel is submitting Gearpump (http://www.gearpump.io) to ASF 
> (https://wiki.apache.org/incubator/GearpumpProposal). Appears to be a mix of 
> low-level primitives a la MillWheel, with some higher level primitives like 
> non-merging windowing mixed in. Seems like it would make a nice Beam runner.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (BEAM-1197) Slowly-changing external data as a side input

2016-12-21 Thread Eugene Kirpichov (JIRA)

Eugene Kirpichov created BEAM-1197:
--

 Summary: Slowly-changing external data as a side input
 Key: BEAM-1197
 URL: https://issues.apache.org/jira/browse/BEAM-1197
 Project: Beam
  Issue Type: Wish
  Components: beam-model
Reporter: Eugene Kirpichov
Assignee: Frances Perry


I've seen repeatedly the following pattern: a user wants to join a PCollection 
against a slowly-changing external dataset: e.g. a file on GCS, or a Bigtable, 
etc.

Side inputs come to mind, but current side input mechanisms don't allow for 
something like periodically reloading the side input.

The best hacky solution I came up with for one use case is documented here: 
http://stackoverflow.com/questions/41254028/can-dataflow-sideinput-be-updated-per-window-by-reading-a-gcs-bucket/41271159#41271159
 , we need to do better than this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (BEAM-430) Introducing gcpTempLocation that default to tempLocation

2016-12-21 Thread Luke Cwik (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Cwik updated BEAM-430:
---
Labels: backward-incompatible  (was: )

> Introducing gcpTempLocation that default to tempLocation
> 
>
> Key: BEAM-430
> URL: https://issues.apache.org/jira/browse/BEAM-430
> Project: Beam
>  Issue Type: Improvement
>Reporter: Pei He
>Assignee: Pei He
>Priority: Minor
>  Labels: backward-incompatible
> Fix For: 0.2.0-incubating
>
>
> Currently, DataflowPipelineOptions.stagingLocation default to tempLocation. 
> And, it requires tempLocation to be a gcs path.
> Another case is BigQueryIO uses tempLocation and also requires it to be on 
> gcs.
> So, users cannot set tempLocation to a non-gcs path with DataflowRunner or 
> BigQueryIO.
> However, tempLocation could be on any file system. For example, WordCount 
> defaults to output to tempLocation.
> The proposal is to add gcpTempLocation. And, it defaults to tempLocation if 
> tempLocation is a gcs path.
> StagingLocation and BigQueryIO will use gcpTempLocation by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (BEAM-430) Introducing gcpTempLocation that default to tempLocation

2016-12-21 Thread Luke Cwik (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Cwik closed BEAM-430.
--

> Introducing gcpTempLocation that default to tempLocation
> 
>
> Key: BEAM-430
> URL: https://issues.apache.org/jira/browse/BEAM-430
> Project: Beam
>  Issue Type: Improvement
>Reporter: Pei He
>Assignee: Pei He
>Priority: Minor
>  Labels: backward-incompatible
> Fix For: 0.2.0-incubating
>
>
> Currently, DataflowPipelineOptions.stagingLocation default to tempLocation. 
> And, it requires tempLocation to be a gcs path.
> Another case is BigQueryIO uses tempLocation and also requires it to be on 
> gcs.
> So, users cannot set tempLocation to a non-gcs path with DataflowRunner or 
> BigQueryIO.
> However, tempLocation could be on any file system. For example, WordCount 
> defaults to output to tempLocation.
> The proposal is to add gcpTempLocation. And, it defaults to tempLocation if 
> tempLocation is a gcs path.
> StagingLocation and BigQueryIO will use gcpTempLocation by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (BEAM-430) Introducing gcpTempLocation that default to tempLocation

2016-12-21 Thread Luke Cwik (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Cwik resolved BEAM-430.

Resolution: Fixed

> Introducing gcpTempLocation that default to tempLocation
> 
>
> Key: BEAM-430
> URL: https://issues.apache.org/jira/browse/BEAM-430
> Project: Beam
>  Issue Type: Improvement
>Reporter: Pei He
>Assignee: Pei He
>Priority: Minor
>  Labels: backward-incompatible
> Fix For: 0.2.0-incubating
>
>
> Currently, DataflowPipelineOptions.stagingLocation default to tempLocation. 
> And, it requires tempLocation to be a gcs path.
> Another case is BigQueryIO uses tempLocation and also requires it to be on 
> gcs.
> So, users cannot set tempLocation to a non-gcs path with DataflowRunner or 
> BigQueryIO.
> However, tempLocation could be on any file system. For example, WordCount 
> defaults to output to tempLocation.
> The proposal is to add gcpTempLocation. And, it defaults to tempLocation if 
> tempLocation is a gcs path.
> StagingLocation and BigQueryIO will use gcpTempLocation by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (BEAM-430) Introducing gcpTempLocation that default to tempLocation

2016-12-21 Thread Luke Cwik (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Cwik reopened BEAM-430:


> Introducing gcpTempLocation that default to tempLocation
> 
>
> Key: BEAM-430
> URL: https://issues.apache.org/jira/browse/BEAM-430
> Project: Beam
>  Issue Type: Improvement
>Reporter: Pei He
>Assignee: Pei He
>Priority: Minor
>  Labels: backward-incompatible
> Fix For: 0.2.0-incubating
>
>
> Currently, DataflowPipelineOptions.stagingLocation default to tempLocation. 
> And, it requires tempLocation to be a gcs path.
> Another case is BigQueryIO uses tempLocation and also requires it to be on 
> gcs.
> So, users cannot set tempLocation to a non-gcs path with DataflowRunner or 
> BigQueryIO.
> However, tempLocation could be on any file system. For example, WordCount 
> defaults to output to tempLocation.
> The proposal is to add gcpTempLocation. And, it defaults to tempLocation if 
> tempLocation is a gcs path.
> StagingLocation and BigQueryIO will use gcpTempLocation by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-115) Beam Runner API

2016-12-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767972#comment-15767972
 ] 

ASF GitHub Bot commented on BEAM-115:
-

GitHub user kennknowles reopened a pull request:

https://github.com/apache/incubator-beam/pull/662

[BEAM-115] WIP: JSON Schema definition of pipeline

This is a json-schema sketch of the concrete schema from the [Pipeline 
Runner API proposal document](https://s.apache.org/beam-runner-api). Because 
our [serialization tech 
discussion](http://mail-archives.apache.org/mod_mbox/beam-dev/201606.mbox/%3CCAN_Ypr2ZPQG3OgPWu==kf-zztg06k0v5i0ay3dabchjyver...@mail.gmail.com%3E)
 seemed to favor JSON on the front end and Proto on the backend, I made this 
quick port. The original Avro IDL definition is also on [a branch with a 
test](https://github.com/kennknowles/incubator-beam/blob/pipeline-model/model/pipeline/src/main/avro/org/apache/beam/model/pipeline/pipeline.avdl).

Notes & Caveats:
- I did not try to flesh out any more details; this was a straight port. 
There's plenty to add, but a PR seems like a place that will attract a desired 
kind of concrete discussion even in the current state.
- Typing this makes my hands hurt. Luckily, it should change exceedingly 
rarely. There are many libraries that can generate json-schema in various ways, 
including Jackson itself, but I'm not so sure any of them are applicable.
- Reading this makes my eyes hurt. This is a real problem. We need a 
readable spec, not just a test suite for validation.
- I am not so sure that [the schema 
library](https://github.com/daveclayton/json-schema-validator) I've used to 
build my smoke test is a good long term choice. I chose it because it was 
Jackson-based.
- I've left comments in the JSON even though that is frowned upon, and 
taken advantage of Jackson's feature to allow them. They can also go into 
`"description"` fields.
- Perhaps we could write YAML and convert to json-schema with no loss of 
precision?

Feel free to leave comments here about the schema or meta issues of e.g. 
where the schema should live and what libraries we might want to use.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kennknowles/incubator-beam 
pipeline-json-schema

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/662.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #662


commit c5843ce10e782056c76157169eb5516bf18ed9e4
Author: Kenneth Knowles <k...@google.com>
Date:   2016-06-10T15:51:02Z

WIP: add JSON Schema definition of pipeline




> Beam Runner API
> ---
>
> Key: BEAM-115
> URL: https://issues.apache.org/jira/browse/BEAM-115
> Project: Beam
>  Issue Type: Improvement
>  Components: beam-model-runner-api
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>
> The PipelineRunner API from the SDK is not ideal for the Beam technical 
> vision.
> It has technical limitations:
>  - The user's DAG (even including library expansions) is never explicitly 
> represented, so it cannot be analyzed except incrementally, and cannot 
> necessarily be reconstructed (for example, to display it!).
>  - The flattened DAG of just primitive transforms isn't well-suited for 
> display or transform override.
>  - The TransformHierarchy isn't well-suited for optimizations.
>  - The user must realistically pre-commit to a runner, and its configuration 
> (batch vs streaming) prior to graph construction, since the runner will be 
> modifying the graph as it is built.
>  - It is fairly language- and SDK-specific.
> It has usability issues (these are not from intuition, but derived from 
> actual cases of failure to use according to the design)
>  - The interleaving of apply() methods in PTransform/Pipeline/PipelineRunner 
> is confusing.
>  - The TransformHierarchy, accessible only via visitor traversals, is 
> cumbersome.
>  - The staging of construction-time vs run-time is not always obvious.
> These are just examples. This ticket tracks designing, coming to consensus, 
> and building an API that more simply and directly supports the technical 
> vision.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-1190) FileBasedSource should ignore files that matched the glob but don't exist

2016-12-21 Thread Daniel Halperin (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767937#comment-15767937
 ] 

Daniel Halperin commented on BEAM-1190:
---

I do not think this is generally safe -- it may mask underlying bugs. For 
example, we should never invoke this code unless the filesystem is known be 
eventually list-consistent but consistent with stat.

This change does not obviate the need for [BEAM-60] -- because users may want 
to go the other way, and expand the inconsistent list they get. I propose you 
package this logic up in whatever the new name for IOChannelUtils is as one of 
the things users can do in the code they run at expand-time.

Bringing the user into the loop is also nice because it makes them deal with 
eventual consistency up front. We are burned a lot by users who don't realize 
what their globs really mean.

> FileBasedSource should ignore files that matched the glob but don't exist
> -
>
> Key: BEAM-1190
> URL: https://issues.apache.org/jira/browse/BEAM-1190
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>
> See user issue:
> http://stackoverflow.com/questions/41251741/coping-with-eventual-consistency-of-gcs-bucket-listing
> We should, after globbing the files in FileBasedSource, individually stat 
> every file and remove those that don't exist, to account for the possibility 
> that glob yielded non-existing files due to eventual consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (BEAM-1196) Serialize/deserialize Pipeline/TransformHierarchy to JSON

2016-12-21 Thread Kenneth Knowles (JIRA)

Kenneth Knowles created BEAM-1196:
-

 Summary: Serialize/deserialize Pipeline/TransformHierarchy to JSON
 Key: BEAM-1196
 URL: https://issues.apache.org/jira/browse/BEAM-1196
 Project: Beam
  Issue Type: New Feature
  Components: beam-model-runner-api, sdk-java-core
Reporter: Kenneth Knowles


There are two sketches of a concrete format for a Pipeline:

1. The Avro schema in the design doc at https://s.apache.org/beam-runner-api 2. 
The JSON schema at https://github.com/apache/incubator-beam/pull/662

These should be pushed all the way through and added to the Java SDK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (BEAM-1195) Give triggers URNs / JSON formats

2016-12-21 Thread Kenneth Knowles (JIRA)

Kenneth Knowles created BEAM-1195:
-

 Summary: Give triggers URNs / JSON formats
 Key: BEAM-1195
 URL: https://issues.apache.org/jira/browse/BEAM-1195
 Project: Beam
  Issue Type: New Feature
  Components: beam-model-runner-api
Reporter: Kenneth Knowles
Assignee: Kenneth Knowles


We have recently gotten to the point where triggers are just syntax, but it is 
still shipped via Java serialization. To make it language-independent, we need 
a concrete syntax.

Something like the following is fairly concise, tag adjacent to payload. I 
haven't bothered making up fully verbose/namespaced URNs here.

{code}
{
"$urn": "OrFinally",
"main": {
  "$urn": "EndOfWindow",
  "early": 
},
"finally": {
  "$urn": "AfterCount",
  "count": 45
}
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (BEAM-1194) DataflowRunner should test a variety of valid tempLocation/stagingLocation/etc formats.

2016-12-21 Thread Daniel Halperin (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Halperin updated BEAM-1194:
--
Description: 
Cloud Dataflow has a minor history of small bugs related to various code paths 
expecting there to be or not be a trailing forward-slash in these location 
fields. The way that Beam's integration tests are set up, we are likely to only 
have one of these two cases tested (there is a single set of integration test 
pipeline options).

We should add a dedicated DataflowRunner integration test to handle this case.

  was:
Cloud Dataflow has a minor history of small bugs related to various code paths 
expecting there to be or not be a trailing forward-slash in these location 
fields. The way that Beam's integration tests are set up, we are likely to only 
have one of these two cases tested (there is a single set of integration test 
pipeline options).

We should add a dedicated integration test to handle this case.


> DataflowRunner should test a variety of valid 
> tempLocation/stagingLocation/etc formats.
> ---
>
> Key: BEAM-1194
> URL: https://issues.apache.org/jira/browse/BEAM-1194
> Project: Beam
>  Issue Type: Test
>  Components: runner-dataflow
>Reporter: Daniel Halperin
>Assignee: Daniel Halperin
>Priority: Minor
>
> Cloud Dataflow has a minor history of small bugs related to various code 
> paths expecting there to be or not be a trailing forward-slash in these 
> location fields. The way that Beam's integration tests are set up, we are 
> likely to only have one of these two cases tested (there is a single set of 
> integration test pipeline options).
> We should add a dedicated DataflowRunner integration test to handle this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (BEAM-1194) DataflowRunner should test a variety of valid tempLocation/stagingLocation/etc formats.

2016-12-21 Thread Daniel Halperin (JIRA)

Daniel Halperin created BEAM-1194:
-

 Summary: DataflowRunner should test a variety of valid 
tempLocation/stagingLocation/etc formats.
 Key: BEAM-1194
 URL: https://issues.apache.org/jira/browse/BEAM-1194
 Project: Beam
  Issue Type: Test
  Components: runner-dataflow
Reporter: Daniel Halperin
Assignee: Daniel Halperin
Priority: Minor


Cloud Dataflow has a minor history of small bugs related to various code paths 
expecting there to be or not be a trailing forward-slash in these location 
fields. The way that Beam's integration tests are set up, we are likely to only 
have one of these two cases tested (there is a single set of integration test 
pipeline options).

We should add a dedicated integration test to handle this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (BEAM-1192) Give transforms URNs, use them instead of instanceof checks

2016-12-21 Thread Kenneth Knowles (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles updated BEAM-1192:
--
Summary: Give transforms URNs, use them instead of instanceof checks  (was: 
Add URNs to known transforms instead of using instanceof checks)

> Give transforms URNs, use them instead of instanceof checks
> ---
>
> Key: BEAM-1192
> URL: https://issues.apache.org/jira/browse/BEAM-1192
> Project: Beam
>  Issue Type: Improvement
>  Components: beam-model-runner-api
>Reporter: Kenneth Knowles
>
> In the [Beam Runner AP|https://s.apache.org/beam-runner-api], transforms of 
> interest to runners are to be identified by URN.
> Currently, Java-based runners use `instanceof` checks to both override 
> transforms and to implement primitive transforms. This language and 
> SDK-specific behavior should be replaced by adding these URNs, and checking 
> them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (BEAM-1192) Give transforms URNs, use them instead of instanceof checks

2016-12-21 Thread Kenneth Knowles (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles updated BEAM-1192:
--
Issue Type: New Feature  (was: Improvement)

> Give transforms URNs, use them instead of instanceof checks
> ---
>
> Key: BEAM-1192
> URL: https://issues.apache.org/jira/browse/BEAM-1192
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model-runner-api
>Reporter: Kenneth Knowles
>
> In the [Beam Runner AP|https://s.apache.org/beam-runner-api], transforms of 
> interest to runners are to be identified by URN.
> Currently, Java-based runners use `instanceof` checks to both override 
> transforms and to implement primitive transforms. This language and 
> SDK-specific behavior should be replaced by adding these URNs, and checking 
> them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (BEAM-1193) Give Coders URNs and document their binary formats outside the Java code base

2016-12-21 Thread Kenneth Knowles (JIRA)

Kenneth Knowles created BEAM-1193:
-

 Summary: Give Coders URNs and document their binary formats 
outside the Java code base
 Key: BEAM-1193
 URL: https://issues.apache.org/jira/browse/BEAM-1193
 Project: Beam
  Issue Type: New Feature
  Components: beam-model, beam-model-runner-api
Reporter: Kenneth Knowles
Assignee: Frances Perry






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-27) Add user-ready API for interacting with timers

2016-12-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-27?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767897#comment-15767897
 ] 

ASF GitHub Bot commented on BEAM-27:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-beam/pull/1660


> Add user-ready API for interacting with timers
> --
>
> Key: BEAM-27
> URL: https://issues.apache.org/jira/browse/BEAM-27
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-core
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>
> Pipeline authors will benefit from a different factorization of interaction 
> with underlying timers. The current APIs are targeted at runner implementers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-1190) FileBasedSource should ignore files that matched the glob but don't exist

2016-12-21 Thread Eugene Kirpichov (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767871#comment-15767871
 ] 

Eugene Kirpichov commented on BEAM-1190:


My proposal is to add a mandatory stat at glob-expand time (and omit the file 
from glob expansion if it doesn't exist), but still throw an error if the file 
doesn't exist at read time. I think this is safe and should not require opt-in, 
since it doesn't seem to introduce new failure modes: both before and after the 
proposed solution we'll fail if a file doesn't exist at read time; but without 
it we may also erroneously fail if the file is included in glob expansion but 
actually doesn't exist at glob expansion time.

When Filesystem APIs are able to tell whether the file system is strongly 
consistent, then we can eliminate the stat as an optimization.

> FileBasedSource should ignore files that matched the glob but don't exist
> -
>
> Key: BEAM-1190
> URL: https://issues.apache.org/jira/browse/BEAM-1190
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>
> See user issue:
> http://stackoverflow.com/questions/41251741/coping-with-eventual-consistency-of-gcs-bucket-listing
> We should, after globbing the files in FileBasedSource, individually stat 
> every file and remove those that don't exist, to account for the possibility 
> that glob yielded non-existing files due to eventual consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (BEAM-1190) FileBasedSource should ignore files that matched the glob but don't exist

2016-12-21 Thread Daniel Halperin (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767818#comment-15767818
 ] 

Daniel Halperin edited comment on BEAM-1190 at 12/21/16 6:46 PM:
-

Not for very long -- the stat at open-time is getting removed. We get the size 
information we need from the list call, but currently throw it away for silly 
reasons.

How would you feel about the ability to execute code in the worker when the 
glob is expanded. I think checking which files actually exist then and deciding 
in one centralized place in time which files you want to read (and committing 
to that decision for later) is probably a simpler and safer solution.


was (Author: dhalp...@google.com):
Not for very long -- the stat at open-time is getting removed as we get the 
information we need from the list call, but throw it away like we shouldn't be.

How would you feel about the ability to execute code in the worker when the 
glob is expanded. I think checking which files actually exist then and deciding 
in one centralized place in time which files you want to read (and committing 
to that decision for later) is probably a simpler and safer solution.

> FileBasedSource should ignore files that matched the glob but don't exist
> -
>
> Key: BEAM-1190
> URL: https://issues.apache.org/jira/browse/BEAM-1190
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>
> See user issue:
> http://stackoverflow.com/questions/41251741/coping-with-eventual-consistency-of-gcs-bucket-listing
> We should, after globbing the files in FileBasedSource, individually stat 
> every file and remove those that don't exist, to account for the possibility 
> that glob yielded non-existing files due to eventual consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-1190) FileBasedSource should ignore files that matched the glob but don't exist

2016-12-21 Thread Daniel Halperin (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767818#comment-15767818
 ] 

Daniel Halperin commented on BEAM-1190:
---

Not for very long -- the stat at open-time is getting removed as we get the 
information we need from the list call, but throw it away like we shouldn't be.

How would you feel about the ability to execute code in the worker when the 
glob is expanded. I think checking which files actually exist then and deciding 
in one centralized place in time which files you want to read (and committing 
to that decision for later) is probably a simpler and safer solution.

> FileBasedSource should ignore files that matched the glob but don't exist
> -
>
> Key: BEAM-1190
> URL: https://issues.apache.org/jira/browse/BEAM-1190
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>
> See user issue:
> http://stackoverflow.com/questions/41251741/coping-with-eventual-consistency-of-gcs-bucket-listing
> We should, after globbing the files in FileBasedSource, individually stat 
> every file and remove those that don't exist, to account for the possibility 
> that glob yielded non-existing files due to eventual consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-362) Move shared runner functionality out of SDK and into runners/core-java

2016-12-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767806#comment-15767806
 ] 

ASF GitHub Bot commented on BEAM-362:
-

Github user asfgit closed the pull request at:

https://github.com/apache/incubator-beam/pull/1666


> Move shared runner functionality out of SDK and into runners/core-java
> --
>
> Key: BEAM-362
> URL: https://issues.apache.org/jira/browse/BEAM-362
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-core
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (BEAM-1192) Add URNs to known transforms instead of using instanceof checks

2016-12-21 Thread Kenneth Knowles (JIRA)

Kenneth Knowles created BEAM-1192:
-

 Summary: Add URNs to known transforms instead of using instanceof 
checks
 Key: BEAM-1192
 URL: https://issues.apache.org/jira/browse/BEAM-1192
 Project: Beam
  Issue Type: Improvement
  Components: beam-model-runner-api
Reporter: Kenneth Knowles


In the [Beam Runner AP|https://s.apache.org/beam-runner-api], transforms of 
interest to runners are to be identified by URN.

Currently, Java-based runners use `instanceof` checks to both override 
transforms and to implement primitive transforms. This language and 
SDK-specific behavior should be replaced by adding these URNs, and checking 
them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-27) Add user-ready API for interacting with timers

2016-12-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-27?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767732#comment-15767732
 ] 

ASF GitHub Bot commented on BEAM-27:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-beam/pull/1673


> Add user-ready API for interacting with timers
> --
>
> Key: BEAM-27
> URL: https://issues.apache.org/jira/browse/BEAM-27
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-core
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>
> Pipeline authors will benefit from a different factorization of interaction 
> with underlying timers. The current APIs are targeted at runner implementers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (BEAM-1178) Make naming of logger objects consistent

2016-12-21 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/BEAM-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía resolved BEAM-1178.

Resolution: Fixed

> Make naming of logger objects consistent
> 
>
> Key: BEAM-1178
> URL: https://issues.apache.org/jira/browse/BEAM-1178
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core, sdk-java-extensions
>Affects Versions: 0.5.0-incubating
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: Trivial
> Fix For: Not applicable
>
>
> Logger objects are used in different instances in Beam, around 90% of the 
> current classes that use loggers use the convention name 'LOG', however there 
> are instances that use 'logger' and others that uses 'LOGGER', this issue is 
> to make the logger naming consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (BEAM-1178) Make naming of logger objects consistent

2016-12-21 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/BEAM-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía reopened BEAM-1178:


> Make naming of logger objects consistent
> 
>
> Key: BEAM-1178
> URL: https://issues.apache.org/jira/browse/BEAM-1178
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core, sdk-java-extensions
>Affects Versions: 0.5.0-incubating
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: Trivial
> Fix For: Not applicable
>
>
> Logger objects are used in different instances in Beam, around 90% of the 
> current classes that use loggers use the convention name 'LOG', however there 
> are instances that use 'logger' and others that uses 'LOGGER', this issue is 
> to make the logger naming consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (BEAM-1165) Unexpected file created when checking dependencies on clean repo

2016-12-21 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/BEAM-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía resolved BEAM-1165.

Resolution: Fixed

> Unexpected file created when checking dependencies on clean repo
> 
>
> Key: BEAM-1165
> URL: https://issues.apache.org/jira/browse/BEAM-1165
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Affects Versions: 0.5.0-incubating
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: Minor
> Fix For: 0.5.0-incubating
>
>
> I just found a weird behavior when I was checking for the latest release,
> nothing breaking, but when I start with a clean repo clone and I do:
> mvn dependency:tree
> It creates a new file runners/flink/examples/wordcounts.txt with the
> dependencies.
> This error happens because maven-dependency-plugin asumes the property output
> used by the flink tests as the export file for the command.
> Ref.
> https://maven.apache.org/plugins/maven-dependency-plugin/tree-mojo.html#output



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (BEAM-1165) Unexpected file created when checking dependencies on clean repo

2016-12-21 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/BEAM-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía reopened BEAM-1165:


> Unexpected file created when checking dependencies on clean repo
> 
>
> Key: BEAM-1165
> URL: https://issues.apache.org/jira/browse/BEAM-1165
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Affects Versions: 0.5.0-incubating
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: Minor
> Fix For: 0.5.0-incubating
>
>
> I just found a weird behavior when I was checking for the latest release,
> nothing breaking, but when I start with a clean repo clone and I do:
> mvn dependency:tree
> It creates a new file runners/flink/examples/wordcounts.txt with the
> dependencies.
> This error happens because maven-dependency-plugin asumes the property output
> used by the flink tests as the export file for the command.
> Ref.
> https://maven.apache.org/plugins/maven-dependency-plugin/tree-mojo.html#output



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-27) Add user-ready API for interacting with timers

2016-12-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-27?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767452#comment-15767452
 ] 

ASF GitHub Bot commented on BEAM-27:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-beam/pull/1668


> Add user-ready API for interacting with timers
> --
>
> Key: BEAM-27
> URL: https://issues.apache.org/jira/browse/BEAM-27
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-core
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>
> Pipeline authors will benefit from a different factorization of interaction 
> with underlying timers. The current APIs are targeted at runner implementers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-1146) Decrease spark runner startup overhead

2016-12-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767378#comment-15767378
 ] 

ASF GitHub Bot commented on BEAM-1146:
--

GitHub user aviemzur opened a pull request:

https://github.com/apache/incubator-beam/pull/1674

[BEAM-1146] Decrease spark runner startup overhead

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [ ] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [ ] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [ ] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---
Replace finding all `Source` and `Coder` implementations for serialization 
registration with wrapper classes.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aviemzur/incubator-beam 
decrease-spark-runner-startup-overhead

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1674.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1674


commit 8501cdc88ee9c89f643120e34381ec9bc2562965
Author: Aviem Zur <aviem...@gmail.com>
Date:   2016-12-21T15:49:34Z

[BEAM-1146] Decrease spark runner startup overhead

Replace finding all `Source` and `Coder` implementations for serialization 
registration with wrapper classes.




> Decrease spark runner startup overhead
> --
>
> Key: BEAM-1146
> URL: https://issues.apache.org/jira/browse/BEAM-1146
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Reporter: Aviem Zur
>Assignee: Aviem Zur
>
> BEAM-921 introduced a lazy singleton instantiated once in each machine 
> (driver & executors) which utilizes reflection to find all subclasses of 
> Source and Coder
> While this is beneficial in it's own right, the change added about one minute 
> of overhead in spark runner startup time (which cause the first job/stage to 
> take up to a minute).
> The change is in class {{BeamSparkRunnerRegistrator}}
> The reason reflection (specifically reflections library) was used here is 
> because  there is no current way of knowing all the source and coder classes 
> at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (BEAM-1145) Remove classifier from shaded spark runner artifact

2016-12-21 Thread Aviem Zur (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aviem Zur reassigned BEAM-1145:
---

Assignee: Aviem Zur  (was: Amit Sela)

> Remove classifier from shaded spark runner artifact
> ---
>
> Key: BEAM-1145
> URL: https://issues.apache.org/jira/browse/BEAM-1145
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Reporter: Aviem Zur
>Assignee: Aviem Zur
>
> Shade plugin configured in spark runner's pom adds a classifier to spark 
> runner shaded jar
> {code:xml}
> true
> spark-app
> {code}
> This means, that in order for a user application that is dependent on 
> spark-runner to work in cluster mode, they have to add the classifier in 
> their dependency declaration, like so:
> {code:xml}
> 
> org.apache.beam
> beam-runners-spark
> 0.4.0-incubating-SNAPSHOT
> spark-app
> 
> {code}
> Otherwise, if they do not specify classifier, the jar they get is unshaded, 
> which in cluster mode, causes collisions between different guava versions.
> Example exception in cluster mode when adding the dependency without 
> classifier:
> {code}
> 16/12/12 06:58:56 WARN TaskSetManager: Lost task 4.0 in stage 8.0 (TID 153, 
> lvsriskng02.lvs.paypal.com): java.lang.NoSuchMethodError: 
> com.google.common.base.Stopwatch.createStarted()Lcom/google/common/base/Stopwatch;
>   at 
> org.apache.beam.runners.spark.stateful.StateSpecFunctions$1.apply(StateSpecFunctions.java:137)
>   at 
> org.apache.beam.runners.spark.stateful.StateSpecFunctions$1.apply(StateSpecFunctions.java:98)
>   at 
> org.apache.spark.streaming.StateSpec$$anonfun$1.apply(StateSpec.scala:180)
>   at 
> org.apache.spark.streaming.StateSpec$$anonfun$1.apply(StateSpec.scala:179)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:57)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:155)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:275)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:149)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:275)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:275)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> I would suggest that the classifier be removed from the shaded jar, to avoid 
> confusion among users, and have a better user experience.
> P.S. Looks like Dataflow runner does not add a classifier to its shaded jar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (BEAM-1144) Spark runner fails to deserialize MicrobatchSource in cluster mode

2016-12-21 Thread Aviem Zur (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aviem Zur reassigned BEAM-1144:
---

Assignee: Aviem Zur  (was: Amit Sela)

> Spark runner fails to deserialize MicrobatchSource in cluster mode
> --
>
> Key: BEAM-1144
> URL: https://issues.apache.org/jira/browse/BEAM-1144
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Aviem Zur
>Assignee: Aviem Zur
>
> When running in cluster mode (yarn), spark runner fails on deserialization of 
> {{MicrobatchSource}}
> After changes made in BEAM-921 spark runner fails in cluster mode with the 
> following:
> {code}
> 16/12/12 04:27:01 ERROR ApplicationMaster: User class threw exception: 
> org.apache.beam.sdk.Pipeline$PipelineExecutionException: 
> com.esotericsoftware.kryo.KryoException: Error during Java deserialization.
> org.apache.beam.sdk.Pipeline$PipelineExecutionException: 
> com.esotericsoftware.kryo.KryoException: Error during Java deserialization.
>   at 
> org.apache.beam.runners.spark.SparkPipelineResult.beamExceptionFrom(SparkPipelineResult.java:72)
>   at 
> org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:115)
>   at 
> org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:101)
>   at 
> com.paypal.risk.platform.aleph.example.MapOnlyExample.main(MapOnlyExample.java:38)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:559)
> Caused by: com.esotericsoftware.kryo.KryoException: Error during Java 
> deserialization.
>   at 
> com.esotericsoftware.kryo.serializers.JavaSerializer.read(JavaSerializer.java:42)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:228)
>   at 
> org.apache.spark.serializer.DeserializationStream.readKey(Serializer.scala:169)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:198)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55)
>   at 
> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:155)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:275)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:275)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassNotFoundException: 
>

[jira] [Assigned] (BEAM-1146) Decrease spark runner startup overhead

2016-12-21 Thread Aviem Zur (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aviem Zur reassigned BEAM-1146:
---

Assignee: Aviem Zur  (was: Amit Sela)

> Decrease spark runner startup overhead
> --
>
> Key: BEAM-1146
> URL: https://issues.apache.org/jira/browse/BEAM-1146
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Reporter: Aviem Zur
>Assignee: Aviem Zur
>
> BEAM-921 introduced a lazy singleton instantiated once in each machine 
> (driver & executors) which utilizes reflection to find all subclasses of 
> Source and Coder
> While this is beneficial in it's own right, the change added about one minute 
> of overhead in spark runner startup time (which cause the first job/stage to 
> take up to a minute).
> The change is in class {{BeamSparkRunnerRegistrator}}
> The reason reflection (specifically reflections library) was used here is 
> because  there is no current way of knowing all the source and coder classes 
> at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (BEAM-1191) Eliminate OldDoFn from the SDK

2016-12-20 Thread Kenneth Knowles (JIRA)

Kenneth Knowles created BEAM-1191:
-

 Summary: Eliminate OldDoFn from the SDK
 Key: BEAM-1191
 URL: https://issues.apache.org/jira/browse/BEAM-1191
 Project: Beam
  Issue Type: Improvement
  Components: sdk-java-core
Reporter: Kenneth Knowles
Priority: Minor


We are far enough along that now {{OldDoFn}} is not usable by users and 
BEAM-498 is closed out. The remaining occurrences are limited to runners and 
things that should end up in runners-core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (BEAM-498) Make DoFnWithContext the new DoFn

2016-12-20 Thread Kenneth Knowles (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles resolved BEAM-498.
--
   Resolution: Fixed
Fix Version/s: 0.3.0-incubating

> Make DoFnWithContext the new DoFn
> -
>
> Key: BEAM-498
> URL: https://issues.apache.org/jira/browse/BEAM-498
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-core
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>  Labels: backward-incompatible
> Fix For: 0.3.0-incubating
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-27) Add user-ready API for interacting with timers

2016-12-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-27?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766056#comment-15766056
 ] 

ASF GitHub Bot commented on BEAM-27:


GitHub user kennknowles opened a pull request:

https://github.com/apache/incubator-beam/pull/1673

[BEAM-27] Require TimeDomain to delete a timer

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [x] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [x] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [x] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---

R: @aljoscha 

A bit of an oversight, I neglected the fact that runners generally store 
different sorts of timers in rather different ways. When a user sets a timer, 
the `DoFnSignature` is available, so this will be for free. And when system 
code deletes a timer, the domain will always be known.

This will require a Dataflow update, so don't worry if Dataflow-specific 
integration tests don't pass.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kennknowles/incubator-beam delete-by-domain

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1673.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1673


commit 46dfd0fb4d2a1533d3ed053983faee6537d3ccf0
Author: Kenneth Knowles <k...@google.com>
Date:   2016-12-21T04:09:25Z

Require TimeDomain to delete a timer




> Add user-ready API for interacting with timers
> --
>
> Key: BEAM-27
> URL: https://issues.apache.org/jira/browse/BEAM-27
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-core
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>
> Pipeline authors will benefit from a different factorization of interaction 
> with underlying timers. The current APIs are targeted at runner implementers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-1190) FileBasedSource should ignore files that matched the glob but don't exist

2016-12-20 Thread Eugene Kirpichov (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766004#comment-15766004
 ] 

Eugene Kirpichov commented on BEAM-1190:


Dan - please note I'm suggesting not to ignore non-existent files entirely, but 
only to ignore *files that were yielded by glob match operation but reported as 
non-existent by per-file stat operation* - i.e. not include such files into 
bundles produced by splitIntoBundles; effectively this is just increasing 
precision of the glob matching. Can you elaborate on how this is dangerous?

> FileBasedSource should ignore files that matched the glob but don't exist
> -
>
> Key: BEAM-1190
> URL: https://issues.apache.org/jira/browse/BEAM-1190
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>
> See user issue:
> http://stackoverflow.com/questions/41251741/coping-with-eventual-consistency-of-gcs-bucket-listing
> We should, after globbing the files in FileBasedSource, individually stat 
> every file and remove those that don't exist, to account for the possibility 
> that glob yielded non-existing files due to eventual consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-79) Gearpump runner

2016-12-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765799#comment-15765799
 ] 

ASF GitHub Bot commented on BEAM-79:


GitHub user manuzhang reopened a pull request:

https://github.com/apache/incubator-beam/pull/1663

[BEAM-79] merge master into gearpump-runner branch

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [x] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [x] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [x] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/manuzhang/incubator-beam gearpump-runner-sync

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1663.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1663


commit e9f254ef2769a082c7fbb500c1c28c6224ac5a7f
Author: Jakob Homan <jgho...@gmail.com>
Date:   2016-12-07T00:59:50Z

[BEAM-1099] Minor typos in KafkaIO

commit afedd68e806830549724dfc0f2565d756cc6b46d
Author: Davor Bonaci <da...@google.com>
Date:   2016-12-07T01:03:54Z

This closes #1524

commit e8c9686a2e898d38afd692328eb171c542084747
Author: Pei He <pe...@google.com>
Date:   2016-11-23T23:59:56Z

[BEAM-1047] Add DataflowClient wrapper on top of JSON library.

commit ded58832ceaef487f4590d9396f09744288c955d
Author: Pei He <pe...@google.com>
Date:   2016-11-24T00:14:27Z

[Code Health] Remove redundant projectId from DataflowPipelineJob.

commit ce03f30c1ee0b84ad2e7f10a6272ffb25548244a
Author: Pei He <pe...@google.com>
Date:   2016-11-28T19:47:42Z

[BEAM-1047] Update dataflow runner code to use DataflowClient wrapper.

commit b2b570f27808b1671bf6cd0fc60f874da671d4ca
Author: bchambers <bchamb...@google.com>
Date:   2016-12-07T01:08:13Z

Closes #1434

commit 0a2ed832ce5af7556db605e99b985ed4ffc1b152
Author: Sam McVeety <s...@google.com>
Date:   2016-10-30T18:58:44Z

BigQueryIO.Read: support runtime options

commit 869b2710efdb90bc3ce5b6e9d4f3b49a3a804a63
Author: Aljoscha Krettek <aljoscha.kret...@gmail.com>
Date:   2016-12-07T05:28:13Z

[FLINK-1102] Fix Aggregator Registration in Flink Batch Runner

commit b41a46e86fd38c4a887f31bdf6cb75969f4750d3
Author: Aljoscha Krettek <aljoscha.kret...@gmail.com>
Date:   2016-12-07T07:26:02Z

This closes #1530

commit baf5e6bd9b1011f4c5c3974aa46393471b340c15
Author: Jean-Baptiste Onofré <jbono...@apache.org>
Date:   2016-12-07T07:37:33Z

[BEAM-1094] Set test scope for Kafka IO and junit

commit 9ccf6dbea0d3807fef6a7c0432906fffd2b8ec3f
Author: Sela <ans...@paypal.com>
Date:   2016-12-07T08:31:38Z

This closes #1531

commit dce3a196a3a26fdd42225520faf3d9084ee48123
Author: Sela <ans...@paypal.com>
Date:   2016-12-07T09:20:07Z

[BEAM-329] Update Spark runner README.

commit 02bb8c375c48847b1686d70184fb194500a62e8c
Author: Jean-Baptiste Onofré <jbono...@apache.org>
Date:   2016-12-07T11:51:09Z

[BEAM-329] This closes #1532

commit b2d72237b592e1dcb5cca30f5cbc9a11d2890c0f
Author: Kenneth Knowles <k...@google.com>
Date:   2016-12-06T23:20:28Z

Port most of DoFnRunner Javadoc to new DoFn

commit 1526184ae8be1f8ae6863f830653204157a584cd
Author: Thomas Groh <tg...@google.com>
Date:   2016-12-07T16:51:02Z

This closes #1527

commit 8e1e46e73edf9cce376ed7bd194db00edc3e60b4
Author: Kenneth Knowles <k...@google.com>
Date:   2016-12-07T05:01:37Z

Port ParDoTest from OldDoFn to new DoFn

commit ae52ec1bc3f3120e9f8e150e50c18564055a467c
Author: Kenneth Knowles <k...@google.com>
Date:   2016-12-07T17:00:18Z

This closes #1529

commit 55d333bff68809ff1a9154491ace02d2d16e3b85
Author: Thomas Groh <tg...@google.com>
Date:   2016-12-05T22:29:05Z

Only provide expanded Inputs and Outputs

This removes PInput and POutput from the immediate API Surface of
TransformHierarchy.Node, and forces Pipeline Visitors to access only
the expanded version of the output.

This is part of the move towards the runner-agnostic representation of a
graph.

commit 5b31a369962907e257de8019fbf6cde4c615b1c0
Author: Thomas Groh <tg...@google.com>
Date:   2016-12-07T17:14:38Z

This closes #1511

commit 43fef2775145f67def3ab8a246ecca192a7d650b
Author: Dan Halperin <

[jira] [Commented] (BEAM-1190) FileBasedSource should ignore files that matched the glob but don't exist

2016-12-20 Thread Paul Findlay (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765711#comment-15765711
 ] 

Paul Findlay commented on BEAM-1190:


[~dhalp...@google.com] Correct me if I'm wrong.. but isn't 
FileBasedSource.createReader basically already doing a stat for each file in 
the expanded list but swallowing the error if there is one, and leaving it for 
startImpl to blow up? We are just asking for the method to not be final so we 
can treat the different sub-classes of IOException appropriately (for our 
pipeline).

But would love to know if there is scary behaviour we haven't considered.

> FileBasedSource should ignore files that matched the glob but don't exist
> -
>
> Key: BEAM-1190
> URL: https://issues.apache.org/jira/browse/BEAM-1190
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>
> See user issue:
> http://stackoverflow.com/questions/41251741/coping-with-eventual-consistency-of-gcs-bucket-listing
> We should, after globbing the files in FileBasedSource, individually stat 
> every file and remove those that don't exist, to account for the possibility 
> that glob yielded non-existing files due to eventual consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (BEAM-1190) FileBasedSource should ignore files that matched the glob but don't exist

2016-12-20 Thread Paul Findlay (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765711#comment-15765711
 ] 

Paul Findlay edited comment on BEAM-1190 at 12/21/16 12:52 AM:
---

[~dhalp...@google.com] Correct me if I'm wrong.. but isn't 
FileBasedSource.createReader basically already doing a stat for each file in 
the expanded list but swallowing the error if there is one, and leaving it for 
FileBasedReader.startImpl to blow up? We are just asking for the method to not 
be final so we can treat the different sub-classes of IOException appropriately 
(for our pipeline).

But would love to know if there is scary behaviour we haven't considered.


was (Author: p...@findlay.net.nz):
[~dhalp...@google.com] Correct me if I'm wrong.. but isn't 
FileBasedSource.createReader basically already doing a stat for each file in 
the expanded list but swallowing the error if there is one, and leaving it for 
startImpl to blow up? We are just asking for the method to not be final so we 
can treat the different sub-classes of IOException appropriately (for our 
pipeline).

But would love to know if there is scary behaviour we haven't considered.

> FileBasedSource should ignore files that matched the glob but don't exist
> -
>
> Key: BEAM-1190
> URL: https://issues.apache.org/jira/browse/BEAM-1190
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>
> See user issue:
> http://stackoverflow.com/questions/41251741/coping-with-eventual-consistency-of-gcs-bucket-listing
> We should, after globbing the files in FileBasedSource, individually stat 
> every file and remove those that don't exist, to account for the possibility 
> that glob yielded non-existing files due to eventual consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-1190) FileBasedSource should ignore files that matched the glob but don't exist

2016-12-20 Thread Daniel Halperin (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765683#comment-15765683
 ] 

Daniel Halperin commented on BEAM-1190:
---

The two relevant JIRA issues: [BEAM-76] and [BEAM-60]

> FileBasedSource should ignore files that matched the glob but don't exist
> -
>
> Key: BEAM-1190
> URL: https://issues.apache.org/jira/browse/BEAM-1190
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>
> See user issue:
> http://stackoverflow.com/questions/41251741/coping-with-eventual-consistency-of-gcs-bucket-listing
> We should, after globbing the files in FileBasedSource, individually stat 
> every file and remove those that don't exist, to account for the possibility 
> that glob yielded non-existing files due to eventual consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-1190) FileBasedSource should ignore files that matched the glob but don't exist

2016-12-20 Thread Daniel Halperin (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765675#comment-15765675
 ] 

Daniel Halperin commented on BEAM-1190:
---

I think this is a very scary default behavior, and something the user should 
implement on their own in pipeline construction.

Alternately, there's already a JIRA issue for giving the user a hook to run 
code at expansion time in order to, e.g., autocomplete sharding templates that 
eventual consistency chose not to show.

> FileBasedSource should ignore files that matched the glob but don't exist
> -
>
> Key: BEAM-1190
> URL: https://issues.apache.org/jira/browse/BEAM-1190
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>
> See user issue:
> http://stackoverflow.com/questions/41251741/coping-with-eventual-consistency-of-gcs-bucket-listing
> We should, after globbing the files in FileBasedSource, individually stat 
> every file and remove those that don't exist, to account for the possibility 
> that glob yielded non-existing files due to eventual consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (BEAM-1190) FileBasedSource should ignore files that matched the glob but don't exist

2016-12-20 Thread Daniel Halperin (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765675#comment-15765675
 ] 

Daniel Halperin edited comment on BEAM-1190 at 12/21/16 12:33 AM:
--

I think this is a very scary proposal for a new default behavior, and something 
the user should implement on their own in pipeline construction.

Alternately, there's already a JIRA issue for giving the user a hook to run 
code at expansion time in order to, e.g., autocomplete sharding templates that 
eventual consistency chose not to show.


was (Author: dhalp...@google.com):
I think this is a very scary default behavior, and something the user should 
implement on their own in pipeline construction.

Alternately, there's already a JIRA issue for giving the user a hook to run 
code at expansion time in order to, e.g., autocomplete sharding templates that 
eventual consistency chose not to show.

> FileBasedSource should ignore files that matched the glob but don't exist
> -
>
> Key: BEAM-1190
> URL: https://issues.apache.org/jira/browse/BEAM-1190
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>
> See user issue:
> http://stackoverflow.com/questions/41251741/coping-with-eventual-consistency-of-gcs-bucket-listing
> We should, after globbing the files in FileBasedSource, individually stat 
> every file and remove those that don't exist, to account for the possibility 
> that glob yielded non-existing files due to eventual consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-1190) FileBasedSource should ignore files that matched the glob but don't exist

2016-12-20 Thread Eugene Kirpichov (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765658#comment-15765658
 ] 

Eugene Kirpichov commented on BEAM-1190:


CC: [~dhalp...@google.com] [~pei...@gmail.com]

> FileBasedSource should ignore files that matched the glob but don't exist
> -
>
> Key: BEAM-1190
> URL: https://issues.apache.org/jira/browse/BEAM-1190
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>
> See user issue:
> http://stackoverflow.com/questions/41251741/coping-with-eventual-consistency-of-gcs-bucket-listing
> We should, after globbing the files in FileBasedSource, individually stat 
> every file and remove those that don't exist, to account for the possibility 
> that glob yielded non-existing files due to eventual consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (BEAM-1190) FileBasedSource should ignore files that matched the glob but don't exist

2016-12-20 Thread Eugene Kirpichov (JIRA)

Eugene Kirpichov created BEAM-1190:
--

 Summary: FileBasedSource should ignore files that matched the glob 
but don't exist
 Key: BEAM-1190
 URL: https://issues.apache.org/jira/browse/BEAM-1190
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-core
Reporter: Eugene Kirpichov
Assignee: Eugene Kirpichov


See user issue:
http://stackoverflow.com/questions/41251741/coping-with-eventual-consistency-of-gcs-bucket-listing

We should, after globbing the files in FileBasedSource, individually stat every 
file and remove those that don't exist, to account for the possibility that 
glob yielded non-existing files due to eventual consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (BEAM-1067) apex.examples.WordCountTest.testWordCountExample is flaky

2016-12-20 Thread Kenneth Knowles (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles updated BEAM-1067:
--
Summary: apex.examples.WordCountTest.testWordCountExample is flaky  (was: 
apex.examples.WordCountTest.testWordCountExample is be flaky)

> apex.examples.WordCountTest.testWordCountExample is flaky
> -
>
> Key: BEAM-1067
> URL: https://issues.apache.org/jira/browse/BEAM-1067
> Project: Beam
>  Issue Type: Bug
>  Components: runner-apex
>Reporter: Stas Levin
>Assignee: Thomas Weise
>
> Seems that 
> {{org.apache.beam.runners.apex.examples.WordCountTest.testWordCountExample}} 
> is flaky.
> For example, 
> [this|https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/5408/org.apache.beam$beam-runners-apex/testReport/org.apache.beam.runners.apex.examples/WordCountTest/testWordCountExample/
>  ] run failed although no changes were made in {{runner-apex}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-1067) apex.examples.WordCountTest.testWordCountExample is be flaky

2016-12-20 Thread Jason Kuster (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765632#comment-15765632
 ] 

Jason Kuster commented on BEAM-1067:


https://builds.apache.org/job/beam_PostCommit_Java_MavenInstall/2185/ just now

> apex.examples.WordCountTest.testWordCountExample is be flaky
> 
>
> Key: BEAM-1067
> URL: https://issues.apache.org/jira/browse/BEAM-1067
> Project: Beam
>  Issue Type: Bug
>  Components: runner-apex
>Reporter: Stas Levin
>Assignee: Thomas Weise
>
> Seems that 
> {{org.apache.beam.runners.apex.examples.WordCountTest.testWordCountExample}} 
> is flaky.
> For example, 
> [this|https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/5408/org.apache.beam$beam-runners-apex/testReport/org.apache.beam.runners.apex.examples/WordCountTest/testWordCountExample/
>  ] run failed although no changes were made in {{runner-apex}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-25) Add user-ready API for interacting with state

2016-12-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765628#comment-15765628
 ] 

ASF GitHub Bot commented on BEAM-25:


GitHub user kennknowles opened a pull request:

https://github.com/apache/incubator-beam/pull/1670

[BEAM-25, BEAM-1117] Fixes for direct runner expansion and evaluation of 
stateful ParDo

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [x] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [x] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [x] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---

R: @tgroh also peeled off from the timers PR, these are fixes for the whole 
setup.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kennknowles/incubator-beam 
DirectRunner-Stateful

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1670.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1670


commit 0615fc9749c3fd0012f4d5524ea8486413778636
Author: Kenneth Knowles <k...@google.com>
Date:   2016-12-20T21:58:29Z

Fix windowing in direct runner Stateful ParDo

commit 7bc23d6b53ed29ae565121df49180ad8d4aac653
Author: Kenneth Knowles <k...@google.com>
Date:   2016-12-20T23:59:45Z

Actually propagate and commit state in direct runner




> Add user-ready API for interacting with state
> -
>
> Key: BEAM-25
> URL: https://issues.apache.org/jira/browse/BEAM-25
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-core
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>  Labels: State
>
> Our current state API is targeted at runner implementers, not pipeline 
> authors. As such it has many capabilities that are not necessary nor 
> desirable for simple use cases of stateful ParDo (such as dynamic state tag 
> creation). Implement a simple state intended for user access.
> (Details of our current thoughts in forthcoming design doc)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-1117) Support for new Timer API in Direct runner

2016-12-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765551#comment-15765551
 ] 

ASF GitHub Bot commented on BEAM-1117:
--

GitHub user kennknowles opened a pull request:

https://github.com/apache/incubator-beam/pull/1669

[BEAM-1117] Direct runner timers prereqs

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [x] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [x] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [x] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---

Per request, here are some commits from #1667 broken out. I am happy to 
trim off more, etc, whatever is easiest for review.

R: @tgroh 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kennknowles/incubator-beam 
DirectRunner-timers-prereqs

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1669.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1669


commit f64816e0cf2e4fcc9525f40ede01c2f8e4ecf28d
Author: Kenneth Knowles <k...@google.com>
Date:   2016-12-20T04:40:11Z

Add informative Instant formatter to BoundedWindow

commit 46c6a4f613629f09b48e3630aa344760b0ad46d4
Author: Kenneth Knowles <k...@google.com>
Date:   2016-12-20T04:40:47Z

Use informative Instant formatter in WatermarkHold

commit 92baa418fbe53c0e7c7afc81db31fc02ab7f3915
Author: Kenneth Knowles <k...@google.com>
Date:   2016-12-20T21:57:55Z

Add static Window.withOutputTimeFn to match build method

commit 7118c4ff85636a65431be54fa2e2f18fb52914cf
Author: Kenneth Knowles <k...@google.com>
Date:   2016-12-20T22:20:07Z

Add UsesTestStream for use with JUnit @Category

commit f667a3e8abcd95be7a235132219c936178ab6bc8
Author: Kenneth Knowles <k...@google.com>
Date:   2016-12-08T04:18:44Z

Allow setting timer by ID in DirectTimerInternals

commit 217e5245e59800d57aa36551fbbdb642a5b447a0
Author: Kenneth Knowles <k...@google.com>
Date:   2016-12-20T21:37:40Z

Hold output watermark according to pending timers




> Support for new Timer API in Direct runner
> --
>
> Key: BEAM-1117
> URL: https://issues.apache.org/jira/browse/BEAM-1117
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-direct
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-646) Get runners out of the apply()

2016-12-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765542#comment-15765542
 ] 

ASF GitHub Bot commented on BEAM-646:
-

Github user asfgit closed the pull request at:

https://github.com/apache/incubator-beam/pull/1569


> Get runners out of the apply()
> --
>
> Key: BEAM-646
> URL: https://issues.apache.org/jira/browse/BEAM-646
> Project: Beam
>  Issue Type: Improvement
>  Components: beam-model-runner-api, sdk-java-core
>Reporter: Kenneth Knowles
>Assignee: Thomas Groh
>
> Right now, the runner intercepts calls to apply() and replaces transforms as 
> we go. This means that there is no "original" user graph. For portability and 
> misc architectural benefits, we would like to build the original graph first, 
> and have the runner override later.
> Some runners already work in this manner, but we could integrate it more 
> smoothly, with more validation, via some handy APIs on e.g. the Pipeline 
> object.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-362) Move shared runner functionality out of SDK and into runners/core-java

2016-12-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765531#comment-15765531
 ] 

ASF GitHub Bot commented on BEAM-362:
-

Github user asfgit closed the pull request at:

https://github.com/apache/incubator-beam/pull/1665


> Move shared runner functionality out of SDK and into runners/core-java
> --
>
> Key: BEAM-362
> URL: https://issues.apache.org/jira/browse/BEAM-362
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-core
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-27) Add user-ready API for interacting with timers

2016-12-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-27?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765387#comment-15765387
 ] 

ASF GitHub Bot commented on BEAM-27:


GitHub user kennknowles opened a pull request:

https://github.com/apache/incubator-beam/pull/1668

[BEAM-27, BEAM-362] Remove deprecated InMemoryTimerInternals from SDK

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [x] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [x] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [x] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kennknowles/incubator-beam 
InMemoryTimerInternals

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1668.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1668


commit bbea8469912b23383a9ae5cf084b5801706e
Author: Kenneth Knowles <k...@google.com>
Date:   2016-12-20T22:07:00Z

Remove deprecated InMemoryTimerInternals from SDK




> Add user-ready API for interacting with timers
> --
>
> Key: BEAM-27
> URL: https://issues.apache.org/jira/browse/BEAM-27
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-core
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>
> Pipeline authors will benefit from a different factorization of interaction 
> with underlying timers. The current APIs are targeted at runner implementers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (BEAM-1097) Dataflow error message for non-existing gcpTempLocation is misleading

2016-12-20 Thread Scott Wegner (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Scott Wegner closed BEAM-1097.
--
   Resolution: Fixed
Fix Version/s: 0.5.0-incubating

> Dataflow error message for non-existing gcpTempLocation is misleading
> -
>
> Key: BEAM-1097
> URL: https://issues.apache.org/jira/browse/BEAM-1097
> Project: Beam
>  Issue Type: Bug
>  Components: examples-java, runner-dataflow
>Reporter: Scott Wegner
>Assignee: Scott Wegner
>Priority: Minor
> Fix For: 0.5.0-incubating
>
>
> The error message for specifying a GCP tempLocation which doesn't exist is 
> misleading. Rather than mentioning the given path doesn't exist, it says none 
> ways specified.
> This is particularly frustrating because it's one of the few configuration 
> values necessary to get started with an example or starter archetype, and 
> it's easy to introduce a typo as it's specified on the commandline. In my 
> case, I was specifying "gs://swegner-tmp" instead of "gs://swegner-test".
> Repro:
> 1. Clone the starter archetype: {noformat}mvn archetype:generate 
> -DarchetypeGroupId=org.apache.beam 
> -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-starter{noformat}
> 2. Add beam-runners-google-cloud-dataflow-java as a dependency in the 
> generated pom.xml
> 3. Build: {noformat}mvn install{noformat}
> 4. Run: {noformat}mvn exec:java -DmainClass=swegner.StarterPipeline 
> -Dexec.args="--runner=DataflowRunner 
> --tempLocation=gs://swegner-tmp"{noformat}
> Expected: An error message along the lines of: "The specified GCP temp 
> location 'gs://swegner-tmp' does not exist under project 'myGcpProject'"
> bq. [ERROR] Failed to execute goal 
> org.codehaus.mojo:exec-maven-plugin:1.4.0:java (default-cli) on project 
> counter-names-test: An exception occured while executing the Java class. 
> null: InvocationTargetException: Failed to construct instance from factory 
> method DataflowRunner#fromOptions(interface 
> org.apache.beam.sdk.options.PipelineOptions): DataflowRunner requires 
> gcpTempLocation, and it is missing in PipelineOptions. -> [Help 1]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-1097) Dataflow error message for non-existing gcpTempLocation is misleading

2016-12-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765244#comment-15765244
 ] 

ASF GitHub Bot commented on BEAM-1097:
--

Github user asfgit closed the pull request at:

https://github.com/apache/incubator-beam/pull/1522


> Dataflow error message for non-existing gcpTempLocation is misleading
> -
>
> Key: BEAM-1097
> URL: https://issues.apache.org/jira/browse/BEAM-1097
> Project: Beam
>  Issue Type: Bug
>  Components: examples-java, runner-dataflow
>Reporter: Scott Wegner
>Assignee: Scott Wegner
>Priority: Minor
>
> The error message for specifying a GCP tempLocation which doesn't exist is 
> misleading. Rather than mentioning the given path doesn't exist, it says none 
> ways specified.
> This is particularly frustrating because it's one of the few configuration 
> values necessary to get started with an example or starter archetype, and 
> it's easy to introduce a typo as it's specified on the commandline. In my 
> case, I was specifying "gs://swegner-tmp" instead of "gs://swegner-test".
> Repro:
> 1. Clone the starter archetype: {noformat}mvn archetype:generate 
> -DarchetypeGroupId=org.apache.beam 
> -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-starter{noformat}
> 2. Add beam-runners-google-cloud-dataflow-java as a dependency in the 
> generated pom.xml
> 3. Build: {noformat}mvn install{noformat}
> 4. Run: {noformat}mvn exec:java -DmainClass=swegner.StarterPipeline 
> -Dexec.args="--runner=DataflowRunner 
> --tempLocation=gs://swegner-tmp"{noformat}
> Expected: An error message along the lines of: "The specified GCP temp 
> location 'gs://swegner-tmp' does not exist under project 'myGcpProject'"
> bq. [ERROR] Failed to execute goal 
> org.codehaus.mojo:exec-maven-plugin:1.4.0:java (default-cli) on project 
> counter-names-test: An exception occured while executing the Java class. 
> null: InvocationTargetException: Failed to construct instance from factory 
> method DataflowRunner#fromOptions(interface 
> org.apache.beam.sdk.options.PipelineOptions): DataflowRunner requires 
> gcpTempLocation, and it is missing in PipelineOptions. -> [Help 1]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-1112) Python E2E Integration Test Framework - Batch Only

2016-12-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765220#comment-15765220
 ] 

ASF GitHub Bot commented on BEAM-1112:
--

GitHub user markflyhigh reopened a pull request:

https://github.com/apache/incubator-beam/pull/1639

[BEAM-1112] Python E2E Test Framework And Wordcount E2E Test

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [x] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [ ] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [ ] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---

 - E2e test framework that supports TestRunner start and verify pipeline 
job.
   - add `TestOptions` which defined `on_success_matcher` that is used to 
verify state/output of pipeline job.
   - validate `on_success_matcher` before pipeline execution to make sure 
it's unpicklable to a subclass of BaseMatcher.
   - create a `TestDataflowRunner` which provide functionalities of 
`DataflowRunner` plus result verification.
   - provide a test verifier `PipelineStateMatcher` that can verify 
pipeline job finished in DONE or not.
 - Add wordcount_it (it = integration test) that build e2e test based on 
existing wordcount pipeline.
   - include wordcount_it to nose collector, so that wordcount_it can be 
collected and run by nose.
   - skip ITs when running unit tests from tox in precommit and postcommit.

Current changes will not change behavior of existing pre/postcommit.
Test is done by running `tox -e py27 -c sdks/python/tox.ini` for unit test 
and running wordcount_it with `TestDataflowRunner` on service 
([link](https://pantheon.corp.google.com/dataflow/job/2016-12-15_17_36_16-3857167705491723621?project=google.com:clouddfe)).

TODO:
 - Output data verifier that verify pipeline output that stores in 
filesystem.
 - Add wordcount_it to precommit and replace existing wordcount execution 
command in postcommit with a better structured nose command.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/markflyhigh/incubator-beam e2e-testrunner

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1639.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1639


commit e1e1fa3a60e1fe234829432d144d6689e240b6f0
Author: Mark Liu <mark...@google.com>
Date:   2016-12-16T01:41:20Z

[BEAM-1112] Python E2E Test Framework And Wordcount E2E Test

commit 0e7007879ee082e3afe5db36107f51c03274f3f5
Author: Mark Liu <mark...@google.com>
Date:   2016-12-16T02:55:53Z

fixup! Fix Code Style

commit d6d71a717e8ed7ab32ffa02621c837c797f66cd7
Author: Mark Liu <mark...@google.com>
Date:   2016-12-20T19:15:59Z

fixup! Address Ahmet comments




> Python E2E Integration Test Framework - Batch Only
> --
>
> Key: BEAM-1112
>     URL: https://issues.apache.org/jira/browse/BEAM-1112
> Project: Beam
>  Issue Type: Task
>  Components: sdk-py, testing
>Reporter: Mark Liu
>Assignee: Mark Liu
>
> Parity with Java. 
> Build e2e integration test framework that can configure and run batch 
> pipeline with specified test runner, wait for pipeline execution and verify 
> results with given verifiers in the end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-1112) Python E2E Integration Test Framework - Batch Only

2016-12-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765219#comment-15765219
 ] 

ASF GitHub Bot commented on BEAM-1112:
--

Github user markflyhigh closed the pull request at:

https://github.com/apache/incubator-beam/pull/1639


> Python E2E Integration Test Framework - Batch Only
> --
>
> Key: BEAM-1112
> URL: https://issues.apache.org/jira/browse/BEAM-1112
> Project: Beam
>  Issue Type: Task
>  Components: sdk-py, testing
>Reporter: Mark Liu
>Assignee: Mark Liu
>
> Parity with Java. 
> Build e2e integration test framework that can configure and run batch 
> pipeline with specified test runner, wait for pipeline execution and verify 
> results with given verifiers in the end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-1117) Support for new Timer API in Direct runner

2016-12-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765203#comment-15765203
 ] 

ASF GitHub Bot commented on BEAM-1117:
--

GitHub user kennknowles opened a pull request:

https://github.com/apache/incubator-beam/pull/1667

[BEAM-1117] Support user timers for ParDo in the direct runner

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [x] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [x] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [x] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---

R: @tgroh 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kennknowles/incubator-beam DirectRunner-timers

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1667.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1667


commit a3ac176cd7edb18d4f633682ee0e6ff30ab76f64
Author: Kenneth Knowles <k...@google.com>
Date:   2016-12-08T04:18:44Z

Allow setting timer by ID in DirectTimerInternals

commit 445750d6cf36f1eda1094531541788260c3fe229
Author: Kenneth Knowles <k...@google.com>
Date:   2016-12-08T18:27:23Z

No longer reject timers for ParDo in direct runner

commit d428abe9e12ddd2609773512a180589ff960d954
Author: Kenneth Knowles <k...@google.com>
Date:   2016-12-08T23:18:44Z

Deliver timers in the direct runner

commit 6915bbc550ad692656e8eeb1ba7161213c9a6ce6
Author: Kenneth Knowles <k...@google.com>
Date:   2016-12-20T04:40:11Z

Add informative Instant formatter to BoundedWindow

commit 2af3f93602b5299cc33c876310a784fc82ff4941
Author: Kenneth Knowles <k...@google.com>
Date:   2016-12-20T04:40:47Z

Use informative Instant formatter in WatermarkHold




> Support for new Timer API in Direct runner
> --
>
> Key: BEAM-1117
>     URL: https://issues.apache.org/jira/browse/BEAM-1117
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-direct
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-1117) Support for new Timer API in Direct runner

2016-12-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765191#comment-15765191
 ] 

ASF GitHub Bot commented on BEAM-1117:
--

Github user asfgit closed the pull request at:

https://github.com/apache/incubator-beam/pull/1581


> Support for new Timer API in Direct runner
> --
>
> Key: BEAM-1117
> URL: https://issues.apache.org/jira/browse/BEAM-1117
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-direct
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-1189) Add guide for release verifiers in the release guide

2016-12-20 Thread JIRA


[ 
https://issues.apache.org/jira/browse/BEAM-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765117#comment-15765117
 ] 

Jean-Baptiste Onofré commented on BEAM-1189:


+1

How to verify artifacts is something interesting indeed.

I did quick instructions for Apache Karaf: 
http://karaf.apache.org/download.html#verify

> Add guide for release verifiers in the release guide
> 
>
> Key: BEAM-1189
> URL: https://issues.apache.org/jira/browse/BEAM-1189
> Project: Beam
>  Issue Type: Improvement
>  Components: website
>Reporter: Kenneth Knowles
>Assignee: James Malone
>
> This came up during the 0.4.0-incubating release discussion.
> There is this checklist: 
> http://incubator.apache.org/guides/releasemanagement.html#check-list
> And we could point to that but make more detailed Beam-specific instructions 
> on 
> http://beam.apache.org/contribute/release-guide/#vote-on-the-release-candidate
> And the template for the vote email should include a link to suggested 
> verification steps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (BEAM-1189) Add guide for release verifiers in the release guide

2016-12-20 Thread Kenneth Knowles (JIRA)

Kenneth Knowles created BEAM-1189:
-

 Summary: Add guide for release verifiers in the release guide
 Key: BEAM-1189
 URL: https://issues.apache.org/jira/browse/BEAM-1189
 Project: Beam
  Issue Type: Improvement
  Components: website
Reporter: Kenneth Knowles
Assignee: James Malone


This came up during the 0.4.0-incubating release discussion.

There is this checklist: 
http://incubator.apache.org/guides/releasemanagement.html#check-list

And we could point to that but make more detailed Beam-specific instructions on 
http://beam.apache.org/contribute/release-guide/#vote-on-the-release-candidate

And the template for the vote email should include a link to suggested 
verification steps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-27) Add user-ready API for interacting with timers

2016-12-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-27?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765055#comment-15765055
 ] 

ASF GitHub Bot commented on BEAM-27:


Github user asfgit closed the pull request at:

https://github.com/apache/incubator-beam/pull/1652


> Add user-ready API for interacting with timers
> --
>
> Key: BEAM-27
> URL: https://issues.apache.org/jira/browse/BEAM-27
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-core
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>
> Pipeline authors will benefit from a different factorization of interaction 
> with underlying timers. The current APIs are targeted at runner implementers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (BEAM-1188) More Verifiers For Python E2E Tests

2016-12-20 Thread Mark Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Liu updated BEAM-1188:
---
Description: 
Add more basic verifiers in e2e test to verify output data in different 
storage/fs:

1. File verifier: compute and verify checksum of file(s) that’s stored on a 
filesystem (GCS / local fs). 
2. Bigquery verifier: query from Bigquery table and verify response content. 

Also update TestOptions.on_success_matcher to accept a list of matchers instead 
of single one.

Note: Have retry when doing IO to avoid test flacky that may come from 
inconsistency of the filesystem. This problem happened in Java integration 
tests.




> More Verifiers For Python E2E Tests
> ---
>
> Key: BEAM-1188
> URL: https://issues.apache.org/jira/browse/BEAM-1188
> Project: Beam
>  Issue Type: Task
>  Components: sdk-py, testing
>Reporter: Mark Liu
>Assignee: Mark Liu
>
> Add more basic verifiers in e2e test to verify output data in different 
> storage/fs:
> 1. File verifier: compute and verify checksum of file(s) that’s stored on a 
> filesystem (GCS / local fs). 
> 2. Bigquery verifier: query from Bigquery table and verify response content. 
> Also update TestOptions.on_success_matcher to accept a list of matchers 
> instead of single one.
> Note: Have retry when doing IO to avoid test flacky that may come from 
> inconsistency of the filesystem. This problem happened in Java integration 
> tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (BEAM-1188) More Verifiers For Python E2E Tests

2016-12-20 Thread Mark Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Liu updated BEAM-1188:
---
Description: 
Add more basic verifiers in e2e test to verify output data in different 
storage/fs:

1. File verifier: compute and verify checksum of file(s) that’s stored on a 
filesystem (GCS / local fs). 
2. Bigquery verifier: query from Bigquery table and verify response content. 
...

Also update TestOptions.on_success_matcher to accept a list of matchers instead 
of single one.

Note: Have retry when doing IO to avoid test flacky that may come from 
inconsistency of the filesystem. This problem happened in Java integration 
tests.




  was:
Add more basic verifiers in e2e test to verify output data in different 
storage/fs:

1. File verifier: compute and verify checksum of file(s) that’s stored on a 
filesystem (GCS / local fs). 
2. Bigquery verifier: query from Bigquery table and verify response content. 

Also update TestOptions.on_success_matcher to accept a list of matchers instead 
of single one.

Note: Have retry when doing IO to avoid test flacky that may come from 
inconsistency of the filesystem. This problem happened in Java integration 
tests.





> More Verifiers For Python E2E Tests
> ---
>
> Key: BEAM-1188
> URL: https://issues.apache.org/jira/browse/BEAM-1188
> Project: Beam
>  Issue Type: Task
>  Components: sdk-py, testing
>Reporter: Mark Liu
>Assignee: Mark Liu
>
> Add more basic verifiers in e2e test to verify output data in different 
> storage/fs:
> 1. File verifier: compute and verify checksum of file(s) that’s stored on a 
> filesystem (GCS / local fs). 
> 2. Bigquery verifier: query from Bigquery table and verify response content. 
> ...
> Also update TestOptions.on_success_matcher to accept a list of matchers 
> instead of single one.
> Note: Have retry when doing IO to avoid test flacky that may come from 
> inconsistency of the filesystem. This problem happened in Java integration 
> tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (BEAM-1188) More Verifiers For Python E2E Tests

2016-12-20 Thread Mark Liu (JIRA)

Mark Liu created BEAM-1188:
--

 Summary: More Verifiers For Python E2E Tests
 Key: BEAM-1188
 URL: https://issues.apache.org/jira/browse/BEAM-1188
 Project: Beam
  Issue Type: Task
  Components: sdk-py, testing
Reporter: Mark Liu
Assignee: Mark Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (BEAM-1187) GCP Transport not performing timed backoff after connection failure

2016-12-20 Thread Davor Bonaci (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davor Bonaci updated BEAM-1187:
---
Assignee: Pei He  (was: Davor Bonaci)

> GCP Transport not performing timed backoff after connection failure
> ---
>
> Key: BEAM-1187
> URL: https://issues.apache.org/jira/browse/BEAM-1187
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow, sdk-java-core, sdk-java-gcp
>Reporter: Luke Cwik
>Assignee: Pei He
>Priority: Minor
>
> The http request retries are failing and seemingly being immediately retried 
> if there is a connection exception. Note that below all the times are the 
> same, and also that we are logging too much. This seems to be related to the 
> interaction by the chaining http request initializer combining the Credential 
> initializer followed by the RetryHttpRequestInitializer. Also, note that we 
> never log "Request failed with IOException, will NOT retry" which implies 
> that the retry logic never made it to the RetryHttpRequestInitializer.
> Action items are:
> 1) Ensure that the RetryHttpRequestInitializer is used
> 2) Ensure that calls do backoff
> 3) Reduce the logging to one terminal statement saying that we retried X 
> times and final failure was YYY.
> Dump of console output:
> Dec 20, 2016 9:12:20 AM 
> com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner fromOptions
> INFO: PipelineOptions.filesToStage was not specified. Defaulting to files 
> from the classpath: will stage 1 files. Enable logging at DEBUG level to see 
> which files will be staged.
> Dec 20, 2016 9:12:21 AM 
> com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner run
> INFO: Executing pipeline on the Dataflow Service, which will have billing 
> implications related to Google Compute Engine usage and other Google Cloud 
> Services.
> Dec 20, 2016 9:12:21 AM com.google.cloud.dataflow.sdk.util.PackageUtil 
> stageClasspathElements
> INFO: Uploading 1 files from PipelineOptions.filesToStage to staging location 
> to prepare for execution.
> Dec 20, 2016 9:12:21 AM com.google.cloud.dataflow.sdk.util.PackageUtil 
> stageClasspathElements
> INFO: Uploading PipelineOptions.filesToStage complete: 1 files newly 
> uploaded, 0 files cached
> Dec 20, 2016 9:12:22 AM com.google.api.client.http.HttpRequest execute
> WARNING: exception thrown while executing request
> java.net.ConnectException: Connection refused
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:589)
>   at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
>   at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
>   at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
>   at sun.net.www.http.HttpClient.(HttpClient.java:211)
>   at sun.net.www.http.HttpClient.New(HttpClient.java:308)
>   at sun.net.www.http.HttpClient.New(HttpClient.java:326)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1283)
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1258)
>   at 
> com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:77)
>   at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981)
>   at 
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
>   at 
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
>   at 
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
>   at 
> com.google.cloud.da

[jira] [Reopened] (BEAM-1186) Migrate the remaining tests to use TestPipeline as a JUnit rule.

2016-12-20 Thread Kenneth Knowles (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles reopened BEAM-1186:
---

> Migrate the remaining tests to use TestPipeline as a JUnit rule.
> 
>
> Key: BEAM-1186
> URL: https://issues.apache.org/jira/browse/BEAM-1186
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Reporter: Stas Levin
>Assignee: Stas Levin
>Priority: Minor
> Fix For: 0.5.0-incubating
>
>
> Following up on [BEAM-1176|https://issues.apache.org/jira/browse/BEAM-1176], 
> the following tests still have direct calls to {{TestPipeline.create()}}:
> * {{AvroIOGeneratedClassTest#runTestRead}}
> * {{ApproximateUniqueTest#runApproximateUniqueWithDuplicates}}
> * {{ApproximateUniqueTest#runApproximateUniqueWithSkewedDistributions}}
> * {{SampleTest#runPickAnyTest}}
> * {{BigtableIOTest#runReadTest}}
> Consider using [parametrised 
> tests|https://github.com/Pragmatists/junitparams] as suggested by [~lcwik].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (BEAM-1176) Make our test suites use @Rule TestPipeline

2016-12-20 Thread Kenneth Knowles (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles resolved BEAM-1176.
---
   Resolution: Fixed
Fix Version/s: 0.5.0-incubating

> Make our test suites use @Rule TestPipeline
> ---
>
> Key: BEAM-1176
> URL: https://issues.apache.org/jira/browse/BEAM-1176
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Reporter: Kenneth Knowles
>Assignee: Stas Levin
>Priority: Minor
> Fix For: 0.5.0-incubating
>
>
> Now that [~staslev] has made {{TestPipeline}} a JUnit rule that performs 
> useful sanity checks, we should port all of our tests to it so that they set 
> a good example for users. Maybe we'll even catch some straggling tests with 
> errors :-)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (BEAM-1186) Migrate the remaining tests to use TestPipeline as a JUnit rule.

2016-12-20 Thread Kenneth Knowles (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles resolved BEAM-1186.
---
   Resolution: Fixed
Fix Version/s: 0.5.0-incubating

> Migrate the remaining tests to use TestPipeline as a JUnit rule.
> 
>
> Key: BEAM-1186
> URL: https://issues.apache.org/jira/browse/BEAM-1186
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Reporter: Stas Levin
>Assignee: Stas Levin
>Priority: Minor
> Fix For: 0.5.0-incubating
>
>
> Following up on [BEAM-1176|https://issues.apache.org/jira/browse/BEAM-1176], 
> the following tests still have direct calls to {{TestPipeline.create()}}:
> * {{AvroIOGeneratedClassTest#runTestRead}}
> * {{ApproximateUniqueTest#runApproximateUniqueWithDuplicates}}
> * {{ApproximateUniqueTest#runApproximateUniqueWithSkewedDistributions}}
> * {{SampleTest#runPickAnyTest}}
> * {{BigtableIOTest#runReadTest}}
> Consider using [parametrised 
> tests|https://github.com/Pragmatists/junitparams] as suggested by [~lcwik].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-1176) Make our test suites use @Rule TestPipeline

2016-12-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764973#comment-15764973
 ] 

ASF GitHub Bot commented on BEAM-1176:
--

Github user asfgit closed the pull request at:

https://github.com/apache/incubator-beam/pull/1664


> Make our test suites use @Rule TestPipeline
> ---
>
> Key: BEAM-1176
> URL: https://issues.apache.org/jira/browse/BEAM-1176
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Reporter: Kenneth Knowles
>Assignee: Stas Levin
>Priority: Minor
>
> Now that [~staslev] has made {{TestPipeline}} a JUnit rule that performs 
> useful sanity checks, we should port all of our tests to it so that they set 
> a good example for users. Maybe we'll even catch some straggling tests with 
> errors :-)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (BEAM-1187) GCP Transport not performing timed backoff after connection failure

2016-12-20 Thread Luke Cwik (JIRA)

Luke Cwik created BEAM-1187:
---

 Summary: GCP Transport not performing timed backoff after 
connection failure
 Key: BEAM-1187
 URL: https://issues.apache.org/jira/browse/BEAM-1187
 Project: Beam
  Issue Type: Bug
  Components: runner-dataflow, sdk-java-core, sdk-java-gcp
Reporter: Luke Cwik
Assignee: Davor Bonaci
Priority: Minor


The http request retries are failing and seemingly being immediately retried if 
there is a connection exception. Note that below all the times are the same, 
and also that we are logging too much. This seems to be related to the 
interaction by the chaining http request initializer combining the Credential 
initializer followed by the RetryHttpRequestInitializer. Also, note that we 
never log "Request failed with IOException, will NOT retry" which implies that 
the retry logic never made it to the RetryHttpRequestInitializer.

Action items are:
1) Ensure that the RetryHttpRequestInitializer is used
2) Ensure that calls do backoff
3) Reduce the logging to one terminal statement saying that we retried X times 
and final failure was YYY.

Dump of console output:
Dec 20, 2016 9:12:20 AM 
com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner fromOptions
INFO: PipelineOptions.filesToStage was not specified. Defaulting to files from 
the classpath: will stage 1 files. Enable logging at DEBUG level to see which 
files will be staged.
Dec 20, 2016 9:12:21 AM 
com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner run
INFO: Executing pipeline on the Dataflow Service, which will have billing 
implications related to Google Compute Engine usage and other Google Cloud 
Services.
Dec 20, 2016 9:12:21 AM com.google.cloud.dataflow.sdk.util.PackageUtil 
stageClasspathElements
INFO: Uploading 1 files from PipelineOptions.filesToStage to staging location 
to prepare for execution.
Dec 20, 2016 9:12:21 AM com.google.cloud.dataflow.sdk.util.PackageUtil 
stageClasspathElements
INFO: Uploading PipelineOptions.filesToStage complete: 1 files newly uploaded, 
0 files cached
Dec 20, 2016 9:12:22 AM com.google.api.client.http.HttpRequest execute
WARNING: exception thrown while executing request
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
at sun.net.www.http.HttpClient.(HttpClient.java:211)
at sun.net.www.http.HttpClient.New(HttpClient.java:308)
at sun.net.www.http.HttpClient.New(HttpClient.java:326)
at 
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
at 
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
at 
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
at 
sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1283)
at 
sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1258)
at 
com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:77)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981)
at 
com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
at 
com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at 
com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at 
com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner.run(DataflowPipelineRunner.java:632)
at 
com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner.run(DataflowPipelineRunner.java:201)
at com.google.cloud.dataflow.sdk.Pipeline.run(Pipeline.java:181)
at 
com.google.cloud.dataflow.integration.NumbersStreaming.numbersStreamingFromPubsub(NumbersStreaming.java:378)
at 
com.google.cloud.dataflow.integration.NumbersStreaming.main(NumbersStreaming.java:831)

Dec 20, 2016 9:12:22 AM com.google.api.client.http.HttpRequest execute
WARNING: exception thrown while executi

[jira] [Commented] (BEAM-1176) Make our test suites use @Rule TestPipeline

2016-12-20 Thread Stas Levin (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764919#comment-15764919
 ] 

Stas Levin commented on BEAM-1176:
--

Sounds good, enter [BEAM-1186|https://issues.apache.org/jira/browse/BEAM-1186].

> Make our test suites use @Rule TestPipeline
> ---
>
> Key: BEAM-1176
> URL: https://issues.apache.org/jira/browse/BEAM-1176
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Reporter: Kenneth Knowles
>Assignee: Stas Levin
>Priority: Minor
>
> Now that [~staslev] has made {{TestPipeline}} a JUnit rule that performs 
> useful sanity checks, we should port all of our tests to it so that they set 
> a good example for users. Maybe we'll even catch some straggling tests with 
> errors :-)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (BEAM-1186) Migrate the remaining tests to use TestPipeline as a JUnit rule.

2016-12-20 Thread Stas Levin (JIRA)

Stas Levin created BEAM-1186:


 Summary: Migrate the remaining tests to use TestPipeline as a 
JUnit rule.
 Key: BEAM-1186
 URL: https://issues.apache.org/jira/browse/BEAM-1186
 Project: Beam
  Issue Type: Improvement
  Components: sdk-java-core
Reporter: Stas Levin
Assignee: Stas Levin
Priority: Minor


Following up on [BEAM-1176|https://issues.apache.org/jira/browse/BEAM-1176], 
the following tests still have direct calls to {{TestPipeline.create()}}:
* {{AvroIOGeneratedClassTest#runTestRead}}
* {{ApproximateUniqueTest#runApproximateUniqueWithDuplicates}}
* {{ApproximateUniqueTest#runApproximateUniqueWithSkewedDistributions}}
* {{SampleTest#runPickAnyTest}}
* {{BigtableIOTest#runReadTest}}

Consider using [parametrised tests|https://github.com/Pragmatists/junitparams] 
as suggested by [~lcwik].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-362) Move shared runner functionality out of SDK and into runners/core-java

2016-12-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764802#comment-15764802
 ] 

ASF GitHub Bot commented on BEAM-362:
-

GitHub user kennknowles opened a pull request:

https://github.com/apache/incubator-beam/pull/1666

[BEAM-362] Move ExecutionContext and related classes to runners-core

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [x] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [x] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [x] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---

R: @lukecwik 

This is built on top of #1665 and will require a new worker image.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kennknowles/incubator-beam ExecutionContext

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1666.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1666


commit 03a85e82ca5f1dff5aae184907508d7c5309a404
Author: Kenneth Knowles <k...@google.com>
Date:   2016-12-16T04:13:25Z

Remove deprecated AggregatorFactory from SDK

commit 6d7a4b10ba74b1fc08d0ad6a759ca5e0ebffdbba
Author: Kenneth Knowles <k...@google.com>
Date:   2016-12-16T04:20:34Z

Move ExecutionContext and related classes to runners-core




> Move shared runner functionality out of SDK and into runners/core-java
> --
>
> Key: BEAM-362
> URL: https://issues.apache.org/jira/browse/BEAM-362
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-core
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (BEAM-980) Document how to configure the DAG created by Apex Runner

2016-12-20 Thread Thomas Weise (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764783#comment-15764783
 ] 

Thomas Weise commented on BEAM-980:
---

The runner should probably have an option to specify a configuration file so 
that users can tune the execution attributes similar to what they can do when 
launching through the Apex CLI.

https://github.com/apache/incubator-beam/blob/release-0.4.0-incubating/runners/apex/src/main/java/org/apache/beam/runners/apex/ApexRunner.java#L158



> Document how to configure the DAG created by Apex Runner
> 
>
> Key: BEAM-980
> URL: https://issues.apache.org/jira/browse/BEAM-980
> Project: Beam
>  Issue Type: Task
>  Components: runner-apex
>Reporter: Thomas Weise
>Assignee: Sandeep Deshmukh
>
> The Beam pipeline is translated to an Apex DAG of operators that have names 
> that are derived from the transforms. In case of composite transforms those 
> look like path names. Apex lets the user configure things like memory, 
> vcores, parallelism through properties/attributes that reference the operator 
> names. The configuration approach needs to be documented and supplemented 
> with an example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 5936 matches

Mail list logo