[jira] [Commented] (BEAM-469) NullableCoder optimized encoding via passthrough context
[ https://issues.apache.org/jira/browse/BEAM-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768525#comment-15768525 ] ASF GitHub Bot commented on BEAM-469: - GitHub user dhalperi opened a pull request: https://github.com/apache/incubator-beam/pull/1680 [BEAM-XXX] Make KVCoder more efficient by removing unnecessary nesting See [BEAM-469] for more information about why this is correct. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dhalperi/incubator-beam efficient-nested-coders Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1680.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1680 commit 621e8250c9535d773c4f4440a34ea0833912b51f Author: Dan Halperin <dhalp...@google.com> Date: 2016-12-21T23:37:49Z [BEAM-XXX] Make KVCoder more efficient by removing unnecessary nesting See [BEAM-469] for more information about why this is correct. > NullableCoder optimized encoding via passthrough context > > > Key: BEAM-469 > URL: https://issues.apache.org/jira/browse/BEAM-469 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Luke Cwik >Assignee: Thomas Groh >Priority: Trivial > Labels: backward-incompatible > Fix For: 0.3.0-incubating > > > NullableCoder should encode using the context given and not always use the > nested context. For coders which can efficiently encode in the outer context > such as StringUtf8Coder or ByteArrayCoder, we are forcing them to prefix > themselves with their length. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-1201) Remove producesSortedKeys from Source
[ https://issues.apache.org/jira/browse/BEAM-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768483#comment-15768483 ] ASF GitHub Bot commented on BEAM-1201: -- GitHub user dhalperi opened a pull request: https://github.com/apache/incubator-beam/pull/1679 [BEAM-1201] Remove BoundedSource.producesSortedKeys R: @jkff You can merge this pull request into a Git repository by running: $ git pull https://github.com/dhalperi/incubator-beam remove-produces-sorted-keys Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1679.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1679 commit ee15138543f8b9926466cf4e4dc6857b3173345e Author: Dan Halperin <dhalp...@google.com> Date: 2016-12-21T23:32:38Z [BEAM-1201] Remove BoundedSource.producesSortedKeys Unused and unclear; for more information see the linked JIRA. > Remove producesSortedKeys from Source > - > > Key: BEAM-1201 > URL: https://issues.apache.org/jira/browse/BEAM-1201 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Daniel Halperin >Assignee: Daniel Halperin >Priority: Minor > Labels: backward-incompatible > > This is a holdover from a precursor of the old Dataflow SDK that we just > failed to delete before releasing Dataflow 1.0, but we can delete before the > first stable release of Beam. > This function has never been used by any runner. It does not mean anything > obvious to implementors, as many sources produce {{T}}, not {{KV<K,V>}} -- > what does it mean in the former case? (And how do you get the latter case > correct?) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-1201) Remove producesSortedKeys from Source
[ https://issues.apache.org/jira/browse/BEAM-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Halperin updated BEAM-1201: -- Labels: backward-incompatible (was: ) > Remove producesSortedKeys from Source > - > > Key: BEAM-1201 > URL: https://issues.apache.org/jira/browse/BEAM-1201 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Daniel Halperin >Assignee: Daniel Halperin >Priority: Minor > Labels: backward-incompatible > > This is a holdover from a precursor of the old Dataflow SDK that we just > failed to delete before releasing Dataflow 1.0, but we can delete before the > first stable release of Beam. > This function has never been used by any runner. It does not mean anything > obvious to implementors, as many sources produce {{T}}, not {{KV<K,V>}} -- > what does it mean in the former case? (And how do you get the latter case > correct?) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-1201) Remove producesSortedKeys from BoundedSource
[ https://issues.apache.org/jira/browse/BEAM-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Halperin updated BEAM-1201: -- Summary: Remove producesSortedKeys from BoundedSource (was: Remove producesSortedKeys from Source) > Remove producesSortedKeys from BoundedSource > > > Key: BEAM-1201 > URL: https://issues.apache.org/jira/browse/BEAM-1201 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Daniel Halperin >Assignee: Daniel Halperin >Priority: Minor > Labels: backward-incompatible > > This is a holdover from a precursor of the old Dataflow SDK that we just > failed to delete before releasing Dataflow 1.0, but we can delete before the > first stable release of Beam. > This function has never been used by any runner. It does not mean anything > obvious to implementors, as many sources produce {{T}}, not {{KV<K,V>}} -- > what does it mean in the former case? (And how do you get the latter case > correct?) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-469) NullableCoder optimized encoding via passthrough context
[ https://issues.apache.org/jira/browse/BEAM-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768476#comment-15768476 ] Daniel Halperin commented on BEAM-469: -- Sorry I missed this JIRA comment, [~mariusz89016]! A bit late, but... Say a coder C does not have the nested context. Then we actually have the guarantee that no one will put later elements. So if {{NullableCoder}} does not have the nested context, then no one will put more elements after whatever the {{NullableCoder}} puts. If the NC puts {{0}} then that's it -- the element is null. But if the NC puts {{1}}, then we know that all remaining bytes in the encoded string belong to the inner coder. That is effectively saying that the inner coder also does not need to have the nested context, so it does not need to write its own length. In your example, the {{NullableCoder}} is used in an inner context. So the inner coder needs to also use the inner context, because there may be more encoded elements later. In either case: the context of the nullable coder can be the same as the context of the inner coder. This is why in the patch here, we simply pass the NC's context down into the inner coder. All we have removed is the _additional_ nesting that was used. https://patch-diff.githubusercontent.com/raw/apache/incubator-beam/pull/992.patch > NullableCoder optimized encoding via passthrough context > > > Key: BEAM-469 > URL: https://issues.apache.org/jira/browse/BEAM-469 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Luke Cwik >Assignee: Thomas Groh >Priority: Trivial > Labels: backward-incompatible > Fix For: 0.3.0-incubating > > > NullableCoder should encode using the context given and not always use the > nested context. For coders which can efficiently encode in the outer context > such as StringUtf8Coder or ByteArrayCoder, we are forcing them to prefix > themselves with their length. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-646) Get runners out of the apply()
[ https://issues.apache.org/jira/browse/BEAM-646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768470#comment-15768470 ] ASF GitHub Bot commented on BEAM-646: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/1582 > Get runners out of the apply() > -- > > Key: BEAM-646 > URL: https://issues.apache.org/jira/browse/BEAM-646 > Project: Beam > Issue Type: Improvement > Components: beam-model-runner-api, sdk-java-core >Reporter: Kenneth Knowles >Assignee: Thomas Groh > > Right now, the runner intercepts calls to apply() and replaces transforms as > we go. This means that there is no "original" user graph. For portability and > misc architectural benefits, we would like to build the original graph first, > and have the runner override later. > Some runners already work in this manner, but we could integrate it more > smoothly, with more validation, via some handy APIs on e.g. the Pipeline > object. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-1112) Python E2E Integration Test Framework - Batch Only
[ https://issues.apache.org/jira/browse/BEAM-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768461#comment-15768461 ] ASF GitHub Bot commented on BEAM-1112: -- Github user markflyhigh closed the pull request at: https://github.com/apache/incubator-beam/pull/1639 > Python E2E Integration Test Framework - Batch Only > -- > > Key: BEAM-1112 > URL: https://issues.apache.org/jira/browse/BEAM-1112 > Project: Beam > Issue Type: Task > Components: sdk-py, testing >Reporter: Mark Liu >Assignee: Mark Liu > > Parity with Java. > Build e2e integration test framework that can configure and run batch > pipeline with specified test runner, wait for pipeline execution and verify > results with given verifiers in the end. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-1112) Python E2E Integration Test Framework - Batch Only
[ https://issues.apache.org/jira/browse/BEAM-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768462#comment-15768462 ] ASF GitHub Bot commented on BEAM-1112: -- GitHub user markflyhigh reopened a pull request: https://github.com/apache/incubator-beam/pull/1639 [BEAM-1112] Python E2E Test Framework And Wordcount E2E Test Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- - E2e test framework that supports TestRunner start and verify pipeline job. - add `TestOptions` which defined `on_success_matcher` that is used to verify state/output of pipeline job. - validate `on_success_matcher` before pipeline execution to make sure it's unpicklable to a subclass of BaseMatcher. - create a `TestDataflowRunner` which provide functionalities of `DataflowRunner` plus result verification. - provide a test verifier `PipelineStateMatcher` that can verify pipeline job finished in DONE or not. - Add wordcount_it (it = integration test) that build e2e test based on existing wordcount pipeline. - include wordcount_it to nose collector, so that wordcount_it can be collected and run by nose. - skip ITs when running unit tests from tox in precommit and postcommit. Current changes will not change behavior of existing pre/postcommit. Test is done by running `tox -e py27 -c sdks/python/tox.ini` for unit test and running wordcount_it with `TestDataflowRunner` on service ([link](https://pantheon.corp.google.com/dataflow/job/2016-12-15_17_36_16-3857167705491723621?project=google.com:clouddfe)). TODO: - Output data verifier that verify pipeline output that stores in filesystem. - Add wordcount_it to precommit and replace existing wordcount execution command in postcommit with a better structured nose command. You can merge this pull request into a Git repository by running: $ git pull https://github.com/markflyhigh/incubator-beam e2e-testrunner Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1639.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1639 > Python E2E Integration Test Framework - Batch Only > -- > > Key: BEAM-1112 > URL: https://issues.apache.org/jira/browse/BEAM-1112 > Project: Beam > Issue Type: Task > Components: sdk-py, testing >Reporter: Mark Liu >Assignee: Mark Liu > > Parity with Java. > Build e2e integration test framework that can configure and run batch > pipeline with specified test runner, wait for pipeline execution and verify > results with given verifiers in the end. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-1202) Coders should have meaningful equals methods
Thomas Groh created BEAM-1202: - Summary: Coders should have meaningful equals methods Key: BEAM-1202 URL: https://issues.apache.org/jira/browse/BEAM-1202 Project: Beam Issue Type: Bug Components: sdk-java-core Reporter: Thomas Groh Assignee: Davor Bonaci {{StandardCoder}} implements an equality check based on the component coders and equal classes. Any coder that is configured, or that does not extend {{StandardCoder}}, should have meaningful implementations of {{equals}} and {{hashCode}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-1202) Coders should have meaningful equals methods
[ https://issues.apache.org/jira/browse/BEAM-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Groh updated BEAM-1202: -- Assignee: (was: Davor Bonaci) > Coders should have meaningful equals methods > > > Key: BEAM-1202 > URL: https://issues.apache.org/jira/browse/BEAM-1202 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Thomas Groh > > {{StandardCoder}} implements an equality check based on the component coders > and equal classes. Any coder that is configured, or that does not extend > {{StandardCoder}}, should have meaningful implementations of {{equals}} and > {{hashCode}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-1201) Remove producesSortedKeys from Source
Daniel Halperin created BEAM-1201: - Summary: Remove producesSortedKeys from Source Key: BEAM-1201 URL: https://issues.apache.org/jira/browse/BEAM-1201 Project: Beam Issue Type: Improvement Components: sdk-java-core Reporter: Daniel Halperin Assignee: Daniel Halperin Priority: Minor This is a holdover from a precursor of the old Dataflow SDK that we just failed to delete before releasing Dataflow 1.0, but we can delete before the first stable release of Beam. This function has never been used by any runner. It does not mean anything obvious to implementors, as many sources produce {{T}}, not {{KV<K,V>}} -- what does it mean in the former case? (And how do you get the latter case correct?) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-1198) ViewFn: explicitly decouple runner materialization of side inputs from SDK-specific mapping
[ https://issues.apache.org/jira/browse/BEAM-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768387#comment-15768387 ] ASF GitHub Bot commented on BEAM-1198: -- GitHub user kennknowles opened a pull request: https://github.com/apache/incubator-beam/pull/1678 [BEAM-1198, BEAM-846, BEAM-260] Refactor Dataflow translator to decouple input and output graphs more easily Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [x] Replace `` in the title with the actual Jira issue number, if there is one. - [x] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- This is preparatory work to make it possible for the translator to have a more loosely coupled relationship between its input and output. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kennknowles/incubator-beam Dataflow-Translator Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1678.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1678 commit 8ed4bb68660c537e4a12c1077ecfa104f9a82eaa Author: Kenneth Knowles <k...@google.com> Date: 2016-12-21T22:21:50Z Inline needless interface DataflowTranslator.TranslationContext The only implementation was DataflowTranslator.Translator. This class needs some updating and the extra layer of the interface simply obscures that work. commit 272d06d7507ad7162616dd1b613efa7c8f5f4069 Author: Kenneth Knowles <k...@google.com> Date: 2016-12-21T22:34:27Z Explicitly pass Step to mutate in Dataflow translator Previously, there was always a "current" step that was the most recent step created. This makes it cumbersome or impossible to do things like translate one primitive transform into a small subgraph of steps. Thus we added hacks like CreatePCollectionView which are not actually part of the model at all - in fact, we should be able to add the needed CollectionToSingleton steps simply by looking at the side inputs of a ParDo node. > ViewFn: explicitly decouple runner materialization of side inputs from > SDK-specific mapping > --- > > Key: BEAM-1198 > URL: https://issues.apache.org/jira/browse/BEAM-1198 > Project: Beam > Issue Type: New Feature > Components: beam-model-fn-api, beam-model-runner-api >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles > > For side inputs, the field {{PCollectionView.fromIterableInternal}} implies > an "iterable" materialization of the contents of a PCollection, which is > adapted to the desired user-facing type within a UDF (today the > PCollectionView "is" the UDF) > In practice, runners get adequate performance by special casing just a couple > of types of PCollectionView in a rather cumbersome manner. > The design to improve this is https://s.apache.org/beam-side-inputs-1-pager > and this makes the de facto standard approach the actual model. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (BEAM-1198) ViewFn: explicitly decouple runner materialization of side inputs from SDK-specific mapping
[ https://issues.apache.org/jira/browse/BEAM-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenneth Knowles reassigned BEAM-1198: - Assignee: Kenneth Knowles > ViewFn: explicitly decouple runner materialization of side inputs from > SDK-specific mapping > --- > > Key: BEAM-1198 > URL: https://issues.apache.org/jira/browse/BEAM-1198 > Project: Beam > Issue Type: New Feature > Components: beam-model-fn-api, beam-model-runner-api >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles > > For side inputs, the field {{PCollectionView.fromIterableInternal}} implies > an "iterable" materialization of the contents of a PCollection, which is > adapted to the desired user-facing type within a UDF (today the > PCollectionView "is" the UDF) > In practice, runners get adequate performance by special casing just a couple > of types of PCollectionView in a rather cumbersome manner. > The design to improve this is https://s.apache.org/beam-side-inputs-1-pager > and this makes the de facto standard approach the actual model. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-1200) PubsubIO should allow for a user to supply the function which computes the watermark that is reported
Luke Cwik created BEAM-1200: --- Summary: PubsubIO should allow for a user to supply the function which computes the watermark that is reported Key: BEAM-1200 URL: https://issues.apache.org/jira/browse/BEAM-1200 Project: Beam Issue Type: Improvement Components: sdk-java-gcp Reporter: Luke Cwik Assignee: Daniel Halperin Priority: Minor A user wanted to build a watermark function which tracked the datas watermark but never falls behind current time more than Y minutes. PubsubIO does not support specifying the function which computes and reports the watermark. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-1117) Support for new Timer API in Direct runner
[ https://issues.apache.org/jira/browse/BEAM-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768294#comment-15768294 ] ASF GitHub Bot commented on BEAM-1117: -- Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/1669 > Support for new Timer API in Direct runner > -- > > Key: BEAM-1117 > URL: https://issues.apache.org/jira/browse/BEAM-1117 > Project: Beam > Issue Type: New Feature > Components: runner-direct >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-1194) DataflowRunner should test a variety of valid tempLocation/stagingLocation/etc formats.
[ https://issues.apache.org/jira/browse/BEAM-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768284#comment-15768284 ] ASF GitHub Bot commented on BEAM-1194: -- Github user dhalperi closed the pull request at: https://github.com/apache/incubator-beam/pull/1671 > DataflowRunner should test a variety of valid > tempLocation/stagingLocation/etc formats. > --- > > Key: BEAM-1194 > URL: https://issues.apache.org/jira/browse/BEAM-1194 > Project: Beam > Issue Type: Test > Components: runner-dataflow >Reporter: Daniel Halperin >Assignee: Daniel Halperin >Priority: Minor > > Cloud Dataflow has a minor history of small bugs related to various code > paths expecting there to be or not be a trailing forward-slash in these > location fields. The way that Beam's integration tests are set up, we are > likely to only have one of these two cases tested (there is a single set of > integration test pipeline options). > We should add a dedicated DataflowRunner integration test to handle this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-1199) Condense recordAsOutput, finishSpecifyingOutput from POutput
Thomas Groh created BEAM-1199: - Summary: Condense recordAsOutput, finishSpecifyingOutput from POutput Key: BEAM-1199 URL: https://issues.apache.org/jira/browse/BEAM-1199 Project: Beam Issue Type: Improvement Components: sdk-java-core Reporter: Thomas Groh Assignee: Davor Bonaci Priority: Minor {{recordAsOutput}} and {{finishSpecifyingOutput}} are both methods which are called after an output has been attached to a PTransform application. They can be combined to only have one method that does any after-production work (such as the initial run of Coder inference) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (BEAM-846) Decouple side input window mapping from WindowFn
[ https://issues.apache.org/jira/browse/BEAM-846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613384#comment-15613384 ] Kenneth Knowles edited comment on BEAM-846 at 12/21/16 9:19 PM: Design 1-pager is https://s.apache.org/beam-windowmappingfn-1-pager was (Author: kenn): Design 1-pager is https://s.apache.org/beam-side-inputs-1-pager and a couple PRs have been authored ([#520|https://github.com/apache/incubator-beam/pull/520] and [#1076|https://github.com/apache/incubator-beam/pull/1076]) attributed to BEAM-115 (the "all of the Runner API" ticket) > Decouple side input window mapping from WindowFn > > > Key: BEAM-846 > URL: https://issues.apache.org/jira/browse/BEAM-846 > Project: Beam > Issue Type: New Feature > Components: beam-model-runner-api, sdk-java-core >Reporter: Robert Bradshaw >Assignee: Kenneth Knowles > Labels: backward-incompatible > > Currently the main WindowFn provides as getSideInputWindow method. Instead, > this mapping should be specified per-side-input (thought the default mapping > would remain the same). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-210) Allow control of empty ON_TIME panes analogous to final panes
[ https://issues.apache.org/jira/browse/BEAM-210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenneth Knowles updated BEAM-210: - Component/s: (was: beam-model) > Allow control of empty ON_TIME panes analogous to final panes > - > > Key: BEAM-210 > URL: https://issues.apache.org/jira/browse/BEAM-210 > Project: Beam > Issue Type: Bug > Components: beam-model-runner-api, sdk-java-core >Reporter: Mark Shields >Assignee: Thomas Groh > > Today, ON_TIME panes are emitted whether or not they are empty. We had > decided that for final panes the user would want to control this behavior, to > control data volume. But for ON_TIME panes no such control exists. The > rationale is perhaps that the ON_TIME pane is a fundamental result that > should not be elided. To be considered: whether this is what we want. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-260) WindowMappingFn: Know the getSideInputWindow upper bound to release side input resources
[ https://issues.apache.org/jira/browse/BEAM-260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenneth Knowles updated BEAM-260: - Component/s: (was: beam-model) beam-model-fn-api > WindowMappingFn: Know the getSideInputWindow upper bound to release side > input resources > > > Key: BEAM-260 > URL: https://issues.apache.org/jira/browse/BEAM-260 > Project: Beam > Issue Type: Bug > Components: beam-model-fn-api, beam-model-runner-api >Reporter: Mark Shields >Assignee: Kenneth Knowles > > We currently have no static knowledge about the getSideInputWindow function, > and runners are thus forced to hold on to all side input state / elements in > case a future element reaches back into an earlier side input element. > Maybe we need an upper bound on lag from current to result of > getSideInputWindow so we can have a progressing gc horizon as we do for GKB > window state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-846) Decouple side input window mapping from WindowFn
[ https://issues.apache.org/jira/browse/BEAM-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenneth Knowles updated BEAM-846: - Component/s: (was: beam-model) > Decouple side input window mapping from WindowFn > > > Key: BEAM-846 > URL: https://issues.apache.org/jira/browse/BEAM-846 > Project: Beam > Issue Type: New Feature > Components: beam-model-runner-api, sdk-java-core >Reporter: Robert Bradshaw >Assignee: Kenneth Knowles > Labels: backward-incompatible > > Currently the main WindowFn provides as getSideInputWindow method. Instead, > this mapping should be specified per-side-input (thought the default mapping > would remain the same). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-1193) Give Coders URNs and document their binary formats outside the Java code base
[ https://issues.apache.org/jira/browse/BEAM-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenneth Knowles updated BEAM-1193: -- Component/s: (was: beam-model) > Give Coders URNs and document their binary formats outside the Java code base > - > > Key: BEAM-1193 > URL: https://issues.apache.org/jira/browse/BEAM-1193 > Project: Beam > Issue Type: New Feature > Components: beam-model-runner-api >Reporter: Kenneth Knowles >Assignee: Frances Perry > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-653) Refine specification for WindowFn.isCompatible()
[ https://issues.apache.org/jira/browse/BEAM-653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenneth Knowles updated BEAM-653: - Component/s: beam-model-runner-api > Refine specification for WindowFn.isCompatible() > - > > Key: BEAM-653 > URL: https://issues.apache.org/jira/browse/BEAM-653 > Project: Beam > Issue Type: New Feature > Components: beam-model, beam-model-runner-api >Reporter: Kenneth Knowles > > {{WindowFn#isCompatible}} doesn't really have a spec. In practice, it is used > primarily when flattening together multiple PCollections. All of the > WindowFns must be compatible, and then just a single WindowFn is selected > arbitrarily for the output PCollection. > In consequence, downstream of the Flatten, the merging behavior will be taken > from this WindowFn. > Currently, there are some mismatches: > - Sessions with different gap durations _are_ compatible today, but probably > shouldn't be since merging makes little sense. (The use of tiny proto-windows > is an implementation detail anyhow) > - SlidingWindows and FixedWindows _could_ reasonably be compatible if they > had the same duration, though it might be odd. > Either way, we should just nail down what we actually mean so we can arrive > at a verdict in these cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-1149) Side input access fails in direct runner (possibly others too) when input element in multiple windows
[ https://issues.apache.org/jira/browse/BEAM-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenneth Knowles updated BEAM-1149: -- Component/s: runner-core > Side input access fails in direct runner (possibly others too) when input > element in multiple windows > - > > Key: BEAM-1149 > URL: https://issues.apache.org/jira/browse/BEAM-1149 > Project: Beam > Issue Type: Bug > Components: runner-core >Reporter: Eugene Kirpichov >Assignee: Kenneth Knowles >Priority: Blocker > Fix For: 0.4.0-incubating > > > {code:java} > private static class FnWithSideInputs extends DoFn<String, String> { > private final PCollectionView view; > private FnWithSideInputs(PCollectionView view) { > this.view = view; > } > @ProcessElement > public void processElement(ProcessContext c) { > c.output(c.element() + ":" + c.sideInput(view)); > } > } > @Test > public void testSideInputsWithMultipleWindows() { > Pipeline p = TestPipeline.create(); > MutableDateTime mutableNow = Instant.now().toMutableDateTime(); > mutableNow.setMillisOfSecond(0); > Instant now = mutableNow.toInstant(); > SlidingWindows windowFn = > > SlidingWindows.of(Duration.standardSeconds(5)).every(Duration.standardSeconds(1)); > PCollectionView view = > p.apply(Create.of(1)).apply(View.asSingleton()); > PCollection res = > p.apply(Create.timestamped(TimestampedValue.of("a", now))) > .apply(Window.into(windowFn)) > .apply(ParDo.of(new FnWithSideInputs(view)).withSideInputs(view)); > PAssert.that(res).containsInAnyOrder("a:1"); > p.run(); > } > {code} > This fails with the following exception: > {code} > org.apache.beam.sdk.Pipeline$PipelineExecutionException: > java.lang.IllegalStateException: sideInput called when main input element is > in multiple windows > at > org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:343) > at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:1) > at org.apache.beam.sdk.Pipeline.run(Pipeline.java:176) > at org.apache.beam.sdk.testing.TestPipeline.run(TestPipeline.java:112) > at > Caused by: java.lang.IllegalStateException: sideInput called when main input > element is in multiple windows > at > org.apache.beam.runners.core.SimpleDoFnRunner$DoFnProcessContext.sideInput(SimpleDoFnRunner.java:514) > at > org.apache.beam.sdk.transforms.ParDoTest$FnWithSideInputs.processElement(ParDoTest.java:738) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-25) Add user-ready API for interacting with state
[ https://issues.apache.org/jira/browse/BEAM-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768175#comment-15768175 ] ASF GitHub Bot commented on BEAM-25: Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/1670 > Add user-ready API for interacting with state > - > > Key: BEAM-25 > URL: https://issues.apache.org/jira/browse/BEAM-25 > Project: Beam > Issue Type: New Feature > Components: sdk-java-core >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles > Labels: State > > Our current state API is targeted at runner implementers, not pipeline > authors. As such it has many capabilities that are not necessary nor > desirable for simple use cases of stateful ParDo (such as dynamic state tag > creation). Implement a simple state intended for user access. > (Details of our current thoughts in forthcoming design doc) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-1198) ViewFn: explicitly decouple runner materialization of side inputs from SDK-specific mapping
Kenneth Knowles created BEAM-1198: - Summary: ViewFn: explicitly decouple runner materialization of side inputs from SDK-specific mapping Key: BEAM-1198 URL: https://issues.apache.org/jira/browse/BEAM-1198 Project: Beam Issue Type: New Feature Components: beam-model-fn-api, beam-model-runner-api Reporter: Kenneth Knowles For side inputs, the field {{PCollectionView.fromIterableInternal}} implies an "iterable" materialization of the contents of a PCollection, which is adapted to the desired user-facing type within a UDF (today the PCollectionView "is" the UDF) In practice, runners get adequate performance by special casing just a couple of types of PCollectionView in a rather cumbersome manner. The design to improve this is https://s.apache.org/beam-side-inputs-1-pager and this makes the de facto standard approach the actual model. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (BEAM-1003) Enable caching of side-input dependent computations
[ https://issues.apache.org/jira/browse/BEAM-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenneth Knowles closed BEAM-1003. - Resolution: Duplicate Fix Version/s: Not applicable Looks like an exact duplicate, a la form resubmission. > Enable caching of side-input dependent computations > --- > > Key: BEAM-1003 > URL: https://issues.apache.org/jira/browse/BEAM-1003 > Project: Beam > Issue Type: New Feature > Components: beam-model >Reporter: Robert Bradshaw >Assignee: Frances Perry > Fix For: Not applicable > > > Sometimes the kind of computations one wants to perform in startBundle depend > on side inputs (and, implicitly, the window). For example, one might want to > initialize a (non-serializable) stateful object. In particular, this leads to > users incorrectly (in the case of triggered or non-globally-windowed side > inputs) memoizing this computation in the first processElement call. > One option would be to fold this into a customizable ViewFn. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-260) WindowMappingFn: Know the getSideInputWindow upper bound to release side input resources
[ https://issues.apache.org/jira/browse/BEAM-260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenneth Knowles updated BEAM-260: - Summary: WindowMappingFn: Know the getSideInputWindow upper bound to release side input resources (was: Know the getSideInputWindow upper bound so can gc side input state) > WindowMappingFn: Know the getSideInputWindow upper bound to release side > input resources > > > Key: BEAM-260 > URL: https://issues.apache.org/jira/browse/BEAM-260 > Project: Beam > Issue Type: Bug > Components: beam-model, beam-model-runner-api >Reporter: Mark Shields >Assignee: Kenneth Knowles > > We currently have no static knowledge about the getSideInputWindow function, > and runners are thus forced to hold on to all side input state / elements in > case a future element reaches back into an earlier side input element. > Maybe we need an upper bound on lag from current to result of > getSideInputWindow so we can have a progressing gc horizon as we do for GKB > window state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-79) Gearpump runner
[ https://issues.apache.org/jira/browse/BEAM-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15768076#comment-15768076 ] ASF GitHub Bot commented on BEAM-79: Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/1663 > Gearpump runner > --- > > Key: BEAM-79 > URL: https://issues.apache.org/jira/browse/BEAM-79 > Project: Beam > Issue Type: New Feature > Components: runner-gearpump >Reporter: Tyler Akidau >Assignee: Manu Zhang > > Intel is submitting Gearpump (http://www.gearpump.io) to ASF > (https://wiki.apache.org/incubator/GearpumpProposal). Appears to be a mix of > low-level primitives a la MillWheel, with some higher level primitives like > non-merging windowing mixed in. Seems like it would make a nice Beam runner. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-1197) Slowly-changing external data as a side input
Eugene Kirpichov created BEAM-1197: -- Summary: Slowly-changing external data as a side input Key: BEAM-1197 URL: https://issues.apache.org/jira/browse/BEAM-1197 Project: Beam Issue Type: Wish Components: beam-model Reporter: Eugene Kirpichov Assignee: Frances Perry I've seen repeatedly the following pattern: a user wants to join a PCollection against a slowly-changing external dataset: e.g. a file on GCS, or a Bigtable, etc. Side inputs come to mind, but current side input mechanisms don't allow for something like periodically reloading the side input. The best hacky solution I came up with for one use case is documented here: http://stackoverflow.com/questions/41254028/can-dataflow-sideinput-be-updated-per-window-by-reading-a-gcs-bucket/41271159#41271159 , we need to do better than this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-430) Introducing gcpTempLocation that default to tempLocation
[ https://issues.apache.org/jira/browse/BEAM-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke Cwik updated BEAM-430: --- Labels: backward-incompatible (was: ) > Introducing gcpTempLocation that default to tempLocation > > > Key: BEAM-430 > URL: https://issues.apache.org/jira/browse/BEAM-430 > Project: Beam > Issue Type: Improvement >Reporter: Pei He >Assignee: Pei He >Priority: Minor > Labels: backward-incompatible > Fix For: 0.2.0-incubating > > > Currently, DataflowPipelineOptions.stagingLocation default to tempLocation. > And, it requires tempLocation to be a gcs path. > Another case is BigQueryIO uses tempLocation and also requires it to be on > gcs. > So, users cannot set tempLocation to a non-gcs path with DataflowRunner or > BigQueryIO. > However, tempLocation could be on any file system. For example, WordCount > defaults to output to tempLocation. > The proposal is to add gcpTempLocation. And, it defaults to tempLocation if > tempLocation is a gcs path. > StagingLocation and BigQueryIO will use gcpTempLocation by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (BEAM-430) Introducing gcpTempLocation that default to tempLocation
[ https://issues.apache.org/jira/browse/BEAM-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke Cwik closed BEAM-430. -- > Introducing gcpTempLocation that default to tempLocation > > > Key: BEAM-430 > URL: https://issues.apache.org/jira/browse/BEAM-430 > Project: Beam > Issue Type: Improvement >Reporter: Pei He >Assignee: Pei He >Priority: Minor > Labels: backward-incompatible > Fix For: 0.2.0-incubating > > > Currently, DataflowPipelineOptions.stagingLocation default to tempLocation. > And, it requires tempLocation to be a gcs path. > Another case is BigQueryIO uses tempLocation and also requires it to be on > gcs. > So, users cannot set tempLocation to a non-gcs path with DataflowRunner or > BigQueryIO. > However, tempLocation could be on any file system. For example, WordCount > defaults to output to tempLocation. > The proposal is to add gcpTempLocation. And, it defaults to tempLocation if > tempLocation is a gcs path. > StagingLocation and BigQueryIO will use gcpTempLocation by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (BEAM-430) Introducing gcpTempLocation that default to tempLocation
[ https://issues.apache.org/jira/browse/BEAM-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke Cwik resolved BEAM-430. Resolution: Fixed > Introducing gcpTempLocation that default to tempLocation > > > Key: BEAM-430 > URL: https://issues.apache.org/jira/browse/BEAM-430 > Project: Beam > Issue Type: Improvement >Reporter: Pei He >Assignee: Pei He >Priority: Minor > Labels: backward-incompatible > Fix For: 0.2.0-incubating > > > Currently, DataflowPipelineOptions.stagingLocation default to tempLocation. > And, it requires tempLocation to be a gcs path. > Another case is BigQueryIO uses tempLocation and also requires it to be on > gcs. > So, users cannot set tempLocation to a non-gcs path with DataflowRunner or > BigQueryIO. > However, tempLocation could be on any file system. For example, WordCount > defaults to output to tempLocation. > The proposal is to add gcpTempLocation. And, it defaults to tempLocation if > tempLocation is a gcs path. > StagingLocation and BigQueryIO will use gcpTempLocation by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (BEAM-430) Introducing gcpTempLocation that default to tempLocation
[ https://issues.apache.org/jira/browse/BEAM-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke Cwik reopened BEAM-430: > Introducing gcpTempLocation that default to tempLocation > > > Key: BEAM-430 > URL: https://issues.apache.org/jira/browse/BEAM-430 > Project: Beam > Issue Type: Improvement >Reporter: Pei He >Assignee: Pei He >Priority: Minor > Labels: backward-incompatible > Fix For: 0.2.0-incubating > > > Currently, DataflowPipelineOptions.stagingLocation default to tempLocation. > And, it requires tempLocation to be a gcs path. > Another case is BigQueryIO uses tempLocation and also requires it to be on > gcs. > So, users cannot set tempLocation to a non-gcs path with DataflowRunner or > BigQueryIO. > However, tempLocation could be on any file system. For example, WordCount > defaults to output to tempLocation. > The proposal is to add gcpTempLocation. And, it defaults to tempLocation if > tempLocation is a gcs path. > StagingLocation and BigQueryIO will use gcpTempLocation by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-115) Beam Runner API
[ https://issues.apache.org/jira/browse/BEAM-115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767972#comment-15767972 ] ASF GitHub Bot commented on BEAM-115: - GitHub user kennknowles reopened a pull request: https://github.com/apache/incubator-beam/pull/662 [BEAM-115] WIP: JSON Schema definition of pipeline This is a json-schema sketch of the concrete schema from the [Pipeline Runner API proposal document](https://s.apache.org/beam-runner-api). Because our [serialization tech discussion](http://mail-archives.apache.org/mod_mbox/beam-dev/201606.mbox/%3CCAN_Ypr2ZPQG3OgPWu==kf-zztg06k0v5i0ay3dabchjyver...@mail.gmail.com%3E) seemed to favor JSON on the front end and Proto on the backend, I made this quick port. The original Avro IDL definition is also on [a branch with a test](https://github.com/kennknowles/incubator-beam/blob/pipeline-model/model/pipeline/src/main/avro/org/apache/beam/model/pipeline/pipeline.avdl). Notes & Caveats: - I did not try to flesh out any more details; this was a straight port. There's plenty to add, but a PR seems like a place that will attract a desired kind of concrete discussion even in the current state. - Typing this makes my hands hurt. Luckily, it should change exceedingly rarely. There are many libraries that can generate json-schema in various ways, including Jackson itself, but I'm not so sure any of them are applicable. - Reading this makes my eyes hurt. This is a real problem. We need a readable spec, not just a test suite for validation. - I am not so sure that [the schema library](https://github.com/daveclayton/json-schema-validator) I've used to build my smoke test is a good long term choice. I chose it because it was Jackson-based. - I've left comments in the JSON even though that is frowned upon, and taken advantage of Jackson's feature to allow them. They can also go into `"description"` fields. - Perhaps we could write YAML and convert to json-schema with no loss of precision? Feel free to leave comments here about the schema or meta issues of e.g. where the schema should live and what libraries we might want to use. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kennknowles/incubator-beam pipeline-json-schema Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/662.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #662 commit c5843ce10e782056c76157169eb5516bf18ed9e4 Author: Kenneth Knowles <k...@google.com> Date: 2016-06-10T15:51:02Z WIP: add JSON Schema definition of pipeline > Beam Runner API > --- > > Key: BEAM-115 > URL: https://issues.apache.org/jira/browse/BEAM-115 > Project: Beam > Issue Type: Improvement > Components: beam-model-runner-api >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles > > The PipelineRunner API from the SDK is not ideal for the Beam technical > vision. > It has technical limitations: > - The user's DAG (even including library expansions) is never explicitly > represented, so it cannot be analyzed except incrementally, and cannot > necessarily be reconstructed (for example, to display it!). > - The flattened DAG of just primitive transforms isn't well-suited for > display or transform override. > - The TransformHierarchy isn't well-suited for optimizations. > - The user must realistically pre-commit to a runner, and its configuration > (batch vs streaming) prior to graph construction, since the runner will be > modifying the graph as it is built. > - It is fairly language- and SDK-specific. > It has usability issues (these are not from intuition, but derived from > actual cases of failure to use according to the design) > - The interleaving of apply() methods in PTransform/Pipeline/PipelineRunner > is confusing. > - The TransformHierarchy, accessible only via visitor traversals, is > cumbersome. > - The staging of construction-time vs run-time is not always obvious. > These are just examples. This ticket tracks designing, coming to consensus, > and building an API that more simply and directly supports the technical > vision. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-1190) FileBasedSource should ignore files that matched the glob but don't exist
[ https://issues.apache.org/jira/browse/BEAM-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767937#comment-15767937 ] Daniel Halperin commented on BEAM-1190: --- I do not think this is generally safe -- it may mask underlying bugs. For example, we should never invoke this code unless the filesystem is known be eventually list-consistent but consistent with stat. This change does not obviate the need for [BEAM-60] -- because users may want to go the other way, and expand the inconsistent list they get. I propose you package this logic up in whatever the new name for IOChannelUtils is as one of the things users can do in the code they run at expand-time. Bringing the user into the loop is also nice because it makes them deal with eventual consistency up front. We are burned a lot by users who don't realize what their globs really mean. > FileBasedSource should ignore files that matched the glob but don't exist > - > > Key: BEAM-1190 > URL: https://issues.apache.org/jira/browse/BEAM-1190 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > > See user issue: > http://stackoverflow.com/questions/41251741/coping-with-eventual-consistency-of-gcs-bucket-listing > We should, after globbing the files in FileBasedSource, individually stat > every file and remove those that don't exist, to account for the possibility > that glob yielded non-existing files due to eventual consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-1196) Serialize/deserialize Pipeline/TransformHierarchy to JSON
Kenneth Knowles created BEAM-1196: - Summary: Serialize/deserialize Pipeline/TransformHierarchy to JSON Key: BEAM-1196 URL: https://issues.apache.org/jira/browse/BEAM-1196 Project: Beam Issue Type: New Feature Components: beam-model-runner-api, sdk-java-core Reporter: Kenneth Knowles There are two sketches of a concrete format for a Pipeline: 1. The Avro schema in the design doc at https://s.apache.org/beam-runner-api 2. The JSON schema at https://github.com/apache/incubator-beam/pull/662 These should be pushed all the way through and added to the Java SDK. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-1195) Give triggers URNs / JSON formats
Kenneth Knowles created BEAM-1195: - Summary: Give triggers URNs / JSON formats Key: BEAM-1195 URL: https://issues.apache.org/jira/browse/BEAM-1195 Project: Beam Issue Type: New Feature Components: beam-model-runner-api Reporter: Kenneth Knowles Assignee: Kenneth Knowles We have recently gotten to the point where triggers are just syntax, but it is still shipped via Java serialization. To make it language-independent, we need a concrete syntax. Something like the following is fairly concise, tag adjacent to payload. I haven't bothered making up fully verbose/namespaced URNs here. {code} { "$urn": "OrFinally", "main": { "$urn": "EndOfWindow", "early": }, "finally": { "$urn": "AfterCount", "count": 45 } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-1194) DataflowRunner should test a variety of valid tempLocation/stagingLocation/etc formats.
[ https://issues.apache.org/jira/browse/BEAM-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Halperin updated BEAM-1194: -- Description: Cloud Dataflow has a minor history of small bugs related to various code paths expecting there to be or not be a trailing forward-slash in these location fields. The way that Beam's integration tests are set up, we are likely to only have one of these two cases tested (there is a single set of integration test pipeline options). We should add a dedicated DataflowRunner integration test to handle this case. was: Cloud Dataflow has a minor history of small bugs related to various code paths expecting there to be or not be a trailing forward-slash in these location fields. The way that Beam's integration tests are set up, we are likely to only have one of these two cases tested (there is a single set of integration test pipeline options). We should add a dedicated integration test to handle this case. > DataflowRunner should test a variety of valid > tempLocation/stagingLocation/etc formats. > --- > > Key: BEAM-1194 > URL: https://issues.apache.org/jira/browse/BEAM-1194 > Project: Beam > Issue Type: Test > Components: runner-dataflow >Reporter: Daniel Halperin >Assignee: Daniel Halperin >Priority: Minor > > Cloud Dataflow has a minor history of small bugs related to various code > paths expecting there to be or not be a trailing forward-slash in these > location fields. The way that Beam's integration tests are set up, we are > likely to only have one of these two cases tested (there is a single set of > integration test pipeline options). > We should add a dedicated DataflowRunner integration test to handle this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-1194) DataflowRunner should test a variety of valid tempLocation/stagingLocation/etc formats.
Daniel Halperin created BEAM-1194: - Summary: DataflowRunner should test a variety of valid tempLocation/stagingLocation/etc formats. Key: BEAM-1194 URL: https://issues.apache.org/jira/browse/BEAM-1194 Project: Beam Issue Type: Test Components: runner-dataflow Reporter: Daniel Halperin Assignee: Daniel Halperin Priority: Minor Cloud Dataflow has a minor history of small bugs related to various code paths expecting there to be or not be a trailing forward-slash in these location fields. The way that Beam's integration tests are set up, we are likely to only have one of these two cases tested (there is a single set of integration test pipeline options). We should add a dedicated integration test to handle this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-1192) Give transforms URNs, use them instead of instanceof checks
[ https://issues.apache.org/jira/browse/BEAM-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenneth Knowles updated BEAM-1192: -- Summary: Give transforms URNs, use them instead of instanceof checks (was: Add URNs to known transforms instead of using instanceof checks) > Give transforms URNs, use them instead of instanceof checks > --- > > Key: BEAM-1192 > URL: https://issues.apache.org/jira/browse/BEAM-1192 > Project: Beam > Issue Type: Improvement > Components: beam-model-runner-api >Reporter: Kenneth Knowles > > In the [Beam Runner AP|https://s.apache.org/beam-runner-api], transforms of > interest to runners are to be identified by URN. > Currently, Java-based runners use `instanceof` checks to both override > transforms and to implement primitive transforms. This language and > SDK-specific behavior should be replaced by adding these URNs, and checking > them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-1192) Give transforms URNs, use them instead of instanceof checks
[ https://issues.apache.org/jira/browse/BEAM-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenneth Knowles updated BEAM-1192: -- Issue Type: New Feature (was: Improvement) > Give transforms URNs, use them instead of instanceof checks > --- > > Key: BEAM-1192 > URL: https://issues.apache.org/jira/browse/BEAM-1192 > Project: Beam > Issue Type: New Feature > Components: beam-model-runner-api >Reporter: Kenneth Knowles > > In the [Beam Runner AP|https://s.apache.org/beam-runner-api], transforms of > interest to runners are to be identified by URN. > Currently, Java-based runners use `instanceof` checks to both override > transforms and to implement primitive transforms. This language and > SDK-specific behavior should be replaced by adding these URNs, and checking > them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-1193) Give Coders URNs and document their binary formats outside the Java code base
Kenneth Knowles created BEAM-1193: - Summary: Give Coders URNs and document their binary formats outside the Java code base Key: BEAM-1193 URL: https://issues.apache.org/jira/browse/BEAM-1193 Project: Beam Issue Type: New Feature Components: beam-model, beam-model-runner-api Reporter: Kenneth Knowles Assignee: Frances Perry -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-27) Add user-ready API for interacting with timers
[ https://issues.apache.org/jira/browse/BEAM-27?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767897#comment-15767897 ] ASF GitHub Bot commented on BEAM-27: Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/1660 > Add user-ready API for interacting with timers > -- > > Key: BEAM-27 > URL: https://issues.apache.org/jira/browse/BEAM-27 > Project: Beam > Issue Type: New Feature > Components: sdk-java-core >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles > > Pipeline authors will benefit from a different factorization of interaction > with underlying timers. The current APIs are targeted at runner implementers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-1190) FileBasedSource should ignore files that matched the glob but don't exist
[ https://issues.apache.org/jira/browse/BEAM-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767871#comment-15767871 ] Eugene Kirpichov commented on BEAM-1190: My proposal is to add a mandatory stat at glob-expand time (and omit the file from glob expansion if it doesn't exist), but still throw an error if the file doesn't exist at read time. I think this is safe and should not require opt-in, since it doesn't seem to introduce new failure modes: both before and after the proposed solution we'll fail if a file doesn't exist at read time; but without it we may also erroneously fail if the file is included in glob expansion but actually doesn't exist at glob expansion time. When Filesystem APIs are able to tell whether the file system is strongly consistent, then we can eliminate the stat as an optimization. > FileBasedSource should ignore files that matched the glob but don't exist > - > > Key: BEAM-1190 > URL: https://issues.apache.org/jira/browse/BEAM-1190 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > > See user issue: > http://stackoverflow.com/questions/41251741/coping-with-eventual-consistency-of-gcs-bucket-listing > We should, after globbing the files in FileBasedSource, individually stat > every file and remove those that don't exist, to account for the possibility > that glob yielded non-existing files due to eventual consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (BEAM-1190) FileBasedSource should ignore files that matched the glob but don't exist
[ https://issues.apache.org/jira/browse/BEAM-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767818#comment-15767818 ] Daniel Halperin edited comment on BEAM-1190 at 12/21/16 6:46 PM: - Not for very long -- the stat at open-time is getting removed. We get the size information we need from the list call, but currently throw it away for silly reasons. How would you feel about the ability to execute code in the worker when the glob is expanded. I think checking which files actually exist then and deciding in one centralized place in time which files you want to read (and committing to that decision for later) is probably a simpler and safer solution. was (Author: dhalp...@google.com): Not for very long -- the stat at open-time is getting removed as we get the information we need from the list call, but throw it away like we shouldn't be. How would you feel about the ability to execute code in the worker when the glob is expanded. I think checking which files actually exist then and deciding in one centralized place in time which files you want to read (and committing to that decision for later) is probably a simpler and safer solution. > FileBasedSource should ignore files that matched the glob but don't exist > - > > Key: BEAM-1190 > URL: https://issues.apache.org/jira/browse/BEAM-1190 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > > See user issue: > http://stackoverflow.com/questions/41251741/coping-with-eventual-consistency-of-gcs-bucket-listing > We should, after globbing the files in FileBasedSource, individually stat > every file and remove those that don't exist, to account for the possibility > that glob yielded non-existing files due to eventual consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-1190) FileBasedSource should ignore files that matched the glob but don't exist
[ https://issues.apache.org/jira/browse/BEAM-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767818#comment-15767818 ] Daniel Halperin commented on BEAM-1190: --- Not for very long -- the stat at open-time is getting removed as we get the information we need from the list call, but throw it away like we shouldn't be. How would you feel about the ability to execute code in the worker when the glob is expanded. I think checking which files actually exist then and deciding in one centralized place in time which files you want to read (and committing to that decision for later) is probably a simpler and safer solution. > FileBasedSource should ignore files that matched the glob but don't exist > - > > Key: BEAM-1190 > URL: https://issues.apache.org/jira/browse/BEAM-1190 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > > See user issue: > http://stackoverflow.com/questions/41251741/coping-with-eventual-consistency-of-gcs-bucket-listing > We should, after globbing the files in FileBasedSource, individually stat > every file and remove those that don't exist, to account for the possibility > that glob yielded non-existing files due to eventual consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-362) Move shared runner functionality out of SDK and into runners/core-java
[ https://issues.apache.org/jira/browse/BEAM-362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767806#comment-15767806 ] ASF GitHub Bot commented on BEAM-362: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/1666 > Move shared runner functionality out of SDK and into runners/core-java > -- > > Key: BEAM-362 > URL: https://issues.apache.org/jira/browse/BEAM-362 > Project: Beam > Issue Type: Improvement > Components: runner-core >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-1192) Add URNs to known transforms instead of using instanceof checks
Kenneth Knowles created BEAM-1192: - Summary: Add URNs to known transforms instead of using instanceof checks Key: BEAM-1192 URL: https://issues.apache.org/jira/browse/BEAM-1192 Project: Beam Issue Type: Improvement Components: beam-model-runner-api Reporter: Kenneth Knowles In the [Beam Runner AP|https://s.apache.org/beam-runner-api], transforms of interest to runners are to be identified by URN. Currently, Java-based runners use `instanceof` checks to both override transforms and to implement primitive transforms. This language and SDK-specific behavior should be replaced by adding these URNs, and checking them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-27) Add user-ready API for interacting with timers
[ https://issues.apache.org/jira/browse/BEAM-27?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767732#comment-15767732 ] ASF GitHub Bot commented on BEAM-27: Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/1673 > Add user-ready API for interacting with timers > -- > > Key: BEAM-27 > URL: https://issues.apache.org/jira/browse/BEAM-27 > Project: Beam > Issue Type: New Feature > Components: sdk-java-core >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles > > Pipeline authors will benefit from a different factorization of interaction > with underlying timers. The current APIs are targeted at runner implementers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (BEAM-1178) Make naming of logger objects consistent
[ https://issues.apache.org/jira/browse/BEAM-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía resolved BEAM-1178. Resolution: Fixed > Make naming of logger objects consistent > > > Key: BEAM-1178 > URL: https://issues.apache.org/jira/browse/BEAM-1178 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core, sdk-java-extensions >Affects Versions: 0.5.0-incubating >Reporter: Ismaël Mejía >Assignee: Ismaël Mejía >Priority: Trivial > Fix For: Not applicable > > > Logger objects are used in different instances in Beam, around 90% of the > current classes that use loggers use the convention name 'LOG', however there > are instances that use 'logger' and others that uses 'LOGGER', this issue is > to make the logger naming consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (BEAM-1178) Make naming of logger objects consistent
[ https://issues.apache.org/jira/browse/BEAM-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía reopened BEAM-1178: > Make naming of logger objects consistent > > > Key: BEAM-1178 > URL: https://issues.apache.org/jira/browse/BEAM-1178 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core, sdk-java-extensions >Affects Versions: 0.5.0-incubating >Reporter: Ismaël Mejía >Assignee: Ismaël Mejía >Priority: Trivial > Fix For: Not applicable > > > Logger objects are used in different instances in Beam, around 90% of the > current classes that use loggers use the convention name 'LOG', however there > are instances that use 'logger' and others that uses 'LOGGER', this issue is > to make the logger naming consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (BEAM-1165) Unexpected file created when checking dependencies on clean repo
[ https://issues.apache.org/jira/browse/BEAM-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía resolved BEAM-1165. Resolution: Fixed > Unexpected file created when checking dependencies on clean repo > > > Key: BEAM-1165 > URL: https://issues.apache.org/jira/browse/BEAM-1165 > Project: Beam > Issue Type: Bug > Components: runner-flink >Affects Versions: 0.5.0-incubating >Reporter: Ismaël Mejía >Assignee: Ismaël Mejía >Priority: Minor > Fix For: 0.5.0-incubating > > > I just found a weird behavior when I was checking for the latest release, > nothing breaking, but when I start with a clean repo clone and I do: > mvn dependency:tree > It creates a new file runners/flink/examples/wordcounts.txt with the > dependencies. > This error happens because maven-dependency-plugin asumes the property output > used by the flink tests as the export file for the command. > Ref. > https://maven.apache.org/plugins/maven-dependency-plugin/tree-mojo.html#output -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (BEAM-1165) Unexpected file created when checking dependencies on clean repo
[ https://issues.apache.org/jira/browse/BEAM-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía reopened BEAM-1165: > Unexpected file created when checking dependencies on clean repo > > > Key: BEAM-1165 > URL: https://issues.apache.org/jira/browse/BEAM-1165 > Project: Beam > Issue Type: Bug > Components: runner-flink >Affects Versions: 0.5.0-incubating >Reporter: Ismaël Mejía >Assignee: Ismaël Mejía >Priority: Minor > Fix For: 0.5.0-incubating > > > I just found a weird behavior when I was checking for the latest release, > nothing breaking, but when I start with a clean repo clone and I do: > mvn dependency:tree > It creates a new file runners/flink/examples/wordcounts.txt with the > dependencies. > This error happens because maven-dependency-plugin asumes the property output > used by the flink tests as the export file for the command. > Ref. > https://maven.apache.org/plugins/maven-dependency-plugin/tree-mojo.html#output -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-27) Add user-ready API for interacting with timers
[ https://issues.apache.org/jira/browse/BEAM-27?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767452#comment-15767452 ] ASF GitHub Bot commented on BEAM-27: Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/1668 > Add user-ready API for interacting with timers > -- > > Key: BEAM-27 > URL: https://issues.apache.org/jira/browse/BEAM-27 > Project: Beam > Issue Type: New Feature > Components: sdk-java-core >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles > > Pipeline authors will benefit from a different factorization of interaction > with underlying timers. The current APIs are targeted at runner implementers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-1146) Decrease spark runner startup overhead
[ https://issues.apache.org/jira/browse/BEAM-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15767378#comment-15767378 ] ASF GitHub Bot commented on BEAM-1146: -- GitHub user aviemzur opened a pull request: https://github.com/apache/incubator-beam/pull/1674 [BEAM-1146] Decrease spark runner startup overhead Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- Replace finding all `Source` and `Coder` implementations for serialization registration with wrapper classes. You can merge this pull request into a Git repository by running: $ git pull https://github.com/aviemzur/incubator-beam decrease-spark-runner-startup-overhead Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1674.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1674 commit 8501cdc88ee9c89f643120e34381ec9bc2562965 Author: Aviem Zur <aviem...@gmail.com> Date: 2016-12-21T15:49:34Z [BEAM-1146] Decrease spark runner startup overhead Replace finding all `Source` and `Coder` implementations for serialization registration with wrapper classes. > Decrease spark runner startup overhead > -- > > Key: BEAM-1146 > URL: https://issues.apache.org/jira/browse/BEAM-1146 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Reporter: Aviem Zur >Assignee: Aviem Zur > > BEAM-921 introduced a lazy singleton instantiated once in each machine > (driver & executors) which utilizes reflection to find all subclasses of > Source and Coder > While this is beneficial in it's own right, the change added about one minute > of overhead in spark runner startup time (which cause the first job/stage to > take up to a minute). > The change is in class {{BeamSparkRunnerRegistrator}} > The reason reflection (specifically reflections library) was used here is > because there is no current way of knowing all the source and coder classes > at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (BEAM-1145) Remove classifier from shaded spark runner artifact
[ https://issues.apache.org/jira/browse/BEAM-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aviem Zur reassigned BEAM-1145: --- Assignee: Aviem Zur (was: Amit Sela) > Remove classifier from shaded spark runner artifact > --- > > Key: BEAM-1145 > URL: https://issues.apache.org/jira/browse/BEAM-1145 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Reporter: Aviem Zur >Assignee: Aviem Zur > > Shade plugin configured in spark runner's pom adds a classifier to spark > runner shaded jar > {code:xml} > true > spark-app > {code} > This means, that in order for a user application that is dependent on > spark-runner to work in cluster mode, they have to add the classifier in > their dependency declaration, like so: > {code:xml} > > org.apache.beam > beam-runners-spark > 0.4.0-incubating-SNAPSHOT > spark-app > > {code} > Otherwise, if they do not specify classifier, the jar they get is unshaded, > which in cluster mode, causes collisions between different guava versions. > Example exception in cluster mode when adding the dependency without > classifier: > {code} > 16/12/12 06:58:56 WARN TaskSetManager: Lost task 4.0 in stage 8.0 (TID 153, > lvsriskng02.lvs.paypal.com): java.lang.NoSuchMethodError: > com.google.common.base.Stopwatch.createStarted()Lcom/google/common/base/Stopwatch; > at > org.apache.beam.runners.spark.stateful.StateSpecFunctions$1.apply(StateSpecFunctions.java:137) > at > org.apache.beam.runners.spark.stateful.StateSpecFunctions$1.apply(StateSpecFunctions.java:98) > at > org.apache.spark.streaming.StateSpec$$anonfun$1.apply(StateSpec.scala:180) > at > org.apache.spark.streaming.StateSpec$$anonfun$1.apply(StateSpec.scala:179) > at > org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:57) > at > org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > at > org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55) > at > org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:155) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:275) > at > org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:149) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:275) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:275) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > I would suggest that the classifier be removed from the shaded jar, to avoid > confusion among users, and have a better user experience. > P.S. Looks like Dataflow runner does not add a classifier to its shaded jar. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (BEAM-1144) Spark runner fails to deserialize MicrobatchSource in cluster mode
[ https://issues.apache.org/jira/browse/BEAM-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aviem Zur reassigned BEAM-1144: --- Assignee: Aviem Zur (was: Amit Sela) > Spark runner fails to deserialize MicrobatchSource in cluster mode > -- > > Key: BEAM-1144 > URL: https://issues.apache.org/jira/browse/BEAM-1144 > Project: Beam > Issue Type: Bug > Components: runner-spark >Reporter: Aviem Zur >Assignee: Aviem Zur > > When running in cluster mode (yarn), spark runner fails on deserialization of > {{MicrobatchSource}} > After changes made in BEAM-921 spark runner fails in cluster mode with the > following: > {code} > 16/12/12 04:27:01 ERROR ApplicationMaster: User class threw exception: > org.apache.beam.sdk.Pipeline$PipelineExecutionException: > com.esotericsoftware.kryo.KryoException: Error during Java deserialization. > org.apache.beam.sdk.Pipeline$PipelineExecutionException: > com.esotericsoftware.kryo.KryoException: Error during Java deserialization. > at > org.apache.beam.runners.spark.SparkPipelineResult.beamExceptionFrom(SparkPipelineResult.java:72) > at > org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:115) > at > org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:101) > at > com.paypal.risk.platform.aleph.example.MapOnlyExample.main(MapOnlyExample.java:38) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:559) > Caused by: com.esotericsoftware.kryo.KryoException: Error during Java > deserialization. > at > com.esotericsoftware.kryo.serializers.JavaSerializer.read(JavaSerializer.java:42) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) > at > org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:228) > at > org.apache.spark.serializer.DeserializationStream.readKey(Serializer.scala:169) > at > org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201) > at > org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:198) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > at > org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55) > at > org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:155) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:275) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:275) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassNotFoundException: >
[jira] [Assigned] (BEAM-1146) Decrease spark runner startup overhead
[ https://issues.apache.org/jira/browse/BEAM-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aviem Zur reassigned BEAM-1146: --- Assignee: Aviem Zur (was: Amit Sela) > Decrease spark runner startup overhead > -- > > Key: BEAM-1146 > URL: https://issues.apache.org/jira/browse/BEAM-1146 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Reporter: Aviem Zur >Assignee: Aviem Zur > > BEAM-921 introduced a lazy singleton instantiated once in each machine > (driver & executors) which utilizes reflection to find all subclasses of > Source and Coder > While this is beneficial in it's own right, the change added about one minute > of overhead in spark runner startup time (which cause the first job/stage to > take up to a minute). > The change is in class {{BeamSparkRunnerRegistrator}} > The reason reflection (specifically reflections library) was used here is > because there is no current way of knowing all the source and coder classes > at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-1191) Eliminate OldDoFn from the SDK
Kenneth Knowles created BEAM-1191: - Summary: Eliminate OldDoFn from the SDK Key: BEAM-1191 URL: https://issues.apache.org/jira/browse/BEAM-1191 Project: Beam Issue Type: Improvement Components: sdk-java-core Reporter: Kenneth Knowles Priority: Minor We are far enough along that now {{OldDoFn}} is not usable by users and BEAM-498 is closed out. The remaining occurrences are limited to runners and things that should end up in runners-core. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (BEAM-498) Make DoFnWithContext the new DoFn
[ https://issues.apache.org/jira/browse/BEAM-498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenneth Knowles resolved BEAM-498. -- Resolution: Fixed Fix Version/s: 0.3.0-incubating > Make DoFnWithContext the new DoFn > - > > Key: BEAM-498 > URL: https://issues.apache.org/jira/browse/BEAM-498 > Project: Beam > Issue Type: New Feature > Components: sdk-java-core >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles > Labels: backward-incompatible > Fix For: 0.3.0-incubating > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-27) Add user-ready API for interacting with timers
[ https://issues.apache.org/jira/browse/BEAM-27?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766056#comment-15766056 ] ASF GitHub Bot commented on BEAM-27: GitHub user kennknowles opened a pull request: https://github.com/apache/incubator-beam/pull/1673 [BEAM-27] Require TimeDomain to delete a timer Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [x] Replace `` in the title with the actual Jira issue number, if there is one. - [x] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- R: @aljoscha A bit of an oversight, I neglected the fact that runners generally store different sorts of timers in rather different ways. When a user sets a timer, the `DoFnSignature` is available, so this will be for free. And when system code deletes a timer, the domain will always be known. This will require a Dataflow update, so don't worry if Dataflow-specific integration tests don't pass. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kennknowles/incubator-beam delete-by-domain Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1673.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1673 commit 46dfd0fb4d2a1533d3ed053983faee6537d3ccf0 Author: Kenneth Knowles <k...@google.com> Date: 2016-12-21T04:09:25Z Require TimeDomain to delete a timer > Add user-ready API for interacting with timers > -- > > Key: BEAM-27 > URL: https://issues.apache.org/jira/browse/BEAM-27 > Project: Beam > Issue Type: New Feature > Components: sdk-java-core >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles > > Pipeline authors will benefit from a different factorization of interaction > with underlying timers. The current APIs are targeted at runner implementers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-1190) FileBasedSource should ignore files that matched the glob but don't exist
[ https://issues.apache.org/jira/browse/BEAM-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766004#comment-15766004 ] Eugene Kirpichov commented on BEAM-1190: Dan - please note I'm suggesting not to ignore non-existent files entirely, but only to ignore *files that were yielded by glob match operation but reported as non-existent by per-file stat operation* - i.e. not include such files into bundles produced by splitIntoBundles; effectively this is just increasing precision of the glob matching. Can you elaborate on how this is dangerous? > FileBasedSource should ignore files that matched the glob but don't exist > - > > Key: BEAM-1190 > URL: https://issues.apache.org/jira/browse/BEAM-1190 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > > See user issue: > http://stackoverflow.com/questions/41251741/coping-with-eventual-consistency-of-gcs-bucket-listing > We should, after globbing the files in FileBasedSource, individually stat > every file and remove those that don't exist, to account for the possibility > that glob yielded non-existing files due to eventual consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-79) Gearpump runner
[ https://issues.apache.org/jira/browse/BEAM-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765799#comment-15765799 ] ASF GitHub Bot commented on BEAM-79: GitHub user manuzhang reopened a pull request: https://github.com/apache/incubator-beam/pull/1663 [BEAM-79] merge master into gearpump-runner branch Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [x] Replace `` in the title with the actual Jira issue number, if there is one. - [x] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- You can merge this pull request into a Git repository by running: $ git pull https://github.com/manuzhang/incubator-beam gearpump-runner-sync Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1663.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1663 commit e9f254ef2769a082c7fbb500c1c28c6224ac5a7f Author: Jakob Homan <jgho...@gmail.com> Date: 2016-12-07T00:59:50Z [BEAM-1099] Minor typos in KafkaIO commit afedd68e806830549724dfc0f2565d756cc6b46d Author: Davor Bonaci <da...@google.com> Date: 2016-12-07T01:03:54Z This closes #1524 commit e8c9686a2e898d38afd692328eb171c542084747 Author: Pei He <pe...@google.com> Date: 2016-11-23T23:59:56Z [BEAM-1047] Add DataflowClient wrapper on top of JSON library. commit ded58832ceaef487f4590d9396f09744288c955d Author: Pei He <pe...@google.com> Date: 2016-11-24T00:14:27Z [Code Health] Remove redundant projectId from DataflowPipelineJob. commit ce03f30c1ee0b84ad2e7f10a6272ffb25548244a Author: Pei He <pe...@google.com> Date: 2016-11-28T19:47:42Z [BEAM-1047] Update dataflow runner code to use DataflowClient wrapper. commit b2b570f27808b1671bf6cd0fc60f874da671d4ca Author: bchambers <bchamb...@google.com> Date: 2016-12-07T01:08:13Z Closes #1434 commit 0a2ed832ce5af7556db605e99b985ed4ffc1b152 Author: Sam McVeety <s...@google.com> Date: 2016-10-30T18:58:44Z BigQueryIO.Read: support runtime options commit 869b2710efdb90bc3ce5b6e9d4f3b49a3a804a63 Author: Aljoscha Krettek <aljoscha.kret...@gmail.com> Date: 2016-12-07T05:28:13Z [FLINK-1102] Fix Aggregator Registration in Flink Batch Runner commit b41a46e86fd38c4a887f31bdf6cb75969f4750d3 Author: Aljoscha Krettek <aljoscha.kret...@gmail.com> Date: 2016-12-07T07:26:02Z This closes #1530 commit baf5e6bd9b1011f4c5c3974aa46393471b340c15 Author: Jean-Baptiste Onofré <jbono...@apache.org> Date: 2016-12-07T07:37:33Z [BEAM-1094] Set test scope for Kafka IO and junit commit 9ccf6dbea0d3807fef6a7c0432906fffd2b8ec3f Author: Sela <ans...@paypal.com> Date: 2016-12-07T08:31:38Z This closes #1531 commit dce3a196a3a26fdd42225520faf3d9084ee48123 Author: Sela <ans...@paypal.com> Date: 2016-12-07T09:20:07Z [BEAM-329] Update Spark runner README. commit 02bb8c375c48847b1686d70184fb194500a62e8c Author: Jean-Baptiste Onofré <jbono...@apache.org> Date: 2016-12-07T11:51:09Z [BEAM-329] This closes #1532 commit b2d72237b592e1dcb5cca30f5cbc9a11d2890c0f Author: Kenneth Knowles <k...@google.com> Date: 2016-12-06T23:20:28Z Port most of DoFnRunner Javadoc to new DoFn commit 1526184ae8be1f8ae6863f830653204157a584cd Author: Thomas Groh <tg...@google.com> Date: 2016-12-07T16:51:02Z This closes #1527 commit 8e1e46e73edf9cce376ed7bd194db00edc3e60b4 Author: Kenneth Knowles <k...@google.com> Date: 2016-12-07T05:01:37Z Port ParDoTest from OldDoFn to new DoFn commit ae52ec1bc3f3120e9f8e150e50c18564055a467c Author: Kenneth Knowles <k...@google.com> Date: 2016-12-07T17:00:18Z This closes #1529 commit 55d333bff68809ff1a9154491ace02d2d16e3b85 Author: Thomas Groh <tg...@google.com> Date: 2016-12-05T22:29:05Z Only provide expanded Inputs and Outputs This removes PInput and POutput from the immediate API Surface of TransformHierarchy.Node, and forces Pipeline Visitors to access only the expanded version of the output. This is part of the move towards the runner-agnostic representation of a graph. commit 5b31a369962907e257de8019fbf6cde4c615b1c0 Author: Thomas Groh <tg...@google.com> Date: 2016-12-07T17:14:38Z This closes #1511 commit 43fef2775145f67def3ab8a246ecca192a7d650b Author: Dan Halperin <
[jira] [Commented] (BEAM-1190) FileBasedSource should ignore files that matched the glob but don't exist
[ https://issues.apache.org/jira/browse/BEAM-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765711#comment-15765711 ] Paul Findlay commented on BEAM-1190: [~dhalp...@google.com] Correct me if I'm wrong.. but isn't FileBasedSource.createReader basically already doing a stat for each file in the expanded list but swallowing the error if there is one, and leaving it for startImpl to blow up? We are just asking for the method to not be final so we can treat the different sub-classes of IOException appropriately (for our pipeline). But would love to know if there is scary behaviour we haven't considered. > FileBasedSource should ignore files that matched the glob but don't exist > - > > Key: BEAM-1190 > URL: https://issues.apache.org/jira/browse/BEAM-1190 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > > See user issue: > http://stackoverflow.com/questions/41251741/coping-with-eventual-consistency-of-gcs-bucket-listing > We should, after globbing the files in FileBasedSource, individually stat > every file and remove those that don't exist, to account for the possibility > that glob yielded non-existing files due to eventual consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (BEAM-1190) FileBasedSource should ignore files that matched the glob but don't exist
[ https://issues.apache.org/jira/browse/BEAM-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765711#comment-15765711 ] Paul Findlay edited comment on BEAM-1190 at 12/21/16 12:52 AM: --- [~dhalp...@google.com] Correct me if I'm wrong.. but isn't FileBasedSource.createReader basically already doing a stat for each file in the expanded list but swallowing the error if there is one, and leaving it for FileBasedReader.startImpl to blow up? We are just asking for the method to not be final so we can treat the different sub-classes of IOException appropriately (for our pipeline). But would love to know if there is scary behaviour we haven't considered. was (Author: p...@findlay.net.nz): [~dhalp...@google.com] Correct me if I'm wrong.. but isn't FileBasedSource.createReader basically already doing a stat for each file in the expanded list but swallowing the error if there is one, and leaving it for startImpl to blow up? We are just asking for the method to not be final so we can treat the different sub-classes of IOException appropriately (for our pipeline). But would love to know if there is scary behaviour we haven't considered. > FileBasedSource should ignore files that matched the glob but don't exist > - > > Key: BEAM-1190 > URL: https://issues.apache.org/jira/browse/BEAM-1190 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > > See user issue: > http://stackoverflow.com/questions/41251741/coping-with-eventual-consistency-of-gcs-bucket-listing > We should, after globbing the files in FileBasedSource, individually stat > every file and remove those that don't exist, to account for the possibility > that glob yielded non-existing files due to eventual consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-1190) FileBasedSource should ignore files that matched the glob but don't exist
[ https://issues.apache.org/jira/browse/BEAM-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765683#comment-15765683 ] Daniel Halperin commented on BEAM-1190: --- The two relevant JIRA issues: [BEAM-76] and [BEAM-60] > FileBasedSource should ignore files that matched the glob but don't exist > - > > Key: BEAM-1190 > URL: https://issues.apache.org/jira/browse/BEAM-1190 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > > See user issue: > http://stackoverflow.com/questions/41251741/coping-with-eventual-consistency-of-gcs-bucket-listing > We should, after globbing the files in FileBasedSource, individually stat > every file and remove those that don't exist, to account for the possibility > that glob yielded non-existing files due to eventual consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-1190) FileBasedSource should ignore files that matched the glob but don't exist
[ https://issues.apache.org/jira/browse/BEAM-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765675#comment-15765675 ] Daniel Halperin commented on BEAM-1190: --- I think this is a very scary default behavior, and something the user should implement on their own in pipeline construction. Alternately, there's already a JIRA issue for giving the user a hook to run code at expansion time in order to, e.g., autocomplete sharding templates that eventual consistency chose not to show. > FileBasedSource should ignore files that matched the glob but don't exist > - > > Key: BEAM-1190 > URL: https://issues.apache.org/jira/browse/BEAM-1190 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > > See user issue: > http://stackoverflow.com/questions/41251741/coping-with-eventual-consistency-of-gcs-bucket-listing > We should, after globbing the files in FileBasedSource, individually stat > every file and remove those that don't exist, to account for the possibility > that glob yielded non-existing files due to eventual consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (BEAM-1190) FileBasedSource should ignore files that matched the glob but don't exist
[ https://issues.apache.org/jira/browse/BEAM-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765675#comment-15765675 ] Daniel Halperin edited comment on BEAM-1190 at 12/21/16 12:33 AM: -- I think this is a very scary proposal for a new default behavior, and something the user should implement on their own in pipeline construction. Alternately, there's already a JIRA issue for giving the user a hook to run code at expansion time in order to, e.g., autocomplete sharding templates that eventual consistency chose not to show. was (Author: dhalp...@google.com): I think this is a very scary default behavior, and something the user should implement on their own in pipeline construction. Alternately, there's already a JIRA issue for giving the user a hook to run code at expansion time in order to, e.g., autocomplete sharding templates that eventual consistency chose not to show. > FileBasedSource should ignore files that matched the glob but don't exist > - > > Key: BEAM-1190 > URL: https://issues.apache.org/jira/browse/BEAM-1190 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > > See user issue: > http://stackoverflow.com/questions/41251741/coping-with-eventual-consistency-of-gcs-bucket-listing > We should, after globbing the files in FileBasedSource, individually stat > every file and remove those that don't exist, to account for the possibility > that glob yielded non-existing files due to eventual consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-1190) FileBasedSource should ignore files that matched the glob but don't exist
[ https://issues.apache.org/jira/browse/BEAM-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765658#comment-15765658 ] Eugene Kirpichov commented on BEAM-1190: CC: [~dhalp...@google.com] [~pei...@gmail.com] > FileBasedSource should ignore files that matched the glob but don't exist > - > > Key: BEAM-1190 > URL: https://issues.apache.org/jira/browse/BEAM-1190 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Eugene Kirpichov >Assignee: Eugene Kirpichov > > See user issue: > http://stackoverflow.com/questions/41251741/coping-with-eventual-consistency-of-gcs-bucket-listing > We should, after globbing the files in FileBasedSource, individually stat > every file and remove those that don't exist, to account for the possibility > that glob yielded non-existing files due to eventual consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-1190) FileBasedSource should ignore files that matched the glob but don't exist
Eugene Kirpichov created BEAM-1190: -- Summary: FileBasedSource should ignore files that matched the glob but don't exist Key: BEAM-1190 URL: https://issues.apache.org/jira/browse/BEAM-1190 Project: Beam Issue Type: Bug Components: sdk-java-core Reporter: Eugene Kirpichov Assignee: Eugene Kirpichov See user issue: http://stackoverflow.com/questions/41251741/coping-with-eventual-consistency-of-gcs-bucket-listing We should, after globbing the files in FileBasedSource, individually stat every file and remove those that don't exist, to account for the possibility that glob yielded non-existing files due to eventual consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-1067) apex.examples.WordCountTest.testWordCountExample is flaky
[ https://issues.apache.org/jira/browse/BEAM-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenneth Knowles updated BEAM-1067: -- Summary: apex.examples.WordCountTest.testWordCountExample is flaky (was: apex.examples.WordCountTest.testWordCountExample is be flaky) > apex.examples.WordCountTest.testWordCountExample is flaky > - > > Key: BEAM-1067 > URL: https://issues.apache.org/jira/browse/BEAM-1067 > Project: Beam > Issue Type: Bug > Components: runner-apex >Reporter: Stas Levin >Assignee: Thomas Weise > > Seems that > {{org.apache.beam.runners.apex.examples.WordCountTest.testWordCountExample}} > is flaky. > For example, > [this|https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/5408/org.apache.beam$beam-runners-apex/testReport/org.apache.beam.runners.apex.examples/WordCountTest/testWordCountExample/ > ] run failed although no changes were made in {{runner-apex}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-1067) apex.examples.WordCountTest.testWordCountExample is be flaky
[ https://issues.apache.org/jira/browse/BEAM-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765632#comment-15765632 ] Jason Kuster commented on BEAM-1067: https://builds.apache.org/job/beam_PostCommit_Java_MavenInstall/2185/ just now > apex.examples.WordCountTest.testWordCountExample is be flaky > > > Key: BEAM-1067 > URL: https://issues.apache.org/jira/browse/BEAM-1067 > Project: Beam > Issue Type: Bug > Components: runner-apex >Reporter: Stas Levin >Assignee: Thomas Weise > > Seems that > {{org.apache.beam.runners.apex.examples.WordCountTest.testWordCountExample}} > is flaky. > For example, > [this|https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/5408/org.apache.beam$beam-runners-apex/testReport/org.apache.beam.runners.apex.examples/WordCountTest/testWordCountExample/ > ] run failed although no changes were made in {{runner-apex}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-25) Add user-ready API for interacting with state
[ https://issues.apache.org/jira/browse/BEAM-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765628#comment-15765628 ] ASF GitHub Bot commented on BEAM-25: GitHub user kennknowles opened a pull request: https://github.com/apache/incubator-beam/pull/1670 [BEAM-25, BEAM-1117] Fixes for direct runner expansion and evaluation of stateful ParDo Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [x] Replace `` in the title with the actual Jira issue number, if there is one. - [x] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- R: @tgroh also peeled off from the timers PR, these are fixes for the whole setup. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kennknowles/incubator-beam DirectRunner-Stateful Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1670.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1670 commit 0615fc9749c3fd0012f4d5524ea8486413778636 Author: Kenneth Knowles <k...@google.com> Date: 2016-12-20T21:58:29Z Fix windowing in direct runner Stateful ParDo commit 7bc23d6b53ed29ae565121df49180ad8d4aac653 Author: Kenneth Knowles <k...@google.com> Date: 2016-12-20T23:59:45Z Actually propagate and commit state in direct runner > Add user-ready API for interacting with state > - > > Key: BEAM-25 > URL: https://issues.apache.org/jira/browse/BEAM-25 > Project: Beam > Issue Type: New Feature > Components: sdk-java-core >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles > Labels: State > > Our current state API is targeted at runner implementers, not pipeline > authors. As such it has many capabilities that are not necessary nor > desirable for simple use cases of stateful ParDo (such as dynamic state tag > creation). Implement a simple state intended for user access. > (Details of our current thoughts in forthcoming design doc) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-1117) Support for new Timer API in Direct runner
[ https://issues.apache.org/jira/browse/BEAM-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765551#comment-15765551 ] ASF GitHub Bot commented on BEAM-1117: -- GitHub user kennknowles opened a pull request: https://github.com/apache/incubator-beam/pull/1669 [BEAM-1117] Direct runner timers prereqs Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [x] Replace `` in the title with the actual Jira issue number, if there is one. - [x] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- Per request, here are some commits from #1667 broken out. I am happy to trim off more, etc, whatever is easiest for review. R: @tgroh You can merge this pull request into a Git repository by running: $ git pull https://github.com/kennknowles/incubator-beam DirectRunner-timers-prereqs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1669.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1669 commit f64816e0cf2e4fcc9525f40ede01c2f8e4ecf28d Author: Kenneth Knowles <k...@google.com> Date: 2016-12-20T04:40:11Z Add informative Instant formatter to BoundedWindow commit 46c6a4f613629f09b48e3630aa344760b0ad46d4 Author: Kenneth Knowles <k...@google.com> Date: 2016-12-20T04:40:47Z Use informative Instant formatter in WatermarkHold commit 92baa418fbe53c0e7c7afc81db31fc02ab7f3915 Author: Kenneth Knowles <k...@google.com> Date: 2016-12-20T21:57:55Z Add static Window.withOutputTimeFn to match build method commit 7118c4ff85636a65431be54fa2e2f18fb52914cf Author: Kenneth Knowles <k...@google.com> Date: 2016-12-20T22:20:07Z Add UsesTestStream for use with JUnit @Category commit f667a3e8abcd95be7a235132219c936178ab6bc8 Author: Kenneth Knowles <k...@google.com> Date: 2016-12-08T04:18:44Z Allow setting timer by ID in DirectTimerInternals commit 217e5245e59800d57aa36551fbbdb642a5b447a0 Author: Kenneth Knowles <k...@google.com> Date: 2016-12-20T21:37:40Z Hold output watermark according to pending timers > Support for new Timer API in Direct runner > -- > > Key: BEAM-1117 > URL: https://issues.apache.org/jira/browse/BEAM-1117 > Project: Beam > Issue Type: New Feature > Components: runner-direct >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-646) Get runners out of the apply()
[ https://issues.apache.org/jira/browse/BEAM-646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765542#comment-15765542 ] ASF GitHub Bot commented on BEAM-646: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/1569 > Get runners out of the apply() > -- > > Key: BEAM-646 > URL: https://issues.apache.org/jira/browse/BEAM-646 > Project: Beam > Issue Type: Improvement > Components: beam-model-runner-api, sdk-java-core >Reporter: Kenneth Knowles >Assignee: Thomas Groh > > Right now, the runner intercepts calls to apply() and replaces transforms as > we go. This means that there is no "original" user graph. For portability and > misc architectural benefits, we would like to build the original graph first, > and have the runner override later. > Some runners already work in this manner, but we could integrate it more > smoothly, with more validation, via some handy APIs on e.g. the Pipeline > object. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-362) Move shared runner functionality out of SDK and into runners/core-java
[ https://issues.apache.org/jira/browse/BEAM-362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765531#comment-15765531 ] ASF GitHub Bot commented on BEAM-362: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/1665 > Move shared runner functionality out of SDK and into runners/core-java > -- > > Key: BEAM-362 > URL: https://issues.apache.org/jira/browse/BEAM-362 > Project: Beam > Issue Type: Improvement > Components: runner-core >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-27) Add user-ready API for interacting with timers
[ https://issues.apache.org/jira/browse/BEAM-27?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765387#comment-15765387 ] ASF GitHub Bot commented on BEAM-27: GitHub user kennknowles opened a pull request: https://github.com/apache/incubator-beam/pull/1668 [BEAM-27, BEAM-362] Remove deprecated InMemoryTimerInternals from SDK Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [x] Replace `` in the title with the actual Jira issue number, if there is one. - [x] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- You can merge this pull request into a Git repository by running: $ git pull https://github.com/kennknowles/incubator-beam InMemoryTimerInternals Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1668.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1668 commit bbea8469912b23383a9ae5cf084b5801706e Author: Kenneth Knowles <k...@google.com> Date: 2016-12-20T22:07:00Z Remove deprecated InMemoryTimerInternals from SDK > Add user-ready API for interacting with timers > -- > > Key: BEAM-27 > URL: https://issues.apache.org/jira/browse/BEAM-27 > Project: Beam > Issue Type: New Feature > Components: sdk-java-core >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles > > Pipeline authors will benefit from a different factorization of interaction > with underlying timers. The current APIs are targeted at runner implementers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (BEAM-1097) Dataflow error message for non-existing gcpTempLocation is misleading
[ https://issues.apache.org/jira/browse/BEAM-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Scott Wegner closed BEAM-1097. -- Resolution: Fixed Fix Version/s: 0.5.0-incubating > Dataflow error message for non-existing gcpTempLocation is misleading > - > > Key: BEAM-1097 > URL: https://issues.apache.org/jira/browse/BEAM-1097 > Project: Beam > Issue Type: Bug > Components: examples-java, runner-dataflow >Reporter: Scott Wegner >Assignee: Scott Wegner >Priority: Minor > Fix For: 0.5.0-incubating > > > The error message for specifying a GCP tempLocation which doesn't exist is > misleading. Rather than mentioning the given path doesn't exist, it says none > ways specified. > This is particularly frustrating because it's one of the few configuration > values necessary to get started with an example or starter archetype, and > it's easy to introduce a typo as it's specified on the commandline. In my > case, I was specifying "gs://swegner-tmp" instead of "gs://swegner-test". > Repro: > 1. Clone the starter archetype: {noformat}mvn archetype:generate > -DarchetypeGroupId=org.apache.beam > -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-starter{noformat} > 2. Add beam-runners-google-cloud-dataflow-java as a dependency in the > generated pom.xml > 3. Build: {noformat}mvn install{noformat} > 4. Run: {noformat}mvn exec:java -DmainClass=swegner.StarterPipeline > -Dexec.args="--runner=DataflowRunner > --tempLocation=gs://swegner-tmp"{noformat} > Expected: An error message along the lines of: "The specified GCP temp > location 'gs://swegner-tmp' does not exist under project 'myGcpProject'" > bq. [ERROR] Failed to execute goal > org.codehaus.mojo:exec-maven-plugin:1.4.0:java (default-cli) on project > counter-names-test: An exception occured while executing the Java class. > null: InvocationTargetException: Failed to construct instance from factory > method DataflowRunner#fromOptions(interface > org.apache.beam.sdk.options.PipelineOptions): DataflowRunner requires > gcpTempLocation, and it is missing in PipelineOptions. -> [Help 1] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-1097) Dataflow error message for non-existing gcpTempLocation is misleading
[ https://issues.apache.org/jira/browse/BEAM-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765244#comment-15765244 ] ASF GitHub Bot commented on BEAM-1097: -- Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/1522 > Dataflow error message for non-existing gcpTempLocation is misleading > - > > Key: BEAM-1097 > URL: https://issues.apache.org/jira/browse/BEAM-1097 > Project: Beam > Issue Type: Bug > Components: examples-java, runner-dataflow >Reporter: Scott Wegner >Assignee: Scott Wegner >Priority: Minor > > The error message for specifying a GCP tempLocation which doesn't exist is > misleading. Rather than mentioning the given path doesn't exist, it says none > ways specified. > This is particularly frustrating because it's one of the few configuration > values necessary to get started with an example or starter archetype, and > it's easy to introduce a typo as it's specified on the commandline. In my > case, I was specifying "gs://swegner-tmp" instead of "gs://swegner-test". > Repro: > 1. Clone the starter archetype: {noformat}mvn archetype:generate > -DarchetypeGroupId=org.apache.beam > -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-starter{noformat} > 2. Add beam-runners-google-cloud-dataflow-java as a dependency in the > generated pom.xml > 3. Build: {noformat}mvn install{noformat} > 4. Run: {noformat}mvn exec:java -DmainClass=swegner.StarterPipeline > -Dexec.args="--runner=DataflowRunner > --tempLocation=gs://swegner-tmp"{noformat} > Expected: An error message along the lines of: "The specified GCP temp > location 'gs://swegner-tmp' does not exist under project 'myGcpProject'" > bq. [ERROR] Failed to execute goal > org.codehaus.mojo:exec-maven-plugin:1.4.0:java (default-cli) on project > counter-names-test: An exception occured while executing the Java class. > null: InvocationTargetException: Failed to construct instance from factory > method DataflowRunner#fromOptions(interface > org.apache.beam.sdk.options.PipelineOptions): DataflowRunner requires > gcpTempLocation, and it is missing in PipelineOptions. -> [Help 1] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-1112) Python E2E Integration Test Framework - Batch Only
[ https://issues.apache.org/jira/browse/BEAM-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765220#comment-15765220 ] ASF GitHub Bot commented on BEAM-1112: -- GitHub user markflyhigh reopened a pull request: https://github.com/apache/incubator-beam/pull/1639 [BEAM-1112] Python E2E Test Framework And Wordcount E2E Test Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- - E2e test framework that supports TestRunner start and verify pipeline job. - add `TestOptions` which defined `on_success_matcher` that is used to verify state/output of pipeline job. - validate `on_success_matcher` before pipeline execution to make sure it's unpicklable to a subclass of BaseMatcher. - create a `TestDataflowRunner` which provide functionalities of `DataflowRunner` plus result verification. - provide a test verifier `PipelineStateMatcher` that can verify pipeline job finished in DONE or not. - Add wordcount_it (it = integration test) that build e2e test based on existing wordcount pipeline. - include wordcount_it to nose collector, so that wordcount_it can be collected and run by nose. - skip ITs when running unit tests from tox in precommit and postcommit. Current changes will not change behavior of existing pre/postcommit. Test is done by running `tox -e py27 -c sdks/python/tox.ini` for unit test and running wordcount_it with `TestDataflowRunner` on service ([link](https://pantheon.corp.google.com/dataflow/job/2016-12-15_17_36_16-3857167705491723621?project=google.com:clouddfe)). TODO: - Output data verifier that verify pipeline output that stores in filesystem. - Add wordcount_it to precommit and replace existing wordcount execution command in postcommit with a better structured nose command. You can merge this pull request into a Git repository by running: $ git pull https://github.com/markflyhigh/incubator-beam e2e-testrunner Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1639.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1639 commit e1e1fa3a60e1fe234829432d144d6689e240b6f0 Author: Mark Liu <mark...@google.com> Date: 2016-12-16T01:41:20Z [BEAM-1112] Python E2E Test Framework And Wordcount E2E Test commit 0e7007879ee082e3afe5db36107f51c03274f3f5 Author: Mark Liu <mark...@google.com> Date: 2016-12-16T02:55:53Z fixup! Fix Code Style commit d6d71a717e8ed7ab32ffa02621c837c797f66cd7 Author: Mark Liu <mark...@google.com> Date: 2016-12-20T19:15:59Z fixup! Address Ahmet comments > Python E2E Integration Test Framework - Batch Only > -- > > Key: BEAM-1112 > URL: https://issues.apache.org/jira/browse/BEAM-1112 > Project: Beam > Issue Type: Task > Components: sdk-py, testing >Reporter: Mark Liu >Assignee: Mark Liu > > Parity with Java. > Build e2e integration test framework that can configure and run batch > pipeline with specified test runner, wait for pipeline execution and verify > results with given verifiers in the end. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-1112) Python E2E Integration Test Framework - Batch Only
[ https://issues.apache.org/jira/browse/BEAM-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765219#comment-15765219 ] ASF GitHub Bot commented on BEAM-1112: -- Github user markflyhigh closed the pull request at: https://github.com/apache/incubator-beam/pull/1639 > Python E2E Integration Test Framework - Batch Only > -- > > Key: BEAM-1112 > URL: https://issues.apache.org/jira/browse/BEAM-1112 > Project: Beam > Issue Type: Task > Components: sdk-py, testing >Reporter: Mark Liu >Assignee: Mark Liu > > Parity with Java. > Build e2e integration test framework that can configure and run batch > pipeline with specified test runner, wait for pipeline execution and verify > results with given verifiers in the end. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-1117) Support for new Timer API in Direct runner
[ https://issues.apache.org/jira/browse/BEAM-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765203#comment-15765203 ] ASF GitHub Bot commented on BEAM-1117: -- GitHub user kennknowles opened a pull request: https://github.com/apache/incubator-beam/pull/1667 [BEAM-1117] Support user timers for ParDo in the direct runner Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [x] Replace `` in the title with the actual Jira issue number, if there is one. - [x] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- R: @tgroh You can merge this pull request into a Git repository by running: $ git pull https://github.com/kennknowles/incubator-beam DirectRunner-timers Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1667.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1667 commit a3ac176cd7edb18d4f633682ee0e6ff30ab76f64 Author: Kenneth Knowles <k...@google.com> Date: 2016-12-08T04:18:44Z Allow setting timer by ID in DirectTimerInternals commit 445750d6cf36f1eda1094531541788260c3fe229 Author: Kenneth Knowles <k...@google.com> Date: 2016-12-08T18:27:23Z No longer reject timers for ParDo in direct runner commit d428abe9e12ddd2609773512a180589ff960d954 Author: Kenneth Knowles <k...@google.com> Date: 2016-12-08T23:18:44Z Deliver timers in the direct runner commit 6915bbc550ad692656e8eeb1ba7161213c9a6ce6 Author: Kenneth Knowles <k...@google.com> Date: 2016-12-20T04:40:11Z Add informative Instant formatter to BoundedWindow commit 2af3f93602b5299cc33c876310a784fc82ff4941 Author: Kenneth Knowles <k...@google.com> Date: 2016-12-20T04:40:47Z Use informative Instant formatter in WatermarkHold > Support for new Timer API in Direct runner > -- > > Key: BEAM-1117 > URL: https://issues.apache.org/jira/browse/BEAM-1117 > Project: Beam > Issue Type: New Feature > Components: runner-direct >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-1117) Support for new Timer API in Direct runner
[ https://issues.apache.org/jira/browse/BEAM-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765191#comment-15765191 ] ASF GitHub Bot commented on BEAM-1117: -- Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/1581 > Support for new Timer API in Direct runner > -- > > Key: BEAM-1117 > URL: https://issues.apache.org/jira/browse/BEAM-1117 > Project: Beam > Issue Type: New Feature > Components: runner-direct >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-1189) Add guide for release verifiers in the release guide
[ https://issues.apache.org/jira/browse/BEAM-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765117#comment-15765117 ] Jean-Baptiste Onofré commented on BEAM-1189: +1 How to verify artifacts is something interesting indeed. I did quick instructions for Apache Karaf: http://karaf.apache.org/download.html#verify > Add guide for release verifiers in the release guide > > > Key: BEAM-1189 > URL: https://issues.apache.org/jira/browse/BEAM-1189 > Project: Beam > Issue Type: Improvement > Components: website >Reporter: Kenneth Knowles >Assignee: James Malone > > This came up during the 0.4.0-incubating release discussion. > There is this checklist: > http://incubator.apache.org/guides/releasemanagement.html#check-list > And we could point to that but make more detailed Beam-specific instructions > on > http://beam.apache.org/contribute/release-guide/#vote-on-the-release-candidate > And the template for the vote email should include a link to suggested > verification steps. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-1189) Add guide for release verifiers in the release guide
Kenneth Knowles created BEAM-1189: - Summary: Add guide for release verifiers in the release guide Key: BEAM-1189 URL: https://issues.apache.org/jira/browse/BEAM-1189 Project: Beam Issue Type: Improvement Components: website Reporter: Kenneth Knowles Assignee: James Malone This came up during the 0.4.0-incubating release discussion. There is this checklist: http://incubator.apache.org/guides/releasemanagement.html#check-list And we could point to that but make more detailed Beam-specific instructions on http://beam.apache.org/contribute/release-guide/#vote-on-the-release-candidate And the template for the vote email should include a link to suggested verification steps. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-27) Add user-ready API for interacting with timers
[ https://issues.apache.org/jira/browse/BEAM-27?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765055#comment-15765055 ] ASF GitHub Bot commented on BEAM-27: Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/1652 > Add user-ready API for interacting with timers > -- > > Key: BEAM-27 > URL: https://issues.apache.org/jira/browse/BEAM-27 > Project: Beam > Issue Type: New Feature > Components: sdk-java-core >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles > > Pipeline authors will benefit from a different factorization of interaction > with underlying timers. The current APIs are targeted at runner implementers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-1188) More Verifiers For Python E2E Tests
[ https://issues.apache.org/jira/browse/BEAM-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Liu updated BEAM-1188: --- Description: Add more basic verifiers in e2e test to verify output data in different storage/fs: 1. File verifier: compute and verify checksum of file(s) that’s stored on a filesystem (GCS / local fs). 2. Bigquery verifier: query from Bigquery table and verify response content. Also update TestOptions.on_success_matcher to accept a list of matchers instead of single one. Note: Have retry when doing IO to avoid test flacky that may come from inconsistency of the filesystem. This problem happened in Java integration tests. > More Verifiers For Python E2E Tests > --- > > Key: BEAM-1188 > URL: https://issues.apache.org/jira/browse/BEAM-1188 > Project: Beam > Issue Type: Task > Components: sdk-py, testing >Reporter: Mark Liu >Assignee: Mark Liu > > Add more basic verifiers in e2e test to verify output data in different > storage/fs: > 1. File verifier: compute and verify checksum of file(s) that’s stored on a > filesystem (GCS / local fs). > 2. Bigquery verifier: query from Bigquery table and verify response content. > Also update TestOptions.on_success_matcher to accept a list of matchers > instead of single one. > Note: Have retry when doing IO to avoid test flacky that may come from > inconsistency of the filesystem. This problem happened in Java integration > tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-1188) More Verifiers For Python E2E Tests
[ https://issues.apache.org/jira/browse/BEAM-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Liu updated BEAM-1188: --- Description: Add more basic verifiers in e2e test to verify output data in different storage/fs: 1. File verifier: compute and verify checksum of file(s) that’s stored on a filesystem (GCS / local fs). 2. Bigquery verifier: query from Bigquery table and verify response content. ... Also update TestOptions.on_success_matcher to accept a list of matchers instead of single one. Note: Have retry when doing IO to avoid test flacky that may come from inconsistency of the filesystem. This problem happened in Java integration tests. was: Add more basic verifiers in e2e test to verify output data in different storage/fs: 1. File verifier: compute and verify checksum of file(s) that’s stored on a filesystem (GCS / local fs). 2. Bigquery verifier: query from Bigquery table and verify response content. Also update TestOptions.on_success_matcher to accept a list of matchers instead of single one. Note: Have retry when doing IO to avoid test flacky that may come from inconsistency of the filesystem. This problem happened in Java integration tests. > More Verifiers For Python E2E Tests > --- > > Key: BEAM-1188 > URL: https://issues.apache.org/jira/browse/BEAM-1188 > Project: Beam > Issue Type: Task > Components: sdk-py, testing >Reporter: Mark Liu >Assignee: Mark Liu > > Add more basic verifiers in e2e test to verify output data in different > storage/fs: > 1. File verifier: compute and verify checksum of file(s) that’s stored on a > filesystem (GCS / local fs). > 2. Bigquery verifier: query from Bigquery table and verify response content. > ... > Also update TestOptions.on_success_matcher to accept a list of matchers > instead of single one. > Note: Have retry when doing IO to avoid test flacky that may come from > inconsistency of the filesystem. This problem happened in Java integration > tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-1188) More Verifiers For Python E2E Tests
Mark Liu created BEAM-1188: -- Summary: More Verifiers For Python E2E Tests Key: BEAM-1188 URL: https://issues.apache.org/jira/browse/BEAM-1188 Project: Beam Issue Type: Task Components: sdk-py, testing Reporter: Mark Liu Assignee: Mark Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-1187) GCP Transport not performing timed backoff after connection failure
[ https://issues.apache.org/jira/browse/BEAM-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davor Bonaci updated BEAM-1187: --- Assignee: Pei He (was: Davor Bonaci) > GCP Transport not performing timed backoff after connection failure > --- > > Key: BEAM-1187 > URL: https://issues.apache.org/jira/browse/BEAM-1187 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-java-core, sdk-java-gcp >Reporter: Luke Cwik >Assignee: Pei He >Priority: Minor > > The http request retries are failing and seemingly being immediately retried > if there is a connection exception. Note that below all the times are the > same, and also that we are logging too much. This seems to be related to the > interaction by the chaining http request initializer combining the Credential > initializer followed by the RetryHttpRequestInitializer. Also, note that we > never log "Request failed with IOException, will NOT retry" which implies > that the retry logic never made it to the RetryHttpRequestInitializer. > Action items are: > 1) Ensure that the RetryHttpRequestInitializer is used > 2) Ensure that calls do backoff > 3) Reduce the logging to one terminal statement saying that we retried X > times and final failure was YYY. > Dump of console output: > Dec 20, 2016 9:12:20 AM > com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner fromOptions > INFO: PipelineOptions.filesToStage was not specified. Defaulting to files > from the classpath: will stage 1 files. Enable logging at DEBUG level to see > which files will be staged. > Dec 20, 2016 9:12:21 AM > com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner run > INFO: Executing pipeline on the Dataflow Service, which will have billing > implications related to Google Compute Engine usage and other Google Cloud > Services. > Dec 20, 2016 9:12:21 AM com.google.cloud.dataflow.sdk.util.PackageUtil > stageClasspathElements > INFO: Uploading 1 files from PipelineOptions.filesToStage to staging location > to prepare for execution. > Dec 20, 2016 9:12:21 AM com.google.cloud.dataflow.sdk.util.PackageUtil > stageClasspathElements > INFO: Uploading PipelineOptions.filesToStage complete: 1 files newly > uploaded, 0 files cached > Dec 20, 2016 9:12:22 AM com.google.api.client.http.HttpRequest execute > WARNING: exception thrown while executing request > java.net.ConnectException: Connection refused > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:589) > at sun.net.NetworkClient.doConnect(NetworkClient.java:175) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) > at sun.net.www.http.HttpClient.(HttpClient.java:211) > at sun.net.www.http.HttpClient.New(HttpClient.java:308) > at sun.net.www.http.HttpClient.New(HttpClient.java:326) > at > sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) > at > sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) > at > sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1283) > at > sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1258) > at > com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:77) > at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981) > at > com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419) > at > com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352) > at > com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469) > at > com.google.cloud.da
[jira] [Reopened] (BEAM-1186) Migrate the remaining tests to use TestPipeline as a JUnit rule.
[ https://issues.apache.org/jira/browse/BEAM-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenneth Knowles reopened BEAM-1186: --- > Migrate the remaining tests to use TestPipeline as a JUnit rule. > > > Key: BEAM-1186 > URL: https://issues.apache.org/jira/browse/BEAM-1186 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Stas Levin >Assignee: Stas Levin >Priority: Minor > Fix For: 0.5.0-incubating > > > Following up on [BEAM-1176|https://issues.apache.org/jira/browse/BEAM-1176], > the following tests still have direct calls to {{TestPipeline.create()}}: > * {{AvroIOGeneratedClassTest#runTestRead}} > * {{ApproximateUniqueTest#runApproximateUniqueWithDuplicates}} > * {{ApproximateUniqueTest#runApproximateUniqueWithSkewedDistributions}} > * {{SampleTest#runPickAnyTest}} > * {{BigtableIOTest#runReadTest}} > Consider using [parametrised > tests|https://github.com/Pragmatists/junitparams] as suggested by [~lcwik]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (BEAM-1176) Make our test suites use @Rule TestPipeline
[ https://issues.apache.org/jira/browse/BEAM-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenneth Knowles resolved BEAM-1176. --- Resolution: Fixed Fix Version/s: 0.5.0-incubating > Make our test suites use @Rule TestPipeline > --- > > Key: BEAM-1176 > URL: https://issues.apache.org/jira/browse/BEAM-1176 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Kenneth Knowles >Assignee: Stas Levin >Priority: Minor > Fix For: 0.5.0-incubating > > > Now that [~staslev] has made {{TestPipeline}} a JUnit rule that performs > useful sanity checks, we should port all of our tests to it so that they set > a good example for users. Maybe we'll even catch some straggling tests with > errors :-) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (BEAM-1186) Migrate the remaining tests to use TestPipeline as a JUnit rule.
[ https://issues.apache.org/jira/browse/BEAM-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenneth Knowles resolved BEAM-1186. --- Resolution: Fixed Fix Version/s: 0.5.0-incubating > Migrate the remaining tests to use TestPipeline as a JUnit rule. > > > Key: BEAM-1186 > URL: https://issues.apache.org/jira/browse/BEAM-1186 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Stas Levin >Assignee: Stas Levin >Priority: Minor > Fix For: 0.5.0-incubating > > > Following up on [BEAM-1176|https://issues.apache.org/jira/browse/BEAM-1176], > the following tests still have direct calls to {{TestPipeline.create()}}: > * {{AvroIOGeneratedClassTest#runTestRead}} > * {{ApproximateUniqueTest#runApproximateUniqueWithDuplicates}} > * {{ApproximateUniqueTest#runApproximateUniqueWithSkewedDistributions}} > * {{SampleTest#runPickAnyTest}} > * {{BigtableIOTest#runReadTest}} > Consider using [parametrised > tests|https://github.com/Pragmatists/junitparams] as suggested by [~lcwik]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-1176) Make our test suites use @Rule TestPipeline
[ https://issues.apache.org/jira/browse/BEAM-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764973#comment-15764973 ] ASF GitHub Bot commented on BEAM-1176: -- Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/1664 > Make our test suites use @Rule TestPipeline > --- > > Key: BEAM-1176 > URL: https://issues.apache.org/jira/browse/BEAM-1176 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Kenneth Knowles >Assignee: Stas Levin >Priority: Minor > > Now that [~staslev] has made {{TestPipeline}} a JUnit rule that performs > useful sanity checks, we should port all of our tests to it so that they set > a good example for users. Maybe we'll even catch some straggling tests with > errors :-) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-1187) GCP Transport not performing timed backoff after connection failure
Luke Cwik created BEAM-1187: --- Summary: GCP Transport not performing timed backoff after connection failure Key: BEAM-1187 URL: https://issues.apache.org/jira/browse/BEAM-1187 Project: Beam Issue Type: Bug Components: runner-dataflow, sdk-java-core, sdk-java-gcp Reporter: Luke Cwik Assignee: Davor Bonaci Priority: Minor The http request retries are failing and seemingly being immediately retried if there is a connection exception. Note that below all the times are the same, and also that we are logging too much. This seems to be related to the interaction by the chaining http request initializer combining the Credential initializer followed by the RetryHttpRequestInitializer. Also, note that we never log "Request failed with IOException, will NOT retry" which implies that the retry logic never made it to the RetryHttpRequestInitializer. Action items are: 1) Ensure that the RetryHttpRequestInitializer is used 2) Ensure that calls do backoff 3) Reduce the logging to one terminal statement saying that we retried X times and final failure was YYY. Dump of console output: Dec 20, 2016 9:12:20 AM com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner fromOptions INFO: PipelineOptions.filesToStage was not specified. Defaulting to files from the classpath: will stage 1 files. Enable logging at DEBUG level to see which files will be staged. Dec 20, 2016 9:12:21 AM com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner run INFO: Executing pipeline on the Dataflow Service, which will have billing implications related to Google Compute Engine usage and other Google Cloud Services. Dec 20, 2016 9:12:21 AM com.google.cloud.dataflow.sdk.util.PackageUtil stageClasspathElements INFO: Uploading 1 files from PipelineOptions.filesToStage to staging location to prepare for execution. Dec 20, 2016 9:12:21 AM com.google.cloud.dataflow.sdk.util.PackageUtil stageClasspathElements INFO: Uploading PipelineOptions.filesToStage complete: 1 files newly uploaded, 0 files cached Dec 20, 2016 9:12:22 AM com.google.api.client.http.HttpRequest execute WARNING: exception thrown while executing request java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.http.HttpClient.(HttpClient.java:211) at sun.net.www.http.HttpClient.New(HttpClient.java:308) at sun.net.www.http.HttpClient.New(HttpClient.java:326) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169) at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933) at sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1283) at sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1258) at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:77) at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981) at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419) at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352) at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469) at com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner.run(DataflowPipelineRunner.java:632) at com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner.run(DataflowPipelineRunner.java:201) at com.google.cloud.dataflow.sdk.Pipeline.run(Pipeline.java:181) at com.google.cloud.dataflow.integration.NumbersStreaming.numbersStreamingFromPubsub(NumbersStreaming.java:378) at com.google.cloud.dataflow.integration.NumbersStreaming.main(NumbersStreaming.java:831) Dec 20, 2016 9:12:22 AM com.google.api.client.http.HttpRequest execute WARNING: exception thrown while executi
[jira] [Commented] (BEAM-1176) Make our test suites use @Rule TestPipeline
[ https://issues.apache.org/jira/browse/BEAM-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764919#comment-15764919 ] Stas Levin commented on BEAM-1176: -- Sounds good, enter [BEAM-1186|https://issues.apache.org/jira/browse/BEAM-1186]. > Make our test suites use @Rule TestPipeline > --- > > Key: BEAM-1176 > URL: https://issues.apache.org/jira/browse/BEAM-1176 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Kenneth Knowles >Assignee: Stas Levin >Priority: Minor > > Now that [~staslev] has made {{TestPipeline}} a JUnit rule that performs > useful sanity checks, we should port all of our tests to it so that they set > a good example for users. Maybe we'll even catch some straggling tests with > errors :-) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (BEAM-1186) Migrate the remaining tests to use TestPipeline as a JUnit rule.
Stas Levin created BEAM-1186: Summary: Migrate the remaining tests to use TestPipeline as a JUnit rule. Key: BEAM-1186 URL: https://issues.apache.org/jira/browse/BEAM-1186 Project: Beam Issue Type: Improvement Components: sdk-java-core Reporter: Stas Levin Assignee: Stas Levin Priority: Minor Following up on [BEAM-1176|https://issues.apache.org/jira/browse/BEAM-1176], the following tests still have direct calls to {{TestPipeline.create()}}: * {{AvroIOGeneratedClassTest#runTestRead}} * {{ApproximateUniqueTest#runApproximateUniqueWithDuplicates}} * {{ApproximateUniqueTest#runApproximateUniqueWithSkewedDistributions}} * {{SampleTest#runPickAnyTest}} * {{BigtableIOTest#runReadTest}} Consider using [parametrised tests|https://github.com/Pragmatists/junitparams] as suggested by [~lcwik]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-362) Move shared runner functionality out of SDK and into runners/core-java
[ https://issues.apache.org/jira/browse/BEAM-362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764802#comment-15764802 ] ASF GitHub Bot commented on BEAM-362: - GitHub user kennknowles opened a pull request: https://github.com/apache/incubator-beam/pull/1666 [BEAM-362] Move ExecutionContext and related classes to runners-core Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [x] Replace `` in the title with the actual Jira issue number, if there is one. - [x] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- R: @lukecwik This is built on top of #1665 and will require a new worker image. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kennknowles/incubator-beam ExecutionContext Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1666.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1666 commit 03a85e82ca5f1dff5aae184907508d7c5309a404 Author: Kenneth Knowles <k...@google.com> Date: 2016-12-16T04:13:25Z Remove deprecated AggregatorFactory from SDK commit 6d7a4b10ba74b1fc08d0ad6a759ca5e0ebffdbba Author: Kenneth Knowles <k...@google.com> Date: 2016-12-16T04:20:34Z Move ExecutionContext and related classes to runners-core > Move shared runner functionality out of SDK and into runners/core-java > -- > > Key: BEAM-362 > URL: https://issues.apache.org/jira/browse/BEAM-362 > Project: Beam > Issue Type: Improvement > Components: runner-core >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-980) Document how to configure the DAG created by Apex Runner
[ https://issues.apache.org/jira/browse/BEAM-980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764783#comment-15764783 ] Thomas Weise commented on BEAM-980: --- The runner should probably have an option to specify a configuration file so that users can tune the execution attributes similar to what they can do when launching through the Apex CLI. https://github.com/apache/incubator-beam/blob/release-0.4.0-incubating/runners/apex/src/main/java/org/apache/beam/runners/apex/ApexRunner.java#L158 > Document how to configure the DAG created by Apex Runner > > > Key: BEAM-980 > URL: https://issues.apache.org/jira/browse/BEAM-980 > Project: Beam > Issue Type: Task > Components: runner-apex >Reporter: Thomas Weise >Assignee: Sandeep Deshmukh > > The Beam pipeline is translated to an Apex DAG of operators that have names > that are derived from the transforms. In case of composite transforms those > look like path names. Apex lets the user configure things like memory, > vcores, parallelism through properties/attributes that reference the operator > names. The configuration approach needs to be documented and supplemented > with an example. -- This message was sent by Atlassian JIRA (v6.3.4#6332)