[jira] [Commented] (BEAM-570) Update AvroSource to support more compression types
[ https://issues.apache.org/jira/browse/BEAM-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547423#comment-15547423 ] ASF GitHub Bot commented on BEAM-570: - GitHub user katsiapis opened a pull request: https://github.com/apache/incubator-beam/pull/1053 [BEAM-570] Title of the pull request - Getting rid of CompressionTypes.ZLIB and CompressionTypes.NO_COMPRESSION. - Introducing BZIP2 compression in analogy to Dataflow Java's BZIP2, towards resolution of https://issues.apache.org/jira/browse/BEAM-570. - Introducing SNAPPY codec support for AVRO conciseness and in order to fully resolve https://issues.apache.org/jira/browse/BEAM-570. - Moving avroio from compression_type to codec as per various discussions. - A few cleanups in avroio. - Making textio more DRY and doing a few cleanups. - Raising exceptions when splitting is requested for compressed source since that should never happen (guaranteed by the service for the supported compression types). - Using cStringIO instead of StringIO in various places as decided in some other discussions. You can merge this pull request into a Git repository by running: $ git pull https://github.com/katsiapis/incubator-beam bz2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1053.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1053 commit bd44e76b80e4edf4f922b9a26f7b359c4ede2008 Author: Gus KatsiapisDate: 2016-10-05T02:41:07Z Several enhancements to Dataflow (part 2 of 2). > Update AvroSource to support more compression types > --- > > Key: BEAM-570 > URL: https://issues.apache.org/jira/browse/BEAM-570 > Project: Beam > Issue Type: Improvement > Components: sdk-py >Reporter: Chamikara Jayalath >Assignee: Chamikara Jayalath > > Python AvroSource [1] currently only support 'deflate' compression. We should > update it to support other compression types supported by the Avro library > (e.g.: snappy, bzip2). > [1] > https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/avroio.py -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] incubator-beam pull request #1053: [BEAM-570] Title of the pull request
GitHub user katsiapis opened a pull request: https://github.com/apache/incubator-beam/pull/1053 [BEAM-570] Title of the pull request - Getting rid of CompressionTypes.ZLIB and CompressionTypes.NO_COMPRESSION. - Introducing BZIP2 compression in analogy to Dataflow Java's BZIP2, towards resolution of https://issues.apache.org/jira/browse/BEAM-570. - Introducing SNAPPY codec support for AVRO conciseness and in order to fully resolve https://issues.apache.org/jira/browse/BEAM-570. - Moving avroio from compression_type to codec as per various discussions. - A few cleanups in avroio. - Making textio more DRY and doing a few cleanups. - Raising exceptions when splitting is requested for compressed source since that should never happen (guaranteed by the service for the supported compression types). - Using cStringIO instead of StringIO in various places as decided in some other discussions. You can merge this pull request into a Git repository by running: $ git pull https://github.com/katsiapis/incubator-beam bz2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1053.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1053 commit bd44e76b80e4edf4f922b9a26f7b359c4ede2008 Author: Gus KatsiapisDate: 2016-10-05T02:41:07Z Several enhancements to Dataflow (part 2 of 2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-beam pull request #1052: Delete DatastoreWordCount
GitHub user dhalperi opened a pull request: https://github.com/apache/incubator-beam/pull/1052 Delete DatastoreWordCount Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- This is the kind of example we do not need to have in Beam. It's just WordCount with a different data source. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dhalperi/incubator-beam patch-1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1052.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1052 commit d5d3af967008e342530622500f252d7b108f39cf Author: Daniel HalperinDate: 2016-10-05T02:23:54Z Delete DatastoreWordCount This is the kind of example we do not need to have in Beam. It's just WordCount with a different data source. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-beam pull request #1051: Add RootTransformEvaluatorFactory
GitHub user tgroh opened a pull request: https://github.com/apache/incubator-beam/pull/1051 Add RootTransformEvaluatorFactory Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- Use for Root Transforms. These transforms generate their own initial inputs, which the Evaluator is responsible for providing back to them to generate elements from the root PCollections. Update ExecutorServiceParallelExecutor to schedule roots based on the provided transforms. Some tests which have become no longer relevant are deleted within this PR. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tgroh/incubator-beam initial_inputs Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1051.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1051 commit 5b0fa8a8599cd0bce5ad60426206f6768b0b0d13 Author: Thomas GrohDate: 2016-09-30T23:28:35Z Add RootTransformEvaluatorFactory Use for Root Transforms. These transforms generate their own initial inputs, which the Evaluator is responsible for providing back to them to generate elements from the root PCollections. Update ExecutorServiceParallelExecutor to schedule roots based on the provided transforms. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-beam pull request #1050: Makes FileBasedSink use a temporary direc...
GitHub user jkff opened a pull request: https://github.com/apache/incubator-beam/pull/1050 Makes FileBasedSink use a temporary directory When writing to `/path/to/foo`, temporary files would be written to `/path/too/foo-temp-$uid` (or something like that), i.e. as siblings of the final output. That could lead to issues like http://stackoverflow.com/q/39822859/278042 Now, temporary files are written to a path like: `/path/too/temp-beam-foo-$date/$uid`. This way, the temporary files won't match the same glob as the final output files (even though they may still fail to be deleted due to eventual consistency issues). You can merge this pull request into a Git repository by running: $ git pull https://github.com/jkff/incubator-beam file-sink-tmp-dir Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1050.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1050 commit 1c34acdaf4a0c0697c9646934ac163788133347b Author: Eugene KirpichovDate: 2016-10-04T22:23:27Z Makes FileBasedSink use a temporary directory When writing to /path/to/foo, temporary files would be written to /path/too/foo-temp-$uid (or something like that), i.e. as siblings of the final output. That could lead to issues like http://stackoverflow.com/q/39822859/278042 Now, temporary files are written to a path like: /path/too/temp-beam-foo-$date/$uid. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (BEAM-498) Make DoFnWithContext the new DoFn
[ https://issues.apache.org/jira/browse/BEAM-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546827#comment-15546827 ] ASF GitHub Bot commented on BEAM-498: - GitHub user kennknowles opened a pull request: https://github.com/apache/incubator-beam/pull/1049 [BEAM-498] Add DoFnInvoker for OldDoFn, for migration ease Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [x] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [x] Replace `` in the title with the actual Jira issue number, if there is one. - [x] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- This allows any runner to use `DoFnInvokers.invokerFor(Object)` to be agnostic as to whether they are running a `DoFn` or `OldDoFn`. Thus, the migration of the runner can occur in advance of further changes to the SDK and deployment can be independent. For example, a backend need not know whether it is deserializing a `DoFn` or `OldDoFn`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kennknowles/incubator-beam DoFnInvokers Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1049.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1049 commit 52d74b9cc15268ad29d5f00a04843fa1aeda1c9e Author: Kenneth KnowlesDate: 2016-10-04T20:56:13Z Add DoFnInvoker for OldDoFn, for migration ease This allows any runner to use DoFnInvokers.invokerFor(Object) to be agnostic as to whether they are running a DoFn or OldDoFn. Thus, the migration of the runner can occur in advance of further changes to the SDK and deployment can be independent. For example, a backend need not know whether it is deserializing a DoFn or OldDoFn. > Make DoFnWithContext the new DoFn > - > > Key: BEAM-498 > URL: https://issues.apache.org/jira/browse/BEAM-498 > Project: Beam > Issue Type: New Feature > Components: sdk-java-core >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] incubator-beam pull request #1049: [BEAM-498] Add DoFnInvoker for OldDoFn, f...
GitHub user kennknowles opened a pull request: https://github.com/apache/incubator-beam/pull/1049 [BEAM-498] Add DoFnInvoker for OldDoFn, for migration ease Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [x] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [x] Replace `` in the title with the actual Jira issue number, if there is one. - [x] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- This allows any runner to use `DoFnInvokers.invokerFor(Object)` to be agnostic as to whether they are running a `DoFn` or `OldDoFn`. Thus, the migration of the runner can occur in advance of further changes to the SDK and deployment can be independent. For example, a backend need not know whether it is deserializing a `DoFn` or `OldDoFn`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kennknowles/incubator-beam DoFnInvokers Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1049.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1049 commit 52d74b9cc15268ad29d5f00a04843fa1aeda1c9e Author: Kenneth KnowlesDate: 2016-10-04T20:56:13Z Add DoFnInvoker for OldDoFn, for migration ease This allows any runner to use DoFnInvokers.invokerFor(Object) to be agnostic as to whether they are running a DoFn or OldDoFn. Thus, the migration of the runner can occur in advance of further changes to the SDK and deployment can be independent. For example, a backend need not know whether it is deserializing a DoFn or OldDoFn. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-beam pull request #1046: Add equality methods to range source.
Github user robertwb closed the pull request at: https://github.com/apache/incubator-beam/pull/1046 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[1/2] incubator-beam git commit: Add equality methods to range source.
Repository: incubator-beam Updated Branches: refs/heads/python-sdk 3a69db0c5 -> 99a33ecdb Add equality methods to range source. Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/837c5aa3 Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/837c5aa3 Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/837c5aa3 Branch: refs/heads/python-sdk Commit: 837c5aa31dd170a9c7e1ba6559e77457bf4a9f7f Parents: 3a69db0 Author: Robert BradshawAuthored: Tue Oct 4 13:12:08 2016 -0700 Committer: Robert Bradshaw Committed: Tue Oct 4 14:35:56 2016 -0700 -- sdks/python/apache_beam/io/concat_source_test.py | 8 1 file changed, 8 insertions(+) -- http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/837c5aa3/sdks/python/apache_beam/io/concat_source_test.py -- diff --git a/sdks/python/apache_beam/io/concat_source_test.py b/sdks/python/apache_beam/io/concat_source_test.py index 828bdb0..e4df472 100644 --- a/sdks/python/apache_beam/io/concat_source_test.py +++ b/sdks/python/apache_beam/io/concat_source_test.py @@ -70,6 +70,14 @@ class RangeSource(iobase.BoundedSource): return yield k + # For testing + def __eq__(self, other): +return (type(self) == type(other) +and self._start == other._start and self._end == other._end) + + def __ne__(self, other): +return not self == other + class ConcatSourceTest(unittest.TestCase):
[2/2] incubator-beam git commit: Closes #1046
Closes #1046 Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/99a33ecd Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/99a33ecd Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/99a33ecd Branch: refs/heads/python-sdk Commit: 99a33ecdb68f66cbf98cdb145794df56ff469081 Parents: 3a69db0 837c5aa Author: Robert BradshawAuthored: Tue Oct 4 14:35:57 2016 -0700 Committer: Robert Bradshaw Committed: Tue Oct 4 14:35:57 2016 -0700 -- sdks/python/apache_beam/io/concat_source_test.py | 8 1 file changed, 8 insertions(+) --
[jira] [Commented] (BEAM-540) Dataflow streaming jobs running on windmill do not need data disks
[ https://issues.apache.org/jira/browse/BEAM-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546638#comment-15546638 ] Pei He commented on BEAM-540: - this can close? > Dataflow streaming jobs running on windmill do not need data disks > -- > > Key: BEAM-540 > URL: https://issues.apache.org/jira/browse/BEAM-540 > Project: Beam > Issue Type: New Feature > Components: runner-dataflow >Reporter: David Rieber >Assignee: Davor Bonaci > > Dataflow streaming jobs running on windmill do not need data disks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-702) Simple pattern for per-bundle and per-DoFn Closeable resources
[ https://issues.apache.org/jira/browse/BEAM-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pei He updated BEAM-702: Component/s: beam-model > Simple pattern for per-bundle and per-DoFn Closeable resources > -- > > Key: BEAM-702 > URL: https://issues.apache.org/jira/browse/BEAM-702 > Project: Beam > Issue Type: Improvement > Components: beam-model >Reporter: Eugene Kirpichov > > Dealing with Closeable resources inside a processElement call is easy: simply > use try-with-resources. > However, bundle- or DoFn-scoped resources, such as long-lived database > connections, are less convenient to deal with: you have to open them in > startBundle and conditionally close in finishBundle (likewise > setup/teardown), taking special care if there's multiple resources to close > all of them. > Perhaps we should provide something like Guava's Closer to DoFn's > https://github.com/google/guava/wiki/ClosingResourcesExplained. Ideally, the > user would need to only write a startBundle() or setup() method, but not > write finishBundle() or teardown() - resources would be closed automatically. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] incubator-beam pull request #1048: Converts KafkaIO builders to @AutoValue
GitHub user jkff opened a pull request: https://github.com/apache/incubator-beam/pull/1048 Converts KafkaIO builders to @AutoValue This is in the same spirit as https://github.com/apache/incubator-beam/pull/1031, https://github.com/jbonofre/incubator-beam/pull/1 and https://github.com/apache/incubator-beam/pull/1033 . Semantics is unchanged AFAICT. The only user-visible change is that TypedRead and TypedWrite no longer exist (they were unnecessary in the first place) - see the trivial changes in test. R: @rangadi You can merge this pull request into a Git repository by running: $ git pull https://github.com/jkff/incubator-beam kafka-autovalue Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1048.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1048 commit ec3fdd030da1cc2ed2fdd0d16fad0396ef1855a9 Author: Eugene KirpichovDate: 2016-09-29T22:30:10Z Converts KafkaIO to AutoValue --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-beam pull request #1047: Little fixes to LatestFnTest
GitHub user kennknowles opened a pull request: https://github.com/apache/incubator-beam/pull/1047 Little fixes to LatestFnTest Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [x] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [x] Replace `` in the title with the actual Jira issue number, if there is one. - [x] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- R: @swegner and any committer You can merge this pull request into a Git repository by running: $ git pull https://github.com/kennknowles/incubator-beam latest-fn-test Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1047.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1047 commit 8ae5861036f065e1240e319b7fe2aefe57376f21 Author: Kenneth KnowlesDate: 2016-10-04T20:23:17Z De-pluralize error message expectation in LatestFnTests commit 9e4ba19082ee2d2f7276013e27cf9a5567436a98 Author: Kenneth Knowles Date: 2016-10-04T20:23:37Z Move LatestFnTests to LatestFnTest --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Closed] (BEAM-545) Pipelines and their executions naming changes
[ https://issues.apache.org/jira/browse/BEAM-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pei He closed BEAM-545. --- Resolution: Fixed Fix Version/s: 0.3.0-incubating > Pipelines and their executions naming changes > - > > Key: BEAM-545 > URL: https://issues.apache.org/jira/browse/BEAM-545 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Pei He >Assignee: Pei He >Priority: Minor > Fix For: 0.3.0-incubating > > > The purpose of the changes is to clarify the differences between the two, have > consensus between runners, and unify the implementation. > Current states: > * PipelineOptions.appName defaults to mainClass name > * DataflowPipelineOptions.jobName defaults to appName+user+datetime > * FlinkPipelineOptions.jobName defaults to appName+user+datetime > Proposal: > 1. Replace PipelineOptions.appName with PipelineOptions.pipelineName. > * It is the user-visible name for a specific graph. > * default to mainClass name. > * Use cases: Find all executions of a pipeline > 2. Add jobName to top level PipelineOptions. > * It is the unique name for an execution > * defaults to pipelineName + user + datetime + random Integer > * Use cases: > -- Finding all executions by USER_A between TIME_X and TIME_Y > -- Naming resources created by the execution. for example: -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-545) Pipelines and their executions naming changes
[ https://issues.apache.org/jira/browse/BEAM-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pei He updated BEAM-545: Component/s: sdk-java-core > Pipelines and their executions naming changes > - > > Key: BEAM-545 > URL: https://issues.apache.org/jira/browse/BEAM-545 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Pei He >Assignee: Pei He >Priority: Minor > > The purpose of the changes is to clarify the differences between the two, have > consensus between runners, and unify the implementation. > Current states: > * PipelineOptions.appName defaults to mainClass name > * DataflowPipelineOptions.jobName defaults to appName+user+datetime > * FlinkPipelineOptions.jobName defaults to appName+user+datetime > Proposal: > 1. Replace PipelineOptions.appName with PipelineOptions.pipelineName. > * It is the user-visible name for a specific graph. > * default to mainClass name. > * Use cases: Find all executions of a pipeline > 2. Add jobName to top level PipelineOptions. > * It is the unique name for an execution > * defaults to pipelineName + user + datetime + random Integer > * Use cases: > -- Finding all executions by USER_A between TIME_X and TIME_Y > -- Naming resources created by the execution. for example: -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (BEAM-696) Side-Inputs non-deterministic with merging main-input windows
[ https://issues.apache.org/jira/browse/BEAM-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Sela updated BEAM-696: --- Comment: was deleted (was: So it runs once, on the merged window ? That happens in the bundle level, correct ? Do bundles always behave the same ? in terms of #elements they hold on to ? If not, and a side input is called on the merged windows, won't the sideInput value be affected from things like network or something else that may affect the bundle's contents ? and if the bundle always holds just one element, then merging never happens, correct ? I find this a bit confusing, I think the problem here has to do with the fact that applying a CombineFn with SideInputs on any PCollection is problematic. Sessions seem to be handled as any other BoundedWindow for that matter, but they are not.. BTW, isn't this: https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/windowing/Sessions.java#L84 saying that sideInputs are not allowed on Sessions ? Fact is that they are allowed, but what does that mean ? ) > Side-Inputs non-deterministic with merging main-input windows > - > > Key: BEAM-696 > URL: https://issues.apache.org/jira/browse/BEAM-696 > Project: Beam > Issue Type: Bug > Components: beam-model >Reporter: Ben Chambers >Assignee: Pei He > > Side-Inputs are non-deterministic for several reasons: > 1. Because they depend on triggering of the side-input (this is acceptable > because triggers are by their nature non-deterministic). > 2. They depend on the current state of the main-input window in order to > lookup the side-input. This means that with merging > 3. Any runner optimizations that affect when the side-input is looked up may > cause problems with either or both of these. > This issue focuses on #2 -- the non-determinism of side-inputs that execute > within a Merging WindowFn. > Possible solution would be to defer running anything that looks up the > side-input until we need to extract an output, and using the main-window at > that point. Specifically, if the main-window is a MergingWindowFn, don't > execute any kind of pre-combine, instead buffer all the inputs and combine > later. > This could still run into some non-determinism if there are triggers > controlling when we extract output. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (BEAM-433) Make Beam examples runners agnostic
[ https://issues.apache.org/jira/browse/BEAM-433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pei He closed BEAM-433. --- Resolution: Fixed Fix Version/s: 0.3.0-incubating > Make Beam examples runners agnostic > --- > > Key: BEAM-433 > URL: https://issues.apache.org/jira/browse/BEAM-433 > Project: Beam > Issue Type: Improvement > Components: examples-java >Reporter: Pei He >Assignee: Pei He > Fix For: 0.3.0-incubating > > > Beam examples are ported from Dataflow, and they heavily reference to > Dataflow classes. > There are following cleanup tasks: > 1. Remove Dataflow streaming and batch injector setup (Done). > 2. Remove references to DataflowPipelineOptions. > 3. Move cancel() from DataflowPipelineJob to PipelineResult. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-696) Side-Inputs non-deterministic with merging main-input windows
[ https://issues.apache.org/jira/browse/BEAM-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546539#comment-15546539 ] Amit Sela commented on BEAM-696: So it runs once, on the merged window ? That happens in the bundle level, correct ? Do bundles always behave the same ? in terms of #elements they hold on to ? If not, and a side input is called on the merged windows, won't the sideInput value be affected from things like network or something else that may affect the bundle's contents ? and if the bundle always holds just one element, then merging never happens, correct ? I find this a bit confusing, I think the problem here has to do with the fact that applying a CombineFn with SideInputs on any PCollection is problematic. Sessions seem to be handled as any other BoundedWindow for that matter, but they are not.. BTW, isn't this: https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/windowing/Sessions.java#L84 saying that sideInputs are not allowed on Sessions ? Fact is that they are allowed, but what does that mean ? > Side-Inputs non-deterministic with merging main-input windows > - > > Key: BEAM-696 > URL: https://issues.apache.org/jira/browse/BEAM-696 > Project: Beam > Issue Type: Bug > Components: beam-model >Reporter: Ben Chambers >Assignee: Pei He > > Side-Inputs are non-deterministic for several reasons: > 1. Because they depend on triggering of the side-input (this is acceptable > because triggers are by their nature non-deterministic). > 2. They depend on the current state of the main-input window in order to > lookup the side-input. This means that with merging > 3. Any runner optimizations that affect when the side-input is looked up may > cause problems with either or both of these. > This issue focuses on #2 -- the non-determinism of side-inputs that execute > within a Merging WindowFn. > Possible solution would be to defer running anything that looks up the > side-input until we need to extract an output, and using the main-window at > that point. Specifically, if the main-window is a MergingWindowFn, don't > execute any kind of pre-combine, instead buffer all the inputs and combine > later. > This could still run into some non-determinism if there are triggers > controlling when we extract output. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-696) Side-Inputs non-deterministic with merging main-input windows
[ https://issues.apache.org/jira/browse/BEAM-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546538#comment-15546538 ] Amit Sela commented on BEAM-696: So it runs once, on the merged window ? That happens in the bundle level, correct ? Do bundles always behave the same ? in terms of #elements they hold on to ? If not, and a side input is called on the merged windows, won't the sideInput value be affected from things like network or something else that may affect the bundle's contents ? and if the bundle always holds just one element, then merging never happens, correct ? I find this a bit confusing, I think the problem here has to do with the fact that applying a CombineFn with SideInputs on any PCollection is problematic. Sessions seem to be handled as any other BoundedWindow for that matter, but they are not.. BTW, isn't this: https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/windowing/Sessions.java#L84 saying that sideInputs are not allowed on Sessions ? Fact is that they are allowed, but what does that mean ? > Side-Inputs non-deterministic with merging main-input windows > - > > Key: BEAM-696 > URL: https://issues.apache.org/jira/browse/BEAM-696 > Project: Beam > Issue Type: Bug > Components: beam-model >Reporter: Ben Chambers >Assignee: Pei He > > Side-Inputs are non-deterministic for several reasons: > 1. Because they depend on triggering of the side-input (this is acceptable > because triggers are by their nature non-deterministic). > 2. They depend on the current state of the main-input window in order to > lookup the side-input. This means that with merging > 3. Any runner optimizations that affect when the side-input is looked up may > cause problems with either or both of these. > This issue focuses on #2 -- the non-determinism of side-inputs that execute > within a Merging WindowFn. > Possible solution would be to defer running anything that looks up the > side-input until we need to extract an output, and using the main-window at > that point. Specifically, if the main-window is a MergingWindowFn, don't > execute any kind of pre-combine, instead buffer all the inputs and combine > later. > This could still run into some non-determinism if there are triggers > controlling when we extract output. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (BEAM-483) Generated job names easily collide
[ https://issues.apache.org/jira/browse/BEAM-483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pei He closed BEAM-483. --- Resolution: Fixed Fix Version/s: 0.3.0-incubating > Generated job names easily collide > -- > > Key: BEAM-483 > URL: https://issues.apache.org/jira/browse/BEAM-483 > Project: Beam > Issue Type: Bug > Components: runner-dataflow >Reporter: Pei He >Assignee: Pei He >Priority: Minor > Fix For: 0.3.0-incubating > > > The current job name generation scheme may easily lead to duplicate job names > and cause DataflowJobAlreadyExistsException, especially when a series of jobs > are submitted at the same time. (e.g., from a single script). > It would be better to just add a random suffix like "-3275" or "-x1bh". -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (BEAM-554) Dataflow runner to support bounded writes in streaming mode.
[ https://issues.apache.org/jira/browse/BEAM-554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pei He resolved BEAM-554. - Resolution: Fixed Fix Version/s: 0.3.0-incubating > Dataflow runner to support bounded writes in streaming mode. > > > Key: BEAM-554 > URL: https://issues.apache.org/jira/browse/BEAM-554 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Pei He >Assignee: Pei He > Fix For: 0.3.0-incubating > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (BEAM-572) Remove Spark references in WordCount
[ https://issues.apache.org/jira/browse/BEAM-572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pei He resolved BEAM-572. - Resolution: Fixed Fix Version/s: 0.3.0-incubating > Remove Spark references in WordCount > > > Key: BEAM-572 > URL: https://issues.apache.org/jira/browse/BEAM-572 > Project: Beam > Issue Type: Bug > Components: examples-java >Reporter: Pei He >Assignee: Mark Liu > Fix For: 0.3.0-incubating > > > Examples should be runner agnostics. > We don't want to have Spark references in > https://github.com/apache/incubator-beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/WordCount.java -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] incubator-beam pull request #1046: Add equality methods to range source.
GitHub user robertwb opened a pull request: https://github.com/apache/incubator-beam/pull/1046 Add equality methods to range source. Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- You can merge this pull request into a Git repository by running: $ git pull https://github.com/robertwb/incubator-beam patch-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1046.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1046 commit 602f269bc9b6c198d9c79a006cca70528e696153 Author: Robert BradshawDate: 2016-10-04T20:12:08Z Add equality methods to range source. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (BEAM-586) CombineFns with no inputs should produce no outputs when used as a main input
[ https://issues.apache.org/jira/browse/BEAM-586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546500#comment-15546500 ] Pei He commented on BEAM-586: - Is that fair to say triggering defines whether and when CombineFns output, and CombineFns define what to output given a collection of inputs (could be empty)? If this is what we want to do (I am not sure if it is right), possible options are: 1. triggering doesn't output empty pane. or 2. CombineFns output nothing for an empty input collection. I would prefer not to differentiate main input and side input. > CombineFns with no inputs should produce no outputs when used as a main input > - > > Key: BEAM-586 > URL: https://issues.apache.org/jira/browse/BEAM-586 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Thomas Groh > > TestGloballyEmptyCollection seems to be violating this assumption > https://github.com/apache/incubator-beam/pull/862/files#diff-a305819710e8d79d2b045d6416184f21R65 > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-561) Add WindowedWordCountIT
[ https://issues.apache.org/jira/browse/BEAM-561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546489#comment-15546489 ] ASF GitHub Bot commented on BEAM-561: - GitHub user markflyhigh opened a pull request: https://github.com/apache/incubator-beam/pull/1045 [BEAM-561] Add Streaming IT to Jenkins Pre-commit Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [x] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [x] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- Jenkins can add this profile to precommit and WindowedWordCountIT will run in both batch and streaming mode in each build. This change increases the coverage of streaming pipeline as well as BigQuery. Currently, only DirectRunner and TestDataflowRunner is applied. TestSparkRunner and TestFlinkRunner will be added later once they fully support streaming integration tests. Test is done by running following command with different configs: ``` mvn clean verify -P jenkins-precommit-streaming,include-runners ``` TODO: Jenkins pre-commit command need to be updated by adding "jenkins-precommit-streaming" flag. You can merge this pull request into a Git repository by running: $ git pull https://github.com/markflyhigh/incubator-beam jenkins-precommit-streaming Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1045.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1045 commit 3cf06363355d9d35bf3834ef97c4b0ee6b2f97a5 Author: Mark LiuDate: 2016-10-04T19:43:46Z [BEAM-561] Add Streaming IT in Jenkins Pre-commit > Add WindowedWordCountIT > --- > > Key: BEAM-561 > URL: https://issues.apache.org/jira/browse/BEAM-561 > Project: Beam > Issue Type: Bug >Reporter: Jason Kuster >Assignee: Mark Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] incubator-beam pull request #1045: [BEAM-561] Add Streaming IT to Jenkins Pr...
GitHub user markflyhigh opened a pull request: https://github.com/apache/incubator-beam/pull/1045 [BEAM-561] Add Streaming IT to Jenkins Pre-commit Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [x] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [x] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- Jenkins can add this profile to precommit and WindowedWordCountIT will run in both batch and streaming mode in each build. This change increases the coverage of streaming pipeline as well as BigQuery. Currently, only DirectRunner and TestDataflowRunner is applied. TestSparkRunner and TestFlinkRunner will be added later once they fully support streaming integration tests. Test is done by running following command with different configs: ``` mvn clean verify -P jenkins-precommit-streaming,include-runners ``` TODO: Jenkins pre-commit command need to be updated by adding "jenkins-precommit-streaming" flag. You can merge this pull request into a Git repository by running: $ git pull https://github.com/markflyhigh/incubator-beam jenkins-precommit-streaming Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1045.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1045 commit 3cf06363355d9d35bf3834ef97c4b0ee6b2f97a5 Author: Mark LiuDate: 2016-10-04T19:43:46Z [BEAM-561] Add Streaming IT in Jenkins Pre-commit --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (BEAM-614) File compression/decompression should support auto detection
[ https://issues.apache.org/jira/browse/BEAM-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546481#comment-15546481 ] ASF GitHub Bot commented on BEAM-614: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/1002 > File compression/decompression should support auto detection > > > Key: BEAM-614 > URL: https://issues.apache.org/jira/browse/BEAM-614 > Project: Beam > Issue Type: Bug >Reporter: Slaven Bilac > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] incubator-beam pull request #1002: [BEAM-614] Updates FileBasedSource to sup...
Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/1002 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[2/2] incubator-beam git commit: Closes #1002
Closes #1002 Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/3a69db0c Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/3a69db0c Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/3a69db0c Branch: refs/heads/python-sdk Commit: 3a69db0c5bb9321c2082831b5c00d778ddf1b1d7 Parents: 731a771 2126a34 Author: Robert BradshawAuthored: Tue Oct 4 13:00:21 2016 -0700 Committer: Robert Bradshaw Committed: Tue Oct 4 13:00:21 2016 -0700 -- sdks/python/apache_beam/io/filebasedsource.py | 44 .../apache_beam/io/filebasedsource_test.py | 104 +-- 2 files changed, 121 insertions(+), 27 deletions(-) --
[1/2] incubator-beam git commit: Updates filebasedsource to support CompressionType.AUTO.
Repository: incubator-beam Updated Branches: refs/heads/python-sdk 731a77152 -> 3a69db0c5 Updates filebasedsource to support CompressionType.AUTO. Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/2126a34c Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/2126a34c Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/2126a34c Branch: refs/heads/python-sdk Commit: 2126a34c06d74c5ad44fbec8dd4e278f99ed473a Parents: 731a771 Author: chamik...@google.comAuthored: Sun Sep 25 21:44:34 2016 -0700 Committer: chamik...@google.com Committed: Mon Oct 3 10:20:46 2016 -0700 -- sdks/python/apache_beam/io/filebasedsource.py | 44 .../apache_beam/io/filebasedsource_test.py | 104 +-- 2 files changed, 121 insertions(+), 27 deletions(-) -- http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/2126a34c/sdks/python/apache_beam/io/filebasedsource.py -- diff --git a/sdks/python/apache_beam/io/filebasedsource.py b/sdks/python/apache_beam/io/filebasedsource.py index 8ff69ca..e067833 100644 --- a/sdks/python/apache_beam/io/filebasedsource.py +++ b/sdks/python/apache_beam/io/filebasedsource.py @@ -42,8 +42,7 @@ class FileBasedSource(iobase.BoundedSource): def __init__(self, file_pattern, min_bundle_size=0, - # TODO(BEAM-614) - compression_type=fileio.CompressionTypes.UNCOMPRESSED, + compression_type=fileio.CompressionTypes.AUTO, splittable=True): """Initializes ``FileBasedSource``. @@ -72,13 +71,6 @@ class FileBasedSource(iobase.BoundedSource): '%s: file_pattern must be a string; got %r instead' % (self.__class__.__name__, file_pattern)) -if compression_type == fileio.CompressionTypes.AUTO: - raise ValueError('FileBasedSource currently does not support ' - 'CompressionTypes.AUTO. Please explicitly specify the ' - 'compression type or use ' - 'CompressionTypes.UNCOMPRESSED if file is ' - 'uncompressed.') - self._pattern = file_pattern self._concat_source = None self._min_bundle_size = min_bundle_size @@ -86,11 +78,12 @@ class FileBasedSource(iobase.BoundedSource): raise TypeError('compression_type must be CompressionType object but ' 'was %s' % type(compression_type)) self._compression_type = compression_type -if compression_type != fileio.CompressionTypes.UNCOMPRESSED: +if compression_type in (fileio.CompressionTypes.UNCOMPRESSED, +fileio.CompressionTypes.AUTO): + self._splittable = splittable +else: # We can't split compressed files efficiently so turn off splitting. self._splittable = False -else: - self._splittable = splittable def _get_concat_source(self): if self._concat_source is None: @@ -102,11 +95,21 @@ class FileBasedSource(iobase.BoundedSource): if sizes[index] == 0: continue # Ignoring empty file. +# We determine splittability of this specific file. +splittable = self.splittable +if (splittable and +self._compression_type == fileio.CompressionTypes.AUTO): + compression_type = fileio.CompressionTypes.detect_compression_type( + file_name) + if compression_type != fileio.CompressionTypes.UNCOMPRESSED: +splittable = False + single_file_source = _SingleFileSource( self, file_name, 0, sizes[index], -min_bundle_size=self._min_bundle_size) +min_bundle_size=self._min_bundle_size, +splittable=splittable) single_file_sources.append(single_file_source) self._concat_source = concat_source.ConcatSource(single_file_sources) return self._concat_source @@ -173,7 +176,7 @@ class _SingleFileSource(iobase.BoundedSource): """Denotes a source for a specific file type.""" def __init__(self, file_based_source, file_name, start_offset, stop_offset, - min_bundle_size=0): + min_bundle_size=0, splittable=True): if not isinstance(start_offset, (int, long)): raise TypeError( 'start_offset must be a number. Received: %r' % start_offset) @@ -193,6 +196,7 @@ class _SingleFileSource(iobase.BoundedSource): self._stop_offset = stop_offset self._min_bundle_size = min_bundle_size self._file_based_source = file_based_source +self._splittable = splittable def split(self, desired_bundle_size, start_offset=None,
[jira] [Commented] (BEAM-696) Side-Inputs non-deterministic with merging main-input windows
[ https://issues.apache.org/jira/browse/BEAM-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546473#comment-15546473 ] Kenneth Knowles commented on BEAM-696: -- Ben points out (in person) that this latter spec is already the spec, because Combine is not a model primitive transform: Combine.perKey expands to "GroupByKey and then combine the groups with ParDo over the iterable". So semantically, it occurs in one single call to {{processElement}}, over an iterable that is in a single main input window, and reads the side input at just once point in its evolution. > Side-Inputs non-deterministic with merging main-input windows > - > > Key: BEAM-696 > URL: https://issues.apache.org/jira/browse/BEAM-696 > Project: Beam > Issue Type: Bug > Components: beam-model >Reporter: Ben Chambers >Assignee: Pei He > > Side-Inputs are non-deterministic for several reasons: > 1. Because they depend on triggering of the side-input (this is acceptable > because triggers are by their nature non-deterministic). > 2. They depend on the current state of the main-input window in order to > lookup the side-input. This means that with merging > 3. Any runner optimizations that affect when the side-input is looked up may > cause problems with either or both of these. > This issue focuses on #2 -- the non-determinism of side-inputs that execute > within a Merging WindowFn. > Possible solution would be to defer running anything that looks up the > side-input until we need to extract an output, and using the main-window at > that point. Specifically, if the main-window is a MergingWindowFn, don't > execute any kind of pre-combine, instead buffer all the inputs and combine > later. > This could still run into some non-determinism if there are triggers > controlling when we extract output. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (BEAM-605) Create BigQuery Verifier
[ https://issues.apache.org/jira/browse/BEAM-605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pei He updated BEAM-605: Affects Version/s: Not applicable Component/s: testing > Create BigQuery Verifier > > > Key: BEAM-605 > URL: https://issues.apache.org/jira/browse/BEAM-605 > Project: Beam > Issue Type: Task > Components: testing >Affects Versions: Not applicable >Reporter: Mark Liu >Assignee: Mark Liu > > Create BigQuery verifier that is used to verify output of integration test > which is using BigQuery as output source. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (BEAM-702) Simple pattern for per-bundle and per-DoFn Closeable resources
[ https://issues.apache.org/jira/browse/BEAM-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pei He updated BEAM-702: Comment: was deleted (was: 1. If there are failures in one bundle, we can easily call teardown()s of DoFns associated with this bundle. (I think we had plan to do it, I am not sure it has been done or not.) The tricky part is that if there are failures in other PTransforms or in the environment while we are processing the bundles. Before runners fail the whole job, we want runners be able to call teardown() for all running ParDos and their bundles in processing. We don't have a plan for this part yet. 2. We should document and clarify whether finishBundle() will be called or not if the bundle fails.) > Simple pattern for per-bundle and per-DoFn Closeable resources > -- > > Key: BEAM-702 > URL: https://issues.apache.org/jira/browse/BEAM-702 > Project: Beam > Issue Type: Improvement >Reporter: Eugene Kirpichov > > Dealing with Closeable resources inside a processElement call is easy: simply > use try-with-resources. > However, bundle- or DoFn-scoped resources, such as long-lived database > connections, are less convenient to deal with: you have to open them in > startBundle and conditionally close in finishBundle (likewise > setup/teardown), taking special care if there's multiple resources to close > all of them. > Perhaps we should provide something like Guava's Closer to DoFn's > https://github.com/google/guava/wiki/ClosingResourcesExplained. Ideally, the > user would need to only write a startBundle() or setup() method, but not > write finishBundle() or teardown() - resources would be closed automatically. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-702) Simple pattern for per-bundle and per-DoFn Closeable resources
[ https://issues.apache.org/jira/browse/BEAM-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546412#comment-15546412 ] Pei He commented on BEAM-702: - 1. If there are failures in one bundle, we can easily call teardown()s of DoFns associated with this bundle. (I think we had plan to do it, I am not sure it has been done or not.) The tricky part is that if there are failures in other PTransforms or in the environment while we are processing the bundles. Before runners fail the whole job, we want runners be able to call teardown() for all running ParDos and their bundles in processing. We don't have a plan for this part yet. 2. We should document and clarify whether finishBundle() will be called or not if the bundle fails. > Simple pattern for per-bundle and per-DoFn Closeable resources > -- > > Key: BEAM-702 > URL: https://issues.apache.org/jira/browse/BEAM-702 > Project: Beam > Issue Type: Improvement >Reporter: Eugene Kirpichov > > Dealing with Closeable resources inside a processElement call is easy: simply > use try-with-resources. > However, bundle- or DoFn-scoped resources, such as long-lived database > connections, are less convenient to deal with: you have to open them in > startBundle and conditionally close in finishBundle (likewise > setup/teardown), taking special care if there's multiple resources to close > all of them. > Perhaps we should provide something like Guava's Closer to DoFn's > https://github.com/google/guava/wiki/ClosingResourcesExplained. Ideally, the > user would need to only write a startBundle() or setup() method, but not > write finishBundle() or teardown() - resources would be closed automatically. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-702) Simple pattern for per-bundle and per-DoFn Closeable resources
[ https://issues.apache.org/jira/browse/BEAM-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546411#comment-15546411 ] Pei He commented on BEAM-702: - 1. If there are failures in one bundle, we can easily call teardown()s of DoFns associated with this bundle. (I think we had plan to do it, I am not sure it has been done or not.) The tricky part is that if there are failures in other PTransforms or in the environment while we are processing the bundles. Before runners fail the whole job, we want runners be able to call teardown() for all running ParDos and their bundles in processing. We don't have a plan for this part yet. 2. We should document and clarify whether finishBundle() will be called or not if the bundle fails. > Simple pattern for per-bundle and per-DoFn Closeable resources > -- > > Key: BEAM-702 > URL: https://issues.apache.org/jira/browse/BEAM-702 > Project: Beam > Issue Type: Improvement >Reporter: Eugene Kirpichov > > Dealing with Closeable resources inside a processElement call is easy: simply > use try-with-resources. > However, bundle- or DoFn-scoped resources, such as long-lived database > connections, are less convenient to deal with: you have to open them in > startBundle and conditionally close in finishBundle (likewise > setup/teardown), taking special care if there's multiple resources to close > all of them. > Perhaps we should provide something like Guava's Closer to DoFn's > https://github.com/google/guava/wiki/ClosingResourcesExplained. Ideally, the > user would need to only write a startBundle() or setup() method, but not > write finishBundle() or teardown() - resources would be closed automatically. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (BEAM-703) SingletonViewFn might exhaust defaultValue if it's serialized after being used.
[ https://issues.apache.org/jira/browse/BEAM-703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenneth Knowles resolved BEAM-703. -- Resolution: Fixed Fix Version/s: 0.3.0-incubating > SingletonViewFn might exhaust defaultValue if it's serialized after being > used. > > > Key: BEAM-703 > URL: https://issues.apache.org/jira/browse/BEAM-703 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Amit Sela >Assignee: Amit Sela >Priority: Minor > Fix For: 0.3.0-incubating > > > In > https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/PCollectionViews.java#L267 > the defaultValue is set to null to avoid decoding over and over I assume. > If the defaultValue is accessed before the the SingletonViewFn is serialized, > it will exhaust the encoded value (assigned with null) while losing the > transient decoded value. > It'd probably be best to simply check if defaultValue is null before > decoding, so that decode will still happen just once, but the encoded data is > not lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] incubator-beam pull request #1040: [BEAM-703] SingletonViewFn might exhaust ...
Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/1040 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (BEAM-703) SingletonViewFn might exhaust defaultValue if it's serialized after being used.
[ https://issues.apache.org/jira/browse/BEAM-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546394#comment-15546394 ] ASF GitHub Bot commented on BEAM-703: - Github user asfgit closed the pull request at: https://github.com/apache/incubator-beam/pull/1040 > SingletonViewFn might exhaust defaultValue if it's serialized after being > used. > > > Key: BEAM-703 > URL: https://issues.apache.org/jira/browse/BEAM-703 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Amit Sela >Assignee: Amit Sela >Priority: Minor > > In > https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/PCollectionViews.java#L267 > the defaultValue is set to null to avoid decoding over and over I assume. > If the defaultValue is accessed before the the SingletonViewFn is serialized, > it will exhaust the encoded value (assigned with null) while losing the > transient decoded value. > It'd probably be best to simply check if defaultValue is null before > decoding, so that decode will still happen just once, but the encoded data is > not lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[1/2] incubator-beam git commit: Avoid losing the encoded defaultValue.
Repository: incubator-beam Updated Branches: refs/heads/master 8462acbcb -> 087dcef1e Avoid losing the encoded defaultValue. Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/0bd97efc Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/0bd97efc Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/0bd97efc Branch: refs/heads/master Commit: 0bd97efc0d764af17cdd8abdf43bff33bb21be2b Parents: 8462acb Author: SelaAuthored: Tue Oct 4 18:12:38 2016 +0300 Committer: Sela Committed: Tue Oct 4 18:12:38 2016 +0300 -- .../src/main/java/org/apache/beam/sdk/util/PCollectionViews.java | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/0bd97efc/sdks/java/core/src/main/java/org/apache/beam/sdk/util/PCollectionViews.java -- diff --git a/sdks/java/core/src/main/java/org/apache/beam/sdk/util/PCollectionViews.java b/sdks/java/core/src/main/java/org/apache/beam/sdk/util/PCollectionViews.java index 14ae5c8..3b1fde9 100644 --- a/sdks/java/core/src/main/java/org/apache/beam/sdk/util/PCollectionViews.java +++ b/sdks/java/core/src/main/java/org/apache/beam/sdk/util/PCollectionViews.java @@ -261,10 +261,9 @@ public class PCollectionViews { } // Lazily decode the default value once synchronized (this) { -if (encodedDefaultValue != null) { +if (encodedDefaultValue != null && defaultValue == null) { try { defaultValue = CoderUtils.decodeFromByteArray(valueCoder, encodedDefaultValue); -encodedDefaultValue = null; } catch (IOException e) { throw new RuntimeException("Unexpected IOException: ", e); }
[2/2] incubator-beam git commit: This closes #1040
This closes #1040 Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/087dcef1 Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/087dcef1 Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/087dcef1 Branch: refs/heads/master Commit: 087dcef1efc6b7765e203f8aebfd3caf8c4a77de Parents: 8462acb 0bd97ef Author: Kenneth KnowlesAuthored: Tue Oct 4 12:27:22 2016 -0700 Committer: Kenneth Knowles Committed: Tue Oct 4 12:27:22 2016 -0700 -- .../src/main/java/org/apache/beam/sdk/util/PCollectionViews.java | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) --
[jira] [Commented] (BEAM-702) Simple pattern for per-bundle and per-DoFn Closeable resources
[ https://issues.apache.org/jira/browse/BEAM-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546300#comment-15546300 ] Eugene Kirpichov commented on BEAM-702: --- Hmm, I didn't realize that we don't call finishBundle and teardown in case the bundle fails. But yes, this makes a lot of sense. I'm not sure whether I personally would prefer annotation-based closeables or explicit calls (e.g. c.addCloseable(createDBWriter)). Explicit calls would make it easier to open a dynamic set of resources (e.g. lazily open connections to different shards of a database depending on the data). However that could be encapsulated into a single Closeable object, making these styles equivalent. > Simple pattern for per-bundle and per-DoFn Closeable resources > -- > > Key: BEAM-702 > URL: https://issues.apache.org/jira/browse/BEAM-702 > Project: Beam > Issue Type: Improvement >Reporter: Eugene Kirpichov > > Dealing with Closeable resources inside a processElement call is easy: simply > use try-with-resources. > However, bundle- or DoFn-scoped resources, such as long-lived database > connections, are less convenient to deal with: you have to open them in > startBundle and conditionally close in finishBundle (likewise > setup/teardown), taking special care if there's multiple resources to close > all of them. > Perhaps we should provide something like Guava's Closer to DoFn's > https://github.com/google/guava/wiki/ClosingResourcesExplained. Ideally, the > user would need to only write a startBundle() or setup() method, but not > write finishBundle() or teardown() - resources would be closed automatically. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-25) Add user-ready API for interacting with state
[ https://issues.apache.org/jira/browse/BEAM-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546294#comment-15546294 ] ASF GitHub Bot commented on BEAM-25: GitHub user kennknowles opened a pull request: https://github.com/apache/incubator-beam/pull/1044 [BEAM-25] Refactor StateSpec out of StateTag Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [x] Replace `` in the title with the actual Jira issue number, if there is one. - [x] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- R: @aljoscha AND @tgroh AND @bjchambers This is a rebase of #793 without much cleanup. The same commentary applies - I'm not totally happy with it but I want to ask for feedback. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kennknowles/incubator-beam StateSpec Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1044.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1044 commit 1f2de1e1212cbdb5eb3c8cb7caf30b94da5e0b00 Author: Kenneth KnowlesDate: 2016-08-05T03:50:28Z Create StateSpec parallel to StateTag commit 6fa680858719eaf13edcc08aa3f3260584876fb5 Author: Kenneth Knowles Date: 2016-08-05T04:48:48Z Make StateTag carry a StateSpec separately from its id > Add user-ready API for interacting with state > - > > Key: BEAM-25 > URL: https://issues.apache.org/jira/browse/BEAM-25 > Project: Beam > Issue Type: Sub-task > Components: sdk-java-core >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles > Labels: State > > Our current state API is targeted at runner implementers, not pipeline > authors. As such it has many capabilities that are not necessary nor > desirable for simple use cases of stateful ParDo (such as dynamic state tag > creation). Implement a simple state intended for user access. > (Details of our current thoughts in forthcoming design doc) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] incubator-beam pull request #1044: [BEAM-25] Refactor StateSpec out of State...
GitHub user kennknowles opened a pull request: https://github.com/apache/incubator-beam/pull/1044 [BEAM-25] Refactor StateSpec out of StateTag Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [x] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [x] Replace `` in the title with the actual Jira issue number, if there is one. - [x] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- R: @aljoscha AND @tgroh AND @bjchambers This is a rebase of #793 without much cleanup. The same commentary applies - I'm not totally happy with it but I want to ask for feedback. You can merge this pull request into a Git repository by running: $ git pull https://github.com/kennknowles/incubator-beam StateSpec Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1044.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1044 commit 1f2de1e1212cbdb5eb3c8cb7caf30b94da5e0b00 Author: Kenneth KnowlesDate: 2016-08-05T03:50:28Z Create StateSpec parallel to StateTag commit 6fa680858719eaf13edcc08aa3f3260584876fb5 Author: Kenneth Knowles Date: 2016-08-05T04:48:48Z Make StateTag carry a StateSpec separately from its id --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Comment Edited] (BEAM-702) Simple pattern for per-bundle and per-DoFn Closeable resources
[ https://issues.apache.org/jira/browse/BEAM-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546277#comment-15546277 ] Pei He edited comment on BEAM-702 at 10/4/16 6:47 PM: -- The feature request is for something like this? class MyDoFn extends DoFn{ @Closable(Scope.BUNDLE) private DBWriter writer = null; @StartBundle public void startBundle(Context c) { writer = createDBWriter(); } @ProcessElement public void processElement(ProcessContext c) { writer.write(...) } } I think runners can recognize Closable resources and their scopes through annotations, and they can generate the code to execute close() after the resources become out of the scope. But, I think the real usefulness is more about "closed automatically" in both success and failure conditions. Closing the resources manually in finishBundle() or teardown() work if the pipeline doesn't failure. Otherwise, if the pipeline fails, the resources might be left open. This is the problem we have in BigQueryIO.Read, where we start external BQ extract jobs. However, if the pipeline fails, there is no hook to cancel them. In summary for my 2 cents: +1 for introducing mechanisms to recognize Closable resources in the model And, it needs to work in both success and failure conditions. was (Author: pei...@gmail.com): The feature request is for something like this? class MyDoFn extends DoFn { @Closable(Scope.BUNDLE) private DBWriter writer = null; @StartBundle public void startBundle(Context c) { writer = createDBWriter(); } @ProcessElement public void processElement(ProcessContext c) { writer.write(...) } } I think runners can recognize Closable resources and their scopes through annotations, and they can generate the code the execute close() after the resources become out of the scope. But, I think the real usefulness is more about "closed automatically" in both success and failure conditions. Closing the resources manually in finishBundle() or teardown() work if the pipeline doesn't failure. Otherwise, if the pipeline fails, the resources might be left open. This is the problem we have in BigQueryIO.Read, where we start external BQ extract jobs. However, if the pipeline fails, there is no hook to cancel them. In summary for my 2 cents: +1 for introducing mechanisms to recognize Closable resources in the model And, it needs to work in both success and failure conditions. > Simple pattern for per-bundle and per-DoFn Closeable resources > -- > > Key: BEAM-702 > URL: https://issues.apache.org/jira/browse/BEAM-702 > Project: Beam > Issue Type: Improvement >Reporter: Eugene Kirpichov > > Dealing with Closeable resources inside a processElement call is easy: simply > use try-with-resources. > However, bundle- or DoFn-scoped resources, such as long-lived database > connections, are less convenient to deal with: you have to open them in > startBundle and conditionally close in finishBundle (likewise > setup/teardown), taking special care if there's multiple resources to close > all of them. > Perhaps we should provide something like Guava's Closer to DoFn's > https://github.com/google/guava/wiki/ClosingResourcesExplained. Ideally, the > user would need to only write a startBundle() or setup() method, but not > write finishBundle() or teardown() - resources would be closed automatically. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-702) Simple pattern for per-bundle and per-DoFn Closeable resources
[ https://issues.apache.org/jira/browse/BEAM-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546277#comment-15546277 ] Pei He commented on BEAM-702: - The feature request is for something like this? class MyDoFn extends DoFn{ @Closable(Scope.BUNDLE) private DBWriter writer = null; @StartBundle public void startBundle(Context c) { writer = createDBWriter(); } @ProcessElement public void processElement(ProcessContext c) { writer.write(...) } } I think runners can recognize Closable resources and their scopes through annotations, and they can generate the code the execute close() after the resources become out of the scope. But, I think the real usefulness is more about "closed automatically" in both success and failure conditions. Closing the resources manually in finishBundle() or teardown() work if the pipeline doesn't failure. Otherwise, if the pipeline fails, the resources might be left open. This is the problem we have in BigQueryIO.Read, where we start external BQ extract jobs. However, if the pipeline fails, there is no hook to cancel them. In summary for my 2 cents: +1 for introducing mechanisms to recognize Closable resources in the model And, it needs to work in both success and failure conditions. > Simple pattern for per-bundle and per-DoFn Closeable resources > -- > > Key: BEAM-702 > URL: https://issues.apache.org/jira/browse/BEAM-702 > Project: Beam > Issue Type: Improvement >Reporter: Eugene Kirpichov > > Dealing with Closeable resources inside a processElement call is easy: simply > use try-with-resources. > However, bundle- or DoFn-scoped resources, such as long-lived database > connections, are less convenient to deal with: you have to open them in > startBundle and conditionally close in finishBundle (likewise > setup/teardown), taking special care if there's multiple resources to close > all of them. > Perhaps we should provide something like Guava's Closer to DoFn's > https://github.com/google/guava/wiki/ClosingResourcesExplained. Ideally, the > user would need to only write a startBundle() or setup() method, but not > write finishBundle() or teardown() - resources would be closed automatically. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-696) Side-Inputs non-deterministic with merging main-input windows
[ https://issues.apache.org/jira/browse/BEAM-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546263#comment-15546263 ] Kenneth Knowles commented on BEAM-696: -- The SDK needs to lay out a spec, which I think is what Pei is saying, and the runner comes up with the execution plan. To clarify - are you suggesting that {{CombineFn}} should only be allowed side input access in {{extractOutput}}, or are you suggesting that runners be required to wait until {{extractOutput}} _will_ be called before running a sequence of {{addInput}}* {{mergeAccum}}* {{extractoutput}} that accesses side inputs? The latter sounds like it could be loosed to "give a consistent view of a side input to the sequence of {{addInput}}* {{mergeAccum}}* {{extractOutput}}" and your proposed execution plan is one obvious choice of how to achieve it. > Side-Inputs non-deterministic with merging main-input windows > - > > Key: BEAM-696 > URL: https://issues.apache.org/jira/browse/BEAM-696 > Project: Beam > Issue Type: Bug > Components: beam-model >Reporter: Ben Chambers >Assignee: Pei He > > Side-Inputs are non-deterministic for several reasons: > 1. Because they depend on triggering of the side-input (this is acceptable > because triggers are by their nature non-deterministic). > 2. They depend on the current state of the main-input window in order to > lookup the side-input. This means that with merging > 3. Any runner optimizations that affect when the side-input is looked up may > cause problems with either or both of these. > This issue focuses on #2 -- the non-determinism of side-inputs that execute > within a Merging WindowFn. > Possible solution would be to defer running anything that looks up the > side-input until we need to extract an output, and using the main-window at > that point. Specifically, if the main-window is a MergingWindowFn, don't > execute any kind of pre-combine, instead buffer all the inputs and combine > later. > This could still run into some non-determinism if there are triggers > controlling when we extract output. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-703) SingletonViewFn might exhaust defaultValue if it's serialized after being used.
[ https://issues.apache.org/jira/browse/BEAM-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546196#comment-15546196 ] Kenneth Knowles commented on BEAM-703: -- I think that might have been an attempt to save memory by freeing up the encoded bytes. I agree with you that it is a bug. > SingletonViewFn might exhaust defaultValue if it's serialized after being > used. > > > Key: BEAM-703 > URL: https://issues.apache.org/jira/browse/BEAM-703 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Amit Sela >Assignee: Amit Sela >Priority: Minor > > In > https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/PCollectionViews.java#L267 > the defaultValue is set to null to avoid decoding over and over I assume. > If the defaultValue is accessed before the the SingletonViewFn is serialized, > it will exhaust the encoded value (assigned with null) while losing the > transient decoded value. > It'd probably be best to simply check if defaultValue is null before > decoding, so that decode will still happen just once, but the encoded data is > not lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] incubator-beam pull request #1043: Datastore Connector prototype using googl...
GitHub user vikkyrk opened a pull request: https://github.com/apache/incubator-beam/pull/1043 Datastore Connector prototype using google-cloud python client Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- You can merge this pull request into a Git repository by running: $ git pull https://github.com/vikkyrk/incubator-beam py_ds Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1043.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1043 commit d553dd23ef346e74ab1eb48d6e29135bb8388b94 Author: Vikas KedigehalliDate: 2016-09-29T18:07:11Z DatastoreIO commit ea430d914d06ccac6b778a545c8da0cf0a2c1753 Author: Vikas Kedigehalli Date: 2016-10-04T17:19:25Z Datastore write --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-beam pull request #1042: BigQueryIO: port trivial fixes from Dataf...
GitHub user peihe opened a pull request: https://github.com/apache/incubator-beam/pull/1042 BigQueryIO: port trivial fixes from Dataflow version. Will rebase once PR-1039 is merged You can merge this pull request into a Git repository by running: $ git pull https://github.com/peihe/incubator-beam bq-beam-dataflow-consistency Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1042.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1042 commit 95cbdea9f55395fa60f5795d88e1178804e22865 Author: Pei HeDate: 2016-10-04T02:37:02Z Forward port Dataflow PR-431 to Beam commit e42a4df4a414284a201a150bc7754570740ed644 Author: Pei He Date: 2016-10-04T03:39:32Z Forward port Dataflow PR-454 to Beam commit 3659e99e46ee0420347de84ae6661c0fad410d94 Author: Pei He Date: 2016-10-04T04:05:44Z Forward port Dataflow PR-453 to Beam commit 6c69d05352427586817af51e6e5221b9ac1d7890 Author: Pei He Date: 2016-10-04T04:19:37Z BigQueryIO: port trivial fixes from Dataflow version. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-beam pull request #1041: Add a default bucket to Dataflow runner
GitHub user sammcveety opened a pull request: https://github.com/apache/incubator-beam/pull/1041 Add a default bucket to Dataflow runner There is a dependency failure that I'm not sure how to debug; all other tests pass. @tgroh could you please take a look? --- You can merge this pull request into a Git repository by running: $ git pull https://github.com/sammcveety/incubator-beam sgmc/bucket_staging Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1041.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1041 commit 28e935efb72b209da48981f55e4349f37597f34a Author: sammcveetyDate: 2016-10-03T18:21:41Z Add a default bucket to Dataflow runner. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Updated] (BEAM-703) SingletonViewFn might exhaust defaultValue if it's serialized after being used.
[ https://issues.apache.org/jira/browse/BEAM-703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Sela updated BEAM-703: --- Summary: SingletonViewFn might exhaust defaultValue if it's serialized after being used. (was: SingletonViewFn might exhaust defaultValue if it's serialized after used. ) > SingletonViewFn might exhaust defaultValue if it's serialized after being > used. > > > Key: BEAM-703 > URL: https://issues.apache.org/jira/browse/BEAM-703 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Amit Sela >Assignee: Amit Sela >Priority: Minor > > In > https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/PCollectionViews.java#L267 > the defaultValue is set to null to avoid decoding over and over I assume. > If the defaultValue is accessed before the the SingletonViewFn is serialized, > it will exhaust the encoded value (assigned with null) while losing the > transient decoded value. > It'd probably be best to simply check if defaultValue is null before > decoding, so that decode will still happen just once, but the encoded data is > not lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-703) SingletonViewFn might exhaust defaultValue if it's serialized after being used.
[ https://issues.apache.org/jira/browse/BEAM-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545665#comment-15545665 ] ASF GitHub Bot commented on BEAM-703: - GitHub user amitsela opened a pull request: https://github.com/apache/incubator-beam/pull/1040 [BEAM-703] SingletonViewFn might exhaust defaultValue if it's serialized after being used. Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- You can merge this pull request into a Git repository by running: $ git pull https://github.com/amitsela/incubator-beam BEAM-703 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1040.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1040 commit 0bd97efc0d764af17cdd8abdf43bff33bb21be2b Author: SelaDate: 2016-10-04T15:12:38Z Avoid losing the encoded defaultValue. > SingletonViewFn might exhaust defaultValue if it's serialized after being > used. > > > Key: BEAM-703 > URL: https://issues.apache.org/jira/browse/BEAM-703 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Amit Sela >Assignee: Amit Sela >Priority: Minor > > In > https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/PCollectionViews.java#L267 > the defaultValue is set to null to avoid decoding over and over I assume. > If the defaultValue is accessed before the the SingletonViewFn is serialized, > it will exhaust the encoded value (assigned with null) while losing the > transient decoded value. > It'd probably be best to simply check if defaultValue is null before > decoding, so that decode will still happen just once, but the encoded data is > not lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] incubator-beam pull request #1040: [BEAM-703] SingletonViewFn might exhaust ...
GitHub user amitsela opened a pull request: https://github.com/apache/incubator-beam/pull/1040 [BEAM-703] SingletonViewFn might exhaust defaultValue if it's serialized after being used. Be sure to do all of the following to help us incorporate your contribution quickly and easily: - [ ] Make sure the PR title is formatted like: `[BEAM-] Description of pull request` - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable Travis-CI on your fork and ensure the whole test matrix passes). - [ ] Replace `` in the title with the actual Jira issue number, if there is one. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.txt). --- You can merge this pull request into a Git repository by running: $ git pull https://github.com/amitsela/incubator-beam BEAM-703 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-beam/pull/1040.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1040 commit 0bd97efc0d764af17cdd8abdf43bff33bb21be2b Author: SelaDate: 2016-10-04T15:12:38Z Avoid losing the encoded defaultValue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (BEAM-703) SingletonViewFn might exhaust defaultValue if it's serialized after used.
Amit Sela created BEAM-703: -- Summary: SingletonViewFn might exhaust defaultValue if it's serialized after used. Key: BEAM-703 URL: https://issues.apache.org/jira/browse/BEAM-703 Project: Beam Issue Type: Bug Components: sdk-java-core Reporter: Amit Sela Assignee: Amit Sela Priority: Minor In https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/PCollectionViews.java#L267 the defaultValue is set to null to avoid decoding over and over I assume. If the defaultValue is accessed before the the SingletonViewFn is serialized, it will exhaust the encoded value (assigned with null) while losing the transient decoded value. It'd probably be best to simply check if defaultValue is null before decoding, so that decode will still happen just once, but the encoded data is not lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (BEAM-696) Side-Inputs non-deterministic with merging main-input windows
[ https://issues.apache.org/jira/browse/BEAM-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15544926#comment-15544926 ] Amit Sela commented on BEAM-696: [~pei...@gmail.com] shouldn't the SDK take care/enforce this behaviour and not let a runner decide this ? It sounds like for merging windows the CombineFns should lookup sideInputs only at extractOutput, otherwise I don't see how DataflowRunner is deferring, local-combine cannot guarantee the state of the merging window (in a specific bundle), correct ? I think this issue is about the SDK / Runner API and should be resolved by either limiting the use of sideInputs with Sessions or enforcing/instructing runners to defer looking-up sideInput until extractOutput executes. Am I missing something here ? > Side-Inputs non-deterministic with merging main-input windows > - > > Key: BEAM-696 > URL: https://issues.apache.org/jira/browse/BEAM-696 > Project: Beam > Issue Type: Bug > Components: beam-model >Reporter: Ben Chambers >Assignee: Pei He > > Side-Inputs are non-deterministic for several reasons: > 1. Because they depend on triggering of the side-input (this is acceptable > because triggers are by their nature non-deterministic). > 2. They depend on the current state of the main-input window in order to > lookup the side-input. This means that with merging > 3. Any runner optimizations that affect when the side-input is looked up may > cause problems with either or both of these. > This issue focuses on #2 -- the non-determinism of side-inputs that execute > within a Merging WindowFn. > Possible solution would be to defer running anything that looks up the > side-input until we need to extract an output, and using the main-window at > that point. Specifically, if the main-window is a MergingWindowFn, don't > execute any kind of pre-combine, instead buffer all the inputs and combine > later. > This could still run into some non-determinism if there are triggers > controlling when we extract output. -- This message was sent by Atlassian JIRA (v6.3.4#6332)