[jira] [Work logged] (BEAM-7274) Protobuf Beam Schema support
[ https://issues.apache.org/jira/browse/BEAM-7274?focusedWorklogId=345831&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345831 ] ASF GitHub Bot logged work on BEAM-7274: Author: ASF GitHub Bot Created on: 19/Nov/19 06:48 Start Date: 19/Nov/19 06:48 Worklog Time Spent: 10m Work Description: reuvenlax commented on issue #8690: [BEAM-7274] Implement the Protobuf schema provider URL: https://github.com/apache/beam/pull/8690#issuecomment-555359927 Another comment: this delegates everything to Proto's reflection API, which can be quite inefficient. Compare with AvroSchema where we delegate straight to generated classes when they exist. Reflection might be necessary for the case of DynamicMessage (though in that case I think we should use RowWithStorage instead of RowWithGetters), but shouldn't be necessary when we have generated classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345831) Time Spent: 9h 40m (was: 9.5h) > Protobuf Beam Schema support > > > Key: BEAM-7274 > URL: https://issues.apache.org/jira/browse/BEAM-7274 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Alex Van Boxel >Assignee: Alex Van Boxel >Priority: Minor > Time Spent: 9h 40m > Remaining Estimate: 0h > > Add support for the new Beam Schema to the Protobuf extension. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7274) Protobuf Beam Schema support
[ https://issues.apache.org/jira/browse/BEAM-7274?focusedWorklogId=345812&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345812 ] ASF GitHub Bot logged work on BEAM-7274: Author: ASF GitHub Bot Created on: 19/Nov/19 06:02 Start Date: 19/Nov/19 06:02 Worklog Time Spent: 10m Work Description: reuvenlax commented on pull request #8690: [BEAM-7274] Implement the Protobuf schema provider URL: https://github.com/apache/beam/pull/8690#discussion_r347741250 ## File path: sdks/java/extensions/protobuf/src/main/java/org/apache/beam/sdk/extensions/protobuf/ProtoSchemaProvider.java ## @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.protobuf; + +import com.google.protobuf.DynamicMessage; +import javax.annotation.Nullable; +import org.apache.beam.sdk.annotations.Experimental; +import org.apache.beam.sdk.schemas.Schema; +import org.apache.beam.sdk.schemas.SchemaProvider; +import org.apache.beam.sdk.transforms.SerializableFunction; +import org.apache.beam.sdk.values.Row; +import org.apache.beam.sdk.values.TypeDescriptor; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * Schema provider for Protobuf messages. The provider is able to handle pre compiled Message file + * without external help. For Dynamic Messages a Descriptor needs to be registered up front on a + * specific URN. + * + * It's possible to inherit this class for a specific implementation that communicates with an + * external registry that maps those URN's with Descriptors. + */ +@Experimental(Experimental.Kind.SCHEMAS) +public class ProtoSchemaProvider implements SchemaProvider { Review comment: I'm not sure I understand this comment. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345812) Time Spent: 9h 20m (was: 9h 10m) > Protobuf Beam Schema support > > > Key: BEAM-7274 > URL: https://issues.apache.org/jira/browse/BEAM-7274 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Alex Van Boxel >Assignee: Alex Van Boxel >Priority: Minor > Time Spent: 9h 20m > Remaining Estimate: 0h > > Add support for the new Beam Schema to the Protobuf extension. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7116) Remove KV from Schema transforms
[ https://issues.apache.org/jira/browse/BEAM-7116?focusedWorklogId=345814&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345814 ] ASF GitHub Bot logged work on BEAM-7116: Author: ASF GitHub Bot Created on: 19/Nov/19 06:03 Start Date: 19/Nov/19 06:03 Worklog Time Spent: 10m Work Description: reuvenlax commented on issue #10151: [BEAM-7116] Remove use of KV in Schema transforms URL: https://github.com/apache/beam/pull/10151#issuecomment-555347866 run sql postcommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345814) Time Spent: 0.5h (was: 20m) > Remove KV from Schema transforms > > > Key: BEAM-7116 > URL: https://issues.apache.org/jira/browse/BEAM-7116 > Project: Beam > Issue Type: Sub-task > Components: sdk-java-core >Reporter: Reuven Lax >Assignee: Brian Hulette >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > Instead of returning KV objects, we should return a Schema with two fields. > The Convert transform should be able to convert these to KV objects. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7274) Protobuf Beam Schema support
[ https://issues.apache.org/jira/browse/BEAM-7274?focusedWorklogId=345813&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345813 ] ASF GitHub Bot logged work on BEAM-7274: Author: ASF GitHub Bot Created on: 19/Nov/19 06:02 Start Date: 19/Nov/19 06:02 Worklog Time Spent: 10m Work Description: reuvenlax commented on pull request #8690: [BEAM-7274] Implement the Protobuf schema provider URL: https://github.com/apache/beam/pull/8690#discussion_r347743622 ## File path: sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java ## @@ -554,6 +555,12 @@ public Builder withFieldValueGetters( return this; } +/** The FieldValueGetters will handle the conversion for Arrays, Maps and Rows. */ +public Builder withFieldValueGettersHandleCollections(boolean collectionHandledByGetter) { + this.collectionHandledByGetter = collectionHandledByGetter; + return this; +} Review comment: Can you help me understand this a bit more? Why does it not work to cache lists for protocol buffers? We saw repeated array conversion to be a big problem (which is why we cache them). I'm wondering if we could instead cache a lazy array like we do with iterables. I'll take a closer look at this code to figure it out. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345813) Time Spent: 9.5h (was: 9h 20m) > Protobuf Beam Schema support > > > Key: BEAM-7274 > URL: https://issues.apache.org/jira/browse/BEAM-7274 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Alex Van Boxel >Assignee: Alex Van Boxel >Priority: Minor > Time Spent: 9.5h > Remaining Estimate: 0h > > Add support for the new Beam Schema to the Protobuf extension. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8654) [Java] beam_Dependency_Check's not getting correct report from Gradle dependencyUpdates
[ https://issues.apache.org/jira/browse/BEAM-8654?focusedWorklogId=345801&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345801 ] ASF GitHub Bot logged work on BEAM-8654: Author: ASF GitHub Bot Created on: 19/Nov/19 04:58 Start Date: 19/Nov/19 04:58 Worklog Time Spent: 10m Work Description: suztomo commented on issue #10127: [BEAM-8654] Fixes resolutionStrategy's interference with dependency check URL: https://github.com/apache/beam/pull/10127#issuecomment-555333591 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345801) Time Spent: 2h 20m (was: 2h 10m) > [Java] beam_Dependency_Check's not getting correct report from Gradle > dependencyUpdates > --- > > Key: BEAM-8654 > URL: https://issues.apache.org/jira/browse/BEAM-8654 > Project: Beam > Issue Type: Improvement > Components: build-system >Reporter: Tomo Suzuki >Assignee: Tomo Suzuki >Priority: Major > Time Spent: 2h 20m > Remaining Estimate: 0h > > Cont. of https://issues.apache.org/jira/browse/BEAM-8621 > https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_Dependency_Check/234/consoleFull > says > {noformat} > 18:20:07 > Task :dependencyUpdates > ... > 18:23:12 The following dependencies are using the latest release version: > ... > 18:23:12 - com.google.cloud.bigdataoss:util:1.9.16 > 18:23:12 - com.google.cloud.bigtable:bigtable-client-core:1.8.0 > {noformat} > But they are not the latest release. > * > https://search.maven.org/artifact/com.google.cloud.bigdataoss/util/2.0.0/jar > * > https://search.maven.org/artifact/com.google.cloud.bigtable/bigtable-client-core/1.12.1/jar > Why does Gradle think they're the latest release? > It seems that " -Drevision=release" flag plays some role here. Without the > flag, Gradle reports these artifacts are not the latest. > https://gist.github.com/suztomo/1460f2be48025c8ea764e86a2c6e39a8 > Even with the flag, it should report the following > {noformat} > The following dependencies have later release versions: > - com.google.cloud.bigtable:bigtable-client-core [1.8.0 -> 1.12.1] > https://cloud.google.com/bigtable/ > {noformat} > https://gist.github.com/suztomo/13473e6b9765c0e96c22aeffab18ef66 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7116) Remove KV from Schema transforms
[ https://issues.apache.org/jira/browse/BEAM-7116?focusedWorklogId=345792&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345792 ] ASF GitHub Bot logged work on BEAM-7116: Author: ASF GitHub Bot Created on: 19/Nov/19 04:35 Start Date: 19/Nov/19 04:35 Worklog Time Spent: 10m Work Description: reuvenlax commented on issue #10151: [BEAM-7116] Remove use of KV in Schema transforms URL: https://github.com/apache/beam/pull/10151#issuecomment-555328927 R: @TheNeuralBit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345792) Time Spent: 20m (was: 10m) > Remove KV from Schema transforms > > > Key: BEAM-7116 > URL: https://issues.apache.org/jira/browse/BEAM-7116 > Project: Beam > Issue Type: Sub-task > Components: sdk-java-core >Reporter: Reuven Lax >Assignee: Brian Hulette >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Instead of returning KV objects, we should return a Schema with two fields. > The Convert transform should be able to convert these to KV objects. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7116) Remove KV from Schema transforms
[ https://issues.apache.org/jira/browse/BEAM-7116?focusedWorklogId=345790&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345790 ] ASF GitHub Bot logged work on BEAM-7116: Author: ASF GitHub Bot Created on: 19/Nov/19 04:32 Start Date: 19/Nov/19 04:32 Worklog Time Spent: 10m Work Description: reuvenlax commented on pull request #10151: [BEAM-7116] Remove use of KV in Schema transforms URL: https://github.com/apache/beam/pull/10151 Beam's KV type has no schema and due to special casing of KvCoder in Beam it is difficult to give it one. Here we modify the Beam schema transforms that return PCollection to instead return PCollection where the Row contains key and value fields. This is possible now that we support large iterables in schemas. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345790) Remaining Estimate: 0h Time Spent: 10m > Remove KV from Schema transforms > > > Key: BEAM-7116 > URL: https://issues.apache.org/jira/browse/BEAM-7116 > Project: Beam > Issue Type: Sub-task > Components: sdk-java-core >Reporter: Reuven Lax >Assignee: Brian Hulette >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Instead of returning KV objects, we should return a Schema with two fields. > The Convert transform should be able to convert these to KV objects. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6756) Support lazy iterables in schemas
[ https://issues.apache.org/jira/browse/BEAM-6756?focusedWorklogId=345789&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345789 ] ASF GitHub Bot logged work on BEAM-6756: Author: ASF GitHub Bot Created on: 19/Nov/19 04:15 Start Date: 19/Nov/19 04:15 Worklog Time Spent: 10m Work Description: reuvenlax commented on pull request #10003: [BEAM-6756] Create Iterable type for Schema URL: https://github.com/apache/beam/pull/10003 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345789) Time Spent: 2.5h (was: 2h 20m) > Support lazy iterables in schemas > - > > Key: BEAM-6756 > URL: https://issues.apache.org/jira/browse/BEAM-6756 > Project: Beam > Issue Type: Sub-task > Components: sdk-java-core >Reporter: Reuven Lax >Assignee: Reuven Lax >Priority: Major > Time Spent: 2.5h > Remaining Estimate: 0h > > The iterables returned by GroupByKey and CoGroupByKey are lazy; this allows a > runner to page data into memory if the full iterable is too large. We > currently don't support this in Schemas, so the Schema Group and CoGroup > transforms materialize all data into memory. We should add support for this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8251) Add worker_region and worker_zone options
[ https://issues.apache.org/jira/browse/BEAM-8251?focusedWorklogId=345786&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345786 ] ASF GitHub Bot logged work on BEAM-8251: Author: ASF GitHub Bot Created on: 19/Nov/19 04:05 Start Date: 19/Nov/19 04:05 Worklog Time Spent: 10m Work Description: ibzib commented on pull request #10150: [BEAM-8251] plumb worker_(region|zone) to Environment proto URL: https://github.com/apache/beam/pull/10150 Added these pipeline options a while back, but they need to be present in the `Environment` to actually take effect. Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/) Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_P
[jira] [Work logged] (BEAM-7951) Allow runner to configure customization WindowedValue coder such as ValueOnlyWindowedValueCoder
[ https://issues.apache.org/jira/browse/BEAM-7951?focusedWorklogId=345774&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345774 ] ASF GitHub Bot logged work on BEAM-7951: Author: ASF GitHub Bot Created on: 19/Nov/19 02:43 Start Date: 19/Nov/19 02:43 Worklog Time Spent: 10m Work Description: sunjincheng121 commented on issue #9979: [BEAM-7951] Allow runner to configure customization WindowedValue coder. URL: https://github.com/apache/beam/pull/9979#issuecomment-555306699 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345774) Time Spent: 2h 10m (was: 2h) > Allow runner to configure customization WindowedValue coder such as > ValueOnlyWindowedValueCoder > --- > > Key: BEAM-7951 > URL: https://issues.apache.org/jira/browse/BEAM-7951 > Project: Beam > Issue Type: Sub-task > Components: java-fn-execution >Reporter: sunjincheng >Assignee: sunjincheng >Priority: Major > Time Spent: 2h 10m > Remaining Estimate: 0h > > The coder of WindowedValue cannot be configured and it’s always > FullWindowedValueCoder. We don't need to serialize the timestamp, window and > pane properties in Flink and so it will be better to make the coder > configurable (i.e. allowing to use ValueOnlyWindowedValueCoder) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8629) WithTypeHints._get_or_create_type_hints may return a mutable copy of the class type hints.
[ https://issues.apache.org/jira/browse/BEAM-8629?focusedWorklogId=345769&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345769 ] ASF GitHub Bot logged work on BEAM-8629: Author: ASF GitHub Bot Created on: 19/Nov/19 02:29 Start Date: 19/Nov/19 02:29 Worklog Time Spent: 10m Work Description: udim commented on pull request #10080: [BEAM-8629] Don't return mutable class type hints. URL: https://github.com/apache/beam/pull/10080#discussion_r347705356 ## File path: sdks/python/apache_beam/typehints/decorators.py ## @@ -340,6 +340,22 @@ def __repr__(self): return 'IOTypeHints[inputs=%s, outputs=%s]' % ( self.input_types, self.output_types) + def __eq__(self, other): Review comment: Why were these 3 methods needed? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345769) Time Spent: 0.5h (was: 20m) > WithTypeHints._get_or_create_type_hints may return a mutable copy of the > class type hints. > -- > > Key: BEAM-8629 > URL: https://issues.apache.org/jira/browse/BEAM-8629 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Robert Bradshaw >Assignee: Robert Bradshaw >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8646) PR #9814 appears to cause failures in fnapi_runner tests on Windows
[ https://issues.apache.org/jira/browse/BEAM-8646?focusedWorklogId=345763&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345763 ] ASF GitHub Bot logged work on BEAM-8646: Author: ASF GitHub Bot Created on: 19/Nov/19 02:18 Start Date: 19/Nov/19 02:18 Worklog Time Spent: 10m Work Description: violalyu commented on issue #10110: [BEAM-8646] Restore original behavior of evaluating worker host on Windows until a better solution is available. URL: https://github.com/apache/beam/pull/10110#issuecomment-555300788 Thanks @tvalentyn ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345763) Time Spent: 0.5h (was: 20m) > PR #9814 appears to cause failures in fnapi_runner tests on Windows > --- > > Key: BEAM-8646 > URL: https://issues.apache.org/jira/browse/BEAM-8646 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Wanqi Lyu >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > It appears that changes in > > https://github.com/apache/beam/commit/d6bcb03f586b5430c30f6ca4a1af9e42711e529c > cause test failures in Beam test suite on Windows, for example: > python setup.py nosetests --tests > apache_beam/runners/portability/portable_runner_test.py:PortableRunnerTestWithExternalEnv.test_callbacks_with_exception > > does not finish on a Windows VM machine within at least 60 seconds but passes > within a second if we change host_from_worker to return 'localhost' in [1]. > [~violalyu] , do you think you could take a look? Thanks! > cc: [~chadrik] [~thw] > [1] > https://github.com/apache/beam/blob/808cb35018cd228a59b152234b655948da2455fa/sdks/python/apache_beam/runners/portability/fn_api_runner.py#L1377. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=345758&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345758 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 19/Nov/19 02:12 Start Date: 19/Nov/19 02:12 Worklog Time Spent: 10m Work Description: liumomo315 commented on issue #10145: [BEAM-8575] Add a Python test to test windowing in DoFn finish_bundle() URL: https://github.com/apache/beam/pull/10145#issuecomment-555299364 Thanks Yichi for the review:) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345758) Time Spent: 11h 10m (was: 11h) > Add more Python validates runner tests > -- > > Key: BEAM-8575 > URL: https://issues.apache.org/jira/browse/BEAM-8575 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: wendy liu >Assignee: wendy liu >Priority: Major > Time Spent: 11h 10m > Remaining Estimate: 0h > > This is the umbrella issue to track the work of adding more Python tests to > improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6756) Support lazy iterables in schemas
[ https://issues.apache.org/jira/browse/BEAM-6756?focusedWorklogId=345754&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345754 ] ASF GitHub Bot logged work on BEAM-6756: Author: ASF GitHub Bot Created on: 19/Nov/19 02:07 Start Date: 19/Nov/19 02:07 Worklog Time Spent: 10m Work Description: reuvenlax commented on issue #10003: [BEAM-6756] Create Iterable type for Schema URL: https://github.com/apache/beam/pull/10003#issuecomment-555298098 Run Java PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345754) Time Spent: 2h 20m (was: 2h 10m) > Support lazy iterables in schemas > - > > Key: BEAM-6756 > URL: https://issues.apache.org/jira/browse/BEAM-6756 > Project: Beam > Issue Type: Sub-task > Components: sdk-java-core >Reporter: Reuven Lax >Assignee: Reuven Lax >Priority: Major > Time Spent: 2h 20m > Remaining Estimate: 0h > > The iterables returned by GroupByKey and CoGroupByKey are lazy; this allows a > runner to page data into memory if the full iterable is too large. We > currently don't support this in Schemas, so the Schema Group and CoGroup > transforms materialize all data into memory. We should add support for this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8016) Render Beam Pipeline as DOT with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8016?focusedWorklogId=345751&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345751 ] ASF GitHub Bot logged work on BEAM-8016: Author: ASF GitHub Bot Created on: 19/Nov/19 01:57 Start Date: 19/Nov/19 01:57 Worklog Time Spent: 10m Work Description: KevinGG commented on issue #10132: [BEAM-8016] Pipeline Graph URL: https://github.com/apache/beam/pull/10132#issuecomment-555295902 Run Portable_Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345751) Time Spent: 3h 50m (was: 3h 40m) > Render Beam Pipeline as DOT with Interactive Beam > --- > > Key: BEAM-8016 > URL: https://issues.apache.org/jira/browse/BEAM-8016 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 3h 50m > Remaining Estimate: 0h > > With work in https://issues.apache.org/jira/browse/BEAM-7760, Beam pipeline > converted to DOT then rendered should mark user defined variables on edges. > With work in https://issues.apache.org/jira/browse/BEAM-7926, it might be > redundant or confusing to render arbitrary random sample PCollection data on > edges. > We'll also make sure edges in the graph corresponds to output -> input > relationship in the user defined pipeline. Each edge is one output. If > multiple down stream inputs take the same output, it should be rendered as > one edge diverging into two instead of two edges. > For advanced interactivity highlight where each execution highlights the part > of the pipeline really executed from the original pipeline, we'll also > provide the support in beta. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=345748&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345748 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 19/Nov/19 01:48 Start Date: 19/Nov/19 01:48 Worklog Time Spent: 10m Work Description: y1chi commented on issue #10145: [BEAM-8575] Add a Python test to test windowing in DoFn finish_bundle() URL: https://github.com/apache/beam/pull/10145#issuecomment-555293777 @tvalentyn could you help to merge? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345748) Time Spent: 11h (was: 10h 50m) > Add more Python validates runner tests > -- > > Key: BEAM-8575 > URL: https://issues.apache.org/jira/browse/BEAM-8575 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: wendy liu >Assignee: wendy liu >Priority: Major > Time Spent: 11h > Remaining Estimate: 0h > > This is the umbrella issue to track the work of adding more Python tests to > improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8586) [SQL] Add a server for MongoDb Integration Test
[ https://issues.apache.org/jira/browse/BEAM-8586?focusedWorklogId=345747&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345747 ] ASF GitHub Bot logged work on BEAM-8586: Author: ASF GitHub Bot Created on: 19/Nov/19 01:39 Start Date: 19/Nov/19 01:39 Worklog Time Spent: 10m Work Description: 11moon11 commented on issue #10061: [BEAM-8586] [SQL] Fix MongoDb integration tests URL: https://github.com/apache/beam/pull/10061#issuecomment-553659742 Run sql postcommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345747) Time Spent: 50m (was: 40m) > [SQL] Add a server for MongoDb Integration Test > --- > > Key: BEAM-8586 > URL: https://issues.apache.org/jira/browse/BEAM-8586 > Project: Beam > Issue Type: Test > Components: dsl-sql >Reporter: Kirill Kozlov >Priority: Minor > Time Spent: 50m > Remaining Estimate: 0h > > We need to pass pipeline options with server information to the > MongoDbReadWriteIT. > For now that test is ignored and excluded from the build.gradle file. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8586) [SQL] Add a server for MongoDb Integration Test
[ https://issues.apache.org/jira/browse/BEAM-8586?focusedWorklogId=345746&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345746 ] ASF GitHub Bot logged work on BEAM-8586: Author: ASF GitHub Bot Created on: 19/Nov/19 01:39 Start Date: 19/Nov/19 01:39 Worklog Time Spent: 10m Work Description: 11moon11 commented on issue #10061: [BEAM-8586] [SQL] Fix MongoDb integration tests URL: https://github.com/apache/beam/pull/10061#issuecomment-555291639 Run sql postcommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345746) Time Spent: 40m (was: 0.5h) > [SQL] Add a server for MongoDb Integration Test > --- > > Key: BEAM-8586 > URL: https://issues.apache.org/jira/browse/BEAM-8586 > Project: Beam > Issue Type: Test > Components: dsl-sql >Reporter: Kirill Kozlov >Priority: Minor > Time Spent: 40m > Remaining Estimate: 0h > > We need to pass pipeline options with server information to the > MongoDbReadWriteIT. > For now that test is ignored and excluded from the build.gradle file. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8016) Render Beam Pipeline as DOT with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8016?focusedWorklogId=345745&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345745 ] ASF GitHub Bot logged work on BEAM-8016: Author: ASF GitHub Bot Created on: 19/Nov/19 01:34 Start Date: 19/Nov/19 01:34 Worklog Time Spent: 10m Work Description: KevinGG commented on issue #10132: [BEAM-8016] Pipeline Graph URL: https://github.com/apache/beam/pull/10132#issuecomment-555276401 Thanks for the comments! > * Are we getting rid of the tooltips displaying the intermediate results? Do they not fit in the new model? We are offering a `show` API to users so that they can visualize a larger set of their data dynamically instead of peeking through a random static sample (which we still offer if the user calls `show` in an ipython terminal not a notebook web frontend). And with new Beam pipeline graph proposal and some pipeline graph library WIP, the tooltip in the future might show other metadata such as elapse/throughput to provide a consistent user experience that is similar to what users have on Dataflow. > * What do the PCollections look like if the user did not specify the PCollection name as a variable? Those PCollections will not be cached. The idea is when building a pipeline, if the user does not assign a PCollection to a variable, they would not be able to build the pipeline further from it and they cannot invoke `show(pcoll)` because they don't have access to `pcoll` in their code. Before, we have had the `leaf pcollection` concept for PCollections who have never been used as inputs. It doesn't work for PCollections consumed by sinks (with input but no output) even if the user has assigned them to a variable and they look like hanging PCollections. It also doesn't work in a notebook environment where new transforms can be added at different locations and PCollections can be re-evaluated due to cell-re-execution. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345745) Time Spent: 3h 40m (was: 3.5h) > Render Beam Pipeline as DOT with Interactive Beam > --- > > Key: BEAM-8016 > URL: https://issues.apache.org/jira/browse/BEAM-8016 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 3h 40m > Remaining Estimate: 0h > > With work in https://issues.apache.org/jira/browse/BEAM-7760, Beam pipeline > converted to DOT then rendered should mark user defined variables on edges. > With work in https://issues.apache.org/jira/browse/BEAM-7926, it might be > redundant or confusing to render arbitrary random sample PCollection data on > edges. > We'll also make sure edges in the graph corresponds to output -> input > relationship in the user defined pipeline. Each edge is one output. If > multiple down stream inputs take the same output, it should be rendered as > one edge diverging into two instead of two edges. > For advanced interactivity highlight where each execution highlights the part > of the pipeline really executed from the original pipeline, we'll also > provide the support in beta. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8613) Add environment variable support to Docker environment
[ https://issues.apache.org/jira/browse/BEAM-8613?focusedWorklogId=345742&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345742 ] ASF GitHub Bot logged work on BEAM-8613: Author: ASF GitHub Bot Created on: 19/Nov/19 01:20 Start Date: 19/Nov/19 01:20 Worklog Time Spent: 10m Work Description: nrusch commented on issue #10064: [BEAM-8613] Add environment variable support to Docker environment URL: https://github.com/apache/beam/pull/10064#issuecomment-555287194 Rebased to current master. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345742) Time Spent: 1h (was: 50m) > Add environment variable support to Docker environment > -- > > Key: BEAM-8613 > URL: https://issues.apache.org/jira/browse/BEAM-8613 > Project: Beam > Issue Type: Improvement > Components: java-fn-execution, runner-core, runner-direct >Reporter: Nathan Rusch >Priority: Trivial > Time Spent: 1h > Remaining Estimate: 0h > > The Process environment allows specifying environment variables via a map > field on its payload message. The Docker environment should support this same > pattern, and forward the contents of the map through to the container runtime. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8016) Render Beam Pipeline as DOT with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8016?focusedWorklogId=345741&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345741 ] ASF GitHub Bot logged work on BEAM-8016: Author: ASF GitHub Bot Created on: 19/Nov/19 01:18 Start Date: 19/Nov/19 01:18 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #10132: [BEAM-8016] Pipeline Graph URL: https://github.com/apache/beam/pull/10132#discussion_r347682578 ## File path: sdks/python/apache_beam/runners/interactive/interactive_runner.py ## @@ -117,123 +119,43 @@ def apply(self, transform, pvalueish, options): return self._underlying_runner.apply(transform, pvalueish, options) def run_pipeline(self, pipeline, options): -if not hasattr(self, '_desired_cache_labels'): - self._desired_cache_labels = set() - -# Invoke a round trip through the runner API. This makes sure the Pipeline -# proto is stable. -pipeline = beam.pipeline.Pipeline.from_runner_api( -pipeline.to_runner_api(use_fake_coders=True), -pipeline.runner, -options) - -# Snapshot the pipeline in a portable proto before mutating it. -pipeline_proto, original_context = pipeline.to_runner_api( -return_context=True, use_fake_coders=True) -pcolls_to_pcoll_id = self._pcolls_to_pcoll_id(pipeline, original_context) - -analyzer = pipeline_analyzer.PipelineAnalyzer(self._cache_manager, - pipeline_proto, - self._underlying_runner, - options, - self._desired_cache_labels) -# Should be only accessed for debugging purpose. -self._analyzer = analyzer +pin = inst.pin(pipeline, options) pipeline_to_execute = beam.pipeline.Pipeline.from_runner_api( -analyzer.pipeline_proto_to_execute(), +pin.instrumented_pipeline_proto(), self._underlying_runner, options) if not self._skip_display: - display = display_manager.DisplayManager( - pipeline_proto=pipeline_proto, - pipeline_analyzer=analyzer, - cache_manager=self._cache_manager, - pipeline_graph_renderer=self._renderer) - display.start_periodic_update() + pg = pipeline_graph.PipelineGraph(pin.original_pipeline, Review comment: Sure! I'll use the full name `a_pipeline_graph`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345741) Time Spent: 3.5h (was: 3h 20m) > Render Beam Pipeline as DOT with Interactive Beam > --- > > Key: BEAM-8016 > URL: https://issues.apache.org/jira/browse/BEAM-8016 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 3.5h > Remaining Estimate: 0h > > With work in https://issues.apache.org/jira/browse/BEAM-7760, Beam pipeline > converted to DOT then rendered should mark user defined variables on edges. > With work in https://issues.apache.org/jira/browse/BEAM-7926, it might be > redundant or confusing to render arbitrary random sample PCollection data on > edges. > We'll also make sure edges in the graph corresponds to output -> input > relationship in the user defined pipeline. Each edge is one output. If > multiple down stream inputs take the same output, it should be rendered as > one edge diverging into two instead of two edges. > For advanced interactivity highlight where each execution highlights the part > of the pipeline really executed from the original pipeline, we'll also > provide the support in beta. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8016) Render Beam Pipeline as DOT with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8016?focusedWorklogId=345740&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345740 ] ASF GitHub Bot logged work on BEAM-8016: Author: ASF GitHub Bot Created on: 19/Nov/19 01:16 Start Date: 19/Nov/19 01:16 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #10132: [BEAM-8016] Pipeline Graph URL: https://github.com/apache/beam/pull/10132#discussion_r347031607 ## File path: sdks/python/apache_beam/runners/interactive/display/pipeline_graph.py ## @@ -136,14 +159,50 @@ def _generate_graph_dicts(self): vertex_dict[invisible_leaf] = {'style': 'invis'} self._edge_to_vertex_pairs[pcoll_id].append( (transform.unique_name, invisible_leaf)) - edge_dict[(transform.unique_name, invisible_leaf)] = {} + if self._pin: +edge_label = {'label': +self._pin.cacheable_var_by_pcoll_id(pcoll_id)} +edge_dict[(transform.unique_name, invisible_leaf)] = edge_label + else: +edge_dict[(transform.unique_name, invisible_leaf)] = {} +# For PCollections with more than one consuming PTransform, we also add +# an invisible dummy node to diverge the edge in the middle as the +# single output is used by multiple down stream PTransforms as inputs +# instead of emitting multiple edges. +elif len(self._consumers[pcoll_id]) > 1: + intermediate_dummy = 'diverge{}'.format( + hash(pcoll_id) % 1) + vertex_dict[intermediate_dummy] = {'shape': 'point', + 'width': '0'} + for consumer in self._consumers[pcoll_id]: +producer_name = transform.unique_name +consumer_name = transforms[consumer].unique_name +self._edge_to_vertex_pairs[pcoll_id].append( +(producer_name, intermediate_dummy)) +if self._pin: + edge_dict[(producer_name, intermediate_dummy)] = { + 'arrowhead': 'none', + 'label': +self._pin.cacheable_var_by_pcoll_id(pcoll_id)} +else: + edge_dict[(producer_name, intermediate_dummy)] = { + 'arrowhead': 'none'} +self._edge_to_vertex_pairs[pcoll_id].append( +(intermediate_dummy, consumer_name)) +edge_dict[(intermediate_dummy, consumer_name)] = {} else: for consumer in self._consumers[pcoll_id]: producer_name = transform.unique_name consumer_name = transforms[consumer].unique_name self._edge_to_vertex_pairs[pcoll_id].append( (producer_name, consumer_name)) -edge_dict[(producer_name, consumer_name)] = {} +if self._pin: + edge_dict[(producer_name, consumer_name)] = { + 'label': +self._pin.cacheable_var_by_pcoll_id(pcoll_id) + } +else: + edge_dict[(producer_name, consumer_name)] = {} Review comment: The difference introduced: Before: ![1](https://user-images.githubusercontent.com/4423149/68979012-364f6480-07b1-11ea-9ec3-8022154a8499.png) After (edge diverge for single output, user variable names labelled PCollections): ![2](https://user-images.githubusercontent.com/4423149/69100500-2bdfd580-0a12-11ea-9ef3-cb735fe69775.png) In the future: With data-centric user flow, notebook cell execution metadata and new Beam pipeline graph proposal: ![3](https://user-images.githubusercontent.com/4423149/68979139-93e3b100-07b1-11ea-8931-a6825b39b7a4.png) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345740) Time Spent: 3h 20m (was: 3h 10m) > Render Beam Pipeline as DOT with Interactive Beam > --- > > Key: BEAM-8016 > URL: https://issues.apache.org/jira/browse/BEAM-8016 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 3h 20m > Remaining Estimate: 0h > > With work in https://issues.apache.org/jira/browse/BEAM-7760, Beam pipeline > converted to DOT then rendered should mark user defined variables on edges. > With work in https://issues.apache.org/jira/browse/BEAM-7926,
[jira] [Work logged] (BEAM-8016) Render Beam Pipeline as DOT with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8016?focusedWorklogId=345739&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345739 ] ASF GitHub Bot logged work on BEAM-8016: Author: ASF GitHub Bot Created on: 19/Nov/19 01:15 Start Date: 19/Nov/19 01:15 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #10132: [BEAM-8016] Pipeline Graph URL: https://github.com/apache/beam/pull/10132#discussion_r347689574 ## File path: sdks/python/apache_beam/runners/interactive/display/pipeline_graph.py ## @@ -92,9 +104,20 @@ def __init__(self, default_vertex_attrs, default_edge_attrs) +self._renderer = pipeline_graph_renderer.get_renderer(render_option) + def get_dot(self): return self._get_graph().to_string() + def display_graph(self): +rendered_graph = self._renderer.render_pipeline_graph(self) +if ie.current_env().is_in_notebook: + try: +from IPython.core import display +display.display(display.HTML(rendered_graph)) + except ImportError: # Unlikely to happen when is_in_notebook. +pass Review comment: Added a warning level logging here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345739) Time Spent: 3h 10m (was: 3h) > Render Beam Pipeline as DOT with Interactive Beam > --- > > Key: BEAM-8016 > URL: https://issues.apache.org/jira/browse/BEAM-8016 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 3h 10m > Remaining Estimate: 0h > > With work in https://issues.apache.org/jira/browse/BEAM-7760, Beam pipeline > converted to DOT then rendered should mark user defined variables on edges. > With work in https://issues.apache.org/jira/browse/BEAM-7926, it might be > redundant or confusing to render arbitrary random sample PCollection data on > edges. > We'll also make sure edges in the graph corresponds to output -> input > relationship in the user defined pipeline. Each edge is one output. If > multiple down stream inputs take the same output, it should be rendered as > one edge diverging into two instead of two edges. > For advanced interactivity highlight where each execution highlights the part > of the pipeline really executed from the original pipeline, we'll also > provide the support in beta. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8016) Render Beam Pipeline as DOT with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8016?focusedWorklogId=345738&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345738 ] ASF GitHub Bot logged work on BEAM-8016: Author: ASF GitHub Bot Created on: 19/Nov/19 01:11 Start Date: 19/Nov/19 01:11 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #10132: [BEAM-8016] Pipeline Graph URL: https://github.com/apache/beam/pull/10132#discussion_r347688641 ## File path: sdks/python/apache_beam/runners/interactive/interactive_runner.py ## @@ -117,123 +119,43 @@ def apply(self, transform, pvalueish, options): return self._underlying_runner.apply(transform, pvalueish, options) def run_pipeline(self, pipeline, options): -if not hasattr(self, '_desired_cache_labels'): - self._desired_cache_labels = set() - -# Invoke a round trip through the runner API. This makes sure the Pipeline -# proto is stable. -pipeline = beam.pipeline.Pipeline.from_runner_api( -pipeline.to_runner_api(use_fake_coders=True), -pipeline.runner, -options) - -# Snapshot the pipeline in a portable proto before mutating it. -pipeline_proto, original_context = pipeline.to_runner_api( -return_context=True, use_fake_coders=True) -pcolls_to_pcoll_id = self._pcolls_to_pcoll_id(pipeline, original_context) - -analyzer = pipeline_analyzer.PipelineAnalyzer(self._cache_manager, - pipeline_proto, - self._underlying_runner, - options, - self._desired_cache_labels) -# Should be only accessed for debugging purpose. -self._analyzer = analyzer +pin = inst.pin(pipeline, options) pipeline_to_execute = beam.pipeline.Pipeline.from_runner_api( -analyzer.pipeline_proto_to_execute(), +pin.instrumented_pipeline_proto(), self._underlying_runner, options) if not self._skip_display: - display = display_manager.DisplayManager( - pipeline_proto=pipeline_proto, - pipeline_analyzer=analyzer, - cache_manager=self._cache_manager, - pipeline_graph_renderer=self._renderer) - display.start_periodic_update() + pg = pipeline_graph.PipelineGraph(pin.original_pipeline, +render_option=self._render_option) + pg.display_graph() result = pipeline_to_execute.run() result.wait_until_finish() -if not self._skip_display: - display.stop_periodic_update() - -return PipelineResult(result, self, self._analyzer.pipeline_info(), - self._cache_manager, pcolls_to_pcoll_id) - - def _pcolls_to_pcoll_id(self, pipeline, original_context): -"""Returns a dict mapping PCollections string to PCollection IDs. - -Using a PipelineVisitor to iterate over every node in the pipeline, -records the mapping from PCollections to PCollections IDs. This mapping -will be used to query cached PCollections. - -Args: - pipeline: (pipeline.Pipeline) - original_context: (pipeline_context.PipelineContext) - -Returns: - (dict from str to str) a dict mapping str(pcoll) to pcoll_id. -""" -pcolls_to_pcoll_id = {} - -from apache_beam.pipeline import PipelineVisitor # pylint: disable=import-error - -class PCollVisitor(PipelineVisitor): # pylint: disable=used-before-assignment - A visitor that records input and output values to be replaced. - - Input and output values that should be updated are recorded in maps - input_replacements and output_replacements respectively. - - We cannot update input and output values while visiting since that - results in validation errors. - """ - - def enter_composite_transform(self, transform_node): -self.visit_transform(transform_node) - - def visit_transform(self, transform_node): -for pcoll in transform_node.outputs.values(): - pcolls_to_pcoll_id[str(pcoll)] = original_context.pcollections.get_id( - pcoll) - -pipeline.visit(PCollVisitor()) -return pcolls_to_pcoll_id +return PipelineResult(result, pin) class PipelineResult(beam.runners.runner.PipelineResult): """Provides access to information about a pipeline.""" - def __init__(self, underlying_result, runner, pipeline_info, cache_manager, - pcolls_to_pcoll_id): + def __init__(self, underlying_result, pin): super(PipelineResult, self).__init__(underlying_result.state) -self._runner = runner -self._pipeline_info = pipeline_info -self._cache_manager = cache_manager -self._pcolls_to_pcoll_id = pcolls_to_pcoll_id - - def _cache_label(self, pcoll): -
[jira] [Work logged] (BEAM-8016) Render Beam Pipeline as DOT with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8016?focusedWorklogId=345737&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345737 ] ASF GitHub Bot logged work on BEAM-8016: Author: ASF GitHub Bot Created on: 19/Nov/19 01:07 Start Date: 19/Nov/19 01:07 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #10132: [BEAM-8016] Pipeline Graph URL: https://github.com/apache/beam/pull/10132#discussion_r347687605 ## File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py ## @@ -337,6 +358,7 @@ def cacheables(pcolls_to_pcoll_id): """ pcoll_version_map = {} cacheables = {} + cacheable_var_by_pcoll_id = {} Review comment: Yes, it is actually PCollections that need to be cached. I thought it was a little bit wordy and chose the shorter form. I'll rephrase it to "PCollections that need to be cached" instead for clarity. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345737) Time Spent: 2h 50m (was: 2h 40m) > Render Beam Pipeline as DOT with Interactive Beam > --- > > Key: BEAM-8016 > URL: https://issues.apache.org/jira/browse/BEAM-8016 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 2h 50m > Remaining Estimate: 0h > > With work in https://issues.apache.org/jira/browse/BEAM-7760, Beam pipeline > converted to DOT then rendered should mark user defined variables on edges. > With work in https://issues.apache.org/jira/browse/BEAM-7926, it might be > redundant or confusing to render arbitrary random sample PCollection data on > edges. > We'll also make sure edges in the graph corresponds to output -> input > relationship in the user defined pipeline. Each edge is one output. If > multiple down stream inputs take the same output, it should be rendered as > one edge diverging into two instead of two edges. > For advanced interactivity highlight where each execution highlights the part > of the pipeline really executed from the original pipeline, we'll also > provide the support in beta. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8629) WithTypeHints._get_or_create_type_hints may return a mutable copy of the class type hints.
[ https://issues.apache.org/jira/browse/BEAM-8629?focusedWorklogId=345734&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345734 ] ASF GitHub Bot logged work on BEAM-8629: Author: ASF GitHub Bot Created on: 19/Nov/19 01:02 Start Date: 19/Nov/19 01:02 Worklog Time Spent: 10m Work Description: robertwb commented on issue #10080: [BEAM-8629] Don't return mutable class type hints. URL: https://github.com/apache/beam/pull/10080#issuecomment-555282487 Ping @udim This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345734) Time Spent: 20m (was: 10m) > WithTypeHints._get_or_create_type_hints may return a mutable copy of the > class type hints. > -- > > Key: BEAM-8629 > URL: https://issues.apache.org/jira/browse/BEAM-8629 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Robert Bradshaw >Assignee: Robert Bradshaw >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8016) Render Beam Pipeline as DOT with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8016?focusedWorklogId=345735&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345735 ] ASF GitHub Bot logged work on BEAM-8016: Author: ASF GitHub Bot Created on: 19/Nov/19 01:02 Start Date: 19/Nov/19 01:02 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #10132: [BEAM-8016] Pipeline Graph URL: https://github.com/apache/beam/pull/10132#discussion_r347686447 ## File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py ## @@ -314,10 +316,29 @@ def cache_key(self, pcoll): cacheable['producer_version'])) return '' + def cacheable_var_by_pcoll_id(self, pcoll_id): +"""Retrieves the variable name of a PCollection. + +In source code, PCollection variables are defined in the user pipeline. When +it's converted to the runner api representation, each PCollection referenced +in the user pipeline is assigned a unique-within-pipeline pcoll_id. Given +such pcoll_id, retrieves the str variable name defined in user pipeline for +that referenced PCollection. If the PCollection is anonymous, return ''. +""" +return self._cacheable_var_by_pcoll_id.get(pcoll_id, '') + def pin(pipeline, options=None): - """Creates PipelineInstrument for a pipeline and its options with cache.""" + """Creates PipelineInstrument for a pipeline and its options with cache. + + This is the shorthand for doing 3 steps: 1) compute once for metadata of given Review comment: Thanks, done. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345735) Time Spent: 2.5h (was: 2h 20m) > Render Beam Pipeline as DOT with Interactive Beam > --- > > Key: BEAM-8016 > URL: https://issues.apache.org/jira/browse/BEAM-8016 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 2.5h > Remaining Estimate: 0h > > With work in https://issues.apache.org/jira/browse/BEAM-7760, Beam pipeline > converted to DOT then rendered should mark user defined variables on edges. > With work in https://issues.apache.org/jira/browse/BEAM-7926, it might be > redundant or confusing to render arbitrary random sample PCollection data on > edges. > We'll also make sure edges in the graph corresponds to output -> input > relationship in the user defined pipeline. Each edge is one output. If > multiple down stream inputs take the same output, it should be rendered as > one edge diverging into two instead of two edges. > For advanced interactivity highlight where each execution highlights the part > of the pipeline really executed from the original pipeline, we'll also > provide the support in beta. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8016) Render Beam Pipeline as DOT with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8016?focusedWorklogId=345736&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345736 ] ASF GitHub Bot logged work on BEAM-8016: Author: ASF GitHub Bot Created on: 19/Nov/19 01:02 Start Date: 19/Nov/19 01:02 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #10132: [BEAM-8016] Pipeline Graph URL: https://github.com/apache/beam/pull/10132#discussion_r347686453 ## File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py ## @@ -314,10 +316,29 @@ def cache_key(self, pcoll): cacheable['producer_version'])) return '' + def cacheable_var_by_pcoll_id(self, pcoll_id): +"""Retrieves the variable name of a PCollection. + +In source code, PCollection variables are defined in the user pipeline. When +it's converted to the runner api representation, each PCollection referenced +in the user pipeline is assigned a unique-within-pipeline pcoll_id. Given +such pcoll_id, retrieves the str variable name defined in user pipeline for +that referenced PCollection. If the PCollection is anonymous, return ''. +""" +return self._cacheable_var_by_pcoll_id.get(pcoll_id, '') + def pin(pipeline, options=None): - """Creates PipelineInstrument for a pipeline and its options with cache.""" + """Creates PipelineInstrument for a pipeline and its options with cache. + + This is the shorthand for doing 3 steps: 1) compute once for metadata of given + runner pipeline and everything watched from user pipelines; 2) associate info + between runner pipeline and its corresponding user pipeline, eliminate data Review comment: Thanks, done. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345736) Time Spent: 2h 40m (was: 2.5h) > Render Beam Pipeline as DOT with Interactive Beam > --- > > Key: BEAM-8016 > URL: https://issues.apache.org/jira/browse/BEAM-8016 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 2h 40m > Remaining Estimate: 0h > > With work in https://issues.apache.org/jira/browse/BEAM-7760, Beam pipeline > converted to DOT then rendered should mark user defined variables on edges. > With work in https://issues.apache.org/jira/browse/BEAM-7926, it might be > redundant or confusing to render arbitrary random sample PCollection data on > edges. > We'll also make sure edges in the graph corresponds to output -> input > relationship in the user defined pipeline. Each edge is one output. If > multiple down stream inputs take the same output, it should be rendered as > one edge diverging into two instead of two edges. > For advanced interactivity highlight where each execution highlights the part > of the pipeline really executed from the original pipeline, we'll also > provide the support in beta. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8016) Render Beam Pipeline as DOT with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8016?focusedWorklogId=345732&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345732 ] ASF GitHub Bot logged work on BEAM-8016: Author: ASF GitHub Bot Created on: 19/Nov/19 00:52 Start Date: 19/Nov/19 00:52 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #10132: [BEAM-8016] Pipeline Graph URL: https://github.com/apache/beam/pull/10132#discussion_r347684208 ## File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py ## @@ -80,7 +80,10 @@ def __init__(self, pipeline, options=None): # A mapping from PCollection id to python id() value in user defined # pipeline instance. (self._pcoll_version_map, - self._cacheables) = cacheables(self.pcolls_to_pcoll_id) + self._cacheables, Review comment: Yes, I agree! It's similar to what pipeline_analyzer was doing but using `PipelineVisitor` and some `globals` instead of pipeline protos. We haven't achieved a consensus about what this should be called. We also have `cache augmented pipeline` and etc.. But none of the names describe what the module does accurately because the implementation details cover so many different execution routes and scenarios. We'll definitely get back to the documentation here once we make the final naming decision. Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345732) Time Spent: 2h 20m (was: 2h 10m) > Render Beam Pipeline as DOT with Interactive Beam > --- > > Key: BEAM-8016 > URL: https://issues.apache.org/jira/browse/BEAM-8016 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 2h 20m > Remaining Estimate: 0h > > With work in https://issues.apache.org/jira/browse/BEAM-7760, Beam pipeline > converted to DOT then rendered should mark user defined variables on edges. > With work in https://issues.apache.org/jira/browse/BEAM-7926, it might be > redundant or confusing to render arbitrary random sample PCollection data on > edges. > We'll also make sure edges in the graph corresponds to output -> input > relationship in the user defined pipeline. Each edge is one output. If > multiple down stream inputs take the same output, it should be rendered as > one edge diverging into two instead of two edges. > For advanced interactivity highlight where each execution highlights the part > of the pipeline really executed from the original pipeline, we'll also > provide the support in beta. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8739) Consistently use with Pipeline(...) syntax
[ https://issues.apache.org/jira/browse/BEAM-8739?focusedWorklogId=345726&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345726 ] ASF GitHub Bot logged work on BEAM-8739: Author: ASF GitHub Bot Created on: 19/Nov/19 00:51 Start Date: 19/Nov/19 00:51 Worklog Time Spent: 10m Work Description: robertwb commented on pull request #10149: [BEAM-8739] Consistently use with Pipeline(...) syntax URL: https://github.com/apache/beam/pull/10149 Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/) Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![B
[jira] [Created] (BEAM-8739) Consistently use with Pipeline(...) syntax
Robert Bradshaw created BEAM-8739: - Summary: Consistently use with Pipeline(...) syntax Key: BEAM-8739 URL: https://issues.apache.org/jira/browse/BEAM-8739 Project: Beam Issue Type: Bug Components: sdk-py-core Reporter: Robert Bradshaw I've run into a couple of tests that forgot to do p.run(). In addition, I'm seeing new tests written in this old style. We should consistently use the with syntax where possible for our examples and tests. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8016) Render Beam Pipeline as DOT with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8016?focusedWorklogId=345724&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345724 ] ASF GitHub Bot logged work on BEAM-8016: Author: ASF GitHub Bot Created on: 19/Nov/19 00:44 Start Date: 19/Nov/19 00:44 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #10132: [BEAM-8016] Pipeline Graph URL: https://github.com/apache/beam/pull/10132#discussion_r347682578 ## File path: sdks/python/apache_beam/runners/interactive/interactive_runner.py ## @@ -117,123 +119,43 @@ def apply(self, transform, pvalueish, options): return self._underlying_runner.apply(transform, pvalueish, options) def run_pipeline(self, pipeline, options): -if not hasattr(self, '_desired_cache_labels'): - self._desired_cache_labels = set() - -# Invoke a round trip through the runner API. This makes sure the Pipeline -# proto is stable. -pipeline = beam.pipeline.Pipeline.from_runner_api( -pipeline.to_runner_api(use_fake_coders=True), -pipeline.runner, -options) - -# Snapshot the pipeline in a portable proto before mutating it. -pipeline_proto, original_context = pipeline.to_runner_api( -return_context=True, use_fake_coders=True) -pcolls_to_pcoll_id = self._pcolls_to_pcoll_id(pipeline, original_context) - -analyzer = pipeline_analyzer.PipelineAnalyzer(self._cache_manager, - pipeline_proto, - self._underlying_runner, - options, - self._desired_cache_labels) -# Should be only accessed for debugging purpose. -self._analyzer = analyzer +pin = inst.pin(pipeline, options) pipeline_to_execute = beam.pipeline.Pipeline.from_runner_api( -analyzer.pipeline_proto_to_execute(), +pin.instrumented_pipeline_proto(), self._underlying_runner, options) if not self._skip_display: - display = display_manager.DisplayManager( - pipeline_proto=pipeline_proto, - pipeline_analyzer=analyzer, - cache_manager=self._cache_manager, - pipeline_graph_renderer=self._renderer) - display.start_periodic_update() + pg = pipeline_graph.PipelineGraph(pin.original_pipeline, Review comment: Sure! I'll use the full name `pipeline_graph`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345724) Time Spent: 2h 10m (was: 2h) > Render Beam Pipeline as DOT with Interactive Beam > --- > > Key: BEAM-8016 > URL: https://issues.apache.org/jira/browse/BEAM-8016 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 2h 10m > Remaining Estimate: 0h > > With work in https://issues.apache.org/jira/browse/BEAM-7760, Beam pipeline > converted to DOT then rendered should mark user defined variables on edges. > With work in https://issues.apache.org/jira/browse/BEAM-7926, it might be > redundant or confusing to render arbitrary random sample PCollection data on > edges. > We'll also make sure edges in the graph corresponds to output -> input > relationship in the user defined pipeline. Each edge is one output. If > multiple down stream inputs take the same output, it should be rendered as > one edge diverging into two instead of two edges. > For advanced interactivity highlight where each execution highlights the part > of the pipeline really executed from the original pipeline, we'll also > provide the support in beta. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=345723&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345723 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 19/Nov/19 00:43 Start Date: 19/Nov/19 00:43 Worklog Time Spent: 10m Work Description: liumomo315 commented on issue #10145: [BEAM-8575] Add a Python test to test windowing in DoFn finish_bundle() URL: https://github.com/apache/beam/pull/10145#issuecomment-555277935 Thanks for the quick review! >move the timestamp assignment to beam.Create() This can be done, but does not make much difference, since we still want both the process() and finish_bundle() to do something in this test, and see the reason below. >and combine the map function into the ParDo. I think it's clearer to separate the Map from the test DoFn. The purpose of this test is to verify that after a DoFn with finish_bundle() implemented, it will produce results both from process() and finish_bundle(). More specifically, it wants to make sure that when windowing is involved, the output will be correct after the DoFn. The last Map is simply to print out all outputs from the test DoFn. Thoughts? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345723) Time Spent: 10h 50m (was: 10h 40m) > Add more Python validates runner tests > -- > > Key: BEAM-8575 > URL: https://issues.apache.org/jira/browse/BEAM-8575 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: wendy liu >Assignee: wendy liu >Priority: Major > Time Spent: 10h 50m > Remaining Estimate: 0h > > This is the umbrella issue to track the work of adding more Python tests to > improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=345722&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345722 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 19/Nov/19 00:43 Start Date: 19/Nov/19 00:43 Worklog Time Spent: 10m Work Description: liumomo315 commented on issue #10145: [BEAM-8575] Add a Python test to test windowing in DoFn finish_bundle() URL: https://github.com/apache/beam/pull/10145#issuecomment-555277935 Thanks for the quick review! >move the timestamp assignment to beam.Create() This can be done, but does not make much difference, since we still want both the process() and finish_bundle() to do something in this test, and see the reason below. >and combine the map function into the ParDo. I think it's clearer to separate the Map from the test DoFn. The purpose of this test is to verify that after a DoFn with finish_bundle() implemented, it will produce results both from process() and finish_bundle(). More specifically, it wants to make sure that when windowing is involved, the output will be correct after the DoFn. The last Map is simply to print out all outputs from the test DoFn. Thoughts? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345722) Time Spent: 10h 40m (was: 10.5h) > Add more Python validates runner tests > -- > > Key: BEAM-8575 > URL: https://issues.apache.org/jira/browse/BEAM-8575 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: wendy liu >Assignee: wendy liu >Priority: Major > Time Spent: 10h 40m > Remaining Estimate: 0h > > This is the umbrella issue to track the work of adding more Python tests to > improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8016) Render Beam Pipeline as DOT with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8016?focusedWorklogId=345720&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345720 ] ASF GitHub Bot logged work on BEAM-8016: Author: ASF GitHub Bot Created on: 19/Nov/19 00:37 Start Date: 19/Nov/19 00:37 Worklog Time Spent: 10m Work Description: KevinGG commented on issue #10132: [BEAM-8016] Pipeline Graph URL: https://github.com/apache/beam/pull/10132#issuecomment-555276401 Thanks for the comments! > * Are we getting rid of the tooltips displaying the intermediate results? Do they not fit in the new model? We are offering a `show` API to users so that they can visualize a larger set of their data dynamically instead of peeking through a random static sample (which we still offer if the user calls `show` in an ipython terminal not a notebook web frontend). And with new Beam pipeline graph proposal and some pipeline graph library WIP, the tooltip in the future might show other metadata such as elapse/throughput to provide a consistent user experience that is similar to what users have on Dataflow. > * What do the PCollections look like if the user did not specify the PCollection name as a variable? Those PCollections will not be cached. The idea is when building a pipeline, if the user does not assign a PCollection to a variable, they would not be able to build the pipeline further from it and they cannot invoke `show(pcoll)` because they don't have access to `pcoll` in their code. Before, we have had the `leaf pcollection` concept for PCollections who have never been used as inputs. It doesn't work for PCollections consumed by sinks (with input but no output) even if the user has assigned them to a variable and they look like hanging PCollections. It also doesn't work in a notebook environment where new transforms can be added at different locations and PCollections can be re-evaluated due to cell-re-execution. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345720) Time Spent: 2h (was: 1h 50m) > Render Beam Pipeline as DOT with Interactive Beam > --- > > Key: BEAM-8016 > URL: https://issues.apache.org/jira/browse/BEAM-8016 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > > With work in https://issues.apache.org/jira/browse/BEAM-7760, Beam pipeline > converted to DOT then rendered should mark user defined variables on edges. > With work in https://issues.apache.org/jira/browse/BEAM-7926, it might be > redundant or confusing to render arbitrary random sample PCollection data on > edges. > We'll also make sure edges in the graph corresponds to output -> input > relationship in the user defined pipeline. Each edge is one output. If > multiple down stream inputs take the same output, it should be rendered as > one edge diverging into two instead of two edges. > For advanced interactivity highlight where each execution highlights the part > of the pipeline really executed from the original pipeline, we'll also > provide the support in beta. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=345715&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345715 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 19/Nov/19 00:30 Start Date: 19/Nov/19 00:30 Worklog Time Spent: 10m Work Description: bumblebee-coming commented on pull request #10070: [BEAM-8575] Added a unit test for Reshuffle to test that Reshuffle pr… URL: https://github.com/apache/beam/pull/10070#discussion_r347679000 ## File path: sdks/python/apache_beam/transforms/util_test.py ## @@ -276,6 +279,13 @@ def process(self, element): with self.assertRaisesRegex(ValueError, r'window.*None.*add_timestamps2'): pipeline.run() +class AddTimestamp(beam.DoFn): Review comment: Done. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345715) Time Spent: 10h 20m (was: 10h 10m) > Add more Python validates runner tests > -- > > Key: BEAM-8575 > URL: https://issues.apache.org/jira/browse/BEAM-8575 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: wendy liu >Assignee: wendy liu >Priority: Major > Time Spent: 10h 20m > Remaining Estimate: 0h > > This is the umbrella issue to track the work of adding more Python tests to > improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=345716&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345716 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 19/Nov/19 00:30 Start Date: 19/Nov/19 00:30 Worklog Time Spent: 10m Work Description: bumblebee-coming commented on pull request #10070: [BEAM-8575] Added a unit test for Reshuffle to test that Reshuffle pr… URL: https://github.com/apache/beam/pull/10070#discussion_r347679045 ## File path: sdks/python/apache_beam/transforms/util_test.py ## @@ -276,6 +279,13 @@ def process(self, element): with self.assertRaisesRegex(ValueError, r'window.*None.*add_timestamps2'): pipeline.run() +class AddTimestamp(beam.DoFn): + def process(self, element, timestamp=beam.DoFn.TimestampParam): +yield beam.window.TimestampedValue(element, timestamp) + +class GetTimestamp(beam.DoFn): + def process(self, element, timestamp=beam.DoFn.TimestampParam): +yield '{} - {}'.format(timestamp, element['name']) Review comment: Done. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345716) Time Spent: 10.5h (was: 10h 20m) > Add more Python validates runner tests > -- > > Key: BEAM-8575 > URL: https://issues.apache.org/jira/browse/BEAM-8575 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: wendy liu >Assignee: wendy liu >Priority: Major > Time Spent: 10.5h > Remaining Estimate: 0h > > This is the umbrella issue to track the work of adding more Python tests to > improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=345713&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345713 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 19/Nov/19 00:29 Start Date: 19/Nov/19 00:29 Worklog Time Spent: 10m Work Description: bumblebee-coming commented on pull request #10070: [BEAM-8575] Added a unit test for Reshuffle to test that Reshuffle pr… URL: https://github.com/apache/beam/pull/10070#discussion_r347678737 ## File path: sdks/python/apache_beam/transforms/util_test.py ## @@ -276,6 +279,13 @@ def process(self, element): with self.assertRaisesRegex(ValueError, r'window.*None.*add_timestamps2'): pipeline.run() +class AddTimestamp(beam.DoFn): + def process(self, element, timestamp=beam.DoFn.TimestampParam): +yield beam.window.TimestampedValue(element, timestamp) Review comment: My understanding is that it should be a no-op, if Reshuffle preserves timestamps. This is what this test is testing. Its Java parity is the testReshufflePreservesTimestamps in file beam/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/ReshuffleTest.java which wrapped the string element into TimestampedValue twice. The first time the element becomes TimestampedValue; the second time the element becomes TimestampedValue>. Python doesn't have nested TimestampedValue type and doesn't have getTimestamp() either, so I used beam.DoFn.TimestampParam to get the timestamp twice, before and after Reshuffle. Assuming beam.DoFn.TimestampParam is always the current timestamp bounded with an element, then it should work. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345713) Time Spent: 10h (was: 9h 50m) > Add more Python validates runner tests > -- > > Key: BEAM-8575 > URL: https://issues.apache.org/jira/browse/BEAM-8575 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: wendy liu >Assignee: wendy liu >Priority: Major > Time Spent: 10h > Remaining Estimate: 0h > > This is the umbrella issue to track the work of adding more Python tests to > improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=345714&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345714 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 19/Nov/19 00:29 Start Date: 19/Nov/19 00:29 Worklog Time Spent: 10m Work Description: bumblebee-coming commented on pull request #10070: [BEAM-8575] Added a unit test for Reshuffle to test that Reshuffle pr… URL: https://github.com/apache/beam/pull/10070#discussion_r347678737 ## File path: sdks/python/apache_beam/transforms/util_test.py ## @@ -276,6 +279,13 @@ def process(self, element): with self.assertRaisesRegex(ValueError, r'window.*None.*add_timestamps2'): pipeline.run() +class AddTimestamp(beam.DoFn): + def process(self, element, timestamp=beam.DoFn.TimestampParam): +yield beam.window.TimestampedValue(element, timestamp) Review comment: My understanding is that it should be a no-op, if Reshuffle preserves timestamps. This is what this test is testing. Its Java parity is the testReshufflePreservesTimestamps in file beam/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/ReshuffleTest.java which wrapped the string element into TimestampedValue twice. The first time the element becomes TimestampedValue; the second time the element becomes TimestampedValue>. Python doesn't have nested TimestampedValue type and doesn't have getTimestamp() either, so I used beam.DoFn.TimestampParam to get the timestamp twice, before and after Reshuffle. Assuming beam.DoFn.TimestampParam is always the current timestamp bounded with an element, then it should work. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345714) Time Spent: 10h 10m (was: 10h) > Add more Python validates runner tests > -- > > Key: BEAM-8575 > URL: https://issues.apache.org/jira/browse/BEAM-8575 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: wendy liu >Assignee: wendy liu >Priority: Major > Time Spent: 10h 10m > Remaining Estimate: 0h > > This is the umbrella issue to track the work of adding more Python tests to > improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=345712&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345712 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 19/Nov/19 00:28 Start Date: 19/Nov/19 00:28 Worklog Time Spent: 10m Work Description: y1chi commented on issue #10145: [BEAM-8575] Add a Python test to test windowing in DoFn finish_bundle() URL: https://github.com/apache/beam/pull/10145#issuecomment-555273207 LGTM, I wonder would it be better to move the timestamp assignment to beam.Create() and combine the map function into the MyDoFn. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345712) Time Spent: 9h 50m (was: 9h 40m) > Add more Python validates runner tests > -- > > Key: BEAM-8575 > URL: https://issues.apache.org/jira/browse/BEAM-8575 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: wendy liu >Assignee: wendy liu >Priority: Major > Time Spent: 9h 50m > Remaining Estimate: 0h > > This is the umbrella issue to track the work of adding more Python tests to > improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6756) Support lazy iterables in schemas
[ https://issues.apache.org/jira/browse/BEAM-6756?focusedWorklogId=345710&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345710 ] ASF GitHub Bot logged work on BEAM-6756: Author: ASF GitHub Bot Created on: 19/Nov/19 00:28 Start Date: 19/Nov/19 00:28 Worklog Time Spent: 10m Work Description: reuvenlax commented on pull request #10003: [BEAM-6756] Create Iterable type for Schema URL: https://github.com/apache/beam/pull/10003#discussion_r347678482 ## File path: sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryUtils.java ## @@ -292,7 +293,7 @@ private static Schema fromTableFieldSchema(List tableFieldSche if (!schemaField.getType().getNullable()) { field.setMode(Mode.REQUIRED.toString()); } - if (TypeName.ARRAY == type.getTypeName()) { + if (type.getTypeName().isCollectionType()) { Review comment: I did another deep search - I found one case in SQL's codegen which I fixed, but can't find anything else. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345710) Time Spent: 2h (was: 1h 50m) > Support lazy iterables in schemas > - > > Key: BEAM-6756 > URL: https://issues.apache.org/jira/browse/BEAM-6756 > Project: Beam > Issue Type: Sub-task > Components: sdk-java-core >Reporter: Reuven Lax >Assignee: Reuven Lax >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > > The iterables returned by GroupByKey and CoGroupByKey are lazy; this allows a > runner to page data into memory if the full iterable is too large. We > currently don't support this in Schemas, so the Schema Group and CoGroup > transforms materialize all data into memory. We should add support for this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6756) Support lazy iterables in schemas
[ https://issues.apache.org/jira/browse/BEAM-6756?focusedWorklogId=345711&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345711 ] ASF GitHub Bot logged work on BEAM-6756: Author: ASF GitHub Bot Created on: 19/Nov/19 00:28 Start Date: 19/Nov/19 00:28 Worklog Time Spent: 10m Work Description: reuvenlax commented on pull request #10003: [BEAM-6756] Create Iterable type for Schema URL: https://github.com/apache/beam/pull/10003#discussion_r347678490 ## File path: sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java ## @@ -450,28 +469,28 @@ static int deepHashCodeForMap( return h; } -static boolean deepEqualsForList(List a, List b, Schema.FieldType elementType) { +static boolean deepEqualsForIterable( +Iterable a, Iterable b, Schema.FieldType elementType) { if (a == b) { return true; } - if (a.size() != b.size()) { -return false; - } Review comment: good catch - fixed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345711) Time Spent: 2h 10m (was: 2h) > Support lazy iterables in schemas > - > > Key: BEAM-6756 > URL: https://issues.apache.org/jira/browse/BEAM-6756 > Project: Beam > Issue Type: Sub-task > Components: sdk-java-core >Reporter: Reuven Lax >Assignee: Reuven Lax >Priority: Major > Time Spent: 2h 10m > Remaining Estimate: 0h > > The iterables returned by GroupByKey and CoGroupByKey are lazy; this allows a > runner to page data into memory if the full iterable is too large. We > currently don't support this in Schemas, so the Schema Group and CoGroup > transforms materialize all data into memory. We should add support for this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=345708&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345708 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 19/Nov/19 00:25 Start Date: 19/Nov/19 00:25 Worklog Time Spent: 10m Work Description: y1chi commented on issue #10145: [BEAM-8575] Add a Python test to test windowing in DoFn finish_bundle() URL: https://github.com/apache/beam/pull/10145#issuecomment-555273207 LGTM, I wonder would it be better to move the timestamp assignment to beam.Create() and combine the map function into the ParDo. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345708) Time Spent: 9h 40m (was: 9.5h) > Add more Python validates runner tests > -- > > Key: BEAM-8575 > URL: https://issues.apache.org/jira/browse/BEAM-8575 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: wendy liu >Assignee: wendy liu >Priority: Major > Time Spent: 9h 40m > Remaining Estimate: 0h > > This is the umbrella issue to track the work of adding more Python tests to > improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6756) Support lazy iterables in schemas
[ https://issues.apache.org/jira/browse/BEAM-6756?focusedWorklogId=345706&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345706 ] ASF GitHub Bot logged work on BEAM-6756: Author: ASF GitHub Bot Created on: 19/Nov/19 00:24 Start Date: 19/Nov/19 00:24 Worklog Time Spent: 10m Work Description: reuvenlax commented on issue #10003: [BEAM-6756] Create Iterable type for Schema URL: https://github.com/apache/beam/pull/10003#issuecomment-555273059 run dataflow validatesrunner This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345706) Time Spent: 1h 40m (was: 1.5h) > Support lazy iterables in schemas > - > > Key: BEAM-6756 > URL: https://issues.apache.org/jira/browse/BEAM-6756 > Project: Beam > Issue Type: Sub-task > Components: sdk-java-core >Reporter: Reuven Lax >Assignee: Reuven Lax >Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > > The iterables returned by GroupByKey and CoGroupByKey are lazy; this allows a > runner to page data into memory if the full iterable is too large. We > currently don't support this in Schemas, so the Schema Group and CoGroup > transforms materialize all data into memory. We should add support for this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-6756) Support lazy iterables in schemas
[ https://issues.apache.org/jira/browse/BEAM-6756?focusedWorklogId=345707&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345707 ] ASF GitHub Bot logged work on BEAM-6756: Author: ASF GitHub Bot Created on: 19/Nov/19 00:24 Start Date: 19/Nov/19 00:24 Worklog Time Spent: 10m Work Description: reuvenlax commented on issue #10003: [BEAM-6756] Create Iterable type for Schema URL: https://github.com/apache/beam/pull/10003#issuecomment-555273094 run sql postcommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345707) Time Spent: 1h 50m (was: 1h 40m) > Support lazy iterables in schemas > - > > Key: BEAM-6756 > URL: https://issues.apache.org/jira/browse/BEAM-6756 > Project: Beam > Issue Type: Sub-task > Components: sdk-java-core >Reporter: Reuven Lax >Assignee: Reuven Lax >Priority: Major > Time Spent: 1h 50m > Remaining Estimate: 0h > > The iterables returned by GroupByKey and CoGroupByKey are lazy; this allows a > runner to page data into memory if the full iterable is too large. We > currently don't support this in Schemas, so the Schema Group and CoGroup > transforms materialize all data into memory. We should add support for this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8016) Render Beam Pipeline as DOT with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8016?focusedWorklogId=345699&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345699 ] ASF GitHub Bot logged work on BEAM-8016: Author: ASF GitHub Bot Created on: 19/Nov/19 00:22 Start Date: 19/Nov/19 00:22 Worklog Time Spent: 10m Work Description: qinyeli commented on pull request #10132: [BEAM-8016] Pipeline Graph URL: https://github.com/apache/beam/pull/10132#discussion_r347667912 ## File path: sdks/python/apache_beam/runners/interactive/interactive_runner.py ## @@ -117,123 +119,43 @@ def apply(self, transform, pvalueish, options): return self._underlying_runner.apply(transform, pvalueish, options) def run_pipeline(self, pipeline, options): -if not hasattr(self, '_desired_cache_labels'): - self._desired_cache_labels = set() - -# Invoke a round trip through the runner API. This makes sure the Pipeline -# proto is stable. -pipeline = beam.pipeline.Pipeline.from_runner_api( -pipeline.to_runner_api(use_fake_coders=True), -pipeline.runner, -options) - -# Snapshot the pipeline in a portable proto before mutating it. -pipeline_proto, original_context = pipeline.to_runner_api( -return_context=True, use_fake_coders=True) -pcolls_to_pcoll_id = self._pcolls_to_pcoll_id(pipeline, original_context) - -analyzer = pipeline_analyzer.PipelineAnalyzer(self._cache_manager, - pipeline_proto, - self._underlying_runner, - options, - self._desired_cache_labels) -# Should be only accessed for debugging purpose. -self._analyzer = analyzer +pin = inst.pin(pipeline, options) pipeline_to_execute = beam.pipeline.Pipeline.from_runner_api( -analyzer.pipeline_proto_to_execute(), +pin.instrumented_pipeline_proto(), self._underlying_runner, options) if not self._skip_display: - display = display_manager.DisplayManager( - pipeline_proto=pipeline_proto, - pipeline_analyzer=analyzer, - cache_manager=self._cache_manager, - pipeline_graph_renderer=self._renderer) - display.start_periodic_update() + pg = pipeline_graph.PipelineGraph(pin.original_pipeline, +render_option=self._render_option) + pg.display_graph() result = pipeline_to_execute.run() result.wait_until_finish() -if not self._skip_display: - display.stop_periodic_update() - -return PipelineResult(result, self, self._analyzer.pipeline_info(), - self._cache_manager, pcolls_to_pcoll_id) - - def _pcolls_to_pcoll_id(self, pipeline, original_context): -"""Returns a dict mapping PCollections string to PCollection IDs. - -Using a PipelineVisitor to iterate over every node in the pipeline, -records the mapping from PCollections to PCollections IDs. This mapping -will be used to query cached PCollections. - -Args: - pipeline: (pipeline.Pipeline) - original_context: (pipeline_context.PipelineContext) - -Returns: - (dict from str to str) a dict mapping str(pcoll) to pcoll_id. -""" -pcolls_to_pcoll_id = {} - -from apache_beam.pipeline import PipelineVisitor # pylint: disable=import-error - -class PCollVisitor(PipelineVisitor): # pylint: disable=used-before-assignment - A visitor that records input and output values to be replaced. - - Input and output values that should be updated are recorded in maps - input_replacements and output_replacements respectively. - - We cannot update input and output values while visiting since that - results in validation errors. - """ - - def enter_composite_transform(self, transform_node): -self.visit_transform(transform_node) - - def visit_transform(self, transform_node): -for pcoll in transform_node.outputs.values(): - pcolls_to_pcoll_id[str(pcoll)] = original_context.pcollections.get_id( - pcoll) - -pipeline.visit(PCollVisitor()) -return pcolls_to_pcoll_id +return PipelineResult(result, pin) class PipelineResult(beam.runners.runner.PipelineResult): """Provides access to information about a pipeline.""" - def __init__(self, underlying_result, runner, pipeline_info, cache_manager, - pcolls_to_pcoll_id): + def __init__(self, underlying_result, pin): super(PipelineResult, self).__init__(underlying_result.state) -self._runner = runner -self._pipeline_info = pipeline_info -self._cache_manager = cache_manager -self._pcolls_to_pcoll_id = pcolls_to_pcoll_id - - def _cache_label(self, pcoll): -
[jira] [Work logged] (BEAM-8016) Render Beam Pipeline as DOT with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8016?focusedWorklogId=345702&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345702 ] ASF GitHub Bot logged work on BEAM-8016: Author: ASF GitHub Bot Created on: 19/Nov/19 00:22 Start Date: 19/Nov/19 00:22 Worklog Time Spent: 10m Work Description: qinyeli commented on pull request #10132: [BEAM-8016] Pipeline Graph URL: https://github.com/apache/beam/pull/10132#discussion_r347674135 ## File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py ## @@ -314,10 +316,29 @@ def cache_key(self, pcoll): cacheable['producer_version'])) return '' + def cacheable_var_by_pcoll_id(self, pcoll_id): +"""Retrieves the variable name of a PCollection. + +In source code, PCollection variables are defined in the user pipeline. When +it's converted to the runner api representation, each PCollection referenced +in the user pipeline is assigned a unique-within-pipeline pcoll_id. Given +such pcoll_id, retrieves the str variable name defined in user pipeline for +that referenced PCollection. If the PCollection is anonymous, return ''. +""" +return self._cacheable_var_by_pcoll_id.get(pcoll_id, '') + def pin(pipeline, options=None): - """Creates PipelineInstrument for a pipeline and its options with cache.""" + """Creates PipelineInstrument for a pipeline and its options with cache. + + This is the shorthand for doing 3 steps: 1) compute once for metadata of given Review comment: nit: of the given This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345702) Time Spent: 1h 40m (was: 1.5h) > Render Beam Pipeline as DOT with Interactive Beam > --- > > Key: BEAM-8016 > URL: https://issues.apache.org/jira/browse/BEAM-8016 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > > With work in https://issues.apache.org/jira/browse/BEAM-7760, Beam pipeline > converted to DOT then rendered should mark user defined variables on edges. > With work in https://issues.apache.org/jira/browse/BEAM-7926, it might be > redundant or confusing to render arbitrary random sample PCollection data on > edges. > We'll also make sure edges in the graph corresponds to output -> input > relationship in the user defined pipeline. Each edge is one output. If > multiple down stream inputs take the same output, it should be rendered as > one edge diverging into two instead of two edges. > For advanced interactivity highlight where each execution highlights the part > of the pipeline really executed from the original pipeline, we'll also > provide the support in beta. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8016) Render Beam Pipeline as DOT with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8016?focusedWorklogId=345704&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345704 ] ASF GitHub Bot logged work on BEAM-8016: Author: ASF GitHub Bot Created on: 19/Nov/19 00:22 Start Date: 19/Nov/19 00:22 Worklog Time Spent: 10m Work Description: qinyeli commented on pull request #10132: [BEAM-8016] Pipeline Graph URL: https://github.com/apache/beam/pull/10132#discussion_r347674214 ## File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py ## @@ -314,10 +316,29 @@ def cache_key(self, pcoll): cacheable['producer_version'])) return '' + def cacheable_var_by_pcoll_id(self, pcoll_id): +"""Retrieves the variable name of a PCollection. + +In source code, PCollection variables are defined in the user pipeline. When +it's converted to the runner api representation, each PCollection referenced +in the user pipeline is assigned a unique-within-pipeline pcoll_id. Given +such pcoll_id, retrieves the str variable name defined in user pipeline for +that referenced PCollection. If the PCollection is anonymous, return ''. +""" +return self._cacheable_var_by_pcoll_id.get(pcoll_id, '') + def pin(pipeline, options=None): - """Creates PipelineInstrument for a pipeline and its options with cache.""" + """Creates PipelineInstrument for a pipeline and its options with cache. + + This is the shorthand for doing 3 steps: 1) compute once for metadata of given + runner pipeline and everything watched from user pipelines; 2) associate info + between runner pipeline and its corresponding user pipeline, eliminate data Review comment: nit: between the runner pipeline This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345704) > Render Beam Pipeline as DOT with Interactive Beam > --- > > Key: BEAM-8016 > URL: https://issues.apache.org/jira/browse/BEAM-8016 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > > With work in https://issues.apache.org/jira/browse/BEAM-7760, Beam pipeline > converted to DOT then rendered should mark user defined variables on edges. > With work in https://issues.apache.org/jira/browse/BEAM-7926, it might be > redundant or confusing to render arbitrary random sample PCollection data on > edges. > We'll also make sure edges in the graph corresponds to output -> input > relationship in the user defined pipeline. Each edge is one output. If > multiple down stream inputs take the same output, it should be rendered as > one edge diverging into two instead of two edges. > For advanced interactivity highlight where each execution highlights the part > of the pipeline really executed from the original pipeline, we'll also > provide the support in beta. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8016) Render Beam Pipeline as DOT with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8016?focusedWorklogId=345701&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345701 ] ASF GitHub Bot logged work on BEAM-8016: Author: ASF GitHub Bot Created on: 19/Nov/19 00:22 Start Date: 19/Nov/19 00:22 Worklog Time Spent: 10m Work Description: qinyeli commented on pull request #10132: [BEAM-8016] Pipeline Graph URL: https://github.com/apache/beam/pull/10132#discussion_r347673119 ## File path: sdks/python/apache_beam/runners/interactive/interactive_runner.py ## @@ -117,123 +119,43 @@ def apply(self, transform, pvalueish, options): return self._underlying_runner.apply(transform, pvalueish, options) def run_pipeline(self, pipeline, options): -if not hasattr(self, '_desired_cache_labels'): - self._desired_cache_labels = set() - -# Invoke a round trip through the runner API. This makes sure the Pipeline -# proto is stable. -pipeline = beam.pipeline.Pipeline.from_runner_api( -pipeline.to_runner_api(use_fake_coders=True), -pipeline.runner, -options) - -# Snapshot the pipeline in a portable proto before mutating it. -pipeline_proto, original_context = pipeline.to_runner_api( -return_context=True, use_fake_coders=True) -pcolls_to_pcoll_id = self._pcolls_to_pcoll_id(pipeline, original_context) - -analyzer = pipeline_analyzer.PipelineAnalyzer(self._cache_manager, - pipeline_proto, - self._underlying_runner, - options, - self._desired_cache_labels) -# Should be only accessed for debugging purpose. -self._analyzer = analyzer +pin = inst.pin(pipeline, options) pipeline_to_execute = beam.pipeline.Pipeline.from_runner_api( -analyzer.pipeline_proto_to_execute(), +pin.instrumented_pipeline_proto(), self._underlying_runner, options) if not self._skip_display: - display = display_manager.DisplayManager( - pipeline_proto=pipeline_proto, - pipeline_analyzer=analyzer, - cache_manager=self._cache_manager, - pipeline_graph_renderer=self._renderer) - display.start_periodic_update() + pg = pipeline_graph.PipelineGraph(pin.original_pipeline, Review comment: Let's rename this into something more descriptive. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345701) > Render Beam Pipeline as DOT with Interactive Beam > --- > > Key: BEAM-8016 > URL: https://issues.apache.org/jira/browse/BEAM-8016 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > With work in https://issues.apache.org/jira/browse/BEAM-7760, Beam pipeline > converted to DOT then rendered should mark user defined variables on edges. > With work in https://issues.apache.org/jira/browse/BEAM-7926, it might be > redundant or confusing to render arbitrary random sample PCollection data on > edges. > We'll also make sure edges in the graph corresponds to output -> input > relationship in the user defined pipeline. Each edge is one output. If > multiple down stream inputs take the same output, it should be rendered as > one edge diverging into two instead of two edges. > For advanced interactivity highlight where each execution highlights the part > of the pipeline really executed from the original pipeline, we'll also > provide the support in beta. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8016) Render Beam Pipeline as DOT with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8016?focusedWorklogId=345703&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345703 ] ASF GitHub Bot logged work on BEAM-8016: Author: ASF GitHub Bot Created on: 19/Nov/19 00:22 Start Date: 19/Nov/19 00:22 Worklog Time Spent: 10m Work Description: qinyeli commented on pull request #10132: [BEAM-8016] Pipeline Graph URL: https://github.com/apache/beam/pull/10132#discussion_r347675402 ## File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py ## @@ -337,6 +358,7 @@ def cacheables(pcolls_to_pcoll_id): """ pcoll_version_map = {} cacheables = {} + cacheable_var_by_pcoll_id = {} Review comment: Can't comment on already-committed lines, so commenting here. I didn't get what a "cache desired PCollection" is. cache-desired PCollection looks slightly better, but still doesn't tell what it is. PCollections that need to be cached? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345703) Time Spent: 1h 40m (was: 1.5h) > Render Beam Pipeline as DOT with Interactive Beam > --- > > Key: BEAM-8016 > URL: https://issues.apache.org/jira/browse/BEAM-8016 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > > With work in https://issues.apache.org/jira/browse/BEAM-7760, Beam pipeline > converted to DOT then rendered should mark user defined variables on edges. > With work in https://issues.apache.org/jira/browse/BEAM-7926, it might be > redundant or confusing to render arbitrary random sample PCollection data on > edges. > We'll also make sure edges in the graph corresponds to output -> input > relationship in the user defined pipeline. Each edge is one output. If > multiple down stream inputs take the same output, it should be rendered as > one edge diverging into two instead of two edges. > For advanced interactivity highlight where each execution highlights the part > of the pipeline really executed from the original pipeline, we'll also > provide the support in beta. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8016) Render Beam Pipeline as DOT with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8016?focusedWorklogId=345705&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345705 ] ASF GitHub Bot logged work on BEAM-8016: Author: ASF GitHub Bot Created on: 19/Nov/19 00:22 Start Date: 19/Nov/19 00:22 Worklog Time Spent: 10m Work Description: qinyeli commented on pull request #10132: [BEAM-8016] Pipeline Graph URL: https://github.com/apache/beam/pull/10132#discussion_r347676584 ## File path: sdks/python/apache_beam/runners/interactive/pipeline_instrument.py ## @@ -80,7 +80,10 @@ def __init__(self, pipeline, options=None): # A mapping from PCollection id to python id() value in user defined # pipeline instance. (self._pcoll_version_map, - self._cacheables) = cacheables(self.pcolls_to_pcoll_id) + self._cacheables, Review comment: Adding comment here since GitHub does not allow comments on unchanged lines. It's probably not obvious what it means to "instrument" a pipeline. From my context back when working on Interactive Beam, I would guess it's something similar to the original pipeline_analyzer, (and I have to admit that "analyze" pipeline was not a much better.), but it's still fuzzy. Maybe adding some comment summarizing what pipeline_instrument does would be good. In a separate PR This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345705) Time Spent: 1h 50m (was: 1h 40m) > Render Beam Pipeline as DOT with Interactive Beam > --- > > Key: BEAM-8016 > URL: https://issues.apache.org/jira/browse/BEAM-8016 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 1h 50m > Remaining Estimate: 0h > > With work in https://issues.apache.org/jira/browse/BEAM-7760, Beam pipeline > converted to DOT then rendered should mark user defined variables on edges. > With work in https://issues.apache.org/jira/browse/BEAM-7926, it might be > redundant or confusing to render arbitrary random sample PCollection data on > edges. > We'll also make sure edges in the graph corresponds to output -> input > relationship in the user defined pipeline. Each edge is one output. If > multiple down stream inputs take the same output, it should be rendered as > one edge diverging into two instead of two edges. > For advanced interactivity highlight where each execution highlights the part > of the pipeline really executed from the original pipeline, we'll also > provide the support in beta. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8016) Render Beam Pipeline as DOT with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8016?focusedWorklogId=345700&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345700 ] ASF GitHub Bot logged work on BEAM-8016: Author: ASF GitHub Bot Created on: 19/Nov/19 00:22 Start Date: 19/Nov/19 00:22 Worklog Time Spent: 10m Work Description: qinyeli commented on pull request #10132: [BEAM-8016] Pipeline Graph URL: https://github.com/apache/beam/pull/10132#discussion_r347666766 ## File path: sdks/python/apache_beam/runners/interactive/display/pipeline_graph.py ## @@ -92,9 +104,20 @@ def __init__(self, default_vertex_attrs, default_edge_attrs) +self._renderer = pipeline_graph_renderer.get_renderer(render_option) + def get_dot(self): return self._get_graph().to_string() + def display_graph(self): +rendered_graph = self._renderer.render_pipeline_graph(self) +if ie.current_env().is_in_notebook: + try: +from IPython.core import display +display.display(display.HTML(rendered_graph)) + except ImportError: # Unlikely to happen when is_in_notebook. +pass Review comment: Maybe print some error message instead of silently failing? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345700) Time Spent: 1.5h (was: 1h 20m) > Render Beam Pipeline as DOT with Interactive Beam > --- > > Key: BEAM-8016 > URL: https://issues.apache.org/jira/browse/BEAM-8016 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > With work in https://issues.apache.org/jira/browse/BEAM-7760, Beam pipeline > converted to DOT then rendered should mark user defined variables on edges. > With work in https://issues.apache.org/jira/browse/BEAM-7926, it might be > redundant or confusing to render arbitrary random sample PCollection data on > edges. > We'll also make sure edges in the graph corresponds to output -> input > relationship in the user defined pipeline. Each edge is one output. If > multiple down stream inputs take the same output, it should be rendered as > one edge diverging into two instead of two edges. > For advanced interactivity highlight where each execution highlights the part > of the pipeline really executed from the original pipeline, we'll also > provide the support in beta. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8016) Render Beam Pipeline as DOT with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8016?focusedWorklogId=345698&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345698 ] ASF GitHub Bot logged work on BEAM-8016: Author: ASF GitHub Bot Created on: 19/Nov/19 00:21 Start Date: 19/Nov/19 00:21 Worklog Time Spent: 10m Work Description: qinyeli commented on issue #10132: [BEAM-8016] Pipeline Graph URL: https://github.com/apache/beam/pull/10132#issuecomment-555272354 Thanks Ning! I haven't been catching up with the latest changes in Interactive Beam, so consider my comments as from someone with somewhat context but not all. A lot of my comments are "This is not very obvious, can you elaborate?" So it looks good in general. Some comments in line. Also a couple of questions out of curiosity. * Are we getting rid of the tooltips displaying the intermediate results? Do they not fit in the new model? * What do the PCollections look like if the user did not specify the PCollection name as a variable? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345698) Time Spent: 1h 20m (was: 1h 10m) > Render Beam Pipeline as DOT with Interactive Beam > --- > > Key: BEAM-8016 > URL: https://issues.apache.org/jira/browse/BEAM-8016 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 1h 20m > Remaining Estimate: 0h > > With work in https://issues.apache.org/jira/browse/BEAM-7760, Beam pipeline > converted to DOT then rendered should mark user defined variables on edges. > With work in https://issues.apache.org/jira/browse/BEAM-7926, it might be > redundant or confusing to render arbitrary random sample PCollection data on > edges. > We'll also make sure edges in the graph corresponds to output -> input > relationship in the user defined pipeline. Each edge is one output. If > multiple down stream inputs take the same output, it should be rendered as > one edge diverging into two instead of two edges. > For advanced interactivity highlight where each execution highlights the part > of the pipeline really executed from the original pipeline, we'll also > provide the support in beta. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8335) Add streaming support to Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8335?focusedWorklogId=345697&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345697 ] ASF GitHub Bot logged work on BEAM-8335: Author: ASF GitHub Bot Created on: 19/Nov/19 00:17 Start Date: 19/Nov/19 00:17 Worklog Time Spent: 10m Work Description: rohdesamuel commented on issue #10119: [BEAM-8335] Adds the StreamingCache URL: https://github.com/apache/beam/pull/10119#issuecomment-555271279 Cool thank you, I pulled out one more PR from the previous mega-PR into https://github.com/apache/beam/pull/10146. It adds the timestamp proto translation utils. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345697) Time Spent: 32h 40m (was: 32.5h) > Add streaming support to Interactive Beam > - > > Key: BEAM-8335 > URL: https://issues.apache.org/jira/browse/BEAM-8335 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Sam Rohde >Assignee: Sam Rohde >Priority: Major > Time Spent: 32h 40m > Remaining Estimate: 0h > > This issue tracks the work items to introduce streaming support to the > Interactive Beam experience. This will allow users to: > * Write and run a streaming job in IPython > * Automatically cache records from unbounded sources > * Add a replay experience that replays all cached records to simulate the > original pipeline execution > * Add controls to play/pause/stop/step individual elements from the cached > records > * Add ability to inspect/visualize unbounded PCollections -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8335) Add streaming support to Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8335?focusedWorklogId=345677&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345677 ] ASF GitHub Bot logged work on BEAM-8335: Author: ASF GitHub Bot Created on: 18/Nov/19 23:53 Start Date: 18/Nov/19 23:53 Worklog Time Spent: 10m Work Description: rohdesamuel commented on issue #10146: [BEAM-8335] Add timestamp to/from protos to Python SDK URL: https://github.com/apache/beam/pull/10146#issuecomment-555265250 > It would also be useful to cover the duration proto as well since both are used. Do you mind doing it in this or a follow-up PR? I don't mind, it's a small enough change. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345677) Time Spent: 32.5h (was: 32h 20m) > Add streaming support to Interactive Beam > - > > Key: BEAM-8335 > URL: https://issues.apache.org/jira/browse/BEAM-8335 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Sam Rohde >Assignee: Sam Rohde >Priority: Major > Time Spent: 32.5h > Remaining Estimate: 0h > > This issue tracks the work items to introduce streaming support to the > Interactive Beam experience. This will allow users to: > * Write and run a streaming job in IPython > * Automatically cache records from unbounded sources > * Add a replay experience that replays all cached records to simulate the > original pipeline execution > * Add controls to play/pause/stop/step individual elements from the cached > records > * Add ability to inspect/visualize unbounded PCollections -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8335) Add streaming support to Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8335?focusedWorklogId=345676&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345676 ] ASF GitHub Bot logged work on BEAM-8335: Author: ASF GitHub Bot Created on: 18/Nov/19 23:52 Start Date: 18/Nov/19 23:52 Worklog Time Spent: 10m Work Description: rohdesamuel commented on pull request #10146: [BEAM-8335] Add timestamp to/from protos to Python SDK URL: https://github.com/apache/beam/pull/10146#discussion_r347669663 ## File path: sdks/python/apache_beam/utils/timestamp.py ## @@ -140,6 +141,23 @@ def to_rfc3339(self): # Append 'Z' for UTC timezone. return self.to_utc_datetime().isoformat() + 'Z' + def to_proto(self): +"""Returns the `google.protobuf.timestamp_pb2` representation.""" +secs = self.micros // 100 +nanos = (self.micros - (secs * 100)) * 1000 +return timestamp_pb2.Timestamp(seconds=secs, nanos=nanos) + + @staticmethod + def from_proto(timestamp_proto): +"""Creates a Timestamp from a `google.protobuf.timestamp_pb2`. + +Note that the google has a sub-second resolution of nanoseconds whereas this +class has a resolution of microsends. This class will truncate the +nanosecond resolution down to the microsecond. +""" +return Timestamp(seconds=timestamp_proto.seconds, + micros=timestamp_proto.nanos // 1000) Review comment: Ack, it now raises a ValueError with a reference to a JIRA as well as a TODO comment. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345676) Time Spent: 32h 20m (was: 32h 10m) > Add streaming support to Interactive Beam > - > > Key: BEAM-8335 > URL: https://issues.apache.org/jira/browse/BEAM-8335 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Sam Rohde >Assignee: Sam Rohde >Priority: Major > Time Spent: 32h 20m > Remaining Estimate: 0h > > This issue tracks the work items to introduce streaming support to the > Interactive Beam experience. This will allow users to: > * Write and run a streaming job in IPython > * Automatically cache records from unbounded sources > * Add a replay experience that replays all cached records to simulate the > original pipeline execution > * Add controls to play/pause/stop/step individual elements from the cached > records > * Add ability to inspect/visualize unbounded PCollections -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8335) Add streaming support to Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8335?focusedWorklogId=345675&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345675 ] ASF GitHub Bot logged work on BEAM-8335: Author: ASF GitHub Bot Created on: 18/Nov/19 23:51 Start Date: 18/Nov/19 23:51 Worklog Time Spent: 10m Work Description: rohdesamuel commented on pull request #10146: [BEAM-8335] Add timestamp to/from protos to Python SDK URL: https://github.com/apache/beam/pull/10146#discussion_r347669525 ## File path: sdks/python/apache_beam/utils/timestamp.py ## @@ -140,6 +141,23 @@ def to_rfc3339(self): # Append 'Z' for UTC timezone. return self.to_utc_datetime().isoformat() + 'Z' + def to_proto(self): +"""Returns the `google.protobuf.timestamp_pb2` representation.""" +secs = self.micros // 100 +nanos = (self.micros - (secs * 100)) * 1000 Review comment: Done. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345675) Time Spent: 32h 10m (was: 32h) > Add streaming support to Interactive Beam > - > > Key: BEAM-8335 > URL: https://issues.apache.org/jira/browse/BEAM-8335 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Sam Rohde >Assignee: Sam Rohde >Priority: Major > Time Spent: 32h 10m > Remaining Estimate: 0h > > This issue tracks the work items to introduce streaming support to the > Interactive Beam experience. This will allow users to: > * Write and run a streaming job in IPython > * Automatically cache records from unbounded sources > * Add a replay experience that replays all cached records to simulate the > original pipeline execution > * Add controls to play/pause/stop/step individual elements from the cached > records > * Add ability to inspect/visualize unbounded PCollections -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8738) Revisit timestamp and duration representation
[ https://issues.apache.org/jira/browse/BEAM-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sam Rohde updated BEAM-8738: Description: The current proto representation of timesetamp and durations in Beam is either raw int64s or the well-known Google protobuf types "google.protobuf.timestamp" and "google.protobuf.duration". Apache Beam uses int64 MAX and MIN as sentinel values for an +inf watermark and -inf watermark. However, the google.protobuf.timestamp is compliant with RFC3339, meaning it can only represent date-times between 0001-01-01 and -12-31. This is not the same as the int64 MAX and MIN representation. The questions remain: * What should the timestamp and duration representations be? * What units should the timestamps and duration be? Nanos? Micros? * What algebra is allowed when dealing with timestamps and durations? What is needed? See: * [https://lists.apache.org/thread.html/c8e7d8dc7d94487fae23fa2b74ee61aec93c94abbcbef3329ffbf3bd@%3Cdev.beam.apache.org%3E] * [https://lists.apache.org/thread.html/27fe9aa5b33dbee97db1bc924ee410048137e4fe97d9c79d131c010c@%3Cdev.beam.apache.org%3E] was: The current proto representation of timesetamp and durations in Beam is either raw int64s or the well-known Google protobuf types "google.protobuf.timestamp" and "google.protobuf.duration". Apache Beam uses int64 MAX and MIN as sentinel values for an +inf watermark and -inf watermark. However, the google.protobuf.timestamp is compliant with RFC3339, meaning it can only represent date-times between 0001-01-01 and -12-31. This is not the same as the int64 MAX and MIN representation. The questions remain: * What should the timestamp and duration representations be? * What units should the timestamps and duration be? Nanos? Micros? * What algebra is allowed when dealing with timestamps and durations? What is needed? > Revisit timestamp and duration representation > - > > Key: BEAM-8738 > URL: https://issues.apache.org/jira/browse/BEAM-8738 > Project: Beam > Issue Type: Improvement > Components: beam-model >Reporter: Sam Rohde >Priority: Minor > > The current proto representation of timesetamp and durations in Beam is > either raw int64s or the well-known Google protobuf types > "google.protobuf.timestamp" and "google.protobuf.duration". Apache Beam uses > int64 MAX and MIN as sentinel values for an +inf watermark and -inf > watermark. However, the google.protobuf.timestamp is compliant with RFC3339, > meaning it can only represent date-times between 0001-01-01 and -12-31. > This is not the same as the int64 MAX and MIN representation. The questions > remain: > * What should the timestamp and duration representations be? > * What units should the timestamps and duration be? Nanos? Micros? > * What algebra is allowed when dealing with timestamps and durations? What > is needed? > See: > * > [https://lists.apache.org/thread.html/c8e7d8dc7d94487fae23fa2b74ee61aec93c94abbcbef3329ffbf3bd@%3Cdev.beam.apache.org%3E] > * > [https://lists.apache.org/thread.html/27fe9aa5b33dbee97db1bc924ee410048137e4fe97d9c79d131c010c@%3Cdev.beam.apache.org%3E] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8738) Revisit timestamp and duration representation
[ https://issues.apache.org/jira/browse/BEAM-8738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sam Rohde updated BEAM-8738: Description: The current proto representation of timesetamp and durations in Beam is either raw int64s or the well-known Google protobuf types "google.protobuf.timestamp" and "google.protobuf.duration". Apache Beam uses int64 MAX and MIN as sentinel values for an +inf watermark and -inf watermark. However, the google.protobuf.timestamp is compliant with RFC3339, meaning it can only represent date-times between 0001-01-01 and -12-31. This is not the same as the int64 MAX and MIN representation. The questions remain: * What should the timestamp and duration representations be? * What units should the timestamps and duration be? Nanos? Micros? * What algebra is allowed when dealing with timestamps and durations? What is needed? See: * [https://lists.apache.org/thread.html/c8e7d8dc7d94487fae23fa2b74ee61aec93c94abbcbef3329ffbf3bd@%3Cdev.beam.apache.org%3E] * [https://lists.apache.org/thread.html/27fe9aa5b33dbee97db1bc924ee410048137e4fe97d9c79d131c010c@%3Cdev.beam.apache.org%3E] was: The current proto representation of timesetamp and durations in Beam is either raw int64s or the well-known Google protobuf types "google.protobuf.timestamp" and "google.protobuf.duration". Apache Beam uses int64 MAX and MIN as sentinel values for an +inf watermark and -inf watermark. However, the google.protobuf.timestamp is compliant with RFC3339, meaning it can only represent date-times between 0001-01-01 and -12-31. This is not the same as the int64 MAX and MIN representation. The questions remain: * What should the timestamp and duration representations be? * What units should the timestamps and duration be? Nanos? Micros? * What algebra is allowed when dealing with timestamps and durations? What is needed? See: * [https://lists.apache.org/thread.html/c8e7d8dc7d94487fae23fa2b74ee61aec93c94abbcbef3329ffbf3bd@%3Cdev.beam.apache.org%3E] * [https://lists.apache.org/thread.html/27fe9aa5b33dbee97db1bc924ee410048137e4fe97d9c79d131c010c@%3Cdev.beam.apache.org%3E] > Revisit timestamp and duration representation > - > > Key: BEAM-8738 > URL: https://issues.apache.org/jira/browse/BEAM-8738 > Project: Beam > Issue Type: Improvement > Components: beam-model >Reporter: Sam Rohde >Priority: Minor > > The current proto representation of timesetamp and durations in Beam is > either raw int64s or the well-known Google protobuf types > "google.protobuf.timestamp" and "google.protobuf.duration". Apache Beam uses > int64 MAX and MIN as sentinel values for an +inf watermark and -inf > watermark. However, the google.protobuf.timestamp is compliant with RFC3339, > meaning it can only represent date-times between 0001-01-01 and -12-31. > This is not the same as the int64 MAX and MIN representation. The questions > remain: > * What should the timestamp and duration representations be? > * What units should the timestamps and duration be? Nanos? Micros? > * What algebra is allowed when dealing with timestamps and durations? What > is needed? > See: > * > [https://lists.apache.org/thread.html/c8e7d8dc7d94487fae23fa2b74ee61aec93c94abbcbef3329ffbf3bd@%3Cdev.beam.apache.org%3E] > * > [https://lists.apache.org/thread.html/27fe9aa5b33dbee97db1bc924ee410048137e4fe97d9c79d131c010c@%3Cdev.beam.apache.org%3E] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-8738) Revisit timestamp and duration representation
Sam Rohde created BEAM-8738: --- Summary: Revisit timestamp and duration representation Key: BEAM-8738 URL: https://issues.apache.org/jira/browse/BEAM-8738 Project: Beam Issue Type: Improvement Components: beam-model Reporter: Sam Rohde The current proto representation of timesetamp and durations in Beam is either raw int64s or the well-known Google protobuf types "google.protobuf.timestamp" and "google.protobuf.duration". Apache Beam uses int64 MAX and MIN as sentinel values for an +inf watermark and -inf watermark. However, the google.protobuf.timestamp is compliant with RFC3339, meaning it can only represent date-times between 0001-01-01 and -12-31. This is not the same as the int64 MAX and MIN representation. The questions remain: * What should the timestamp and duration representations be? * What units should the timestamps and duration be? Nanos? Micros? * What algebra is allowed when dealing with timestamps and durations? What is needed? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8645) TimestampCombiner incorrect in beam python
[ https://issues.apache.org/jira/browse/BEAM-8645?focusedWorklogId=345665&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345665 ] ASF GitHub Bot logged work on BEAM-8645: Author: ASF GitHub Bot Created on: 18/Nov/19 23:24 Start Date: 18/Nov/19 23:24 Worklog Time Spent: 10m Work Description: HuangLED commented on issue #10143: [BEAM-8645] To test state backed iterable coder in py sdk. URL: https://github.com/apache/beam/pull/10143#issuecomment-555257322 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345665) Time Spent: 3h 40m (was: 3.5h) > TimestampCombiner incorrect in beam python > -- > > Key: BEAM-8645 > URL: https://issues.apache.org/jira/browse/BEAM-8645 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Ruoyun Huang >Priority: Major > Time Spent: 3h 40m > Remaining Estimate: 0h > > When we have a TimestampValue on combine: > {code:java} > main_stream = (p > | 'main TestStream' >> TestStream() > .add_elements([window.TimestampedValue(('k', 100), 0)]) > .add_elements([window.TimestampedValue(('k', 400), 9)]) > .advance_watermark_to_infinity() > | 'main windowInto' >> beam.WindowInto( > window.FixedWindows(10), > timestamp_combiner=TimestampCombiner.OUTPUT_AT_LATEST) | > 'Combine' >> beam.CombinePerKey(sum)) > The expect timestamp should be: > LATEST: (('k', 500), Timestamp(9)), > EARLIEST: (('k', 500), Timestamp(0)), > END_OF_WINDOW: (('k', 500), Timestamp(10)), > But current py streaming gives following results: > LATEST: (('k', 500), Timestamp(10)), > EARLIEST: (('k', 500), Timestamp(10)), > END_OF_WINDOW: (('k', 500), Timestamp(9.)), > More details and discussions: > https://lists.apache.org/thread.html/d3af1f2f84a2e59a747196039eae77812b78a991f0f293c717e5f4e1@%3Cdev.beam.apache.org%3E > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=345664&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345664 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 18/Nov/19 23:24 Start Date: 18/Nov/19 23:24 Worklog Time Spent: 10m Work Description: liumomo315 commented on issue #10102: [BEAM-8575] Test flatten a single PC and test flatten a flattened PC URL: https://github.com/apache/beam/pull/10102#issuecomment-555257137 Thanks, Robert! PTAL:) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345664) Time Spent: 9.5h (was: 9h 20m) > Add more Python validates runner tests > -- > > Key: BEAM-8575 > URL: https://issues.apache.org/jira/browse/BEAM-8575 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: wendy liu >Assignee: wendy liu >Priority: Major > Time Spent: 9.5h > Remaining Estimate: 0h > > This is the umbrella issue to track the work of adding more Python tests to > improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=345658&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345658 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 18/Nov/19 23:22 Start Date: 18/Nov/19 23:22 Worklog Time Spent: 10m Work Description: liumomo315 commented on pull request #10102: [BEAM-8575] Test flatten a single PC and test flatten a flattened PC URL: https://github.com/apache/beam/pull/10102#discussion_r347661431 ## File path: sdks/python/apache_beam/transforms/ptransform_test.py ## @@ -550,11 +563,23 @@ def test_flatten_pcollections_in_iterable(self): assert_that(result, equal_to([0, 1, 2, 3, 4, 5, 6, 7])) pipeline.run() + @attr('ValidatesRunner') + def test_flatten_a_flattened_pcollection(self): +pipeline = TestPipeline() +pcoll_1 = pipeline | 'Start 1' >> beam.Create([0, 1, 2, 3]) +pcoll_2 = pipeline | 'Start 2' >> beam.Create([4, 5, 6, 7]) +result = ((pcoll_1, pcoll_2) | 'Flatten' >> beam.Flatten(), Review comment: Refactored as suggested:) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345658) Time Spent: 9h (was: 8h 50m) > Add more Python validates runner tests > -- > > Key: BEAM-8575 > URL: https://issues.apache.org/jira/browse/BEAM-8575 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: wendy liu >Assignee: wendy liu >Priority: Major > Time Spent: 9h > Remaining Estimate: 0h > > This is the umbrella issue to track the work of adding more Python tests to > improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8335) Add streaming support to Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8335?focusedWorklogId=345661&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345661 ] ASF GitHub Bot logged work on BEAM-8335: Author: ASF GitHub Bot Created on: 18/Nov/19 23:22 Start Date: 18/Nov/19 23:22 Worklog Time Spent: 10m Work Description: lukecwik commented on pull request #10146: [BEAM-8335] Add timestamp to/from protos to Python SDK URL: https://github.com/apache/beam/pull/10146#discussion_r347661383 ## File path: sdks/python/apache_beam/utils/timestamp.py ## @@ -140,6 +141,23 @@ def to_rfc3339(self): # Append 'Z' for UTC timezone. return self.to_utc_datetime().isoformat() + 'Z' + def to_proto(self): +"""Returns the `google.protobuf.timestamp_pb2` representation.""" +secs = self.micros // 100 +nanos = (self.micros - (secs * 100)) * 1000 +return timestamp_pb2.Timestamp(seconds=secs, nanos=nanos) + + @staticmethod + def from_proto(timestamp_proto): +"""Creates a Timestamp from a `google.protobuf.timestamp_pb2`. + +Note that the google has a sub-second resolution of nanoseconds whereas this +class has a resolution of microsends. This class will truncate the +nanosecond resolution down to the microsecond. +""" +return Timestamp(seconds=timestamp_proto.seconds, + micros=timestamp_proto.nanos // 1000) Review comment: I think we should fail here if we ever perform any rounding with a TODO stating that we should support nanos. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345661) > Add streaming support to Interactive Beam > - > > Key: BEAM-8335 > URL: https://issues.apache.org/jira/browse/BEAM-8335 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Sam Rohde >Assignee: Sam Rohde >Priority: Major > Time Spent: 32h > Remaining Estimate: 0h > > This issue tracks the work items to introduce streaming support to the > Interactive Beam experience. This will allow users to: > * Write and run a streaming job in IPython > * Automatically cache records from unbounded sources > * Add a replay experience that replays all cached records to simulate the > original pipeline execution > * Add controls to play/pause/stop/step individual elements from the cached > records > * Add ability to inspect/visualize unbounded PCollections -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=345659&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345659 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 18/Nov/19 23:22 Start Date: 18/Nov/19 23:22 Worklog Time Spent: 10m Work Description: liumomo315 commented on pull request #10102: [BEAM-8575] Test flatten a single PC and test flatten a flattened PC URL: https://github.com/apache/beam/pull/10102#discussion_r347661504 ## File path: sdks/python/apache_beam/transforms/ptransform_test.py ## @@ -550,11 +563,23 @@ def test_flatten_pcollections_in_iterable(self): assert_that(result, equal_to([0, 1, 2, 3, 4, 5, 6, 7])) pipeline.run() + @attr('ValidatesRunner') + def test_flatten_a_flattened_pcollection(self): +pipeline = TestPipeline() +pcoll_1 = pipeline | 'Start 1' >> beam.Create([0, 1, 2, 3]) +pcoll_2 = pipeline | 'Start 2' >> beam.Create([4, 5, 6, 7]) +result = ((pcoll_1, pcoll_2) | 'Flatten' >> beam.Flatten(), + ) |'Flatten again' >> beam.Flatten() +assert_that(result, equal_to([x for x in range(8)])) +pipeline.run() + + @attr('ValidatesRunner') Review comment: I see. Reverted this one too. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345659) Time Spent: 9h 10m (was: 9h) > Add more Python validates runner tests > -- > > Key: BEAM-8575 > URL: https://issues.apache.org/jira/browse/BEAM-8575 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: wendy liu >Assignee: wendy liu >Priority: Major > Time Spent: 9h 10m > Remaining Estimate: 0h > > This is the umbrella issue to track the work of adding more Python tests to > improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8335) Add streaming support to Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8335?focusedWorklogId=345660&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345660 ] ASF GitHub Bot logged work on BEAM-8335: Author: ASF GitHub Bot Created on: 18/Nov/19 23:22 Start Date: 18/Nov/19 23:22 Worklog Time Spent: 10m Work Description: lukecwik commented on pull request #10146: [BEAM-8335] Add timestamp to/from protos to Python SDK URL: https://github.com/apache/beam/pull/10146#discussion_r347661181 ## File path: sdks/python/apache_beam/utils/timestamp.py ## @@ -140,6 +141,23 @@ def to_rfc3339(self): # Append 'Z' for UTC timezone. return self.to_utc_datetime().isoformat() + 'Z' + def to_proto(self): +"""Returns the `google.protobuf.timestamp_pb2` representation.""" +secs = self.micros // 100 +nanos = (self.micros - (secs * 100)) * 1000 Review comment: ```suggestion nanos = (self.micros % 100) * 1000 ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345660) Time Spent: 32h (was: 31h 50m) > Add streaming support to Interactive Beam > - > > Key: BEAM-8335 > URL: https://issues.apache.org/jira/browse/BEAM-8335 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Sam Rohde >Assignee: Sam Rohde >Priority: Major > Time Spent: 32h > Remaining Estimate: 0h > > This issue tracks the work items to introduce streaming support to the > Interactive Beam experience. This will allow users to: > * Write and run a streaming job in IPython > * Automatically cache records from unbounded sources > * Add a replay experience that replays all cached records to simulate the > original pipeline execution > * Add controls to play/pause/stop/step individual elements from the cached > records > * Add ability to inspect/visualize unbounded PCollections -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=345662&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345662 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 18/Nov/19 23:22 Start Date: 18/Nov/19 23:22 Worklog Time Spent: 10m Work Description: liumomo315 commented on pull request #10102: [BEAM-8575] Test flatten a single PC and test flatten a flattened PC URL: https://github.com/apache/beam/pull/10102#discussion_r347661571 ## File path: sdks/python/apache_beam/transforms/ptransform_test.py ## @@ -550,11 +563,23 @@ def test_flatten_pcollections_in_iterable(self): assert_that(result, equal_to([0, 1, 2, 3, 4, 5, 6, 7])) pipeline.run() + @attr('ValidatesRunner') + def test_flatten_a_flattened_pcollection(self): +pipeline = TestPipeline() +pcoll_1 = pipeline | 'Start 1' >> beam.Create([0, 1, 2, 3]) +pcoll_2 = pipeline | 'Start 2' >> beam.Create([4, 5, 6, 7]) +result = ((pcoll_1, pcoll_2) | 'Flatten' >> beam.Flatten(), + ) |'Flatten again' >> beam.Flatten() +assert_that(result, equal_to([x for x in range(8)])) +pipeline.run() + + @attr('ValidatesRunner') def test_flatten_input_type_must_be_iterable(self): # Inputs to flatten *must* be an iterable. with self.assertRaises(ValueError): 4 | beam.Flatten() + @attr('ValidatesRunner') Review comment: Done. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345662) Time Spent: 9h 20m (was: 9h 10m) > Add more Python validates runner tests > -- > > Key: BEAM-8575 > URL: https://issues.apache.org/jira/browse/BEAM-8575 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: wendy liu >Assignee: wendy liu >Priority: Major > Time Spent: 9h 20m > Remaining Estimate: 0h > > This is the umbrella issue to track the work of adding more Python tests to > improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=345657&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345657 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 18/Nov/19 23:21 Start Date: 18/Nov/19 23:21 Worklog Time Spent: 10m Work Description: liumomo315 commented on pull request #10102: [BEAM-8575] Test flatten a single PC and test flatten a flattened PC URL: https://github.com/apache/beam/pull/10102#discussion_r347661278 ## File path: sdks/python/apache_beam/transforms/ptransform_test.py ## @@ -536,12 +538,23 @@ def test_flatten_no_pcollections(self): assert_that(result, equal_to([])) pipeline.run() + @attr('ValidatesRunner') + def test_flatten_one_single_pcollection(self): +pipeline = TestPipeline() +input = [0, 1, 2, 3] +pcoll = pipeline | 'Input' >> beam.Create(input) +result = (pcoll,)| 'Single Flatten' >> beam.Flatten() +assert_that(result, equal_to(input)) +pipeline.run() + + @attr('ValidatesRunner') def test_flatten_same_pcollections(self): pipeline = TestPipeline() pc = pipeline | beam.Create(['a', 'b']) assert_that((pc, pc, pc) | beam.Flatten(), equal_to(['a', 'b'] * 3)) pipeline.run() + @attr('ValidatesRunner') Review comment: Reverted. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345657) Time Spent: 8h 50m (was: 8h 40m) > Add more Python validates runner tests > -- > > Key: BEAM-8575 > URL: https://issues.apache.org/jira/browse/BEAM-8575 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: wendy liu >Assignee: wendy liu >Priority: Major > Time Spent: 8h 50m > Remaining Estimate: 0h > > This is the umbrella issue to track the work of adding more Python tests to > improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-7886) Make row coder a standard coder and implement in python
[ https://issues.apache.org/jira/browse/BEAM-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976961#comment-16976961 ] Brian Hulette commented on BEAM-7886: - I don't think we have a jira for it yet. The way it works in Java is that every logical type needs to have a base type, and there's some well known way to convert between the logical type and the base type (e.g. timestamp would be converted to int64 millis since the epoch and encoded with varint coder). Java also currently has some support for Java-only logical types that rely on putting a serialized class in the pipeline graph, but I'd like to move away from that. > Make row coder a standard coder and implement in python > --- > > Key: BEAM-7886 > URL: https://issues.apache.org/jira/browse/BEAM-7886 > Project: Beam > Issue Type: Improvement > Components: beam-model, sdk-java-core, sdk-py-core >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: Major > Time Spent: 16h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8016) Render Beam Pipeline as DOT with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8016?focusedWorklogId=345641&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345641 ] ASF GitHub Bot logged work on BEAM-8016: Author: ASF GitHub Bot Created on: 18/Nov/19 22:46 Start Date: 18/Nov/19 22:46 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #10132: [BEAM-8016] Pipeline Graph URL: https://github.com/apache/beam/pull/10132#discussion_r347031607 ## File path: sdks/python/apache_beam/runners/interactive/display/pipeline_graph.py ## @@ -136,14 +159,50 @@ def _generate_graph_dicts(self): vertex_dict[invisible_leaf] = {'style': 'invis'} self._edge_to_vertex_pairs[pcoll_id].append( (transform.unique_name, invisible_leaf)) - edge_dict[(transform.unique_name, invisible_leaf)] = {} + if self._pin: +edge_label = {'label': +self._pin.cacheable_var_by_pcoll_id(pcoll_id)} +edge_dict[(transform.unique_name, invisible_leaf)] = edge_label + else: +edge_dict[(transform.unique_name, invisible_leaf)] = {} +# For PCollections with more than one consuming PTransform, we also add +# an invisible dummy node to diverge the edge in the middle as the +# single output is used by multiple down stream PTransforms as inputs +# instead of emitting multiple edges. +elif len(self._consumers[pcoll_id]) > 1: + intermediate_dummy = 'diverge{}'.format( + hash(pcoll_id) % 1) + vertex_dict[intermediate_dummy] = {'shape': 'point', + 'width': '0'} + for consumer in self._consumers[pcoll_id]: +producer_name = transform.unique_name +consumer_name = transforms[consumer].unique_name +self._edge_to_vertex_pairs[pcoll_id].append( +(producer_name, intermediate_dummy)) +if self._pin: + edge_dict[(producer_name, intermediate_dummy)] = { + 'arrowhead': 'none', + 'label': +self._pin.cacheable_var_by_pcoll_id(pcoll_id)} +else: + edge_dict[(producer_name, intermediate_dummy)] = { + 'arrowhead': 'none'} +self._edge_to_vertex_pairs[pcoll_id].append( +(intermediate_dummy, consumer_name)) +edge_dict[(intermediate_dummy, consumer_name)] = {} else: for consumer in self._consumers[pcoll_id]: producer_name = transform.unique_name consumer_name = transforms[consumer].unique_name self._edge_to_vertex_pairs[pcoll_id].append( (producer_name, consumer_name)) -edge_dict[(producer_name, consumer_name)] = {} +if self._pin: + edge_dict[(producer_name, consumer_name)] = { + 'label': +self._pin.cacheable_var_by_pcoll_id(pcoll_id) + } +else: + edge_dict[(producer_name, consumer_name)] = {} Review comment: The entire logic will be simplified once data-centric user flow is in-place and pcollections are rendered as nodes rather than labels on edges. The difference introduced: Before: ![1](https://user-images.githubusercontent.com/4423149/68979012-364f6480-07b1-11ea-9ec3-8022154a8499.png) After (edge diverge for single output, user variable names labelled PCollections): ![2](https://user-images.githubusercontent.com/4423149/69100500-2bdfd580-0a12-11ea-9ef3-cb735fe69775.png) In the future: With data-centric user flow, notebook cell execution metadata and new Beam pipeline graph proposal: ![3](https://user-images.githubusercontent.com/4423149/68979139-93e3b100-07b1-11ea-8931-a6825b39b7a4.png) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345641) Time Spent: 1h 10m (was: 1h) > Render Beam Pipeline as DOT with Interactive Beam > --- > > Key: BEAM-8016 > URL: https://issues.apache.org/jira/browse/BEAM-8016 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 1h 10m > Remaining Estimate: 0h > > With work in https://issues.apache.org/jira/browse/BEAM-7760, Beam pipe
[jira] [Work logged] (BEAM-8645) TimestampCombiner incorrect in beam python
[ https://issues.apache.org/jira/browse/BEAM-8645?focusedWorklogId=345640&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345640 ] ASF GitHub Bot logged work on BEAM-8645: Author: ASF GitHub Bot Created on: 18/Nov/19 22:44 Start Date: 18/Nov/19 22:44 Worklog Time Spent: 10m Work Description: HuangLED commented on issue #10135: [BEAM-8645] Create a py test case for Re-iteration on GBK result. URL: https://github.com/apache/beam/pull/10135#issuecomment-555245258 Thanks for reviewing. Updated. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345640) Time Spent: 3.5h (was: 3h 20m) > TimestampCombiner incorrect in beam python > -- > > Key: BEAM-8645 > URL: https://issues.apache.org/jira/browse/BEAM-8645 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Ruoyun Huang >Priority: Major > Time Spent: 3.5h > Remaining Estimate: 0h > > When we have a TimestampValue on combine: > {code:java} > main_stream = (p > | 'main TestStream' >> TestStream() > .add_elements([window.TimestampedValue(('k', 100), 0)]) > .add_elements([window.TimestampedValue(('k', 400), 9)]) > .advance_watermark_to_infinity() > | 'main windowInto' >> beam.WindowInto( > window.FixedWindows(10), > timestamp_combiner=TimestampCombiner.OUTPUT_AT_LATEST) | > 'Combine' >> beam.CombinePerKey(sum)) > The expect timestamp should be: > LATEST: (('k', 500), Timestamp(9)), > EARLIEST: (('k', 500), Timestamp(0)), > END_OF_WINDOW: (('k', 500), Timestamp(10)), > But current py streaming gives following results: > LATEST: (('k', 500), Timestamp(10)), > EARLIEST: (('k', 500), Timestamp(10)), > END_OF_WINDOW: (('k', 500), Timestamp(9.)), > More details and discussions: > https://lists.apache.org/thread.html/d3af1f2f84a2e59a747196039eae77812b78a991f0f293c717e5f4e1@%3Cdev.beam.apache.org%3E > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8645) TimestampCombiner incorrect in beam python
[ https://issues.apache.org/jira/browse/BEAM-8645?focusedWorklogId=345638&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345638 ] ASF GitHub Bot logged work on BEAM-8645: Author: ASF GitHub Bot Created on: 18/Nov/19 22:40 Start Date: 18/Nov/19 22:40 Worklog Time Spent: 10m Work Description: HuangLED commented on pull request #10135: [BEAM-8645] Create a py test case for Re-iteration on GBK result. URL: https://github.com/apache/beam/pull/10135#discussion_r347647686 ## File path: sdks/python/apache_beam/transforms/ptransform_test.py ## @@ -471,6 +471,24 @@ def test_group_by_key(self): assert_that(result, equal_to([(1, [1, 2, 3]), (2, [1, 2]), (3, [1])])) pipeline.run() + def test_group_by_key_reiteration(self): +class MyDoFn(beam.DoFn): + def process(self, gbk_result): +value_list = gbk_result[1] +sum_val = 0 +# Iterate the GBK result for multiple times. +for _ in range(0, 17): + sum_val += sum(value_list) +return (gbk_result[0], sum_val) + +pipeline = TestPipeline() +pcoll = pipeline | 'start' >> beam.Create( +[(1, 1), (1, 2), (1, 3), (1, 4)]) +result = (pcoll | 'Group' >> beam.GroupByKey() + | 'Reiteration-Sum' >> beam.ParDo(MyDoFn())) +assert_that(result, equal_to([1, 170])) Review comment: Ah, turns out the DoFn above is supposed to return a list in py SDK. Thanks for catching this. Fixed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345638) Time Spent: 3h 20m (was: 3h 10m) > TimestampCombiner incorrect in beam python > -- > > Key: BEAM-8645 > URL: https://issues.apache.org/jira/browse/BEAM-8645 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Ruoyun Huang >Priority: Major > Time Spent: 3h 20m > Remaining Estimate: 0h > > When we have a TimestampValue on combine: > {code:java} > main_stream = (p > | 'main TestStream' >> TestStream() > .add_elements([window.TimestampedValue(('k', 100), 0)]) > .add_elements([window.TimestampedValue(('k', 400), 9)]) > .advance_watermark_to_infinity() > | 'main windowInto' >> beam.WindowInto( > window.FixedWindows(10), > timestamp_combiner=TimestampCombiner.OUTPUT_AT_LATEST) | > 'Combine' >> beam.CombinePerKey(sum)) > The expect timestamp should be: > LATEST: (('k', 500), Timestamp(9)), > EARLIEST: (('k', 500), Timestamp(0)), > END_OF_WINDOW: (('k', 500), Timestamp(10)), > But current py streaming gives following results: > LATEST: (('k', 500), Timestamp(10)), > EARLIEST: (('k', 500), Timestamp(10)), > END_OF_WINDOW: (('k', 500), Timestamp(9.)), > More details and discussions: > https://lists.apache.org/thread.html/d3af1f2f84a2e59a747196039eae77812b78a991f0f293c717e5f4e1@%3Cdev.beam.apache.org%3E > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8645) TimestampCombiner incorrect in beam python
[ https://issues.apache.org/jira/browse/BEAM-8645?focusedWorklogId=345637&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345637 ] ASF GitHub Bot logged work on BEAM-8645: Author: ASF GitHub Bot Created on: 18/Nov/19 22:39 Start Date: 18/Nov/19 22:39 Worklog Time Spent: 10m Work Description: HuangLED commented on pull request #10135: [BEAM-8645] Create a py test case for Re-iteration on GBK result. URL: https://github.com/apache/beam/pull/10135#discussion_r347647417 ## File path: sdks/python/apache_beam/transforms/ptransform_test.py ## @@ -471,6 +471,24 @@ def test_group_by_key(self): assert_that(result, equal_to([(1, [1, 2, 3]), (2, [1, 2]), (3, [1])])) pipeline.run() + def test_group_by_key_reiteration(self): +class MyDoFn(beam.DoFn): + def process(self, gbk_result): +value_list = gbk_result[1] Review comment: Thanks! Fixed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345637) Time Spent: 3h 10m (was: 3h) > TimestampCombiner incorrect in beam python > -- > > Key: BEAM-8645 > URL: https://issues.apache.org/jira/browse/BEAM-8645 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Ruoyun Huang >Priority: Major > Time Spent: 3h 10m > Remaining Estimate: 0h > > When we have a TimestampValue on combine: > {code:java} > main_stream = (p > | 'main TestStream' >> TestStream() > .add_elements([window.TimestampedValue(('k', 100), 0)]) > .add_elements([window.TimestampedValue(('k', 400), 9)]) > .advance_watermark_to_infinity() > | 'main windowInto' >> beam.WindowInto( > window.FixedWindows(10), > timestamp_combiner=TimestampCombiner.OUTPUT_AT_LATEST) | > 'Combine' >> beam.CombinePerKey(sum)) > The expect timestamp should be: > LATEST: (('k', 500), Timestamp(9)), > EARLIEST: (('k', 500), Timestamp(0)), > END_OF_WINDOW: (('k', 500), Timestamp(10)), > But current py streaming gives following results: > LATEST: (('k', 500), Timestamp(10)), > EARLIEST: (('k', 500), Timestamp(10)), > END_OF_WINDOW: (('k', 500), Timestamp(9.)), > More details and discussions: > https://lists.apache.org/thread.html/d3af1f2f84a2e59a747196039eae77812b78a991f0f293c717e5f4e1@%3Cdev.beam.apache.org%3E > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8592) DataCatalogTableProvider should not squash table components together into a string
[ https://issues.apache.org/jira/browse/BEAM-8592?focusedWorklogId=345635&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345635 ] ASF GitHub Bot logged work on BEAM-8592: Author: ASF GitHub Bot Created on: 18/Nov/19 22:32 Start Date: 18/Nov/19 22:32 Worklog Time Spent: 10m Work Description: kennknowles commented on issue #10021: [BEAM-8592] Adjusting ZetaSQL table resolution to standard URL: https://github.com/apache/beam/pull/10021#issuecomment-555240238 run sql postcommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345635) Time Spent: 1h 10m (was: 1h) > DataCatalogTableProvider should not squash table components together into a > string > -- > > Key: BEAM-8592 > URL: https://issues.apache.org/jira/browse/BEAM-8592 > Project: Beam > Issue Type: Bug > Components: dsl-sql, dsl-sql-zetasql >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles >Priority: Major > Time Spent: 1h 10m > Remaining Estimate: 0h > > Currently, if a user writes a table name like \{{foo.`baz.bar`.bizzle}} > representing the components \{{"foo", "baz.bar", "bizzle"}} the > DataCatalogTableProvider will concatenate the components into a string and > resolve the identifier as if it represented \{{"foo", "baz", "bar", > "bizzle"}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=345632&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345632 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 18/Nov/19 22:25 Start Date: 18/Nov/19 22:25 Worklog Time Spent: 10m Work Description: robertwb commented on pull request #10070: [BEAM-8575] Added a unit test for Reshuffle to test that Reshuffle pr… URL: https://github.com/apache/beam/pull/10070#discussion_r347641647 ## File path: sdks/python/apache_beam/transforms/util_test.py ## @@ -276,6 +279,13 @@ def process(self, element): with self.assertRaisesRegex(ValueError, r'window.*None.*add_timestamps2'): pipeline.run() +class AddTimestamp(beam.DoFn): Review comment: As these are used only locally, define then locally (i.e. in the test method). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345632) Time Spent: 8.5h (was: 8h 20m) > Add more Python validates runner tests > -- > > Key: BEAM-8575 > URL: https://issues.apache.org/jira/browse/BEAM-8575 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: wendy liu >Assignee: wendy liu >Priority: Major > Time Spent: 8.5h > Remaining Estimate: 0h > > This is the umbrella issue to track the work of adding more Python tests to > improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=345631&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345631 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 18/Nov/19 22:25 Start Date: 18/Nov/19 22:25 Worklog Time Spent: 10m Work Description: robertwb commented on pull request #10070: [BEAM-8575] Added a unit test for Reshuffle to test that Reshuffle pr… URL: https://github.com/apache/beam/pull/10070#discussion_r347641178 ## File path: sdks/python/apache_beam/transforms/util_test.py ## @@ -276,6 +279,13 @@ def process(self, element): with self.assertRaisesRegex(ValueError, r'window.*None.*add_timestamps2'): pipeline.run() +class AddTimestamp(beam.DoFn): + def process(self, element, timestamp=beam.DoFn.TimestampParam): +yield beam.window.TimestampedValue(element, timestamp) Review comment: This is a no-op. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345631) Time Spent: 8.5h (was: 8h 20m) > Add more Python validates runner tests > -- > > Key: BEAM-8575 > URL: https://issues.apache.org/jira/browse/BEAM-8575 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: wendy liu >Assignee: wendy liu >Priority: Major > Time Spent: 8.5h > Remaining Estimate: 0h > > This is the umbrella issue to track the work of adding more Python tests to > improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=345633&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345633 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 18/Nov/19 22:25 Start Date: 18/Nov/19 22:25 Worklog Time Spent: 10m Work Description: robertwb commented on pull request #10070: [BEAM-8575] Added a unit test for Reshuffle to test that Reshuffle pr… URL: https://github.com/apache/beam/pull/10070#discussion_r347642132 ## File path: sdks/python/apache_beam/transforms/util_test.py ## @@ -276,6 +279,13 @@ def process(self, element): with self.assertRaisesRegex(ValueError, r'window.*None.*add_timestamps2'): pipeline.run() +class AddTimestamp(beam.DoFn): + def process(self, element, timestamp=beam.DoFn.TimestampParam): +yield beam.window.TimestampedValue(element, timestamp) + +class GetTimestamp(beam.DoFn): + def process(self, element, timestamp=beam.DoFn.TimestampParam): +yield '{} - {}'.format(timestamp, element['name']) Review comment: (Optional) I would either call this FormatWithTimestamp, or have it return a tuple. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345633) Time Spent: 8h 40m (was: 8.5h) > Add more Python validates runner tests > -- > > Key: BEAM-8575 > URL: https://issues.apache.org/jira/browse/BEAM-8575 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: wendy liu >Assignee: wendy liu >Priority: Major > Time Spent: 8h 40m > Remaining Estimate: 0h > > This is the umbrella issue to track the work of adding more Python tests to > improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8335) Add streaming support to Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8335?focusedWorklogId=345625&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345625 ] ASF GitHub Bot logged work on BEAM-8335: Author: ASF GitHub Bot Created on: 18/Nov/19 22:23 Start Date: 18/Nov/19 22:23 Worklog Time Spent: 10m Work Description: rohdesamuel commented on issue #10146: [BEAM-8335] Add timestamp to/from protos to Python SDK URL: https://github.com/apache/beam/pull/10146#issuecomment-555237183 R: @lukecwik Hey Luke, are you able to review this please? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345625) Time Spent: 31h 50m (was: 31h 40m) > Add streaming support to Interactive Beam > - > > Key: BEAM-8335 > URL: https://issues.apache.org/jira/browse/BEAM-8335 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Sam Rohde >Assignee: Sam Rohde >Priority: Major > Time Spent: 31h 50m > Remaining Estimate: 0h > > This issue tracks the work items to introduce streaming support to the > Interactive Beam experience. This will allow users to: > * Write and run a streaming job in IPython > * Automatically cache records from unbounded sources > * Add a replay experience that replays all cached records to simulate the > original pipeline execution > * Add controls to play/pause/stop/step individual elements from the cached > records > * Add ability to inspect/visualize unbounded PCollections -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=345622&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345622 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 18/Nov/19 22:22 Start Date: 18/Nov/19 22:22 Worklog Time Spent: 10m Work Description: robertwb commented on pull request #10102: [BEAM-8575] Test flatten a single PC and test flatten a flattened PC URL: https://github.com/apache/beam/pull/10102#discussion_r347640781 ## File path: sdks/python/apache_beam/transforms/ptransform_test.py ## @@ -550,11 +563,23 @@ def test_flatten_pcollections_in_iterable(self): assert_that(result, equal_to([0, 1, 2, 3, 4, 5, 6, 7])) pipeline.run() + @attr('ValidatesRunner') + def test_flatten_a_flattened_pcollection(self): +pipeline = TestPipeline() +pcoll_1 = pipeline | 'Start 1' >> beam.Create([0, 1, 2, 3]) +pcoll_2 = pipeline | 'Start 2' >> beam.Create([4, 5, 6, 7]) +result = ((pcoll_1, pcoll_2) | 'Flatten' >> beam.Flatten(), + ) |'Flatten again' >> beam.Flatten() +assert_that(result, equal_to([x for x in range(8)])) +pipeline.run() + + @attr('ValidatesRunner') def test_flatten_input_type_must_be_iterable(self): # Inputs to flatten *must* be an iterable. with self.assertRaises(ValueError): 4 | beam.Flatten() + @attr('ValidatesRunner') Review comment: Same. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345622) Time Spent: 8h 20m (was: 8h 10m) > Add more Python validates runner tests > -- > > Key: BEAM-8575 > URL: https://issues.apache.org/jira/browse/BEAM-8575 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: wendy liu >Assignee: wendy liu >Priority: Major > Time Spent: 8h 20m > Remaining Estimate: 0h > > This is the umbrella issue to track the work of adding more Python tests to > improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=345620&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345620 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 18/Nov/19 22:22 Start Date: 18/Nov/19 22:22 Worklog Time Spent: 10m Work Description: robertwb commented on pull request #10102: [BEAM-8575] Test flatten a single PC and test flatten a flattened PC URL: https://github.com/apache/beam/pull/10102#discussion_r347639407 ## File path: sdks/python/apache_beam/transforms/ptransform_test.py ## @@ -536,12 +538,23 @@ def test_flatten_no_pcollections(self): assert_that(result, equal_to([])) pipeline.run() + @attr('ValidatesRunner') + def test_flatten_one_single_pcollection(self): +pipeline = TestPipeline() +input = [0, 1, 2, 3] +pcoll = pipeline | 'Input' >> beam.Create(input) +result = (pcoll,)| 'Single Flatten' >> beam.Flatten() +assert_that(result, equal_to(input)) +pipeline.run() + + @attr('ValidatesRunner') def test_flatten_same_pcollections(self): pipeline = TestPipeline() pc = pipeline | beam.Create(['a', 'b']) assert_that((pc, pc, pc) | beam.Flatten(), equal_to(['a', 'b'] * 3)) pipeline.run() + @attr('ValidatesRunner') Review comment: This one doesn't need to be ValidatesRunner as it's an SDK feature. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345620) Time Spent: 8h 10m (was: 8h) > Add more Python validates runner tests > -- > > Key: BEAM-8575 > URL: https://issues.apache.org/jira/browse/BEAM-8575 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: wendy liu >Assignee: wendy liu >Priority: Major > Time Spent: 8h 10m > Remaining Estimate: 0h > > This is the umbrella issue to track the work of adding more Python tests to > improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=345621&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345621 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 18/Nov/19 22:22 Start Date: 18/Nov/19 22:22 Worklog Time Spent: 10m Work Description: robertwb commented on pull request #10102: [BEAM-8575] Test flatten a single PC and test flatten a flattened PC URL: https://github.com/apache/beam/pull/10102#discussion_r347640740 ## File path: sdks/python/apache_beam/transforms/ptransform_test.py ## @@ -550,11 +563,23 @@ def test_flatten_pcollections_in_iterable(self): assert_that(result, equal_to([0, 1, 2, 3, 4, 5, 6, 7])) pipeline.run() + @attr('ValidatesRunner') + def test_flatten_a_flattened_pcollection(self): +pipeline = TestPipeline() +pcoll_1 = pipeline | 'Start 1' >> beam.Create([0, 1, 2, 3]) +pcoll_2 = pipeline | 'Start 2' >> beam.Create([4, 5, 6, 7]) +result = ((pcoll_1, pcoll_2) | 'Flatten' >> beam.Flatten(), + ) |'Flatten again' >> beam.Flatten() +assert_that(result, equal_to([x for x in range(8)])) +pipeline.run() + + @attr('ValidatesRunner') Review comment: This raises an error on construction, should not be ValidatesRunner. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345621) Time Spent: 8h 20m (was: 8h 10m) > Add more Python validates runner tests > -- > > Key: BEAM-8575 > URL: https://issues.apache.org/jira/browse/BEAM-8575 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: wendy liu >Assignee: wendy liu >Priority: Major > Time Spent: 8h 20m > Remaining Estimate: 0h > > This is the umbrella issue to track the work of adding more Python tests to > improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=345619&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345619 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 18/Nov/19 22:22 Start Date: 18/Nov/19 22:22 Worklog Time Spent: 10m Work Description: robertwb commented on pull request #10102: [BEAM-8575] Test flatten a single PC and test flatten a flattened PC URL: https://github.com/apache/beam/pull/10102#discussion_r347640596 ## File path: sdks/python/apache_beam/transforms/ptransform_test.py ## @@ -550,11 +563,23 @@ def test_flatten_pcollections_in_iterable(self): assert_that(result, equal_to([0, 1, 2, 3, 4, 5, 6, 7])) pipeline.run() + @attr('ValidatesRunner') + def test_flatten_a_flattened_pcollection(self): +pipeline = TestPipeline() +pcoll_1 = pipeline | 'Start 1' >> beam.Create([0, 1, 2, 3]) +pcoll_2 = pipeline | 'Start 2' >> beam.Create([4, 5, 6, 7]) +result = ((pcoll_1, pcoll_2) | 'Flatten' >> beam.Flatten(), Review comment: This is a bit hard to read. It also would probably be good to have a third collection in the mix, e.g. ``` pcoll_12 = (pcoll_1, pcoll_2) | beam.Flatten() pcoll_123 = (pcoll_12, pcoll_3) | 'FlattenAgain' >> beam.Flatten() ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345619) Time Spent: 8h 10m (was: 8h) > Add more Python validates runner tests > -- > > Key: BEAM-8575 > URL: https://issues.apache.org/jira/browse/BEAM-8575 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: wendy liu >Assignee: wendy liu >Priority: Major > Time Spent: 8h 10m > Remaining Estimate: 0h > > This is the umbrella issue to track the work of adding more Python tests to > improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8335) Add streaming support to Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8335?focusedWorklogId=345623&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345623 ] ASF GitHub Bot logged work on BEAM-8335: Author: ASF GitHub Bot Created on: 18/Nov/19 22:22 Start Date: 18/Nov/19 22:22 Worklog Time Spent: 10m Work Description: rohdesamuel commented on pull request #10146: [BEAM-8335] Add timestamp to/from protos to Python SDK URL: https://github.com/apache/beam/pull/10146 Adds extra utility in the Python SDK Timestamp object to convert between it and the well-known google.protobuf.timestamp.Timestamp object. Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/) Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build Status](https://builds.apache.or
[jira] [Work logged] (BEAM-8691) Beam Dependency Update Request: com.google.cloud.bigtable:bigtable-client-core
[ https://issues.apache.org/jira/browse/BEAM-8691?focusedWorklogId=345616&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345616 ] ASF GitHub Bot logged work on BEAM-8691: Author: ASF GitHub Bot Created on: 18/Nov/19 22:19 Start Date: 18/Nov/19 22:19 Worklog Time Spent: 10m Work Description: suztomo commented on issue #10144: [BEAM-8691] Upgrading bigtable-client-core to latest 1.12.1 URL: https://github.com/apache/beam/pull/10144#issuecomment-555235926 R: @chamikaramj, @lukecwik This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345616) Time Spent: 1.5h (was: 1h 20m) > Beam Dependency Update Request: com.google.cloud.bigtable:bigtable-client-core > -- > > Key: BEAM-8691 > URL: https://issues.apache.org/jira/browse/BEAM-8691 > Project: Beam > Issue Type: Sub-task > Components: dependencies >Reporter: Beam JIRA Bot >Assignee: Tomo Suzuki >Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > - 2019-11-15 19:39:51.523448 > - > Please consider upgrading the dependency > com.google.cloud.bigtable:bigtable-client-core. > The current version is 1.8.0. The latest version is 1.12.1 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7473) Update RestrictionTracker within Python to not be required to be thread safe
[ https://issues.apache.org/jira/browse/BEAM-7473?focusedWorklogId=345615&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345615 ] ASF GitHub Bot logged work on BEAM-7473: Author: ASF GitHub Bot Created on: 18/Nov/19 22:17 Start Date: 18/Nov/19 22:17 Worklog Time Spent: 10m Work Description: robertwb commented on pull request #10118: [BEAM-7473] Pack RangeTracker into restriction URL: https://github.com/apache/beam/pull/10118#discussion_r347636628 ## File path: sdks/python/apache_beam/io/iobase.py ## @@ -1402,20 +1402,36 @@ class _SDFBoundedSourceWrapper(ptransform.PTransform): NOTE: This transform can only be used with beam_fn_api enabled. """ + + class _SDFBoundedSourceRestriction(object): +""" A restriction wraps SourceBundle and RangeTracker. """ +def __init__(self, source_bundle, range_tracker=None): + self.source_bundle = source_bundle + self.range_tracker = range_tracker + +def __reduce__(self): + # The instance of RangeTracker shouldn't be serialized. + return (self.__class__, (self.source_bundle, )) + + class _SDFBoundedSourceRestrictionTracker(RestrictionTracker): """An `iobase.RestrictionTracker` implementations for wrapping BoundedSource -with SDF. +with SDF. The tracked restriction is a (SourceBundle, RangeTracker) pair. +In order to save bytes sent across the wire, the RangeTracker is set as +system tracking RangeTracker only when current_restriction is called. Delegated RangeTracker guarantees synchronization safety. """ def __init__(self, restriction): - if not isinstance(restriction, SourceBundle): + if not isinstance(restriction, +_SDFBoundedSourceWrapper._SDFBoundedSourceRestriction): raise ValueError('Initializing SDFBoundedSourceRestrictionTracker' - 'requires a SourceBundle') - self._delegate_range_tracker = restriction.source.get_range_tracker( - restriction.start_position, restriction.stop_position) - self._source = restriction.source - self._weight = restriction.weight + ' requires a _SDFBoundedSourceRestriction') + self._source = restriction.source_bundle.source + self._weight = restriction.source_bundle.weight + self._delegate_range_tracker = self._source.get_range_tracker( Review comment: How about just adding a range_tracker() method to _SDFBoundedSourceRestriction? Then _delegate_range_tracker would not have to be tracked here at all. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345615) Time Spent: 1h 20m (was: 1h 10m) > Update RestrictionTracker within Python to not be required to be thread safe > > > Key: BEAM-7473 > URL: https://issues.apache.org/jira/browse/BEAM-7473 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core, sdk-py-harness >Reporter: Luke Cwik >Assignee: Boyuan Zhang >Priority: Major > Fix For: 2.17.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > The commit > [https://github.com/apache/beam/commit/8faffb2bcf28ccab5e9a95322743cc60df65077c#diff-ed95abb6bc30a9ed07faef5c3fea93f0] > modified the Java SDK removed the need for users to be thread safe and > instead made the framework provide the necessary locking around a restriction > tracker. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7473) Update RestrictionTracker within Python to not be required to be thread safe
[ https://issues.apache.org/jira/browse/BEAM-7473?focusedWorklogId=345614&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345614 ] ASF GitHub Bot logged work on BEAM-7473: Author: ASF GitHub Bot Created on: 18/Nov/19 22:17 Start Date: 18/Nov/19 22:17 Worklog Time Spent: 10m Work Description: robertwb commented on pull request #10118: [BEAM-7473] Pack RangeTracker into restriction URL: https://github.com/apache/beam/pull/10118#discussion_r347636857 ## File path: sdks/python/apache_beam/io/iobase.py ## @@ -1424,11 +1440,12 @@ def current_progress(self): def current_restriction(self): start_pos = self._delegate_range_tracker.start_position() stop_pos = self._delegate_range_tracker.stop_position() - return SourceBundle( - self._weight, - self._source, - start_pos, - stop_pos) + return _SDFBoundedSourceWrapper._SDFBoundedSourceRestriction( Review comment: Rather than re-creating it here, how about just storing (and returning) the initial _SDFBoundedSourceRestriction object (which does lazy initialization in range_tracker())? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345614) Time Spent: 1h 10m (was: 1h) > Update RestrictionTracker within Python to not be required to be thread safe > > > Key: BEAM-7473 > URL: https://issues.apache.org/jira/browse/BEAM-7473 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core, sdk-py-harness >Reporter: Luke Cwik >Assignee: Boyuan Zhang >Priority: Major > Fix For: 2.17.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > The commit > [https://github.com/apache/beam/commit/8faffb2bcf28ccab5e9a95322743cc60df65077c#diff-ed95abb6bc30a9ed07faef5c3fea93f0] > modified the Java SDK removed the need for users to be thread safe and > instead made the framework provide the necessary locking around a restriction > tracker. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7473) Update RestrictionTracker within Python to not be required to be thread safe
[ https://issues.apache.org/jira/browse/BEAM-7473?focusedWorklogId=345613&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345613 ] ASF GitHub Bot logged work on BEAM-7473: Author: ASF GitHub Bot Created on: 18/Nov/19 22:17 Start Date: 18/Nov/19 22:17 Worklog Time Spent: 10m Work Description: robertwb commented on pull request #10118: [BEAM-7473] Pack RangeTracker into restriction URL: https://github.com/apache/beam/pull/10118#discussion_r347637047 ## File path: sdks/python/apache_beam/io/iobase.py ## @@ -1456,10 +1473,17 @@ def try_split(self, fraction_of_remainder): residual_weight = self._weight - primary_weight # Update self._weight to primary weight self._weight = primary_weight -return (SourceBundle(primary_weight, self._source, start_pos, - split_pos), -SourceBundle(residual_weight, self._source, split_pos, - stop_pos)) +return ( Review comment: You could add a split method to the _SDFBoundedSourceRestriction that would contain most of the logic of this method. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345613) Time Spent: 1h 10m (was: 1h) > Update RestrictionTracker within Python to not be required to be thread safe > > > Key: BEAM-7473 > URL: https://issues.apache.org/jira/browse/BEAM-7473 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core, sdk-py-harness >Reporter: Luke Cwik >Assignee: Boyuan Zhang >Priority: Major > Fix For: 2.17.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > The commit > [https://github.com/apache/beam/commit/8faffb2bcf28ccab5e9a95322743cc60df65077c#diff-ed95abb6bc30a9ed07faef5c3fea93f0] > modified the Java SDK removed the need for users to be thread safe and > instead made the framework provide the necessary locking around a restriction > tracker. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7473) Update RestrictionTracker within Python to not be required to be thread safe
[ https://issues.apache.org/jira/browse/BEAM-7473?focusedWorklogId=345612&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345612 ] ASF GitHub Bot logged work on BEAM-7473: Author: ASF GitHub Bot Created on: 18/Nov/19 22:17 Start Date: 18/Nov/19 22:17 Worklog Time Spent: 10m Work Description: robertwb commented on pull request #10118: [BEAM-7473] Pack RangeTracker into restriction URL: https://github.com/apache/beam/pull/10118#discussion_r347635742 ## File path: sdks/python/apache_beam/io/iobase.py ## @@ -1404,18 +1404,21 @@ class _SDFBoundedSourceWrapper(ptransform.PTransform): """ class _SDFBoundedSourceRestrictionTracker(RestrictionTracker): """An `iobase.RestrictionTracker` implementations for wrapping BoundedSource -with SDF. +with SDF. The tracked restriction is a (SourceBundle, RangeTracker) pair. +In order to save bytes sent across the wire, the RangeTracker is set as +system tracking RangeTracker only when current_restriction is called. Review comment: Update comment? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345612) Time Spent: 1h 10m (was: 1h) > Update RestrictionTracker within Python to not be required to be thread safe > > > Key: BEAM-7473 > URL: https://issues.apache.org/jira/browse/BEAM-7473 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core, sdk-py-harness >Reporter: Luke Cwik >Assignee: Boyuan Zhang >Priority: Major > Fix For: 2.17.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > The commit > [https://github.com/apache/beam/commit/8faffb2bcf28ccab5e9a95322743cc60df65077c#diff-ed95abb6bc30a9ed07faef5c3fea93f0] > modified the Java SDK removed the need for users to be thread safe and > instead made the framework provide the necessary locking around a restriction > tracker. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-4775) JobService should support returning metrics
[ https://issues.apache.org/jira/browse/BEAM-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lukasz Gajowy resolved BEAM-4775. - Fix Version/s: Not applicable Resolution: Fixed the JobServer part is done - all kinds of MonitroingInfos are forwarded through grpc to PortableRunners (sdk side). The PortableRunners can decide how to digest them (e.g. create MetricsResult from monitoring infos where possible). > JobService should support returning metrics > --- > > Key: BEAM-4775 > URL: https://issues.apache.org/jira/browse/BEAM-4775 > Project: Beam > Issue Type: Bug > Components: beam-model >Reporter: Eugene Kirpichov >Assignee: Lukasz Gajowy >Priority: Major > Fix For: Not applicable > > Time Spent: 55h > Remaining Estimate: 0h > > Design doc: [https://s.apache.org/get-metrics-api]. > Further discussion is ongoing on [this > doc|https://docs.google.com/document/d/1m83TsFvJbOlcLfXVXprQm1B7vUakhbLZMzuRrOHWnTg/edit?ts=5c826bb4#heading=h.faqan9rjc6dm]. > We want to report job metrics back to the portability harness from the runner > harness, for displaying to users. > h1. Relevant PRs in flight: > h2. Ready for Review: > * [#8022|https://github.com/apache/beam/pull/8022]: correct the Job RPC > protos from [#8018|https://github.com/apache/beam/pull/8018]. > h2. Iterating / Discussing: > * [#7971|https://github.com/apache/beam/pull/7971]: Flink portable metrics: > get ptransform from MonitoringInfo, not stage name > ** this is a simpler, Flink-specific PR that is basically duplicated inside > each of the following two, so may be worth trying to merge in first > * #[7915|https://github.com/apache/beam/pull/7915]: use MonitoringInfo data > model in Java SDK metrics > * [#7868|https://github.com/apache/beam/pull/7868]: MonitoringInfo URN tweaks > h2. Merged > * [#8018|https://github.com/apache/beam/pull/8018]: add job metrics RPC > protos > * [#7867|https://github.com/apache/beam/pull/7867]: key MetricResult by a > MetricKey > * [#7938|https://github.com/apache/beam/pull/7938]: move MonitoringInfo > protos to model/pipeline module > * [#7883|https://github.com/apache/beam/pull/7883]: Add > MetricQueryResults.allMetrics() helper > * [#7866|https://github.com/apache/beam/pull/7866]: move function helpers > from fn-harness to sdks/java/core > * [#7890|https://github.com/apache/beam/pull/7890]: consolidate MetricResult > implementations > h2. Closed > * [#7934|https://github.com/apache/beam/pull/7934]: job metrics RPC + SDK > support > * [#7876|https://github.com/apache/beam/pull/7876]: Clean up metric protos; > support integer distributions, gauges -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8651) Python 3 portable pipelines sometimes fail with errors in StockUnpickler.find_class()
[ https://issues.apache.org/jira/browse/BEAM-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976938#comment-16976938 ] Valentyn Tymofieiev commented on BEAM-8651: --- It is possible that the error is caused by https://bugs.python.org/issue35943, which is fixed on cpython master, but not on 3.7 branch. This would explain why I can still reproduce the error on Python. 3.7.5rc1. Also, as per https://bugs.python.org/issue34572, pickling fixes will not be backported to Python 3.5, 3.6. > Python 3 portable pipelines sometimes fail with errors in > StockUnpickler.find_class() > - > > Key: BEAM-8651 > URL: https://issues.apache.org/jira/browse/BEAM-8651 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Valentyn Tymofieiev >Priority: Major > Attachments: beam8651.py > > > Several Beam users [1,2] reported an error which happens on Python 3 in > StockUnpickler.find_class. > So far I've seen reports of the error on Python 3.5, 3.6, and 3.7.1, on Flink > and Dataflow runners. On Dataflow runner so far I have seen this in streaming > pipelines only, which use portable SDK worker. > Typical stack trace: > {noformat} > File > "python3.5/site-packages/apache_beam/runners/worker/bundle_processor.py", > line 1148, in _create_pardo_operation > dofn_data = pickler.loads(serialized_fn) > > File "python3.5/site-packages/apache_beam/internal/pickler.py", line 265, > in loads > return dill.loads(s) > > File "python3.5/site-packages/dill/_dill.py", line 317, in loads > > return load(file, ignore) > > File "python3.5/site-packages/dill/_dill.py", line 305, in load > > obj = pik.load() > > File "python3.5/site-packages/dill/_dill.py", line 474, in find_class > > return StockUnpickler.find_class(self, module, name) > > AttributeError: Can't get attribute 'ClassName' on 'python3.5/site-packages/filename.py'> > {noformat} > According to Guenther from [1]: > {quote} > This looks exactly like a race condition that we've encountered on Python > 3.7.1: There's a bug in some older 3.7.x releases that breaks the > thread-safety of the unpickler, as concurrent unpickle threads can access a > module before it has been fully imported. See > https://bugs.python.org/issue34572 for more information. > The traceback shows a Python 3.6 venv so this could be a different issue > (the unpickle bug was introduced in version 3.7). If it's the same bug then > upgrading to Python 3.7.3 or higher should fix that issue. One potential > workaround is to ensure that all of the modules get imported during the > initialization of the sdk_worker, as this bug only affects imports done by > the unpickler. > {quote} > Opening this for visibility. Current open questions are: > 1. Find a minimal example to reproduce this issue. > 2. Figure out whether users are still affected by this issue on Python 3.7.3. > 3. Communicate a workarounds for 3.5, 3.6 users affected by this. > [1] > https://lists.apache.org/thread.html/5581ddfcf6d2ae10d25b834b8a61ebee265ffbcf650c6ec8d1e69408@%3Cdev.beam.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-4775) JobService should support returning metrics
[ https://issues.apache.org/jira/browse/BEAM-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lukasz Gajowy reassigned BEAM-4775: --- Assignee: Lukasz Gajowy (was: Kamil Wasilewski) > JobService should support returning metrics > --- > > Key: BEAM-4775 > URL: https://issues.apache.org/jira/browse/BEAM-4775 > Project: Beam > Issue Type: Bug > Components: beam-model >Reporter: Eugene Kirpichov >Assignee: Lukasz Gajowy >Priority: Major > Time Spent: 55h > Remaining Estimate: 0h > > Design doc: [https://s.apache.org/get-metrics-api]. > Further discussion is ongoing on [this > doc|https://docs.google.com/document/d/1m83TsFvJbOlcLfXVXprQm1B7vUakhbLZMzuRrOHWnTg/edit?ts=5c826bb4#heading=h.faqan9rjc6dm]. > We want to report job metrics back to the portability harness from the runner > harness, for displaying to users. > h1. Relevant PRs in flight: > h2. Ready for Review: > * [#8022|https://github.com/apache/beam/pull/8022]: correct the Job RPC > protos from [#8018|https://github.com/apache/beam/pull/8018]. > h2. Iterating / Discussing: > * [#7971|https://github.com/apache/beam/pull/7971]: Flink portable metrics: > get ptransform from MonitoringInfo, not stage name > ** this is a simpler, Flink-specific PR that is basically duplicated inside > each of the following two, so may be worth trying to merge in first > * #[7915|https://github.com/apache/beam/pull/7915]: use MonitoringInfo data > model in Java SDK metrics > * [#7868|https://github.com/apache/beam/pull/7868]: MonitoringInfo URN tweaks > h2. Merged > * [#8018|https://github.com/apache/beam/pull/8018]: add job metrics RPC > protos > * [#7867|https://github.com/apache/beam/pull/7867]: key MetricResult by a > MetricKey > * [#7938|https://github.com/apache/beam/pull/7938]: move MonitoringInfo > protos to model/pipeline module > * [#7883|https://github.com/apache/beam/pull/7883]: Add > MetricQueryResults.allMetrics() helper > * [#7866|https://github.com/apache/beam/pull/7866]: move function helpers > from fn-harness to sdks/java/core > * [#7890|https://github.com/apache/beam/pull/7890]: consolidate MetricResult > implementations > h2. Closed > * [#7934|https://github.com/apache/beam/pull/7934]: job metrics RPC + SDK > support > * [#7876|https://github.com/apache/beam/pull/7876]: Clean up metric protos; > support integer distributions, gauges -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-4777) Python PortableRunner should support metrics
[ https://issues.apache.org/jira/browse/BEAM-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lukasz Gajowy reassigned BEAM-4777: --- Assignee: Kamil Wasilewski (was: Lukasz Gajowy) > Python PortableRunner should support metrics > > > Key: BEAM-4777 > URL: https://issues.apache.org/jira/browse/BEAM-4777 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Eugene Kirpichov >Assignee: Kamil Wasilewski >Priority: Major > > BEAM-4775 concerns adding metrics to the JobService API; the current issue is > about making Python PortableRunner understand them. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-5994) Publish metrics from load tests to BigQuery database
[ https://issues.apache.org/jira/browse/BEAM-5994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kamil Wasilewski resolved BEAM-5994. Resolution: Fixed > Publish metrics from load tests to BigQuery database > > > Key: BEAM-5994 > URL: https://issues.apache.org/jira/browse/BEAM-5994 > Project: Beam > Issue Type: Sub-task > Components: testing >Reporter: Kasia Kucharczyk >Assignee: Kasia Kucharczyk >Priority: Major > Fix For: Not applicable > > Time Spent: 2h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8645) TimestampCombiner incorrect in beam python
[ https://issues.apache.org/jira/browse/BEAM-8645?focusedWorklogId=345610&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345610 ] ASF GitHub Bot logged work on BEAM-8645: Author: ASF GitHub Bot Created on: 18/Nov/19 22:08 Start Date: 18/Nov/19 22:08 Worklog Time Spent: 10m Work Description: robertwb commented on pull request #10135: [BEAM-8645] Create a py test case for Re-iteration on GBK result. URL: https://github.com/apache/beam/pull/10135#discussion_r347634526 ## File path: sdks/python/apache_beam/transforms/ptransform_test.py ## @@ -471,6 +471,24 @@ def test_group_by_key(self): assert_that(result, equal_to([(1, [1, 2, 3]), (2, [1, 2]), (3, [1])])) pipeline.run() + def test_group_by_key_reiteration(self): +class MyDoFn(beam.DoFn): + def process(self, gbk_result): +value_list = gbk_result[1] Review comment: More idiomatic to do `key, value_list = gbk_result` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345610) Time Spent: 2h 50m (was: 2h 40m) > TimestampCombiner incorrect in beam python > -- > > Key: BEAM-8645 > URL: https://issues.apache.org/jira/browse/BEAM-8645 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Ruoyun Huang >Priority: Major > Time Spent: 2h 50m > Remaining Estimate: 0h > > When we have a TimestampValue on combine: > {code:java} > main_stream = (p > | 'main TestStream' >> TestStream() > .add_elements([window.TimestampedValue(('k', 100), 0)]) > .add_elements([window.TimestampedValue(('k', 400), 9)]) > .advance_watermark_to_infinity() > | 'main windowInto' >> beam.WindowInto( > window.FixedWindows(10), > timestamp_combiner=TimestampCombiner.OUTPUT_AT_LATEST) | > 'Combine' >> beam.CombinePerKey(sum)) > The expect timestamp should be: > LATEST: (('k', 500), Timestamp(9)), > EARLIEST: (('k', 500), Timestamp(0)), > END_OF_WINDOW: (('k', 500), Timestamp(10)), > But current py streaming gives following results: > LATEST: (('k', 500), Timestamp(10)), > EARLIEST: (('k', 500), Timestamp(10)), > END_OF_WINDOW: (('k', 500), Timestamp(9.)), > More details and discussions: > https://lists.apache.org/thread.html/d3af1f2f84a2e59a747196039eae77812b78a991f0f293c717e5f4e1@%3Cdev.beam.apache.org%3E > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8645) TimestampCombiner incorrect in beam python
[ https://issues.apache.org/jira/browse/BEAM-8645?focusedWorklogId=345611&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345611 ] ASF GitHub Bot logged work on BEAM-8645: Author: ASF GitHub Bot Created on: 18/Nov/19 22:08 Start Date: 18/Nov/19 22:08 Worklog Time Spent: 10m Work Description: robertwb commented on pull request #10135: [BEAM-8645] Create a py test case for Re-iteration on GBK result. URL: https://github.com/apache/beam/pull/10135#discussion_r347635063 ## File path: sdks/python/apache_beam/transforms/ptransform_test.py ## @@ -471,6 +471,24 @@ def test_group_by_key(self): assert_that(result, equal_to([(1, [1, 2, 3]), (2, [1, 2]), (3, [1])])) pipeline.run() + def test_group_by_key_reiteration(self): +class MyDoFn(beam.DoFn): + def process(self, gbk_result): +value_list = gbk_result[1] +sum_val = 0 +# Iterate the GBK result for multiple times. +for _ in range(0, 17): + sum_val += sum(value_list) +return (gbk_result[0], sum_val) + +pipeline = TestPipeline() +pcoll = pipeline | 'start' >> beam.Create( +[(1, 1), (1, 2), (1, 3), (1, 4)]) +result = (pcoll | 'Group' >> beam.GroupByKey() + | 'Reiteration-Sum' >> beam.ParDo(MyDoFn())) +assert_that(result, equal_to([1, 170])) Review comment: `equal_to` takes a list of PCollection elements. This should be `equal_to([(1, 170)])`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345611) Time Spent: 3h (was: 2h 50m) > TimestampCombiner incorrect in beam python > -- > > Key: BEAM-8645 > URL: https://issues.apache.org/jira/browse/BEAM-8645 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Ruoyun Huang >Priority: Major > Time Spent: 3h > Remaining Estimate: 0h > > When we have a TimestampValue on combine: > {code:java} > main_stream = (p > | 'main TestStream' >> TestStream() > .add_elements([window.TimestampedValue(('k', 100), 0)]) > .add_elements([window.TimestampedValue(('k', 400), 9)]) > .advance_watermark_to_infinity() > | 'main windowInto' >> beam.WindowInto( > window.FixedWindows(10), > timestamp_combiner=TimestampCombiner.OUTPUT_AT_LATEST) | > 'Combine' >> beam.CombinePerKey(sum)) > The expect timestamp should be: > LATEST: (('k', 500), Timestamp(9)), > EARLIEST: (('k', 500), Timestamp(0)), > END_OF_WINDOW: (('k', 500), Timestamp(10)), > But current py streaming gives following results: > LATEST: (('k', 500), Timestamp(10)), > EARLIEST: (('k', 500), Timestamp(10)), > END_OF_WINDOW: (('k', 500), Timestamp(9.)), > More details and discussions: > https://lists.apache.org/thread.html/d3af1f2f84a2e59a747196039eae77812b78a991f0f293c717e5f4e1@%3Cdev.beam.apache.org%3E > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=345603&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345603 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 18/Nov/19 21:43 Start Date: 18/Nov/19 21:43 Worklog Time Spent: 10m Work Description: liumomo315 commented on issue #10145: [BEAM-8575] Add a Python test to test windowing in DoFn finish_bundle() URL: https://github.com/apache/beam/pull/10145#issuecomment-555223170 R: @y1chi PTAL Yichi! Thanks:) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345603) Time Spent: 8h (was: 7h 50m) > Add more Python validates runner tests > -- > > Key: BEAM-8575 > URL: https://issues.apache.org/jira/browse/BEAM-8575 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: wendy liu >Assignee: wendy liu >Priority: Major > Time Spent: 8h > Remaining Estimate: 0h > > This is the umbrella issue to track the work of adding more Python tests to > improve test coverage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-4132) Element type inference doesn't work for multi-output DoFns
[ https://issues.apache.org/jira/browse/BEAM-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976906#comment-16976906 ] Udi Meiri commented on BEAM-4132: - PCollections have element_type, inherited from PValue. The default value is None, which is not a valid value for a type. Currently, the solution for DoOutputsTuple is to always set the type to Any. The types should be derived from one of: 1. Tagged type hints. This doesn't exist yet. 2. Inferred result/output type. Pipeline._infer_result_type doesn't seem to work on undeclared tags (when DoOutputsTuple._tags is empty). > Element type inference doesn't work for multi-output DoFns > -- > > Key: BEAM-4132 > URL: https://issues.apache.org/jira/browse/BEAM-4132 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Affects Versions: 2.4.0 >Reporter: Chuan Yu Foo >Assignee: Udi Meiri >Priority: Major > Time Spent: 2.5h > Remaining Estimate: 0h > > TLDR: if you have a multi-output DoFn, then the non-main PCollections with > incorrectly have their element types set to None. This affects type checking > for pipelines involving these PCollections. > Minimal example: > {code} > import apache_beam as beam > class TripleDoFn(beam.DoFn): > def process(self, elem): > yield_elem > if elem % 2 == 0: > yield beam.pvalue.TaggedOutput('ten_times', elem * 10) > if elem % 3 == 0: > yield beam.pvalue.TaggedOutput('hundred_times', elem * 100) > > @beam.typehints.with_input_types(int) > @beam.typehints.with_output_types(int) > class MultiplyBy(beam.DoFn): > def __init__(self, multiplier): > self._multiplier = multiplier > def process(self, elem): > return elem * self._multiplier > > def main(): > with beam.Pipeline() as p: > x, a, b = ( > p > | 'Create' >> beam.Create([1, 2, 3]) > | 'TripleDo' >> beam.ParDo(TripleDoFn()).with_outputs( > 'ten_times', 'hundred_times', main='main_output')) > _ = a | 'MultiplyBy2' >> beam.ParDo(MultiplyBy(2)) > if __name__ == '__main__': > main() > {code} > Running this yields the following error: > {noformat} > apache_beam.typehints.decorators.TypeCheckError: Type hint violation for > 'MultiplyBy2': requires but got None for elem > {noformat} > Replacing {{a}} with {{b}} yields the same error. Replacing {{a}} with {{x}} > instead yields the following error: > {noformat} > apache_beam.typehints.decorators.TypeCheckError: Type hint violation for > 'MultiplyBy2': requires but got Union[TaggedOutput, int] for elem > {noformat} > I would expect Beam to correctly infer that {{a}} and {{b}} have element > types of {{int}} rather than {{None}}, and I would also expect Beam to > correctly figure out that the element types of {{x}} are compatible with > {{int}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8575) Add more Python validates runner tests
[ https://issues.apache.org/jira/browse/BEAM-8575?focusedWorklogId=345597&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345597 ] ASF GitHub Bot logged work on BEAM-8575: Author: ASF GitHub Bot Created on: 18/Nov/19 21:34 Start Date: 18/Nov/19 21:34 Worklog Time Spent: 10m Work Description: liumomo315 commented on pull request #10145: [BEAM-8575] Add a Python test to test windowing in DoFn finish_bundle() URL: https://github.com/apache/beam/pull/10145 This test is the Python parity for the Java ParDo test testWindowingInStartAndFinishBundle. Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/) Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python36
[jira] [Work logged] (BEAM-8335) Add streaming support to Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8335?focusedWorklogId=345589&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-345589 ] ASF GitHub Bot logged work on BEAM-8335: Author: ASF GitHub Bot Created on: 18/Nov/19 21:22 Start Date: 18/Nov/19 21:22 Worklog Time Spent: 10m Work Description: robertwb commented on issue #10119: [BEAM-8335] Adds the StreamingCache URL: https://github.com/apache/beam/pull/10119#issuecomment-555215039 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 345589) Time Spent: 31.5h (was: 31h 20m) > Add streaming support to Interactive Beam > - > > Key: BEAM-8335 > URL: https://issues.apache.org/jira/browse/BEAM-8335 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Sam Rohde >Assignee: Sam Rohde >Priority: Major > Time Spent: 31.5h > Remaining Estimate: 0h > > This issue tracks the work items to introduce streaming support to the > Interactive Beam experience. This will allow users to: > * Write and run a streaming job in IPython > * Automatically cache records from unbounded sources > * Add a replay experience that replays all cached records to simulate the > original pipeline execution > * Add controls to play/pause/stop/step individual elements from the cached > records > * Add ability to inspect/visualize unbounded PCollections -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8668) TypeError: _create_function() takes from 2 to 6 positional arguments but 7 were given
[ https://issues.apache.org/jira/browse/BEAM-8668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976890#comment-16976890 ] Valentyn Tymofieiev commented on BEAM-8668: --- Correction: This happens when SDK uses dill >=0.3.1.1 for pickling and worker uses dill <= 0.3.0 for unpickling. > TypeError: _create_function() takes from 2 to 6 positional arguments but 7 > were given > - > > Key: BEAM-8668 > URL: https://issues.apache.org/jira/browse/BEAM-8668 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > > This error looks to me a lot like a mismatch between SDK versions (Dill > dependencies), but I've had a couple reports of it so I thought it worth > making it discoverable here. > > Traceback (most recent call last): > File > "/usr/local/lib/python3.6/site-packages/apache_beam/runners/worker/sdk_worker.py", > line 158, in _execute > response = task() > File > "/usr/local/lib/python3.6/site-packages/apache_beam/runners/worker/sdk_worker.py", > line 191, in > self._execute(lambda: worker.do_instruction(work), work) > File > "/usr/local/lib/python3.6/site-packages/apache_beam/runners/worker/sdk_worker.py", > line 343, in do_instruction > request.instruction_id) > File > "/usr/local/lib/python3.6/site-packages/apache_beam/runners/worker/sdk_worker.py", > line 363, in process_bundle > instruction_id, request.process_bundle_descriptor_reference) > File > "/usr/local/lib/python3.6/site-packages/apache_beam/runners/worker/sdk_worker.py", > line 306, in get > self.data_channel_factory) > File > "/usr/local/lib/python3.6/site-packages/apache_beam/runners/worker/bundle_processor.py", > line 578, in __init__ > self.ops = self.create_execution_tree(self.process_bundle_descriptor) > File > "/usr/local/lib/python3.6/site-packages/apache_beam/runners/worker/bundle_processor.py", > line 622, in create_execution_tree > descriptor.transforms, key=topological_height, reverse=True)]) > File > "/usr/local/lib/python3.6/site-packages/apache_beam/runners/worker/bundle_processor.py", > line 621, in > for transform_id in sorted( > File > "/usr/local/lib/python3.6/site-packages/apache_beam/runners/worker/bundle_processor.py", > line 546, in wrapper > result = cache[args] = func(*args) > File > "/usr/local/lib/python3.6/site-packages/apache_beam/runners/worker/bundle_processor.py", > line 605, in get_operation > in descriptor.transforms[transform_id].outputs.items() > File > "/usr/local/lib/python3.6/site-packages/apache_beam/runners/worker/bundle_processor.py", > line 604, in > for tag, pcoll_id > File > "/usr/local/lib/python3.6/site-packages/apache_beam/runners/worker/bundle_processor.py", > line 603, in > tag: [get_operation(op) for op in pcoll_consumers[pcoll_id]] > File > "/usr/local/lib/python3.6/site-packages/apache_beam/runners/worker/bundle_processor.py", > line 546, in wrapper > result = cache[args] = func(*args) > File > "/usr/local/lib/python3.6/site-packages/apache_beam/runners/worker/bundle_processor.py", > line 608, in get_operation > transform_id, transform_consumers) > File > "/usr/local/lib/python3.6/site-packages/apache_beam/runners/worker/bundle_processor.py", > line 867, in create_operation > return creator(self, transform_id, transform_proto, payload, consumers) > File > "/usr/local/lib/python3.6/site-packages/apache_beam/runners/worker/bundle_processor.py", > line 1110, in create > serialized_fn, parameter) > File > "/usr/local/lib/python3.6/site-packages/apache_beam/runners/worker/bundle_processor.py", > line 1148, in _create_pardo_operation > dofn_data = pickler.loads(serialized_fn) > File > "/usr/local/lib/python3.6/site-packages/apache_beam/internal/pickler.py", > line 265, in loads > return dill.loads(s) > File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 317, in > loads > return load(file, ignore) > File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 305, in load > obj = pik.load() > TypeError: _create_function() takes from 2 to 6 positional arguments but 7 > were given > -- This message was sent by Atlassian Jira (v8.3.4#803005)