[jira] [Work logged] (BEAM-8415) Improve error message when adding a PTransform with a name that already exists in the pipeline
[ https://issues.apache.org/jira/browse/BEAM-8415?focusedWorklogId=330844=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330844 ] ASF GitHub Bot logged work on BEAM-8415: Author: ASF GitHub Bot Created on: 19/Oct/19 02:21 Start Date: 19/Oct/19 02:21 Worklog Time Spent: 10m Work Description: pabloem commented on issue #9812: [BEAM-8415] Improving error message when applying PTransform with a n… URL: https://github.com/apache/beam/pull/9812#issuecomment-544055770 Thanks David! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330844) Time Spent: 1.5h (was: 1h 20m) > Improve error message when adding a PTransform with a name that already > exists in the pipeline > -- > > Key: BEAM-8415 > URL: https://issues.apache.org/jira/browse/BEAM-8415 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Reporter: David Yan >Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > Currently, when trying to apply a PTransform with a name that already exists > in the pipeline, it returns a confusing error: > Transform "XXX" does not have a stable unique label. This will prevent > updating of pipelines. To apply a transform with a specified label write > pvalue | "label" >> transform > We'd like to improve this error message. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8415) Improve error message when adding a PTransform with a name that already exists in the pipeline
[ https://issues.apache.org/jira/browse/BEAM-8415?focusedWorklogId=330845=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330845 ] ASF GitHub Bot logged work on BEAM-8415: Author: ASF GitHub Bot Created on: 19/Oct/19 02:21 Start Date: 19/Oct/19 02:21 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9812: [BEAM-8415] Improving error message when applying PTransform with a n… URL: https://github.com/apache/beam/pull/9812 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330845) Time Spent: 1h 40m (was: 1.5h) > Improve error message when adding a PTransform with a name that already > exists in the pipeline > -- > > Key: BEAM-8415 > URL: https://issues.apache.org/jira/browse/BEAM-8415 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Reporter: David Yan >Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > > Currently, when trying to apply a PTransform with a name that already exists > in the pipeline, it returns a confusing error: > Transform "XXX" does not have a stable unique label. This will prevent > updating of pipelines. To apply a transform with a specified label write > pvalue | "label" >> transform > We'd like to improve this error message. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7886) Make row coder a standard coder and implement in python
[ https://issues.apache.org/jira/browse/BEAM-7886?focusedWorklogId=330838=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330838 ] ASF GitHub Bot logged work on BEAM-7886: Author: ASF GitHub Bot Created on: 19/Oct/19 01:20 Start Date: 19/Oct/19 01:20 Worklog Time Spent: 10m Work Description: TheNeuralBit commented on pull request #9188: [BEAM-7886] Make row coder a standard coder and implement in Python URL: https://github.com/apache/beam/pull/9188#discussion_r336716237 ## File path: sdks/python/apache_beam/coders/standard_coders_test.py ## @@ -133,11 +171,17 @@ def parse_coder(self, spec): for c in spec.get('components', ())] context.coders.put_proto(coder_id, beam_runner_api_pb2.Coder( spec=beam_runner_api_pb2.FunctionSpec( -urn=spec['urn'], payload=spec.get('payload')), +urn=spec['urn'], payload=spec.get('payload', '').encode()), Review comment: Done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330838) Time Spent: 13h 50m (was: 13h 40m) > Make row coder a standard coder and implement in python > --- > > Key: BEAM-7886 > URL: https://issues.apache.org/jira/browse/BEAM-7886 > Project: Beam > Issue Type: Improvement > Components: beam-model, sdk-java-core, sdk-py-core >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: Major > Time Spent: 13h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8438) Update Python/Streaming IO Documentation
[ https://issues.apache.org/jira/browse/BEAM-8438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955026#comment-16955026 ] Brian Hulette commented on BEAM-8438: - [~altay] do we still support a different set of IOs for Python streaming vs. batch? Pablo pointed out on slack that at least apache_beam.io.fileio should work for streaming now. I'm happy to update this documentation, I'm just not sure what the correct set of IOs is for Python/Streaming. > Update Python/Streaming IO Documentation > > > Key: BEAM-8438 > URL: https://issues.apache.org/jira/browse/BEAM-8438 > Project: Beam > Issue Type: Task > Components: website >Reporter: Brian Hulette >Priority: Major > > Built-in IO documentation states that Python/Streaming only supports pubsub > and BQ, which is out of date. > https://beam.apache.org/documentation/io/built-in/ > This came up on > [slack|https://the-asf.slack.com/archives/CBDNLQZM1/p157141041000] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8438) Update Python/Streaming IO Documentation
[ https://issues.apache.org/jira/browse/BEAM-8438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Hulette updated BEAM-8438: Description: Built-in IO documentation states that Python/Streaming only supports pubsub and BQ, which is out of date. https://beam.apache.org/documentation/io/built-in/ This came up on [slack|https://the-asf.slack.com/archives/CBDNLQZM1/p157141041000] was: Built-in IO documentation states that Python/Streaming only supports pubsub and BQ, which is out of date. https://beam.apache.org/documentation/io/built-in/ This came up on [slack| https://the-asf.slack.com/archives/CBDNLQZM1/p157141041000] > Update Python/Streaming IO Documentation > > > Key: BEAM-8438 > URL: https://issues.apache.org/jira/browse/BEAM-8438 > Project: Beam > Issue Type: Task > Components: website >Reporter: Brian Hulette >Priority: Major > > Built-in IO documentation states that Python/Streaming only supports pubsub > and BQ, which is out of date. > https://beam.apache.org/documentation/io/built-in/ > This came up on > [slack|https://the-asf.slack.com/archives/CBDNLQZM1/p157141041000] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-8438) Update Python/Streaming IO Documentation
Brian Hulette created BEAM-8438: --- Summary: Update Python/Streaming IO Documentation Key: BEAM-8438 URL: https://issues.apache.org/jira/browse/BEAM-8438 Project: Beam Issue Type: Task Components: website Reporter: Brian Hulette Built-in IO documentation states that Python/Streaming only supports pubsub and BQ, which is out of date. https://beam.apache.org/documentation/io/built-in/ This came up on [slack| https://the-asf.slack.com/archives/CBDNLQZM1/p157141041000] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8434) Allow trigger transcript tests to be run as ValidatesRunner tests.
[ https://issues.apache.org/jira/browse/BEAM-8434?focusedWorklogId=330835=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330835 ] ASF GitHub Bot logged work on BEAM-8434: Author: ASF GitHub Bot Created on: 19/Oct/19 00:32 Start Date: 19/Oct/19 00:32 Worklog Time Spent: 10m Work Description: robertwb commented on issue #9832: [BEAM-8434] Translate trigger transcripts into validates runner tests. URL: https://github.com/apache/beam/pull/9832#issuecomment-544023904 R: @ liumomo315 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330835) Time Spent: 20m (was: 10m) > Allow trigger transcript tests to be run as ValidatesRunner tests. > --- > > Key: BEAM-8434 > URL: https://issues.apache.org/jira/browse/BEAM-8434 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Robert Bradshaw >Assignee: Robert Bradshaw >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8435) Allow access to PaneInfo from Python DoFns
[ https://issues.apache.org/jira/browse/BEAM-8435?focusedWorklogId=330834=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330834 ] ASF GitHub Bot logged work on BEAM-8435: Author: ASF GitHub Bot Created on: 19/Oct/19 00:27 Start Date: 19/Oct/19 00:27 Worklog Time Spent: 10m Work Description: robertwb commented on pull request #9836: [BEAM-8435] Implement PaneInfo computation for Python. URL: https://github.com/apache/beam/pull/9836 Also enable retrieval as a DoFn parameter. Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/) Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=330832=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330832 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 19/Oct/19 00:16 Start Date: 19/Oct/19 00:16 Worklog Time Spent: 10m Work Description: KevinGG commented on issue #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#issuecomment-544019379 isort and pylint are crazy and contradicting about where you should put `import apache_beam as beam`. And the order suggested changes over time. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330832) Time Spent: 6h 40m (was: 6.5h) > Visualize PCollection with Interactive Beam > --- > > Key: BEAM-7926 > URL: https://issues.apache.org/jira/browse/BEAM-7926 > Project: Beam > Issue Type: New Feature > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 6h 40m > Remaining Estimate: 0h > > Support auto plotting / charting of materialized data of a given PCollection > with Interactive Beam. > Say an Interactive Beam pipeline defined as > p = create_pipeline() > pcoll = p | 'Transform' >> transform() > The use can call a single function and get auto-magical charting of the data > as materialized pcoll. > e.g., visualize(pcoll) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7886) Make row coder a standard coder and implement in python
[ https://issues.apache.org/jira/browse/BEAM-7886?focusedWorklogId=330831=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330831 ] ASF GitHub Bot logged work on BEAM-7886: Author: ASF GitHub Bot Created on: 19/Oct/19 00:13 Start Date: 19/Oct/19 00:13 Worklog Time Spent: 10m Work Description: TheNeuralBit commented on pull request #9188: [BEAM-7886] Make row coder a standard coder and implement in Python URL: https://github.com/apache/beam/pull/9188#discussion_r336712062 ## File path: runners/core-construction-java/src/test/java/org/apache/beam/runners/core/construction/CommonCoderTest.java ## @@ -278,41 +290,90 @@ private static Object convertValue(Object value, CommonCoder coderSpec, Coder co return WindowedValue.of(windowValue, timestamp, windows, paneInfo); } else if (s.equals(getUrn(StandardCoders.Enum.DOUBLE))) { return Double.parseDouble((String) value); +} else if (s.equals(getUrn(StandardCoders.Enum.ROW))) { + Schema schema; + try { +schema = SchemaTranslation.fromProto(SchemaApi.Schema.parseFrom(coderSpec.getPayload())); + } catch (InvalidProtocolBufferException e) { +throw new RuntimeException("Failed to parse schema payload for row coder", e); + } + + return parseField(value, Schema.FieldType.row(schema)); } else { throw new IllegalStateException("Unknown coder URN: " + coderSpec.getUrn()); } } + private static Object parseField(Object value, Schema.FieldType fieldType) { +switch (fieldType.getTypeName()) { + case BYTE: +return ((Number) value).byteValue(); + case INT16: +return ((Number) value).shortValue(); + case INT32: +return ((Number) value).intValue(); + case INT64: +return ((Number) value).longValue(); + case FLOAT: +return Float.parseFloat((String) value); + case DOUBLE: +return Double.parseDouble((String) value); Review comment: Ok, I created [BEAM-8437](https://issues.apache.org/jira/browse/BEAM-8437) lets take this conversation over there. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330831) Time Spent: 13h 40m (was: 13.5h) > Make row coder a standard coder and implement in python > --- > > Key: BEAM-7886 > URL: https://issues.apache.org/jira/browse/BEAM-7886 > Project: Beam > Issue Type: Improvement > Components: beam-model, sdk-java-core, sdk-py-core >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: Major > Time Spent: 13h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-8437) Consider using native doubles in standard_coders_test.yaml
Brian Hulette created BEAM-8437: --- Summary: Consider using native doubles in standard_coders_test.yaml Key: BEAM-8437 URL: https://issues.apache.org/jira/browse/BEAM-8437 Project: Beam Issue Type: Task Components: testing Reporter: Brian Hulette Context: https://github.com/apache/beam/pull/9188#discussion_r332233044 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=330829=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330829 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 19/Oct/19 00:10 Start Date: 19/Oct/19 00:10 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r336711762 ## File path: sdks/python/setup.py ## @@ -219,11 +227,15 @@ def run(self): install_requires=REQUIRED_PACKAGES, python_requires=python_requires, test_suite='nose.collector', -tests_require=REQUIRED_TEST_PACKAGES, +tests_require=[ +REQUIRED_TEST_PACKAGES, +INTERACTIVE_BEAM, +], extras_require={ 'docs': ['Sphinx>=1.5.2,<2.0'], 'test': REQUIRED_TEST_PACKAGES, 'gcp': GCP_REQUIREMENTS, +'ib': INTERACTIVE_BEAM, Review comment: Agreed! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330829) Time Spent: 6.5h (was: 6h 20m) > Visualize PCollection with Interactive Beam > --- > > Key: BEAM-7926 > URL: https://issues.apache.org/jira/browse/BEAM-7926 > Project: Beam > Issue Type: New Feature > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 6.5h > Remaining Estimate: 0h > > Support auto plotting / charting of materialized data of a given PCollection > with Interactive Beam. > Say an Interactive Beam pipeline defined as > p = create_pipeline() > pcoll = p | 'Transform' >> transform() > The use can call a single function and get auto-magical charting of the data > as materialized pcoll. > e.g., visualize(pcoll) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=330828=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330828 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 19/Oct/19 00:10 Start Date: 19/Oct/19 00:10 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r336711748 ## File path: sdks/python/setup.py ## @@ -219,11 +227,15 @@ def run(self): install_requires=REQUIRED_PACKAGES, python_requires=python_requires, test_suite='nose.collector', -tests_require=REQUIRED_TEST_PACKAGES, +tests_require=[ +REQUIRED_TEST_PACKAGES, +INTERACTIVE_BEAM, Review comment: Because there are unit tests around interactive beam packages. When those packages are under REQUIRED_PACKAGES, they are always installed and available during tests. Once we move them into an extras_require, they are not installed nor available during tests anymore. For example, *-gcp tests will install [gcp,test] to pick up the extras_package GCP_REQUIREMENTS. Since we don't have/need *-interactive test suites, we can make the group to always be installed as required test packages for tests. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330828) Time Spent: 6h 20m (was: 6h 10m) > Visualize PCollection with Interactive Beam > --- > > Key: BEAM-7926 > URL: https://issues.apache.org/jira/browse/BEAM-7926 > Project: Beam > Issue Type: New Feature > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 6h 20m > Remaining Estimate: 0h > > Support auto plotting / charting of materialized data of a given PCollection > with Interactive Beam. > Say an Interactive Beam pipeline defined as > p = create_pipeline() > pcoll = p | 'Transform' >> transform() > The use can call a single function and get auto-magical charting of the data > as materialized pcoll. > e.g., visualize(pcoll) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=330826=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330826 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 18/Oct/19 23:56 Start Date: 18/Oct/19 23:56 Worklog Time Spent: 10m Work Description: aaltay commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r336710275 ## File path: sdks/python/setup.py ## @@ -219,11 +227,15 @@ def run(self): install_requires=REQUIRED_PACKAGES, python_requires=python_requires, test_suite='nose.collector', -tests_require=REQUIRED_TEST_PACKAGES, +tests_require=[ +REQUIRED_TEST_PACKAGES, +INTERACTIVE_BEAM, Review comment: Why tests would require interactive beam packages? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330826) Time Spent: 6h 10m (was: 6h) > Visualize PCollection with Interactive Beam > --- > > Key: BEAM-7926 > URL: https://issues.apache.org/jira/browse/BEAM-7926 > Project: Beam > Issue Type: New Feature > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 6h 10m > Remaining Estimate: 0h > > Support auto plotting / charting of materialized data of a given PCollection > with Interactive Beam. > Say an Interactive Beam pipeline defined as > p = create_pipeline() > pcoll = p | 'Transform' >> transform() > The use can call a single function and get auto-magical charting of the data > as materialized pcoll. > e.g., visualize(pcoll) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=330824=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330824 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 18/Oct/19 23:55 Start Date: 18/Oct/19 23:55 Worklog Time Spent: 10m Work Description: aaltay commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r336710191 ## File path: sdks/python/setup.py ## @@ -219,11 +227,15 @@ def run(self): install_requires=REQUIRED_PACKAGES, python_requires=python_requires, test_suite='nose.collector', -tests_require=REQUIRED_TEST_PACKAGES, +tests_require=[ +REQUIRED_TEST_PACKAGES, +INTERACTIVE_BEAM, +], extras_require={ 'docs': ['Sphinx>=1.5.2,<2.0'], 'test': REQUIRED_TEST_PACKAGES, 'gcp': GCP_REQUIREMENTS, +'ib': INTERACTIVE_BEAM, Review comment: how about `interactive` instead of `ib`. This leads to a more readable `pip install apache_beam[interactive]` instead of `pip install apache_beam[ib]` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330824) Time Spent: 6h (was: 5h 50m) > Visualize PCollection with Interactive Beam > --- > > Key: BEAM-7926 > URL: https://issues.apache.org/jira/browse/BEAM-7926 > Project: Beam > Issue Type: New Feature > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 6h > Remaining Estimate: 0h > > Support auto plotting / charting of materialized data of a given PCollection > with Interactive Beam. > Say an Interactive Beam pipeline defined as > p = create_pipeline() > pcoll = p | 'Transform' >> transform() > The use can call a single function and get auto-magical charting of the data > as materialized pcoll. > e.g., visualize(pcoll) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8146) SchemaCoder/RowCoder have no equals() function
[ https://issues.apache.org/jira/browse/BEAM-8146?focusedWorklogId=330819=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330819 ] ASF GitHub Bot logged work on BEAM-8146: Author: ASF GitHub Bot Created on: 18/Oct/19 23:36 Start Date: 18/Oct/19 23:36 Worklog Time Spent: 10m Work Description: TheNeuralBit commented on issue #9493: [BEAM-8146,BEAM-8204,BEAM-8205] Add equals and hashCode to SchemaCoder and RowCoder URL: https://github.com/apache/beam/pull/9493#issuecomment-544007362 That's a good point, I don't think so. Initially I added the type descriptor as something we can verify as a proxy for `toRow`/`fromRow` but now that I've made the functions created by our `SchemaProvider` implementations comparable, that check is redundant. Do you think I should back out the type descriptor changes to make this simpler? It's mostly its own commit so it wouldn't be too hard. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330819) Time Spent: 2h 50m (was: 2h 40m) > SchemaCoder/RowCoder have no equals() function > -- > > Key: BEAM-8146 > URL: https://issues.apache.org/jira/browse/BEAM-8146 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Affects Versions: 2.15.0 >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: Major > Time Spent: 2h 50m > Remaining Estimate: 0h > > SchemaCoder has no equals function, so it can't be compared in tests, like > CloudComponentsTests$DefaultCoders, which is being re-enabled in > https://github.com/apache/beam/pull/9446 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8396) Default to LOOPBACK mode for local flink (spark, ...) runner.
[ https://issues.apache.org/jira/browse/BEAM-8396?focusedWorklogId=330815=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330815 ] ASF GitHub Bot logged work on BEAM-8396: Author: ASF GitHub Bot Created on: 18/Oct/19 23:10 Start Date: 18/Oct/19 23:10 Worklog Time Spent: 10m Work Description: robertwb commented on issue #9833: [BEAM-8396] Default to LOOPBACK mode for local flink runner. URL: https://github.com/apache/beam/pull/9833#issuecomment-543997995 Thanks for the quick review! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330815) Time Spent: 20m (was: 10m) > Default to LOOPBACK mode for local flink (spark, ...) runner. > - > > Key: BEAM-8396 > URL: https://issues.apache.org/jira/browse/BEAM-8396 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Reporter: Robert Bradshaw >Assignee: Robert Bradshaw >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > As well as being lower overhead, this will avoid surprises about workers > operating within the docker filesystem, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8436) Interactive runner incompatible with experiments=beam_fn_api
[ https://issues.apache.org/jira/browse/BEAM-8436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw updated BEAM-8436: -- Description: When this is enabled one gets {code} ERROR: test_wordcount (apache_beam.runners.interactive.interactive_runner_test.InteractiveRunnerTest) -- Traceback (most recent call last): File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/interactive_runner_test.py", line 85, in test_wordcount result = p.run() File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/pipeline.py", line 406, in run self._options).run(False) File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/pipeline.py", line 419, in run return self.runner.run_pipeline(self, self._options) File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/interactive_runner.py", line 136, in run_pipeline self._desired_cache_labels) File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 73, in __init__ self._analyze_pipeline() File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 93, in _analyze_pipeline desired_pcollections = self._desired_pcollections(self._pipeline_info) File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 313, in _desired_pcollections cache_label = pipeline_info.cache_label(pcoll_id) File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 397, in cache_label return self._derivation(pcoll_id).cache_label() File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 405, in _derivation for input_tag, input_id in transform_proto.inputs.items() ... File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 405, in _derivation for input_tag, input_id in transform_proto.inputs.items() File "/Users/robertwb/Work/beam/venv3/bin/../lib/python3.6/_collections_abc.py", line 678, in items return ItemsView(self) RecursionError: maximum recursion depth exceeded while calling a Python object {code} was: When this is enabled one gets {{ERROR: test_wordcount (apache_beam.runners.interactive.interactive_runner_test.InteractiveRunnerTest) -- Traceback (most recent call last): File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/interactive_runner_test.py", line 85, in test_wordcount result = p.run() File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/pipeline.py", line 406, in run self._options).run(False) File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/pipeline.py", line 419, in run return self.runner.run_pipeline(self, self._options) File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/interactive_runner.py", line 136, in run_pipeline self._desired_cache_labels) File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 73, in __init__ self._analyze_pipeline() File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 93, in _analyze_pipeline desired_pcollections = self._desired_pcollections(self._pipeline_info) File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 313, in _desired_pcollections cache_label = pipeline_info.cache_label(pcoll_id) File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 397, in cache_label return self._derivation(pcoll_id).cache_label() File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 405, in _derivation for input_tag, input_id in transform_proto.inputs.items() ... File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 405, in _derivation for input_tag, input_id in transform_proto.inputs.items() File "/Users/robertwb/Work/beam/venv3/bin/../lib/python3.6/_collections_abc.py", line 678, in items return ItemsView(self) RecursionError: maximum recursion depth exceeded while calling a Python object }} > Interactive runner incompatible with experiments=beam_fn_api > > >
[jira] [Updated] (BEAM-8436) Interactive runner incompatible with experiments=beam_fn_api
[ https://issues.apache.org/jira/browse/BEAM-8436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw updated BEAM-8436: -- Description: When this is enabled one gets {{ERROR: test_wordcount (apache_beam.runners.interactive.interactive_runner_test.InteractiveRunnerTest) -- Traceback (most recent call last): File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/interactive_runner_test.py", line 85, in test_wordcount result = p.run() File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/pipeline.py", line 406, in run self._options).run(False) File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/pipeline.py", line 419, in run return self.runner.run_pipeline(self, self._options) File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/interactive_runner.py", line 136, in run_pipeline self._desired_cache_labels) File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 73, in __init__ self._analyze_pipeline() File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 93, in _analyze_pipeline desired_pcollections = self._desired_pcollections(self._pipeline_info) File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 313, in _desired_pcollections cache_label = pipeline_info.cache_label(pcoll_id) File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 397, in cache_label return self._derivation(pcoll_id).cache_label() File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 405, in _derivation for input_tag, input_id in transform_proto.inputs.items() ... File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 405, in _derivation for input_tag, input_id in transform_proto.inputs.items() File "/Users/robertwb/Work/beam/venv3/bin/../lib/python3.6/_collections_abc.py", line 678, in items return ItemsView(self) RecursionError: maximum recursion depth exceeded while calling a Python object }} was: When this is enabled one gets {{ ERROR: test_wordcount (apache_beam.runners.interactive.interactive_runner_test.InteractiveRunnerTest) -- Traceback (most recent call last): File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/interactive_runner_test.py", line 85, in test_wordcount result = p.run() File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/pipeline.py", line 406, in run self._options).run(False) File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/pipeline.py", line 419, in run return self.runner.run_pipeline(self, self._options) File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/interactive_runner.py", line 136, in run_pipeline self._desired_cache_labels) File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 73, in __init__ self._analyze_pipeline() File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 93, in _analyze_pipeline desired_pcollections = self._desired_pcollections(self._pipeline_info) File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 313, in _desired_pcollections cache_label = pipeline_info.cache_label(pcoll_id) File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 397, in cache_label return self._derivation(pcoll_id).cache_label() File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 405, in _derivation for input_tag, input_id in transform_proto.inputs.items() ... File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 405, in _derivation for input_tag, input_id in transform_proto.inputs.items() File "/Users/robertwb/Work/beam/venv3/bin/../lib/python3.6/_collections_abc.py", line 678, in items return ItemsView(self) RecursionError: maximum recursion depth exceeded while calling a Python object }} > Interactive runner incompatible with experiments=beam_fn_api > > > Key:
[jira] [Created] (BEAM-8436) Interactive runner incompatible with experiments=beam_fn_api
Robert Bradshaw created BEAM-8436: - Summary: Interactive runner incompatible with experiments=beam_fn_api Key: BEAM-8436 URL: https://issues.apache.org/jira/browse/BEAM-8436 Project: Beam Issue Type: Bug Components: sdk-py-core Reporter: Robert Bradshaw When this is enabled one gets {{ ERROR: test_wordcount (apache_beam.runners.interactive.interactive_runner_test.InteractiveRunnerTest) -- Traceback (most recent call last): File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/interactive_runner_test.py", line 85, in test_wordcount result = p.run() File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/pipeline.py", line 406, in run self._options).run(False) File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/pipeline.py", line 419, in run return self.runner.run_pipeline(self, self._options) File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/interactive_runner.py", line 136, in run_pipeline self._desired_cache_labels) File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 73, in __init__ self._analyze_pipeline() File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 93, in _analyze_pipeline desired_pcollections = self._desired_pcollections(self._pipeline_info) File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 313, in _desired_pcollections cache_label = pipeline_info.cache_label(pcoll_id) File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 397, in cache_label return self._derivation(pcoll_id).cache_label() File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 405, in _derivation for input_tag, input_id in transform_proto.inputs.items() ... File "/Users/robertwb/Work/beam/incubator-beam/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py", line 405, in _derivation for input_tag, input_id in transform_proto.inputs.items() File "/Users/robertwb/Work/beam/venv3/bin/../lib/python3.6/_collections_abc.py", line 678, in items return ItemsView(self) RecursionError: maximum recursion depth exceeded while calling a Python object }} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8433) DataCatalogBigQueryIT runs for both Calcite and ZetaSQL dialects
[ https://issues.apache.org/jira/browse/BEAM-8433?focusedWorklogId=330814=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330814 ] ASF GitHub Bot logged work on BEAM-8433: Author: ASF GitHub Bot Created on: 18/Oct/19 22:53 Start Date: 18/Oct/19 22:53 Worklog Time Spent: 10m Work Description: kennknowles commented on pull request #9831: [BEAM-8433] DataCatalogBigQueryIT runs for both Calcite and ZetaSQL d… URL: https://github.com/apache/beam/pull/9831 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330814) Time Spent: 1h 20m (was: 1h 10m) > DataCatalogBigQueryIT runs for both Calcite and ZetaSQL dialects > > > Key: BEAM-8433 > URL: https://issues.apache.org/jira/browse/BEAM-8433 > Project: Beam > Issue Type: Improvement > Components: dsl-sql >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Time Spent: 1h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8415) Improve error message when adding a PTransform with a name that already exists in the pipeline
[ https://issues.apache.org/jira/browse/BEAM-8415?focusedWorklogId=330813=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330813 ] ASF GitHub Bot logged work on BEAM-8415: Author: ASF GitHub Bot Created on: 18/Oct/19 22:47 Start Date: 18/Oct/19 22:47 Worklog Time Spent: 10m Work Description: davidyan74 commented on issue #9812: [BEAM-8415] Improving error message when applying PTransform with a n… URL: https://github.com/apache/beam/pull/9812#issuecomment-543990209 Fixed. Thanks Robert! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330813) Time Spent: 1h 20m (was: 1h 10m) > Improve error message when adding a PTransform with a name that already > exists in the pipeline > -- > > Key: BEAM-8415 > URL: https://issues.apache.org/jira/browse/BEAM-8415 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Reporter: David Yan >Priority: Major > Time Spent: 1h 20m > Remaining Estimate: 0h > > Currently, when trying to apply a PTransform with a name that already exists > in the pipeline, it returns a confusing error: > Transform "XXX" does not have a stable unique label. This will prevent > updating of pipelines. To apply a transform with a specified label write > pvalue | "label" >> transform > We'd like to improve this error message. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8396) Default to LOOPBACK mode for local flink (spark, ...) runner.
[ https://issues.apache.org/jira/browse/BEAM-8396?focusedWorklogId=330808=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330808 ] ASF GitHub Bot logged work on BEAM-8396: Author: ASF GitHub Bot Created on: 18/Oct/19 22:12 Start Date: 18/Oct/19 22:12 Worklog Time Spent: 10m Work Description: robertwb commented on pull request #9833: [BEAM-8396] Default to LOOPBACK mode for local flink runner. URL: https://github.com/apache/beam/pull/9833 Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/) Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![Build
[jira] [Assigned] (BEAM-8396) Default to LOOPBACK mode for local flink (spark, ...) runner.
[ https://issues.apache.org/jira/browse/BEAM-8396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Bradshaw reassigned BEAM-8396: - Assignee: Robert Bradshaw > Default to LOOPBACK mode for local flink (spark, ...) runner. > - > > Key: BEAM-8396 > URL: https://issues.apache.org/jira/browse/BEAM-8396 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Reporter: Robert Bradshaw >Assignee: Robert Bradshaw >Priority: Major > > As well as being lower overhead, this will avoid surprises about workers > operating within the docker filesystem, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8372) Allow submission of Flink UberJar directly to flink cluster.
[ https://issues.apache.org/jira/browse/BEAM-8372?focusedWorklogId=330799=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330799 ] ASF GitHub Bot logged work on BEAM-8372: Author: ASF GitHub Bot Created on: 18/Oct/19 21:57 Start Date: 18/Oct/19 21:57 Worklog Time Spent: 10m Work Description: robertwb commented on pull request #9803: [BEAM-8372] Follow-up to Flink UberJar submission. URL: https://github.com/apache/beam/pull/9803#discussion_r336690127 ## File path: sdks/python/apache_beam/runners/portability/flink_runner.py ## @@ -32,18 +32,18 @@ class FlinkRunner(portable_runner.PortableRunner): def default_job_server(self, options): -flink_master_url = options.view_as(FlinkRunnerOptions).flink_master_url -if flink_master_url == '[local]' or sys.version_info < (3, 6): +flink_master = options.view_as(FlinkRunnerOptions).flink_master Review comment: Thanks, let me know when you have a PR. I'd like to get this in by 2.17 so we can advertise this as a release thats really easy to use with Flink. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330799) Time Spent: 5h 10m (was: 5h) > Allow submission of Flink UberJar directly to flink cluster. > > > Key: BEAM-8372 > URL: https://issues.apache.org/jira/browse/BEAM-8372 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Robert Bradshaw >Assignee: Robert Bradshaw >Priority: Major > Time Spent: 5h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8418) Fix handling of Impulse transform in Dataflow runner.
[ https://issues.apache.org/jira/browse/BEAM-8418?focusedWorklogId=330793=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330793 ] ASF GitHub Bot logged work on BEAM-8418: Author: ASF GitHub Bot Created on: 18/Oct/19 21:45 Start Date: 18/Oct/19 21:45 Worklog Time Spent: 10m Work Description: tvalentyn commented on issue #9822: [BEAM-8418] Use base64 string for representing impulse payload in DF runner legacy codepath. URL: https://github.com/apache/beam/pull/9822#issuecomment-543410624 Run Python 3.6 PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330793) Time Spent: 2h 20m (was: 2h 10m) > Fix handling of Impulse transform in Dataflow runner. > -- > > Key: BEAM-8418 > URL: https://issues.apache.org/jira/browse/BEAM-8418 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Robert Bradshaw >Priority: Major > Time Spent: 2h 20m > Remaining Estimate: 0h > > Following pipeline fails on Dataflow runner unless we use beam_fn_api > experiment. > {noformat} > class NoOpDoFn(beam.DoFn): > def process(self, element): > return element > p = beam.Pipeline(options=pipeline_options) > _ = p | beam.Impulse() | beam.ParDo(NoOpDoFn()) > result = p.run() > {noformat} > The reason is that we encode Impluse payload using url-escaping in [1], while > Dataflow runner expects base64 encoding in non-fnapi mode. In FnApi mode, DF > runner expects URL escaping. > We should fix or reconcile the encoding in non-FnAPI path, and add a > ValidatesRunner test that catches this error. > [1] > https://github.com/apache/beam/blob/12d07745835e1b9c1e824b83beeeadf63ab4b234/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py#L633 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8418) Fix handling of Impulse transform in Dataflow runner.
[ https://issues.apache.org/jira/browse/BEAM-8418?focusedWorklogId=330792=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330792 ] ASF GitHub Bot logged work on BEAM-8418: Author: ASF GitHub Bot Created on: 18/Oct/19 21:45 Start Date: 18/Oct/19 21:45 Worklog Time Spent: 10m Work Description: tvalentyn commented on issue #9822: [BEAM-8418] Use base64 string for representing impulse payload in DF runner legacy codepath. URL: https://github.com/apache/beam/pull/9822#issuecomment-543440776 Run Python 3.6 PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330792) Time Spent: 2h 10m (was: 2h) > Fix handling of Impulse transform in Dataflow runner. > -- > > Key: BEAM-8418 > URL: https://issues.apache.org/jira/browse/BEAM-8418 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Robert Bradshaw >Priority: Major > Time Spent: 2h 10m > Remaining Estimate: 0h > > Following pipeline fails on Dataflow runner unless we use beam_fn_api > experiment. > {noformat} > class NoOpDoFn(beam.DoFn): > def process(self, element): > return element > p = beam.Pipeline(options=pipeline_options) > _ = p | beam.Impulse() | beam.ParDo(NoOpDoFn()) > result = p.run() > {noformat} > The reason is that we encode Impluse payload using url-escaping in [1], while > Dataflow runner expects base64 encoding in non-fnapi mode. In FnApi mode, DF > runner expects URL escaping. > We should fix or reconcile the encoding in non-FnAPI path, and add a > ValidatesRunner test that catches this error. > [1] > https://github.com/apache/beam/blob/12d07745835e1b9c1e824b83beeeadf63ab4b234/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py#L633 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8418) Fix handling of Impulse transform in Dataflow runner.
[ https://issues.apache.org/jira/browse/BEAM-8418?focusedWorklogId=330791=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330791 ] ASF GitHub Bot logged work on BEAM-8418: Author: ASF GitHub Bot Created on: 18/Oct/19 21:44 Start Date: 18/Oct/19 21:44 Worklog Time Spent: 10m Work Description: tvalentyn commented on issue #9822: [BEAM-8418] Use base64 string for representing impulse payload in DF runner legacy codepath. URL: https://github.com/apache/beam/pull/9822#issuecomment-543477168 Run Python 3.6 PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330791) Time Spent: 2h (was: 1h 50m) > Fix handling of Impulse transform in Dataflow runner. > -- > > Key: BEAM-8418 > URL: https://issues.apache.org/jira/browse/BEAM-8418 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Robert Bradshaw >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > > Following pipeline fails on Dataflow runner unless we use beam_fn_api > experiment. > {noformat} > class NoOpDoFn(beam.DoFn): > def process(self, element): > return element > p = beam.Pipeline(options=pipeline_options) > _ = p | beam.Impulse() | beam.ParDo(NoOpDoFn()) > result = p.run() > {noformat} > The reason is that we encode Impluse payload using url-escaping in [1], while > Dataflow runner expects base64 encoding in non-fnapi mode. In FnApi mode, DF > runner expects URL escaping. > We should fix or reconcile the encoding in non-FnAPI path, and add a > ValidatesRunner test that catches this error. > [1] > https://github.com/apache/beam/blob/12d07745835e1b9c1e824b83beeeadf63ab4b234/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py#L633 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8418) Fix handling of Impulse transform in Dataflow runner.
[ https://issues.apache.org/jira/browse/BEAM-8418?focusedWorklogId=330790=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330790 ] ASF GitHub Bot logged work on BEAM-8418: Author: ASF GitHub Bot Created on: 18/Oct/19 21:44 Start Date: 18/Oct/19 21:44 Worklog Time Spent: 10m Work Description: tvalentyn commented on issue #9822: [BEAM-8418] Use base64 string for representing impulse payload in DF runner legacy codepath. URL: https://github.com/apache/beam/pull/9822#issuecomment-543507060 Run Python 3.6 PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330790) Time Spent: 1h 50m (was: 1h 40m) > Fix handling of Impulse transform in Dataflow runner. > -- > > Key: BEAM-8418 > URL: https://issues.apache.org/jira/browse/BEAM-8418 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Robert Bradshaw >Priority: Major > Time Spent: 1h 50m > Remaining Estimate: 0h > > Following pipeline fails on Dataflow runner unless we use beam_fn_api > experiment. > {noformat} > class NoOpDoFn(beam.DoFn): > def process(self, element): > return element > p = beam.Pipeline(options=pipeline_options) > _ = p | beam.Impulse() | beam.ParDo(NoOpDoFn()) > result = p.run() > {noformat} > The reason is that we encode Impluse payload using url-escaping in [1], while > Dataflow runner expects base64 encoding in non-fnapi mode. In FnApi mode, DF > runner expects URL escaping. > We should fix or reconcile the encoding in non-FnAPI path, and add a > ValidatesRunner test that catches this error. > [1] > https://github.com/apache/beam/blob/12d07745835e1b9c1e824b83beeeadf63ab4b234/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py#L633 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8434) Allow trigger transcript tests to be run as ValidatesRunner tests.
[ https://issues.apache.org/jira/browse/BEAM-8434?focusedWorklogId=330780=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330780 ] ASF GitHub Bot logged work on BEAM-8434: Author: ASF GitHub Bot Created on: 18/Oct/19 21:31 Start Date: 18/Oct/19 21:31 Worklog Time Spent: 10m Work Description: robertwb commented on pull request #9832: [BEAM-8434] Translate trigger transcripts into validates runner tests. URL: https://github.com/apache/beam/pull/9832 Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/) Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/) | --- |
[jira] [Work logged] (BEAM-8433) DataCatalogBigQueryIT runs for both Calcite and ZetaSQL dialects
[ https://issues.apache.org/jira/browse/BEAM-8433?focusedWorklogId=330778=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330778 ] ASF GitHub Bot logged work on BEAM-8433: Author: ASF GitHub Bot Created on: 18/Oct/19 21:25 Start Date: 18/Oct/19 21:25 Worklog Time Spent: 10m Work Description: amaliujia commented on issue #9831: [BEAM-8433] DataCatalogBigQueryIT runs for both Calcite and ZetaSQL d… URL: https://github.com/apache/beam/pull/9831#issuecomment-543926793 I tried `Junit Parameterized Test` idea. It turned out that that framework add `[` `]` to test name, and our testing infra use test name for temporarily BQ table name. The problem is I believe BQ table name does not allow `[` and `]`. That's why I ended with the implementation in this PR. The future of duplicating as many as tests for both dialects will require take care of such details. A sample of error mesage: ``` { "code" : 400, "errors" : [ { "domain" : "global", "message" : "Invalid table ID \"DataCatalogBigQueryIT_testReadWrite[org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner]_2019_10_18_20_10_21_376_5037052097078144632\".", "reason" : "invalid" ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330778) Time Spent: 1h 10m (was: 1h) > DataCatalogBigQueryIT runs for both Calcite and ZetaSQL dialects > > > Key: BEAM-8433 > URL: https://issues.apache.org/jira/browse/BEAM-8433 > Project: Beam > Issue Type: Improvement > Components: dsl-sql >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Time Spent: 1h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-8435) Allow access to PaneInfo from Python DoFns
Robert Bradshaw created BEAM-8435: - Summary: Allow access to PaneInfo from Python DoFns Key: BEAM-8435 URL: https://issues.apache.org/jira/browse/BEAM-8435 Project: Beam Issue Type: Bug Components: sdk-py-core Reporter: Robert Bradshaw PaneInfoParam exists, but the plumbing to actually populate it at runtime was never added. (Nor, clearly, were any tests...) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-8434) Allow trigger transcript tests to be run as ValidatesRunner tests.
Robert Bradshaw created BEAM-8434: - Summary: Allow trigger transcript tests to be run as ValidatesRunner tests. Key: BEAM-8434 URL: https://issues.apache.org/jira/browse/BEAM-8434 Project: Beam Issue Type: Improvement Components: testing Reporter: Robert Bradshaw Assignee: Robert Bradshaw -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8433) DataCatalogBigQueryIT runs for both Calcite and ZetaSQL dialects
[ https://issues.apache.org/jira/browse/BEAM-8433?focusedWorklogId=330767=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330767 ] ASF GitHub Bot logged work on BEAM-8433: Author: ASF GitHub Bot Created on: 18/Oct/19 20:41 Start Date: 18/Oct/19 20:41 Worklog Time Spent: 10m Work Description: amaliujia commented on issue #9831: [BEAM-8433] DataCatalogBigQueryIT runs for both Calcite and ZetaSQL d… URL: https://github.com/apache/beam/pull/9831#issuecomment-543934812 Run Java_Examples_Dataflow PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330767) Time Spent: 1h (was: 50m) > DataCatalogBigQueryIT runs for both Calcite and ZetaSQL dialects > > > Key: BEAM-8433 > URL: https://issues.apache.org/jira/browse/BEAM-8433 > Project: Beam > Issue Type: Improvement > Components: dsl-sql >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8433) DataCatalogBigQueryIT runs for both Calcite and ZetaSQL dialects
[ https://issues.apache.org/jira/browse/BEAM-8433?focusedWorklogId=330758=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330758 ] ASF GitHub Bot logged work on BEAM-8433: Author: ASF GitHub Bot Created on: 18/Oct/19 20:31 Start Date: 18/Oct/19 20:31 Worklog Time Spent: 10m Work Description: amaliujia commented on issue #9831: [BEAM-8433] DataCatalogBigQueryIT runs for both Calcite and ZetaSQL d… URL: https://github.com/apache/beam/pull/9831#issuecomment-543926793 I tried `Junit Parameterized Test` idea. It turned out that that framework add `[` `]` to test name, and our testing infra use test name for temporarily BQ table name. The problem is I believe BQ table name does not allow `[` and `]`. That's why I ended with the implementation in this PR. The future of duplicates as many as tests for both dialects will require take care of such details. A sample of error mesage: ``` { "code" : 400, "errors" : [ { "domain" : "global", "message" : "Invalid table ID \"DataCatalogBigQueryIT_testReadWrite[org.apache.beam.sdk.extensions.sql.impl.CalciteQueryPlanner]_2019_10_18_20_10_21_376_5037052097078144632\".", "reason" : "invalid" ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330758) Time Spent: 50m (was: 40m) > DataCatalogBigQueryIT runs for both Calcite and ZetaSQL dialects > > > Key: BEAM-8433 > URL: https://issues.apache.org/jira/browse/BEAM-8433 > Project: Beam > Issue Type: Improvement > Components: dsl-sql >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8433) DataCatalogBigQueryIT runs for both Calcite and ZetaSQL dialects
[ https://issues.apache.org/jira/browse/BEAM-8433?focusedWorklogId=330750=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330750 ] ASF GitHub Bot logged work on BEAM-8433: Author: ASF GitHub Bot Created on: 18/Oct/19 20:25 Start Date: 18/Oct/19 20:25 Worklog Time Spent: 10m Work Description: amaliujia commented on issue #9831: [BEAM-8433] DataCatalogBigQueryIT runs for both Calcite and ZetaSQL d… URL: https://github.com/apache/beam/pull/9831#issuecomment-543926793 I tried `Junit Parameterized Test` idea. It turned out that that framework add `[` `]` to test name, and our testing infra use test name for temporarily BQ table name. The problem is I believe BQ table name does not allow `[` and `]`. That's why I ended with the implementation in this PR. The future of duplicates as many as tests for both dialects will require take care of such details. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330750) Time Spent: 0.5h (was: 20m) > DataCatalogBigQueryIT runs for both Calcite and ZetaSQL dialects > > > Key: BEAM-8433 > URL: https://issues.apache.org/jira/browse/BEAM-8433 > Project: Beam > Issue Type: Improvement > Components: dsl-sql >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8433) DataCatalogBigQueryIT runs for both Calcite and ZetaSQL dialects
[ https://issues.apache.org/jira/browse/BEAM-8433?focusedWorklogId=330752=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330752 ] ASF GitHub Bot logged work on BEAM-8433: Author: ASF GitHub Bot Created on: 18/Oct/19 20:25 Start Date: 18/Oct/19 20:25 Worklog Time Spent: 10m Work Description: amaliujia commented on issue #9831: [BEAM-8433] DataCatalogBigQueryIT runs for both Calcite and ZetaSQL d… URL: https://github.com/apache/beam/pull/9831#issuecomment-543926877 R: @kennknowles This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330752) Time Spent: 40m (was: 0.5h) > DataCatalogBigQueryIT runs for both Calcite and ZetaSQL dialects > > > Key: BEAM-8433 > URL: https://issues.apache.org/jira/browse/BEAM-8433 > Project: Beam > Issue Type: Improvement > Components: dsl-sql >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8433) DataCatalogBigQueryIT runs for both Calcite and ZetaSQL dialects
[ https://issues.apache.org/jira/browse/BEAM-8433?focusedWorklogId=330747=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330747 ] ASF GitHub Bot logged work on BEAM-8433: Author: ASF GitHub Bot Created on: 18/Oct/19 20:23 Start Date: 18/Oct/19 20:23 Worklog Time Spent: 10m Work Description: amaliujia commented on issue #9831: [BEAM-8433] DataCatalogBigQueryIT runs for both Calcite and ZetaSQL d… URL: https://github.com/apache/beam/pull/9831#issuecomment-543925505 Run SQL Postcommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330747) Time Spent: 20m (was: 10m) > DataCatalogBigQueryIT runs for both Calcite and ZetaSQL dialects > > > Key: BEAM-8433 > URL: https://issues.apache.org/jira/browse/BEAM-8433 > Project: Beam > Issue Type: Improvement > Components: dsl-sql >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8433) DataCatalogBigQueryIT runs for both Calcite and ZetaSQL dialects
[ https://issues.apache.org/jira/browse/BEAM-8433?focusedWorklogId=330746=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330746 ] ASF GitHub Bot logged work on BEAM-8433: Author: ASF GitHub Bot Created on: 18/Oct/19 20:22 Start Date: 18/Oct/19 20:22 Worklog Time Spent: 10m Work Description: amaliujia commented on pull request #9831: [BEAM-8433] DataCatalogBigQueryIT runs for both Calcite and ZetaSQL d… URL: https://github.com/apache/beam/pull/9831 …ialects. Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/) Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![Build
[jira] [Created] (BEAM-8433) DataCatalogBigQueryIT runs for both Calcite and ZetaSQL dialects
Rui Wang created BEAM-8433: -- Summary: DataCatalogBigQueryIT runs for both Calcite and ZetaSQL dialects Key: BEAM-8433 URL: https://issues.apache.org/jira/browse/BEAM-8433 Project: Beam Issue Type: Improvement Components: dsl-sql Reporter: Rui Wang Assignee: Rui Wang -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8424) java 11 dataflow validates runner tests are timeouting
[ https://issues.apache.org/jira/browse/BEAM-8424?focusedWorklogId=330740=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330740 ] ASF GitHub Bot logged work on BEAM-8424: Author: ASF GitHub Bot Created on: 18/Oct/19 20:08 Start Date: 18/Oct/19 20:08 Worklog Time Spent: 10m Work Description: lgajowy commented on issue #9819: [BEAM-8424] Prevent Java 11 VR tests from timeouting URL: https://github.com/apache/beam/pull/9819#issuecomment-543918486 Sounds good. Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330740) Time Spent: 1h (was: 50m) > java 11 dataflow validates runner tests are timeouting > -- > > Key: BEAM-8424 > URL: https://issues.apache.org/jira/browse/BEAM-8424 > Project: Beam > Issue Type: Bug > Components: test-failures >Reporter: Lukasz Gajowy >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > [https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_PostCommit_Java11_ValidatesRunner_Dataflow/] > [https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_PostCommit_Java11_ValidatesRunner_PortabilityApi_Dataflow/] > these jobs take more than currently set timeout (3h). > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=330700=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330700 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 18/Oct/19 18:26 Start Date: 18/Oct/19 18:26 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r336621115 ## File path: sdks/python/apache_beam/runners/interactive/display/pcoll_visualization_test.py ## @@ -0,0 +1,133 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Tests for apache_beam.runners.interactive.display.pcoll_visualization.""" +from __future__ import absolute_import + +import sys +import time +import unittest + +import timeloop + +import apache_beam as beam +from apache_beam.runners import runner +from apache_beam.runners.interactive import interactive_environment as ie +from apache_beam.runners.interactive.display import pcoll_visualization as pv + +# Work around nose tests using Python2 without unittest.mock module. +try: + from unittest.mock import patch +except ImportError: + from mock import patch + + +class PCollVisualizationTest(unittest.TestCase): + + def setUp(self): +self._p = beam.Pipeline() +# pylint: disable=range-builtin-not-iterating +self._pcoll = self._p | 'Create' >> beam.Create(range(1000)) + + @unittest.skipIf(sys.version_info < (3, 5, 3), + 'PCollVisualization is not supported on Python 2.') + def test_raise_error_for_non_pcoll_input(self): +class Foo(object): + pass + +with self.assertRaises(ValueError) as ctx: + pv.PCollVisualization(Foo()) + self.assertTrue('pcoll should be apache_beam.pvalue.PCollection' in + ctx.exception) + + @unittest.skipIf(sys.version_info < (3, 5, 3), + 'PCollVisualization is not supported on Python 2.') + def test_pcoll_visualization_generate_unique_display_id(self): +pv_1 = pv.PCollVisualization(self._pcoll) +pv_2 = pv.PCollVisualization(self._pcoll) +self.assertNotEqual(pv_1._dive_display_id, pv_2._dive_display_id) +self.assertNotEqual(pv_1._overview_display_id, pv_2._overview_display_id) +self.assertNotEqual(pv_1._df_display_id, pv_2._df_display_id) + + @unittest.skipIf(sys.version_info < (3, 5, 3), + 'PCollVisualization is not supported on Python 2.') + @patch('apache_beam.runners.interactive.display.pcoll_visualization' + '.PCollVisualization._to_element_list', lambda x: [1, 2, 3]) + def test_one_shot_visualization_not_return_handle(self): +self.assertIsNone(pv.visualize(self._pcoll)) + + def _mock_to_element_list(self): +yield [1, 2, 3] +yield [1, 2, 3, 4] + + @unittest.skipIf(sys.version_info < (3, 5, 3), + 'PCollVisualization is not supported on Python 2.') + @patch('apache_beam.runners.interactive.display.pcoll_visualization' + '.PCollVisualization._to_element_list', _mock_to_element_list) + def test_dynamical_plotting_return_handle(self): +h = pv.visualize(self._pcoll, dynamical_plotting_interval=1) +self.assertIsInstance(h, timeloop.Timeloop) +h.stop() + + @unittest.skipIf(sys.version_info < (3, 5, 3), + 'PCollVisualization is not supported on Python 2.') + @patch('apache_beam.runners.interactive.display.pcoll_visualization' + '.PCollVisualization._to_element_list', _mock_to_element_list) + @patch('apache_beam.runners.interactive.display.pcoll_visualization' + '.PCollVisualization.display_facets') + def test_dynamical_plotting_update_same_display(self, + mocked_display_facets): +# Starts async dynamical plotting. +h = pv.visualize(self._pcoll, dynamical_plotting_interval=0.001) +# Blocking so the above async task can execute a few iterations. +time.sleep(0.1) Review comment: Using while loop querying for an ending condition to block then. Also added a timeout to ensure the while loop to never exceed 0.1
[jira] [Work logged] (BEAM-8428) [SQL] BigQuery should support project push-down in DIRECT_READ mode
[ https://issues.apache.org/jira/browse/BEAM-8428?focusedWorklogId=330694=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330694 ] ASF GitHub Bot logged work on BEAM-8428: Author: ASF GitHub Bot Created on: 18/Oct/19 18:03 Start Date: 18/Oct/19 18:03 Worklog Time Spent: 10m Work Description: 11moon11 commented on issue #9823: [BEAM-8428] [SQL] Add project push-down for BigQuery URL: https://github.com/apache/beam/pull/9823#issuecomment-543864425 Run Direct Runner Nexmark Tests This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330694) Time Spent: 50m (was: 40m) > [SQL] BigQuery should support project push-down in DIRECT_READ mode > --- > > Key: BEAM-8428 > URL: https://issues.apache.org/jira/browse/BEAM-8428 > Project: Beam > Issue Type: Improvement > Components: dsl-sql >Reporter: Kirill Kozlov >Assignee: Kirill Kozlov >Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > BigQuery should perform project push-down for read pipelines when applicable. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8428) [SQL] BigQuery should support project push-down in DIRECT_READ mode
[ https://issues.apache.org/jira/browse/BEAM-8428?focusedWorklogId=330695=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330695 ] ASF GitHub Bot logged work on BEAM-8428: Author: ASF GitHub Bot Created on: 18/Oct/19 18:03 Start Date: 18/Oct/19 18:03 Worklog Time Spent: 10m Work Description: 11moon11 commented on issue #9823: [BEAM-8428] [SQL] Add project push-down for BigQuery URL: https://github.com/apache/beam/pull/9823#issuecomment-543864425 Run Direct Runner Nexmark Tests This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330695) Time Spent: 1h (was: 50m) > [SQL] BigQuery should support project push-down in DIRECT_READ mode > --- > > Key: BEAM-8428 > URL: https://issues.apache.org/jira/browse/BEAM-8428 > Project: Beam > Issue Type: Improvement > Components: dsl-sql >Reporter: Kirill Kozlov >Assignee: Kirill Kozlov >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > BigQuery should perform project push-down for read pipelines when applicable. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-4132) Element type inference doesn't work for multi-output DoFns
[ https://issues.apache.org/jira/browse/BEAM-4132?focusedWorklogId=330691=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330691 ] ASF GitHub Bot logged work on BEAM-4132: Author: ASF GitHub Bot Created on: 18/Oct/19 17:55 Start Date: 18/Oct/19 17:55 Worklog Time Spent: 10m Work Description: udim commented on pull request #9810: [BEAM-4132] Support multi-output type inference URL: https://github.com/apache/beam/pull/9810#discussion_r336608361 ## File path: sdks/python/apache_beam/pipeline.py ## @@ -571,6 +573,10 @@ def _infer_result_type(self, transform, inputs, result_pcollection): else: result_pcollection.element_type = transform.infer_output_type( input_element_type) +elif isinstance(result_pcollection, pvalue.DoOutputsTuple): + # Single-input, multi-output inference. + for pcoll in result_pcollection: +self._infer_result_type(transform, inputs, pcoll) Review comment: This indeed does set the same output type for each pcoll. Maybe there's a way to infer more deeply. I can look at this later if this would be of value (I'm thinking that we should come up with a way to specify multi-output type hints). Currently this is better that leaving pcoll.element_type as None, which fails type checking. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330691) Time Spent: 1h 10m (was: 1h) > Element type inference doesn't work for multi-output DoFns > -- > > Key: BEAM-4132 > URL: https://issues.apache.org/jira/browse/BEAM-4132 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Affects Versions: 2.4.0 >Reporter: Chuan Yu Foo >Assignee: Udi Meiri >Priority: Major > Time Spent: 1h 10m > Remaining Estimate: 0h > > TLDR: if you have a multi-output DoFn, then the non-main PCollections with > incorrectly have their element types set to None. This affects type checking > for pipelines involving these PCollections. > Minimal example: > {code} > import apache_beam as beam > class TripleDoFn(beam.DoFn): > def process(self, elem): > yield_elem > if elem % 2 == 0: > yield beam.pvalue.TaggedOutput('ten_times', elem * 10) > if elem % 3 == 0: > yield beam.pvalue.TaggedOutput('hundred_times', elem * 100) > > @beam.typehints.with_input_types(int) > @beam.typehints.with_output_types(int) > class MultiplyBy(beam.DoFn): > def __init__(self, multiplier): > self._multiplier = multiplier > def process(self, elem): > return elem * self._multiplier > > def main(): > with beam.Pipeline() as p: > x, a, b = ( > p > | 'Create' >> beam.Create([1, 2, 3]) > | 'TripleDo' >> beam.ParDo(TripleDoFn()).with_outputs( > 'ten_times', 'hundred_times', main='main_output')) > _ = a | 'MultiplyBy2' >> beam.ParDo(MultiplyBy(2)) > if __name__ == '__main__': > main() > {code} > Running this yields the following error: > {noformat} > apache_beam.typehints.decorators.TypeCheckError: Type hint violation for > 'MultiplyBy2': requires but got None for elem > {noformat} > Replacing {{a}} with {{b}} yields the same error. Replacing {{a}} with {{x}} > instead yields the following error: > {noformat} > apache_beam.typehints.decorators.TypeCheckError: Type hint violation for > 'MultiplyBy2': requires but got Union[TaggedOutput, int] for elem > {noformat} > I would expect Beam to correctly infer that {{a}} and {{b}} have element > types of {{int}} rather than {{None}}, and I would also expect Beam to > correctly figure out that the element types of {{x}} are compatible with > {{int}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8405) Python: Datastore: add support for embedded entities
[ https://issues.apache.org/jira/browse/BEAM-8405?focusedWorklogId=330686=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330686 ] ASF GitHub Bot logged work on BEAM-8405: Author: ASF GitHub Bot Created on: 18/Oct/19 17:49 Start Date: 18/Oct/19 17:49 Worklog Time Spent: 10m Work Description: chamikaramj commented on pull request #9805: [BEAM-8405] Support embedded Datastore entities URL: https://github.com/apache/beam/pull/9805#discussion_r336605472 ## File path: sdks/python/apache_beam/io/gcp/datastore/v1new/types_test.py ## @@ -47,30 +49,52 @@ def setUp(self): # Don't do any network requests. _http=mock.MagicMock()) + def _assert_keys_equal(self, beam_type, client_type, expected_project): +self.assertEqual(beam_type.path_elements[0], client_type.kind) +self.assertEqual(beam_type.path_elements[1], client_type.id) +self.assertEqual(expected_project, client_type.project) + def testEntityToClientEntity(self): +# Test conversion from Beam type to client type. k = Key(['kind', 1234], project=self._PROJECT) kc = k.to_client_key() -exclude_from_indexes = ('efi1', 'efi2') +exclude_from_indexes = ('datetime', 'key') e = Entity(k, exclude_from_indexes=exclude_from_indexes) -ref = Key(['kind2', 1235]) -e.set_properties({'efi1': 'value', 'property': 'value', 'ref': ref}) +properties = { + 'datetime': datetime.datetime.utcnow(), + 'key_ref': Key(['kind2', 1235]), + 'bool': True, + 'float': 1.21, + 'int': 1337, + 'unicode': 'text', + 'bytes': b'bytes', + 'geopoint': GeoPoint(0.123, 0.456), + 'none': None, + 'list': [1, 2, 3], + 'entity': Entity(Key(['kind', 111])), + 'dict': {'property': 5}, +} +e.set_properties(properties) ec = e.to_client_entity() self.assertEqual(kc, ec.key) self.assertSetEqual(set(exclude_from_indexes), ec.exclude_from_indexes) self.assertEqual('kind', ec.kind) self.assertEqual(1234, ec.id) -self.assertEqual('kind2', ec['ref'].kind) -self.assertEqual(1235, ec['ref'].id) -self.assertEqual(self._PROJECT, ec['ref'].project) - - def testEntityFromClientEntity(self): Review comment: We don't need this test anymore ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330686) Time Spent: 0.5h (was: 20m) > Python: Datastore: add support for embedded entities > - > > Key: BEAM-8405 > URL: https://issues.apache.org/jira/browse/BEAM-8405 > Project: Beam > Issue Type: Bug > Components: io-py-gcp >Reporter: Udi Meiri >Assignee: Udi Meiri >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > The conversion methods to/from the client entity type should be updated to > support an embedded Entity. > https://github.com/apache/beam/blob/603d68aafe9bdcd124d28ad62ad36af01e7a7403/sdks/python/apache_beam/io/gcp/datastore/v1new/types.py#L216-L240 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8406) TextTable support JSON format
[ https://issues.apache.org/jira/browse/BEAM-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954862#comment-16954862 ] Kirill Kozlov commented on BEAM-8406: - I think utilizing JsonToRow.withSchema(schema) transform may be useful here. > TextTable support JSON format > - > > Key: BEAM-8406 > URL: https://issues.apache.org/jira/browse/BEAM-8406 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Rui Wang >Priority: Major > > Have a JSON table implementation similar to [1]. > [1]: > https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/text/TextTable.java -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-4132) Element type inference doesn't work for multi-output DoFns
[ https://issues.apache.org/jira/browse/BEAM-4132?focusedWorklogId=330684=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330684 ] ASF GitHub Bot logged work on BEAM-4132: Author: ASF GitHub Bot Created on: 18/Oct/19 17:44 Start Date: 18/Oct/19 17:44 Worklog Time Spent: 10m Work Description: udim commented on pull request #9810: [BEAM-4132] Support multi-output type inference URL: https://github.com/apache/beam/pull/9810#discussion_r336603704 ## File path: sdks/python/apache_beam/typehints/typed_pipeline_test.py ## @@ -131,6 +131,41 @@ def filter_fn(data): self.assertEqual([1, 3], [1, 2, 3] | beam.Filter(filter_fn)) + def test_partition(self): +p = TestPipeline() +even, odd = (p + | beam.Create([1, 2, 3]) + | 'even_odd' >> beam.Partition(lambda e, _: e % 2, 2)) +# Test that the element type of even and odd is int. +res_even = (even +| 'id_even' >> beam.ParDo(lambda e: [e]).with_input_types(int)) +res_odd = (odd + | 'id_odd' >> beam.ParDo(lambda e: [e]).with_input_types(int)) +assert_that(res_even, equal_to([2]), label='even_check') +assert_that(res_odd, equal_to([1, 3]), label='odd_check') +p.run() + + def test_typed_dofn_multi_output(self): +class MyDoFn(beam.DoFn): + def process(self, element): +if element % 2: + yield beam.pvalue.TaggedOutput('odd', element) +else: + yield beam.pvalue.TaggedOutput('even', element) + +p = TestPipeline() +res = (p + | beam.Create([1, 2, 3]) + | beam.ParDo(MyDoFn()).with_outputs('odd', 'even')) +# Test that the element type of even and odd is int. Review comment: More precisely it should read that we're testing it's not None and consistent with int. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330684) Time Spent: 1h (was: 50m) > Element type inference doesn't work for multi-output DoFns > -- > > Key: BEAM-4132 > URL: https://issues.apache.org/jira/browse/BEAM-4132 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Affects Versions: 2.4.0 >Reporter: Chuan Yu Foo >Assignee: Udi Meiri >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > TLDR: if you have a multi-output DoFn, then the non-main PCollections with > incorrectly have their element types set to None. This affects type checking > for pipelines involving these PCollections. > Minimal example: > {code} > import apache_beam as beam > class TripleDoFn(beam.DoFn): > def process(self, elem): > yield_elem > if elem % 2 == 0: > yield beam.pvalue.TaggedOutput('ten_times', elem * 10) > if elem % 3 == 0: > yield beam.pvalue.TaggedOutput('hundred_times', elem * 100) > > @beam.typehints.with_input_types(int) > @beam.typehints.with_output_types(int) > class MultiplyBy(beam.DoFn): > def __init__(self, multiplier): > self._multiplier = multiplier > def process(self, elem): > return elem * self._multiplier > > def main(): > with beam.Pipeline() as p: > x, a, b = ( > p > | 'Create' >> beam.Create([1, 2, 3]) > | 'TripleDo' >> beam.ParDo(TripleDoFn()).with_outputs( > 'ten_times', 'hundred_times', main='main_output')) > _ = a | 'MultiplyBy2' >> beam.ParDo(MultiplyBy(2)) > if __name__ == '__main__': > main() > {code} > Running this yields the following error: > {noformat} > apache_beam.typehints.decorators.TypeCheckError: Type hint violation for > 'MultiplyBy2': requires but got None for elem > {noformat} > Replacing {{a}} with {{b}} yields the same error. Replacing {{a}} with {{x}} > instead yields the following error: > {noformat} > apache_beam.typehints.decorators.TypeCheckError: Type hint violation for > 'MultiplyBy2': requires but got Union[TaggedOutput, int] for elem > {noformat} > I would expect Beam to correctly infer that {{a}} and {{b}} have element > types of {{int}} rather than {{None}}, and I would also expect Beam to > correctly figure out that the element types of {{x}} are compatible with > {{int}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7732) Allow access to SpannerOptions in Beam
[ https://issues.apache.org/jira/browse/BEAM-7732?focusedWorklogId=330669=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330669 ] ASF GitHub Bot logged work on BEAM-7732: Author: ASF GitHub Bot Created on: 18/Oct/19 17:24 Start Date: 18/Oct/19 17:24 Worklog Time Spent: 10m Work Description: chamikaramj commented on issue #9048: [BEAM-7732] Enable setting custom SpannerOptions. URL: https://github.com/apache/beam/pull/9048#issuecomment-543847351 Can we get additional options as KVs and generate SpannerOptions from that ? I would like not to leak client library classes through Beam SpannerIO API if possible. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330669) Time Spent: 3.5h (was: 3h 20m) > Allow access to SpannerOptions in Beam > -- > > Key: BEAM-7732 > URL: https://issues.apache.org/jira/browse/BEAM-7732 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Affects Versions: 2.12.0, 2.13.0 >Reporter: Niel Markwick >Priority: Minor > Time Spent: 3.5h > Remaining Estimate: 0h > > Beam hides the > [SpannerOptions|https://github.com/googleapis/google-cloud-java/blob/master/google-cloud-clients/google-cloud-spanner/src/main/java/com/google/cloud/spanner/SpannerOptions.java] > object behind a > [SpannerConfig|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerConfig.java] > object because the SpannerOptions object is not serializable. > This means that the only options that can be set are those that can be > specified in SpannerConfig - limited to host, project, instance, database. > Suggestion: add the possibility to set a SpannerOptionsFactory in > SpannerConfig: > {code:java} > public interface SpannerOptionsFactory extends Serializable { > public SpannerOptions create(); > } > {code} > This would allow the user use this factory class to specify custom > SpannerOptions before they are passed onto the connectToSpanner() method; > connectToSpanner() would then become: > {code:java} > public SpannerAccessor connectToSpanner() { > > SpannerOptions.Builder builder = spannerOptionsFactory.create().toBuilder(); > // rest of connectToSpanner follows, setting project, host, etc. > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=330668=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330668 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 18/Oct/19 17:23 Start Date: 18/Oct/19 17:23 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r336595299 ## File path: sdks/python/apache_beam/runners/interactive/display/pcoll_visualization.py ## @@ -0,0 +1,258 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Module visualizes PCollection data. + +For internal use only; no backwards-compatibility guarantees. +Only works with Python 3.5+. +""" +from __future__ import absolute_import + +import base64 +import logging +from datetime import timedelta + +from facets_overview.generic_feature_statistics_generator import GenericFeatureStatisticsGenerator +from IPython.core.display import HTML +from IPython.core.display import Javascript +from IPython.core.display import display +from IPython.core.display import display_javascript +from IPython.core.display import update_display +from pandas.io.json import json_normalize +from timeloop import Timeloop + +from apache_beam import pvalue +from apache_beam.runners.interactive import interactive_environment as ie +from apache_beam.runners.interactive import pipeline_instrument as instr + +# jsons doesn't support < Python 3.5. Work around with json for legacy tests. +try: + import jsons + _pv_jsons_load = jsons.load + _pv_jsons_dump = jsons.dump +except ImportError: + import json + _pv_jsons_load = json.load + _pv_jsons_dump = json.dump + +# 1-d types that need additional normalization to be compatible with DataFrame. +_one_dimension_types = (int, float, str, bool, list, tuple) + +_DIVE_SCRIPT_TEMPLATE = """ +document.querySelector("#{display_id}").data = {jsonstr};""" +_DIVE_HTML_TEMPLATE = """ +https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"> +https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;> + + + document.querySelector("#{display_id}").data = {jsonstr}; +""" +_OVERVIEW_SCRIPT_TEMPLATE = """ + document.querySelector("#{display_id}").protoInput = "{protostr}"; + """ +_OVERVIEW_HTML_TEMPLATE = """ +https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"> +https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;> + + + document.querySelector("#{display_id}").protoInput = "{protostr}"; +""" +_DATAFRAME_PAGINATION_TEMPLATE = """ +https://ajax.googleapis.com/ajax/libs/jquery/2.2.2/jquery.min.js"> +https://cdn.datatables.net/1.10.16/js/jquery.dataTables.js"> +https://cdn.datatables.net/1.10.16/css/jquery.dataTables.css;> +{dataframe_html} + + $("#{table_id}").DataTable(); +""" + + +def visualize(pcoll, dynamical_plotting_interval=None): Review comment: It automatically checks the running job's pipeline result and stop itself when the job is in an terminated state. The end user could also assign the returned handle from visualize() to a variable, say `handle = visualize(pcoll)` and call `handle.stop()` to explicitly manually stop the visualization from anywhere in the notebook. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330668) Time Spent: 5.5h (was: 5h 20m) > Visualize PCollection with Interactive
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=330670=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330670 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 18/Oct/19 17:25 Start Date: 18/Oct/19 17:25 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r336592169 ## File path: sdks/python/apache_beam/runners/interactive/interactive_environment.py ## @@ -105,3 +113,33 @@ def set_cache_manager(self, cache_manager): def cache_manager(self): """Gets the cache manager held by current Interactive Environment.""" return self._cache_manager + + def set_pipeline_result(self, pipeline, result): Review comment: It's not used yet. It should be invoked by the InteractiveRunner once we have all the building blocks ready. What we are planning to do is to only allow one job for one user pipeline running at the same time within a runner instance. So if the user re-executes p.run() for an async running job. The runner should cancel the running job using current tracked pipeline_result and start a new one, then track the new pipeline_result. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330670) Time Spent: 5h 40m (was: 5.5h) > Visualize PCollection with Interactive Beam > --- > > Key: BEAM-7926 > URL: https://issues.apache.org/jira/browse/BEAM-7926 > Project: Beam > Issue Type: New Feature > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 5h 40m > Remaining Estimate: 0h > > Support auto plotting / charting of materialized data of a given PCollection > with Interactive Beam. > Say an Interactive Beam pipeline defined as > p = create_pipeline() > pcoll = p | 'Transform' >> transform() > The use can call a single function and get auto-magical charting of the data > as materialized pcoll. > e.g., visualize(pcoll) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8430) sdk_worker_parallelism default is inconsistent between Py and Java SDKs
[ https://issues.apache.org/jira/browse/BEAM-8430?focusedWorklogId=330666=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330666 ] ASF GitHub Bot logged work on BEAM-8430: Author: ASF GitHub Bot Created on: 18/Oct/19 17:21 Start Date: 18/Oct/19 17:21 Worklog Time Spent: 10m Work Description: tweise commented on issue #9829: [BEAM-8430] Change py default sdk_worker_parallelism to 1 URL: https://github.com/apache/beam/pull/9829#issuecomment-543845796 The mismatch actually goes back to https://github.com/apache/beam/commit/f3623e8ba2257f7659ccb312dc2574f862ef41b5#diff-525d5d65bedd7ea5e6fce6e4cd57e153 and was masked by the job server option that got removed now. Please also add a comment to `PortableOptions` that they need to remain in sync with the Java version. Longer term it would be nice to define such options in proto. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330666) Time Spent: 0.5h (was: 20m) > sdk_worker_parallelism default is inconsistent between Py and Java SDKs > --- > > Key: BEAM-8430 > URL: https://issues.apache.org/jira/browse/BEAM-8430 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core, sdk-py-core >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > Default is currently 1 in Java, 0 in Python. > https://github.com/apache/beam/blob/7b67a926b8939ede8f2e33c85579b540d18afccf/sdks/java/core/src/main/java/org/apache/beam/sdk/options/PortablePipelineOptions.java#L73 > https://github.com/apache/beam/blob/7b67a926b8939ede8f2e33c85579b540d18afccf/sdks/python/apache_beam/options/pipeline_options.py#L848 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=330665=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330665 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 18/Oct/19 17:21 Start Date: 18/Oct/19 17:21 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r336594249 ## File path: sdks/python/apache_beam/runners/interactive/display/pcoll_visualization.py ## @@ -0,0 +1,258 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Module visualizes PCollection data. + +For internal use only; no backwards-compatibility guarantees. +Only works with Python 3.5+. +""" +from __future__ import absolute_import + +import base64 +import logging +from datetime import timedelta + +from facets_overview.generic_feature_statistics_generator import GenericFeatureStatisticsGenerator +from IPython.core.display import HTML +from IPython.core.display import Javascript +from IPython.core.display import display +from IPython.core.display import display_javascript +from IPython.core.display import update_display +from pandas.io.json import json_normalize +from timeloop import Timeloop + +from apache_beam import pvalue +from apache_beam.runners.interactive import interactive_environment as ie +from apache_beam.runners.interactive import pipeline_instrument as instr + +# jsons doesn't support < Python 3.5. Work around with json for legacy tests. +try: + import jsons + _pv_jsons_load = jsons.load + _pv_jsons_dump = jsons.dump +except ImportError: + import json + _pv_jsons_load = json.load + _pv_jsons_dump = json.dump + +# 1-d types that need additional normalization to be compatible with DataFrame. +_one_dimension_types = (int, float, str, bool, list, tuple) + +_DIVE_SCRIPT_TEMPLATE = """ +document.querySelector("#{display_id}").data = {jsonstr};""" +_DIVE_HTML_TEMPLATE = """ +https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"> +https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;> + + + document.querySelector("#{display_id}").data = {jsonstr}; +""" +_OVERVIEW_SCRIPT_TEMPLATE = """ + document.querySelector("#{display_id}").protoInput = "{protostr}"; + """ +_OVERVIEW_HTML_TEMPLATE = """ +https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"> +https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;> + + + document.querySelector("#{display_id}").protoInput = "{protostr}"; +""" +_DATAFRAME_PAGINATION_TEMPLATE = """ +https://ajax.googleapis.com/ajax/libs/jquery/2.2.2/jquery.min.js"> +https://cdn.datatables.net/1.10.16/js/jquery.dataTables.js"> +https://cdn.datatables.net/1.10.16/css/jquery.dataTables.css;> +{dataframe_html} + + $("#{table_id}").DataTable(); +""" + + +def visualize(pcoll, dynamical_plotting_interval=None): + """Visualizes the data of a given PCollection. Optionally enables dynamical + plotting with interval in seconds if the PCollection is being produced by a + running pipeline or the pipeline is streaming indefinitely. The function + always returns immediately and is asynchronous when dynamical plotting is on. + + If dynamical plotting enabled, the visualization is updated continuously until + the pipeline producing the PCollection is in an end state. The visualization + would be anchored to the notebook cell output area. The function + asynchronously returns a handle to the visualization job immediately. The user + could manually do:: + +# In one notebook cell, enable dynamical plotting every 1 second: +handle = visualize(pcoll, dynamical_plotting_interval=1) +# Visualization anchored to the cell's
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=330663=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330663 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 18/Oct/19 17:18 Start Date: 18/Oct/19 17:18 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r336593305 ## File path: sdks/python/apache_beam/runners/interactive/display/pcoll_visualization.py ## @@ -0,0 +1,258 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Module visualizes PCollection data. + +For internal use only; no backwards-compatibility guarantees. +Only works with Python 3.5+. +""" +from __future__ import absolute_import + +import base64 +import logging +from datetime import timedelta + +from facets_overview.generic_feature_statistics_generator import GenericFeatureStatisticsGenerator +from IPython.core.display import HTML +from IPython.core.display import Javascript +from IPython.core.display import display +from IPython.core.display import display_javascript +from IPython.core.display import update_display +from pandas.io.json import json_normalize +from timeloop import Timeloop + +from apache_beam import pvalue +from apache_beam.runners.interactive import interactive_environment as ie +from apache_beam.runners.interactive import pipeline_instrument as instr + +# jsons doesn't support < Python 3.5. Work around with json for legacy tests. +try: + import jsons + _pv_jsons_load = jsons.load + _pv_jsons_dump = jsons.dump +except ImportError: + import json + _pv_jsons_load = json.load + _pv_jsons_dump = json.dump + +# 1-d types that need additional normalization to be compatible with DataFrame. +_one_dimension_types = (int, float, str, bool, list, tuple) + +_DIVE_SCRIPT_TEMPLATE = """ +document.querySelector("#{display_id}").data = {jsonstr};""" +_DIVE_HTML_TEMPLATE = """ +https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"> +https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;> + + + document.querySelector("#{display_id}").data = {jsonstr}; +""" +_OVERVIEW_SCRIPT_TEMPLATE = """ + document.querySelector("#{display_id}").protoInput = "{protostr}"; + """ +_OVERVIEW_HTML_TEMPLATE = """ +https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"> +https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;> + + + document.querySelector("#{display_id}").protoInput = "{protostr}"; +""" +_DATAFRAME_PAGINATION_TEMPLATE = """ +https://ajax.googleapis.com/ajax/libs/jquery/2.2.2/jquery.min.js"> +https://cdn.datatables.net/1.10.16/js/jquery.dataTables.js"> +https://cdn.datatables.net/1.10.16/css/jquery.dataTables.css;> +{dataframe_html} + + $("#{table_id}").DataTable(); +""" + + +def visualize(pcoll, dynamical_plotting_interval=None): + """Visualizes the data of a given PCollection. Optionally enables dynamical + plotting with interval in seconds if the PCollection is being produced by a + running pipeline or the pipeline is streaming indefinitely. The function + always returns immediately and is asynchronous when dynamical plotting is on. + + If dynamical plotting enabled, the visualization is updated continuously until + the pipeline producing the PCollection is in an end state. The visualization + would be anchored to the notebook cell output area. The function + asynchronously returns a handle to the visualization job immediately. The user + could manually do:: + +# In one notebook cell, enable dynamical plotting every 1 second: +handle = visualize(pcoll, dynamical_plotting_interval=1) +# Visualization anchored to the cell's
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=330661=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330661 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 18/Oct/19 17:17 Start Date: 18/Oct/19 17:17 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r336592842 ## File path: sdks/python/apache_beam/runners/interactive/display/pcoll_visualization_test.py ## @@ -0,0 +1,133 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Tests for apache_beam.runners.interactive.display.pcoll_visualization.""" +from __future__ import absolute_import + +import sys +import time +import unittest + +import timeloop + +import apache_beam as beam +from apache_beam.runners import runner +from apache_beam.runners.interactive import interactive_environment as ie +from apache_beam.runners.interactive.display import pcoll_visualization as pv + +# Work around nose tests using Python2 without unittest.mock module. +try: + from unittest.mock import patch +except ImportError: + from mock import patch + + +class PCollVisualizationTest(unittest.TestCase): + + def setUp(self): +self._p = beam.Pipeline() +# pylint: disable=range-builtin-not-iterating +self._pcoll = self._p | 'Create' >> beam.Create(range(1000)) + + @unittest.skipIf(sys.version_info < (3, 5, 3), + 'PCollVisualization is not supported on Python 2.') + def test_raise_error_for_non_pcoll_input(self): +class Foo(object): + pass + +with self.assertRaises(ValueError) as ctx: + pv.PCollVisualization(Foo()) + self.assertTrue('pcoll should be apache_beam.pvalue.PCollection' in + ctx.exception) + + @unittest.skipIf(sys.version_info < (3, 5, 3), Review comment: Yes, you can do it by putting skipIf at the class level. It's just one of the tests is depending on (3,6,3) (where assert_called is introduced in unittest). So not all tests are depending on the same version. Decorating at function level gives more flexibility. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330661) Time Spent: 4h 50m (was: 4h 40m) > Visualize PCollection with Interactive Beam > --- > > Key: BEAM-7926 > URL: https://issues.apache.org/jira/browse/BEAM-7926 > Project: Beam > Issue Type: New Feature > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 4h 50m > Remaining Estimate: 0h > > Support auto plotting / charting of materialized data of a given PCollection > with Interactive Beam. > Say an Interactive Beam pipeline defined as > p = create_pipeline() > pcoll = p | 'Transform' >> transform() > The use can call a single function and get auto-magical charting of the data > as materialized pcoll. > e.g., visualize(pcoll) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=330660=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330660 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 18/Oct/19 17:15 Start Date: 18/Oct/19 17:15 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r336592169 ## File path: sdks/python/apache_beam/runners/interactive/interactive_environment.py ## @@ -105,3 +113,33 @@ def set_cache_manager(self, cache_manager): def cache_manager(self): """Gets the cache manager held by current Interactive Environment.""" return self._cache_manager + + def set_pipeline_result(self, pipeline, result): Review comment: It's not used yet. It should be invoked by the InteractiveRunner once we have all the building blocks ready. What we are planning to do is to only allow one job for one user pipeline running at the same time within a runner instance. So if the user re-executes p.run() for an async running job. The runner should cancel the running job using current tracked pipeline_result and starts a new one, then track the new pipeline_result. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330660) Time Spent: 4h 40m (was: 4.5h) > Visualize PCollection with Interactive Beam > --- > > Key: BEAM-7926 > URL: https://issues.apache.org/jira/browse/BEAM-7926 > Project: Beam > Issue Type: New Feature > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 4h 40m > Remaining Estimate: 0h > > Support auto plotting / charting of materialized data of a given PCollection > with Interactive Beam. > Say an Interactive Beam pipeline defined as > p = create_pipeline() > pcoll = p | 'Transform' >> transform() > The use can call a single function and get auto-magical charting of the data > as materialized pcoll. > e.g., visualize(pcoll) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=330658=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330658 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 18/Oct/19 17:13 Start Date: 18/Oct/19 17:13 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r336591010 ## File path: sdks/python/setup.py ## @@ -107,13 +107,17 @@ def get_version(): 'crcmod>=1.7,<2.0', # Dill doesn't guarantee comatibility between releases within minor version. 'dill>=0.3.0,<0.3.1', +'facets-overview>=1.0.0,<2', Review comment: Thanks for the suggestions! Yes, I feel the same way but didn't know what to do about it. Will make the change. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330658) Time Spent: 4.5h (was: 4h 20m) > Visualize PCollection with Interactive Beam > --- > > Key: BEAM-7926 > URL: https://issues.apache.org/jira/browse/BEAM-7926 > Project: Beam > Issue Type: New Feature > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 4.5h > Remaining Estimate: 0h > > Support auto plotting / charting of materialized data of a given PCollection > with Interactive Beam. > Say an Interactive Beam pipeline defined as > p = create_pipeline() > pcoll = p | 'Transform' >> transform() > The use can call a single function and get auto-magical charting of the data > as materialized pcoll. > e.g., visualize(pcoll) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=330653=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330653 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 18/Oct/19 17:04 Start Date: 18/Oct/19 17:04 Worklog Time Spent: 10m Work Description: rohdesamuel commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r336587840 ## File path: sdks/python/apache_beam/runners/interactive/display/pcoll_visualization_test.py ## @@ -0,0 +1,133 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Tests for apache_beam.runners.interactive.display.pcoll_visualization.""" +from __future__ import absolute_import + +import sys +import time +import unittest + +import timeloop + +import apache_beam as beam +from apache_beam.runners import runner +from apache_beam.runners.interactive import interactive_environment as ie +from apache_beam.runners.interactive.display import pcoll_visualization as pv + +# Work around nose tests using Python2 without unittest.mock module. +try: + from unittest.mock import patch +except ImportError: + from mock import patch + + +class PCollVisualizationTest(unittest.TestCase): + + def setUp(self): +self._p = beam.Pipeline() +# pylint: disable=range-builtin-not-iterating +self._pcoll = self._p | 'Create' >> beam.Create(range(1000)) + + @unittest.skipIf(sys.version_info < (3, 5, 3), + 'PCollVisualization is not supported on Python 2.') + def test_raise_error_for_non_pcoll_input(self): +class Foo(object): + pass + +with self.assertRaises(ValueError) as ctx: + pv.PCollVisualization(Foo()) + self.assertTrue('pcoll should be apache_beam.pvalue.PCollection' in + ctx.exception) + + @unittest.skipIf(sys.version_info < (3, 5, 3), + 'PCollVisualization is not supported on Python 2.') + def test_pcoll_visualization_generate_unique_display_id(self): +pv_1 = pv.PCollVisualization(self._pcoll) +pv_2 = pv.PCollVisualization(self._pcoll) +self.assertNotEqual(pv_1._dive_display_id, pv_2._dive_display_id) +self.assertNotEqual(pv_1._overview_display_id, pv_2._overview_display_id) +self.assertNotEqual(pv_1._df_display_id, pv_2._df_display_id) + + @unittest.skipIf(sys.version_info < (3, 5, 3), + 'PCollVisualization is not supported on Python 2.') + @patch('apache_beam.runners.interactive.display.pcoll_visualization' + '.PCollVisualization._to_element_list', lambda x: [1, 2, 3]) + def test_one_shot_visualization_not_return_handle(self): +self.assertIsNone(pv.visualize(self._pcoll)) + + def _mock_to_element_list(self): +yield [1, 2, 3] +yield [1, 2, 3, 4] + + @unittest.skipIf(sys.version_info < (3, 5, 3), + 'PCollVisualization is not supported on Python 2.') + @patch('apache_beam.runners.interactive.display.pcoll_visualization' + '.PCollVisualization._to_element_list', _mock_to_element_list) + def test_dynamical_plotting_return_handle(self): +h = pv.visualize(self._pcoll, dynamical_plotting_interval=1) +self.assertIsInstance(h, timeloop.Timeloop) +h.stop() + + @unittest.skipIf(sys.version_info < (3, 5, 3), + 'PCollVisualization is not supported on Python 2.') + @patch('apache_beam.runners.interactive.display.pcoll_visualization' + '.PCollVisualization._to_element_list', _mock_to_element_list) + @patch('apache_beam.runners.interactive.display.pcoll_visualization' + '.PCollVisualization.display_facets') + def test_dynamical_plotting_update_same_display(self, + mocked_display_facets): +# Starts async dynamical plotting. +h = pv.visualize(self._pcoll, dynamical_plotting_interval=0.001) +# Blocking so the above async task can execute a few iterations. +time.sleep(0.1) Review comment: Please don't use time.sleep() in unit tests. Not only does it slow down the testing, but it also generally leads to flaky tests,
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=330649=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330649 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 18/Oct/19 17:02 Start Date: 18/Oct/19 17:02 Worklog Time Spent: 10m Work Description: rohdesamuel commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r336582418 ## File path: sdks/python/apache_beam/runners/interactive/display/pcoll_visualization_test.py ## @@ -0,0 +1,133 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Tests for apache_beam.runners.interactive.display.pcoll_visualization.""" +from __future__ import absolute_import + +import sys +import time +import unittest + +import timeloop + +import apache_beam as beam +from apache_beam.runners import runner +from apache_beam.runners.interactive import interactive_environment as ie +from apache_beam.runners.interactive.display import pcoll_visualization as pv + +# Work around nose tests using Python2 without unittest.mock module. +try: + from unittest.mock import patch +except ImportError: + from mock import patch + + +class PCollVisualizationTest(unittest.TestCase): + + def setUp(self): +self._p = beam.Pipeline() +# pylint: disable=range-builtin-not-iterating +self._pcoll = self._p | 'Create' >> beam.Create(range(1000)) + + @unittest.skipIf(sys.version_info < (3, 5, 3), + 'PCollVisualization is not supported on Python 2.') + def test_raise_error_for_non_pcoll_input(self): +class Foo(object): + pass + +with self.assertRaises(ValueError) as ctx: + pv.PCollVisualization(Foo()) + self.assertTrue('pcoll should be apache_beam.pvalue.PCollection' in + ctx.exception) + + @unittest.skipIf(sys.version_info < (3, 5, 3), Review comment: Is there a way to do this check only once? Maybe in the module's "if \_\_name\_\_ == "main""? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330649) Time Spent: 4h 10m (was: 4h) > Visualize PCollection with Interactive Beam > --- > > Key: BEAM-7926 > URL: https://issues.apache.org/jira/browse/BEAM-7926 > Project: Beam > Issue Type: New Feature > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 4h 10m > Remaining Estimate: 0h > > Support auto plotting / charting of materialized data of a given PCollection > with Interactive Beam. > Say an Interactive Beam pipeline defined as > p = create_pipeline() > pcoll = p | 'Transform' >> transform() > The use can call a single function and get auto-magical charting of the data > as materialized pcoll. > e.g., visualize(pcoll) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=330648=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330648 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 18/Oct/19 17:01 Start Date: 18/Oct/19 17:01 Worklog Time Spent: 10m Work Description: rohdesamuel commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r336586553 ## File path: sdks/python/apache_beam/runners/interactive/display/pcoll_visualization.py ## @@ -0,0 +1,258 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Module visualizes PCollection data. + +For internal use only; no backwards-compatibility guarantees. +Only works with Python 3.5+. +""" +from __future__ import absolute_import + +import base64 +import logging +from datetime import timedelta + +from facets_overview.generic_feature_statistics_generator import GenericFeatureStatisticsGenerator +from IPython.core.display import HTML +from IPython.core.display import Javascript +from IPython.core.display import display +from IPython.core.display import display_javascript +from IPython.core.display import update_display +from pandas.io.json import json_normalize +from timeloop import Timeloop + +from apache_beam import pvalue +from apache_beam.runners.interactive import interactive_environment as ie +from apache_beam.runners.interactive import pipeline_instrument as instr + +# jsons doesn't support < Python 3.5. Work around with json for legacy tests. +try: + import jsons + _pv_jsons_load = jsons.load + _pv_jsons_dump = jsons.dump +except ImportError: + import json + _pv_jsons_load = json.load + _pv_jsons_dump = json.dump + +# 1-d types that need additional normalization to be compatible with DataFrame. +_one_dimension_types = (int, float, str, bool, list, tuple) + +_DIVE_SCRIPT_TEMPLATE = """ +document.querySelector("#{display_id}").data = {jsonstr};""" +_DIVE_HTML_TEMPLATE = """ +https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"> +https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;> + + + document.querySelector("#{display_id}").data = {jsonstr}; +""" +_OVERVIEW_SCRIPT_TEMPLATE = """ + document.querySelector("#{display_id}").protoInput = "{protostr}"; + """ +_OVERVIEW_HTML_TEMPLATE = """ +https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"> +https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;> + + + document.querySelector("#{display_id}").protoInput = "{protostr}"; +""" +_DATAFRAME_PAGINATION_TEMPLATE = """ +https://ajax.googleapis.com/ajax/libs/jquery/2.2.2/jquery.min.js"> +https://cdn.datatables.net/1.10.16/js/jquery.dataTables.js"> +https://cdn.datatables.net/1.10.16/css/jquery.dataTables.css;> +{dataframe_html} + + $("#{table_id}").DataTable(); +""" + + +def visualize(pcoll, dynamical_plotting_interval=None): Review comment: How does one stop visualizing a PCollection? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330648) Time Spent: 4h (was: 3h 50m) > Visualize PCollection with Interactive Beam > --- > > Key: BEAM-7926 > URL: https://issues.apache.org/jira/browse/BEAM-7926 > Project: Beam > Issue Type: New Feature > Components: runner-py-interactive >Reporter:
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=330647=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330647 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 18/Oct/19 16:59 Start Date: 18/Oct/19 16:59 Worklog Time Spent: 10m Work Description: rohdesamuel commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r336586086 ## File path: sdks/python/apache_beam/runners/interactive/display/pcoll_visualization.py ## @@ -0,0 +1,258 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Module visualizes PCollection data. + +For internal use only; no backwards-compatibility guarantees. +Only works with Python 3.5+. +""" +from __future__ import absolute_import + +import base64 +import logging +from datetime import timedelta + +from facets_overview.generic_feature_statistics_generator import GenericFeatureStatisticsGenerator +from IPython.core.display import HTML +from IPython.core.display import Javascript +from IPython.core.display import display +from IPython.core.display import display_javascript +from IPython.core.display import update_display +from pandas.io.json import json_normalize +from timeloop import Timeloop + +from apache_beam import pvalue +from apache_beam.runners.interactive import interactive_environment as ie +from apache_beam.runners.interactive import pipeline_instrument as instr + +# jsons doesn't support < Python 3.5. Work around with json for legacy tests. +try: + import jsons + _pv_jsons_load = jsons.load + _pv_jsons_dump = jsons.dump +except ImportError: + import json + _pv_jsons_load = json.load + _pv_jsons_dump = json.dump + +# 1-d types that need additional normalization to be compatible with DataFrame. +_one_dimension_types = (int, float, str, bool, list, tuple) + +_DIVE_SCRIPT_TEMPLATE = """ +document.querySelector("#{display_id}").data = {jsonstr};""" +_DIVE_HTML_TEMPLATE = """ +https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"> +https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;> + + + document.querySelector("#{display_id}").data = {jsonstr}; +""" +_OVERVIEW_SCRIPT_TEMPLATE = """ + document.querySelector("#{display_id}").protoInput = "{protostr}"; + """ +_OVERVIEW_HTML_TEMPLATE = """ +https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"> +https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;> + + + document.querySelector("#{display_id}").protoInput = "{protostr}"; +""" +_DATAFRAME_PAGINATION_TEMPLATE = """ +https://ajax.googleapis.com/ajax/libs/jquery/2.2.2/jquery.min.js"> +https://cdn.datatables.net/1.10.16/js/jquery.dataTables.js"> +https://cdn.datatables.net/1.10.16/css/jquery.dataTables.css;> +{dataframe_html} + + $("#{table_id}").DataTable(); +""" + + +def visualize(pcoll, dynamical_plotting_interval=None): + """Visualizes the data of a given PCollection. Optionally enables dynamical + plotting with interval in seconds if the PCollection is being produced by a + running pipeline or the pipeline is streaming indefinitely. The function + always returns immediately and is asynchronous when dynamical plotting is on. + + If dynamical plotting enabled, the visualization is updated continuously until + the pipeline producing the PCollection is in an end state. The visualization + would be anchored to the notebook cell output area. The function + asynchronously returns a handle to the visualization job immediately. The user + could manually do:: + +# In one notebook cell, enable dynamical plotting every 1 second: +handle = visualize(pcoll, dynamical_plotting_interval=1) +# Visualization anchored to the
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=330646=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330646 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 18/Oct/19 16:57 Start Date: 18/Oct/19 16:57 Worklog Time Spent: 10m Work Description: rohdesamuel commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r336585148 ## File path: sdks/python/apache_beam/runners/interactive/display/pcoll_visualization.py ## @@ -0,0 +1,258 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Module visualizes PCollection data. + +For internal use only; no backwards-compatibility guarantees. +Only works with Python 3.5+. +""" +from __future__ import absolute_import + +import base64 +import logging +from datetime import timedelta + +from facets_overview.generic_feature_statistics_generator import GenericFeatureStatisticsGenerator +from IPython.core.display import HTML +from IPython.core.display import Javascript +from IPython.core.display import display +from IPython.core.display import display_javascript +from IPython.core.display import update_display +from pandas.io.json import json_normalize +from timeloop import Timeloop + +from apache_beam import pvalue +from apache_beam.runners.interactive import interactive_environment as ie +from apache_beam.runners.interactive import pipeline_instrument as instr + +# jsons doesn't support < Python 3.5. Work around with json for legacy tests. +try: + import jsons + _pv_jsons_load = jsons.load + _pv_jsons_dump = jsons.dump +except ImportError: + import json + _pv_jsons_load = json.load + _pv_jsons_dump = json.dump + +# 1-d types that need additional normalization to be compatible with DataFrame. +_one_dimension_types = (int, float, str, bool, list, tuple) + +_DIVE_SCRIPT_TEMPLATE = """ +document.querySelector("#{display_id}").data = {jsonstr};""" +_DIVE_HTML_TEMPLATE = """ +https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"> +https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;> + + + document.querySelector("#{display_id}").data = {jsonstr}; +""" +_OVERVIEW_SCRIPT_TEMPLATE = """ + document.querySelector("#{display_id}").protoInput = "{protostr}"; + """ +_OVERVIEW_HTML_TEMPLATE = """ +https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"> +https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;> + + + document.querySelector("#{display_id}").protoInput = "{protostr}"; +""" +_DATAFRAME_PAGINATION_TEMPLATE = """ +https://ajax.googleapis.com/ajax/libs/jquery/2.2.2/jquery.min.js"> +https://cdn.datatables.net/1.10.16/js/jquery.dataTables.js"> +https://cdn.datatables.net/1.10.16/css/jquery.dataTables.css;> +{dataframe_html} + + $("#{table_id}").DataTable(); +""" + + +def visualize(pcoll, dynamical_plotting_interval=None): + """Visualizes the data of a given PCollection. Optionally enables dynamical + plotting with interval in seconds if the PCollection is being produced by a + running pipeline or the pipeline is streaming indefinitely. The function + always returns immediately and is asynchronous when dynamical plotting is on. + + If dynamical plotting enabled, the visualization is updated continuously until + the pipeline producing the PCollection is in an end state. The visualization + would be anchored to the notebook cell output area. The function + asynchronously returns a handle to the visualization job immediately. The user + could manually do:: + +# In one notebook cell, enable dynamical plotting every 1 second: +handle = visualize(pcoll, dynamical_plotting_interval=1) +# Visualization anchored to the
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=330645=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330645 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 18/Oct/19 16:54 Start Date: 18/Oct/19 16:54 Worklog Time Spent: 10m Work Description: rohdesamuel commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r336583985 ## File path: sdks/python/apache_beam/runners/interactive/display/pcoll_visualization.py ## @@ -0,0 +1,258 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Module visualizes PCollection data. + +For internal use only; no backwards-compatibility guarantees. +Only works with Python 3.5+. +""" +from __future__ import absolute_import + +import base64 +import logging +from datetime import timedelta + +from facets_overview.generic_feature_statistics_generator import GenericFeatureStatisticsGenerator +from IPython.core.display import HTML +from IPython.core.display import Javascript +from IPython.core.display import display +from IPython.core.display import display_javascript +from IPython.core.display import update_display +from pandas.io.json import json_normalize +from timeloop import Timeloop + +from apache_beam import pvalue +from apache_beam.runners.interactive import interactive_environment as ie +from apache_beam.runners.interactive import pipeline_instrument as instr + +# jsons doesn't support < Python 3.5. Work around with json for legacy tests. +try: + import jsons + _pv_jsons_load = jsons.load + _pv_jsons_dump = jsons.dump +except ImportError: + import json + _pv_jsons_load = json.load + _pv_jsons_dump = json.dump + +# 1-d types that need additional normalization to be compatible with DataFrame. +_one_dimension_types = (int, float, str, bool, list, tuple) + +_DIVE_SCRIPT_TEMPLATE = """ +document.querySelector("#{display_id}").data = {jsonstr};""" +_DIVE_HTML_TEMPLATE = """ +https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"> +https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;> + + + document.querySelector("#{display_id}").data = {jsonstr}; +""" +_OVERVIEW_SCRIPT_TEMPLATE = """ + document.querySelector("#{display_id}").protoInput = "{protostr}"; + """ +_OVERVIEW_HTML_TEMPLATE = """ +https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"> +https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;> + + + document.querySelector("#{display_id}").protoInput = "{protostr}"; +""" +_DATAFRAME_PAGINATION_TEMPLATE = """ +https://ajax.googleapis.com/ajax/libs/jquery/2.2.2/jquery.min.js"> +https://cdn.datatables.net/1.10.16/js/jquery.dataTables.js"> +https://cdn.datatables.net/1.10.16/css/jquery.dataTables.css;> +{dataframe_html} + + $("#{table_id}").DataTable(); +""" + + +def visualize(pcoll, dynamical_plotting_interval=None): + """Visualizes the data of a given PCollection. Optionally enables dynamical + plotting with interval in seconds if the PCollection is being produced by a + running pipeline or the pipeline is streaming indefinitely. The function + always returns immediately and is asynchronous when dynamical plotting is on. + + If dynamical plotting enabled, the visualization is updated continuously until + the pipeline producing the PCollection is in an end state. The visualization + would be anchored to the notebook cell output area. The function + asynchronously returns a handle to the visualization job immediately. The user + could manually do:: + +# In one notebook cell, enable dynamical plotting every 1 second: +handle = visualize(pcoll, dynamical_plotting_interval=1) +# Visualization anchored to the
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=330642=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330642 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 18/Oct/19 16:52 Start Date: 18/Oct/19 16:52 Worklog Time Spent: 10m Work Description: rohdesamuel commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r336583208 ## File path: sdks/python/apache_beam/runners/interactive/display/pcoll_visualization.py ## @@ -0,0 +1,258 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Module visualizes PCollection data. + +For internal use only; no backwards-compatibility guarantees. +Only works with Python 3.5+. +""" +from __future__ import absolute_import + +import base64 +import logging +from datetime import timedelta + +from facets_overview.generic_feature_statistics_generator import GenericFeatureStatisticsGenerator +from IPython.core.display import HTML +from IPython.core.display import Javascript +from IPython.core.display import display +from IPython.core.display import display_javascript +from IPython.core.display import update_display +from pandas.io.json import json_normalize +from timeloop import Timeloop + +from apache_beam import pvalue +from apache_beam.runners.interactive import interactive_environment as ie +from apache_beam.runners.interactive import pipeline_instrument as instr + +# jsons doesn't support < Python 3.5. Work around with json for legacy tests. +try: + import jsons + _pv_jsons_load = jsons.load + _pv_jsons_dump = jsons.dump +except ImportError: + import json + _pv_jsons_load = json.load + _pv_jsons_dump = json.dump + +# 1-d types that need additional normalization to be compatible with DataFrame. +_one_dimension_types = (int, float, str, bool, list, tuple) + +_DIVE_SCRIPT_TEMPLATE = """ +document.querySelector("#{display_id}").data = {jsonstr};""" +_DIVE_HTML_TEMPLATE = """ +https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"> +https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;> + + + document.querySelector("#{display_id}").data = {jsonstr}; +""" +_OVERVIEW_SCRIPT_TEMPLATE = """ + document.querySelector("#{display_id}").protoInput = "{protostr}"; + """ +_OVERVIEW_HTML_TEMPLATE = """ +https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"> +https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;> + + + document.querySelector("#{display_id}").protoInput = "{protostr}"; +""" +_DATAFRAME_PAGINATION_TEMPLATE = """ +https://ajax.googleapis.com/ajax/libs/jquery/2.2.2/jquery.min.js"> +https://cdn.datatables.net/1.10.16/js/jquery.dataTables.js"> +https://cdn.datatables.net/1.10.16/css/jquery.dataTables.css;> +{dataframe_html} + + $("#{table_id}").DataTable(); +""" + + +def visualize(pcoll, dynamical_plotting_interval=None): + """Visualizes the data of a given PCollection. Optionally enables dynamical + plotting with interval in seconds if the PCollection is being produced by a + running pipeline or the pipeline is streaming indefinitely. The function + always returns immediately and is asynchronous when dynamical plotting is on. + + If dynamical plotting enabled, the visualization is updated continuously until + the pipeline producing the PCollection is in an end state. The visualization + would be anchored to the notebook cell output area. The function + asynchronously returns a handle to the visualization job immediately. The user + could manually do:: + +# In one notebook cell, enable dynamical plotting every 1 second: +handle = visualize(pcoll, dynamical_plotting_interval=1) Review comment: Please change
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=330639=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330639 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 18/Oct/19 16:50 Start Date: 18/Oct/19 16:50 Worklog Time Spent: 10m Work Description: rohdesamuel commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r336582418 ## File path: sdks/python/apache_beam/runners/interactive/display/pcoll_visualization_test.py ## @@ -0,0 +1,133 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Tests for apache_beam.runners.interactive.display.pcoll_visualization.""" +from __future__ import absolute_import + +import sys +import time +import unittest + +import timeloop + +import apache_beam as beam +from apache_beam.runners import runner +from apache_beam.runners.interactive import interactive_environment as ie +from apache_beam.runners.interactive.display import pcoll_visualization as pv + +# Work around nose tests using Python2 without unittest.mock module. +try: + from unittest.mock import patch +except ImportError: + from mock import patch + + +class PCollVisualizationTest(unittest.TestCase): + + def setUp(self): +self._p = beam.Pipeline() +# pylint: disable=range-builtin-not-iterating +self._pcoll = self._p | 'Create' >> beam.Create(range(1000)) + + @unittest.skipIf(sys.version_info < (3, 5, 3), + 'PCollVisualization is not supported on Python 2.') + def test_raise_error_for_non_pcoll_input(self): +class Foo(object): + pass + +with self.assertRaises(ValueError) as ctx: + pv.PCollVisualization(Foo()) + self.assertTrue('pcoll should be apache_beam.pvalue.PCollection' in + ctx.exception) + + @unittest.skipIf(sys.version_info < (3, 5, 3), Review comment: Is there a way to do this check only once? Maybe in the module's "if __name__ == "main""? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330639) Time Spent: 3h 10m (was: 3h) > Visualize PCollection with Interactive Beam > --- > > Key: BEAM-7926 > URL: https://issues.apache.org/jira/browse/BEAM-7926 > Project: Beam > Issue Type: New Feature > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 3h 10m > Remaining Estimate: 0h > > Support auto plotting / charting of materialized data of a given PCollection > with Interactive Beam. > Say an Interactive Beam pipeline defined as > p = create_pipeline() > pcoll = p | 'Transform' >> transform() > The use can call a single function and get auto-magical charting of the data > as materialized pcoll. > e.g., visualize(pcoll) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=330638=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330638 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 18/Oct/19 16:48 Start Date: 18/Oct/19 16:48 Worklog Time Spent: 10m Work Description: rohdesamuel commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r336581554 ## File path: sdks/python/apache_beam/runners/interactive/interactive_environment.py ## @@ -105,3 +113,33 @@ def set_cache_manager(self, cache_manager): def cache_manager(self): """Gets the cache manager held by current Interactive Environment.""" return self._cache_manager + + def set_pipeline_result(self, pipeline, result): Review comment: Why do we need this method? It looks like it's only being used in tests. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330638) Time Spent: 3h (was: 2h 50m) > Visualize PCollection with Interactive Beam > --- > > Key: BEAM-7926 > URL: https://issues.apache.org/jira/browse/BEAM-7926 > Project: Beam > Issue Type: New Feature > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 3h > Remaining Estimate: 0h > > Support auto plotting / charting of materialized data of a given PCollection > with Interactive Beam. > Say an Interactive Beam pipeline defined as > p = create_pipeline() > pcoll = p | 'Transform' >> transform() > The use can call a single function and get auto-magical charting of the data > as materialized pcoll. > e.g., visualize(pcoll) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-4132) Element type inference doesn't work for multi-output DoFns
[ https://issues.apache.org/jira/browse/BEAM-4132?focusedWorklogId=330637=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330637 ] ASF GitHub Bot logged work on BEAM-4132: Author: ASF GitHub Bot Created on: 18/Oct/19 16:46 Start Date: 18/Oct/19 16:46 Worklog Time Spent: 10m Work Description: aaltay commented on pull request #9810: [BEAM-4132] Support multi-output type inference URL: https://github.com/apache/beam/pull/9810#discussion_r336580986 ## File path: sdks/python/apache_beam/pipeline.py ## @@ -571,6 +573,10 @@ def _infer_result_type(self, transform, inputs, result_pcollection): else: result_pcollection.element_type = transform.infer_output_type( input_element_type) +elif isinstance(result_pcollection, pvalue.DoOutputsTuple): + # Single-input, multi-output inference. + for pcoll in result_pcollection: +self._infer_result_type(transform, inputs, pcoll) Review comment: I do not believe it does, I missed this while reviewing. I have a more fundamental question. Is it possible to infer types for multiple outputs without explicit typehints from the user? DoFn will choose their taggedoutput dynamically based on the element. And I am not sure how we can infer different output types only by knowing the input type and all list of possible return types from a function. In other words, is it possible to infer output types (without explicit hints) to be anything other than union of all possible output types from that transform? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330637) Time Spent: 50m (was: 40m) > Element type inference doesn't work for multi-output DoFns > -- > > Key: BEAM-4132 > URL: https://issues.apache.org/jira/browse/BEAM-4132 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Affects Versions: 2.4.0 >Reporter: Chuan Yu Foo >Assignee: Udi Meiri >Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > TLDR: if you have a multi-output DoFn, then the non-main PCollections with > incorrectly have their element types set to None. This affects type checking > for pipelines involving these PCollections. > Minimal example: > {code} > import apache_beam as beam > class TripleDoFn(beam.DoFn): > def process(self, elem): > yield_elem > if elem % 2 == 0: > yield beam.pvalue.TaggedOutput('ten_times', elem * 10) > if elem % 3 == 0: > yield beam.pvalue.TaggedOutput('hundred_times', elem * 100) > > @beam.typehints.with_input_types(int) > @beam.typehints.with_output_types(int) > class MultiplyBy(beam.DoFn): > def __init__(self, multiplier): > self._multiplier = multiplier > def process(self, elem): > return elem * self._multiplier > > def main(): > with beam.Pipeline() as p: > x, a, b = ( > p > | 'Create' >> beam.Create([1, 2, 3]) > | 'TripleDo' >> beam.ParDo(TripleDoFn()).with_outputs( > 'ten_times', 'hundred_times', main='main_output')) > _ = a | 'MultiplyBy2' >> beam.ParDo(MultiplyBy(2)) > if __name__ == '__main__': > main() > {code} > Running this yields the following error: > {noformat} > apache_beam.typehints.decorators.TypeCheckError: Type hint violation for > 'MultiplyBy2': requires but got None for elem > {noformat} > Replacing {{a}} with {{b}} yields the same error. Replacing {{a}} with {{x}} > instead yields the following error: > {noformat} > apache_beam.typehints.decorators.TypeCheckError: Type hint violation for > 'MultiplyBy2': requires but got Union[TaggedOutput, int] for elem > {noformat} > I would expect Beam to correctly infer that {{a}} and {{b}} have element > types of {{int}} rather than {{None}}, and I would also expect Beam to > correctly figure out that the element types of {{x}} are compatible with > {{int}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8432) Parametrize source & target compatibility for beam Java modules
[ https://issues.apache.org/jira/browse/BEAM-8432?focusedWorklogId=330636=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330636 ] ASF GitHub Bot logged work on BEAM-8432: Author: ASF GitHub Bot Created on: 18/Oct/19 16:45 Start Date: 18/Oct/19 16:45 Worklog Time Spent: 10m Work Description: lgajowy commented on pull request #9830: [BEAM-8432] Move javaVersion to gradle.properties URL: https://github.com/apache/beam/pull/9830 This is done in order to be able to change this parameter easily using the command line, eg: ./gradlew clean build -PjavaVersion=11 It's needed for providing java 11 support. @kennknowles could you take a look? Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/) Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build
[jira] [Created] (BEAM-8432) Parametrize source & target compatibility for beam Java modules
Lukasz Gajowy created BEAM-8432: --- Summary: Parametrize source & target compatibility for beam Java modules Key: BEAM-8432 URL: https://issues.apache.org/jira/browse/BEAM-8432 Project: Beam Issue Type: Improvement Components: build-system Reporter: Lukasz Gajowy Assignee: Lukasz Gajowy Fix For: Not applicable Currently, "javaVersion" property is hardcoded in BeamModulePlugin in [JavaNatureConfiguration|https://github.com/apache/beam/blob/master/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L82]. For the sake of migrating the project to Java 11 we could use a mechanism that will allow parametrizing the version from the command line, e.g: {code:java} // this could set source and target compatibility to 11: ./gradlew clean build -PjavaVersion=11{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-5690) Issue with GroupByKey in BeamSql using SparkRunner
[ https://issues.apache.org/jira/browse/BEAM-5690?focusedWorklogId=330597=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330597 ] ASF GitHub Bot logged work on BEAM-5690: Author: ASF GitHub Bot Created on: 18/Oct/19 16:13 Start Date: 18/Oct/19 16:13 Worklog Time Spent: 10m Work Description: bmv126 commented on issue #9567: [BEAM-5690] Fix Zero value issue with GroupByKey/CountByKey in SparkRunner URL: https://github.com/apache/beam/pull/9567#issuecomment-543815179 Run Java PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330597) Time Spent: 3h (was: 2h 50m) > Issue with GroupByKey in BeamSql using SparkRunner > -- > > Key: BEAM-5690 > URL: https://issues.apache.org/jira/browse/BEAM-5690 > Project: Beam > Issue Type: Task > Components: runner-spark >Reporter: Kenneth Knowles >Priority: Major > Time Spent: 3h > Remaining Estimate: 0h > > Reported on user@ > {quote}We are trying to setup a pipeline with using BeamSql and the trigger > used is default (AfterWatermark crosses the window). > Below is the pipeline: > >KafkaSource (KafkaIO) >---> Windowing (FixedWindow 1min) >---> BeamSql >---> KafkaSink (KafkaIO) > > We are using Spark Runner for this. > The BeamSql query is: > {code}select Col3, count(*) as count_col1 from PCOLLECTION GROUP BY Col3{code} > We are grouping by Col3 which is a string. It can hold values string[0-9]. > > The records are getting emitted out at 1 min to kafka sink, but the output > record in kafka is not as expected. > Below is the output observed: (WST and WET are indicators for window start > time and window end time) > {code} > {"count_col1":1,"Col3":"string5","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":3,"Col3":"string7","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":2,"Col3":"string8","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":1,"Col3":"string2","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":1,"Col3":"string6","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":0,"Col3":"string6","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":0,"Col3":"string6","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":0,"Col3":"string6","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":0,"Col3":"string6","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":0,"Col3":"string6","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":0,"Col3":"string6","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 0} > {code} > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8430) sdk_worker_parallelism default is inconsistent between Py and Java SDKs
[ https://issues.apache.org/jira/browse/BEAM-8430?focusedWorklogId=330589=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330589 ] ASF GitHub Bot logged work on BEAM-8430: Author: ASF GitHub Bot Created on: 18/Oct/19 15:55 Start Date: 18/Oct/19 15:55 Worklog Time Spent: 10m Work Description: mxm commented on issue #9829: [BEAM-8430] Change py default sdk_worker_parallelism to 1 URL: https://github.com/apache/beam/pull/9829#issuecomment-543808313 Run Python2_PVR_Flink PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330589) Time Spent: 20m (was: 10m) > sdk_worker_parallelism default is inconsistent between Py and Java SDKs > --- > > Key: BEAM-8430 > URL: https://issues.apache.org/jira/browse/BEAM-8430 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core, sdk-py-core >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Default is currently 1 in Java, 0 in Python. > https://github.com/apache/beam/blob/7b67a926b8939ede8f2e33c85579b540d18afccf/sdks/java/core/src/main/java/org/apache/beam/sdk/options/PortablePipelineOptions.java#L73 > https://github.com/apache/beam/blob/7b67a926b8939ede8f2e33c85579b540d18afccf/sdks/python/apache_beam/options/pipeline_options.py#L848 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8348) Portable Python job name hard-coded to "job"
[ https://issues.apache.org/jira/browse/BEAM-8348?focusedWorklogId=330581=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330581 ] ASF GitHub Bot logged work on BEAM-8348: Author: ASF GitHub Bot Created on: 18/Oct/19 15:47 Start Date: 18/Oct/19 15:47 Worklog Time Spent: 10m Work Description: ibzib commented on pull request #9789: [BEAM-8348] set job_name in portable_runner.py job request URL: https://github.com/apache/beam/pull/9789#discussion_r336557297 ## File path: sdks/python/apache_beam/runners/portability/portable_runner.py ## @@ -296,7 +297,8 @@ def add_runner_options(parser): prepare_response = job_service.Prepare( beam_job_api_pb2.PrepareJobRequest( -job_name='job', pipeline=proto_pipeline, +job_name=options.view_as(GoogleCloudOptions).job_name or 'job', Review comment: It's a good idea. Unfortunately, it looks like the override of `__settatr__` would take precedence over the property's setter. I'm not too familiar with how all these Python internals work, though, so I could be missing something obvious? https://github.com/apache/beam/blob/16b4afa4469ee2d2fd5a5f4aafd77a156ee14e7b/sdks/python/apache_beam/options/pipeline_options.py#L336 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330581) Time Spent: 2h 20m (was: 2h 10m) > Portable Python job name hard-coded to "job" > > > Key: BEAM-8348 > URL: https://issues.apache.org/jira/browse/BEAM-8348 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Minor > Time Spent: 2h 20m > Remaining Estimate: 0h > > See [1]. `job_name` is already taken by Google Cloud options [2], so I guess > we should create a new option (maybe `portable_job_name` to avoid disruption). > [[1] > https://github.com/apache/beam/blob/55588e91ed8e3e25bb661a6202c31e99297e0e79/sdks/python/apache_beam/runners/portability/portable_runner.py#L294|https://github.com/apache/beam/blob/55588e91ed8e3e25bb661a6202c31e99297e0e79/sdks/python/apache_beam/runners/portability/portable_runner.py#L294] > [2] > [https://github.com/apache/beam/blob/c5bbb51014f7506a2651d6070f27fb3c3dc0da8f/sdks/python/apache_beam/options/pipeline_options.py#L438] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8382) Add polling interval to KinesisIO.Read
[ https://issues.apache.org/jira/browse/BEAM-8382?focusedWorklogId=330578=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330578 ] ASF GitHub Bot logged work on BEAM-8382: Author: ASF GitHub Bot Created on: 18/Oct/19 15:40 Start Date: 18/Oct/19 15:40 Worklog Time Spent: 10m Work Description: aromanenko-dev commented on issue #9765: [BEAM-8382] Add polling interval to KinesisIO.Read URL: https://github.com/apache/beam/pull/9765#issuecomment-543801925 @jfarr Regarding backoff strategy - I think it's a good way in case if your backend doesn't provide any statistics in advance at which rate you have to perform next queries (we don't know about other consumers of the same shard). We know only maximum limits and the result of last `kinesis.getRecords()` execution. So, starting with recommended default value and use proper backoff strategy in case of failures it could be optimal solution when you need to limit rate of requests. Otherwise, there are big chances that backend resources won't be utilised efficiently. `FluentBackoff` is used in many Beam IOs for similar reasons. I'd like to ask you questions about your use case. Let's suppose we can have ability set static polling interval with `withPollingInterval(Duration)` (as implemented in current PR). What would be default value in your case? 1 sec? What would you do in case if it won't be enough for already running pipeline? Would your need to change this value and restart pipeline manually? Btw, default delay of 1 sec looks quite pessimistic since according to AWS doc `each shard can support up to five read transactions per second.` So, in case of one consumer per shard (I can guess this is a quite common case) it should be 200 ms as default value. PS: The goal of minimizing the number of tuning knobs is to reduce the number of possible config combinations (which grows exponentially) in case if it's possible to configure this automatically [1]. If it's not possible then the better solution will be to provide flexible API, especially for unbounded sources (as Kinesis, Kafka, etc), that are being used in long-running pipelines. [1] https://beam.apache.org/contribute/ptransform-style-guide/#configuration This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330578) Time Spent: 2h 10m (was: 2h) > Add polling interval to KinesisIO.Read > -- > > Key: BEAM-8382 > URL: https://issues.apache.org/jira/browse/BEAM-8382 > Project: Beam > Issue Type: Improvement > Components: io-java-kinesis >Affects Versions: 2.13.0, 2.14.0, 2.15.0 >Reporter: Jonothan Farr >Assignee: Jonothan Farr >Priority: Major > Time Spent: 2h 10m > Remaining Estimate: 0h > > With the current implementation we are observing Kinesis throttling due to > ReadProvisionedThroughputExceeded on the order of hundreds of times per > second, regardless of the actual Kinesis throughput. This is because the > ShardReadersPool readLoop() method is polling getRecords() as fast as > possible. > From the KDS documentation: > {quote}Each shard can support up to five read transactions per second. > {quote} > and > {quote}For best results, sleep for at least 1 second (1,000 milliseconds) > between calls to getRecords to avoid exceeding the limit on getRecords > frequency. > {quote} > [https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html] > [https://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-sdk.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8382) Add polling interval to KinesisIO.Read
[ https://issues.apache.org/jira/browse/BEAM-8382?focusedWorklogId=330576=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330576 ] ASF GitHub Bot logged work on BEAM-8382: Author: ASF GitHub Bot Created on: 18/Oct/19 15:39 Start Date: 18/Oct/19 15:39 Worklog Time Spent: 10m Work Description: aromanenko-dev commented on issue #9765: [BEAM-8382] Add polling interval to KinesisIO.Read URL: https://github.com/apache/beam/pull/9765#issuecomment-543801925 @jfarr Regarding backoff strategy - I think it's a good way in case if your backend doesn't provide any statistics in advance at which rate you have to perform next queries. We know only maximum limits and the result of last `kinesis.getRecords()` execution. So, starting with recommended default value and use proper backoff strategy in case of failures it could be optimal solution when you need to limit rate of requests. Otherwise, there are big chances that backend resources won't be utilised efficiently. `FluentBackoff` is used in many Beam IOs for similar reasons. I'd like to ask you questions about your use case. Let's suppose we can have ability set static polling interval with `withPollingInterval(Duration)` (as implemented in current PR). What would be default value in your case? 1 sec? What would you do in case if it won't be enough for already running pipeline? Would your need to change this value and restart pipeline manually? Btw, default delay of 1 sec looks quite pessimistic since according to AWS doc `each shard can support up to five read transactions per second.` So, in case of one consumer per shard (I can guess this is a quite common case) it should be 200 ms as default value. PS: The goal of minimizing the number of tuning knobs is to reduce the number of possible config combinations (which grows exponentially) in case if it's possible to configure this automatically [1]. If it's not possible then the better solution will be to provide flexible API, especially for unbounded sources (as Kinesis, Kafka, etc), that are being used in long-running pipelines. [1] https://beam.apache.org/contribute/ptransform-style-guide/#configuration This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330576) Time Spent: 2h (was: 1h 50m) > Add polling interval to KinesisIO.Read > -- > > Key: BEAM-8382 > URL: https://issues.apache.org/jira/browse/BEAM-8382 > Project: Beam > Issue Type: Improvement > Components: io-java-kinesis >Affects Versions: 2.13.0, 2.14.0, 2.15.0 >Reporter: Jonothan Farr >Assignee: Jonothan Farr >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > > With the current implementation we are observing Kinesis throttling due to > ReadProvisionedThroughputExceeded on the order of hundreds of times per > second, regardless of the actual Kinesis throughput. This is because the > ShardReadersPool readLoop() method is polling getRecords() as fast as > possible. > From the KDS documentation: > {quote}Each shard can support up to five read transactions per second. > {quote} > and > {quote}For best results, sleep for at least 1 second (1,000 milliseconds) > between calls to getRecords to avoid exceeding the limit on getRecords > frequency. > {quote} > [https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html] > [https://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-sdk.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=330565=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330565 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 18/Oct/19 15:20 Start Date: 18/Oct/19 15:20 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9778: [BEAM-7013] Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases URL: https://github.com/apache/beam/pull/9778#discussion_r336525227 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java ## @@ -126,22 +145,49 @@ public static void deleteDataset() throws Exception { } /** - * Test that HLL++ sketch computed in BigQuery can be processed by Beam. Hll sketch is computed by - * {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test verifies that we can run {@link - * HllCount.MergePartial} and {@link HllCount.Extract} on the sketch in Beam to get the correct - * estimated count. + * Test that non-empty HLL++ sketch computed in BigQuery can be processed by Beam. + * + * Hll sketch is computed by {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test + * verifies that we can run {@link HllCount.MergePartial} and {@link HllCount.Extract} on the + * sketch in Beam to get the correct estimated count. + */ + @Test + public void testReadNonEmptySketchFromBigQuery() { +readSketchFromBigQuery(DATA_TABLE_ID_NON_EMPTY, EXPECTED_COUNT_NON_EMPTY); + } + + /** + * Test that empty HLL++ sketch computed in BigQuery can be processed by Beam. Review comment: (same here: "Test that an empty...", "The HLL sketch...") This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330565) Time Spent: 35h 20m (was: 35h 10m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 35h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=330563=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330563 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 18/Oct/19 15:20 Start Date: 18/Oct/19 15:20 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9778: [BEAM-7013] Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases URL: https://github.com/apache/beam/pull/9778#discussion_r336535846 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java ## @@ -96,28 +106,37 @@ public static void prepareDatasetAndDataTable() throws Exception { BIGQUERY_CLIENT.createNewDataset(PROJECT_ID, DATASET_ID); -// Create Data Table TableSchema dataTableSchema = new TableSchema() .setFields( Collections.singletonList( new TableFieldSchema().setName(DATA_FIELD_NAME).setType(DATA_FIELD_TYPE))); -Table dataTable = + +Table dataTableNonEmpty = new Table() .setSchema(dataTableSchema) .setTableReference( new TableReference() .setProjectId(PROJECT_ID) .setDatasetId(DATASET_ID) -.setTableId(DATA_TABLE_ID)); -BIGQUERY_CLIENT.createNewTable(PROJECT_ID, DATASET_ID, dataTable); - +.setTableId(DATA_TABLE_ID_NON_EMPTY)); +BIGQUERY_CLIENT.createNewTable(PROJECT_ID, DATASET_ID, dataTableNonEmpty); // Prepopulate test data to Data Table Review comment: "Prepopulate data tables with test data" This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330563) Time Spent: 35h 10m (was: 35h) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 35h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=330564=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330564 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 18/Oct/19 15:20 Start Date: 18/Oct/19 15:20 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9778: [BEAM-7013] Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases URL: https://github.com/apache/beam/pull/9778#discussion_r336534833 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java ## @@ -96,28 +106,37 @@ public static void prepareDatasetAndDataTable() throws Exception { Review comment: Maybe rename to "...AndDataTables()"? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330564) Time Spent: 35h 20m (was: 35h 10m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 35h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=330566=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330566 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 18/Oct/19 15:20 Start Date: 18/Oct/19 15:20 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9778: [BEAM-7013] Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases URL: https://github.com/apache/beam/pull/9778#discussion_r336524398 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java ## @@ -126,22 +145,49 @@ public static void deleteDataset() throws Exception { } /** - * Test that HLL++ sketch computed in BigQuery can be processed by Beam. Hll sketch is computed by - * {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test verifies that we can run {@link - * HllCount.MergePartial} and {@link HllCount.Extract} on the sketch in Beam to get the correct - * estimated count. + * Test that non-empty HLL++ sketch computed in BigQuery can be processed by Beam. + * + * Hll sketch is computed by {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test Review comment: Nit: "Test that a HLL++ sketch...", and "The HLL sketch is computed...". Otherwise, LGTM! Also, all Javadoc should be in third person ("Tests that..." instead of "Test that"; see https://www.oracle.com/technetwork/articles/java/index-137868.html, "Use 3rd person..."). Sorry that I missed this in the first version of this code! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330566) Time Spent: 35.5h (was: 35h 20m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 35.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=330561=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330561 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 18/Oct/19 15:20 Start Date: 18/Oct/19 15:20 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9778: [BEAM-7013] Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases URL: https://github.com/apache/beam/pull/9778#discussion_r336542489 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java ## @@ -65,23 +66,32 @@ private static final List TEST_DATA = Arrays.asList("Apple", "Orange", "Banana", "Orange"); - // Data Table: used by testReadSketchFromBigQuery()) + // Data Table: used by tests reading sketches from BigQuery // Schema: only one STRING field named "data". - // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange" - private static final String DATA_TABLE_ID = "hll_data"; private static final String DATA_FIELD_NAME = "data"; private static final String DATA_FIELD_TYPE = "STRING"; private static final String QUERY_RESULT_FIELD_NAME = "sketch"; - private static final Long EXPECTED_COUNT = 3L; - // Sketch Table: used by testWriteSketchToBigQuery() + // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange" + private static final String DATA_TABLE_ID_NON_EMPTY = "hll_data_non_empty"; + private static final Long EXPECTED_COUNT_NON_EMPTY = 3L; + + // Content: empty + private static final String DATA_TABLE_ID_EMPTY = "hll_data_empty"; Review comment: Does the aggregation (HLL_COUNT.INIT) over an empty table return a NULL sketch, as expected? I.e., did the test fail before you modified 'parseQueryResultToByteArray' to deal with NULLs? I think it should (according to https://plx.corp.google.com/scripts2/script_5d._6fe78c__2144_8a38_883d24fc4a60, last SELECT), but double-checking. I'd wish we could add an assert, but that doesn't work with the utility method. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330561) Time Spent: 34h 50m (was: 34h 40m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 34h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=330562=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330562 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 18/Oct/19 15:20 Start Date: 18/Oct/19 15:20 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9778: [BEAM-7013] Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases URL: https://github.com/apache/beam/pull/9778#discussion_r336529565 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java ## @@ -126,22 +145,49 @@ public static void deleteDataset() throws Exception { } /** - * Test that HLL++ sketch computed in BigQuery can be processed by Beam. Hll sketch is computed by - * {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test verifies that we can run {@link - * HllCount.MergePartial} and {@link HllCount.Extract} on the sketch in Beam to get the correct - * estimated count. + * Test that non-empty HLL++ sketch computed in BigQuery can be processed by Beam. + * + * Hll sketch is computed by {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test + * verifies that we can run {@link HllCount.MergePartial} and {@link HllCount.Extract} on the + * sketch in Beam to get the correct estimated count. + */ + @Test + public void testReadNonEmptySketchFromBigQuery() { +readSketchFromBigQuery(DATA_TABLE_ID_NON_EMPTY, EXPECTED_COUNT_NON_EMPTY); + } + + /** + * Test that empty HLL++ sketch computed in BigQuery can be processed by Beam. + * + * Hll sketch is computed by {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test + * verifies that we can run {@link HllCount.MergePartial} and {@link HllCount.Extract} on the + * sketch in Beam to get the correct estimated count. */ @Test - public void testReadSketchFromBigQuery() { -String tableSpec = String.format("%s.%s", DATASET_ID, DATA_TABLE_ID); + public void testReadEmptySketchFromBigQuery() { +readSketchFromBigQuery(DATA_TABLE_ID_EMPTY, EXPECTED_COUNT_EMPTY); + } + + private void readSketchFromBigQuery(String tableId, Long expectedCount) { +String tableSpec = String.format("%s.%s", DATASET_ID, tableId); String query = String.format( "SELECT HLL_COUNT.INIT(%s) AS %s FROM %s", DATA_FIELD_NAME, QUERY_RESULT_FIELD_NAME, tableSpec); + SerializableFunction parseQueryResultToByteArray = -(SchemaAndRecord schemaAndRecord) -> -// BigQuery BYTES type corresponds to Java java.nio.ByteBuffer type -((ByteBuffer) schemaAndRecord.getRecord().get(QUERY_RESULT_FIELD_NAME)).array(); +input -> { + // BigQuery BYTES type corresponds to Java java.nio.ByteBuffer type + ByteBuffer sketch = (ByteBuffer) input.getRecord().get(QUERY_RESULT_FIELD_NAME); + if (sketch == null) { +// Empty sketch is represented by null in BigQuery and by empty byte array in Beam +return new byte[0]; + } else { +byte[] result = new byte[sketch.remaining()]; Review comment: why not `return sketch.array()` as previously? Since we can't be 100% sure that the ByteBuffer is backed by an accessible array? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330562) Time Spent: 35h (was: 34h 50m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 35h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8431) Option job_name should be in StandardOptions
[ https://issues.apache.org/jira/browse/BEAM-8431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Weaver updated BEAM-8431: -- Status: Open (was: Triage Needed) > Option job_name should be in StandardOptions > > > Key: BEAM-8431 > URL: https://issues.apache.org/jira/browse/BEAM-8431 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > > jobName is used by multiple runners on the Java side of things, so it should > not be limited to GoogleCloudOptions in Python. > https://github.com/apache/beam/blob/16b4afa4469ee2d2fd5a5f4aafd77a156ee14e7b/sdks/python/apache_beam/options/pipeline_options.py#L438 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8430) sdk_worker_parallelism default is inconsistent between Py and Java SDKs
[ https://issues.apache.org/jira/browse/BEAM-8430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Weaver updated BEAM-8430: -- Status: Open (was: Triage Needed) > sdk_worker_parallelism default is inconsistent between Py and Java SDKs > --- > > Key: BEAM-8430 > URL: https://issues.apache.org/jira/browse/BEAM-8430 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core, sdk-py-core >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Default is currently 1 in Java, 0 in Python. > https://github.com/apache/beam/blob/7b67a926b8939ede8f2e33c85579b540d18afccf/sdks/java/core/src/main/java/org/apache/beam/sdk/options/PortablePipelineOptions.java#L73 > https://github.com/apache/beam/blob/7b67a926b8939ede8f2e33c85579b540d18afccf/sdks/python/apache_beam/options/pipeline_options.py#L848 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-8431) Option job_name should be in StandardOptions
Kyle Weaver created BEAM-8431: - Summary: Option job_name should be in StandardOptions Key: BEAM-8431 URL: https://issues.apache.org/jira/browse/BEAM-8431 Project: Beam Issue Type: Improvement Components: sdk-py-core Reporter: Kyle Weaver Assignee: Kyle Weaver jobName is used by multiple runners on the Java side of things, so it should not be limited to GoogleCloudOptions in Python. https://github.com/apache/beam/blob/16b4afa4469ee2d2fd5a5f4aafd77a156ee14e7b/sdks/python/apache_beam/options/pipeline_options.py#L438 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8410) JdbcIO should support setConnectionInitSqls in its DataSource
[ https://issues.apache.org/jira/browse/BEAM-8410?focusedWorklogId=330539=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330539 ] ASF GitHub Bot logged work on BEAM-8410: Author: ASF GitHub Bot Created on: 18/Oct/19 14:36 Start Date: 18/Oct/19 14:36 Worklog Time Spent: 10m Work Description: aromanenko-dev commented on issue #9808: [BEAM-8410] JdbcIO should support setConnectionInitSqls in its DataSource URL: https://github.com/apache/beam/pull/9808#issuecomment-543609627 I think having a test for failing case should be enough. Please, do just two things before it will be ready for merging: - Jenkins jobs fails because of `spotless` - run `gradlew spotlessApply` before commit to fix it. - Squash commits into one with appropriate commit message. Hint: run `./gradlew -p path/to/changed/package check` every time before pushing to verify that build is green. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330539) Time Spent: 1h 20m (was: 1h 10m) > JdbcIO should support setConnectionInitSqls in its DataSource > - > > Key: BEAM-8410 > URL: https://issues.apache.org/jira/browse/BEAM-8410 > Project: Beam > Issue Type: Improvement > Components: io-java-jdbc >Reporter: Cam Mach >Assignee: Cam Mach >Priority: Minor > Time Spent: 1h 20m > Remaining Estimate: 0h > > This property, connectionInitSqls, is very handy for anyone who use MySql and > Mariadb, to set any init sql statements to be executed at connection time. > Note: but it's not applicable across databases -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-8430) sdk_worker_parallelism default is inconsistent between Py and Java SDKs
Kyle Weaver created BEAM-8430: - Summary: sdk_worker_parallelism default is inconsistent between Py and Java SDKs Key: BEAM-8430 URL: https://issues.apache.org/jira/browse/BEAM-8430 Project: Beam Issue Type: Improvement Components: sdk-java-core, sdk-py-core Reporter: Kyle Weaver Assignee: Kyle Weaver Default is currently 1 in Java, 0 in Python. https://github.com/apache/beam/blob/7b67a926b8939ede8f2e33c85579b540d18afccf/sdks/java/core/src/main/java/org/apache/beam/sdk/options/PortablePipelineOptions.java#L73 https://github.com/apache/beam/blob/7b67a926b8939ede8f2e33c85579b540d18afccf/sdks/python/apache_beam/options/pipeline_options.py#L848 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-5690) Issue with GroupByKey in BeamSql using SparkRunner
[ https://issues.apache.org/jira/browse/BEAM-5690?focusedWorklogId=330497=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330497 ] ASF GitHub Bot logged work on BEAM-5690: Author: ASF GitHub Bot Created on: 18/Oct/19 13:31 Start Date: 18/Oct/19 13:31 Worklog Time Spent: 10m Work Description: bmv126 commented on issue #9567: [BEAM-5690] Fix Zero value issue with GroupByKey/CountByKey in SparkRunner URL: https://github.com/apache/beam/pull/9567#issuecomment-543747540 @echauchot I have addressed the review comment, can you have a look. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330497) Time Spent: 2h 50m (was: 2h 40m) > Issue with GroupByKey in BeamSql using SparkRunner > -- > > Key: BEAM-5690 > URL: https://issues.apache.org/jira/browse/BEAM-5690 > Project: Beam > Issue Type: Task > Components: runner-spark >Reporter: Kenneth Knowles >Priority: Major > Time Spent: 2h 50m > Remaining Estimate: 0h > > Reported on user@ > {quote}We are trying to setup a pipeline with using BeamSql and the trigger > used is default (AfterWatermark crosses the window). > Below is the pipeline: > >KafkaSource (KafkaIO) >---> Windowing (FixedWindow 1min) >---> BeamSql >---> KafkaSink (KafkaIO) > > We are using Spark Runner for this. > The BeamSql query is: > {code}select Col3, count(*) as count_col1 from PCOLLECTION GROUP BY Col3{code} > We are grouping by Col3 which is a string. It can hold values string[0-9]. > > The records are getting emitted out at 1 min to kafka sink, but the output > record in kafka is not as expected. > Below is the output observed: (WST and WET are indicators for window start > time and window end time) > {code} > {"count_col1":1,"Col3":"string5","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":3,"Col3":"string7","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":2,"Col3":"string8","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":1,"Col3":"string2","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":1,"Col3":"string6","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":0,"Col3":"string6","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":0,"Col3":"string6","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":0,"Col3":"string6","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":0,"Col3":"string6","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":0,"Col3":"string6","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":0,"Col3":"string6","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 0} > {code} > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-5690) Issue with GroupByKey in BeamSql using SparkRunner
[ https://issues.apache.org/jira/browse/BEAM-5690?focusedWorklogId=330496=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330496 ] ASF GitHub Bot logged work on BEAM-5690: Author: ASF GitHub Bot Created on: 18/Oct/19 13:29 Start Date: 18/Oct/19 13:29 Worklog Time Spent: 10m Work Description: bmv126 commented on pull request #9567: [BEAM-5690] Fix Zero value issue with GroupByKey/CountByKey in SparkRunner URL: https://github.com/apache/beam/pull/9567#discussion_r336490749 ## File path: runners/spark/src/main/java/org/apache/beam/runners/spark/stateful/SparkGroupAlsoByWindowViaWindowSet.java ## @@ -338,6 +339,18 @@ public void outputWindowedValue( outputHolder.getWindowedValues(); if (!outputs.isEmpty() || !stateInternals.getState().isEmpty()) { +Collection filteredTimers = +timerInternals.getTimers().stream() +.filter( +timer -> +timer +.getTimestamp() +.plus(windowingStrategy.getAllowedLateness()) + .isBefore(timerInternals.currentInputWatermarkTime())) +.collect(Collectors.toList()); + +filteredTimers.forEach(timerInternals::deleteTimer); + Review comment: I have moved the code to a method in LateDataUtils. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330496) Time Spent: 2h 40m (was: 2.5h) > Issue with GroupByKey in BeamSql using SparkRunner > -- > > Key: BEAM-5690 > URL: https://issues.apache.org/jira/browse/BEAM-5690 > Project: Beam > Issue Type: Task > Components: runner-spark >Reporter: Kenneth Knowles >Priority: Major > Time Spent: 2h 40m > Remaining Estimate: 0h > > Reported on user@ > {quote}We are trying to setup a pipeline with using BeamSql and the trigger > used is default (AfterWatermark crosses the window). > Below is the pipeline: > >KafkaSource (KafkaIO) >---> Windowing (FixedWindow 1min) >---> BeamSql >---> KafkaSink (KafkaIO) > > We are using Spark Runner for this. > The BeamSql query is: > {code}select Col3, count(*) as count_col1 from PCOLLECTION GROUP BY Col3{code} > We are grouping by Col3 which is a string. It can hold values string[0-9]. > > The records are getting emitted out at 1 min to kafka sink, but the output > record in kafka is not as expected. > Below is the output observed: (WST and WET are indicators for window start > time and window end time) > {code} > {"count_col1":1,"Col3":"string5","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":3,"Col3":"string7","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":2,"Col3":"string8","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":1,"Col3":"string2","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":1,"Col3":"string6","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":0,"Col3":"string6","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":0,"Col3":"string6","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":0,"Col3":"string6","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":0,"Col3":"string6","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":0,"Col3":"string6","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 +"} > {"count_col1":0,"Col3":"string6","WST":"2018-10-09 09-55-00 > +","WET":"2018-10-09 09-56-00 0} > {code} > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-3718) ClassNotFoundException on CloudResourceManager$Builder
[ https://issues.apache.org/jira/browse/BEAM-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954587#comment-16954587 ] Luis commented on BEAM-3718: It looks like this issue still happens on Beam SDK 2.16 with JVM 11. {code:java} java.lang.RuntimeException: Failed to construct instance from factory method DataflowRunner#fromOptions(interface org.apache.beam.sdk.options.PipelineOptions) at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod(InstanceBuilder.java:224) at org.apache.beam.sdk.util.InstanceBuilder.build(InstanceBuilder.java:155) at org.apache.beam.sdk.PipelineRunner.fromOptions(PipelineRunner.java:55) at org.apache.beam.sdk.Pipeline.create(Pipeline.java:147) at com.spotify.dbeam.jobs.JdbcAvroJob.create(JdbcAvroJob.java:70) at com.spotify.dbeam.jobs.JdbcAvroJob.create(JdbcAvroJob.java:77) at com.spotify.dbeam.jobs.JdbcAvroJob.create(JdbcAvroJob.java:83) at com.spotify.dbeam.jobs.JdbcAvroJob.main(JdbcAvroJob.java:153) Caused by: java.lang.reflect.InvocationTargetException at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod(InstanceBuilder.java:214) ... 7 more Caused by: java.lang.IllegalArgumentException: Unable to use ClassLoader to detect classpath elements. Current ClassLoader is jdk.internal.loader.ClassLoaders$AppClassLoader@799f7e29, only URLClassLoaders are supported. at org.apache.beam.runners.core.construction.PipelineResources.detectClassPathResourcesToStage(PipelineResources.java:58) at org.apache.beam.runners.dataflow.DataflowRunner.fromOptions(DataflowRunner.java:285) ... 12 more {code} > ClassNotFoundException on CloudResourceManager$Builder > -- > > Key: BEAM-3718 > URL: https://issues.apache.org/jira/browse/BEAM-3718 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Reporter: Yunis Arif Said >Priority: Trivial > > In a spring boot application running google cloud dataflow code. The dataflow > takes data from google PubSub, transform incoming data and output result to > bigquery for storage. The code does not have any syntax errors. The problem > is when the application is run, the following exception is thrown. > > {code:java} > Exception in thread "main" java.lang.RuntimeException: Failed to construct > instance from factory method DataflowRunner#fromOptions(interface > org.apache.beam.sdk.options.PipelineOptions) > at > org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod(InstanceBuilder.java:233) > at org.apache.beam.sdk.util.InstanceBuilder.build(InstanceBuilder.java:162) > at org.apache.beam.sdk.PipelineRunner.fromOptions(PipelineRunner.java:52) > at org.apache.beam.sdk.Pipeline.create(Pipeline.java:142) > at com.trackers.exlon.ExlonApplication.main(ExlonApplication.java:69) > > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.beam.sdk.util.InstanceBuilder.buildFromMethod(InstanceBuilder.java:222) > ... 4 more > Caused by: java.lang.NoClassDefFoundError: > com/google/api/services/cloudresourcemanager/CloudResourceManager$Builder > at > org.apache.beam.sdk.extensions.gcp.options.GcpOptions$GcpTempLocationFactory.newCloudResourceManagerClient(GcpOptions.java:369) > at > org.apache.beam.sdk.extensions.gcp.options.GcpOptions$GcpTempLocationFactory.create(GcpOptions.java:240) > at > org.apache.beam.sdk.extensions.gcp.options.GcpOptions$GcpTempLocationFactory.create(GcpOptions.java:228) > at > org.apache.beam.sdk.options.ProxyInvocationHandler.returnDefaultHelper(ProxyInvocationHandler.java:592) > at > org.apache.beam.sdk.options.ProxyInvocationHandler.getDefault(ProxyInvocationHandler.java:533) > at > org.apache.beam.sdk.options.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:156) > at com.sun.proxy.$Proxy85.getGcpTempLocation(Unknown Source) > at > org.apache.beam.runners.dataflow.DataflowRunner.fromOptions(DataflowRunner.java:223) > ... 9 more > Caused by: java.lang.ClassNotFoundException: > com.google.api.services.cloudresourcemanager.CloudResourceManager$Builder > at
[jira] [Work logged] (BEAM-7520) DirectRunner timers are not strictly time ordered
[ https://issues.apache.org/jira/browse/BEAM-7520?focusedWorklogId=330482=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330482 ] ASF GitHub Bot logged work on BEAM-7520: Author: ASF GitHub Bot Created on: 18/Oct/19 12:52 Start Date: 18/Oct/19 12:52 Worklog Time Spent: 10m Work Description: je-ik commented on issue #9190: [BEAM-7520] Fix timer firing order in DirectRunner URL: https://github.com/apache/beam/pull/9190#issuecomment-543727429 Run Java_Examples_Dataflow PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330482) Time Spent: 15h 40m (was: 15.5h) > DirectRunner timers are not strictly time ordered > - > > Key: BEAM-7520 > URL: https://issues.apache.org/jira/browse/BEAM-7520 > Project: Beam > Issue Type: Bug > Components: runner-direct >Affects Versions: 2.13.0 >Reporter: Jan Lukavský >Assignee: Jan Lukavský >Priority: Major > Time Spent: 15h 40m > Remaining Estimate: 0h > > Let's suppose we have the following situation: > - statful ParDo with two timers - timerA and timerB > - timerA is set for window.maxTimestamp() + 1 > - timerB is set anywhere between timerB.timestamp > - input watermark moves to BoundedWindow.TIMESTAMP_MAX_VALUE > Then the order of timers is as follows (correct): > - timerB > - timerA > But, if timerB sets another timer (say for timerB.timestamp + 1), then the > order of timers will be: > - timerB (timerB.timestamp) > - timerA (BoundedWindow.TIMESTAMP_MAX_VALUE) > - timerB (timerB.timestamp + 1) > Which is not ordered by timestamp. The reason for this is that when the input > watermark update is evaluated, the WatermarkManager,extractFiredTimers() will > produce both timerA and timerB. That would be correct, but when timerB sets > another timer, that breaks this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7520) DirectRunner timers are not strictly time ordered
[ https://issues.apache.org/jira/browse/BEAM-7520?focusedWorklogId=330483=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330483 ] ASF GitHub Bot logged work on BEAM-7520: Author: ASF GitHub Bot Created on: 18/Oct/19 12:52 Start Date: 18/Oct/19 12:52 Worklog Time Spent: 10m Work Description: je-ik commented on issue #9190: [BEAM-7520] Fix timer firing order in DirectRunner URL: https://github.com/apache/beam/pull/9190#issuecomment-543727505 Run Flink ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330483) Time Spent: 15h 50m (was: 15h 40m) > DirectRunner timers are not strictly time ordered > - > > Key: BEAM-7520 > URL: https://issues.apache.org/jira/browse/BEAM-7520 > Project: Beam > Issue Type: Bug > Components: runner-direct >Affects Versions: 2.13.0 >Reporter: Jan Lukavský >Assignee: Jan Lukavský >Priority: Major > Time Spent: 15h 50m > Remaining Estimate: 0h > > Let's suppose we have the following situation: > - statful ParDo with two timers - timerA and timerB > - timerA is set for window.maxTimestamp() + 1 > - timerB is set anywhere between timerB.timestamp > - input watermark moves to BoundedWindow.TIMESTAMP_MAX_VALUE > Then the order of timers is as follows (correct): > - timerB > - timerA > But, if timerB sets another timer (say for timerB.timestamp + 1), then the > order of timers will be: > - timerB (timerB.timestamp) > - timerA (BoundedWindow.TIMESTAMP_MAX_VALUE) > - timerB (timerB.timestamp + 1) > Which is not ordered by timestamp. The reason for this is that when the input > watermark update is evaluated, the WatermarkManager,extractFiredTimers() will > produce both timerA and timerB. That would be correct, but when timerB sets > another timer, that breaks this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8410) JdbcIO should support setConnectionInitSqls in its DataSource
[ https://issues.apache.org/jira/browse/BEAM-8410?focusedWorklogId=330470=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330470 ] ASF GitHub Bot logged work on BEAM-8410: Author: ASF GitHub Bot Created on: 18/Oct/19 12:28 Start Date: 18/Oct/19 12:28 Worklog Time Spent: 10m Work Description: aromanenko-dev commented on issue #9808: [BEAM-8410] JdbcIO should support setConnectionInitSqls in its DataSource URL: https://github.com/apache/beam/pull/9808#issuecomment-543609627 I think having a test for failing case should be enough. Please, do just two things before it will be ready for merging: - Jenkins jobs fails because of `spotless` - run `gradlew spotlessApply` before commit to fix it. - Squash commits into one with appropriate commit message. Hint: run `./gradlew -p path/to/changed/package check` every time before pushing to verify that build is green. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330470) Time Spent: 1h 10m (was: 1h) > JdbcIO should support setConnectionInitSqls in its DataSource > - > > Key: BEAM-8410 > URL: https://issues.apache.org/jira/browse/BEAM-8410 > Project: Beam > Issue Type: Improvement > Components: io-java-jdbc >Reporter: Cam Mach >Assignee: Cam Mach >Priority: Minor > Time Spent: 1h 10m > Remaining Estimate: 0h > > This property, connectionInitSqls, is very handy for anyone who use MySql and > Mariadb, to set any init sql statements to be executed at connection time. > Note: but it's not applicable across databases -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8383) Add metrics to Python state cache
[ https://issues.apache.org/jira/browse/BEAM-8383?focusedWorklogId=330417=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330417 ] ASF GitHub Bot logged work on BEAM-8383: Author: ASF GitHub Bot Created on: 18/Oct/19 10:38 Start Date: 18/Oct/19 10:38 Worklog Time Spent: 10m Work Description: mxm commented on issue #9769: [BEAM-8383] Add metrics to the Python state cache URL: https://github.com/apache/beam/pull/9769#issuecomment-543664340 Run Java PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330417) Time Spent: 40m (was: 0.5h) > Add metrics to Python state cache > - > > Key: BEAM-8383 > URL: https://issues.apache.org/jira/browse/BEAM-8383 > Project: Beam > Issue Type: New Feature > Components: runner-flink, sdk-py-core >Reporter: Maximilian Michels >Assignee: Maximilian Michels >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > For more insight into how effectively the cache works, metrics should be > added to the Python SDK. All the state operations should be counted, as well > as metrics like the current size of the cache, cache hits/misses, and the > capacity of the cache. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8383) Add metrics to Python state cache
[ https://issues.apache.org/jira/browse/BEAM-8383?focusedWorklogId=330416=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330416 ] ASF GitHub Bot logged work on BEAM-8383: Author: ASF GitHub Bot Created on: 18/Oct/19 10:38 Start Date: 18/Oct/19 10:38 Worklog Time Spent: 10m Work Description: mxm commented on issue #9769: [BEAM-8383] Add metrics to the Python state cache URL: https://github.com/apache/beam/pull/9769#issuecomment-543664281 Run Python2_PVR_Flink PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330416) Time Spent: 0.5h (was: 20m) > Add metrics to Python state cache > - > > Key: BEAM-8383 > URL: https://issues.apache.org/jira/browse/BEAM-8383 > Project: Beam > Issue Type: New Feature > Components: runner-flink, sdk-py-core >Reporter: Maximilian Michels >Assignee: Maximilian Michels >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > For more insight into how effectively the cache works, metrics should be > added to the Python SDK. All the state operations should be counted, as well > as metrics like the current size of the cache, cache hits/misses, and the > capacity of the cache. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8412) Py2 post-commit test crossLanguagePortableWordCount failing
[ https://issues.apache.org/jira/browse/BEAM-8412?focusedWorklogId=330410=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330410 ] ASF GitHub Bot logged work on BEAM-8412: Author: ASF GitHub Bot Created on: 18/Oct/19 10:30 Start Date: 18/Oct/19 10:30 Worklog Time Spent: 10m Work Description: mxm commented on pull request #9817: [BEAM-8412] xlang test: set sdk worker parallelism to 1 URL: https://github.com/apache/beam/pull/9817#discussion_r336424962 ## File path: sdks/python/test-suites/portable/py2/build.gradle ## @@ -128,6 +128,7 @@ task crossLanguagePortableWordCount { "--shutdown_sources_on_final_watermark", "--environment_cache_millis=1", "--expansion_service_jar=${testServiceExpansionJar}", +"--sdk_worker_parallelism=1" Review comment: Let's add a comment here why this is necessary. Otherwise LGTM. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330410) Time Spent: 1h 40m (was: 1.5h) > Py2 post-commit test crossLanguagePortableWordCount failing > > > Key: BEAM-8412 > URL: https://issues.apache.org/jira/browse/BEAM-8412 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-harness, test-failures >Reporter: Ahmet Altay >Assignee: Chamikara Madhusanka Jayalath >Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > > Error is: 13:09:02 RuntimeError: IOError: [Errno 2] No such file or > directory: > '/tmp/beam-temp-py-wordcount-portable-67ce50e0eebe11e9892842010a80009c/fe77e28c-2e44-4914-8460-76ab4dbb8579.py-wordcount-portable' > [while running 'write/Write/WriteImpl/WriteBundles'] > Log: > https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_PostCommit_Python2/706/consoleFull > Last passing run was 705. Between 705 and 706 the change is > (https://builds.apache.org/view/A-D/view/Beam/view/All/job/beam_PostCommit_Python2/706/) > : https://github.com/apache/beam/pull/9785 -- Although this PR looks > unrelated. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8281) Resize file-based IOITs
[ https://issues.apache.org/jira/browse/BEAM-8281?focusedWorklogId=330405=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330405 ] ASF GitHub Bot logged work on BEAM-8281: Author: ASF GitHub Bot Created on: 18/Oct/19 10:13 Start Date: 18/Oct/19 10:13 Worklog Time Spent: 10m Work Description: lgajowy commented on pull request #9638: [BEAM-8281] Resize IOITs datasets URL: https://github.com/apache/beam/pull/9638 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330405) Time Spent: 10h (was: 9h 50m) > Resize file-based IOITs > > > Key: BEAM-8281 > URL: https://issues.apache.org/jira/browse/BEAM-8281 > Project: Beam > Issue Type: Sub-task > Components: testing >Reporter: Michal Walenia >Assignee: Michal Walenia >Priority: Minor > Time Spent: 10h > Remaining Estimate: 0h > > Resize the IOITs to use known amounts of data saved on GCS so that it's > possible to report dataset size and measure throughput as part of the test > metrics -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7520) DirectRunner timers are not strictly time ordered
[ https://issues.apache.org/jira/browse/BEAM-7520?focusedWorklogId=330394=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330394 ] ASF GitHub Bot logged work on BEAM-7520: Author: ASF GitHub Bot Created on: 18/Oct/19 09:51 Start Date: 18/Oct/19 09:51 Worklog Time Spent: 10m Work Description: je-ik commented on issue #9190: [BEAM-7520] Fix timer firing order in DirectRunner URL: https://github.com/apache/beam/pull/9190#issuecomment-543640928 Run Samza ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330394) Time Spent: 15h 20m (was: 15h 10m) > DirectRunner timers are not strictly time ordered > - > > Key: BEAM-7520 > URL: https://issues.apache.org/jira/browse/BEAM-7520 > Project: Beam > Issue Type: Bug > Components: runner-direct >Affects Versions: 2.13.0 >Reporter: Jan Lukavský >Assignee: Jan Lukavský >Priority: Major > Time Spent: 15h 20m > Remaining Estimate: 0h > > Let's suppose we have the following situation: > - statful ParDo with two timers - timerA and timerB > - timerA is set for window.maxTimestamp() + 1 > - timerB is set anywhere between timerB.timestamp > - input watermark moves to BoundedWindow.TIMESTAMP_MAX_VALUE > Then the order of timers is as follows (correct): > - timerB > - timerA > But, if timerB sets another timer (say for timerB.timestamp + 1), then the > order of timers will be: > - timerB (timerB.timestamp) > - timerA (BoundedWindow.TIMESTAMP_MAX_VALUE) > - timerB (timerB.timestamp + 1) > Which is not ordered by timestamp. The reason for this is that when the input > watermark update is evaluated, the WatermarkManager,extractFiredTimers() will > produce both timerA and timerB. That would be correct, but when timerB sets > another timer, that breaks this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7520) DirectRunner timers are not strictly time ordered
[ https://issues.apache.org/jira/browse/BEAM-7520?focusedWorklogId=330395=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330395 ] ASF GitHub Bot logged work on BEAM-7520: Author: ASF GitHub Bot Created on: 18/Oct/19 09:51 Start Date: 18/Oct/19 09:51 Worklog Time Spent: 10m Work Description: je-ik commented on issue #9190: [BEAM-7520] Fix timer firing order in DirectRunner URL: https://github.com/apache/beam/pull/9190#issuecomment-543641135 Run Flink ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330395) Time Spent: 15.5h (was: 15h 20m) > DirectRunner timers are not strictly time ordered > - > > Key: BEAM-7520 > URL: https://issues.apache.org/jira/browse/BEAM-7520 > Project: Beam > Issue Type: Bug > Components: runner-direct >Affects Versions: 2.13.0 >Reporter: Jan Lukavský >Assignee: Jan Lukavský >Priority: Major > Time Spent: 15.5h > Remaining Estimate: 0h > > Let's suppose we have the following situation: > - statful ParDo with two timers - timerA and timerB > - timerA is set for window.maxTimestamp() + 1 > - timerB is set anywhere between timerB.timestamp > - input watermark moves to BoundedWindow.TIMESTAMP_MAX_VALUE > Then the order of timers is as follows (correct): > - timerB > - timerA > But, if timerB sets another timer (say for timerB.timestamp + 1), then the > order of timers will be: > - timerB (timerB.timestamp) > - timerA (BoundedWindow.TIMESTAMP_MAX_VALUE) > - timerB (timerB.timestamp + 1) > Which is not ordered by timestamp. The reason for this is that when the input > watermark update is evaluated, the WatermarkManager,extractFiredTimers() will > produce both timerA and timerB. That would be correct, but when timerB sets > another timer, that breaks this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7520) DirectRunner timers are not strictly time ordered
[ https://issues.apache.org/jira/browse/BEAM-7520?focusedWorklogId=330393=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330393 ] ASF GitHub Bot logged work on BEAM-7520: Author: ASF GitHub Bot Created on: 18/Oct/19 09:50 Start Date: 18/Oct/19 09:50 Worklog Time Spent: 10m Work Description: je-ik commented on issue #9190: [BEAM-7520] Fix timer firing order in DirectRunner URL: https://github.com/apache/beam/pull/9190#issuecomment-543640860 Run Flink ValidaatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330393) Time Spent: 15h 10m (was: 15h) > DirectRunner timers are not strictly time ordered > - > > Key: BEAM-7520 > URL: https://issues.apache.org/jira/browse/BEAM-7520 > Project: Beam > Issue Type: Bug > Components: runner-direct >Affects Versions: 2.13.0 >Reporter: Jan Lukavský >Assignee: Jan Lukavský >Priority: Major > Time Spent: 15h 10m > Remaining Estimate: 0h > > Let's suppose we have the following situation: > - statful ParDo with two timers - timerA and timerB > - timerA is set for window.maxTimestamp() + 1 > - timerB is set anywhere between timerB.timestamp > - input watermark moves to BoundedWindow.TIMESTAMP_MAX_VALUE > Then the order of timers is as follows (correct): > - timerB > - timerA > But, if timerB sets another timer (say for timerB.timestamp + 1), then the > order of timers will be: > - timerB (timerB.timestamp) > - timerA (BoundedWindow.TIMESTAMP_MAX_VALUE) > - timerB (timerB.timestamp + 1) > Which is not ordered by timestamp. The reason for this is that when the input > watermark update is evaluated, the WatermarkManager,extractFiredTimers() will > produce both timerA and timerB. That would be correct, but when timerB sets > another timer, that breaks this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7520) DirectRunner timers are not strictly time ordered
[ https://issues.apache.org/jira/browse/BEAM-7520?focusedWorklogId=330392=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330392 ] ASF GitHub Bot logged work on BEAM-7520: Author: ASF GitHub Bot Created on: 18/Oct/19 09:50 Start Date: 18/Oct/19 09:50 Worklog Time Spent: 10m Work Description: je-ik commented on issue #9190: [BEAM-7520] Fix timer firing order in DirectRunner URL: https://github.com/apache/beam/pull/9190#issuecomment-543640775 Run Dataflow ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330392) Time Spent: 15h (was: 14h 50m) > DirectRunner timers are not strictly time ordered > - > > Key: BEAM-7520 > URL: https://issues.apache.org/jira/browse/BEAM-7520 > Project: Beam > Issue Type: Bug > Components: runner-direct >Affects Versions: 2.13.0 >Reporter: Jan Lukavský >Assignee: Jan Lukavský >Priority: Major > Time Spent: 15h > Remaining Estimate: 0h > > Let's suppose we have the following situation: > - statful ParDo with two timers - timerA and timerB > - timerA is set for window.maxTimestamp() + 1 > - timerB is set anywhere between timerB.timestamp > - input watermark moves to BoundedWindow.TIMESTAMP_MAX_VALUE > Then the order of timers is as follows (correct): > - timerB > - timerA > But, if timerB sets another timer (say for timerB.timestamp + 1), then the > order of timers will be: > - timerB (timerB.timestamp) > - timerA (BoundedWindow.TIMESTAMP_MAX_VALUE) > - timerB (timerB.timestamp + 1) > Which is not ordered by timestamp. The reason for this is that when the input > watermark update is evaluated, the WatermarkManager,extractFiredTimers() will > produce both timerA and timerB. That would be correct, but when timerB sets > another timer, that breaks this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8399) Python HDFS implementation should support filenames of the format "hdfs://namenodehost/parent/child"
[ https://issues.apache.org/jira/browse/BEAM-8399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954414#comment-16954414 ] alionkun commented on BEAM-8399: facing same problem on machine learning task based on tensorflow and beam. tensorflow and other components working with HDFS accept url like "hdfs://namenodehost/parent/child" which is semantically complete. but beam accpets url like "hdfs://parent/child". problem is we have to keep two styles for each hdfs url. > Python HDFS implementation should support filenames of the format > "hdfs://namenodehost/parent/child" > > > Key: BEAM-8399 > URL: https://issues.apache.org/jira/browse/BEAM-8399 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Priority: Major > > "hdfs://namenodehost/parent/child" and "/parent/child" seems to be the > correct filename formats for HDFS based on [1] but we currently support > format "hdfs://parent/child". > To not break existing users, we have to either (1) somehow support both > versions by default (based on [2] seems like HDFS does not allow colons in > file path so this might be possible) (2) make > "hdfs://namenodehost/parent/child" optional for now and change it to default > after few versions. > We should also make sure that Beam Java and Python HDFS file-system > implementations are consistent in this regard. > > [1][https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html] > [2] https://issues.apache.org/jira/browse/HDFS-13 > > cc: [~udim] -- This message was sent by Atlassian Jira (v8.3.4#803005)