[jira] [Closed] (BEAM-9705) Bug on database.io for writing with batch size 1
[ https://issues.apache.org/jira/browse/BEAM-9705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Eka Sanjaya closed BEAM-9705. Resolution: Fixed > Bug on database.io for writing with batch size 1 > - > > Key: BEAM-9705 > URL: https://issues.apache.org/jira/browse/BEAM-9705 > Project: Beam > Issue Type: Bug > Components: dsl-sql >Affects Versions: 2.19.0 > Environment: Ubuntu 18.04 >Reporter: Adrian Eka Sanjaya >Priority: Major > Labels: patch > Fix For: Not applicable > > Original Estimate: 1h > Time Spent: 1h 40m > Remaining Estimate: 0h > > When we try to make the batch size become 1, it makes the library for > database io become broken because when trying to write the last data. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9705) Bug on database.io for writing with batch size 1
[ https://issues.apache.org/jira/browse/BEAM-9705?focusedWorklogId=417459=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417459 ] ASF GitHub Bot logged work on BEAM-9705: Author: ASF GitHub Bot Created on: 07/Apr/20 05:33 Start Date: 07/Apr/20 05:33 Worklog Time Spent: 10m Work Description: youngoli commented on pull request #11323: [BEAM-9705] Go sdk add value length validation checking on write to d… URL: https://github.com/apache/beam/pull/11323 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417459) Time Spent: 1h 40m (was: 1.5h) > Bug on database.io for writing with batch size 1 > - > > Key: BEAM-9705 > URL: https://issues.apache.org/jira/browse/BEAM-9705 > Project: Beam > Issue Type: Bug > Components: dsl-sql >Affects Versions: 2.19.0 > Environment: Ubuntu 18.04 >Reporter: Adrian Eka Sanjaya >Priority: Major > Labels: patch > Fix For: Not applicable > > Original Estimate: 1h > Time Spent: 1h 40m > Remaining Estimate: 0h > > When we try to make the batch size become 1, it makes the library for > database io become broken because when trying to write the last data. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9147) [Java] PTransform that integrates Video Intelligence functionality
[ https://issues.apache.org/jira/browse/BEAM-9147?focusedWorklogId=417448=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417448 ] ASF GitHub Bot logged work on BEAM-9147: Author: ASF GitHub Bot Created on: 07/Apr/20 04:58 Start Date: 07/Apr/20 04:58 Worklog Time Spent: 10m Work Description: Ardagan commented on issue #11261: [BEAM-9147] Add a VideoIntelligence transform to Java SDK URL: https://github.com/apache/beam/pull/11261#issuecomment-610173883 Community metrics should be irrelevvant failure. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417448) Time Spent: 2.5h (was: 2h 20m) > [Java] PTransform that integrates Video Intelligence functionality > -- > > Key: BEAM-9147 > URL: https://issues.apache.org/jira/browse/BEAM-9147 > Project: Beam > Issue Type: Sub-task > Components: io-java-gcp >Reporter: Kamil Wasilewski >Assignee: Michał Walenia >Priority: Major > Time Spent: 2.5h > Remaining Estimate: 0h > > The goal is to create a PTransform that integrates Google Cloud Video > Intelligence functionality [1]. > The transform should be able to take both video GCS location or video data > bytes as an input. > A module with the transform should be put into _`sdks/java/extensions`_ > folder. > [1] [https://cloud.google.com/video-intelligence/] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9147) [Java] PTransform that integrates Video Intelligence functionality
[ https://issues.apache.org/jira/browse/BEAM-9147?focusedWorklogId=417441=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417441 ] ASF GitHub Bot logged work on BEAM-9147: Author: ASF GitHub Bot Created on: 07/Apr/20 04:50 Start Date: 07/Apr/20 04:50 Worklog Time Spent: 10m Work Description: Ardagan commented on issue #11261: [BEAM-9147] Add a VideoIntelligence transform to Java SDK URL: https://github.com/apache/beam/pull/11261#issuecomment-610171762 Run CommunityMetrics PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417441) Time Spent: 2h 20m (was: 2h 10m) > [Java] PTransform that integrates Video Intelligence functionality > -- > > Key: BEAM-9147 > URL: https://issues.apache.org/jira/browse/BEAM-9147 > Project: Beam > Issue Type: Sub-task > Components: io-java-gcp >Reporter: Kamil Wasilewski >Assignee: Michał Walenia >Priority: Major > Time Spent: 2h 20m > Remaining Estimate: 0h > > The goal is to create a PTransform that integrates Google Cloud Video > Intelligence functionality [1]. > The transform should be able to take both video GCS location or video data > bytes as an input. > A module with the transform should be put into _`sdks/java/extensions`_ > folder. > [1] [https://cloud.google.com/video-intelligence/] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9639) Abstract bundle execution logic from stage execution logic
[ https://issues.apache.org/jira/browse/BEAM-9639?focusedWorklogId=417426=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417426 ] ASF GitHub Bot logged work on BEAM-9639: Author: ASF GitHub Bot Created on: 07/Apr/20 03:17 Start Date: 07/Apr/20 03:17 Worklog Time Spent: 10m Work Description: pabloem commented on issue #11270: [BEAM-9639][BEAM-9608] Improvements for FnApiRunner URL: https://github.com/apache/beam/pull/11270#issuecomment-610150292 @robertwb ptal This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417426) Time Spent: 0.5h (was: 20m) > Abstract bundle execution logic from stage execution logic > -- > > Key: BEAM-9639 > URL: https://issues.apache.org/jira/browse/BEAM-9639 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > The FnApiRunner currently works on a per-stage manner, and does not abstract > single-bundle execution much. This work item is to clearly define the code to > execute a single bundle. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9650) Add consistent slowly changing side inputs support
[ https://issues.apache.org/jira/browse/BEAM-9650?focusedWorklogId=417425=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417425 ] ASF GitHub Bot logged work on BEAM-9650: Author: ASF GitHub Bot Created on: 07/Apr/20 03:16 Start Date: 07/Apr/20 03:16 Worklog Time Spent: 10m Work Description: Ardagan commented on issue #11182: [BEAM-9650] Add PeriodicImpulse Transform and slowly changing side input documentation URL: https://github.com/apache/beam/pull/11182#issuecomment-610150038 > > > What is the expected behaviours around lifecycle events for runners that support drain / update. Does it need to be explicitly documented? Processing will stop on drain. So it should not cause any issues. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417425) Time Spent: 1h 20m (was: 1h 10m) > Add consistent slowly changing side inputs support > -- > > Key: BEAM-9650 > URL: https://issues.apache.org/jira/browse/BEAM-9650 > Project: Beam > Issue Type: Bug > Components: io-ideas >Reporter: Mikhail Gryzykhin >Assignee: Mikhail Gryzykhin >Priority: Major > Time Spent: 1h 20m > Remaining Estimate: 0h > > Add implementation for slowly changing dimentions based on [design > doc](https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg/edit] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9705) Bug on database.io for writing with batch size 1
[ https://issues.apache.org/jira/browse/BEAM-9705?focusedWorklogId=417401=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417401 ] ASF GitHub Bot logged work on BEAM-9705: Author: ASF GitHub Bot Created on: 07/Apr/20 02:51 Start Date: 07/Apr/20 02:51 Worklog Time Spent: 10m Work Description: adrian3ka commented on issue #11323: [BEAM-9705] Go sdk add value length validation checking on write to d… URL: https://github.com/apache/beam/pull/11323#issuecomment-610142249 @youngoli i think it's better to merge earlier because it's the main io to write on DB, so on the next release the databaseio is more stable. Especially on unbounded we couldn't control how much data will be processed every batch. Thank you. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417401) Time Spent: 1.5h (was: 1h 20m) > Bug on database.io for writing with batch size 1 > - > > Key: BEAM-9705 > URL: https://issues.apache.org/jira/browse/BEAM-9705 > Project: Beam > Issue Type: Bug > Components: dsl-sql >Affects Versions: 2.19.0 > Environment: Ubuntu 18.04 >Reporter: Adrian Eka Sanjaya >Priority: Major > Labels: patch > Fix For: Not applicable > > Original Estimate: 1h > Time Spent: 1.5h > Remaining Estimate: 0h > > When we try to make the batch size become 1, it makes the library for > database io become broken because when trying to write the last data. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9705) Bug on database.io for writing with batch size 1
[ https://issues.apache.org/jira/browse/BEAM-9705?focusedWorklogId=417396=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417396 ] ASF GitHub Bot logged work on BEAM-9705: Author: ASF GitHub Bot Created on: 07/Apr/20 02:45 Start Date: 07/Apr/20 02:45 Worklog Time Spent: 10m Work Description: adrian3ka commented on issue #11323: [BEAM-9705] Go sdk add value length validation checking on write to d… URL: https://github.com/apache/beam/pull/11323#issuecomment-610142249 @youngoli i think it's better to merge earlier because it's the main io to write on DB, so on the next release the databaseio is more stable. Thank you. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417396) Time Spent: 1h 20m (was: 1h 10m) > Bug on database.io for writing with batch size 1 > - > > Key: BEAM-9705 > URL: https://issues.apache.org/jira/browse/BEAM-9705 > Project: Beam > Issue Type: Bug > Components: dsl-sql >Affects Versions: 2.19.0 > Environment: Ubuntu 18.04 >Reporter: Adrian Eka Sanjaya >Priority: Major > Labels: patch > Fix For: Not applicable > > Original Estimate: 1h > Time Spent: 1h 20m > Remaining Estimate: 0h > > When we try to make the batch size become 1, it makes the library for > database io become broken because when trying to write the last data. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9691) Ensure Dataflow BQ Native sink are not used on FnApi
[ https://issues.apache.org/jira/browse/BEAM-9691?focusedWorklogId=417370=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417370 ] ASF GitHub Bot logged work on BEAM-9691: Author: ASF GitHub Bot Created on: 07/Apr/20 02:16 Start Date: 07/Apr/20 02:16 Worklog Time Spent: 10m Work Description: pabloem commented on issue #11309: [BEAM-9691] Ensuring BQ Native Sink is avoided on FnApi pipelines URL: https://github.com/apache/beam/pull/11309#issuecomment-610134389 Run Python 3.5 PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417370) Time Spent: 2h (was: 1h 50m) > Ensure Dataflow BQ Native sink are not used on FnApi > > > Key: BEAM-9691 > URL: https://issues.apache.org/jira/browse/BEAM-9691 > Project: Beam > Issue Type: Bug > Components: io-py-gcp >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9715) annotations_test fails in some environmens
[ https://issues.apache.org/jira/browse/BEAM-9715?focusedWorklogId=417365=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417365 ] ASF GitHub Bot logged work on BEAM-9715: Author: ASF GitHub Bot Created on: 07/Apr/20 02:00 Start Date: 07/Apr/20 02:00 Worklog Time Spent: 10m Work Description: pabloem commented on issue #11329: [BEAM-9715] Ensuring annotations_test passes in all environments URL: https://github.com/apache/beam/pull/11329#issuecomment-610130043 r: @udim This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417365) Time Spent: 20m (was: 10m) > annotations_test fails in some environmens > -- > > Key: BEAM-9715 > URL: https://issues.apache.org/jira/browse/BEAM-9715 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9715) annotations_test fails in some environmens
[ https://issues.apache.org/jira/browse/BEAM-9715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pablo Estrada updated BEAM-9715: Priority: Minor (was: Major) > annotations_test fails in some environmens > -- > > Key: BEAM-9715 > URL: https://issues.apache.org/jira/browse/BEAM-9715 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9715) annotations_test fails in some environmens
[ https://issues.apache.org/jira/browse/BEAM-9715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pablo Estrada updated BEAM-9715: Status: Open (was: Triage Needed) > annotations_test fails in some environmens > -- > > Key: BEAM-9715 > URL: https://issues.apache.org/jira/browse/BEAM-9715 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9715) annotations_test fails in some environmens
[ https://issues.apache.org/jira/browse/BEAM-9715?focusedWorklogId=417364=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417364 ] ASF GitHub Bot logged work on BEAM-9715: Author: ASF GitHub Bot Created on: 07/Apr/20 01:59 Start Date: 07/Apr/20 01:59 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #11329: [BEAM-9715] Ensuring annotations_test passes in all environments URL: https://github.com/apache/beam/pull/11329 **Please** add a meaningful description for your change here Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] Update `CHANGES.md` with noteworthy changes. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)[![Build
[jira] [Created] (BEAM-9715) annotations_test fails in some environmens
Pablo Estrada created BEAM-9715: --- Summary: annotations_test fails in some environmens Key: BEAM-9715 URL: https://issues.apache.org/jira/browse/BEAM-9715 Project: Beam Issue Type: Bug Components: sdk-py-core Reporter: Pablo Estrada Assignee: Pablo Estrada -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-9147) [Java] PTransform that integrates Video Intelligence functionality
[ https://issues.apache.org/jira/browse/BEAM-9147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076811#comment-17076811 ] Masud Hasan commented on BEAM-9147: --- Yes- in beam context. File IO watch or from PubSubIO My understanding for live streaming is that client sdk does not fully support yet. Not sure if it has changed. [https://cloud.google.com/video-intelligence/docs/streaming/live-streaming-overview] Let's say if I have a file 50MB (received the GCS URI, Feature config and context as PubSub or Json Message) and can call API with max 3 concurrent requests - I would think using streaming VI api, I can send out 15 MB contents/ chunk in parallel for faster performance. Do you think I can build such request using this PTransform? [https://cloud.google.com/video-intelligence/docs/streaming/label-analysis] > [Java] PTransform that integrates Video Intelligence functionality > -- > > Key: BEAM-9147 > URL: https://issues.apache.org/jira/browse/BEAM-9147 > Project: Beam > Issue Type: Sub-task > Components: io-java-gcp >Reporter: Kamil Wasilewski >Assignee: Michał Walenia >Priority: Major > Time Spent: 2h 10m > Remaining Estimate: 0h > > The goal is to create a PTransform that integrates Google Cloud Video > Intelligence functionality [1]. > The transform should be able to take both video GCS location or video data > bytes as an input. > A module with the transform should be put into _`sdks/java/extensions`_ > folder. > [1] [https://cloud.google.com/video-intelligence/] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8910) Use AVRO instead of JSON in BigQuery bounded source.
[ https://issues.apache.org/jira/browse/BEAM-8910?focusedWorklogId=417355=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417355 ] ASF GitHub Bot logged work on BEAM-8910: Author: ASF GitHub Bot Created on: 07/Apr/20 01:26 Start Date: 07/Apr/20 01:26 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #11086: [BEAM-8910] Make custom BQ source read from Avro URL: https://github.com/apache/beam/pull/11086#discussion_r404483110 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -663,14 +662,10 @@ def split(self, desired_bundle_size, start_position=None, stop_position=None): self._setup_temporary_dataset(bq) self.table_reference = self._execute_query(bq) - schema, metadata_list = self._export_files(bq) + unused_schema, metadata_list = self._export_files(bq) Review comment: We may do that, but if we end up keeping a backwards compatibility flag, we'll need to keep the coder as-is. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417355) Time Spent: 4h (was: 3h 50m) > Use AVRO instead of JSON in BigQuery bounded source. > > > Key: BEAM-8910 > URL: https://issues.apache.org/jira/browse/BEAM-8910 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Reporter: Kamil Wasilewski >Assignee: Pablo Estrada >Priority: Minor > Time Spent: 4h > Remaining Estimate: 0h > > The proposed BigQuery bounded source in Python SDK (see PR: > [https://github.com/apache/beam/pull/9772)] uses a BigQuery export job to > take a snapshot of the table and read from each produced JSON file. A > performance improvement can be gain by switching to AVRO instead. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9468) Add Google Cloud Healthcare API IO Connectors
[ https://issues.apache.org/jira/browse/BEAM-9468?focusedWorklogId=417350=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417350 ] ASF GitHub Bot logged work on BEAM-9468: Author: ASF GitHub Bot Created on: 07/Apr/20 01:13 Start Date: 07/Apr/20 01:13 Worklog Time Spent: 10m Work Description: jaketf commented on issue #11151: [BEAM-9468] Hl7v2 io URL: https://github.com/apache/beam/pull/11151#issuecomment-610050634 Next Steps (based on offline feedback): - [x] Improve API for users: - [x] Add static methods for common patterns with `ListHL7v2Messages` - [x] Add `ValueProvider` support to ease use in the DataflowTemplates - [x] `ListHL7v2Messages` (hl7v2Store and filter) - [x] `Write` (hl7v2Store) - [ ] "standardize" integration tests - [x] Refactor ITs to create / destroy HL7v2 Store under a parameterized dataset in `@BeforeClass` `@AfterClass` to avoid issues with parallel tests runs. - [x] Remove hard coding of my HL7v2Store / project in integration tests. - [ ] Add Healthcare API Dataset to Beam integration test project (pending permissions in [this dev list thread](https://lists.apache.org/thread.html/rebe5cd40a40a9fc7f2c1d563b48ee1ce4ff9cac3dfdc0258006cc686%40%3Cdev.beam.apache.org%3E)) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417350) Time Spent: 13h (was: 12h 50m) > Add Google Cloud Healthcare API IO Connectors > - > > Key: BEAM-9468 > URL: https://issues.apache.org/jira/browse/BEAM-9468 > Project: Beam > Issue Type: New Feature > Components: io-java-gcp >Reporter: Jacob Ferriero >Assignee: Jacob Ferriero >Priority: Minor > Time Spent: 13h > Remaining Estimate: 0h > > Add IO Transforms for the HL7v2, FHIR and DICOM stores in the [Google Cloud > Healthcare API|https://cloud.google.com/healthcare/docs/] > HL7v2IO > FHIRIO > DICOM -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9468) Add Google Cloud Healthcare API IO Connectors
[ https://issues.apache.org/jira/browse/BEAM-9468?focusedWorklogId=417349=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417349 ] ASF GitHub Bot logged work on BEAM-9468: Author: ASF GitHub Bot Created on: 07/Apr/20 01:12 Start Date: 07/Apr/20 01:12 Worklog Time Spent: 10m Work Description: jaketf commented on issue #11151: [BEAM-9468] Hl7v2 io URL: https://github.com/apache/beam/pull/11151#issuecomment-610050634 Next Steps (based on offline feedback): - [x] Improve API for users: - [x] Add static methods for common patterns with `ListHL7v2Messages` - [x] Add `ValueProvider` support to ease use in the DataflowTemplates - [x] `ListHL7v2Messages` (hl7v2Store and filter) - [x] `Write` (hl7v2Store) - [ ] "standardize" integration tests - [x] Refactor ITs to create / destroy HL7v2 Store under a parameterized dataset in `@BeforeClass` `@AfterClass` to avoid issues with parallel tests runs. - [x] Remove hard coding of my HL7v2Store / project in integration tests. - [ ] Add Healthcare API Dataset to Beam integration test project This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417349) Time Spent: 12h 50m (was: 12h 40m) > Add Google Cloud Healthcare API IO Connectors > - > > Key: BEAM-9468 > URL: https://issues.apache.org/jira/browse/BEAM-9468 > Project: Beam > Issue Type: New Feature > Components: io-java-gcp >Reporter: Jacob Ferriero >Assignee: Jacob Ferriero >Priority: Minor > Time Spent: 12h 50m > Remaining Estimate: 0h > > Add IO Transforms for the HL7v2, FHIR and DICOM stores in the [Google Cloud > Healthcare API|https://cloud.google.com/healthcare/docs/] > HL7v2IO > FHIRIO > DICOM -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9468) Add Google Cloud Healthcare API IO Connectors
[ https://issues.apache.org/jira/browse/BEAM-9468?focusedWorklogId=417348=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417348 ] ASF GitHub Bot logged work on BEAM-9468: Author: ASF GitHub Bot Created on: 07/Apr/20 01:11 Start Date: 07/Apr/20 01:11 Worklog Time Spent: 10m Work Description: jaketf commented on issue #11151: [BEAM-9468] Hl7v2 io URL: https://github.com/apache/beam/pull/11151#issuecomment-610050634 Next Steps (based on offline feed): - [x] Improve API for users: - [x] Add static methods for common patterns with `ListHL7v2Messages` - [x] Add `ValueProvider` support to ease use in the DataflowTemplates - [x] `ListHL7v2Messages` (hl7v2Store and filter) - [x] `Write` (hl7v2Store) - [ ] "standardize" integration tests - [x] Refactor ITs to create / destroy HL7v2 Store under a parameterized dataset in `@BeforeClass` `@AfterClass` to avoid issues with parallel tests runs. - [x] Remove hard coding of my HL7v2Store / project in integration tests. - [ ] Add Healthcare API Dataset to Beam integration test project This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417348) Time Spent: 12h 40m (was: 12.5h) > Add Google Cloud Healthcare API IO Connectors > - > > Key: BEAM-9468 > URL: https://issues.apache.org/jira/browse/BEAM-9468 > Project: Beam > Issue Type: New Feature > Components: io-java-gcp >Reporter: Jacob Ferriero >Assignee: Jacob Ferriero >Priority: Minor > Time Spent: 12h 40m > Remaining Estimate: 0h > > Add IO Transforms for the HL7v2, FHIR and DICOM stores in the [Google Cloud > Healthcare API|https://cloud.google.com/healthcare/docs/] > HL7v2IO > FHIRIO > DICOM -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-4374) Update existing metrics in the FN API to use new Metric Schema
[ https://issues.apache.org/jira/browse/BEAM-4374?focusedWorklogId=417343=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417343 ] ASF GitHub Bot logged work on BEAM-4374: Author: ASF GitHub Bot Created on: 07/Apr/20 00:51 Start Date: 07/Apr/20 00:51 Worklog Time Spent: 10m Work Description: lukecwik commented on pull request #11325: [BEAM-4374, BEAM-6189] Delete and remove deprecated Metrics proto URL: https://github.com/apache/beam/pull/11325 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417343) Time Spent: 41h 10m (was: 41h) > Update existing metrics in the FN API to use new Metric Schema > -- > > Key: BEAM-4374 > URL: https://issues.apache.org/jira/browse/BEAM-4374 > Project: Beam > Issue Type: New Feature > Components: beam-model >Reporter: Alex Amato >Priority: Major > Time Spent: 41h 10m > Remaining Estimate: 0h > > Update existing metrics to use the new proto and cataloging schema defined in: > [_https://s.apache.org/beam-fn-api-metrics_] > * Check in new protos > * Define catalog file for metrics > * Port existing metrics to use this new format, based on catalog > names+metadata -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9714) [Go SDK] Require --region flag in Dataflow runner
[ https://issues.apache.org/jira/browse/BEAM-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Weaver updated BEAM-9714: -- Status: Open (was: Triage Needed) > [Go SDK] Require --region flag in Dataflow runner > - > > Key: BEAM-9714 > URL: https://issues.apache.org/jira/browse/BEAM-9714 > Project: Beam > Issue Type: Improvement > Components: sdk-go >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > > We already require --region for Java and Python, we should require it for Go > as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9714) [Go SDK] Require --region flag in Dataflow runner
Kyle Weaver created BEAM-9714: - Summary: [Go SDK] Require --region flag in Dataflow runner Key: BEAM-9714 URL: https://issues.apache.org/jira/browse/BEAM-9714 Project: Beam Issue Type: Improvement Components: sdk-go Reporter: Kyle Weaver Assignee: Kyle Weaver We already require --region for Java and Python, we should require it for Go as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9008) Add readAll() method to CassandraIO
[ https://issues.apache.org/jira/browse/BEAM-9008?focusedWorklogId=417341=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417341 ] ASF GitHub Bot logged work on BEAM-9008: Author: ASF GitHub Bot Created on: 07/Apr/20 00:43 Start Date: 07/Apr/20 00:43 Worklog Time Spent: 10m Work Description: vmarquez commented on issue #10546: [BEAM-9008] Add CassandraIO readAll method URL: https://github.com/apache/beam/pull/10546#issuecomment-610109654 @iemejia is that a spurious failure or did something I do break the Flink test? I tested locally and all seems to work... LMK if you need anything from me. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417341) Time Spent: 7h 50m (was: 7h 40m) > Add readAll() method to CassandraIO > --- > > Key: BEAM-9008 > URL: https://issues.apache.org/jira/browse/BEAM-9008 > Project: Beam > Issue Type: New Feature > Components: io-java-cassandra >Affects Versions: 2.16.0 >Reporter: vincent marquez >Assignee: vincent marquez >Priority: Minor > Time Spent: 7h 50m > Remaining Estimate: 0h > > When querying a large cassandra database, it's often *much* more useful to > programatically generate the queries needed to to be run rather than reading > all partitions and attempting some filtering. > As an example: > {code:java} > public class Event { >@PartitionKey(0) public UUID accountId; >@PartitionKey(1)public String yearMonthDay; >@ClusteringKey public UUID eventId; >//other data... > }{code} > If there is ten years worth of data, you may want to only query one year's > worth. Here each token range would represent one 'token' but all events for > the day. > {code:java} > Set accounts = getRelevantAccounts(); > Set dateRange = generateDateRange("2018-01-01", "2019-01-01"); > PCollection tokens = generateTokens(accounts, dateRange); > {code} > > I propose an additional _readAll()_ PTransform that can take a PCollection > of token ranges and can return a PCollection of what the query would > return. > *Question: How much code should be in common between both methods?* > Currently the read connector already groups all partitions into a List of > Token Ranges, so it would be simple to refactor the current read() based > method to a 'ParDo' based one and have them both share the same function. > Reasons against sharing code between read and readAll > * Not having the read based method return a BoundedSource connector would > mean losing the ability to know the size of the data returned > * Currently the CassandraReader executes all the grouped TokenRange queries > *asynchronously* which is (maybe?) fine when all that's happening is > splitting up all the partition ranges but terrible for executing potentially > millions of queries. > Reasons _for_ sharing code would be simplified code base and that both of > the above issues would most likely have a negligable performance impact. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9685) Don't release Go SDK container until Go is officially supported.
[ https://issues.apache.org/jira/browse/BEAM-9685?focusedWorklogId=417340=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417340 ] ASF GitHub Bot logged work on BEAM-9685: Author: ASF GitHub Bot Created on: 07/Apr/20 00:43 Start Date: 07/Apr/20 00:43 Worklog Time Spent: 10m Work Description: Hannah-Jiang commented on pull request #11308: [BEAM-9685] remove Go SDK container from release process from 2.22.0 URL: https://github.com/apache/beam/pull/11308 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417340) Time Spent: 2h 10m (was: 2h) > Don't release Go SDK container until Go is officially supported. > > > Key: BEAM-9685 > URL: https://issues.apache.org/jira/browse/BEAM-9685 > Project: Beam > Issue Type: Task > Components: build-system >Reporter: Hannah Jiang >Assignee: Hannah Jiang >Priority: Major > Fix For: 2.21.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > 1. Remove Go SDK container from release process. > 2. Update document about it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9685) Don't release Go SDK container until Go is officially supported.
[ https://issues.apache.org/jira/browse/BEAM-9685?focusedWorklogId=417339=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417339 ] ASF GitHub Bot logged work on BEAM-9685: Author: ASF GitHub Bot Created on: 07/Apr/20 00:43 Start Date: 07/Apr/20 00:43 Worklog Time Spent: 10m Work Description: Hannah-Jiang commented on issue #11308: [BEAM-9685] remove Go SDK container from release process from 2.22.0 URL: https://github.com/apache/beam/pull/11308#issuecomment-610109509 > Yes, the Go Postcommit is failing in general right now, not due to this PR. See: https://builds.apache.org/job/beam_PostCommit_Go/ Thanks Daniel for confirming. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417339) Time Spent: 2h (was: 1h 50m) > Don't release Go SDK container until Go is officially supported. > > > Key: BEAM-9685 > URL: https://issues.apache.org/jira/browse/BEAM-9685 > Project: Beam > Issue Type: Task > Components: build-system >Reporter: Hannah Jiang >Assignee: Hannah Jiang >Priority: Major > Fix For: 2.21.0 > > Time Spent: 2h > Remaining Estimate: 0h > > 1. Remove Go SDK container from release process. > 2. Update document about it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9685) Don't release Go SDK container until Go is officially supported.
[ https://issues.apache.org/jira/browse/BEAM-9685?focusedWorklogId=417338=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417338 ] ASF GitHub Bot logged work on BEAM-9685: Author: ASF GitHub Bot Created on: 07/Apr/20 00:39 Start Date: 07/Apr/20 00:39 Worklog Time Spent: 10m Work Description: youngoli commented on issue #11308: [BEAM-9685] remove Go SDK container from release process from 2.22.0 URL: https://github.com/apache/beam/pull/11308#issuecomment-610108628 Yes, the Go Postcommit is failing in general right now, not due to this PR. See: https://builds.apache.org/job/beam_PostCommit_Go/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417338) Time Spent: 1h 50m (was: 1h 40m) > Don't release Go SDK container until Go is officially supported. > > > Key: BEAM-9685 > URL: https://issues.apache.org/jira/browse/BEAM-9685 > Project: Beam > Issue Type: Task > Components: build-system >Reporter: Hannah Jiang >Assignee: Hannah Jiang >Priority: Major > Fix For: 2.21.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > 1. Remove Go SDK container from release process. > 2. Update document about it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9674) "Selected fields list too long" error when calling tables.get in BigQueryStorageTableSource
[ https://issues.apache.org/jira/browse/BEAM-9674?focusedWorklogId=417335=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417335 ] ASF GitHub Bot logged work on BEAM-9674: Author: ASF GitHub Bot Created on: 07/Apr/20 00:33 Start Date: 07/Apr/20 00:33 Worklog Time Spent: 10m Work Description: aaltay commented on pull request #11292: [BEAM-9674] Don't specify selected fields when fetching BigQuery table size URL: https://github.com/apache/beam/pull/11292#discussion_r404468600 ## File path: sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryServices.java ## @@ -101,10 +101,6 @@ JobStatistics dryRunQuery(String projectId, JobConfigurationQuery queryConfig, S @Nullable Table getTable(TableReference tableRef) throws InterruptedException, IOException; -@Nullable Review comment: Is this deleted a public API? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417335) Time Spent: 0.5h (was: 20m) > "Selected fields list too long" error when calling tables.get in > BigQueryStorageTableSource > --- > > Key: BEAM-9674 > URL: https://issues.apache.org/jira/browse/BEAM-9674 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.19.0 >Reporter: Kenneth Jung >Assignee: Kenneth Jung >Priority: Minor > Time Spent: 0.5h > Remaining Estimate: 0h > > Customers experience errors similar to the following: > Caused by: > com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad > Request { "code" : 400, "errors" : [ > { "domain" : "global", "message" : "Selected fields too long: must > be less than 16384 characters.", "reason" : "invalid" } > ], "message" : "Selected fields too long: must be less than 16384 > characters.", "status" : "INVALID_ARGUMENT" } > com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146) > > com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113) > > com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40) > > com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:321) > com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1097) > com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419) > > com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352) > > com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469) > > org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl.executeWithRetries(BigQueryServicesImpl.java:938) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9674) "Selected fields list too long" error when calling tables.get in BigQueryStorageTableSource
[ https://issues.apache.org/jira/browse/BEAM-9674?focusedWorklogId=417336=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417336 ] ASF GitHub Bot logged work on BEAM-9674: Author: ASF GitHub Bot Created on: 07/Apr/20 00:33 Start Date: 07/Apr/20 00:33 Worklog Time Spent: 10m Work Description: aaltay commented on pull request #11292: [BEAM-9674] Don't specify selected fields when fetching BigQuery table size URL: https://github.com/apache/beam/pull/11292#discussion_r404468600 ## File path: sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryServices.java ## @@ -101,10 +101,6 @@ JobStatistics dryRunQuery(String projectId, JobConfigurationQuery queryConfig, S @Nullable Table getTable(TableReference tableRef) throws InterruptedException, IOException; -@Nullable Review comment: Is this removing a public API? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417336) Time Spent: 40m (was: 0.5h) > "Selected fields list too long" error when calling tables.get in > BigQueryStorageTableSource > --- > > Key: BEAM-9674 > URL: https://issues.apache.org/jira/browse/BEAM-9674 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.19.0 >Reporter: Kenneth Jung >Assignee: Kenneth Jung >Priority: Minor > Time Spent: 40m > Remaining Estimate: 0h > > Customers experience errors similar to the following: > Caused by: > com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad > Request { "code" : 400, "errors" : [ > { "domain" : "global", "message" : "Selected fields too long: must > be less than 16384 characters.", "reason" : "invalid" } > ], "message" : "Selected fields too long: must be less than 16384 > characters.", "status" : "INVALID_ARGUMENT" } > com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146) > > com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113) > > com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40) > > com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:321) > com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1097) > com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419) > > com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352) > > com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469) > > org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl.executeWithRetries(BigQueryServicesImpl.java:938) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9650) Add consistent slowly changing side inputs support
[ https://issues.apache.org/jira/browse/BEAM-9650?focusedWorklogId=417332=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417332 ] ASF GitHub Bot logged work on BEAM-9650: Author: ASF GitHub Bot Created on: 07/Apr/20 00:31 Start Date: 07/Apr/20 00:31 Worklog Time Spent: 10m Work Description: rezarokni commented on issue #11182: [BEAM-9650] Add PeriodicImpulse Transform and slowly changing side input documentation URL: https://github.com/apache/beam/pull/11182#issuecomment-610106616 What is the expected behaviours around lifecycle events for runners that support drain / update. Does it need to be explicitly documented? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417332) Time Spent: 1h 10m (was: 1h) > Add consistent slowly changing side inputs support > -- > > Key: BEAM-9650 > URL: https://issues.apache.org/jira/browse/BEAM-9650 > Project: Beam > Issue Type: Bug > Components: io-ideas >Reporter: Mikhail Gryzykhin >Assignee: Mikhail Gryzykhin >Priority: Major > Time Spent: 1h 10m > Remaining Estimate: 0h > > Add implementation for slowly changing dimentions based on [design > doc](https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg/edit] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9685) Don't release Go SDK container until Go is officially supported.
[ https://issues.apache.org/jira/browse/BEAM-9685?focusedWorklogId=417327=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417327 ] ASF GitHub Bot logged work on BEAM-9685: Author: ASF GitHub Bot Created on: 07/Apr/20 00:21 Start Date: 07/Apr/20 00:21 Worklog Time Spent: 10m Work Description: Hannah-Jiang commented on issue #11308: [BEAM-9685] remove Go SDK container from release process from 2.22.0 URL: https://github.com/apache/beam/pull/11308#issuecomment-610104116 PostCommit is failing with some tests. The job was able to create, push, run tests and delete the Go SDK container, so I don't think the failures are related to current PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417327) Time Spent: 1h 40m (was: 1.5h) > Don't release Go SDK container until Go is officially supported. > > > Key: BEAM-9685 > URL: https://issues.apache.org/jira/browse/BEAM-9685 > Project: Beam > Issue Type: Task > Components: build-system >Reporter: Hannah Jiang >Assignee: Hannah Jiang >Priority: Major > Fix For: 2.21.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > 1. Remove Go SDK container from release process. > 2. Update document about it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-9199) Make --region a required flag for DataflowRunner
[ https://issues.apache.org/jira/browse/BEAM-9199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Weaver resolved BEAM-9199. --- Fix Version/s: 2.21.0 Resolution: Fixed > Make --region a required flag for DataflowRunner > > > Key: BEAM-9199 > URL: https://issues.apache.org/jira/browse/BEAM-9199 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > Fix For: 2.21.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > We've been warning users since Beam 2.15.0 that --region will be required. > That is sufficient time, so now we can start requiring the flag. > While this is a small change in and of itself, I'm guessing many tests and > examples will need to be updated to add --region. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9691) Ensure Dataflow BQ Native sink are not used on FnApi
[ https://issues.apache.org/jira/browse/BEAM-9691?focusedWorklogId=417326=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417326 ] ASF GitHub Bot logged work on BEAM-9691: Author: ASF GitHub Bot Created on: 07/Apr/20 00:18 Start Date: 07/Apr/20 00:18 Worklog Time Spent: 10m Work Description: pabloem commented on issue #11309: [BEAM-9691] Ensuring BQ Native Sink is avoided on FnApi pipelines URL: https://github.com/apache/beam/pull/11309#issuecomment-610103170 Run Python 3.5 PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417326) Time Spent: 1h 50m (was: 1h 40m) > Ensure Dataflow BQ Native sink are not used on FnApi > > > Key: BEAM-9691 > URL: https://issues.apache.org/jira/browse/BEAM-9691 > Project: Beam > Issue Type: Bug > Components: io-py-gcp >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Major > Time Spent: 1h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9468) Add Google Cloud Healthcare API IO Connectors
[ https://issues.apache.org/jira/browse/BEAM-9468?focusedWorklogId=417324=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417324 ] ASF GitHub Bot logged work on BEAM-9468: Author: ASF GitHub Bot Created on: 07/Apr/20 00:15 Start Date: 07/Apr/20 00:15 Worklog Time Spent: 10m Work Description: jaketf commented on issue #11151: [BEAM-9468] Hl7v2 io URL: https://github.com/apache/beam/pull/11151#issuecomment-610050634 Next Steps (based on offline feed): - [x] Improve API for users: - [x] Add static methods for common patterns with `ListHL7v2Messages` - [x] Add `ValueProvider` support to ease use in the DataflowTemplates - [x] `ListHL7v2Messages` (hl7v2Store and filter) - [x] `Write` (hl7v2Store) - [ ] "standardize" integration tests - [x] Refactor ITs to create / destroy HL7v2 Store under a parameterized dataset in `@BeforeClass` `@AfterClass` to avoid issues with parallel tests runs. - [ ] Remove hard coding of my HL7v2Store / project in integration tests. - [ ] Add Healthcare API Dataset to Beam integration test project This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417324) Time Spent: 12.5h (was: 12h 20m) > Add Google Cloud Healthcare API IO Connectors > - > > Key: BEAM-9468 > URL: https://issues.apache.org/jira/browse/BEAM-9468 > Project: Beam > Issue Type: New Feature > Components: io-java-gcp >Reporter: Jacob Ferriero >Assignee: Jacob Ferriero >Priority: Minor > Time Spent: 12.5h > Remaining Estimate: 0h > > Add IO Transforms for the HL7v2, FHIR and DICOM stores in the [Google Cloud > Healthcare API|https://cloud.google.com/healthcare/docs/] > HL7v2IO > FHIRIO > DICOM -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9199) Make --region a required flag for DataflowRunner
[ https://issues.apache.org/jira/browse/BEAM-9199?focusedWorklogId=417323=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417323 ] ASF GitHub Bot logged work on BEAM-9199: Author: ASF GitHub Bot Created on: 07/Apr/20 00:12 Start Date: 07/Apr/20 00:12 Worklog Time Spent: 10m Work Description: ibzib commented on pull request #11281: [BEAM-9199] Require --region option for Dataflow in Java SDK. URL: https://github.com/apache/beam/pull/11281 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417323) Time Spent: 3.5h (was: 3h 20m) > Make --region a required flag for DataflowRunner > > > Key: BEAM-9199 > URL: https://issues.apache.org/jira/browse/BEAM-9199 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > Time Spent: 3.5h > Remaining Estimate: 0h > > We've been warning users since Beam 2.15.0 that --region will be required. > That is sufficient time, so now we can start requiring the flag. > While this is a small change in and of itself, I'm guessing many tests and > examples will need to be updated to add --region. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9199) Make --region a required flag for DataflowRunner
[ https://issues.apache.org/jira/browse/BEAM-9199?focusedWorklogId=417321=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417321 ] ASF GitHub Bot logged work on BEAM-9199: Author: ASF GitHub Bot Created on: 07/Apr/20 00:11 Start Date: 07/Apr/20 00:11 Worklog Time Spent: 10m Work Description: ibzib commented on issue #11269: [BEAM-9199] Require Dataflow --region in Python SDK. URL: https://github.com/apache/beam/pull/11269#issuecomment-610101073 Failure in `hdfsIntegrationTest` looks like known flake (BEAM-7405 et al): `docker-credential-gcloud not installed or not available in PATH` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417321) Time Spent: 3h 10m (was: 3h) > Make --region a required flag for DataflowRunner > > > Key: BEAM-9199 > URL: https://issues.apache.org/jira/browse/BEAM-9199 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > Time Spent: 3h 10m > Remaining Estimate: 0h > > We've been warning users since Beam 2.15.0 that --region will be required. > That is sufficient time, so now we can start requiring the flag. > While this is a small change in and of itself, I'm guessing many tests and > examples will need to be updated to add --region. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9199) Make --region a required flag for DataflowRunner
[ https://issues.apache.org/jira/browse/BEAM-9199?focusedWorklogId=417322=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417322 ] ASF GitHub Bot logged work on BEAM-9199: Author: ASF GitHub Bot Created on: 07/Apr/20 00:11 Start Date: 07/Apr/20 00:11 Worklog Time Spent: 10m Work Description: ibzib commented on pull request #11269: [BEAM-9199] Require Dataflow --region in Python SDK. URL: https://github.com/apache/beam/pull/11269 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417322) Time Spent: 3h 20m (was: 3h 10m) > Make --region a required flag for DataflowRunner > > > Key: BEAM-9199 > URL: https://issues.apache.org/jira/browse/BEAM-9199 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > Time Spent: 3h 20m > Remaining Estimate: 0h > > We've been warning users since Beam 2.15.0 that --region will be required. > That is sufficient time, so now we can start requiring the flag. > While this is a small change in and of itself, I'm guessing many tests and > examples will need to be updated to add --region. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9713) hints should be rejected
[ https://issues.apache.org/jira/browse/BEAM-9713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Pilloud updated BEAM-9713: - Status: Open (was: Triage Needed) > hints should be rejected > > > Key: BEAM-9713 > URL: https://issues.apache.org/jira/browse/BEAM-9713 > Project: Beam > Issue Type: Bug > Components: dsl-sql-zetasql >Reporter: Andrew Pilloud >Priority: Trivial > Labels: zetasql-compliance > > five failures in shard 32 > {code} > Expected: ERROR: generic::invalid_argument: Unsupported hint: invalid_hint > Actual: ARRAY>[{123}] > {code} > {code} > @{ invalid_hint=5 } select i from t > > select @{ invalid_hint=5 } i from t > > select i from t @{ invalid_hint=5 } > > select i from t group @{ invalid_hint=5 } by 1 > > select i from t group @{ num_shards='abc' } by 1 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9713) hints should be rejected
Andrew Pilloud created BEAM-9713: Summary: hints should be rejected Key: BEAM-9713 URL: https://issues.apache.org/jira/browse/BEAM-9713 Project: Beam Issue Type: Bug Components: dsl-sql-zetasql Reporter: Andrew Pilloud five failures in shard 32 {code} Expected: ERROR: generic::invalid_argument: Unsupported hint: invalid_hint Actual: ARRAY>[{123}] {code} {code} @{ invalid_hint=5 } select i from t select @{ invalid_hint=5 } i from t select i from t @{ invalid_hint=5 } select i from t group @{ invalid_hint=5 } by 1 select i from t group @{ num_shards='abc' } by 1 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9712) setting default timezone doesn't work
Andrew Pilloud created BEAM-9712: Summary: setting default timezone doesn't work Key: BEAM-9712 URL: https://issues.apache.org/jira/browse/BEAM-9712 Project: Beam Issue Type: Bug Components: dsl-sql-zetasql Reporter: Andrew Pilloud several failures in shard 14 (note: fixing the internal tests requires plumbing through the timezone config.) {code} [name=timestamp_to_string_1] select [cast(timestamp "2015-01-28" as string), cast(timestamp "2015-01-28 00:00:00" as string), cast(timestamp "2015-01-28 00:00:00.0" as string), cast(timestamp "2015-01-28 00:00:00.00" as string), cast(timestamp "2015-01-28 00:00:00.000" as string), cast(timestamp "2015-01-28 00:00:00." as string), cast(timestamp "2015-01-28 00:00:00.0" as string), cast(timestamp "2015-01-28 00:00:00.00" as string)] -- ARRAY>>[ {ARRAY[ "2015-01-28 00:00:00+13:45", "2015-01-28 00:00:00+13:45", "2015-01-28 00:00:00+13:45", "2015-01-28 00:00:00+13:45", "2015-01-28 00:00:00+13:45", "2015-01-28 00:00:00+13:45", "2015-01-28 00:00:00+13:45", "2015-01-28 00:00:00+13:45" ]} ] {code} {code} [default_time_zone=Pacific/Chatham] [name=timestamp_to_string_1] select [cast(timestamp "2015-01-28" as string), cast(timestamp "2015-01-28 00:00:00" as string), cast(timestamp "2015-01-28 00:00:00.0" as string), cast(timestamp "2015-01-28 00:00:00.00" as string), cast(timestamp "2015-01-28 00:00:00.000" as string), cast(timestamp "2015-01-28 00:00:00." as string), cast(timestamp "2015-01-28 00:00:00.0" as string), cast(timestamp "2015-01-28 00:00:00.00" as string)] -- ARRAY>>[ {ARRAY[ "2015-01-28 00:00:00+13:45", "2015-01-28 00:00:00+13:45", "2015-01-28 00:00:00+13:45", "2015-01-28 00:00:00+13:45", "2015-01-28 00:00:00+13:45", "2015-01-28 00:00:00+13:45", "2015-01-28 00:00:00+13:45", "2015-01-28 00:00:00+13:45" ]} ] {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9712) setting default timezone doesn't work
[ https://issues.apache.org/jira/browse/BEAM-9712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Pilloud updated BEAM-9712: - Status: Open (was: Triage Needed) > setting default timezone doesn't work > - > > Key: BEAM-9712 > URL: https://issues.apache.org/jira/browse/BEAM-9712 > Project: Beam > Issue Type: Bug > Components: dsl-sql-zetasql >Reporter: Andrew Pilloud >Priority: Trivial > Labels: zetasql-compliance > > several failures in shard 14 > (note: fixing the internal tests requires plumbing through the timezone > config.) > {code} > [name=timestamp_to_string_1] > select [cast(timestamp "2015-01-28" as string), > cast(timestamp "2015-01-28 00:00:00" as string), > cast(timestamp "2015-01-28 00:00:00.0" as string), > cast(timestamp "2015-01-28 00:00:00.00" as string), > cast(timestamp "2015-01-28 00:00:00.000" as string), > cast(timestamp "2015-01-28 00:00:00." as string), > cast(timestamp "2015-01-28 00:00:00.0" as string), > cast(timestamp "2015-01-28 00:00:00.00" as string)] > -- > ARRAY>>[ > {ARRAY[ > "2015-01-28 00:00:00+13:45", > "2015-01-28 00:00:00+13:45", > "2015-01-28 00:00:00+13:45", > "2015-01-28 00:00:00+13:45", > "2015-01-28 00:00:00+13:45", > "2015-01-28 00:00:00+13:45", > "2015-01-28 00:00:00+13:45", > "2015-01-28 00:00:00+13:45" >]} > ] > {code} > {code} > [default_time_zone=Pacific/Chatham] > [name=timestamp_to_string_1] > select [cast(timestamp "2015-01-28" as string), > cast(timestamp "2015-01-28 00:00:00" as string), > cast(timestamp "2015-01-28 00:00:00.0" as string), > cast(timestamp "2015-01-28 00:00:00.00" as string), > cast(timestamp "2015-01-28 00:00:00.000" as string), > cast(timestamp "2015-01-28 00:00:00." as string), > cast(timestamp "2015-01-28 00:00:00.0" as string), > cast(timestamp "2015-01-28 00:00:00.00" as string)] > -- > ARRAY>>[ > {ARRAY[ > "2015-01-28 00:00:00+13:45", > "2015-01-28 00:00:00+13:45", > "2015-01-28 00:00:00+13:45", > "2015-01-28 00:00:00+13:45", > "2015-01-28 00:00:00+13:45", > "2015-01-28 00:00:00+13:45", > "2015-01-28 00:00:00+13:45", > "2015-01-28 00:00:00+13:45" >]} > ] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9709) timezone off by 8 hours
[ https://issues.apache.org/jira/browse/BEAM-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Pilloud updated BEAM-9709: - Description: two failures in shard 13, one failure in shard 19 {code} Expected: ARRAY>[{2014-01-31 00:00:00+00}] Actual: ARRAY>[{2014-01-31 08:00:00+00}], {code} {code} select timestamp(date '2014-01-31') {code} was: one failure in shard 19 (It is possible this test is attempting to change the default timezone before running) {code} Expected: ARRAY>[ {2000-01-02 18:20:30+00}, {2000-01-02 09:02:03+00} ] Actual: ARRAY>[ {2000-01-02 10:20:30+00}, {2000-01-02 01:02:03+00} ], {code} {code} SELECT x FROM UNNEST([TIMESTAMP '2000-01-02 10:20:30', '2000-01-02 01:02:03']) x; {code} > timezone off by 8 hours > --- > > Key: BEAM-9709 > URL: https://issues.apache.org/jira/browse/BEAM-9709 > Project: Beam > Issue Type: Bug > Components: dsl-sql-zetasql >Reporter: Andrew Pilloud >Priority: Trivial > Labels: zetasql-compliance > > two failures in shard 13, one failure in shard 19 > {code} > Expected: ARRAY>[{2014-01-31 00:00:00+00}] > Actual: ARRAY>[{2014-01-31 08:00:00+00}], > {code} > {code} > select timestamp(date '2014-01-31') > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9709) timezone off by 8 hours
[ https://issues.apache.org/jira/browse/BEAM-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Pilloud updated BEAM-9709: - Summary: timezone off by 8 hours (was: unnest timezone off by 8 hours) > timezone off by 8 hours > --- > > Key: BEAM-9709 > URL: https://issues.apache.org/jira/browse/BEAM-9709 > Project: Beam > Issue Type: Bug > Components: dsl-sql-zetasql >Reporter: Andrew Pilloud >Priority: Trivial > Labels: zetasql-compliance > > one failure in shard 19 > (It is possible this test is attempting to change the default timezone before > running) > {code} > Expected: ARRAY>[ > {2000-01-02 18:20:30+00}, > {2000-01-02 09:02:03+00} > ] > Actual: ARRAY>[ > {2000-01-02 10:20:30+00}, > {2000-01-02 01:02:03+00} > ], > {code} > {code} > SELECT x FROM UNNEST([TIMESTAMP '2000-01-02 10:20:30', '2000-01-02 > 01:02:03']) x; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9708) count with no elements returns no value instead of 0
[ https://issues.apache.org/jira/browse/BEAM-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Pilloud updated BEAM-9708: - Description: two failures in shard 3, One failure in shard 37 {code:java} Expected: ARRAY>[{0}] Actual: ARRAY>[], Details: Number of array elements is {1} and {0} in respective arrays {[unordered: {0}]} and {[]} {code} {code} [prepare_database] CREATE TABLE TableEmpty AS SELECT val FROM (SELECT 1 as val) WHERE false -- ARRAY>[] == [name=aggregation_count_6] SELECT COUNT(*) FROM TableEmpty -- ARRAY>[{0}] == [name=aggregation_count_7] SELECT COUNT(val) FROM TableEmpty -- ARRAY>[{0}] {code} {code} SELECT COUNT(a) FROM ( SELECT a FROM (SELECT 1 a UNION ALL SELECT 2 UNION ALL SELECT 3) LIMIT 0 OFFSET 0) {code} was: One failure in shard 37 {code:java} Expected: ARRAY>[{0}] Actual: ARRAY>[], Details: Number of array elements is {1} and {0} in respective arrays {[unordered: {0}]} and {[]} {code} {code} SELECT COUNT(a) FROM ( SELECT a FROM (SELECT 1 a UNION ALL SELECT 2 UNION ALL SELECT 3) LIMIT 0 OFFSET 0) {code} > count with no elements returns no value instead of 0 > > > Key: BEAM-9708 > URL: https://issues.apache.org/jira/browse/BEAM-9708 > Project: Beam > Issue Type: Bug > Components: dsl-sql-zetasql >Reporter: Andrew Pilloud >Priority: Trivial > Labels: zetasql-compliance > > two failures in shard 3, One failure in shard 37 > {code:java} > Expected: ARRAY>[{0}] > Actual: ARRAY>[], > Details: Number of array elements is {1} and {0} in respective arrays > {[unordered: {0}]} and {[]} {code} > {code} > [prepare_database] > CREATE TABLE TableEmpty AS SELECT val FROM (SELECT 1 as val) WHERE false > -- > ARRAY>[] > == > [name=aggregation_count_6] > SELECT COUNT(*) FROM TableEmpty > -- > ARRAY>[{0}] > == > [name=aggregation_count_7] > SELECT COUNT(val) FROM TableEmpty > -- > ARRAY>[{0}] > {code} > {code} > SELECT COUNT(a) FROM ( > SELECT a FROM (SELECT 1 a UNION ALL SELECT 2 UNION ALL SELECT 3) LIMIT 0 > OFFSET 0) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9711) sum(null) should be null not 0
[ https://issues.apache.org/jira/browse/BEAM-9711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Pilloud updated BEAM-9711: - Status: Open (was: Triage Needed) > sum(null) should be null not 0 > -- > > Key: BEAM-9711 > URL: https://issues.apache.org/jira/browse/BEAM-9711 > Project: Beam > Issue Type: Bug > Components: dsl-sql-zetasql >Reporter: Andrew Pilloud >Priority: Trivial > Labels: zetasql-compliance > > one failure in shard 3 > {code} > Expected: ARRAY>[ > {1, NULL}, > {2, NULL}, > {3, NULL}, > {4, 3}, > {5, 4}, > {6, 5}, > {7, 6}, > {8, 7}, > {9, 8}, > {10, 9}, > {11, 10}, > {12, 11}, > {13, 12}, > {14, 13} > ] > Actual: ARRAY>[ > {1, 0}, > {10, 9}, > {7, 6}, > {2, 0}, > {13, 12}, > {5, 4}, > {4, 3}, > {14, 13}, > {6, 5}, > {11, 10}, > {12, 11}, > {8, 7}, > {3, 0}, > {9, 8} > ], > {code} > {code} > [prepare_database] > CREATE TABLE TableLarge AS > SELECT CAST(1 AS int64) as row_id, >CAST(NULL AS bool) as bool_val, CAST(NULL AS double) as double_val, >CAST(NULL AS int64) as int64_val, CAST(NULL AS uint64) as uint64_val, >CAST(NULL AS string) as str_val UNION ALL > SELECT 2, true, NULL, NULL, NULL, NULL UNION ALL > SELECT 3, false, 0.2, NULL, NULL, NULL UNION ALL > SELECT 4, true, 0.3, 3,NULL, NULL UNION ALL > SELECT 5, false, 0.4, 4,15, "4" UNION ALL > SELECT 6, true, 0.5, 5,17, "5" UNION ALL > SELECT 7, false, 0.6, 6,19, "6" UNION ALL > SELECT 8, true, 0.7, 7,21, "7" UNION ALL > SELECT 9, false, 0.8, 8,23, "8" UNION ALL > SELECT 10, true, 0.9, 9,25, "9" UNION ALL > SELECT 11, false, 1.0, 10,27, "10" UNION ALL > SELECT 12, true, IEEE_DIVIDE(1, 0), 11, 29, "11" UNION ALL > SELECT 13, false, IEEE_DIVIDE(-1, 0), 12, 31, "12" UNION ALL > SELECT 14, true, IEEE_DIVIDE(0, 0), 13, 33, "13" > -- > ARRAY bool_val BOOL, > double_val DOUBLE, > int64_val INT64, > uint64_val UINT64, > str_val STRING>> > [ > {1, NULL, NULL, NULL, NULL, NULL}, > {2, true, NULL, NULL, NULL, NULL}, > {3, false, 0.2, NULL, NULL, NULL}, > {4, true, 0.3, 3, NULL, NULL}, > {5, false, 0.4, 4, 15, "4"}, > {6, true, 0.5, 5, 17, "5"}, > {7, false, 0.6, 6, 19, "6"}, > {8, true, 0.7, 7, 21, "7"}, > {9, false, 0.8, 8, 23, "8"}, > {10, true, 0.9, 9, 25, "9"}, > {11, false, 1, 10, 27, "10"}, > {12, true, inf, 11, 29, "11"}, > {13, false, -inf, 12, 31, "12"}, > {14, true, nan, 13, 33, "13"} > ] > == > # SUM should work with GROUP BY. > [name=aggregation_sum_group_by] > SELECT row_id, SUM(int64_val) int64_sum FROM TableLarge GROUP BY row_id > -- > ARRAY>[ > {1, NULL}, > {2, NULL}, > {3, NULL}, > {4, 3}, > {5, 4}, > {6, 5}, > {7, 6}, > {8, 7}, > {9, 8}, > {10, 9}, > {11, 10}, > {12, 11}, > {13, 12}, > {14, 13} > ] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9618) Allow SDKs to pull process bundle descriptors.
[ https://issues.apache.org/jira/browse/BEAM-9618?focusedWorklogId=417302=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417302 ] ASF GitHub Bot logged work on BEAM-9618: Author: ASF GitHub Bot Created on: 06/Apr/20 23:28 Start Date: 06/Apr/20 23:28 Worklog Time Spent: 10m Work Description: robertwb commented on pull request #11328: [BEAM-9618] Java SDK worker support for pulling bundle descriptors. URL: https://github.com/apache/beam/pull/11328 Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] Update `CHANGES.md` with noteworthy changes. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/)
[jira] [Created] (BEAM-9711) sum(null) should be null not 0
Andrew Pilloud created BEAM-9711: Summary: sum(null) should be null not 0 Key: BEAM-9711 URL: https://issues.apache.org/jira/browse/BEAM-9711 Project: Beam Issue Type: Bug Components: dsl-sql-zetasql Reporter: Andrew Pilloud one failure in shard 3 {code} Expected: ARRAY>[ {1, NULL}, {2, NULL}, {3, NULL}, {4, 3}, {5, 4}, {6, 5}, {7, 6}, {8, 7}, {9, 8}, {10, 9}, {11, 10}, {12, 11}, {13, 12}, {14, 13} ] Actual: ARRAY>[ {1, 0}, {10, 9}, {7, 6}, {2, 0}, {13, 12}, {5, 4}, {4, 3}, {14, 13}, {6, 5}, {11, 10}, {12, 11}, {8, 7}, {3, 0}, {9, 8} ], {code} {code} [prepare_database] CREATE TABLE TableLarge AS SELECT CAST(1 AS int64) as row_id, CAST(NULL AS bool) as bool_val, CAST(NULL AS double) as double_val, CAST(NULL AS int64) as int64_val, CAST(NULL AS uint64) as uint64_val, CAST(NULL AS string) as str_val UNION ALL SELECT 2, true, NULL, NULL, NULL, NULL UNION ALL SELECT 3, false, 0.2, NULL, NULL, NULL UNION ALL SELECT 4, true, 0.3, 3,NULL, NULL UNION ALL SELECT 5, false, 0.4, 4,15, "4" UNION ALL SELECT 6, true, 0.5, 5,17, "5" UNION ALL SELECT 7, false, 0.6, 6,19, "6" UNION ALL SELECT 8, true, 0.7, 7,21, "7" UNION ALL SELECT 9, false, 0.8, 8,23, "8" UNION ALL SELECT 10, true, 0.9, 9,25, "9" UNION ALL SELECT 11, false, 1.0, 10,27, "10" UNION ALL SELECT 12, true, IEEE_DIVIDE(1, 0), 11, 29, "11" UNION ALL SELECT 13, false, IEEE_DIVIDE(-1, 0), 12, 31, "12" UNION ALL SELECT 14, true, IEEE_DIVIDE(0, 0), 13, 33, "13" -- ARRAY> [ {1, NULL, NULL, NULL, NULL, NULL}, {2, true, NULL, NULL, NULL, NULL}, {3, false, 0.2, NULL, NULL, NULL}, {4, true, 0.3, 3, NULL, NULL}, {5, false, 0.4, 4, 15, "4"}, {6, true, 0.5, 5, 17, "5"}, {7, false, 0.6, 6, 19, "6"}, {8, true, 0.7, 7, 21, "7"}, {9, false, 0.8, 8, 23, "8"}, {10, true, 0.9, 9, 25, "9"}, {11, false, 1, 10, 27, "10"}, {12, true, inf, 11, 29, "11"}, {13, false, -inf, 12, 31, "12"}, {14, true, nan, 13, 33, "13"} ] == # SUM should work with GROUP BY. [name=aggregation_sum_group_by] SELECT row_id, SUM(int64_val) int64_sum FROM TableLarge GROUP BY row_id -- ARRAY>[ {1, NULL}, {2, NULL}, {3, NULL}, {4, 3}, {5, 4}, {6, 5}, {7, 6}, {8, 7}, {9, 8}, {10, 9}, {11, 10}, {12, 11}, {13, 12}, {14, 13} ] {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9710) Got current time instead of timestamp value
[ https://issues.apache.org/jira/browse/BEAM-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Pilloud updated BEAM-9710: - Status: Open (was: Triage Needed) > Got current time instead of timestamp value > --- > > Key: BEAM-9710 > URL: https://issues.apache.org/jira/browse/BEAM-9710 > Project: Beam > Issue Type: Bug > Components: dsl-sql-zetasql >Reporter: Andrew Pilloud >Priority: Trivial > Labels: zetasql-compliance > > one failure in shard 13 > {code} > Expected: ARRAY>[{2014-12-01 00:00:00+00}] > Actual: ARRAY>[{2020-04-06 > 00:20:40.052+00}], > {code} > {code} > [prepare_database] > CREATE TABLE Table1 AS > SELECT timestamp '2014-12-01' as timestamp_val > -- > ARRAY>[{2014-12-01 00:00:00+00}] > == > [name=timestamp_type_2] > SELECT timestamp_val > FROM Table1 > -- > ARRAY>[{2014-12-01 00:00:00+00}] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9710) Got current time instead of timestamp value
Andrew Pilloud created BEAM-9710: Summary: Got current time instead of timestamp value Key: BEAM-9710 URL: https://issues.apache.org/jira/browse/BEAM-9710 Project: Beam Issue Type: Bug Components: dsl-sql-zetasql Reporter: Andrew Pilloud one failure in shard 13 {code} Expected: ARRAY>[{2014-12-01 00:00:00+00}] Actual: ARRAY>[{2020-04-06 00:20:40.052+00}], {code} {code} [prepare_database] CREATE TABLE Table1 AS SELECT timestamp '2014-12-01' as timestamp_val -- ARRAY>[{2014-12-01 00:00:00+00}] == [name=timestamp_type_2] SELECT timestamp_val FROM Table1 -- ARRAY>[{2014-12-01 00:00:00+00}] {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9468) Add Google Cloud Healthcare API IO Connectors
[ https://issues.apache.org/jira/browse/BEAM-9468?focusedWorklogId=417301=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417301 ] ASF GitHub Bot logged work on BEAM-9468: Author: ASF GitHub Bot Created on: 06/Apr/20 23:21 Start Date: 06/Apr/20 23:21 Worklog Time Spent: 10m Work Description: jaketf commented on issue #11151: [BEAM-9468] Hl7v2 io URL: https://github.com/apache/beam/pull/11151#issuecomment-610050634 Next Steps (based on offline feed): - [x] Improve API for users: - [x] Add static methods for common patterns with `ListHL7v2Messages` - [x] Add `ValueProvider` support to ease use in the DataflowTemplates - [x] `ListHL7v2Messages` (hl7v2Store and filter) - [x] `Write` (hl7v2Store) - [ ] "standardize" integration tests - [ ] Refactor ITs to create / destroy HL7v2 Store under a parameterized dataset in `@BeforeClass` `@AfterClass` to avoid issues with parallel tests runs. - [ ] Remove hard coding of my HL7v2Store / project in integration tests. - [ ] Add Healthcare API Dataset to Beam integration test project This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417301) Time Spent: 12h 20m (was: 12h 10m) > Add Google Cloud Healthcare API IO Connectors > - > > Key: BEAM-9468 > URL: https://issues.apache.org/jira/browse/BEAM-9468 > Project: Beam > Issue Type: New Feature > Components: io-java-gcp >Reporter: Jacob Ferriero >Assignee: Jacob Ferriero >Priority: Minor > Time Spent: 12h 20m > Remaining Estimate: 0h > > Add IO Transforms for the HL7v2, FHIR and DICOM stores in the [Google Cloud > Healthcare API|https://cloud.google.com/healthcare/docs/] > HL7v2IO > FHIRIO > DICOM -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9709) unnest timezone off by 8 hours
[ https://issues.apache.org/jira/browse/BEAM-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Pilloud updated BEAM-9709: - Status: Open (was: Triage Needed) > unnest timezone off by 8 hours > -- > > Key: BEAM-9709 > URL: https://issues.apache.org/jira/browse/BEAM-9709 > Project: Beam > Issue Type: Bug > Components: dsl-sql-zetasql >Reporter: Andrew Pilloud >Priority: Trivial > Labels: zetasql-compliance > > one failure in shard 19 > (It is possible this test is attempting to change the default timezone before > running) > {code} > Expected: ARRAY>[ > {2000-01-02 18:20:30+00}, > {2000-01-02 09:02:03+00} > ] > Actual: ARRAY>[ > {2000-01-02 10:20:30+00}, > {2000-01-02 01:02:03+00} > ], > {code} > {code} > SELECT x FROM UNNEST([TIMESTAMP '2000-01-02 10:20:30', '2000-01-02 > 01:02:03']) x; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9709) unnest timezone off by 8 hours
Andrew Pilloud created BEAM-9709: Summary: unnest timezone off by 8 hours Key: BEAM-9709 URL: https://issues.apache.org/jira/browse/BEAM-9709 Project: Beam Issue Type: Bug Components: dsl-sql-zetasql Reporter: Andrew Pilloud one failure in shard 19 (It is possible this test is attempting to change the default timezone before running) {code} Expected: ARRAY>[ {2000-01-02 18:20:30+00}, {2000-01-02 09:02:03+00} ] Actual: ARRAY>[ {2000-01-02 10:20:30+00}, {2000-01-02 01:02:03+00} ], {code} {code} SELECT x FROM UNNEST([TIMESTAMP '2000-01-02 10:20:30', '2000-01-02 01:02:03']) x; {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9708) count with no elements returns no value instead of 0
[ https://issues.apache.org/jira/browse/BEAM-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Pilloud updated BEAM-9708: - Labels: zetasql-compliance (was: ) > count with no elements returns no value instead of 0 > > > Key: BEAM-9708 > URL: https://issues.apache.org/jira/browse/BEAM-9708 > Project: Beam > Issue Type: Bug > Components: dsl-sql-zetasql >Reporter: Andrew Pilloud >Priority: Trivial > Labels: zetasql-compliance > > One failure in shard 37 > {code:java} > Expected: ARRAY>[{0}] > Actual: ARRAY>[], > Details: Number of array elements is {1} and {0} in respective arrays > {[unordered: {0}]} and {[]} {code} > {code} > SELECT COUNT(a) FROM ( > SELECT a FROM (SELECT 1 a UNION ALL SELECT 2 UNION ALL SELECT 3) LIMIT 0 > OFFSET 0) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9708) count with no elements returns no value instead of 0
[ https://issues.apache.org/jira/browse/BEAM-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Pilloud updated BEAM-9708: - Status: Open (was: Triage Needed) > count with no elements returns no value instead of 0 > > > Key: BEAM-9708 > URL: https://issues.apache.org/jira/browse/BEAM-9708 > Project: Beam > Issue Type: Bug > Components: dsl-sql-zetasql >Reporter: Andrew Pilloud >Priority: Trivial > Labels: zetasql-compliance > > One failure in shard 37 > {code:java} > Expected: ARRAY>[{0}] > Actual: ARRAY>[], > Details: Number of array elements is {1} and {0} in respective arrays > {[unordered: {0}]} and {[]} {code} > {code} > SELECT COUNT(a) FROM ( > SELECT a FROM (SELECT 1 a UNION ALL SELECT 2 UNION ALL SELECT 3) LIMIT 0 > OFFSET 0) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9708) count with no elements returns no value instead of 0
Andrew Pilloud created BEAM-9708: Summary: count with no elements returns no value instead of 0 Key: BEAM-9708 URL: https://issues.apache.org/jira/browse/BEAM-9708 Project: Beam Issue Type: Bug Components: dsl-sql-zetasql Reporter: Andrew Pilloud One failure in shard 37 {code:java} Expected: ARRAY>[{0}] Actual: ARRAY>[], Details: Number of array elements is {1} and {0} in respective arrays {[unordered: {0}]} and {[]} {code} {code} SELECT COUNT(a) FROM ( SELECT a FROM (SELECT 1 a UNION ALL SELECT 2 UNION ALL SELECT 3) LIMIT 0 OFFSET 0) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417295=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417295 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 23:03 Start Date: 06/Apr/20 23:03 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610080905 Run Go Spark ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417295) Time Spent: 11h (was: 10h 50m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 11h > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417294=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417294 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 23:03 Start Date: 06/Apr/20 23:03 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610080809 Run Go Spark ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417294) Time Spent: 10h 50m (was: 10h 40m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 10h 50m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417293=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417293 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 23:03 Start Date: 06/Apr/20 23:03 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610080530 Run Python Spark ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417293) Time Spent: 10h 40m (was: 10.5h) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 10h 40m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417291=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417291 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 23:02 Start Date: 06/Apr/20 23:02 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610080530 Run Python Spark ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417291) Time Spent: 10h 20m (was: 10h 10m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 10h 20m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417288=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417288 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 23:02 Start Date: 06/Apr/20 23:02 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610078890 Run Java Spark PortableValidatesRunner Batch This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417288) Time Spent: 10h 10m (was: 10h) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 10h 10m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417292=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417292 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 23:02 Start Date: 06/Apr/20 23:02 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610080657 Run Python Spark ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417292) Time Spent: 10.5h (was: 10h 20m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 10.5h > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417287=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417287 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 23:01 Start Date: 06/Apr/20 23:01 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610080380 Run Java Spark PortableValidatesRunner Batch This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417287) Time Spent: 10h (was: 9h 50m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 10h > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417283=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417283 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 23:01 Start Date: 06/Apr/20 23:01 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610080182 Run Spark Runner Nexmark Tests This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417283) Time Spent: 9h 20m (was: 9h 10m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 9h 20m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417281=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417281 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 23:01 Start Date: 06/Apr/20 23:01 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610080127 Run Spark Runner Nexmark Tests This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417281) Time Spent: 9h 10m (was: 9h) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 9h 10m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417284=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417284 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 23:01 Start Date: 06/Apr/20 23:01 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610078236 Run Spark Runner Nexmark Tests This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417284) Time Spent: 9.5h (was: 9h 20m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 9.5h > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417285=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417285 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 23:01 Start Date: 06/Apr/20 23:01 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610080065 Run Spark Runner Nexmark Tests This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417285) Time Spent: 9h 40m (was: 9.5h) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 9h 40m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417286=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417286 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 23:01 Start Date: 06/Apr/20 23:01 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610080127 Run Spark Runner Nexmark Tests This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417286) Time Spent: 9h 50m (was: 9h 40m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 9h 50m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417275=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417275 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 23:00 Start Date: 06/Apr/20 23:00 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610079874 Run Spark ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417275) Time Spent: 8h 20m (was: 8h 10m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 8h 20m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417280=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417280 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 23:00 Start Date: 06/Apr/20 23:00 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610080065 Run Spark Runner Nexmark Tests This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417280) Time Spent: 9h (was: 8h 50m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 9h > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417276=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417276 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 23:00 Start Date: 06/Apr/20 23:00 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610079624 Run Spark ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417276) Time Spent: 8.5h (was: 8h 20m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 8.5h > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417278=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417278 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 23:00 Start Date: 06/Apr/20 23:00 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610079752 Run Spark ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417278) Time Spent: 8h 40m (was: 8.5h) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 8h 40m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417279=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417279 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 23:00 Start Date: 06/Apr/20 23:00 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610077992 Run Spark ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417279) Time Spent: 8h 50m (was: 8h 40m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 8h 50m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417272=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417272 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 22:59 Start Date: 06/Apr/20 22:59 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610079437 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417272) Time Spent: 7h 50m (was: 7h 40m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 7h 50m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417274=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417274 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 22:59 Start Date: 06/Apr/20 22:59 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610079752 Run Spark ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417274) Time Spent: 8h 10m (was: 8h) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 8h 10m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417271=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417271 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 22:59 Start Date: 06/Apr/20 22:59 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610079366 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417271) Time Spent: 7h 40m (was: 7.5h) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 7h 40m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417273=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417273 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 22:59 Start Date: 06/Apr/20 22:59 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610079624 Run Spark ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417273) Time Spent: 8h (was: 7h 50m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 8h > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417270=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417270 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 22:58 Start Date: 06/Apr/20 22:58 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610079437 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417270) Time Spent: 7.5h (was: 7h 20m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 7.5h > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417269=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417269 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 22:58 Start Date: 06/Apr/20 22:58 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610079366 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417269) Time Spent: 7h 20m (was: 7h 10m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 7h 20m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417266=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417266 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 22:57 Start Date: 06/Apr/20 22:57 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610078527 Run Java Spark PortableValidatesRunner Batch This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417266) Time Spent: 6h 50m (was: 6h 40m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 6h 50m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417268=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417268 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 22:57 Start Date: 06/Apr/20 22:57 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610078731 Run Java Spark PortableValidatesRunner Batch This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417268) Time Spent: 7h 10m (was: 7h) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 7h 10m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417267=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417267 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 22:57 Start Date: 06/Apr/20 22:57 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610078632 Run Java Spark PortableValidatesRunner Batch This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417267) Time Spent: 7h (was: 6h 50m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 7h > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417265=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417265 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 22:57 Start Date: 06/Apr/20 22:57 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610078890 Run Java Spark PortableValidatesRunner Batch This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417265) Time Spent: 6h 40m (was: 6.5h) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 6h 40m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417263=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417263 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 22:56 Start Date: 06/Apr/20 22:56 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610078632 Run Java Spark PortableValidatesRunner Batch This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417263) Time Spent: 6h 20m (was: 6h 10m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 6h 20m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417264=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417264 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 22:56 Start Date: 06/Apr/20 22:56 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610078731 Run Java Spark PortableValidatesRunner Batch This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417264) Time Spent: 6.5h (was: 6h 20m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 6.5h > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9557) Error setting processing time timers near end-of-window
[ https://issues.apache.org/jira/browse/BEAM-9557?focusedWorklogId=417262=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417262 ] ASF GitHub Bot logged work on BEAM-9557: Author: ASF GitHub Bot Created on: 06/Apr/20 22:56 Start Date: 06/Apr/20 22:56 Worklog Time Spent: 10m Work Description: amaliujia commented on issue #11226: [BEAM-9557] Fix timer window boundary checking URL: https://github.com/apache/beam/pull/11226#issuecomment-610078537 @reuvenlax do you need help on the failed java tests? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417262) Time Spent: 7h 10m (was: 7h) > Error setting processing time timers near end-of-window > --- > > Key: BEAM-9557 > URL: https://issues.apache.org/jira/browse/BEAM-9557 > Project: Beam > Issue Type: Bug > Components: runner-core >Reporter: Steve Niemitz >Assignee: Reuven Lax >Priority: Critical > Fix For: 2.20.0 > > Time Spent: 7h 10m > Remaining Estimate: 0h > > Previously, it was possible to set a processing time timer past the end of a > window, and it would simply not fire. > However, now, this results in an error: > {code:java} > java.lang.IllegalArgumentException: Attempted to set event time timer that > outputs for 2020-03-19T18:01:35.000Z but that is after the expiration of > window 2020-03-19T17:59:59.999Z > > org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument(Preconditions.java:440) > > org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$TimerInternalsTimer.setAndVerifyOutputTimestamp(SimpleDoFnRunner.java:1011) > > org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$TimerInternalsTimer.setRelative(SimpleDoFnRunner.java:934) > .processElement(???.scala:187) > {code} > > I think the regression was introduced in commit > a005fd765a762183ca88df90f261f6d4a20cf3e0. Also notice that the error message > is wrong, it says that "event time timer" but the timer is in the processing > time domain. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417258=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417258 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 22:55 Start Date: 06/Apr/20 22:55 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610078304 Run Spark Runner Nexmark Tests This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417258) Time Spent: 5h 50m (was: 5h 40m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 5h 50m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417261=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417261 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 22:55 Start Date: 06/Apr/20 22:55 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610078527 Run Java Spark PortableValidatesRunner Batch This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417261) Time Spent: 6h 10m (was: 6h) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 6h 10m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9557) Error setting processing time timers near end-of-window
[ https://issues.apache.org/jira/browse/BEAM-9557?focusedWorklogId=417256=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417256 ] ASF GitHub Bot logged work on BEAM-9557: Author: ASF GitHub Bot Created on: 06/Apr/20 22:55 Start Date: 06/Apr/20 22:55 Worklog Time Spent: 10m Work Description: amaliujia commented on issue #11226: [BEAM-9557] Fix timer window boundary checking URL: https://github.com/apache/beam/pull/11226#issuecomment-610078277 LGTM This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417256) Time Spent: 7h (was: 6h 50m) > Error setting processing time timers near end-of-window > --- > > Key: BEAM-9557 > URL: https://issues.apache.org/jira/browse/BEAM-9557 > Project: Beam > Issue Type: Bug > Components: runner-core >Reporter: Steve Niemitz >Assignee: Reuven Lax >Priority: Critical > Fix For: 2.20.0 > > Time Spent: 7h > Remaining Estimate: 0h > > Previously, it was possible to set a processing time timer past the end of a > window, and it would simply not fire. > However, now, this results in an error: > {code:java} > java.lang.IllegalArgumentException: Attempted to set event time timer that > outputs for 2020-03-19T18:01:35.000Z but that is after the expiration of > window 2020-03-19T17:59:59.999Z > > org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument(Preconditions.java:440) > > org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$TimerInternalsTimer.setAndVerifyOutputTimestamp(SimpleDoFnRunner.java:1011) > > org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$TimerInternalsTimer.setRelative(SimpleDoFnRunner.java:934) > .processElement(???.scala:187) > {code} > > I think the regression was introduced in commit > a005fd765a762183ca88df90f261f6d4a20cf3e0. Also notice that the error message > is wrong, it says that "event time timer" but the timer is in the processing > time domain. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417257=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417257 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 22:55 Start Date: 06/Apr/20 22:55 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610078304 Run Spark Runner Nexmark Tests This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417257) Time Spent: 5h 40m (was: 5.5h) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 5h 40m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417255=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417255 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 22:55 Start Date: 06/Apr/20 22:55 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610078236 Run Spark Runner Nexmark Tests This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417255) Time Spent: 5.5h (was: 5h 20m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 5.5h > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417259=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417259 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 22:55 Start Date: 06/Apr/20 22:55 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610077933 Run Spark Runner Nexmark Tests This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417259) Time Spent: 6h (was: 5h 50m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 6h > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417254=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417254 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 22:54 Start Date: 06/Apr/20 22:54 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610078155 Run Spark ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417254) Time Spent: 5h 20m (was: 5h 10m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 5h 20m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417252=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417252 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 22:54 Start Date: 06/Apr/20 22:54 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610077676 Run Spark ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417252) Time Spent: 5h (was: 4h 50m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 5h > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417253=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417253 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 22:54 Start Date: 06/Apr/20 22:54 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610078155 Run Spark ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417253) Time Spent: 5h 10m (was: 5h) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 5h 10m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417251=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417251 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 22:54 Start Date: 06/Apr/20 22:54 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610077992 Run Spark ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417251) Time Spent: 4h 50m (was: 4h 40m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 4h 50m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417248=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417248 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 22:53 Start Date: 06/Apr/20 22:53 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610077189 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417248) Time Spent: 4h 20m (was: 4h 10m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 4h 20m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417247=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417247 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 22:53 Start Date: 06/Apr/20 22:53 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610077676 Run Spark ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417247) Time Spent: 4h 10m (was: 4h) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 4h 10m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417250=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417250 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 22:53 Start Date: 06/Apr/20 22:53 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610077933 Run Spark Runner Nexmark Tests This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417250) Time Spent: 4h 40m (was: 4.5h) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 4h 40m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417249=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417249 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 22:53 Start Date: 06/Apr/20 22:53 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610077560 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417249) Time Spent: 4.5h (was: 4h 20m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 4.5h > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417246=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417246 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 22:52 Start Date: 06/Apr/20 22:52 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610077560 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417246) Time Spent: 4h (was: 3h 50m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 4h > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism
[ https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417245=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417245 ] ASF GitHub Bot logged work on BEAM-9434: Author: ASF GitHub Bot Created on: 06/Apr/20 22:51 Start Date: 06/Apr/20 22:51 Worklog Time Spent: 10m Work Description: iemejia commented on issue #11037: [BEAM-9434] performance improvements reading many Avro files in S3 URL: https://github.com/apache/beam/pull/11037#issuecomment-610077189 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 417245) Time Spent: 3h 50m (was: 3h 40m) > Improve Spark runner reshuffle translation to maximize parallelism > -- > > Key: BEAM-9434 > URL: https://issues.apache.org/jira/browse/BEAM-9434 > Project: Beam > Issue Type: Improvement > Components: runner-spark >Affects Versions: 2.19.0 >Reporter: Emiliano Capoccia >Assignee: Emiliano Capoccia >Priority: Minor > Fix For: 2.21.0 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > There is a performance issue when processing a large number of small Avro > files in Spark on k8s (tens of thousands or more). > The recommended way of reading a pattern of Avro files in Beam is by means of: > > {code:java} > PCollection records = p.apply(AvroIO.read(AvroGenClass.class) > .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles()) > {code} > However, in the case of many small files, the above results in the entire > reading taking place in a single task/node, which is considerably slow and > has scalability issues. > The option of omitting the hint is not viable, as it results in too many > tasks being spawn, and the cluster being busy doing coordination of tiny > tasks with high overhead. > There are a few workarounds on the internet which mainly revolve around > compacting the input files before processing, so that a reduced number of > bulky files is processed in parallel. > It seems the Spark runner is using the parallelism of the input distributed > collection (RDD) to calculate the number of partitions in Reshuffle. In the > case of FileIO/AvroIO if the input pattern is a regex the size of the input > is 1 which would be far from an optimal parallelism value. We may fix this by > improving the translation of reshuffle to maximize parallelism. > -- This message was sent by Atlassian Jira (v8.3.4#803005)