date:20200406

[jira] [Closed] (BEAM-9705) Bug on database.io for writing with batch size 1

2020-04-06 Thread Adrian Eka Sanjaya (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Eka Sanjaya closed BEAM-9705.

Resolution: Fixed

> Bug on database.io for writing with batch size 1 
> -
>
> Key: BEAM-9705
> URL: https://issues.apache.org/jira/browse/BEAM-9705
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql
>Affects Versions: 2.19.0
> Environment: Ubuntu 18.04
>Reporter: Adrian Eka Sanjaya
>Priority: Major
>  Labels: patch
> Fix For: Not applicable
>
>   Original Estimate: 1h
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> When we try to make the batch size become 1, it makes the library for 
> database io become broken because when trying to write the last data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9705) Bug on database.io for writing with batch size 1

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9705?focusedWorklogId=417459=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417459
 ]

ASF GitHub Bot logged work on BEAM-9705:


Author: ASF GitHub Bot
Created on: 07/Apr/20 05:33
Start Date: 07/Apr/20 05:33
Worklog Time Spent: 10m 
  Work Description: youngoli commented on pull request #11323: [BEAM-9705] 
Go sdk add value length validation checking on write to d…
URL: https://github.com/apache/beam/pull/11323
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417459)
Time Spent: 1h 40m  (was: 1.5h)

> Bug on database.io for writing with batch size 1 
> -
>
> Key: BEAM-9705
> URL: https://issues.apache.org/jira/browse/BEAM-9705
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql
>Affects Versions: 2.19.0
> Environment: Ubuntu 18.04
>Reporter: Adrian Eka Sanjaya
>Priority: Major
>  Labels: patch
> Fix For: Not applicable
>
>   Original Estimate: 1h
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> When we try to make the batch size become 1, it makes the library for 
> database io become broken because when trying to write the last data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9147) [Java] PTransform that integrates Video Intelligence functionality

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9147?focusedWorklogId=417448=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417448
 ]

ASF GitHub Bot logged work on BEAM-9147:


Author: ASF GitHub Bot
Created on: 07/Apr/20 04:58
Start Date: 07/Apr/20 04:58
Worklog Time Spent: 10m 
  Work Description: Ardagan commented on issue #11261: [BEAM-9147] Add a 
VideoIntelligence transform to Java SDK
URL: https://github.com/apache/beam/pull/11261#issuecomment-610173883
 
 
   Community metrics should be irrelevvant failure.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417448)
Time Spent: 2.5h  (was: 2h 20m)

> [Java] PTransform that integrates Video Intelligence functionality
> --
>
> Key: BEAM-9147
> URL: https://issues.apache.org/jira/browse/BEAM-9147
> Project: Beam
>  Issue Type: Sub-task
>  Components: io-java-gcp
>Reporter: Kamil Wasilewski
>Assignee: Michał Walenia
>Priority: Major
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> The goal is to create a PTransform that integrates Google Cloud Video 
> Intelligence functionality [1].
> The transform should be able to take both video GCS location or video data 
> bytes as an input.
> A module with the transform should be put into _`sdks/java/extensions`_ 
> folder.
> [1] [https://cloud.google.com/video-intelligence/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9147) [Java] PTransform that integrates Video Intelligence functionality

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9147?focusedWorklogId=417441=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417441
 ]

ASF GitHub Bot logged work on BEAM-9147:


Author: ASF GitHub Bot
Created on: 07/Apr/20 04:50
Start Date: 07/Apr/20 04:50
Worklog Time Spent: 10m 
  Work Description: Ardagan commented on issue #11261: [BEAM-9147] Add a 
VideoIntelligence transform to Java SDK
URL: https://github.com/apache/beam/pull/11261#issuecomment-610171762
 
 
   Run CommunityMetrics PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417441)
Time Spent: 2h 20m  (was: 2h 10m)

> [Java] PTransform that integrates Video Intelligence functionality
> --
>
> Key: BEAM-9147
> URL: https://issues.apache.org/jira/browse/BEAM-9147
> Project: Beam
>  Issue Type: Sub-task
>  Components: io-java-gcp
>Reporter: Kamil Wasilewski
>Assignee: Michał Walenia
>Priority: Major
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> The goal is to create a PTransform that integrates Google Cloud Video 
> Intelligence functionality [1].
> The transform should be able to take both video GCS location or video data 
> bytes as an input.
> A module with the transform should be put into _`sdks/java/extensions`_ 
> folder.
> [1] [https://cloud.google.com/video-intelligence/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9639) Abstract bundle execution logic from stage execution logic

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9639?focusedWorklogId=417426=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417426
 ]

ASF GitHub Bot logged work on BEAM-9639:


Author: ASF GitHub Bot
Created on: 07/Apr/20 03:17
Start Date: 07/Apr/20 03:17
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #11270: 
[BEAM-9639][BEAM-9608] Improvements for FnApiRunner
URL: https://github.com/apache/beam/pull/11270#issuecomment-610150292
 
 
   @robertwb ptal
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417426)
Time Spent: 0.5h  (was: 20m)

> Abstract bundle execution logic from stage execution logic
> --
>
> Key: BEAM-9639
> URL: https://issues.apache.org/jira/browse/BEAM-9639
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The FnApiRunner currently works on a per-stage manner, and does not abstract 
> single-bundle execution much. This work item is to clearly define the code to 
> execute a single bundle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9650) Add consistent slowly changing side inputs support

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9650?focusedWorklogId=417425=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417425
 ]

ASF GitHub Bot logged work on BEAM-9650:


Author: ASF GitHub Bot
Created on: 07/Apr/20 03:16
Start Date: 07/Apr/20 03:16
Worklog Time Spent: 10m 
  Work Description: Ardagan commented on issue #11182: [BEAM-9650] Add 
PeriodicImpulse Transform and slowly changing side input documentation
URL: https://github.com/apache/beam/pull/11182#issuecomment-610150038
 
 
   > 
   > 
   > What is the expected behaviours around lifecycle events for runners that 
support drain / update. Does it need to be explicitly documented?
   
   Processing will stop on drain. So it should not cause any issues.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417425)
Time Spent: 1h 20m  (was: 1h 10m)

> Add consistent slowly changing side inputs support
> --
>
> Key: BEAM-9650
> URL: https://issues.apache.org/jira/browse/BEAM-9650
> Project: Beam
>  Issue Type: Bug
>  Components: io-ideas
>Reporter: Mikhail Gryzykhin
>Assignee: Mikhail Gryzykhin
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Add implementation for slowly changing dimentions based on [design 
> doc](https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg/edit]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9705) Bug on database.io for writing with batch size 1

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9705?focusedWorklogId=417401=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417401
 ]

ASF GitHub Bot logged work on BEAM-9705:


Author: ASF GitHub Bot
Created on: 07/Apr/20 02:51
Start Date: 07/Apr/20 02:51
Worklog Time Spent: 10m 
  Work Description: adrian3ka commented on issue #11323: [BEAM-9705] Go sdk 
add value length validation checking on write to d…
URL: https://github.com/apache/beam/pull/11323#issuecomment-610142249
 
 
   @youngoli i think it's better to merge earlier because it's the main io to 
write on DB, so on the next release the databaseio is more stable. Especially 
on unbounded we couldn't control how much data will be processed every batch. 
Thank you.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417401)
Time Spent: 1.5h  (was: 1h 20m)

> Bug on database.io for writing with batch size 1 
> -
>
> Key: BEAM-9705
> URL: https://issues.apache.org/jira/browse/BEAM-9705
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql
>Affects Versions: 2.19.0
> Environment: Ubuntu 18.04
>Reporter: Adrian Eka Sanjaya
>Priority: Major
>  Labels: patch
> Fix For: Not applicable
>
>   Original Estimate: 1h
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When we try to make the batch size become 1, it makes the library for 
> database io become broken because when trying to write the last data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9705) Bug on database.io for writing with batch size 1

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9705?focusedWorklogId=417396=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417396
 ]

ASF GitHub Bot logged work on BEAM-9705:


Author: ASF GitHub Bot
Created on: 07/Apr/20 02:45
Start Date: 07/Apr/20 02:45
Worklog Time Spent: 10m 
  Work Description: adrian3ka commented on issue #11323: [BEAM-9705] Go sdk 
add value length validation checking on write to d…
URL: https://github.com/apache/beam/pull/11323#issuecomment-610142249
 
 
   @youngoli i think it's better to merge earlier because it's the main io to 
write on DB, so on the next release the databaseio is more stable. Thank you.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417396)
Time Spent: 1h 20m  (was: 1h 10m)

> Bug on database.io for writing with batch size 1 
> -
>
> Key: BEAM-9705
> URL: https://issues.apache.org/jira/browse/BEAM-9705
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql
>Affects Versions: 2.19.0
> Environment: Ubuntu 18.04
>Reporter: Adrian Eka Sanjaya
>Priority: Major
>  Labels: patch
> Fix For: Not applicable
>
>   Original Estimate: 1h
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> When we try to make the batch size become 1, it makes the library for 
> database io become broken because when trying to write the last data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9691) Ensure Dataflow BQ Native sink are not used on FnApi

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9691?focusedWorklogId=417370=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417370
 ]

ASF GitHub Bot logged work on BEAM-9691:


Author: ASF GitHub Bot
Created on: 07/Apr/20 02:16
Start Date: 07/Apr/20 02:16
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #11309: [BEAM-9691] Ensuring 
BQ Native Sink is avoided on FnApi pipelines
URL: https://github.com/apache/beam/pull/11309#issuecomment-610134389
 
 
   Run Python 3.5 PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417370)
Time Spent: 2h  (was: 1h 50m)

> Ensure Dataflow BQ Native sink are not used on FnApi
> 
>
> Key: BEAM-9691
> URL: https://issues.apache.org/jira/browse/BEAM-9691
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9715) annotations_test fails in some environmens

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9715?focusedWorklogId=417365=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417365
 ]

ASF GitHub Bot logged work on BEAM-9715:


Author: ASF GitHub Bot
Created on: 07/Apr/20 02:00
Start Date: 07/Apr/20 02:00
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #11329: [BEAM-9715] Ensuring 
annotations_test passes in all environments
URL: https://github.com/apache/beam/pull/11329#issuecomment-610130043
 
 
   r: @udim 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417365)
Time Spent: 20m  (was: 10m)

> annotations_test fails in some environmens
> --
>
> Key: BEAM-9715
> URL: https://issues.apache.org/jira/browse/BEAM-9715
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-9715) annotations_test fails in some environmens

2020-04-06 Thread Pablo Estrada (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pablo Estrada updated BEAM-9715:

Priority: Minor  (was: Major)

> annotations_test fails in some environmens
> --
>
> Key: BEAM-9715
> URL: https://issues.apache.org/jira/browse/BEAM-9715
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-9715) annotations_test fails in some environmens

2020-04-06 Thread Pablo Estrada (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pablo Estrada updated BEAM-9715:

Status: Open  (was: Triage Needed)

> annotations_test fails in some environmens
> --
>
> Key: BEAM-9715
> URL: https://issues.apache.org/jira/browse/BEAM-9715
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9715) annotations_test fails in some environmens

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9715?focusedWorklogId=417364=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417364
 ]

ASF GitHub Bot logged work on BEAM-9715:


Author: ASF GitHub Bot
Created on: 07/Apr/20 01:59
Start Date: 07/Apr/20 01:59
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #11329: [BEAM-9715] 
Ensuring annotations_test passes in all environments
URL: https://github.com/apache/beam/pull/11329
 
 
   **Please** add a meaningful description for your change here
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] Update `CHANGES.md` with noteworthy changes.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more 
tips on [how to make review process 
smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)[![Build

[jira] [Created] (BEAM-9715) annotations_test fails in some environmens

2020-04-06 Thread Pablo Estrada (Jira)

Pablo Estrada created BEAM-9715:
---

 Summary: annotations_test fails in some environmens
 Key: BEAM-9715
 URL: https://issues.apache.org/jira/browse/BEAM-9715
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-core
Reporter: Pablo Estrada
Assignee: Pablo Estrada






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (BEAM-9147) [Java] PTransform that integrates Video Intelligence functionality

2020-04-06 Thread Masud Hasan (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-9147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076811#comment-17076811
 ] 

Masud Hasan commented on BEAM-9147:
---

Yes- in beam context. File IO watch or from PubSubIO

My understanding for live streaming is that client sdk does not fully support 
yet. Not sure if it has changed. 
[https://cloud.google.com/video-intelligence/docs/streaming/live-streaming-overview]

Let's say if I have a file 50MB (received the GCS URI, Feature config and 
context as PubSub or Json Message) and can call API with max 3 concurrent 
requests - I would think using streaming VI api, I can send out 15 MB contents/ 
chunk in parallel for faster performance.  Do you think I can build such 
request using this PTransform?

[https://cloud.google.com/video-intelligence/docs/streaming/label-analysis] 

 

> [Java] PTransform that integrates Video Intelligence functionality
> --
>
> Key: BEAM-9147
> URL: https://issues.apache.org/jira/browse/BEAM-9147
> Project: Beam
>  Issue Type: Sub-task
>  Components: io-java-gcp
>Reporter: Kamil Wasilewski
>Assignee: Michał Walenia
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The goal is to create a PTransform that integrates Google Cloud Video 
> Intelligence functionality [1].
> The transform should be able to take both video GCS location or video data 
> bytes as an input.
> A module with the transform should be put into _`sdks/java/extensions`_ 
> folder.
> [1] [https://cloud.google.com/video-intelligence/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-8910) Use AVRO instead of JSON in BigQuery bounded source.

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8910?focusedWorklogId=417355=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417355
 ]

ASF GitHub Bot logged work on BEAM-8910:


Author: ASF GitHub Bot
Created on: 07/Apr/20 01:26
Start Date: 07/Apr/20 01:26
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #11086: [BEAM-8910] 
Make custom BQ source read from Avro
URL: https://github.com/apache/beam/pull/11086#discussion_r404483110
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -663,14 +662,10 @@ def split(self, desired_bundle_size, 
start_position=None, stop_position=None):
 self._setup_temporary_dataset(bq)
 self.table_reference = self._execute_query(bq)
 
-  schema, metadata_list = self._export_files(bq)
+  unused_schema, metadata_list = self._export_files(bq)
 
 Review comment:
   We may do that, but if we end up keeping a backwards compatibility flag, 
we'll need to keep the coder as-is.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417355)
Time Spent: 4h  (was: 3h 50m)

> Use AVRO instead of JSON in BigQuery bounded source.
> 
>
> Key: BEAM-8910
> URL: https://issues.apache.org/jira/browse/BEAM-8910
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Kamil Wasilewski
>Assignee: Pablo Estrada
>Priority: Minor
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> The proposed BigQuery bounded source in Python SDK (see PR: 
> [https://github.com/apache/beam/pull/9772)] uses a BigQuery export job to 
> take a snapshot of the table and read from each produced JSON file. A 
> performance improvement can be gain by switching to AVRO instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9468) Add Google Cloud Healthcare API IO Connectors

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9468?focusedWorklogId=417350=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417350
 ]

ASF GitHub Bot logged work on BEAM-9468:


Author: ASF GitHub Bot
Created on: 07/Apr/20 01:13
Start Date: 07/Apr/20 01:13
Worklog Time Spent: 10m 
  Work Description: jaketf commented on issue #11151: [BEAM-9468]  Hl7v2 io
URL: https://github.com/apache/beam/pull/11151#issuecomment-610050634
 
 
   Next Steps (based on offline feedback):
   - [x] Improve API for users:
   - [x] Add static methods for common patterns with `ListHL7v2Messages` 
   - [x] Add `ValueProvider` support to ease use in the 
DataflowTemplates
   - [x] `ListHL7v2Messages` (hl7v2Store and filter)
   - [x] `Write` (hl7v2Store)
   - [ ] "standardize" integration tests
   - [x] Refactor ITs to create / destroy HL7v2 Store under a parameterized 
dataset in `@BeforeClass` `@AfterClass` to avoid issues with parallel tests 
runs.
   - [x] Remove hard coding of my HL7v2Store / project in integration tests.
   - [ ] Add Healthcare API Dataset to Beam integration test project 
(pending permissions in [this dev list 
thread](https://lists.apache.org/thread.html/rebe5cd40a40a9fc7f2c1d563b48ee1ce4ff9cac3dfdc0258006cc686%40%3Cdev.beam.apache.org%3E))
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417350)
Time Spent: 13h  (was: 12h 50m)

> Add Google Cloud Healthcare API IO Connectors
> -
>
> Key: BEAM-9468
> URL: https://issues.apache.org/jira/browse/BEAM-9468
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp
>Reporter: Jacob Ferriero
>Assignee: Jacob Ferriero
>Priority: Minor
>  Time Spent: 13h
>  Remaining Estimate: 0h
>
> Add IO Transforms for the HL7v2, FHIR and DICOM stores in the [Google Cloud 
> Healthcare API|https://cloud.google.com/healthcare/docs/]
> HL7v2IO
> FHIRIO
> DICOM 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9468) Add Google Cloud Healthcare API IO Connectors

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9468?focusedWorklogId=417349=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417349
 ]

ASF GitHub Bot logged work on BEAM-9468:


Author: ASF GitHub Bot
Created on: 07/Apr/20 01:12
Start Date: 07/Apr/20 01:12
Worklog Time Spent: 10m 
  Work Description: jaketf commented on issue #11151: [BEAM-9468]  Hl7v2 io
URL: https://github.com/apache/beam/pull/11151#issuecomment-610050634
 
 
   Next Steps (based on offline feedback):
   - [x] Improve API for users:
   - [x] Add static methods for common patterns with `ListHL7v2Messages` 
   - [x] Add `ValueProvider` support to ease use in the 
DataflowTemplates
   - [x] `ListHL7v2Messages` (hl7v2Store and filter)
   - [x] `Write` (hl7v2Store)
   - [ ] "standardize" integration tests
   - [x] Refactor ITs to create / destroy HL7v2 Store under a parameterized 
dataset in `@BeforeClass` `@AfterClass` to avoid issues with parallel tests 
runs.
   - [x] Remove hard coding of my HL7v2Store / project in integration tests.
   - [ ] Add Healthcare API Dataset to Beam integration test project
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417349)
Time Spent: 12h 50m  (was: 12h 40m)

> Add Google Cloud Healthcare API IO Connectors
> -
>
> Key: BEAM-9468
> URL: https://issues.apache.org/jira/browse/BEAM-9468
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp
>Reporter: Jacob Ferriero
>Assignee: Jacob Ferriero
>Priority: Minor
>  Time Spent: 12h 50m
>  Remaining Estimate: 0h
>
> Add IO Transforms for the HL7v2, FHIR and DICOM stores in the [Google Cloud 
> Healthcare API|https://cloud.google.com/healthcare/docs/]
> HL7v2IO
> FHIRIO
> DICOM 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9468) Add Google Cloud Healthcare API IO Connectors

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9468?focusedWorklogId=417348=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417348
 ]

ASF GitHub Bot logged work on BEAM-9468:


Author: ASF GitHub Bot
Created on: 07/Apr/20 01:11
Start Date: 07/Apr/20 01:11
Worklog Time Spent: 10m 
  Work Description: jaketf commented on issue #11151: [BEAM-9468]  Hl7v2 io
URL: https://github.com/apache/beam/pull/11151#issuecomment-610050634
 
 
   Next Steps (based on offline feed):
   - [x] Improve API for users:
   - [x] Add static methods for common patterns with `ListHL7v2Messages` 
   - [x] Add `ValueProvider` support to ease use in the 
DataflowTemplates
   - [x] `ListHL7v2Messages` (hl7v2Store and filter)
   - [x] `Write` (hl7v2Store)
   - [ ] "standardize" integration tests
   - [x] Refactor ITs to create / destroy HL7v2 Store under a parameterized 
dataset in `@BeforeClass` `@AfterClass` to avoid issues with parallel tests 
runs.
   - [x] Remove hard coding of my HL7v2Store / project in integration tests.
   - [ ] Add Healthcare API Dataset to Beam integration test project
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417348)
Time Spent: 12h 40m  (was: 12.5h)

> Add Google Cloud Healthcare API IO Connectors
> -
>
> Key: BEAM-9468
> URL: https://issues.apache.org/jira/browse/BEAM-9468
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp
>Reporter: Jacob Ferriero
>Assignee: Jacob Ferriero
>Priority: Minor
>  Time Spent: 12h 40m
>  Remaining Estimate: 0h
>
> Add IO Transforms for the HL7v2, FHIR and DICOM stores in the [Google Cloud 
> Healthcare API|https://cloud.google.com/healthcare/docs/]
> HL7v2IO
> FHIRIO
> DICOM 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-4374) Update existing metrics in the FN API to use new Metric Schema

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-4374?focusedWorklogId=417343=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417343
 ]

ASF GitHub Bot logged work on BEAM-4374:


Author: ASF GitHub Bot
Created on: 07/Apr/20 00:51
Start Date: 07/Apr/20 00:51
Worklog Time Spent: 10m 
  Work Description: lukecwik commented on pull request #11325: [BEAM-4374, 
BEAM-6189] Delete and remove deprecated Metrics proto
URL: https://github.com/apache/beam/pull/11325
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417343)
Time Spent: 41h 10m  (was: 41h)

> Update existing metrics in the FN API to use new Metric Schema
> --
>
> Key: BEAM-4374
> URL: https://issues.apache.org/jira/browse/BEAM-4374
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model
>Reporter: Alex Amato
>Priority: Major
>  Time Spent: 41h 10m
>  Remaining Estimate: 0h
>
> Update existing metrics to use the new proto and cataloging schema defined in:
> [_https://s.apache.org/beam-fn-api-metrics_]
>  * Check in new protos
>  * Define catalog file for metrics
>  * Port existing metrics to use this new format, based on catalog 
> names+metadata



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-9714) [Go SDK] Require --region flag in Dataflow runner

2020-04-06 Thread Kyle Weaver (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Weaver updated BEAM-9714:
--
Status: Open  (was: Triage Needed)

> [Go SDK] Require --region flag in Dataflow runner
> -
>
> Key: BEAM-9714
> URL: https://issues.apache.org/jira/browse/BEAM-9714
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-go
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Major
>
> We already require --region for Java and Python, we should require it for Go 
> as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (BEAM-9714) [Go SDK] Require --region flag in Dataflow runner

2020-04-06 Thread Kyle Weaver (Jira)

Kyle Weaver created BEAM-9714:
-

 Summary: [Go SDK] Require --region flag in Dataflow runner
 Key: BEAM-9714
 URL: https://issues.apache.org/jira/browse/BEAM-9714
 Project: Beam
  Issue Type: Improvement
  Components: sdk-go
Reporter: Kyle Weaver
Assignee: Kyle Weaver


We already require --region for Java and Python, we should require it for Go as 
well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9008) Add readAll() method to CassandraIO

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9008?focusedWorklogId=417341=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417341
 ]

ASF GitHub Bot logged work on BEAM-9008:


Author: ASF GitHub Bot
Created on: 07/Apr/20 00:43
Start Date: 07/Apr/20 00:43
Worklog Time Spent: 10m 
  Work Description: vmarquez commented on issue #10546: [BEAM-9008] Add 
CassandraIO readAll method
URL: https://github.com/apache/beam/pull/10546#issuecomment-610109654
 
 
   @iemejia is that a spurious failure or did something I do break the Flink 
test?  I tested locally and all seems to work... LMK if you need anything from 
me. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417341)
Time Spent: 7h 50m  (was: 7h 40m)

> Add readAll() method to CassandraIO
> ---
>
> Key: BEAM-9008
> URL: https://issues.apache.org/jira/browse/BEAM-9008
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-cassandra
>Affects Versions: 2.16.0
>Reporter: vincent marquez
>Assignee: vincent marquez
>Priority: Minor
>  Time Spent: 7h 50m
>  Remaining Estimate: 0h
>
> When querying a large cassandra database, it's often *much* more useful to 
> programatically generate the queries needed to to be run rather than reading 
> all partitions and attempting some filtering.  
> As an example:
> {code:java}
> public class Event { 
>@PartitionKey(0) public UUID accountId;
>@PartitionKey(1)public String yearMonthDay; 
>@ClusteringKey public UUID eventId;  
>//other data...
> }{code}
> If there is ten years worth of data, you may want to only query one year's 
> worth.  Here each token range would represent one 'token' but all events for 
> the day. 
> {code:java}
> Set accounts = getRelevantAccounts();
> Set dateRange = generateDateRange("2018-01-01", "2019-01-01");
> PCollection tokens = generateTokens(accounts, dateRange); 
> {code}
>  
>  I propose an additional _readAll()_ PTransform that can take a PCollection 
> of token ranges and can return a PCollection of what the query would 
> return. 
> *Question: How much code should be in common between both methods?* 
> Currently the read connector already groups all partitions into a List of 
> Token Ranges, so it would be simple to refactor the current read() based 
> method to a 'ParDo' based one and have them both share the same function.  
> Reasons against sharing code between read and readAll
>  * Not having the read based method return a BoundedSource connector would 
> mean losing the ability to know the size of the data returned
>  * Currently the CassandraReader executes all the grouped TokenRange queries 
> *asynchronously* which is (maybe?) fine when all that's happening is 
> splitting up all the partition ranges but terrible for executing potentially 
> millions of queries. 
>  Reasons _for_ sharing code would be simplified code base and that both of 
> the above issues would most likely have a negligable performance impact. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9685) Don't release Go SDK container until Go is officially supported.

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9685?focusedWorklogId=417340=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417340
 ]

ASF GitHub Bot logged work on BEAM-9685:


Author: ASF GitHub Bot
Created on: 07/Apr/20 00:43
Start Date: 07/Apr/20 00:43
Worklog Time Spent: 10m 
  Work Description: Hannah-Jiang commented on pull request #11308: 
[BEAM-9685] remove Go SDK container from release process from 2.22.0
URL: https://github.com/apache/beam/pull/11308
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417340)
Time Spent: 2h 10m  (was: 2h)

> Don't release Go SDK container until Go is officially supported.
> 
>
> Key: BEAM-9685
> URL: https://issues.apache.org/jira/browse/BEAM-9685
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.21.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> 1. Remove Go SDK container from release process.
> 2. Update document about it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9685) Don't release Go SDK container until Go is officially supported.

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9685?focusedWorklogId=417339=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417339
 ]

ASF GitHub Bot logged work on BEAM-9685:


Author: ASF GitHub Bot
Created on: 07/Apr/20 00:43
Start Date: 07/Apr/20 00:43
Worklog Time Spent: 10m 
  Work Description: Hannah-Jiang commented on issue #11308: [BEAM-9685] 
remove Go SDK container from release process from 2.22.0
URL: https://github.com/apache/beam/pull/11308#issuecomment-610109509
 
 
   > Yes, the Go Postcommit is failing in general right now, not due to this 
PR. See: https://builds.apache.org/job/beam_PostCommit_Go/
   
   Thanks Daniel for confirming.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417339)
Time Spent: 2h  (was: 1h 50m)

> Don't release Go SDK container until Go is officially supported.
> 
>
> Key: BEAM-9685
> URL: https://issues.apache.org/jira/browse/BEAM-9685
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.21.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> 1. Remove Go SDK container from release process.
> 2. Update document about it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9685) Don't release Go SDK container until Go is officially supported.

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9685?focusedWorklogId=417338=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417338
 ]

ASF GitHub Bot logged work on BEAM-9685:


Author: ASF GitHub Bot
Created on: 07/Apr/20 00:39
Start Date: 07/Apr/20 00:39
Worklog Time Spent: 10m 
  Work Description: youngoli commented on issue #11308: [BEAM-9685] remove 
Go SDK container from release process from 2.22.0
URL: https://github.com/apache/beam/pull/11308#issuecomment-610108628
 
 
   Yes, the Go Postcommit is failing in general right now, not due to this PR. 
See: https://builds.apache.org/job/beam_PostCommit_Go/
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417338)
Time Spent: 1h 50m  (was: 1h 40m)

> Don't release Go SDK container until Go is officially supported.
> 
>
> Key: BEAM-9685
> URL: https://issues.apache.org/jira/browse/BEAM-9685
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.21.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> 1. Remove Go SDK container from release process.
> 2. Update document about it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9674) "Selected fields list too long" error when calling tables.get in BigQueryStorageTableSource

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9674?focusedWorklogId=417335=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417335
 ]

ASF GitHub Bot logged work on BEAM-9674:


Author: ASF GitHub Bot
Created on: 07/Apr/20 00:33
Start Date: 07/Apr/20 00:33
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #11292: [BEAM-9674] 
Don't specify selected fields when fetching BigQuery table size
URL: https://github.com/apache/beam/pull/11292#discussion_r404468600
 
 

 ##
 File path: 
sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryServices.java
 ##
 @@ -101,10 +101,6 @@ JobStatistics dryRunQuery(String projectId, 
JobConfigurationQuery queryConfig, S
 @Nullable
 Table getTable(TableReference tableRef) throws InterruptedException, 
IOException;
 
-@Nullable
 
 Review comment:
   Is this deleted a public API?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417335)
Time Spent: 0.5h  (was: 20m)

> "Selected fields list too long" error when calling tables.get in 
> BigQueryStorageTableSource
> ---
>
> Key: BEAM-9674
> URL: https://issues.apache.org/jira/browse/BEAM-9674
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.19.0
>Reporter: Kenneth Jung
>Assignee: Kenneth Jung
>Priority: Minor
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Customers experience errors similar to the following:
>  Caused by: 
> com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad 
> Request { "code" : 400, "errors" : [
> { "domain" : "global", "message" : "Selected fields too long:  must 
> be less than 16384 characters.", "reason" : "invalid" }
> ], "message" : "Selected fields too long:  must be less than 16384 
> characters.", "status" : "INVALID_ARGUMENT" } 
> com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
>  
> com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
>  
> com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
>  
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:321)
>  com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1097) 
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
>  
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
>  
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
>  
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl.executeWithRetries(BigQueryServicesImpl.java:938)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9674) "Selected fields list too long" error when calling tables.get in BigQueryStorageTableSource

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9674?focusedWorklogId=417336=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417336
 ]

ASF GitHub Bot logged work on BEAM-9674:


Author: ASF GitHub Bot
Created on: 07/Apr/20 00:33
Start Date: 07/Apr/20 00:33
Worklog Time Spent: 10m 
  Work Description: aaltay commented on pull request #11292: [BEAM-9674] 
Don't specify selected fields when fetching BigQuery table size
URL: https://github.com/apache/beam/pull/11292#discussion_r404468600
 
 

 ##
 File path: 
sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryServices.java
 ##
 @@ -101,10 +101,6 @@ JobStatistics dryRunQuery(String projectId, 
JobConfigurationQuery queryConfig, S
 @Nullable
 Table getTable(TableReference tableRef) throws InterruptedException, 
IOException;
 
-@Nullable
 
 Review comment:
   Is this removing a public API?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417336)
Time Spent: 40m  (was: 0.5h)

> "Selected fields list too long" error when calling tables.get in 
> BigQueryStorageTableSource
> ---
>
> Key: BEAM-9674
> URL: https://issues.apache.org/jira/browse/BEAM-9674
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.19.0
>Reporter: Kenneth Jung
>Assignee: Kenneth Jung
>Priority: Minor
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Customers experience errors similar to the following:
>  Caused by: 
> com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad 
> Request { "code" : 400, "errors" : [
> { "domain" : "global", "message" : "Selected fields too long:  must 
> be less than 16384 characters.", "reason" : "invalid" }
> ], "message" : "Selected fields too long:  must be less than 16384 
> characters.", "status" : "INVALID_ARGUMENT" } 
> com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
>  
> com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
>  
> com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
>  
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:321)
>  com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1097) 
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
>  
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
>  
> com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
>  
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl.executeWithRetries(BigQueryServicesImpl.java:938)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9650) Add consistent slowly changing side inputs support

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9650?focusedWorklogId=417332=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417332
 ]

ASF GitHub Bot logged work on BEAM-9650:


Author: ASF GitHub Bot
Created on: 07/Apr/20 00:31
Start Date: 07/Apr/20 00:31
Worklog Time Spent: 10m 
  Work Description: rezarokni commented on issue #11182: [BEAM-9650] Add 
PeriodicImpulse Transform and slowly changing side input documentation
URL: https://github.com/apache/beam/pull/11182#issuecomment-610106616
 
 
   What is the expected behaviours around lifecycle events for runners that 
support drain / update. Does it need to be explicitly documented?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417332)
Time Spent: 1h 10m  (was: 1h)

> Add consistent slowly changing side inputs support
> --
>
> Key: BEAM-9650
> URL: https://issues.apache.org/jira/browse/BEAM-9650
> Project: Beam
>  Issue Type: Bug
>  Components: io-ideas
>Reporter: Mikhail Gryzykhin
>Assignee: Mikhail Gryzykhin
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Add implementation for slowly changing dimentions based on [design 
> doc](https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg/edit]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9685) Don't release Go SDK container until Go is officially supported.

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9685?focusedWorklogId=417327=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417327
 ]

ASF GitHub Bot logged work on BEAM-9685:


Author: ASF GitHub Bot
Created on: 07/Apr/20 00:21
Start Date: 07/Apr/20 00:21
Worklog Time Spent: 10m 
  Work Description: Hannah-Jiang commented on issue #11308: [BEAM-9685] 
remove Go SDK container from release process from 2.22.0
URL: https://github.com/apache/beam/pull/11308#issuecomment-610104116
 
 
   PostCommit is failing with some tests. The job was able to create, push, run 
tests and delete the Go SDK container, so I don't think the failures are 
related to current PR.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417327)
Time Spent: 1h 40m  (was: 1.5h)

> Don't release Go SDK container until Go is officially supported.
> 
>
> Key: BEAM-9685
> URL: https://issues.apache.org/jira/browse/BEAM-9685
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.21.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> 1. Remove Go SDK container from release process.
> 2. Update document about it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (BEAM-9199) Make --region a required flag for DataflowRunner

2020-04-06 Thread Kyle Weaver (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Weaver resolved BEAM-9199.
---
Fix Version/s: 2.21.0
   Resolution: Fixed

> Make --region a required flag for DataflowRunner
> 
>
> Key: BEAM-9199
> URL: https://issues.apache.org/jira/browse/BEAM-9199
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Major
> Fix For: 2.21.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> We've been warning users since Beam 2.15.0 that --region will be required. 
> That is sufficient time, so now we can start requiring the flag.
> While this is a small change in and of itself, I'm guessing many tests and 
> examples will need to be updated to add --region.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9691) Ensure Dataflow BQ Native sink are not used on FnApi

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9691?focusedWorklogId=417326=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417326
 ]

ASF GitHub Bot logged work on BEAM-9691:


Author: ASF GitHub Bot
Created on: 07/Apr/20 00:18
Start Date: 07/Apr/20 00:18
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #11309: [BEAM-9691] Ensuring 
BQ Native Sink is avoided on FnApi pipelines
URL: https://github.com/apache/beam/pull/11309#issuecomment-610103170
 
 
   Run Python 3.5 PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417326)
Time Spent: 1h 50m  (was: 1h 40m)

> Ensure Dataflow BQ Native sink are not used on FnApi
> 
>
> Key: BEAM-9691
> URL: https://issues.apache.org/jira/browse/BEAM-9691
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Major
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9468) Add Google Cloud Healthcare API IO Connectors

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9468?focusedWorklogId=417324=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417324
 ]

ASF GitHub Bot logged work on BEAM-9468:


Author: ASF GitHub Bot
Created on: 07/Apr/20 00:15
Start Date: 07/Apr/20 00:15
Worklog Time Spent: 10m 
  Work Description: jaketf commented on issue #11151: [BEAM-9468]  Hl7v2 io
URL: https://github.com/apache/beam/pull/11151#issuecomment-610050634
 
 
   Next Steps (based on offline feed):
   - [x] Improve API for users:
   - [x] Add static methods for common patterns with `ListHL7v2Messages` 
   - [x] Add `ValueProvider` support to ease use in the 
DataflowTemplates
   - [x] `ListHL7v2Messages` (hl7v2Store and filter)
   - [x] `Write` (hl7v2Store)
   - [ ] "standardize" integration tests
   - [x] Refactor ITs to create / destroy HL7v2 Store under a parameterized 
dataset in `@BeforeClass` `@AfterClass` to avoid issues with parallel tests 
runs.
   - [ ] Remove hard coding of my HL7v2Store / project in integration tests.
   - [ ] Add Healthcare API Dataset to Beam integration test project
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417324)
Time Spent: 12.5h  (was: 12h 20m)

> Add Google Cloud Healthcare API IO Connectors
> -
>
> Key: BEAM-9468
> URL: https://issues.apache.org/jira/browse/BEAM-9468
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp
>Reporter: Jacob Ferriero
>Assignee: Jacob Ferriero
>Priority: Minor
>  Time Spent: 12.5h
>  Remaining Estimate: 0h
>
> Add IO Transforms for the HL7v2, FHIR and DICOM stores in the [Google Cloud 
> Healthcare API|https://cloud.google.com/healthcare/docs/]
> HL7v2IO
> FHIRIO
> DICOM 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9199) Make --region a required flag for DataflowRunner

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9199?focusedWorklogId=417323=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417323
 ]

ASF GitHub Bot logged work on BEAM-9199:


Author: ASF GitHub Bot
Created on: 07/Apr/20 00:12
Start Date: 07/Apr/20 00:12
Worklog Time Spent: 10m 
  Work Description: ibzib commented on pull request #11281: [BEAM-9199] 
Require --region option for Dataflow in Java SDK.
URL: https://github.com/apache/beam/pull/11281
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417323)
Time Spent: 3.5h  (was: 3h 20m)

> Make --region a required flag for DataflowRunner
> 
>
> Key: BEAM-9199
> URL: https://issues.apache.org/jira/browse/BEAM-9199
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Major
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> We've been warning users since Beam 2.15.0 that --region will be required. 
> That is sufficient time, so now we can start requiring the flag.
> While this is a small change in and of itself, I'm guessing many tests and 
> examples will need to be updated to add --region.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9199) Make --region a required flag for DataflowRunner

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9199?focusedWorklogId=417321=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417321
 ]

ASF GitHub Bot logged work on BEAM-9199:


Author: ASF GitHub Bot
Created on: 07/Apr/20 00:11
Start Date: 07/Apr/20 00:11
Worklog Time Spent: 10m 
  Work Description: ibzib commented on issue #11269: [BEAM-9199] Require 
Dataflow --region in Python SDK.
URL: https://github.com/apache/beam/pull/11269#issuecomment-610101073
 
 
   Failure in `hdfsIntegrationTest` looks like known flake (BEAM-7405 et al): 
`docker-credential-gcloud not installed or not available in PATH`
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417321)
Time Spent: 3h 10m  (was: 3h)

> Make --region a required flag for DataflowRunner
> 
>
> Key: BEAM-9199
> URL: https://issues.apache.org/jira/browse/BEAM-9199
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Major
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> We've been warning users since Beam 2.15.0 that --region will be required. 
> That is sufficient time, so now we can start requiring the flag.
> While this is a small change in and of itself, I'm guessing many tests and 
> examples will need to be updated to add --region.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9199) Make --region a required flag for DataflowRunner

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9199?focusedWorklogId=417322=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417322
 ]

ASF GitHub Bot logged work on BEAM-9199:


Author: ASF GitHub Bot
Created on: 07/Apr/20 00:11
Start Date: 07/Apr/20 00:11
Worklog Time Spent: 10m 
  Work Description: ibzib commented on pull request #11269: [BEAM-9199] 
Require Dataflow --region in Python SDK.
URL: https://github.com/apache/beam/pull/11269
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417322)
Time Spent: 3h 20m  (was: 3h 10m)

> Make --region a required flag for DataflowRunner
> 
>
> Key: BEAM-9199
> URL: https://issues.apache.org/jira/browse/BEAM-9199
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Major
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> We've been warning users since Beam 2.15.0 that --region will be required. 
> That is sufficient time, so now we can start requiring the flag.
> While this is a small change in and of itself, I'm guessing many tests and 
> examples will need to be updated to add --region.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-9713) hints should be rejected

2020-04-06 Thread Andrew Pilloud (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Pilloud updated BEAM-9713:
-
Status: Open  (was: Triage Needed)

> hints should be rejected
> 
>
> Key: BEAM-9713
> URL: https://issues.apache.org/jira/browse/BEAM-9713
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql-zetasql
>Reporter: Andrew Pilloud
>Priority: Trivial
>  Labels: zetasql-compliance
>
> five failures in shard 32
> {code}
> Expected: ERROR: generic::invalid_argument: Unsupported hint: invalid_hint
>   Actual: ARRAY>[{123}]
> {code}
> {code}
> @{ invalid_hint=5 } select i from t
> 
> select @{ invalid_hint=5 } i from t
> 
> select i from t @{ invalid_hint=5 }
> 
> select i from t group @{ invalid_hint=5 } by 1
> 
> select i from t group @{ num_shards='abc' } by 1
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (BEAM-9713) hints should be rejected

2020-04-06 Thread Andrew Pilloud (Jira)

Andrew Pilloud created BEAM-9713:


 Summary: hints should be rejected
 Key: BEAM-9713
 URL: https://issues.apache.org/jira/browse/BEAM-9713
 Project: Beam
  Issue Type: Bug
  Components: dsl-sql-zetasql
Reporter: Andrew Pilloud


five failures in shard 32
{code}
Expected: ERROR: generic::invalid_argument: Unsupported hint: invalid_hint
  Actual: ARRAY>[{123}]
{code}
{code}
@{ invalid_hint=5 } select i from t

select @{ invalid_hint=5 } i from t

select i from t @{ invalid_hint=5 }

select i from t group @{ invalid_hint=5 } by 1

select i from t group @{ num_shards='abc' } by 1
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (BEAM-9712) setting default timezone doesn't work

2020-04-06 Thread Andrew Pilloud (Jira)

Andrew Pilloud created BEAM-9712:


 Summary: setting default timezone doesn't work
 Key: BEAM-9712
 URL: https://issues.apache.org/jira/browse/BEAM-9712
 Project: Beam
  Issue Type: Bug
  Components: dsl-sql-zetasql
Reporter: Andrew Pilloud


several failures in shard 14
(note: fixing the internal tests requires plumbing through the timezone config.)
{code}
[name=timestamp_to_string_1]
select [cast(timestamp "2015-01-28" as string),
cast(timestamp "2015-01-28 00:00:00" as string),
cast(timestamp "2015-01-28 00:00:00.0" as string),
cast(timestamp "2015-01-28 00:00:00.00" as string),
cast(timestamp "2015-01-28 00:00:00.000" as string),
cast(timestamp "2015-01-28 00:00:00." as string),
cast(timestamp "2015-01-28 00:00:00.0" as string),
cast(timestamp "2015-01-28 00:00:00.00" as string)]
--
ARRAY>>[
  {ARRAY[
 "2015-01-28 00:00:00+13:45",
 "2015-01-28 00:00:00+13:45",
 "2015-01-28 00:00:00+13:45",
 "2015-01-28 00:00:00+13:45",
 "2015-01-28 00:00:00+13:45",
 "2015-01-28 00:00:00+13:45",
 "2015-01-28 00:00:00+13:45",
 "2015-01-28 00:00:00+13:45"
   ]}
]
{code}
{code}
[default_time_zone=Pacific/Chatham]

[name=timestamp_to_string_1]
select [cast(timestamp "2015-01-28" as string),
cast(timestamp "2015-01-28 00:00:00" as string),
cast(timestamp "2015-01-28 00:00:00.0" as string),
cast(timestamp "2015-01-28 00:00:00.00" as string),
cast(timestamp "2015-01-28 00:00:00.000" as string),
cast(timestamp "2015-01-28 00:00:00." as string),
cast(timestamp "2015-01-28 00:00:00.0" as string),
cast(timestamp "2015-01-28 00:00:00.00" as string)]
--
ARRAY>>[
  {ARRAY[
 "2015-01-28 00:00:00+13:45",
 "2015-01-28 00:00:00+13:45",
 "2015-01-28 00:00:00+13:45",
 "2015-01-28 00:00:00+13:45",
 "2015-01-28 00:00:00+13:45",
 "2015-01-28 00:00:00+13:45",
 "2015-01-28 00:00:00+13:45",
 "2015-01-28 00:00:00+13:45"
   ]}
]
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-9712) setting default timezone doesn't work

2020-04-06 Thread Andrew Pilloud (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Pilloud updated BEAM-9712:
-
Status: Open  (was: Triage Needed)

> setting default timezone doesn't work
> -
>
> Key: BEAM-9712
> URL: https://issues.apache.org/jira/browse/BEAM-9712
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql-zetasql
>Reporter: Andrew Pilloud
>Priority: Trivial
>  Labels: zetasql-compliance
>
> several failures in shard 14
> (note: fixing the internal tests requires plumbing through the timezone 
> config.)
> {code}
> [name=timestamp_to_string_1]
> select [cast(timestamp "2015-01-28" as string),
> cast(timestamp "2015-01-28 00:00:00" as string),
> cast(timestamp "2015-01-28 00:00:00.0" as string),
> cast(timestamp "2015-01-28 00:00:00.00" as string),
> cast(timestamp "2015-01-28 00:00:00.000" as string),
> cast(timestamp "2015-01-28 00:00:00." as string),
> cast(timestamp "2015-01-28 00:00:00.0" as string),
> cast(timestamp "2015-01-28 00:00:00.00" as string)]
> --
> ARRAY>>[
>   {ARRAY[
>  "2015-01-28 00:00:00+13:45",
>  "2015-01-28 00:00:00+13:45",
>  "2015-01-28 00:00:00+13:45",
>  "2015-01-28 00:00:00+13:45",
>  "2015-01-28 00:00:00+13:45",
>  "2015-01-28 00:00:00+13:45",
>  "2015-01-28 00:00:00+13:45",
>  "2015-01-28 00:00:00+13:45"
>]}
> ]
> {code}
> {code}
> [default_time_zone=Pacific/Chatham]
> [name=timestamp_to_string_1]
> select [cast(timestamp "2015-01-28" as string),
> cast(timestamp "2015-01-28 00:00:00" as string),
> cast(timestamp "2015-01-28 00:00:00.0" as string),
> cast(timestamp "2015-01-28 00:00:00.00" as string),
> cast(timestamp "2015-01-28 00:00:00.000" as string),
> cast(timestamp "2015-01-28 00:00:00." as string),
> cast(timestamp "2015-01-28 00:00:00.0" as string),
> cast(timestamp "2015-01-28 00:00:00.00" as string)]
> --
> ARRAY>>[
>   {ARRAY[
>  "2015-01-28 00:00:00+13:45",
>  "2015-01-28 00:00:00+13:45",
>  "2015-01-28 00:00:00+13:45",
>  "2015-01-28 00:00:00+13:45",
>  "2015-01-28 00:00:00+13:45",
>  "2015-01-28 00:00:00+13:45",
>  "2015-01-28 00:00:00+13:45",
>  "2015-01-28 00:00:00+13:45"
>]}
> ]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-9709) timezone off by 8 hours

2020-04-06 Thread Andrew Pilloud (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Pilloud updated BEAM-9709:
-
Description: 
two failures in shard 13, one failure in shard 19
{code}
Expected: ARRAY>[{2014-01-31 00:00:00+00}]
  Actual: ARRAY>[{2014-01-31 08:00:00+00}], 
{code}
{code}
select timestamp(date '2014-01-31')
{code}

  was:
one failure in shard 19
(It is possible this test is attempting to change the default timezone before 
running)
{code}
Expected: ARRAY>[
  {2000-01-02 18:20:30+00},
  {2000-01-02 09:02:03+00}
]
  Actual: ARRAY>[
  {2000-01-02 10:20:30+00},
  {2000-01-02 01:02:03+00}
], 
{code}
{code}
SELECT x FROM UNNEST([TIMESTAMP '2000-01-02 10:20:30', '2000-01-02 01:02:03']) 
x;
{code}


> timezone off by 8 hours
> ---
>
> Key: BEAM-9709
> URL: https://issues.apache.org/jira/browse/BEAM-9709
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql-zetasql
>Reporter: Andrew Pilloud
>Priority: Trivial
>  Labels: zetasql-compliance
>
> two failures in shard 13, one failure in shard 19
> {code}
> Expected: ARRAY>[{2014-01-31 00:00:00+00}]
>   Actual: ARRAY>[{2014-01-31 08:00:00+00}], 
> {code}
> {code}
> select timestamp(date '2014-01-31')
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-9709) timezone off by 8 hours

2020-04-06 Thread Andrew Pilloud (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Pilloud updated BEAM-9709:
-
Summary: timezone off by 8 hours  (was: unnest timezone off by 8 hours)

> timezone off by 8 hours
> ---
>
> Key: BEAM-9709
> URL: https://issues.apache.org/jira/browse/BEAM-9709
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql-zetasql
>Reporter: Andrew Pilloud
>Priority: Trivial
>  Labels: zetasql-compliance
>
> one failure in shard 19
> (It is possible this test is attempting to change the default timezone before 
> running)
> {code}
> Expected: ARRAY>[
>   {2000-01-02 18:20:30+00},
>   {2000-01-02 09:02:03+00}
> ]
>   Actual: ARRAY>[
>   {2000-01-02 10:20:30+00},
>   {2000-01-02 01:02:03+00}
> ], 
> {code}
> {code}
> SELECT x FROM UNNEST([TIMESTAMP '2000-01-02 10:20:30', '2000-01-02 
> 01:02:03']) x;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-9708) count with no elements returns no value instead of 0

2020-04-06 Thread Andrew Pilloud (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Pilloud updated BEAM-9708:
-
Description: 
two failures in shard 3, One failure in shard 37
{code:java}
Expected: ARRAY>[{0}]
  Actual: ARRAY>[], 
 Details: Number of array elements is {1} and {0} in respective arrays 
{[unordered: {0}]} and {[]} {code}
{code}
[prepare_database]
CREATE TABLE TableEmpty AS SELECT val FROM (SELECT 1 as val) WHERE false
--
ARRAY>[]
==
[name=aggregation_count_6]
SELECT COUNT(*) FROM TableEmpty
--
ARRAY>[{0}]
==
[name=aggregation_count_7]
SELECT COUNT(val) FROM TableEmpty
--
ARRAY>[{0}]
{code}
{code}
SELECT COUNT(a) FROM (
SELECT a FROM (SELECT 1 a UNION ALL SELECT 2 UNION ALL SELECT 3) LIMIT 0 OFFSET 
0)
{code}

  was:
One failure in shard 37
{code:java}
Expected: ARRAY>[{0}]
  Actual: ARRAY>[], 
 Details: Number of array elements is {1} and {0} in respective arrays 
{[unordered: {0}]} and {[]} {code}
{code}
SELECT COUNT(a) FROM (
SELECT a FROM (SELECT 1 a UNION ALL SELECT 2 UNION ALL SELECT 3) LIMIT 0 OFFSET 
0)
{code}


> count with no elements returns no value instead of 0
> 
>
> Key: BEAM-9708
> URL: https://issues.apache.org/jira/browse/BEAM-9708
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql-zetasql
>Reporter: Andrew Pilloud
>Priority: Trivial
>  Labels: zetasql-compliance
>
> two failures in shard 3, One failure in shard 37
> {code:java}
> Expected: ARRAY>[{0}]
>   Actual: ARRAY>[], 
>  Details: Number of array elements is {1} and {0} in respective arrays 
> {[unordered: {0}]} and {[]} {code}
> {code}
> [prepare_database]
> CREATE TABLE TableEmpty AS SELECT val FROM (SELECT 1 as val) WHERE false
> --
> ARRAY>[]
> ==
> [name=aggregation_count_6]
> SELECT COUNT(*) FROM TableEmpty
> --
> ARRAY>[{0}]
> ==
> [name=aggregation_count_7]
> SELECT COUNT(val) FROM TableEmpty
> --
> ARRAY>[{0}]
> {code}
> {code}
> SELECT COUNT(a) FROM (
> SELECT a FROM (SELECT 1 a UNION ALL SELECT 2 UNION ALL SELECT 3) LIMIT 0 
> OFFSET 0)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-9711) sum(null) should be null not 0

2020-04-06 Thread Andrew Pilloud (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Pilloud updated BEAM-9711:
-
Status: Open  (was: Triage Needed)

> sum(null) should be null not 0
> --
>
> Key: BEAM-9711
> URL: https://issues.apache.org/jira/browse/BEAM-9711
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql-zetasql
>Reporter: Andrew Pilloud
>Priority: Trivial
>  Labels: zetasql-compliance
>
> one failure in shard 3
> {code}
> Expected: ARRAY>[
>   {1, NULL},
>   {2, NULL},
>   {3, NULL},
>   {4, 3},
>   {5, 4},
>   {6, 5},
>   {7, 6},
>   {8, 7},
>   {9, 8},
>   {10, 9},
>   {11, 10},
>   {12, 11},
>   {13, 12},
>   {14, 13}
> ]
>   Actual: ARRAY>[
>   {1, 0},
>   {10, 9},
>   {7, 6},
>   {2, 0},
>   {13, 12},
>   {5, 4},
>   {4, 3},
>   {14, 13},
>   {6, 5},
>   {11, 10},
>   {12, 11},
>   {8, 7},
>   {3, 0},
>   {9, 8}
> ], 
> {code}
> {code}
> [prepare_database]
> CREATE TABLE TableLarge AS
> SELECT CAST(1 AS int64) as row_id,
>CAST(NULL AS bool) as bool_val, CAST(NULL AS double) as double_val,
>CAST(NULL AS int64) as int64_val, CAST(NULL AS uint64) as uint64_val,
>CAST(NULL AS string) as str_val UNION ALL
>   SELECT 2,  true,  NULL, NULL, NULL, NULL UNION ALL
>   SELECT 3,  false, 0.2,  NULL, NULL, NULL UNION ALL
>   SELECT 4,  true,  0.3,  3,NULL, NULL UNION ALL
>   SELECT 5,  false, 0.4,  4,15, "4" UNION ALL
>   SELECT 6,  true,  0.5,  5,17, "5" UNION ALL
>   SELECT 7,  false, 0.6,  6,19,  "6" UNION ALL
>   SELECT 8,  true,  0.7,  7,21,  "7" UNION ALL
>   SELECT 9,  false, 0.8,  8,23, "8" UNION ALL
>   SELECT 10, true,  0.9,  9,25,  "9" UNION ALL
>   SELECT 11, false, 1.0, 10,27, "10" UNION ALL
>   SELECT 12, true,  IEEE_DIVIDE(1, 0), 11, 29, "11" UNION ALL
>   SELECT 13, false, IEEE_DIVIDE(-1, 0), 12, 31, "12" UNION ALL
>   SELECT 14, true,  IEEE_DIVIDE(0, 0), 13, 33, "13"
> --
> ARRAY  bool_val BOOL,
>  double_val DOUBLE,
>  int64_val INT64,
>  uint64_val UINT64,
>  str_val STRING>>
> [
>   {1, NULL, NULL, NULL, NULL, NULL},
>   {2, true, NULL, NULL, NULL, NULL},
>   {3, false, 0.2, NULL, NULL, NULL},
>   {4, true, 0.3, 3, NULL, NULL},
>   {5, false, 0.4, 4, 15, "4"},
>   {6, true, 0.5, 5, 17, "5"},
>   {7, false, 0.6, 6, 19, "6"},
>   {8, true, 0.7, 7, 21, "7"},
>   {9, false, 0.8, 8, 23, "8"},
>   {10, true, 0.9, 9, 25, "9"},
>   {11, false, 1, 10, 27, "10"},
>   {12, true, inf, 11, 29, "11"},
>   {13, false, -inf, 12, 31, "12"},
>   {14, true, nan, 13, 33, "13"}
> ]
> ==
> # SUM should work with GROUP BY.
> [name=aggregation_sum_group_by]
> SELECT row_id, SUM(int64_val) int64_sum FROM TableLarge GROUP BY row_id
> --
> ARRAY>[
>   {1, NULL},
>   {2, NULL},
>   {3, NULL},
>   {4, 3},
>   {5, 4},
>   {6, 5},
>   {7, 6},
>   {8, 7},
>   {9, 8},
>   {10, 9},
>   {11, 10},
>   {12, 11},
>   {13, 12},
>   {14, 13}
> ]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9618) Allow SDKs to pull process bundle descriptors.

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9618?focusedWorklogId=417302=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417302
 ]

ASF GitHub Bot logged work on BEAM-9618:


Author: ASF GitHub Bot
Created on: 06/Apr/20 23:28
Start Date: 06/Apr/20 23:28
Worklog Time Spent: 10m 
  Work Description: robertwb commented on pull request #11328: [BEAM-9618] 
Java SDK worker support for pulling bundle descriptors.
URL: https://github.com/apache/beam/pull/11328
 
 
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] Update `CHANGES.md` with noteworthy changes.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more 
tips on [how to make review process 
smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/)

[jira] [Created] (BEAM-9711) sum(null) should be null not 0

2020-04-06 Thread Andrew Pilloud (Jira)

Andrew Pilloud created BEAM-9711:


 Summary: sum(null) should be null not 0
 Key: BEAM-9711
 URL: https://issues.apache.org/jira/browse/BEAM-9711
 Project: Beam
  Issue Type: Bug
  Components: dsl-sql-zetasql
Reporter: Andrew Pilloud


one failure in shard 3
{code}
Expected: ARRAY>[
  {1, NULL},
  {2, NULL},
  {3, NULL},
  {4, 3},
  {5, 4},
  {6, 5},
  {7, 6},
  {8, 7},
  {9, 8},
  {10, 9},
  {11, 10},
  {12, 11},
  {13, 12},
  {14, 13}
]
  Actual: ARRAY>[
  {1, 0},
  {10, 9},
  {7, 6},
  {2, 0},
  {13, 12},
  {5, 4},
  {4, 3},
  {14, 13},
  {6, 5},
  {11, 10},
  {12, 11},
  {8, 7},
  {3, 0},
  {9, 8}
], 
{code}
{code}
[prepare_database]
CREATE TABLE TableLarge AS
SELECT CAST(1 AS int64) as row_id,
   CAST(NULL AS bool) as bool_val, CAST(NULL AS double) as double_val,
   CAST(NULL AS int64) as int64_val, CAST(NULL AS uint64) as uint64_val,
   CAST(NULL AS string) as str_val UNION ALL
  SELECT 2,  true,  NULL, NULL, NULL, NULL UNION ALL
  SELECT 3,  false, 0.2,  NULL, NULL, NULL UNION ALL
  SELECT 4,  true,  0.3,  3,NULL, NULL UNION ALL
  SELECT 5,  false, 0.4,  4,15, "4" UNION ALL
  SELECT 6,  true,  0.5,  5,17, "5" UNION ALL
  SELECT 7,  false, 0.6,  6,19,  "6" UNION ALL
  SELECT 8,  true,  0.7,  7,21,  "7" UNION ALL
  SELECT 9,  false, 0.8,  8,23, "8" UNION ALL
  SELECT 10, true,  0.9,  9,25,  "9" UNION ALL
  SELECT 11, false, 1.0, 10,27, "10" UNION ALL
  SELECT 12, true,  IEEE_DIVIDE(1, 0), 11, 29, "11" UNION ALL
  SELECT 13, false, IEEE_DIVIDE(-1, 0), 12, 31, "12" UNION ALL
  SELECT 14, true,  IEEE_DIVIDE(0, 0), 13, 33, "13"
--
ARRAY>
[
  {1, NULL, NULL, NULL, NULL, NULL},
  {2, true, NULL, NULL, NULL, NULL},
  {3, false, 0.2, NULL, NULL, NULL},
  {4, true, 0.3, 3, NULL, NULL},
  {5, false, 0.4, 4, 15, "4"},
  {6, true, 0.5, 5, 17, "5"},
  {7, false, 0.6, 6, 19, "6"},
  {8, true, 0.7, 7, 21, "7"},
  {9, false, 0.8, 8, 23, "8"},
  {10, true, 0.9, 9, 25, "9"},
  {11, false, 1, 10, 27, "10"},
  {12, true, inf, 11, 29, "11"},
  {13, false, -inf, 12, 31, "12"},
  {14, true, nan, 13, 33, "13"}
]
==
# SUM should work with GROUP BY.
[name=aggregation_sum_group_by]
SELECT row_id, SUM(int64_val) int64_sum FROM TableLarge GROUP BY row_id
--
ARRAY>[
  {1, NULL},
  {2, NULL},
  {3, NULL},
  {4, 3},
  {5, 4},
  {6, 5},
  {7, 6},
  {8, 7},
  {9, 8},
  {10, 9},
  {11, 10},
  {12, 11},
  {13, 12},
  {14, 13}
]
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-9710) Got current time instead of timestamp value

2020-04-06 Thread Andrew Pilloud (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Pilloud updated BEAM-9710:
-
Status: Open  (was: Triage Needed)

> Got current time instead of timestamp value
> ---
>
> Key: BEAM-9710
> URL: https://issues.apache.org/jira/browse/BEAM-9710
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql-zetasql
>Reporter: Andrew Pilloud
>Priority: Trivial
>  Labels: zetasql-compliance
>
> one failure in shard 13
> {code}
> Expected: ARRAY>[{2014-12-01 00:00:00+00}]
>   Actual: ARRAY>[{2020-04-06 
> 00:20:40.052+00}], 
> {code}
> {code}
> [prepare_database]
> CREATE TABLE Table1 AS
> SELECT timestamp '2014-12-01' as timestamp_val
> --
> ARRAY>[{2014-12-01 00:00:00+00}]
> ==
> [name=timestamp_type_2]
> SELECT timestamp_val
> FROM Table1
> --
> ARRAY>[{2014-12-01 00:00:00+00}]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (BEAM-9710) Got current time instead of timestamp value

2020-04-06 Thread Andrew Pilloud (Jira)

Andrew Pilloud created BEAM-9710:


 Summary: Got current time instead of timestamp value
 Key: BEAM-9710
 URL: https://issues.apache.org/jira/browse/BEAM-9710
 Project: Beam
  Issue Type: Bug
  Components: dsl-sql-zetasql
Reporter: Andrew Pilloud


one failure in shard 13
{code}
Expected: ARRAY>[{2014-12-01 00:00:00+00}]
  Actual: ARRAY>[{2020-04-06 00:20:40.052+00}], 
{code}
{code}
[prepare_database]
CREATE TABLE Table1 AS
SELECT timestamp '2014-12-01' as timestamp_val
--
ARRAY>[{2014-12-01 00:00:00+00}]
==
[name=timestamp_type_2]
SELECT timestamp_val
FROM Table1
--
ARRAY>[{2014-12-01 00:00:00+00}]
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9468) Add Google Cloud Healthcare API IO Connectors

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9468?focusedWorklogId=417301=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417301
 ]

ASF GitHub Bot logged work on BEAM-9468:


Author: ASF GitHub Bot
Created on: 06/Apr/20 23:21
Start Date: 06/Apr/20 23:21
Worklog Time Spent: 10m 
  Work Description: jaketf commented on issue #11151: [BEAM-9468]  Hl7v2 io
URL: https://github.com/apache/beam/pull/11151#issuecomment-610050634
 
 
   Next Steps (based on offline feed):
   - [x] Improve API for users:
   - [x] Add static methods for common patterns with `ListHL7v2Messages` 
   - [x] Add `ValueProvider` support to ease use in the 
DataflowTemplates
   - [x] `ListHL7v2Messages` (hl7v2Store and filter)
   - [x] `Write` (hl7v2Store)
   - [ ] "standardize" integration tests
   - [ ] Refactor ITs to create / destroy HL7v2 Store under a parameterized 
dataset in `@BeforeClass` `@AfterClass` to avoid issues with parallel tests 
runs.
   - [ ] Remove hard coding of my HL7v2Store / project in integration tests.
   - [ ] Add Healthcare API Dataset to Beam integration test project
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417301)
Time Spent: 12h 20m  (was: 12h 10m)

> Add Google Cloud Healthcare API IO Connectors
> -
>
> Key: BEAM-9468
> URL: https://issues.apache.org/jira/browse/BEAM-9468
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp
>Reporter: Jacob Ferriero
>Assignee: Jacob Ferriero
>Priority: Minor
>  Time Spent: 12h 20m
>  Remaining Estimate: 0h
>
> Add IO Transforms for the HL7v2, FHIR and DICOM stores in the [Google Cloud 
> Healthcare API|https://cloud.google.com/healthcare/docs/]
> HL7v2IO
> FHIRIO
> DICOM 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-9709) unnest timezone off by 8 hours

2020-04-06 Thread Andrew Pilloud (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Pilloud updated BEAM-9709:
-
Status: Open  (was: Triage Needed)

> unnest timezone off by 8 hours
> --
>
> Key: BEAM-9709
> URL: https://issues.apache.org/jira/browse/BEAM-9709
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql-zetasql
>Reporter: Andrew Pilloud
>Priority: Trivial
>  Labels: zetasql-compliance
>
> one failure in shard 19
> (It is possible this test is attempting to change the default timezone before 
> running)
> {code}
> Expected: ARRAY>[
>   {2000-01-02 18:20:30+00},
>   {2000-01-02 09:02:03+00}
> ]
>   Actual: ARRAY>[
>   {2000-01-02 10:20:30+00},
>   {2000-01-02 01:02:03+00}
> ], 
> {code}
> {code}
> SELECT x FROM UNNEST([TIMESTAMP '2000-01-02 10:20:30', '2000-01-02 
> 01:02:03']) x;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (BEAM-9709) unnest timezone off by 8 hours

2020-04-06 Thread Andrew Pilloud (Jira)

Andrew Pilloud created BEAM-9709:


 Summary: unnest timezone off by 8 hours
 Key: BEAM-9709
 URL: https://issues.apache.org/jira/browse/BEAM-9709
 Project: Beam
  Issue Type: Bug
  Components: dsl-sql-zetasql
Reporter: Andrew Pilloud


one failure in shard 19
(It is possible this test is attempting to change the default timezone before 
running)
{code}
Expected: ARRAY>[
  {2000-01-02 18:20:30+00},
  {2000-01-02 09:02:03+00}
]
  Actual: ARRAY>[
  {2000-01-02 10:20:30+00},
  {2000-01-02 01:02:03+00}
], 
{code}
{code}
SELECT x FROM UNNEST([TIMESTAMP '2000-01-02 10:20:30', '2000-01-02 01:02:03']) 
x;
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-9708) count with no elements returns no value instead of 0

2020-04-06 Thread Andrew Pilloud (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Pilloud updated BEAM-9708:
-
Labels: zetasql-compliance  (was: )

> count with no elements returns no value instead of 0
> 
>
> Key: BEAM-9708
> URL: https://issues.apache.org/jira/browse/BEAM-9708
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql-zetasql
>Reporter: Andrew Pilloud
>Priority: Trivial
>  Labels: zetasql-compliance
>
> One failure in shard 37
> {code:java}
> Expected: ARRAY>[{0}]
>   Actual: ARRAY>[], 
>  Details: Number of array elements is {1} and {0} in respective arrays 
> {[unordered: {0}]} and {[]} {code}
> {code}
> SELECT COUNT(a) FROM (
> SELECT a FROM (SELECT 1 a UNION ALL SELECT 2 UNION ALL SELECT 3) LIMIT 0 
> OFFSET 0)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (BEAM-9708) count with no elements returns no value instead of 0

2020-04-06 Thread Andrew Pilloud (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Pilloud updated BEAM-9708:
-
Status: Open  (was: Triage Needed)

> count with no elements returns no value instead of 0
> 
>
> Key: BEAM-9708
> URL: https://issues.apache.org/jira/browse/BEAM-9708
> Project: Beam
>  Issue Type: Bug
>  Components: dsl-sql-zetasql
>Reporter: Andrew Pilloud
>Priority: Trivial
>  Labels: zetasql-compliance
>
> One failure in shard 37
> {code:java}
> Expected: ARRAY>[{0}]
>   Actual: ARRAY>[], 
>  Details: Number of array elements is {1} and {0} in respective arrays 
> {[unordered: {0}]} and {[]} {code}
> {code}
> SELECT COUNT(a) FROM (
> SELECT a FROM (SELECT 1 a UNION ALL SELECT 2 UNION ALL SELECT 3) LIMIT 0 
> OFFSET 0)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (BEAM-9708) count with no elements returns no value instead of 0

2020-04-06 Thread Andrew Pilloud (Jira)

Andrew Pilloud created BEAM-9708:


 Summary: count with no elements returns no value instead of 0
 Key: BEAM-9708
 URL: https://issues.apache.org/jira/browse/BEAM-9708
 Project: Beam
  Issue Type: Bug
  Components: dsl-sql-zetasql
Reporter: Andrew Pilloud


One failure in shard 37
{code:java}
Expected: ARRAY>[{0}]
  Actual: ARRAY>[], 
 Details: Number of array elements is {1} and {0} in respective arrays 
{[unordered: {0}]} and {[]} {code}
{code}
SELECT COUNT(a) FROM (
SELECT a FROM (SELECT 1 a UNION ALL SELECT 2 UNION ALL SELECT 3) LIMIT 0 OFFSET 
0)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417295=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417295
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 23:03
Start Date: 06/Apr/20 23:03
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610080905
 
 
   Run Go Spark ValidatesRunner
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417295)
Time Spent: 11h  (was: 10h 50m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 11h
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417294=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417294
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 23:03
Start Date: 06/Apr/20 23:03
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610080809
 
 
   Run Go Spark ValidatesRunner
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417294)
Time Spent: 10h 50m  (was: 10h 40m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 10h 50m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417293=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417293
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 23:03
Start Date: 06/Apr/20 23:03
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610080530
 
 
   Run Python Spark ValidatesRunner
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417293)
Time Spent: 10h 40m  (was: 10.5h)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 10h 40m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417291=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417291
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 23:02
Start Date: 06/Apr/20 23:02
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610080530
 
 
   Run Python Spark ValidatesRunner
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417291)
Time Spent: 10h 20m  (was: 10h 10m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417288=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417288
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 23:02
Start Date: 06/Apr/20 23:02
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610078890
 
 
   Run Java Spark PortableValidatesRunner Batch
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417288)
Time Spent: 10h 10m  (was: 10h)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 10h 10m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417292=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417292
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 23:02
Start Date: 06/Apr/20 23:02
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610080657
 
 
   Run Python Spark ValidatesRunner
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417292)
Time Spent: 10.5h  (was: 10h 20m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 10.5h
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417287=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417287
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 23:01
Start Date: 06/Apr/20 23:01
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610080380
 
 
   Run Java Spark PortableValidatesRunner Batch
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417287)
Time Spent: 10h  (was: 9h 50m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 10h
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417283=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417283
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 23:01
Start Date: 06/Apr/20 23:01
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610080182
 
 
   Run Spark Runner Nexmark Tests
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417283)
Time Spent: 9h 20m  (was: 9h 10m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417281=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417281
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 23:01
Start Date: 06/Apr/20 23:01
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610080127
 
 
   Run Spark Runner Nexmark Tests
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417281)
Time Spent: 9h 10m  (was: 9h)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417284=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417284
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 23:01
Start Date: 06/Apr/20 23:01
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610078236
 
 
   Run Spark Runner Nexmark Tests
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417284)
Time Spent: 9.5h  (was: 9h 20m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417285=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417285
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 23:01
Start Date: 06/Apr/20 23:01
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610080065
 
 
   Run Spark Runner Nexmark Tests
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417285)
Time Spent: 9h 40m  (was: 9.5h)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 9h 40m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417286=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417286
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 23:01
Start Date: 06/Apr/20 23:01
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610080127
 
 
   Run Spark Runner Nexmark Tests
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417286)
Time Spent: 9h 50m  (was: 9h 40m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 9h 50m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417275=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417275
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 23:00
Start Date: 06/Apr/20 23:00
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610079874
 
 
   Run Spark ValidatesRunner
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417275)
Time Spent: 8h 20m  (was: 8h 10m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 8h 20m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417280=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417280
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 23:00
Start Date: 06/Apr/20 23:00
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610080065
 
 
   Run Spark Runner Nexmark Tests
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417280)
Time Spent: 9h  (was: 8h 50m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 9h
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417276=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417276
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 23:00
Start Date: 06/Apr/20 23:00
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610079624
 
 
   Run Spark ValidatesRunner
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417276)
Time Spent: 8.5h  (was: 8h 20m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 8.5h
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417278=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417278
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 23:00
Start Date: 06/Apr/20 23:00
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610079752
 
 
   Run Spark ValidatesRunner
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417278)
Time Spent: 8h 40m  (was: 8.5h)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417279=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417279
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 23:00
Start Date: 06/Apr/20 23:00
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610077992
 
 
   Run Spark ValidatesRunner
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417279)
Time Spent: 8h 50m  (was: 8h 40m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417272=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417272
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:59
Start Date: 06/Apr/20 22:59
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610079437
 
 
   retest this please
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417272)
Time Spent: 7h 50m  (was: 7h 40m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 7h 50m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417274=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417274
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:59
Start Date: 06/Apr/20 22:59
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610079752
 
 
   Run Spark ValidatesRunner
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417274)
Time Spent: 8h 10m  (was: 8h)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417271=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417271
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:59
Start Date: 06/Apr/20 22:59
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610079366
 
 
   retest this please
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417271)
Time Spent: 7h 40m  (was: 7.5h)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417273=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417273
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:59
Start Date: 06/Apr/20 22:59
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610079624
 
 
   Run Spark ValidatesRunner
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417273)
Time Spent: 8h  (was: 7h 50m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417270=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417270
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:58
Start Date: 06/Apr/20 22:58
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610079437
 
 
   retest this please
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417270)
Time Spent: 7.5h  (was: 7h 20m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417269=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417269
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:58
Start Date: 06/Apr/20 22:58
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610079366
 
 
   retest this please
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417269)
Time Spent: 7h 20m  (was: 7h 10m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417266=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417266
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:57
Start Date: 06/Apr/20 22:57
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610078527
 
 
   Run Java Spark PortableValidatesRunner Batch
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417266)
Time Spent: 6h 50m  (was: 6h 40m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417268=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417268
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:57
Start Date: 06/Apr/20 22:57
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610078731
 
 
   Run Java Spark PortableValidatesRunner Batch
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417268)
Time Spent: 7h 10m  (was: 7h)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417267=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417267
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:57
Start Date: 06/Apr/20 22:57
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610078632
 
 
   Run Java Spark PortableValidatesRunner Batch
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417267)
Time Spent: 7h  (was: 6h 50m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417265=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417265
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:57
Start Date: 06/Apr/20 22:57
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610078890
 
 
   Run Java Spark PortableValidatesRunner Batch
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417265)
Time Spent: 6h 40m  (was: 6.5h)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417263=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417263
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:56
Start Date: 06/Apr/20 22:56
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610078632
 
 
   Run Java Spark PortableValidatesRunner Batch
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417263)
Time Spent: 6h 20m  (was: 6h 10m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417264=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417264
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:56
Start Date: 06/Apr/20 22:56
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610078731
 
 
   Run Java Spark PortableValidatesRunner Batch
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417264)
Time Spent: 6.5h  (was: 6h 20m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9557) Error setting processing time timers near end-of-window

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9557?focusedWorklogId=417262=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417262
 ]

ASF GitHub Bot logged work on BEAM-9557:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:56
Start Date: 06/Apr/20 22:56
Worklog Time Spent: 10m 
  Work Description: amaliujia commented on issue #11226: [BEAM-9557] Fix 
timer window boundary checking
URL: https://github.com/apache/beam/pull/11226#issuecomment-610078537
 
 
   @reuvenlax 
   
   do you need help on the failed java tests?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417262)
Time Spent: 7h 10m  (was: 7h)

> Error setting processing time timers near end-of-window
> ---
>
> Key: BEAM-9557
> URL: https://issues.apache.org/jira/browse/BEAM-9557
> Project: Beam
>  Issue Type: Bug
>  Components: runner-core
>Reporter: Steve Niemitz
>Assignee: Reuven Lax
>Priority: Critical
> Fix For: 2.20.0
>
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> Previously, it was possible to set a processing time timer past the end of a 
> window, and it would simply not fire.
> However, now, this results in an error:
> {code:java}
> java.lang.IllegalArgumentException: Attempted to set event time timer that 
> outputs for 2020-03-19T18:01:35.000Z but that is after the expiration of 
> window 2020-03-19T17:59:59.999Z
> 
> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument(Preconditions.java:440)
> 
> org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$TimerInternalsTimer.setAndVerifyOutputTimestamp(SimpleDoFnRunner.java:1011)
> 
> org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$TimerInternalsTimer.setRelative(SimpleDoFnRunner.java:934)
> .processElement(???.scala:187)
>  {code}
>  
> I think the regression was introduced in commit 
> a005fd765a762183ca88df90f261f6d4a20cf3e0.  Also notice that the error message 
> is wrong, it says that "event time timer" but the timer is in the processing 
> time domain.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417258=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417258
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:55
Start Date: 06/Apr/20 22:55
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610078304
 
 
   Run Spark Runner Nexmark Tests
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417258)
Time Spent: 5h 50m  (was: 5h 40m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417261=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417261
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:55
Start Date: 06/Apr/20 22:55
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610078527
 
 
   Run Java Spark PortableValidatesRunner Batch
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417261)
Time Spent: 6h 10m  (was: 6h)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9557) Error setting processing time timers near end-of-window

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9557?focusedWorklogId=417256=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417256
 ]

ASF GitHub Bot logged work on BEAM-9557:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:55
Start Date: 06/Apr/20 22:55
Worklog Time Spent: 10m 
  Work Description: amaliujia commented on issue #11226: [BEAM-9557] Fix 
timer window boundary checking
URL: https://github.com/apache/beam/pull/11226#issuecomment-610078277
 
 
   LGTM
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417256)
Time Spent: 7h  (was: 6h 50m)

> Error setting processing time timers near end-of-window
> ---
>
> Key: BEAM-9557
> URL: https://issues.apache.org/jira/browse/BEAM-9557
> Project: Beam
>  Issue Type: Bug
>  Components: runner-core
>Reporter: Steve Niemitz
>Assignee: Reuven Lax
>Priority: Critical
> Fix For: 2.20.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> Previously, it was possible to set a processing time timer past the end of a 
> window, and it would simply not fire.
> However, now, this results in an error:
> {code:java}
> java.lang.IllegalArgumentException: Attempted to set event time timer that 
> outputs for 2020-03-19T18:01:35.000Z but that is after the expiration of 
> window 2020-03-19T17:59:59.999Z
> 
> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument(Preconditions.java:440)
> 
> org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$TimerInternalsTimer.setAndVerifyOutputTimestamp(SimpleDoFnRunner.java:1011)
> 
> org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner$TimerInternalsTimer.setRelative(SimpleDoFnRunner.java:934)
> .processElement(???.scala:187)
>  {code}
>  
> I think the regression was introduced in commit 
> a005fd765a762183ca88df90f261f6d4a20cf3e0.  Also notice that the error message 
> is wrong, it says that "event time timer" but the timer is in the processing 
> time domain.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417257=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417257
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:55
Start Date: 06/Apr/20 22:55
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610078304
 
 
   Run Spark Runner Nexmark Tests
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417257)
Time Spent: 5h 40m  (was: 5.5h)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417255=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417255
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:55
Start Date: 06/Apr/20 22:55
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610078236
 
 
   Run Spark Runner Nexmark Tests
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417255)
Time Spent: 5.5h  (was: 5h 20m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417259=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417259
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:55
Start Date: 06/Apr/20 22:55
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610077933
 
 
   Run Spark Runner Nexmark Tests
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417259)
Time Spent: 6h  (was: 5h 50m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417254=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417254
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:54
Start Date: 06/Apr/20 22:54
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610078155
 
 
   Run Spark ValidatesRunner
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417254)
Time Spent: 5h 20m  (was: 5h 10m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417252=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417252
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:54
Start Date: 06/Apr/20 22:54
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610077676
 
 
   Run Spark ValidatesRunner
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417252)
Time Spent: 5h  (was: 4h 50m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417253=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417253
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:54
Start Date: 06/Apr/20 22:54
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610078155
 
 
   Run Spark ValidatesRunner
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417253)
Time Spent: 5h 10m  (was: 5h)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417251=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417251
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:54
Start Date: 06/Apr/20 22:54
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610077992
 
 
   Run Spark ValidatesRunner
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417251)
Time Spent: 4h 50m  (was: 4h 40m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417248=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417248
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:53
Start Date: 06/Apr/20 22:53
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610077189
 
 
   retest this please
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417248)
Time Spent: 4h 20m  (was: 4h 10m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417247=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417247
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:53
Start Date: 06/Apr/20 22:53
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610077676
 
 
   Run Spark ValidatesRunner
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417247)
Time Spent: 4h 10m  (was: 4h)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417250=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417250
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:53
Start Date: 06/Apr/20 22:53
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610077933
 
 
   Run Spark Runner Nexmark Tests
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417250)
Time Spent: 4h 40m  (was: 4.5h)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417249=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417249
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:53
Start Date: 06/Apr/20 22:53
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610077560
 
 
   retest this please
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417249)
Time Spent: 4.5h  (was: 4h 20m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417246=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417246
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:52
Start Date: 06/Apr/20 22:52
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610077560
 
 
   retest this please
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417246)
Time Spent: 4h  (was: 3h 50m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9434) Improve Spark runner reshuffle translation to maximize parallelism

2020-04-06 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-9434?focusedWorklogId=417245=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-417245
 ]

ASF GitHub Bot logged work on BEAM-9434:


Author: ASF GitHub Bot
Created on: 06/Apr/20 22:51
Start Date: 06/Apr/20 22:51
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #11037: [BEAM-9434] 
performance improvements reading many Avro files in S3
URL: https://github.com/apache/beam/pull/11037#issuecomment-610077189
 
 
   retest this please
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 417245)
Time Spent: 3h 50m  (was: 3h 40m)

> Improve Spark runner reshuffle translation to maximize parallelism
> --
>
> Key: BEAM-9434
> URL: https://issues.apache.org/jira/browse/BEAM-9434
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Affects Versions: 2.19.0
>Reporter: Emiliano Capoccia
>Assignee: Emiliano Capoccia
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> There is a performance issue when processing a large number of small Avro 
> files in Spark on k8s (tens of thousands or more).
> The recommended way of reading a pattern of Avro files in Beam is by means of:
>  
> {code:java}
> PCollection records = p.apply(AvroIO.read(AvroGenClass.class)
> .from("s3://my-bucket/path-to/*.avro").withHintMatchesManyFiles())
> {code}
> However, in the case of many small files, the above results in the entire 
> reading taking place in a single task/node, which is considerably slow and 
> has scalability issues.
> The option of omitting the hint is not viable, as it results in too many 
> tasks being spawn, and the cluster being busy doing coordination of tiny 
> tasks with high overhead.
> There are a few workarounds on the internet which mainly revolve around 
> compacting the input files before processing, so that a reduced number of 
> bulky files is processed in parallel.
> It seems the Spark runner is using the parallelism of the input distributed 
> collection (RDD) to calculate the number of partitions in Reshuffle. In the 
> case of FileIO/AvroIO if the input pattern is a regex the size of the input 
> is 1 which would be far from an optimal parallelism value. We may fix this by 
> improving the translation of reshuffle to maximize parallelism.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

1 2 3 4 5 >

1 - 100 of 438 matches

Mail list logo