[jira] [Updated] (BEAM-9685) Don't release Go SDK container until Go is officially supported.

2020-05-20 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9685:
---
Description: 
We decided to remove Go SDK images from release process from 2.21.0, because Go 
SDK is not mutual enough to release.

 

We can add the images back to release process when following items happens.
 # Go SDK is mutual enough for users to use it. [~lostluck] should have ideas 
when they will be ready. 
 # Add licenses/notices/source code to the images to avoid legal issue. There 
was a PR try to do this, but was closed as we decided not to release Go images. 
PR: [https://github.com/apache/beam/pull/11246]

 

To remove Go SDKs from release, we need to do following items.
 # Remove Go SDK container from release process.
 # Update document about it.

PR for removing: [https://github.com/apache/beam/pull/11308]

 

To add Go SDK images back to release process, we need to revert and may improve 
above two items.

  was:
We decided to remove Go SDK images from release process from 2.21.0, because Go 
SDK is not mutual enough to release.

 

We can add the images back to release process when following items happens.
 # Go SDK is mutual enough for users to use it. [~lostluck] should have ideas 
when they will be ready. 
 # Add licenses/notices/source code to the images to avoid legal issue. There 
was a PR try to do this, but was closed as we decided not to release Go images. 
PR: [https://github.com/apache/beam/pull/11246]

 

To remove Go SDKs from release, we need to do following items.
 # Remove Go SDK container from release process.
 # Update document about it.

 

To add Go SDK images back to release process, we need to revert and may improve 
above two items.


> Don't release Go SDK container until Go is officially supported.
> 
>
> Key: BEAM-9685
> URL: https://issues.apache.org/jira/browse/BEAM-9685
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: P2
> Fix For: 2.21.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We decided to remove Go SDK images from release process from 2.21.0, because 
> Go SDK is not mutual enough to release.
>  
> We can add the images back to release process when following items happens.
>  # Go SDK is mutual enough for users to use it. [~lostluck] should have ideas 
> when they will be ready. 
>  # Add licenses/notices/source code to the images to avoid legal issue. There 
> was a PR try to do this, but was closed as we decided not to release Go 
> images. PR: [https://github.com/apache/beam/pull/11246]
>  
> To remove Go SDKs from release, we need to do following items.
>  # Remove Go SDK container from release process.
>  # Update document about it.
> PR for removing: [https://github.com/apache/beam/pull/11308]
>  
> To add Go SDK images back to release process, we need to revert and may 
> improve above two items.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9685) Don't release Go SDK container until Go is officially supported.

2020-05-20 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9685:
---
Description: 
We decided to remove Go SDK images from release process from 2.21.0, because Go 
SDK is not mutual enough to release.

We can add the images back to release process when following items happens.
 # Go SDK is mutual enough for users to use it. [~lostluck] should have ideas 
when they will be ready. 
 # Add licenses/notices/source code to the images to avoid legal issue. There 
was a PR try to do this, but was closed as we decided not to release Go images. 
PR: [https://github.com/apache/beam/pull/11246]

 

To remove Go SDKs from release, we need to do following items.
 # Remove Go SDK container from release process.
 # Update document about it.

 

To add Go SDK images back to release process, we need to revert and may improve 
above two items.

  was:
1. Remove Go SDK container from release process.
2. Update document about it.


> Don't release Go SDK container until Go is officially supported.
> 
>
> Key: BEAM-9685
> URL: https://issues.apache.org/jira/browse/BEAM-9685
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: P2
> Fix For: 2.21.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We decided to remove Go SDK images from release process from 2.21.0, because 
> Go SDK is not mutual enough to release.
> We can add the images back to release process when following items happens.
>  # Go SDK is mutual enough for users to use it. [~lostluck] should have ideas 
> when they will be ready. 
>  # Add licenses/notices/source code to the images to avoid legal issue. There 
> was a PR try to do this, but was closed as we decided not to release Go 
> images. PR: [https://github.com/apache/beam/pull/11246]
>  
> To remove Go SDKs from release, we need to do following items.
>  # Remove Go SDK container from release process.
>  # Update document about it.
>  
> To add Go SDK images back to release process, we need to revert and may 
> improve above two items.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9685) Don't release Go SDK container until Go is officially supported.

2020-05-20 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9685:
---
Description: 
We decided to remove Go SDK images from release process from 2.21.0, because Go 
SDK is not mutual enough to release.

 

We can add the images back to release process when following items happens.
 # Go SDK is mutual enough for users to use it. [~lostluck] should have ideas 
when they will be ready. 
 # Add licenses/notices/source code to the images to avoid legal issue. There 
was a PR try to do this, but was closed as we decided not to release Go images. 
PR: [https://github.com/apache/beam/pull/11246]

 

To remove Go SDKs from release, we need to do following items.
 # Remove Go SDK container from release process.
 # Update document about it.

 

To add Go SDK images back to release process, we need to revert and may improve 
above two items.

  was:
We decided to remove Go SDK images from release process from 2.21.0, because Go 
SDK is not mutual enough to release.

We can add the images back to release process when following items happens.
 # Go SDK is mutual enough for users to use it. [~lostluck] should have ideas 
when they will be ready. 
 # Add licenses/notices/source code to the images to avoid legal issue. There 
was a PR try to do this, but was closed as we decided not to release Go images. 
PR: [https://github.com/apache/beam/pull/11246]

 

To remove Go SDKs from release, we need to do following items.
 # Remove Go SDK container from release process.
 # Update document about it.

 

To add Go SDK images back to release process, we need to revert and may improve 
above two items.


> Don't release Go SDK container until Go is officially supported.
> 
>
> Key: BEAM-9685
> URL: https://issues.apache.org/jira/browse/BEAM-9685
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: P2
> Fix For: 2.21.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We decided to remove Go SDK images from release process from 2.21.0, because 
> Go SDK is not mutual enough to release.
>  
> We can add the images back to release process when following items happens.
>  # Go SDK is mutual enough for users to use it. [~lostluck] should have ideas 
> when they will be ready. 
>  # Add licenses/notices/source code to the images to avoid legal issue. There 
> was a PR try to do this, but was closed as we decided not to release Go 
> images. PR: [https://github.com/apache/beam/pull/11246]
>  
> To remove Go SDKs from release, we need to do following items.
>  # Remove Go SDK container from release process.
>  # Update document about it.
>  
> To add Go SDK images back to release process, we need to revert and may 
> improve above two items.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (BEAM-9913) Cross-language ValidatesRunner tests are failing due to failure of ':sdks:java:container:pullLicenses'

2020-05-07 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-9913 started by Hannah Jiang.
--
> Cross-language ValidatesRunner tests are failing due to failure of 
> ':sdks:java:container:pullLicenses'
> --
>
> Key: BEAM-9913
> URL: https://issues.apache.org/jira/browse/BEAM-9913
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Hannah Jiang
>Priority: Major
>
> Both beam_PostCommit_XVR_Flink and beam_PostCommit_XVR_Spark are perma red.
> For example,
> [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Flink/2487/]
> [https://scans.gradle.com/s/rydkawcamxtm4/console-log?task=:sdks:java:container:pullLicenses]
>  
> Caused by: 
> org.gradle.process.internal.ExecException
> Process 'command './sdks/java/container/license_scripts/license_script.sh'' 
> finished with non-zero exit value 2
>  
> at 
> org.gradle.process.internal.DefaultExecHandle$ExecResultImpl.assertNormalExitValue(DefaultExecHandle.java:396)
> at 
> org.gradle.process.internal.DefaultExecAction.execute(DefaultExecAction.java:37)
>  
> Probably due to [https://github.com/apache/beam/pull/11548]
>  
> Hannah, can you please take a look ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9913) Cross-language ValidatesRunner tests are failing due to failure of ':sdks:java:container:pullLicenses'

2020-05-07 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9913:
---
Status: Open  (was: Triage Needed)

> Cross-language ValidatesRunner tests are failing due to failure of 
> ':sdks:java:container:pullLicenses'
> --
>
> Key: BEAM-9913
> URL: https://issues.apache.org/jira/browse/BEAM-9913
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Hannah Jiang
>Priority: Major
>
> Both beam_PostCommit_XVR_Flink and beam_PostCommit_XVR_Spark are perma red.
> For example,
> [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Flink/2487/]
> [https://scans.gradle.com/s/rydkawcamxtm4/console-log?task=:sdks:java:container:pullLicenses]
>  
> Caused by: 
> org.gradle.process.internal.ExecException
> Process 'command './sdks/java/container/license_scripts/license_script.sh'' 
> finished with non-zero exit value 2
>  
> at 
> org.gradle.process.internal.DefaultExecHandle$ExecResultImpl.assertNormalExitValue(DefaultExecHandle.java:396)
> at 
> org.gradle.process.internal.DefaultExecAction.execute(DefaultExecAction.java:37)
>  
> Probably due to [https://github.com/apache/beam/pull/11548]
>  
> Hannah, can you please take a look ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9880) touch: build/target/third_party_licenses/skip: No such file or directory

2020-05-04 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099365#comment-17099365
 ] 

Hannah Jiang commented on BEAM-9880:


Created a PR, that should solve the issue. 
[https://github.com/apache/beam/pull/11606]

> touch: build/target/third_party_licenses/skip: No such file or directory
> 
>
> Key: BEAM-9880
> URL: https://issues.apache.org/jira/browse/BEAM-9880
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-harness
>Reporter: Kyle Weaver
>Assignee: Hannah Jiang
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When I run ./gradlew 
> :sdks:python:test-suites:portable:py2:crossLanguageTests, I get the following 
> error:
> > Task :sdks:java:container:createFile FAILED
> touch: build/target/third_party_licenses/skip: No such file or directory
> When I do `ls build`, the only thing it outputs is `gradleenv`. So it looks 
> like it's assuming the directory exists, when it might not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (BEAM-9880) touch: build/target/third_party_licenses/skip: No such file or directory

2020-05-04 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-9880 started by Hannah Jiang.
--
> touch: build/target/third_party_licenses/skip: No such file or directory
> 
>
> Key: BEAM-9880
> URL: https://issues.apache.org/jira/browse/BEAM-9880
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-harness
>Reporter: Kyle Weaver
>Assignee: Hannah Jiang
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When I run ./gradlew 
> :sdks:python:test-suites:portable:py2:crossLanguageTests, I get the following 
> error:
> > Task :sdks:java:container:createFile FAILED
> touch: build/target/third_party_licenses/skip: No such file or directory
> When I do `ls build`, the only thing it outputs is `gradleenv`. So it looks 
> like it's assuming the directory exists, when it might not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9880) touch: build/target/third_party_licenses/skip: No such file or directory

2020-05-04 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099344#comment-17099344
 ] 

Hannah Jiang commented on BEAM-9880:


The directly should be created as of 
[https://github.com/apache/beam/blob/master/sdks/java/container/build.gradle#L110]

I cannot reproduce it. Can you try to create a java image only?

By the way, you should use docker-pull-licenses or isRelease (hasn't been 
merged yet) to pull licenses.

 

> touch: build/target/third_party_licenses/skip: No such file or directory
> 
>
> Key: BEAM-9880
> URL: https://issues.apache.org/jira/browse/BEAM-9880
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-harness
>Reporter: Kyle Weaver
>Assignee: Hannah Jiang
>Priority: Major
>
> When I run ./gradlew 
> :sdks:python:test-suites:portable:py2:crossLanguageTests, I get the following 
> error:
> > Task :sdks:java:container:createFile FAILED
> touch: build/target/third_party_licenses/skip: No such file or directory
> When I do `ls build`, the only thing it outputs is `gradleenv`. So it looks 
> like it's assuming the directory exists, when it might not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9880) touch: build/target/third_party_licenses/skip: No such file or directory

2020-05-04 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099324#comment-17099324
 ] 

Hannah Jiang commented on BEAM-9880:


Did you pull from the head and still see the issue? 
Can you check if sdks/java/container/build/target/third_party_licenses exists?

> touch: build/target/third_party_licenses/skip: No such file or directory
> 
>
> Key: BEAM-9880
> URL: https://issues.apache.org/jira/browse/BEAM-9880
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-harness
>Reporter: Kyle Weaver
>Assignee: Hannah Jiang
>Priority: Major
>
> When I run ./gradlew 
> :sdks:python:test-suites:portable:py2:crossLanguageTests, I get the following 
> error:
> > Task :sdks:java:container:createFile FAILED
> touch: build/target/third_party_licenses/skip: No such file or directory
> When I do `ls build`, the only thing it outputs is `gradleenv`. So it looks 
> like it's assuming the directory exists, when it might not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (BEAM-8209) Document custom docker containers

2020-04-30 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang closed BEAM-8209.
--
Fix Version/s: Not applicable
   Resolution: Fixed

> Document custom docker containers
> -
>
> Key: BEAM-8209
> URL: https://issues.apache.org/jira/browse/BEAM-8209
> Project: Beam
>  Issue Type: Sub-task
>  Components: website
>Reporter: Cyrus Maden
>Assignee: Cyrus Maden
>Priority: Minor
> Fix For: Not applicable
>
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9867) Status of custom container as of 04/2020

2020-04-30 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9867:
---
Parent: BEAM-7907
Issue Type: Sub-task  (was: Task)

> Status of custom container as of 04/2020
> 
>
> Key: BEAM-9867
> URL: https://issues.apache.org/jira/browse/BEAM-9867
> Project: Beam
>  Issue Type: Sub-task
>  Components: build-system
>Reporter: Hannah Jiang
>Priority: Major
>
> Here is a link to a doc which summarizes status of custom containers as of 
> 04/2020. https://s.apache.org/ygrub



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9867) Status of custom container as of 04/2020

2020-04-30 Thread Hannah Jiang (Jira)
Hannah Jiang created BEAM-9867:
--

 Summary: Status of custom container as of 04/2020
 Key: BEAM-9867
 URL: https://issues.apache.org/jira/browse/BEAM-9867
 Project: Beam
  Issue Type: Task
  Components: build-system
Reporter: Hannah Jiang


Here is a link to a doc which summarizes status of custom containers as of 
04/2020. https://s.apache.org/ygrub



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9866) Improve docker-pull-licenses trigger rule

2020-04-30 Thread Hannah Jiang (Jira)
Hannah Jiang created BEAM-9866:
--

 Summary: Improve docker-pull-licenses trigger rule
 Key: BEAM-9866
 URL: https://issues.apache.org/jira/browse/BEAM-9866
 Project: Beam
  Issue Type: Task
  Components: build-system
Reporter: Hannah Jiang


Now license pulling is triggered at Jenkins test whenever a Java or Python SDK 
docker image is created. We can improve this to trigger the job only when there 
are dependency changes. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9849) Caching license files for license pulling

2020-04-28 Thread Hannah Jiang (Jira)
Hannah Jiang created BEAM-9849:
--

 Summary: Caching license files for license pulling
 Key: BEAM-9849
 URL: https://issues.apache.org/jira/browse/BEAM-9849
 Project: Beam
  Issue Type: Task
  Components: build-system
Reporter: Hannah Jiang


Licenses are pulled every time a docker image is created.
We need to come up with a caching approach to cache the files so that the same 
file is pulled only once ever.
This caching appraoch should be useable by all images release by Beam, 
including SDK docker images, Flink & Spark job server images etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-9443) support direct_num_workers=0

2020-04-27 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang resolved BEAM-9443.

Resolution: Fixed

> support direct_num_workers=0 
> -
>
> Key: BEAM-9443
> URL: https://issues.apache.org/jira/browse/BEAM-9443
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.22.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> when direct_num_workers=0, set it to number of cores.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-9797) license_script.sh calls pip install/uninstall in local env

2020-04-27 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang resolved BEAM-9797.

Fix Version/s: 2.21.0
   Resolution: Fixed

> license_script.sh calls pip install/uninstall in local env
> --
>
> Key: BEAM-9797
> URL: https://issues.apache.org/jira/browse/BEAM-9797
> Project: Beam
>  Issue Type: Bug
>  Components: build-system
>Reporter: Udi Meiri
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.21.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> File is: 
> https://github.com/apache/beam/blob/master/sdks/java/container/license_scripts/license_script.sh
> The problem is with the code that does pip install and uninstall.
> 1. It is not okay to modify the local environment.
> 2. Running this script in parallel with itself (on Jenkins) has a chance to 
> cause a race.
> The solution is to use a tox environment to run this script in. Tox will take 
> care of creating a virtualenv with the required dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-9778) beam_PostCommit_XVR_Spark failing

2020-04-27 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang resolved BEAM-9778.

Fix Version/s: 2.21.0
   Resolution: Fixed

> beam_PostCommit_XVR_Spark failing
> -
>
> Key: BEAM-9778
> URL: https://issues.apache.org/jira/browse/BEAM-9778
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Kyle Weaver
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.21.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> https://builds.apache.org/job/beam_PostCommit_XVR_Spark/
> 17:59:09 Execution failed for task 
> ':sdks:java:container:generateThirdPartyLicenses'.
> 17:59:09 > Process 'command 
> './sdks/java/container/license_scripts/license_script.sh'' finished with 
> non-zero exit value 1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (BEAM-9778) beam_PostCommit_XVR_Spark failing

2020-04-21 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088999#comment-17088999
 ] 

Hannah Jiang edited comment on BEAM-9778 at 4/21/20, 7:48 PM:
--

It fails because sdks:java:container:generateThirdPartyLicenses was executed 
twice with :runners:spark:job-server:validatesCrossLanguageRunner job.
Error: mkdir: cannot create directory 
‘sdks/java/container/third_party_licenses’: File exists


was (Author: hannahjiang):
It fails because java docker image is created twice with 
:runners:spark:job-server:validatesCrossLanguageRunner job.
Error: mkdir: cannot create directory 
‘sdks/java/container/third_party_licenses’: File exists

> beam_PostCommit_XVR_Spark failing
> -
>
> Key: BEAM-9778
> URL: https://issues.apache.org/jira/browse/BEAM-9778
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Kyle Weaver
>Assignee: Hannah Jiang
>Priority: Major
>
> https://builds.apache.org/job/beam_PostCommit_XVR_Spark/
> 17:59:09 Execution failed for task 
> ':sdks:java:container:generateThirdPartyLicenses'.
> 17:59:09 > Process 'command 
> './sdks/java/container/license_scripts/license_script.sh'' finished with 
> non-zero exit value 1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9778) beam_PostCommit_XVR_Spark failing

2020-04-21 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088999#comment-17088999
 ] 

Hannah Jiang commented on BEAM-9778:


It fails because java docker image is created twice with 
:runners:spark:job-server:validatesCrossLanguageRunner job.
Error: mkdir: cannot create directory 
‘sdks/java/container/third_party_licenses’: File exists

> beam_PostCommit_XVR_Spark failing
> -
>
> Key: BEAM-9778
> URL: https://issues.apache.org/jira/browse/BEAM-9778
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Kyle Weaver
>Assignee: Hannah Jiang
>Priority: Major
>
> https://builds.apache.org/job/beam_PostCommit_XVR_Spark/
> 17:59:09 Execution failed for task 
> ':sdks:java:container:generateThirdPartyLicenses'.
> 17:59:09 > Process 'command 
> './sdks/java/container/license_scripts/license_script.sh'' finished with 
> non-zero exit value 1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-9764) :sdks:java:container:generateThirdPartyLicenses failing

2020-04-21 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang resolved BEAM-9764.

Fix Version/s: 2.21.0
   Resolution: Fixed

> :sdks:java:container:generateThirdPartyLicenses failing
> ---
>
> Key: BEAM-9764
> URL: https://issues.apache.org/jira/browse/BEAM-9764
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core, test-failures
>Reporter: Udi Meiri
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.21.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/774/console
> The traceback is interspersed with other logs:
> {code}
> Traceback (most recent call last):
> Successfully pulled 
> java_third_party_licenses/protobuf-java-util-3.11.1.jar/LICENSE from 
> https://opensource.org/licenses/BSD-3-Clause
> Successfully pulled java_third_party_licenses/protoc-3.11.0.jar/LICENSE from 
> http://www.apache.org/licenses/LICENSE-2.0.txt
>   File "sdks/java/container/license_scripts/pull_licenses_java.py", line 138, 
> in 
> Successfully pulled java_third_party_licenses/protoc-3.11.1.jar/LICENSE from 
> http://www.apache.org/licenses/LICENSE-2.0.txt
> license_url = dep['moduleLicenseUrl']
> Successfully pulled java_third_party_licenses/zetasketch-0.1.0.jar/LICENSE 
> from http://www.apache.org/licenses/LICENSE-2.0.txt
> KeyError: 'moduleLicenseUrl'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (BEAM-9797) license_script.sh calls pip install/uninstall in local env

2020-04-21 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-9797 started by Hannah Jiang.
--
> license_script.sh calls pip install/uninstall in local env
> --
>
> Key: BEAM-9797
> URL: https://issues.apache.org/jira/browse/BEAM-9797
> Project: Beam
>  Issue Type: Bug
>  Components: build-system
>Reporter: Udi Meiri
>Assignee: Hannah Jiang
>Priority: Major
>
> File is: 
> https://github.com/apache/beam/blob/master/sdks/java/container/license_scripts/license_script.sh
> The problem is with the code that does pip install and uninstall.
> 1. It is not okay to modify the local environment.
> 2. Running this script in parallel with itself (on Jenkins) has a chance to 
> cause a race.
> The solution is to use a tox environment to run this script in. Tox will take 
> care of creating a virtualenv with the required dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-9797) license_script.sh calls pip install/uninstall in local env

2020-04-21 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang reassigned BEAM-9797:
--

Assignee: Hannah Jiang

> license_script.sh calls pip install/uninstall in local env
> --
>
> Key: BEAM-9797
> URL: https://issues.apache.org/jira/browse/BEAM-9797
> Project: Beam
>  Issue Type: Bug
>  Components: build-system
>Reporter: Udi Meiri
>Assignee: Hannah Jiang
>Priority: Major
>
> File is: 
> https://github.com/apache/beam/blob/master/sdks/java/container/license_scripts/license_script.sh
> The problem is with the code that does pip install and uninstall.
> 1. It is not okay to modify the local environment.
> 2. Running this script in parallel with itself (on Jenkins) has a chance to 
> cause a race.
> The solution is to use a tox environment to run this script in. Tox will take 
> care of creating a virtualenv with the required dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9797) license_script.sh calls pip install/uninstall in local env

2020-04-21 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9797:
---
Status: Open  (was: Triage Needed)

> license_script.sh calls pip install/uninstall in local env
> --
>
> Key: BEAM-9797
> URL: https://issues.apache.org/jira/browse/BEAM-9797
> Project: Beam
>  Issue Type: Bug
>  Components: build-system
>Reporter: Udi Meiri
>Priority: Major
>
> File is: 
> https://github.com/apache/beam/blob/master/sdks/java/container/license_scripts/license_script.sh
> The problem is with the code that does pip install and uninstall.
> 1. It is not okay to modify the local environment.
> 2. Running this script in parallel with itself (on Jenkins) has a chance to 
> cause a race.
> The solution is to use a tox environment to run this script in. Tox will take 
> care of creating a virtualenv with the required dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-6586) Design and implement a release process for Beam SDK harness containers.

2020-04-20 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088226#comment-17088226
 ] 

Hannah Jiang commented on BEAM-6586:


It is duplicate of BEAM-8105. 

> Design and implement a release process for Beam SDK harness containers.
> ---
>
> Key: BEAM-6586
> URL: https://issues.apache.org/jira/browse/BEAM-6586
> Project: Beam
>  Issue Type: New Feature
>  Components: build-system
>Reporter: Valentyn Tymofieiev
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: Not applicable
>
>
> Related discussion: 
> [https://lists.apache.org/thread.html/770496ee9cf1096d78806fece8dd37716279b51ca5bb600dfa263c55@%3Cdev.beam.apache.org%3E]
> cc: [~angoenka]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (BEAM-9778) beam_PostCommit_XVR_Spark failing

2020-04-20 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-9778 started by Hannah Jiang.
--
> beam_PostCommit_XVR_Spark failing
> -
>
> Key: BEAM-9778
> URL: https://issues.apache.org/jira/browse/BEAM-9778
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Kyle Weaver
>Assignee: Hannah Jiang
>Priority: Major
>
> https://builds.apache.org/job/beam_PostCommit_XVR_Spark/
> 17:59:09 Execution failed for task 
> ':sdks:java:container:generateThirdPartyLicenses'.
> 17:59:09 > Process 'command 
> './sdks/java/container/license_scripts/license_script.sh'' finished with 
> non-zero exit value 1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9685) Don't release Go SDK container until Go is officially supported.

2020-04-16 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085132#comment-17085132
 ] 

Hannah Jiang commented on BEAM-9685:


I think it's ok to keep already released images. 
I added a comment at the docker page to say Go SDK is experimental and will not 
be released from 2.21.0.

> Don't release Go SDK container until Go is officially supported.
> 
>
> Key: BEAM-9685
> URL: https://issues.apache.org/jira/browse/BEAM-9685
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.21.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> 1. Remove Go SDK container from release process.
> 2. Update document about it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (BEAM-9764) :sdks:java:container:generateThirdPartyLicenses failing

2020-04-15 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-9764 started by Hannah Jiang.
--
> :sdks:java:container:generateThirdPartyLicenses failing
> ---
>
> Key: BEAM-9764
> URL: https://issues.apache.org/jira/browse/BEAM-9764
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core, test-failures
>Reporter: Udi Meiri
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.21.0
>
>
> https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/774/console
> The traceback is interspersed with other logs:
> {code}
> Traceback (most recent call last):
> Successfully pulled 
> java_third_party_licenses/protobuf-java-util-3.11.1.jar/LICENSE from 
> https://opensource.org/licenses/BSD-3-Clause
> Successfully pulled java_third_party_licenses/protoc-3.11.0.jar/LICENSE from 
> http://www.apache.org/licenses/LICENSE-2.0.txt
>   File "sdks/java/container/license_scripts/pull_licenses_java.py", line 138, 
> in 
> Successfully pulled java_third_party_licenses/protoc-3.11.1.jar/LICENSE from 
> http://www.apache.org/licenses/LICENSE-2.0.txt
> license_url = dep['moduleLicenseUrl']
> Successfully pulled java_third_party_licenses/zetasketch-0.1.0.jar/LICENSE 
> from http://www.apache.org/licenses/LICENSE-2.0.txt
> KeyError: 'moduleLicenseUrl'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (BEAM-9764) :sdks:java:container:generateThirdPartyLicenses failing

2020-04-15 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084338#comment-17084338
 ] 

Hannah Jiang edited comment on BEAM-9764 at 4/15/20, 8:32 PM:
--

Log:
{code:java}
05:19:17 Invalid url: 
https://git.tukaani.org/?p=xz-java.git;a=blob_plain;f=COPYING;h=c1d404dc7a6f06a0437bf1055fedaa4a4c89d728;hb=HEAD
05:19:17 Invalid url: 
https://git.tukaani.org/?p=xz-java.git;a=blob_plain;f=COPYING;h=c1d404dc7a6f06a0437bf1055fedaa4a4c89d728;hb=HEAD
{code}

Error: 
{code:java}
05:19:21 Traceback (most recent call last):
05:19:21   File "sdks/java/container/license_scripts/pull_licenses_java.py", 
line 225, in 
05:19:21 error_msg)
05:19:21 RuntimeError: ('1 error(s) occurred.', 
[' Licenses were not able to be pulled 
automatically for some dependencies. Please search source code of the 
dependencies on the internet and add "license" and "notice" (if available) 
field to sdks/java/container/license_scripts/dep_urls_java.yaml for each 
missing license. Dependency List: [xz-1.5,xz-1.8]'])
{code}

The URLs are valid and they worked fine several times. Need to see why they are 
invalid with this run.


was (Author: hannahjiang):
Log:
{code:java}
05:19:17 Invalid url: 
https://git.tukaani.org/?p=xz-java.git;a=blob_plain;f=COPYING;h=c1d404dc7a6f06a0437bf1055fedaa4a4c89d728;hb=HEAD
05:19:17 Invalid url: 
https://git.tukaani.org/?p=xz-java.git;a=blob_plain;f=COPYING;h=c1d404dc7a6f06a0437bf1055fedaa4a4c89d728;hb=HEAD
{code}
```
Error: 
{code:java}
05:19:21 Traceback (most recent call last):
05:19:21   File "sdks/java/container/license_scripts/pull_licenses_java.py", 
line 225, in 
05:19:21 error_msg)
05:19:21 RuntimeError: ('1 error(s) occurred.', 
[' Licenses were not able to be pulled 
automatically for some dependencies. Please search source code of the 
dependencies on the internet and add "license" and "notice" (if available) 
field to sdks/java/container/license_scripts/dep_urls_java.yaml for each 
missing license. Dependency List: [xz-1.5,xz-1.8]'])
{code}

The URLs are valid and they worked fine several times. Need to see why they are 
invalid with this run.

> :sdks:java:container:generateThirdPartyLicenses failing
> ---
>
> Key: BEAM-9764
> URL: https://issues.apache.org/jira/browse/BEAM-9764
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core, test-failures
>Reporter: Udi Meiri
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.21.0
>
>
> https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/774/console
> The traceback is interspersed with other logs:
> {code}
> Traceback (most recent call last):
> Successfully pulled 
> java_third_party_licenses/protobuf-java-util-3.11.1.jar/LICENSE from 
> https://opensource.org/licenses/BSD-3-Clause
> Successfully pulled java_third_party_licenses/protoc-3.11.0.jar/LICENSE from 
> http://www.apache.org/licenses/LICENSE-2.0.txt
>   File "sdks/java/container/license_scripts/pull_licenses_java.py", line 138, 
> in 
> Successfully pulled java_third_party_licenses/protoc-3.11.1.jar/LICENSE from 
> http://www.apache.org/licenses/LICENSE-2.0.txt
> license_url = dep['moduleLicenseUrl']
> Successfully pulled java_third_party_licenses/zetasketch-0.1.0.jar/LICENSE 
> from http://www.apache.org/licenses/LICENSE-2.0.txt
> KeyError: 'moduleLicenseUrl'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9764) :sdks:java:container:generateThirdPartyLicenses failing

2020-04-15 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084348#comment-17084348
 ] 

Hannah Jiang commented on BEAM-9764:


The next run pulled from the same urls successfully.  
https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/775/console

{code}
11:46:40 Successfully pulled java_third_party_licenses/xz-1.5.jar/LICENSE from 
https://git.tukaani.org/?p=xz-java.git;a=blob_plain;f=COPYING;h=c1d404dc7a6f06a0437bf1055fedaa4a4c89d728;hb=HEAD
11:46:40 Successfully pulled java_third_party_licenses/xz-1.8.jar/LICENSE from 
https://git.tukaani.org/?p=xz-java.git;a=blob_plain;f=COPYING;h=c1d404dc7a6f06a0437bf1055fedaa4a4c89d728;hb=HEAD
{code}

I tried locally to pull from the urls and it worked for more than 20 times.
Will add trace back print to get more error messages.

> :sdks:java:container:generateThirdPartyLicenses failing
> ---
>
> Key: BEAM-9764
> URL: https://issues.apache.org/jira/browse/BEAM-9764
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core, test-failures
>Reporter: Udi Meiri
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.21.0
>
>
> https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/774/console
> The traceback is interspersed with other logs:
> {code}
> Traceback (most recent call last):
> Successfully pulled 
> java_third_party_licenses/protobuf-java-util-3.11.1.jar/LICENSE from 
> https://opensource.org/licenses/BSD-3-Clause
> Successfully pulled java_third_party_licenses/protoc-3.11.0.jar/LICENSE from 
> http://www.apache.org/licenses/LICENSE-2.0.txt
>   File "sdks/java/container/license_scripts/pull_licenses_java.py", line 138, 
> in 
> Successfully pulled java_third_party_licenses/protoc-3.11.1.jar/LICENSE from 
> http://www.apache.org/licenses/LICENSE-2.0.txt
> license_url = dep['moduleLicenseUrl']
> Successfully pulled java_third_party_licenses/zetasketch-0.1.0.jar/LICENSE 
> from http://www.apache.org/licenses/LICENSE-2.0.txt
> KeyError: 'moduleLicenseUrl'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9764) :sdks:java:container:generateThirdPartyLicenses failing

2020-04-15 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084338#comment-17084338
 ] 

Hannah Jiang commented on BEAM-9764:


Log:
{code:java}
05:19:17 Invalid url: 
https://git.tukaani.org/?p=xz-java.git;a=blob_plain;f=COPYING;h=c1d404dc7a6f06a0437bf1055fedaa4a4c89d728;hb=HEAD
05:19:17 Invalid url: 
https://git.tukaani.org/?p=xz-java.git;a=blob_plain;f=COPYING;h=c1d404dc7a6f06a0437bf1055fedaa4a4c89d728;hb=HEAD
{code}
```
Error: 
{code:java}
05:19:21 Traceback (most recent call last):
05:19:21   File "sdks/java/container/license_scripts/pull_licenses_java.py", 
line 225, in 
05:19:21 error_msg)
05:19:21 RuntimeError: ('1 error(s) occurred.', 
[' Licenses were not able to be pulled 
automatically for some dependencies. Please search source code of the 
dependencies on the internet and add "license" and "notice" (if available) 
field to sdks/java/container/license_scripts/dep_urls_java.yaml for each 
missing license. Dependency List: [xz-1.5,xz-1.8]'])
{code}

The URLs are valid and they worked fine several times. Need to see why they are 
invalid with this run.

> :sdks:java:container:generateThirdPartyLicenses failing
> ---
>
> Key: BEAM-9764
> URL: https://issues.apache.org/jira/browse/BEAM-9764
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core, test-failures
>Reporter: Udi Meiri
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.21.0
>
>
> https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/774/console
> The traceback is interspersed with other logs:
> {code}
> Traceback (most recent call last):
> Successfully pulled 
> java_third_party_licenses/protobuf-java-util-3.11.1.jar/LICENSE from 
> https://opensource.org/licenses/BSD-3-Clause
> Successfully pulled java_third_party_licenses/protoc-3.11.0.jar/LICENSE from 
> http://www.apache.org/licenses/LICENSE-2.0.txt
>   File "sdks/java/container/license_scripts/pull_licenses_java.py", line 138, 
> in 
> Successfully pulled java_third_party_licenses/protoc-3.11.1.jar/LICENSE from 
> http://www.apache.org/licenses/LICENSE-2.0.txt
> license_url = dep['moduleLicenseUrl']
> Successfully pulled java_third_party_licenses/zetasketch-0.1.0.jar/LICENSE 
> from http://www.apache.org/licenses/LICENSE-2.0.txt
> KeyError: 'moduleLicenseUrl'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9764) :sdks:java:container:generateThirdPartyLicenses failing

2020-04-15 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084334#comment-17084334
 ] 

Hannah Jiang commented on BEAM-9764:


Looking.

> :sdks:java:container:generateThirdPartyLicenses failing
> ---
>
> Key: BEAM-9764
> URL: https://issues.apache.org/jira/browse/BEAM-9764
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core, test-failures
>Reporter: Udi Meiri
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.21.0
>
>
> https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/774/console
> The traceback is interspersed with other logs:
> {code}
> Traceback (most recent call last):
> Successfully pulled 
> java_third_party_licenses/protobuf-java-util-3.11.1.jar/LICENSE from 
> https://opensource.org/licenses/BSD-3-Clause
> Successfully pulled java_third_party_licenses/protoc-3.11.0.jar/LICENSE from 
> http://www.apache.org/licenses/LICENSE-2.0.txt
>   File "sdks/java/container/license_scripts/pull_licenses_java.py", line 138, 
> in 
> Successfully pulled java_third_party_licenses/protoc-3.11.1.jar/LICENSE from 
> http://www.apache.org/licenses/LICENSE-2.0.txt
> license_url = dep['moduleLicenseUrl']
> Successfully pulled java_third_party_licenses/zetasketch-0.1.0.jar/LICENSE 
> from http://www.apache.org/licenses/LICENSE-2.0.txt
> KeyError: 'moduleLicenseUrl'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9764) :sdks:java:container:generateThirdPartyLicenses failing

2020-04-15 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9764:
---
Status: Open  (was: Triage Needed)

> :sdks:java:container:generateThirdPartyLicenses failing
> ---
>
> Key: BEAM-9764
> URL: https://issues.apache.org/jira/browse/BEAM-9764
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core, test-failures
>Reporter: Udi Meiri
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.21.0
>
>
> https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/774/console
> The traceback is interspersed with other logs:
> {code}
> Traceback (most recent call last):
> Successfully pulled 
> java_third_party_licenses/protobuf-java-util-3.11.1.jar/LICENSE from 
> https://opensource.org/licenses/BSD-3-Clause
> Successfully pulled java_third_party_licenses/protoc-3.11.0.jar/LICENSE from 
> http://www.apache.org/licenses/LICENSE-2.0.txt
>   File "sdks/java/container/license_scripts/pull_licenses_java.py", line 138, 
> in 
> Successfully pulled java_third_party_licenses/protoc-3.11.1.jar/LICENSE from 
> http://www.apache.org/licenses/LICENSE-2.0.txt
> license_url = dep['moduleLicenseUrl']
> Successfully pulled java_third_party_licenses/zetasketch-0.1.0.jar/LICENSE 
> from http://www.apache.org/licenses/LICENSE-2.0.txt
> KeyError: 'moduleLicenseUrl'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-9136) Add LICENSES and NOTICES to docker images

2020-04-13 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang resolved BEAM-9136.

Resolution: Fixed

> Add LICENSES and NOTICES to docker images
> -
>
> Key: BEAM-9136
> URL: https://issues.apache.org/jira/browse/BEAM-9136
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.21.0
>
>  Time Spent: 21.5h
>  Remaining Estimate: 0h
>
> Scan dependencies and add licenses and notices of the dependencies to SDK 
> docker images.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9443) support direct_num_workers=0

2020-04-10 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9443:
---
Issue Type: Improvement  (was: Bug)

> support direct_num_workers=0 
> -
>
> Key: BEAM-9443
> URL: https://issues.apache.org/jira/browse/BEAM-9443
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.22.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> when direct_num_workers=0, set it to number of cores.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (BEAM-9443) support direct_num_workers=0

2020-04-09 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-9443 started by Hannah Jiang.
--
> support direct_num_workers=0 
> -
>
> Key: BEAM-9443
> URL: https://issues.apache.org/jira/browse/BEAM-9443
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.22.0
>
>
> when direct_num_workers=0, set it to number of cores.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9136) Add LICENSES and NOTICES to docker images

2020-04-09 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9136:
---
Fix Version/s: (was: 2.22.0)
   2.21.0

> Add LICENSES and NOTICES to docker images
> -
>
> Key: BEAM-9136
> URL: https://issues.apache.org/jira/browse/BEAM-9136
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.21.0
>
>  Time Spent: 19h 40m
>  Remaining Estimate: 0h
>
> Scan dependencies and add licenses and notices of the dependencies to SDK 
> docker images.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (BEAM-9685) Don't release Go SDK container until Go is officially supported.

2020-04-08 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang closed BEAM-9685.
--
Resolution: Fixed

> Don't release Go SDK container until Go is officially supported.
> 
>
> Key: BEAM-9685
> URL: https://issues.apache.org/jira/browse/BEAM-9685
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.21.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> 1. Remove Go SDK container from release process.
> 2. Update document about it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9443) support direct_num_workers=0

2020-04-08 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078809#comment-17078809
 ] 

Hannah Jiang commented on BEAM-9443:


Moved this to 2.22.0.

> support direct_num_workers=0 
> -
>
> Key: BEAM-9443
> URL: https://issues.apache.org/jira/browse/BEAM-9443
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.22.0
>
>
> when direct_num_workers=0, set it to number of cores.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9443) support direct_num_workers=0

2020-04-08 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9443:
---
Fix Version/s: (was: 2.21.0)
   2.22.0

> support direct_num_workers=0 
> -
>
> Key: BEAM-9443
> URL: https://issues.apache.org/jira/browse/BEAM-9443
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.22.0
>
>
> when direct_num_workers=0, set it to number of cores.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9136) Add LICENSES and NOTICES to docker images

2020-04-08 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078808#comment-17078808
 ] 

Hannah Jiang commented on BEAM-9136:


Python part is complete. We can add Python to 2.21 and add Java to 2.22.

It's ok to cut release branch as it is in master for this ticket.

> Add LICENSES and NOTICES to docker images
> -
>
> Key: BEAM-9136
> URL: https://issues.apache.org/jira/browse/BEAM-9136
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.21.0
>
>  Time Spent: 19h 40m
>  Remaining Estimate: 0h
>
> Scan dependencies and add licenses and notices of the dependencies to SDK 
> docker images.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9063) Migrate docker images to apache namespace.

2020-04-08 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9063:
---
Component/s: (was: beam-community)
 build-system

> Migrate docker images to apache namespace.
> --
>
> Key: BEAM-9063
> URL: https://issues.apache.org/jira/browse/BEAM-9063
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.20.0
>
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> https://hub.docker.com/u/apache



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9719) Cross-language test suites failing due to mossing nose plugin

2020-04-07 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077643#comment-17077643
 ] 

Hannah Jiang commented on BEAM-9719:


How do we want to fix it? add nose back or fix the failing tests?

> Cross-language test suites failing due to mossing nose plugin
> -
>
> Key: BEAM-9719
> URL: https://issues.apache.org/jira/browse/BEAM-9719
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness, test-failures
>Reporter: Chamikara Madhusanka Jayalath
>Priority: Major
>
> Seems like due to [https://github.com/apache/beam/pull/11307]
>  
> [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Flink/]
>  
> [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_XVR_Spark/]
>  
> *16:00:54*   File 
> "/usr/local/lib/python2.7/site-packages/apache_beam/transforms/validate_runner_xlang_test.py",
>  line 24, in *16:00:54* from nose.plugins.attrib import 
> attr*16:00:54* ImportError: No module named nose.plugins.attrib
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (BEAM-9685) Don't release Go SDK container until Go is officially supported.

2020-04-03 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-9685 started by Hannah Jiang.
--
> Don't release Go SDK container until Go is officially supported.
> 
>
> Key: BEAM-9685
> URL: https://issues.apache.org/jira/browse/BEAM-9685
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.21.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> 1. Remove Go SDK container from release process.
> 2. Update document about it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9685) Don't release Go SDK container until Go is officially supported.

2020-04-03 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9685:
---
Status: Open  (was: Triage Needed)

> Don't release Go SDK container until Go is officially supported.
> 
>
> Key: BEAM-9685
> URL: https://issues.apache.org/jira/browse/BEAM-9685
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.21.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> 1. Remove Go SDK container from release process.
> 2. Update document about it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9685) Don't release Go SDK container until Go is officially supported.

2020-04-02 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9685:
---
Issue Type: Task  (was: Bug)

> Don't release Go SDK container until Go is officially supported.
> 
>
> Key: BEAM-9685
> URL: https://issues.apache.org/jira/browse/BEAM-9685
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.21.0
>
>
> 1. Remove Go SDK container from release process.
> 2. Update document about it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9685) Don't release Go SDK container until Go is officially supported.

2020-04-02 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074067#comment-17074067
 ] 

Hannah Jiang commented on BEAM-9685:


[~lostluck], could you please add links to the tickets should be resolved 
before we add Go containers back to the release process?


> Don't release Go SDK container until Go is officially supported.
> 
>
> Key: BEAM-9685
> URL: https://issues.apache.org/jira/browse/BEAM-9685
> Project: Beam
>  Issue Type: Bug
>  Components: build-system
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.21.0
>
>
> 1. Remove Go SDK container from release process.
> 2. Update document about it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9685) Don't release Go SDK container until Go is officially supported.

2020-04-02 Thread Hannah Jiang (Jira)
Hannah Jiang created BEAM-9685:
--

 Summary: Don't release Go SDK container until Go is officially 
supported.
 Key: BEAM-9685
 URL: https://issues.apache.org/jira/browse/BEAM-9685
 Project: Beam
  Issue Type: Bug
  Components: build-system
Reporter: Hannah Jiang
Assignee: Hannah Jiang
 Fix For: 2.21.0


1. Remove Go SDK container from release process.
2. Update document about it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9136) Add LICENSES and NOTICES to docker images

2020-04-02 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9136:
---
Fix Version/s: 2.21.0

> Add LICENSES and NOTICES to docker images
> -
>
> Key: BEAM-9136
> URL: https://issues.apache.org/jira/browse/BEAM-9136
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.21.0
>
>  Time Spent: 15h 50m
>  Remaining Estimate: 0h
>
> Scan dependencies and add licenses and notices of the dependencies to SDK 
> docker images.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8551) Beam Python containers should include all Beam SDK dependencies, and do not have conflicting dependencies

2020-03-24 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-8551:
---
Status: Open  (was: Triage Needed)

> Beam Python containers should include all Beam SDK dependencies, and do not 
> have conflicting dependencies
> -
>
> Key: BEAM-8551
> URL: https://issues.apache.org/jira/browse/BEAM-8551
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Hannah Jiang
>Priority: Major
>
> Checks could be introduced during container creation, and be enforced by 
> ValidatesContainer test suites. We could:
> - Check pip output or status code for incompatible dependency errors.
> - Remove internet access when installing apache-beam in the container, to 
> makes sure all dependencies are installed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-8551) Beam Python containers should include all Beam SDK dependencies, and do not have conflicting dependencies

2020-03-24 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang reassigned BEAM-8551:
--

Assignee: Hannah Jiang

> Beam Python containers should include all Beam SDK dependencies, and do not 
> have conflicting dependencies
> -
>
> Key: BEAM-8551
> URL: https://issues.apache.org/jira/browse/BEAM-8551
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Hannah Jiang
>Priority: Major
>
> Checks could be introduced during container creation, and be enforced by 
> ValidatesContainer test suites. We could:
> - Check pip output or status code for incompatible dependency errors.
> - Remove internet access when installing apache-beam in the container, to 
> makes sure all dependencies are installed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-9413) [beam_PostCommit_Py_ValCont] build failed

2020-03-09 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang resolved BEAM-9413.

Resolution: Fixed

> [beam_PostCommit_Py_ValCont] build failed
> -
>
> Key: BEAM-9413
> URL: https://issues.apache.org/jira/browse/BEAM-9413
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Yueyang Qiu
>Assignee: Hannah Jiang
>Priority: Major
>  Labels: currently-failing
> Fix For: 2.20.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> See [https://builds.apache.org/job/beam_PostCommit_Py_ValCont/5706/]
> Error:
>  
> *16:12:13* The push refers to repository 
> [us.gcr.io/apache-beam-testing/jenkins/python2.7_sdk]*16:12:13* An image does 
> not exist locally with the tag: 
> us.gcr.io/apache-beam-testing/jenkins/python2.7_sdk*16:12:14* Build step 
> 'Execute shell' marked build as failure*16:12:15* Sending e-mails to: 
> bui...@beam.apache.org*16:12:15* Recording test results*16:12:16* ERROR: Step 
> 'Publish JUnit test result report' failed: No test report files were found. 
> Configuration error?*16:12:18* No emails were triggered.*16:12:18* Finished: 
> FAILURE



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9443) support direct_num_workers=0

2020-03-04 Thread Hannah Jiang (Jira)
Hannah Jiang created BEAM-9443:
--

 Summary: support direct_num_workers=0 
 Key: BEAM-9443
 URL: https://issues.apache.org/jira/browse/BEAM-9443
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-core
Reporter: Hannah Jiang
Assignee: Hannah Jiang
 Fix For: 2.21.0


when direct_num_workers=0, set it to number of cores.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9413) [beam_PostCommit_Py_ValCont] build failed

2020-03-03 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17050653#comment-17050653
 ] 

Hannah Jiang commented on BEAM-9413:


This was merged. [~amaliujia]

> [beam_PostCommit_Py_ValCont] build failed
> -
>
> Key: BEAM-9413
> URL: https://issues.apache.org/jira/browse/BEAM-9413
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Yueyang Qiu
>Assignee: Hannah Jiang
>Priority: Major
>  Labels: currently-failing
> Fix For: 2.20.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> See [https://builds.apache.org/job/beam_PostCommit_Py_ValCont/5706/]
> Error:
>  
> *16:12:13* The push refers to repository 
> [us.gcr.io/apache-beam-testing/jenkins/python2.7_sdk]*16:12:13* An image does 
> not exist locally with the tag: 
> us.gcr.io/apache-beam-testing/jenkins/python2.7_sdk*16:12:14* Build step 
> 'Execute shell' marked build as failure*16:12:15* Sending e-mails to: 
> bui...@beam.apache.org*16:12:15* Recording test results*16:12:16* ERROR: Step 
> 'Publish JUnit test result report' failed: No test report files were found. 
> Configuration error?*16:12:18* No emails were triggered.*16:12:18* Finished: 
> FAILURE



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9413) [beam_PostCommit_Py_ValCont] build failed

2020-03-03 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17050639#comment-17050639
 ] 

Hannah Jiang commented on BEAM-9413:


It's not a blocker. However, I am almost done with this ticket. If it can be 
merged before you create RC0, I would be happy to include it. If you're ready 
to cut RC, please go ahead.

> [beam_PostCommit_Py_ValCont] build failed
> -
>
> Key: BEAM-9413
> URL: https://issues.apache.org/jira/browse/BEAM-9413
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Yueyang Qiu
>Assignee: Hannah Jiang
>Priority: Major
>  Labels: currently-failing
> Fix For: 2.20.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> See [https://builds.apache.org/job/beam_PostCommit_Py_ValCont/5706/]
> Error:
>  
> *16:12:13* The push refers to repository 
> [us.gcr.io/apache-beam-testing/jenkins/python2.7_sdk]*16:12:13* An image does 
> not exist locally with the tag: 
> us.gcr.io/apache-beam-testing/jenkins/python2.7_sdk*16:12:14* Build step 
> 'Execute shell' marked build as failure*16:12:15* Sending e-mails to: 
> bui...@beam.apache.org*16:12:15* Recording test results*16:12:16* ERROR: Step 
> 'Publish JUnit test result report' failed: No test report files were found. 
> Configuration error?*16:12:18* No emails were triggered.*16:12:18* Finished: 
> FAILURE



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (BEAM-9413) [beam_PostCommit_Py_ValCont] build failed

2020-03-02 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-9413 started by Hannah Jiang.
--
> [beam_PostCommit_Py_ValCont] build failed
> -
>
> Key: BEAM-9413
> URL: https://issues.apache.org/jira/browse/BEAM-9413
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Yueyang Qiu
>Assignee: Hannah Jiang
>Priority: Major
>  Labels: currently-failing
> Fix For: 2.20.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> See [https://builds.apache.org/job/beam_PostCommit_Py_ValCont/5706/]
> Error:
>  
> *16:12:13* The push refers to repository 
> [us.gcr.io/apache-beam-testing/jenkins/python2.7_sdk]*16:12:13* An image does 
> not exist locally with the tag: 
> us.gcr.io/apache-beam-testing/jenkins/python2.7_sdk*16:12:14* Build step 
> 'Execute shell' marked build as failure*16:12:15* Sending e-mails to: 
> bui...@beam.apache.org*16:12:15* Recording test results*16:12:16* ERROR: Step 
> 'Publish JUnit test result report' failed: No test report files were found. 
> Configuration error?*16:12:18* No emails were triggered.*16:12:18* Finished: 
> FAILURE



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9413) [beam_PostCommit_Py_ValCont] build failed

2020-03-02 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9413:
---
Status: Open  (was: Triage Needed)

> [beam_PostCommit_Py_ValCont] build failed
> -
>
> Key: BEAM-9413
> URL: https://issues.apache.org/jira/browse/BEAM-9413
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Yueyang Qiu
>Assignee: Hannah Jiang
>Priority: Major
>  Labels: currently-failing
> Fix For: 2.20.0
>
>
> See [https://builds.apache.org/job/beam_PostCommit_Py_ValCont/5706/]
> Error:
>  
> *16:12:13* The push refers to repository 
> [us.gcr.io/apache-beam-testing/jenkins/python2.7_sdk]*16:12:13* An image does 
> not exist locally with the tag: 
> us.gcr.io/apache-beam-testing/jenkins/python2.7_sdk*16:12:14* Build step 
> 'Execute shell' marked build as failure*16:12:15* Sending e-mails to: 
> bui...@beam.apache.org*16:12:15* Recording test results*16:12:16* ERROR: Step 
> 'Publish JUnit test result report' failed: No test report files were found. 
> Configuration error?*16:12:18* No emails were triggered.*16:12:18* Finished: 
> FAILURE



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9413) [beam_PostCommit_Py_ValCont] build failed

2020-03-02 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9413:
---
Fix Version/s: 2.20.0

> [beam_PostCommit_Py_ValCont] build failed
> -
>
> Key: BEAM-9413
> URL: https://issues.apache.org/jira/browse/BEAM-9413
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Yueyang Qiu
>Assignee: Hannah Jiang
>Priority: Major
>  Labels: currently-failing
> Fix For: 2.20.0
>
>
> See [https://builds.apache.org/job/beam_PostCommit_Py_ValCont/5706/]
> Error:
>  
> *16:12:13* The push refers to repository 
> [us.gcr.io/apache-beam-testing/jenkins/python2.7_sdk]*16:12:13* An image does 
> not exist locally with the tag: 
> us.gcr.io/apache-beam-testing/jenkins/python2.7_sdk*16:12:14* Build step 
> 'Execute shell' marked build as failure*16:12:15* Sending e-mails to: 
> bui...@beam.apache.org*16:12:15* Recording test results*16:12:16* ERROR: Step 
> 'Publish JUnit test result report' failed: No test report files were found. 
> Configuration error?*16:12:18* No emails were triggered.*16:12:18* Finished: 
> FAILURE



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-9413) [beam_PostCommit_Py_ValCont] build failed

2020-03-02 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang reassigned BEAM-9413:
--

Assignee: Hannah Jiang

> [beam_PostCommit_Py_ValCont] build failed
> -
>
> Key: BEAM-9413
> URL: https://issues.apache.org/jira/browse/BEAM-9413
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Yueyang Qiu
>Assignee: Hannah Jiang
>Priority: Major
>  Labels: currently-failing
>
> See [https://builds.apache.org/job/beam_PostCommit_Py_ValCont/5706/]
> Error:
>  
> *16:12:13* The push refers to repository 
> [us.gcr.io/apache-beam-testing/jenkins/python2.7_sdk]*16:12:13* An image does 
> not exist locally with the tag: 
> us.gcr.io/apache-beam-testing/jenkins/python2.7_sdk*16:12:14* Build step 
> 'Execute shell' marked build as failure*16:12:15* Sending e-mails to: 
> bui...@beam.apache.org*16:12:15* Recording test results*16:12:16* ERROR: Step 
> 'Publish JUnit test result report' failed: No test report files were found. 
> Configuration error?*16:12:18* No emails were triggered.*16:12:18* Finished: 
> FAILURE



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (BEAM-9228) _SDFBoundedSourceWrapper doesn't distribute data to multiple workers

2020-02-26 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang closed BEAM-9228.
--

> _SDFBoundedSourceWrapper doesn't distribute data to multiple workers
> 
>
> Key: BEAM-9228
> URL: https://issues.apache.org/jira/browse/BEAM-9228
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.16.0, 2.18.0, 2.19.0
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.20.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> A user reported following issue.
> -
> I have a set of tfrecord files, obtained by converting parquet files with 
> Spark. Each file is roughly 1GB and I have 11 of those.
> I would expect simple statistics gathering (ie counting number of items of 
> all files) to scale linearly with respect to the number of cores on my system.
> I am able to reproduce the issue with the minimal snippet below
> {code:java}
> import apache_beam as beam
> from apache_beam.options.pipeline_options import PipelineOptions
> from apache_beam.runners.portability import fn_api_runner
> from apache_beam.portability.api import beam_runner_api_pb2
> from apache_beam.portability import python_urns
> import sys
> pipeline_options = PipelineOptions(['--direct_num_workers', '4'])
> file_pattern = 'part-r-00*
> runner=fn_api_runner.FnApiRunner(
>   default_environment=beam_runner_api_pb2.Environment(
>   urn=python_urns.SUBPROCESS_SDK,
>   payload=b'%s -m apache_beam.runners.worker.sdk_worker_main'
> % sys.executable.encode('ascii')))
> p = beam.Pipeline(runner=runner, options=pipeline_options)
> lines = (p | 'read' >> beam.io.tfrecordio.ReadFromTFRecord(file_pattern)
>  | beam.combiners.Count.Globally()
>  | beam.io.WriteToText('/tmp/output'))
> p.run()
> {code}
> Only one combination of apache_beam revision / worker type seems to work (I 
> refer to https://beam.apache.org/documentation/runners/direct/ for the worker 
> types)
> * beam 2.16; neither multithread nor multiprocess achieve high cpu usage on 
> multiple cores
> * beam 2.17: able to achieve high cpu usage on all 4 cores
> * beam 2.18: not tested the mulithreaded mode but the multiprocess mode fails 
> when trying to serialize the Environment instance most likely because of a 
> change from 2.17 to 2.18.
> I also tried briefly SparkRunner with version 2.16 but was no able to achieve 
> any throughput.
> What is the recommnended way to achieve what I am trying to ? How can I 
> troubleshoot ?
> --
> This is caused by [this 
> PR|https://github.com/apache/beam/commit/02f8ad4eee3ec0ea8cbdc0f99c1dad29f00a9f60].
> A [workaround|https://github.com/apache/beam/pull/10729] is tried, which is 
> rolling back iobase.py not to use _SDFBoundedSourceWrapper. This confirmed 
> that data is distributed to multiple workers, however, there are some 
> regressions with SDF wrapper tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-9228) _SDFBoundedSourceWrapper doesn't distribute data to multiple workers

2020-02-26 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang resolved BEAM-9228.

Resolution: Fixed

> _SDFBoundedSourceWrapper doesn't distribute data to multiple workers
> 
>
> Key: BEAM-9228
> URL: https://issues.apache.org/jira/browse/BEAM-9228
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.16.0, 2.18.0, 2.19.0
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.20.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> A user reported following issue.
> -
> I have a set of tfrecord files, obtained by converting parquet files with 
> Spark. Each file is roughly 1GB and I have 11 of those.
> I would expect simple statistics gathering (ie counting number of items of 
> all files) to scale linearly with respect to the number of cores on my system.
> I am able to reproduce the issue with the minimal snippet below
> {code:java}
> import apache_beam as beam
> from apache_beam.options.pipeline_options import PipelineOptions
> from apache_beam.runners.portability import fn_api_runner
> from apache_beam.portability.api import beam_runner_api_pb2
> from apache_beam.portability import python_urns
> import sys
> pipeline_options = PipelineOptions(['--direct_num_workers', '4'])
> file_pattern = 'part-r-00*
> runner=fn_api_runner.FnApiRunner(
>   default_environment=beam_runner_api_pb2.Environment(
>   urn=python_urns.SUBPROCESS_SDK,
>   payload=b'%s -m apache_beam.runners.worker.sdk_worker_main'
> % sys.executable.encode('ascii')))
> p = beam.Pipeline(runner=runner, options=pipeline_options)
> lines = (p | 'read' >> beam.io.tfrecordio.ReadFromTFRecord(file_pattern)
>  | beam.combiners.Count.Globally()
>  | beam.io.WriteToText('/tmp/output'))
> p.run()
> {code}
> Only one combination of apache_beam revision / worker type seems to work (I 
> refer to https://beam.apache.org/documentation/runners/direct/ for the worker 
> types)
> * beam 2.16; neither multithread nor multiprocess achieve high cpu usage on 
> multiple cores
> * beam 2.17: able to achieve high cpu usage on all 4 cores
> * beam 2.18: not tested the mulithreaded mode but the multiprocess mode fails 
> when trying to serialize the Environment instance most likely because of a 
> change from 2.17 to 2.18.
> I also tried briefly SparkRunner with version 2.16 but was no able to achieve 
> any throughput.
> What is the recommnended way to achieve what I am trying to ? How can I 
> troubleshoot ?
> --
> This is caused by [this 
> PR|https://github.com/apache/beam/commit/02f8ad4eee3ec0ea8cbdc0f99c1dad29f00a9f60].
> A [workaround|https://github.com/apache/beam/pull/10729] is tried, which is 
> rolling back iobase.py not to use _SDFBoundedSourceWrapper. This confirmed 
> that data is distributed to multiple workers, however, there are some 
> regressions with SDF wrapper tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9228) _SDFBoundedSourceWrapper doesn't distribute data to multiple workers

2020-02-26 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17045235#comment-17045235
 ] 

Hannah Jiang commented on BEAM-9228:


Yes, it was done. Will close it. 

> _SDFBoundedSourceWrapper doesn't distribute data to multiple workers
> 
>
> Key: BEAM-9228
> URL: https://issues.apache.org/jira/browse/BEAM-9228
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.16.0, 2.18.0, 2.19.0
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.20.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> A user reported following issue.
> -
> I have a set of tfrecord files, obtained by converting parquet files with 
> Spark. Each file is roughly 1GB and I have 11 of those.
> I would expect simple statistics gathering (ie counting number of items of 
> all files) to scale linearly with respect to the number of cores on my system.
> I am able to reproduce the issue with the minimal snippet below
> {code:java}
> import apache_beam as beam
> from apache_beam.options.pipeline_options import PipelineOptions
> from apache_beam.runners.portability import fn_api_runner
> from apache_beam.portability.api import beam_runner_api_pb2
> from apache_beam.portability import python_urns
> import sys
> pipeline_options = PipelineOptions(['--direct_num_workers', '4'])
> file_pattern = 'part-r-00*
> runner=fn_api_runner.FnApiRunner(
>   default_environment=beam_runner_api_pb2.Environment(
>   urn=python_urns.SUBPROCESS_SDK,
>   payload=b'%s -m apache_beam.runners.worker.sdk_worker_main'
> % sys.executable.encode('ascii')))
> p = beam.Pipeline(runner=runner, options=pipeline_options)
> lines = (p | 'read' >> beam.io.tfrecordio.ReadFromTFRecord(file_pattern)
>  | beam.combiners.Count.Globally()
>  | beam.io.WriteToText('/tmp/output'))
> p.run()
> {code}
> Only one combination of apache_beam revision / worker type seems to work (I 
> refer to https://beam.apache.org/documentation/runners/direct/ for the worker 
> types)
> * beam 2.16; neither multithread nor multiprocess achieve high cpu usage on 
> multiple cores
> * beam 2.17: able to achieve high cpu usage on all 4 cores
> * beam 2.18: not tested the mulithreaded mode but the multiprocess mode fails 
> when trying to serialize the Environment instance most likely because of a 
> change from 2.17 to 2.18.
> I also tried briefly SparkRunner with version 2.16 but was no able to achieve 
> any throughput.
> What is the recommnended way to achieve what I am trying to ? How can I 
> troubleshoot ?
> --
> This is caused by [this 
> PR|https://github.com/apache/beam/commit/02f8ad4eee3ec0ea8cbdc0f99c1dad29f00a9f60].
> A [workaround|https://github.com/apache/beam/pull/10729] is tried, which is 
> rolling back iobase.py not to use _SDFBoundedSourceWrapper. This confirmed 
> that data is distributed to multiple workers, however, there are some 
> regressions with SDF wrapper tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-9063) Migrate docker images to apache namespace.

2020-02-21 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang resolved BEAM-9063.

Fix Version/s: (was: Not applicable)
   2.20.0
   Resolution: Fixed

> Migrate docker images to apache namespace.
> --
>
> Key: BEAM-9063
> URL: https://issues.apache.org/jira/browse/BEAM-9063
> Project: Beam
>  Issue Type: Task
>  Components: beam-community
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.20.0
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> https://hub.docker.com/u/apache



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9136) Add LICENSES and NOTICES to docker images

2020-02-19 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9136:
---
Fix Version/s: (was: 2.20.0)

> Add LICENSES and NOTICES to docker images
> -
>
> Key: BEAM-9136
> URL: https://issues.apache.org/jira/browse/BEAM-9136
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
>
> Scan dependencies and add licenses and notices of the dependencies to SDK 
> docker images.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (BEAM-9136) Add LICENSES and NOTICES to docker images

2020-02-19 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-9136 started by Hannah Jiang.
--
> Add LICENSES and NOTICES to docker images
> -
>
> Key: BEAM-9136
> URL: https://issues.apache.org/jira/browse/BEAM-9136
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.20.0
>
>
> Scan dependencies and add licenses and notices of the dependencies to SDK 
> docker images.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (BEAM-9228) _SDFBoundedSourceWrapper doesn't distribute data to multiple workers

2020-02-19 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-9228 started by Hannah Jiang.
--
> _SDFBoundedSourceWrapper doesn't distribute data to multiple workers
> 
>
> Key: BEAM-9228
> URL: https://issues.apache.org/jira/browse/BEAM-9228
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.16.0, 2.18.0, 2.19.0
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.20.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> A user reported following issue.
> -
> I have a set of tfrecord files, obtained by converting parquet files with 
> Spark. Each file is roughly 1GB and I have 11 of those.
> I would expect simple statistics gathering (ie counting number of items of 
> all files) to scale linearly with respect to the number of cores on my system.
> I am able to reproduce the issue with the minimal snippet below
> {code:java}
> import apache_beam as beam
> from apache_beam.options.pipeline_options import PipelineOptions
> from apache_beam.runners.portability import fn_api_runner
> from apache_beam.portability.api import beam_runner_api_pb2
> from apache_beam.portability import python_urns
> import sys
> pipeline_options = PipelineOptions(['--direct_num_workers', '4'])
> file_pattern = 'part-r-00*
> runner=fn_api_runner.FnApiRunner(
>   default_environment=beam_runner_api_pb2.Environment(
>   urn=python_urns.SUBPROCESS_SDK,
>   payload=b'%s -m apache_beam.runners.worker.sdk_worker_main'
> % sys.executable.encode('ascii')))
> p = beam.Pipeline(runner=runner, options=pipeline_options)
> lines = (p | 'read' >> beam.io.tfrecordio.ReadFromTFRecord(file_pattern)
>  | beam.combiners.Count.Globally()
>  | beam.io.WriteToText('/tmp/output'))
> p.run()
> {code}
> Only one combination of apache_beam revision / worker type seems to work (I 
> refer to https://beam.apache.org/documentation/runners/direct/ for the worker 
> types)
> * beam 2.16; neither multithread nor multiprocess achieve high cpu usage on 
> multiple cores
> * beam 2.17: able to achieve high cpu usage on all 4 cores
> * beam 2.18: not tested the mulithreaded mode but the multiprocess mode fails 
> when trying to serialize the Environment instance most likely because of a 
> change from 2.17 to 2.18.
> I also tried briefly SparkRunner with version 2.16 but was no able to achieve 
> any throughput.
> What is the recommnended way to achieve what I am trying to ? How can I 
> troubleshoot ?
> --
> This is caused by [this 
> PR|https://github.com/apache/beam/commit/02f8ad4eee3ec0ea8cbdc0f99c1dad29f00a9f60].
> A [workaround|https://github.com/apache/beam/pull/10729] is tried, which is 
> rolling back iobase.py not to use _SDFBoundedSourceWrapper. This confirmed 
> that data is distributed to multiple workers, however, there are some 
> regressions with SDF wrapper tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9228) _SDFBoundedSourceWrapper doesn't distribute data to multiple workers

2020-02-13 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9228:
---
Fix Version/s: 2.20.0

> _SDFBoundedSourceWrapper doesn't distribute data to multiple workers
> 
>
> Key: BEAM-9228
> URL: https://issues.apache.org/jira/browse/BEAM-9228
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.16.0, 2.18.0
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.20.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> A user reported following issue.
> -
> I have a set of tfrecord files, obtained by converting parquet files with 
> Spark. Each file is roughly 1GB and I have 11 of those.
> I would expect simple statistics gathering (ie counting number of items of 
> all files) to scale linearly with respect to the number of cores on my system.
> I am able to reproduce the issue with the minimal snippet below
> {code:java}
> import apache_beam as beam
> from apache_beam.options.pipeline_options import PipelineOptions
> from apache_beam.runners.portability import fn_api_runner
> from apache_beam.portability.api import beam_runner_api_pb2
> from apache_beam.portability import python_urns
> import sys
> pipeline_options = PipelineOptions(['--direct_num_workers', '4'])
> file_pattern = 'part-r-00*
> runner=fn_api_runner.FnApiRunner(
>   default_environment=beam_runner_api_pb2.Environment(
>   urn=python_urns.SUBPROCESS_SDK,
>   payload=b'%s -m apache_beam.runners.worker.sdk_worker_main'
> % sys.executable.encode('ascii')))
> p = beam.Pipeline(runner=runner, options=pipeline_options)
> lines = (p | 'read' >> beam.io.tfrecordio.ReadFromTFRecord(file_pattern)
>  | beam.combiners.Count.Globally()
>  | beam.io.WriteToText('/tmp/output'))
> p.run()
> {code}
> Only one combination of apache_beam revision / worker type seems to work (I 
> refer to https://beam.apache.org/documentation/runners/direct/ for the worker 
> types)
> * beam 2.16; neither multithread nor multiprocess achieve high cpu usage on 
> multiple cores
> * beam 2.17: able to achieve high cpu usage on all 4 cores
> * beam 2.18: not tested the mulithreaded mode but the multiprocess mode fails 
> when trying to serialize the Environment instance most likely because of a 
> change from 2.17 to 2.18.
> I also tried briefly SparkRunner with version 2.16 but was no able to achieve 
> any throughput.
> What is the recommnended way to achieve what I am trying to ? How can I 
> troubleshoot ?
> --
> This is caused by [this 
> PR|https://github.com/apache/beam/commit/02f8ad4eee3ec0ea8cbdc0f99c1dad29f00a9f60].
> A [workaround|https://github.com/apache/beam/pull/10729] is tried, which is 
> rolling back iobase.py not to use _SDFBoundedSourceWrapper. This confirmed 
> that data is distributed to multiple workers, however, there are some 
> regressions with SDF wrapper tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9228) _SDFBoundedSourceWrapper doesn't distribute data to multiple workers

2020-02-13 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9228:
---
Affects Version/s: 2.19.0

> _SDFBoundedSourceWrapper doesn't distribute data to multiple workers
> 
>
> Key: BEAM-9228
> URL: https://issues.apache.org/jira/browse/BEAM-9228
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.16.0, 2.18.0, 2.19.0
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.20.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> A user reported following issue.
> -
> I have a set of tfrecord files, obtained by converting parquet files with 
> Spark. Each file is roughly 1GB and I have 11 of those.
> I would expect simple statistics gathering (ie counting number of items of 
> all files) to scale linearly with respect to the number of cores on my system.
> I am able to reproduce the issue with the minimal snippet below
> {code:java}
> import apache_beam as beam
> from apache_beam.options.pipeline_options import PipelineOptions
> from apache_beam.runners.portability import fn_api_runner
> from apache_beam.portability.api import beam_runner_api_pb2
> from apache_beam.portability import python_urns
> import sys
> pipeline_options = PipelineOptions(['--direct_num_workers', '4'])
> file_pattern = 'part-r-00*
> runner=fn_api_runner.FnApiRunner(
>   default_environment=beam_runner_api_pb2.Environment(
>   urn=python_urns.SUBPROCESS_SDK,
>   payload=b'%s -m apache_beam.runners.worker.sdk_worker_main'
> % sys.executable.encode('ascii')))
> p = beam.Pipeline(runner=runner, options=pipeline_options)
> lines = (p | 'read' >> beam.io.tfrecordio.ReadFromTFRecord(file_pattern)
>  | beam.combiners.Count.Globally()
>  | beam.io.WriteToText('/tmp/output'))
> p.run()
> {code}
> Only one combination of apache_beam revision / worker type seems to work (I 
> refer to https://beam.apache.org/documentation/runners/direct/ for the worker 
> types)
> * beam 2.16; neither multithread nor multiprocess achieve high cpu usage on 
> multiple cores
> * beam 2.17: able to achieve high cpu usage on all 4 cores
> * beam 2.18: not tested the mulithreaded mode but the multiprocess mode fails 
> when trying to serialize the Environment instance most likely because of a 
> change from 2.17 to 2.18.
> I also tried briefly SparkRunner with version 2.16 but was no able to achieve 
> any throughput.
> What is the recommnended way to achieve what I am trying to ? How can I 
> troubleshoot ?
> --
> This is caused by [this 
> PR|https://github.com/apache/beam/commit/02f8ad4eee3ec0ea8cbdc0f99c1dad29f00a9f60].
> A [workaround|https://github.com/apache/beam/pull/10729] is tried, which is 
> rolling back iobase.py not to use _SDFBoundedSourceWrapper. This confirmed 
> that data is distributed to multiple workers, however, there are some 
> regressions with SDF wrapper tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-9228) _SDFBoundedSourceWrapper doesn't distribute data to multiple workers

2020-02-05 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang reassigned BEAM-9228:
--

Assignee: Hannah Jiang

> _SDFBoundedSourceWrapper doesn't distribute data to multiple workers
> 
>
> Key: BEAM-9228
> URL: https://issues.apache.org/jira/browse/BEAM-9228
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.16.0, 2.18.0
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
>
> A user reported following issue.
> -
> I have a set of tfrecord files, obtained by converting parquet files with 
> Spark. Each file is roughly 1GB and I have 11 of those.
> I would expect simple statistics gathering (ie counting number of items of 
> all files) to scale linearly with respect to the number of cores on my system.
> I am able to reproduce the issue with the minimal snippet below
> {code:java}
> import apache_beam as beam
> from apache_beam.options.pipeline_options import PipelineOptions
> from apache_beam.runners.portability import fn_api_runner
> from apache_beam.portability.api import beam_runner_api_pb2
> from apache_beam.portability import python_urns
> import sys
> pipeline_options = PipelineOptions(['--direct_num_workers', '4'])
> file_pattern = 'part-r-00*
> runner=fn_api_runner.FnApiRunner(
>   default_environment=beam_runner_api_pb2.Environment(
>   urn=python_urns.SUBPROCESS_SDK,
>   payload=b'%s -m apache_beam.runners.worker.sdk_worker_main'
> % sys.executable.encode('ascii')))
> p = beam.Pipeline(runner=runner, options=pipeline_options)
> lines = (p | 'read' >> beam.io.tfrecordio.ReadFromTFRecord(file_pattern)
>  | beam.combiners.Count.Globally()
>  | beam.io.WriteToText('/tmp/output'))
> p.run()
> {code}
> Only one combination of apache_beam revision / worker type seems to work (I 
> refer to https://beam.apache.org/documentation/runners/direct/ for the worker 
> types)
> * beam 2.16; neither multithread nor multiprocess achieve high cpu usage on 
> multiple cores
> * beam 2.17: able to achieve high cpu usage on all 4 cores
> * beam 2.18: not tested the mulithreaded mode but the multiprocess mode fails 
> when trying to serialize the Environment instance most likely because of a 
> change from 2.17 to 2.18.
> I also tried briefly SparkRunner with version 2.16 but was no able to achieve 
> any throughput.
> What is the recommnended way to achieve what I am trying to ? How can I 
> troubleshoot ?
> --
> This is caused by [this 
> PR|https://github.com/apache/beam/commit/02f8ad4eee3ec0ea8cbdc0f99c1dad29f00a9f60].
> A [workaround|https://github.com/apache/beam/pull/10729] is tried, which is 
> rolling back iobase.py not to use _SDFBoundedSourceWrapper. This confirmed 
> that data is distributed to multiple workers, however, there are some 
> regressions with SDF wrapper tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9228) _SDFBoundedSourceWrapper doesn't distribute data to multiple workers

2020-02-05 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9228:
---
Status: Open  (was: Triage Needed)

> _SDFBoundedSourceWrapper doesn't distribute data to multiple workers
> 
>
> Key: BEAM-9228
> URL: https://issues.apache.org/jira/browse/BEAM-9228
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.16.0, 2.18.0
>Reporter: Hannah Jiang
>Priority: Major
>
> A user reported following issue.
> -
> I have a set of tfrecord files, obtained by converting parquet files with 
> Spark. Each file is roughly 1GB and I have 11 of those.
> I would expect simple statistics gathering (ie counting number of items of 
> all files) to scale linearly with respect to the number of cores on my system.
> I am able to reproduce the issue with the minimal snippet below
> {code:java}
> import apache_beam as beam
> from apache_beam.options.pipeline_options import PipelineOptions
> from apache_beam.runners.portability import fn_api_runner
> from apache_beam.portability.api import beam_runner_api_pb2
> from apache_beam.portability import python_urns
> import sys
> pipeline_options = PipelineOptions(['--direct_num_workers', '4'])
> file_pattern = 'part-r-00*
> runner=fn_api_runner.FnApiRunner(
>   default_environment=beam_runner_api_pb2.Environment(
>   urn=python_urns.SUBPROCESS_SDK,
>   payload=b'%s -m apache_beam.runners.worker.sdk_worker_main'
> % sys.executable.encode('ascii')))
> p = beam.Pipeline(runner=runner, options=pipeline_options)
> lines = (p | 'read' >> beam.io.tfrecordio.ReadFromTFRecord(file_pattern)
>  | beam.combiners.Count.Globally()
>  | beam.io.WriteToText('/tmp/output'))
> p.run()
> {code}
> Only one combination of apache_beam revision / worker type seems to work (I 
> refer to https://beam.apache.org/documentation/runners/direct/ for the worker 
> types)
> * beam 2.16; neither multithread nor multiprocess achieve high cpu usage on 
> multiple cores
> * beam 2.17: able to achieve high cpu usage on all 4 cores
> * beam 2.18: not tested the mulithreaded mode but the multiprocess mode fails 
> when trying to serialize the Environment instance most likely because of a 
> change from 2.17 to 2.18.
> I also tried briefly SparkRunner with version 2.16 but was no able to achieve 
> any throughput.
> What is the recommnended way to achieve what I am trying to ? How can I 
> troubleshoot ?
> --
> This is caused by [this 
> PR|https://github.com/apache/beam/commit/02f8ad4eee3ec0ea8cbdc0f99c1dad29f00a9f60].
> A [workaround|https://github.com/apache/beam/pull/10729] is tried, which is 
> rolling back iobase.py not to use _SDFBoundedSourceWrapper. This confirmed 
> that data is distributed to multiple workers, however, there are some 
> regressions with SDF wrapper tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9228) _SDFBoundedSourceWrapper doesn't distribute data to multiple workers

2020-01-30 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9228:
---
Description: 
A user reported following issue.

-
I have a set of tfrecord files, obtained by converting parquet files with 
Spark. Each file is roughly 1GB and I have 11 of those.

I would expect simple statistics gathering (ie counting number of items of all 
files) to scale linearly with respect to the number of cores on my system.

I am able to reproduce the issue with the minimal snippet below

{code:java}
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.runners.portability import fn_api_runner
from apache_beam.portability.api import beam_runner_api_pb2
from apache_beam.portability import python_urns
import sys

pipeline_options = PipelineOptions(['--direct_num_workers', '4'])

file_pattern = 'part-r-00*
runner=fn_api_runner.FnApiRunner(
  default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main'
% sys.executable.encode('ascii')))

p = beam.Pipeline(runner=runner, options=pipeline_options)

lines = (p | 'read' >> beam.io.tfrecordio.ReadFromTFRecord(file_pattern)
 | beam.combiners.Count.Globally()
 | beam.io.WriteToText('/tmp/output'))

p.run()
{code}


Only one combination of apache_beam revision / worker type seems to work (I 
refer to https://beam.apache.org/documentation/runners/direct/ for the worker 
types)
* beam 2.16; neither multithread nor multiprocess achieve high cpu usage on 
multiple cores
* beam 2.17: able to achieve high cpu usage on all 4 cores
* beam 2.18: not tested the mulithreaded mode but the multiprocess mode fails 
when trying to serialize the Environment instance most likely because of a 
change from 2.17 to 2.18.

I also tried briefly SparkRunner with version 2.16 but was no able to achieve 
any throughput.

What is the recommnended way to achieve what I am trying to ? How can I 
troubleshoot ?
--

This is caused by [this 
PR|https://github.com/apache/beam/commit/02f8ad4eee3ec0ea8cbdc0f99c1dad29f00a9f60].

A [workaround|https://github.com/apache/beam/pull/10729] is tried, which is 
rolling back iobase.py not to use _SDFBoundedSourceWrapper. This confirmed that 
data is distributed to multiple workers, however, there are some regressions 
with SDF wrapper tests.

  was:
A user reported following issue.

-
I have a set of tfrecord files, obtained by converting parquet files with 
Spark. Each file is roughly 1GB and I have 11 of those.

I would expect simple statistics gathering (ie counting number of items of all 
files) to scale linearly with respect to the number of cores on my system.

I am able to reproduce the issue with the minimal snippet below

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.runners.portability import fn_api_runner
from apache_beam.portability.api import beam_runner_api_pb2
from apache_beam.portability import python_urns
import sys

pipeline_options = PipelineOptions(['--direct_num_workers', '4'])

file_pattern = 'part-r-00*
runner=fn_api_runner.FnApiRunner(
  default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main'
% sys.executable.encode('ascii')))

p = beam.Pipeline(runner=runner, options=pipeline_options)

lines = (p | 'read' >> beam.io.tfrecordio.ReadFromTFRecord(file_pattern)
 | beam.combiners.Count.Globally()
 | beam.io.WriteToText('/tmp/output'))

p.run()

Only one combination of apache_beam revision / worker type seems to work (I 
refer to https://beam.apache.org/documentation/runners/direct/ for the worker 
types)
* beam 2.16; neither multithread nor multiprocess achieve high cpu usage on 
multiple cores
* beam 2.17: able to achieve high cpu usage on all 4 cores
* beam 2.18: not tested the mulithreaded mode but the multiprocess mode fails 
when trying to serialize the Environment instance most likely because of a 
change from 2.17 to 2.18.

I also tried briefly SparkRunner with version 2.16 but was no able to achieve 
any throughput.

What is the recommnended way to achieve what I am trying to ? How can I 
troubleshoot ?
--

This is caused by [this 

[jira] [Assigned] (BEAM-9228) _SDFBoundedSourceWrapper doesn't distribute data to multiple workers

2020-01-30 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang reassigned BEAM-9228:
--

Assignee: Hannah Jiang

> _SDFBoundedSourceWrapper doesn't distribute data to multiple workers
> 
>
> Key: BEAM-9228
> URL: https://issues.apache.org/jira/browse/BEAM-9228
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.16.0, 2.18.0
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
>
> A user reported following issue.
> -
> I have a set of tfrecord files, obtained by converting parquet files with 
> Spark. Each file is roughly 1GB and I have 11 of those.
> I would expect simple statistics gathering (ie counting number of items of 
> all files) to scale linearly with respect to the number of cores on my system.
> I am able to reproduce the issue with the minimal snippet below
> import apache_beam as beam
> from apache_beam.options.pipeline_options import PipelineOptions
> from apache_beam.runners.portability import fn_api_runner
> from apache_beam.portability.api import beam_runner_api_pb2
> from apache_beam.portability import python_urns
> import sys
> pipeline_options = PipelineOptions(['--direct_num_workers', '4'])
> file_pattern = 'part-r-00*
> runner=fn_api_runner.FnApiRunner(
>   default_environment=beam_runner_api_pb2.Environment(
>   urn=python_urns.SUBPROCESS_SDK,
>   payload=b'%s -m apache_beam.runners.worker.sdk_worker_main'
> % sys.executable.encode('ascii')))
> p = beam.Pipeline(runner=runner, options=pipeline_options)
> lines = (p | 'read' >> beam.io.tfrecordio.ReadFromTFRecord(file_pattern)
>  | beam.combiners.Count.Globally()
>  | beam.io.WriteToText('/tmp/output'))
> p.run()
> Only one combination of apache_beam revision / worker type seems to work (I 
> refer to https://beam.apache.org/documentation/runners/direct/ for the worker 
> types)
> * beam 2.16; neither multithread nor multiprocess achieve high cpu usage on 
> multiple cores
> * beam 2.17: able to achieve high cpu usage on all 4 cores
> * beam 2.18: not tested the mulithreaded mode but the multiprocess mode fails 
> when trying to serialize the Environment instance most likely because of a 
> change from 2.17 to 2.18.
> I also tried briefly SparkRunner with version 2.16 but was no able to achieve 
> any throughput.
> What is the recommnended way to achieve what I am trying to ? How can I 
> troubleshoot ?
> --
> This is caused by [this 
> PR|https://github.com/apache/beam/commit/02f8ad4eee3ec0ea8cbdc0f99c1dad29f00a9f60].
> A [workaround|https://github.com/apache/beam/pull/10729] is tried, which is 
> rolling back iobase.py not to use _SDFBoundedSourceWrapper. This confirmed 
> that data is shuffled to multiple workers, however, there are some 
> regressions with SDF wrapper tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-9228) _SDFBoundedSourceWrapper doesn't distribute data to multiple workers

2020-01-30 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang reassigned BEAM-9228:
--

Assignee: (was: Hannah Jiang)

> _SDFBoundedSourceWrapper doesn't distribute data to multiple workers
> 
>
> Key: BEAM-9228
> URL: https://issues.apache.org/jira/browse/BEAM-9228
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.16.0, 2.18.0
>Reporter: Hannah Jiang
>Priority: Major
>
> A user reported following issue.
> -
> I have a set of tfrecord files, obtained by converting parquet files with 
> Spark. Each file is roughly 1GB and I have 11 of those.
> I would expect simple statistics gathering (ie counting number of items of 
> all files) to scale linearly with respect to the number of cores on my system.
> I am able to reproduce the issue with the minimal snippet below
> import apache_beam as beam
> from apache_beam.options.pipeline_options import PipelineOptions
> from apache_beam.runners.portability import fn_api_runner
> from apache_beam.portability.api import beam_runner_api_pb2
> from apache_beam.portability import python_urns
> import sys
> pipeline_options = PipelineOptions(['--direct_num_workers', '4'])
> file_pattern = 'part-r-00*
> runner=fn_api_runner.FnApiRunner(
>   default_environment=beam_runner_api_pb2.Environment(
>   urn=python_urns.SUBPROCESS_SDK,
>   payload=b'%s -m apache_beam.runners.worker.sdk_worker_main'
> % sys.executable.encode('ascii')))
> p = beam.Pipeline(runner=runner, options=pipeline_options)
> lines = (p | 'read' >> beam.io.tfrecordio.ReadFromTFRecord(file_pattern)
>  | beam.combiners.Count.Globally()
>  | beam.io.WriteToText('/tmp/output'))
> p.run()
> Only one combination of apache_beam revision / worker type seems to work (I 
> refer to https://beam.apache.org/documentation/runners/direct/ for the worker 
> types)
> * beam 2.16; neither multithread nor multiprocess achieve high cpu usage on 
> multiple cores
> * beam 2.17: able to achieve high cpu usage on all 4 cores
> * beam 2.18: not tested the mulithreaded mode but the multiprocess mode fails 
> when trying to serialize the Environment instance most likely because of a 
> change from 2.17 to 2.18.
> I also tried briefly SparkRunner with version 2.16 but was no able to achieve 
> any throughput.
> What is the recommnended way to achieve what I am trying to ? How can I 
> troubleshoot ?
> --
> This is caused by [this 
> PR|https://github.com/apache/beam/commit/02f8ad4eee3ec0ea8cbdc0f99c1dad29f00a9f60].
> A [workaround|https://github.com/apache/beam/pull/10729] is tried, which is 
> rolling back iobase.py not to use _SDFBoundedSourceWrapper. This confirmed 
> that data is shuffled to multiple workers, however, there are some 
> regressions with SDF wrapper tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9228) _SDFBoundedSourceWrapper doesn't distribute data to multiple workers

2020-01-30 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9228:
---
Description: 
A user reported following issue.

-
I have a set of tfrecord files, obtained by converting parquet files with 
Spark. Each file is roughly 1GB and I have 11 of those.

I would expect simple statistics gathering (ie counting number of items of all 
files) to scale linearly with respect to the number of cores on my system.

I am able to reproduce the issue with the minimal snippet below

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.runners.portability import fn_api_runner
from apache_beam.portability.api import beam_runner_api_pb2
from apache_beam.portability import python_urns
import sys

pipeline_options = PipelineOptions(['--direct_num_workers', '4'])

file_pattern = 'part-r-00*
runner=fn_api_runner.FnApiRunner(
  default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main'
% sys.executable.encode('ascii')))

p = beam.Pipeline(runner=runner, options=pipeline_options)

lines = (p | 'read' >> beam.io.tfrecordio.ReadFromTFRecord(file_pattern)
 | beam.combiners.Count.Globally()
 | beam.io.WriteToText('/tmp/output'))

p.run()

Only one combination of apache_beam revision / worker type seems to work (I 
refer to https://beam.apache.org/documentation/runners/direct/ for the worker 
types)
* beam 2.16; neither multithread nor multiprocess achieve high cpu usage on 
multiple cores
* beam 2.17: able to achieve high cpu usage on all 4 cores
* beam 2.18: not tested the mulithreaded mode but the multiprocess mode fails 
when trying to serialize the Environment instance most likely because of a 
change from 2.17 to 2.18.

I also tried briefly SparkRunner with version 2.16 but was no able to achieve 
any throughput.

What is the recommnended way to achieve what I am trying to ? How can I 
troubleshoot ?
--

This is caused by [this 
PR|https://github.com/apache/beam/commit/02f8ad4eee3ec0ea8cbdc0f99c1dad29f00a9f60].

A [workaround|https://github.com/apache/beam/pull/10729] is tried, which is 
rolling back iobase.py not to use _SDFBoundedSourceWrapper. This confirmed that 
data is distributed to multiple workers, however, there are some regressions 
with SDF wrapper tests.

  was:
A user reported following issue.

-
I have a set of tfrecord files, obtained by converting parquet files with 
Spark. Each file is roughly 1GB and I have 11 of those.

I would expect simple statistics gathering (ie counting number of items of all 
files) to scale linearly with respect to the number of cores on my system.

I am able to reproduce the issue with the minimal snippet below

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.runners.portability import fn_api_runner
from apache_beam.portability.api import beam_runner_api_pb2
from apache_beam.portability import python_urns
import sys

pipeline_options = PipelineOptions(['--direct_num_workers', '4'])

file_pattern = 'part-r-00*
runner=fn_api_runner.FnApiRunner(
  default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main'
% sys.executable.encode('ascii')))

p = beam.Pipeline(runner=runner, options=pipeline_options)

lines = (p | 'read' >> beam.io.tfrecordio.ReadFromTFRecord(file_pattern)
 | beam.combiners.Count.Globally()
 | beam.io.WriteToText('/tmp/output'))

p.run()

Only one combination of apache_beam revision / worker type seems to work (I 
refer to https://beam.apache.org/documentation/runners/direct/ for the worker 
types)
* beam 2.16; neither multithread nor multiprocess achieve high cpu usage on 
multiple cores
* beam 2.17: able to achieve high cpu usage on all 4 cores
* beam 2.18: not tested the mulithreaded mode but the multiprocess mode fails 
when trying to serialize the Environment instance most likely because of a 
change from 2.17 to 2.18.

I also tried briefly SparkRunner with version 2.16 but was no able to achieve 
any throughput.

What is the recommnended way to achieve what I am trying to ? How can I 
troubleshoot ?
--

This is caused by [this 
PR|https://github.com/apache/beam/commit/02f8ad4eee3ec0ea8cbdc0f99c1dad29f00a9f60].

[jira] [Updated] (BEAM-9228) _SDFBoundedSourceWrapper doesn't distribute data to multiple workers

2020-01-30 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9228:
---
Description: 
A user reported following issue.

-
I have a set of tfrecord files, obtained by converting parquet files with 
Spark. Each file is roughly 1GB and I have 11 of those.

I would expect simple statistics gathering (ie counting number of items of all 
files) to scale linearly with respect to the number of cores on my system.

I am able to reproduce the issue with the minimal snippet below

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.runners.portability import fn_api_runner
from apache_beam.portability.api import beam_runner_api_pb2
from apache_beam.portability import python_urns
import sys

pipeline_options = PipelineOptions(['--direct_num_workers', '4'])

file_pattern = 'part-r-00*
runner=fn_api_runner.FnApiRunner(
  default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main'
% sys.executable.encode('ascii')))

p = beam.Pipeline(runner=runner, options=pipeline_options)

lines = (p | 'read' >> beam.io.tfrecordio.ReadFromTFRecord(file_pattern)
 | beam.combiners.Count.Globally()
 | beam.io.WriteToText('/tmp/output'))

p.run()

Only one combination of apache_beam revision / worker type seems to work (I 
refer to https://beam.apache.org/documentation/runners/direct/ for the worker 
types)
* beam 2.16; neither multithread nor multiprocess achieve high cpu usage on 
multiple cores
* beam 2.17: able to achieve high cpu usage on all 4 cores
* beam 2.18: not tested the mulithreaded mode but the multiprocess mode fails 
when trying to serialize the Environment instance most likely because of a 
change from 2.17 to 2.18.

I also tried briefly SparkRunner with version 2.16 but was no able to achieve 
any throughput.

What is the recommnended way to achieve what I am trying to ? How can I 
troubleshoot ?
--

This is caused by [this 
PR|https://github.com/apache/beam/commit/02f8ad4eee3ec0ea8cbdc0f99c1dad29f00a9f60].

A [workaround|https://github.com/apache/beam/pull/10729] is tried, which is 
rolling back iobase.py not to use _SDFBoundedSourceWrapper. This confirmed that 
data is shuffled to multiple workers, however, there are some regressions with 
SDF wrapper tests.

  was:
A user reported following issue.

-
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.runners.portability import fn_api_runner
from apache_beam.portability.api import beam_runner_api_pb2
from apache_beam.portability import python_urns
import sys

pipeline_options = PipelineOptions(['--direct_num_workers', '4'])

file_pattern = 'part-r-00*
runner=fn_api_runner.FnApiRunner(
  default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main'
% sys.executable.encode('ascii')))

p = beam.Pipeline(runner=runner, options=pipeline_options)

lines = (p | 'read' >> beam.io.tfrecordio.ReadFromTFRecord(file_pattern)
   | beam.combiners.Count.Globally()
   | beam.io.WriteToText('/tmp/output'))

p.run()
--

This is caused by [this 
PR|https://github.com/apache/beam/commit/02f8ad4eee3ec0ea8cbdc0f99c1dad29f00a9f60].

A [workaround|https://github.com/apache/beam/pull/10729] is tried, which is 
rolling back iobase.py not to use _SDFBoundedSourceWrapper. This confirmed that 
data is shuffled to multiple workers, however, there are some regressions with 
SDF wrapper tests.


> _SDFBoundedSourceWrapper doesn't distribute data to multiple workers
> 
>
> Key: BEAM-9228
> URL: https://issues.apache.org/jira/browse/BEAM-9228
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.16.0, 2.18.0
>Reporter: Hannah Jiang
>Priority: Major
>
> A user reported following issue.
> -
> I have a set of tfrecord files, obtained by converting parquet files with 
> Spark. Each file is roughly 1GB and I have 11 of those.
> I would expect simple statistics gathering (ie counting number of items of 
> all files) to scale linearly with respect to the number of cores on my system.
> I am able to 

[jira] [Created] (BEAM-9228) _SDFBoundedSourceWrapper doesn't distribute data to multiple workers

2020-01-30 Thread Hannah Jiang (Jira)
Hannah Jiang created BEAM-9228:
--

 Summary: _SDFBoundedSourceWrapper doesn't distribute data to 
multiple workers
 Key: BEAM-9228
 URL: https://issues.apache.org/jira/browse/BEAM-9228
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-core
Affects Versions: 2.18.0, 2.16.0
Reporter: Hannah Jiang


A user reported following issue.

-
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.runners.portability import fn_api_runner
from apache_beam.portability.api import beam_runner_api_pb2
from apache_beam.portability import python_urns
import sys

pipeline_options = PipelineOptions(['--direct_num_workers', '4'])

file_pattern = 'part-r-00*
runner=fn_api_runner.FnApiRunner(
  default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main'
% sys.executable.encode('ascii')))

p = beam.Pipeline(runner=runner, options=pipeline_options)

lines = (p | 'read' >> beam.io.tfrecordio.ReadFromTFRecord(file_pattern)
   | beam.combiners.Count.Globally()
   | beam.io.WriteToText('/tmp/output'))

p.run()
--

This is caused by [this 
PR|https://github.com/apache/beam/commit/02f8ad4eee3ec0ea8cbdc0f99c1dad29f00a9f60].

A [workaround|https://github.com/apache/beam/pull/10729] is tried, which is 
rolling back iobase.py not to use _SDFBoundedSourceWrapper. This confirmed that 
data is shuffled to multiple workers, however, there are some regressions with 
SDF wrapper tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9207) Create a script to define all variables used by release scripts

2020-01-28 Thread Hannah Jiang (Jira)
Hannah Jiang created BEAM-9207:
--

 Summary: Create a script to define all variables used by release 
scripts
 Key: BEAM-9207
 URL: https://issues.apache.org/jira/browse/BEAM-9207
 Project: Beam
  Issue Type: Task
  Components: dependencies
Reporter: Hannah Jiang


Now we are defining variables with each script and this cause the definitions 
are duplicated at each script. We should have a place which defines all these 
variables and shared by all scripts for release.
* put it to dependencies component, because there is no release component.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9136) Add LICENSES and NOTICES to docker images

2020-01-24 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9136:
---
Description: Scan dependencies and add licenses and notices of the 
dependencies to SDK docker images.

> Add LICENSES and NOTICES to docker images
> -
>
> Key: BEAM-9136
> URL: https://issues.apache.org/jira/browse/BEAM-9136
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.20.0
>
>
> Scan dependencies and add licenses and notices of the dependencies to SDK 
> docker images.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9136) Add LICENSES and NOTICES to docker images

2020-01-16 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9136:
---
Status: Open  (was: Triage Needed)

> Add LICENSES and NOTICES to docker images
> -
>
> Key: BEAM-9136
> URL: https://issues.apache.org/jira/browse/BEAM-9136
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.20.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9135) Add an example with Java sdk container image.

2020-01-16 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9135:
---
Status: Open  (was: Triage Needed)

> Add an example with Java sdk container image.
> -
>
> Key: BEAM-9135
> URL: https://issues.apache.org/jira/browse/BEAM-9135
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: Not applicable
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (BEAM-8250) Make docker images registry root consistent

2020-01-16 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang closed BEAM-8250.
--
Fix Version/s: Not applicable
   Resolution: Duplicate

> Make docker images registry root consistent
> ---
>
> Key: BEAM-8250
> URL: https://issues.apache.org/jira/browse/BEAM-8250
> Project: Beam
>  Issue Type: Improvement
>  Components: build-system
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: Not applicable
>
>
> Docker images use two different docker repository root:
> *${System.properties["user.name"]}-docker-apache.bintray.io/beam (all other 
> images) vs docker.io/apachebeam (SDK images).*
> Make it consistent to use apachebeam for all images.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (BEAM-9084) Cleaning up SDK docker image tagging

2020-01-16 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang closed BEAM-9084.
--
Fix Version/s: 2.20.0
   Resolution: Fixed

> Cleaning up SDK docker image tagging
> 
>
> Key: BEAM-9084
> URL: https://issues.apache.org/jira/browse/BEAM-9084
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Affects Versions: 2.16.0, 2.17.0
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.20.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9136) Add LICENSES and NOTICES to docker images

2020-01-16 Thread Hannah Jiang (Jira)
Hannah Jiang created BEAM-9136:
--

 Summary: Add LICENSES and NOTICES to docker images
 Key: BEAM-9136
 URL: https://issues.apache.org/jira/browse/BEAM-9136
 Project: Beam
  Issue Type: Task
  Components: build-system
Reporter: Hannah Jiang
Assignee: Hannah Jiang
 Fix For: 2.20.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9135) Add an example with Java sdk container image.

2020-01-16 Thread Hannah Jiang (Jira)
Hannah Jiang created BEAM-9135:
--

 Summary: Add an example with Java sdk container image.
 Key: BEAM-9135
 URL: https://issues.apache.org/jira/browse/BEAM-9135
 Project: Beam
  Issue Type: Task
  Components: build-system
Reporter: Hannah Jiang
Assignee: Hannah Jiang
 Fix For: Not applicable






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (BEAM-9063) Migrate docker images to apache namespace.

2020-01-15 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-9063 started by Hannah Jiang.
--
> Migrate docker images to apache namespace.
> --
>
> Key: BEAM-9063
> URL: https://issues.apache.org/jira/browse/BEAM-9063
> Project: Beam
>  Issue Type: Task
>  Components: beam-community
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: Not applicable
>
>
> https://hub.docker.com/u/apache



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9115) Is the release script set_version.sh in use?

2020-01-13 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014736#comment-17014736
 ] 

Hannah Jiang commented on BEAM-9115:


This script is used to update versions where we hardcoded it at code. 
For example, https://github.com/apache/beam/blob/master/gradle.properties#L27 
is updated with the script.

> Is the release script set_version.sh in use?
> 
>
> Key: BEAM-9115
> URL: https://issues.apache.org/jira/browse/BEAM-9115
> Project: Beam
>  Issue Type: Bug
>  Components: build-system
>Reporter: Udi Meiri
>Assignee: Kenneth Knowles
>Priority: Minor
>
> I can't find any references to it in the Beam github repo.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (BEAM-9068) Use local docker image if available

2020-01-10 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang closed BEAM-9068.
--
Fix Version/s: Not applicable
   Resolution: Won't Fix

> Use local docker image if available
> ---
>
> Key: BEAM-9068
> URL: https://issues.apache.org/jira/browse/BEAM-9068
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Affects Versions: 2.17.0
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: Not applicable
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (BEAM-9084) Cleaning up SDK docker image tagging

2020-01-10 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-9084 started by Hannah Jiang.
--
> Cleaning up SDK docker image tagging
> 
>
> Key: BEAM-9084
> URL: https://issues.apache.org/jira/browse/BEAM-9084
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Affects Versions: 2.16.0, 2.17.0
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9063) Migrate docker images to apache namespace.

2020-01-10 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9063:
---
Summary: Migrate docker images to apache namespace.  (was: Make docker 
images official images)

> Migrate docker images to apache namespace.
> --
>
> Key: BEAM-9063
> URL: https://issues.apache.org/jira/browse/BEAM-9063
> Project: Beam
>  Issue Type: Task
>  Components: beam-community
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: Not applicable
>
>
> Documentation: [https://docs.docker.com/docker-hub/official_images/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9063) Migrate docker images to apache namespace.

2020-01-10 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9063:
---
Description: https://hub.docker.com/u/apache  (was: Documentation: 
[https://docs.docker.com/docker-hub/official_images/])

> Migrate docker images to apache namespace.
> --
>
> Key: BEAM-9063
> URL: https://issues.apache.org/jira/browse/BEAM-9063
> Project: Beam
>  Issue Type: Task
>  Components: beam-community
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: Not applicable
>
>
> https://hub.docker.com/u/apache



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (BEAM-7861) Make it easy to change between multi-process and multi-thread mode for Python Direct runners

2020-01-10 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-7861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013105#comment-17013105
 ] 

Hannah Jiang edited comment on BEAM-7861 at 1/10/20 8:17 PM:
-

We can use --direct_running_mode to switch between multi_threading and 
multi_processing.
 We can set direct_running_mode to one of ['in_memory',  'multi_threading', 
'multi_processing']. Default mode is in_memory.

*in_memory*: it is multi threading mode, worker and runners' communication 
happens in the memory (not through gRPC).
 *multi_threading*: it is multi threading mode, worker and runners communicate 
through gRPC.
 *multi_processing*: it is multi processing, worker and runners communicate 
through gRPC.

Here is how to set the direct_running_mode.
 *Option 1*: set it with PipelineOptions().
{code:java}
 pipeline_options = PipelineOptions(direct_num_workers=2, 
direct_running_mode='multi_threading')
 p = beam.Pipeline(
         runner=fn_api_runner.FnApiRunner(),
         options=pipeline_options)
{code}

*Option 2*: pass it with CLI.
{code:java}
 python xxx --direct_num_workers 2 - -direct_running_mode multi_threading

known_args, pipeline_args = parser.parse_known_args(argv)
 p = beam.Pipeline(
         runner=fn_api_runner.FnApiRunner(),
         options=pipeline_options)
{code}


was (Author: hannahjiang):
We can use --direct_running_mode to switch between multi_threading and 
multi_processing.
 We can set direct_running_mode to one of ['in_memory',  'multi_threading', 
'multi_processing']. Default mode is in_memory.

*in_memory*: it is multi threading mode, worker and runners' communication 
happens in the memory (not through gRPC).
 *multi_threading*: it is multi threading mode, worker and runners communicate 
through gRPC.
 *multi_processing*: it is multi processing, worker and runners communicate 
through gRPC.

Here is how to set the direct_running_mode.
 *Option 1*: set it with PipelineOptions().
{code:java}
 pipeline_options = PipelineOptions(direct_num_workers=2, 
direct_running_mode=multi_threading)
 p = beam.Pipeline(
         runner=fn_api_runner.FnApiRunner(),
         options=pipeline_options)
{code}

*Option 2*: pass it with CLI.
{code:java}
 python xxx --direct_num_workers 2 - -direct_running_mode 'multi_threading'

known_args, pipeline_args = parser.parse_known_args(argv)
 p = beam.Pipeline(
         runner=fn_api_runner.FnApiRunner(),
         options=pipeline_options)
{code}

> Make it easy to change between multi-process and multi-thread mode for Python 
> Direct runners
> 
>
> Key: BEAM-7861
> URL: https://issues.apache.org/jira/browse/BEAM-7861
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.19.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> BEAM-3645 makes it possible to run a map task parallel.
> However, users need to change runner when switch between multithreading and 
> multiprocessing mode.
> We want to add a flag (ex: --use-multiprocess) to make the switch easy 
> without changing the runner each time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (BEAM-7861) Make it easy to change between multi-process and multi-thread mode for Python Direct runners

2020-01-10 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-7861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013105#comment-17013105
 ] 

Hannah Jiang edited comment on BEAM-7861 at 1/10/20 8:16 PM:
-

We can use --direct_running_mode to switch between multi_threading and 
multi_processing.
 We can set direct_running_mode to one of ['in_memory',  'multi_threading', 
'multi_processing']. Default mode is in_memory.

*in_memory*: it is multi threading mode, worker and runners' communication 
happens in the memory (not through gRPC).
 *multi_threading*: it is multi threading mode, worker and runners communicate 
through gRPC.
 *multi_processing*: it is multi processing, worker and runners communicate 
through gRPC.

Here is how to set the direct_running_mode.
 *Option 1*: set it with PipelineOptions().
{code:java}
 pipeline_options = PipelineOptions(direct_num_workers=2, 
direct_running_mode=multi_threading)
 p = beam.Pipeline(
         runner=fn_api_runner.FnApiRunner(),
         options=pipeline_options)
{code}

*Option 2*: pass it with CLI.
{code:java}
 python xxx --direct_num_workers 2 - -direct_running_mode 'multi_threading'

known_args, pipeline_args = parser.parse_known_args(argv)
 p = beam.Pipeline(
         runner=fn_api_runner.FnApiRunner(),
         options=pipeline_options)
{code}


was (Author: hannahjiang):
We can use --direct_running_mode to switch between multi_threading and 
multi_processing.
 We can set direct_running_mode to one of ['in_memory',  'multi_threading', 
'multi_processing']. Default mode is in_memory.

*in_memory*: it is multi threading mode, worker and runners' communication 
happens in the memory (not through gRPC).
 *multi_threading*: it is multi threading mode, worker and runners communicate 
through gRPC.
 *multi_processing*: it is multi processing, worker and runners communicate 
through gRPC.

Here is how to set the direct_running_mode.
 *Option 1*: set it with PipelineOptions().
{code:java}
 pipeline_options = PipelineOptions(direct_num_workers=2, 
direct_running_mode='multi_threading')
 p = beam.Pipeline(
         runner=fn_api_runner.FnApiRunner(),
         options=pipeline_options)
{code}

*Option 2*: pass it with CLI.
{code:java}
 python xxx --direct_num_workers 2 - -direct_running_mode 'multi_threading'

known_args, pipeline_args = parser.parse_known_args(argv)
 p = beam.Pipeline(
         runner=fn_api_runner.FnApiRunner(),
         options=pipeline_options)
{code}

> Make it easy to change between multi-process and multi-thread mode for Python 
> Direct runners
> 
>
> Key: BEAM-7861
> URL: https://issues.apache.org/jira/browse/BEAM-7861
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.19.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> BEAM-3645 makes it possible to run a map task parallel.
> However, users need to change runner when switch between multithreading and 
> multiprocessing mode.
> We want to add a flag (ex: --use-multiprocess) to make the switch easy 
> without changing the runner each time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (BEAM-3645) Support multi-process execution on the FnApiRunner

2020-01-10 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896575#comment-16896575
 ] 

Hannah Jiang edited comment on BEAM-3645 at 1/10/20 8:15 PM:
-

*{color:#ff}Update on 01/10/2019{color}*

We can use --direct_running_mode to switch between multi_threading and 
multi_processing.
direct_running_mode can be one of ['in_memory',  'multi_threading', 
'multi_processing']. Default mode is in_memory.

*in_memory*: it is multi threading mode, worker and runners' communication 
happens in the memory (not through gRPC).
 *multi_threading*: it is multi threading mode, worker and runners communicate 
through gRPC.
 *multi_processing*: it is multi processing, worker and runners communicate 
through gRPC.

Here is how to set the direct_running_mode.
 *Option 1*: set it with PipelineOptions().
{code:java}
 pipeline_options = PipelineOptions(direct_num_workers=2, 
direct_running_mode='multi_threading')
 p = beam.Pipeline(
         runner=fn_api_runner.FnApiRunner(),
         options=pipeline_options)
{code}

*Option 2*: pass it with CLI.
{code:java}
 python xxx --direct_num_workers 2 - -direct_running_mode multi_threading

known_args, pipeline_args = parser.parse_known_args(argv)
 p = beam.Pipeline(
         runner=fn_api_runner.FnApiRunner(),
         options=pipeline_options)
{code}
 

{color:#ff}*Update on 30/06/2018.*{color}

Direct runner can now process map tasks across multiple workers. Depending on 
running environment, these workers are running in multithreading or 
multiprocessing mode.

_*It is supported from Beam 2.15.*_

*Run with multiprocessing mode*:
{code:java}
# using subprocess runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' %
  sys.executable.encode('ascii'
{code}
 

*Run with multithreading mode:*
{code:java}
# using embedded grpc runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
    default_environment=beam_runner_api_pb2.Environment(
      urn=python_urns.EMBEDDED_PYTHON_GRPC,
      payload=b'1'))) # payload is # of threads of each worker.{code}
 

*--direct_num_workers* option is used to control parallelism. Default value is 
1. 
{code:java}
# an example to pass it from CLI.
python wordcount.py --input xx --output xx --direct_num_workers 2

# an example to set it with PipelineOptions.
from apache_beam.options.pipeline_options import PipelineOptions
pipeline_options = PipelineOptions(['--direct_num_workers', '2'])

# an example to add it to existing pipeline options.
from apache_beam.options.pipeline_options import DirectOptions
pipeline_options = xxx
pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code}


was (Author: hannahjiang):
*{color:#ff}Update on 01/10/2019{color}*

We can use --direct_running_mode to switch between multi_threading and 
multi_processing.
direct_running_mode can be one of ['in_memory',  'multi_threading', 
'multi_processing']. Default mode is in_memory.

*in_memory*: it is multi threading mode, worker and runners' communication 
happens in the memory (not through gRPC).
 *multi_threading*: it is multi threading mode, worker and runners communicate 
through gRPC.
 *multi_processing*: it is multi processing, worker and runners communicate 
through gRPC.

Here is how to set the direct_running_mode.
 *Option 1*: set it with PipelineOptions().
{code:java}
 pipeline_options = PipelineOptions(direct_num_workers=2, 
direct_running_mode='multi_threading')
 p = beam.Pipeline(
         runner=fn_api_runner.FnApiRunner(),
         options=pipeline_options)
{code}

*Option 2*: pass it with CLI.
{code:java}
 python xxx --direct_num_workers 2 - -direct_running_mode 'multi_threading'

known_args, pipeline_args = parser.parse_known_args(argv)
 p = beam.Pipeline(
         runner=fn_api_runner.FnApiRunner(),
         options=pipeline_options)
{code}
 

{color:#ff}*Update on 30/06/2018.*{color}

Direct runner can now process map tasks across multiple workers. Depending on 
running environment, these workers are running in multithreading or 
multiprocessing mode.

_*It is supported from Beam 2.15.*_

*Run with multiprocessing mode*:
{code:java}
# using subprocess runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' %
  sys.executable.encode('ascii'
{code}
 

*Run with multithreading mode:*
{code:java}
# using embedded grpc runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
    default_environment=beam_runner_api_pb2.Environment(
      

[jira] [Comment Edited] (BEAM-3645) Support multi-process execution on the FnApiRunner

2020-01-10 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896575#comment-16896575
 ] 

Hannah Jiang edited comment on BEAM-3645 at 1/10/20 6:24 PM:
-

*{color:#ff}Update on 01/10/2019{color}*

We can use --direct_running_mode to switch between multi_threading and 
multi_processing.
direct_running_mode can be one of ['in_memory',  'multi_threading', 
'multi_processing']. Default mode is in_memory.

*in_memory*: it is multi threading mode, worker and runners' communication 
happens in the memory (not through gRPC).
 *multi_threading*: it is multi threading mode, worker and runners communicate 
through gRPC.
 *multi_processing*: it is multi processing, worker and runners communicate 
through gRPC.

Here is how to set the direct_running_mode.
 *Option 1*: set it with PipelineOptions().
{code:java}
 pipeline_options = PipelineOptions(direct_num_workers=2, 
direct_running_mode='multi_threading')
 p = beam.Pipeline(
         runner=fn_api_runner.FnApiRunner(),
         options=pipeline_options)
{code}

*Option 2*: pass it with CLI.
{code:java}
 python xxx --direct_num_workers 2 - -direct_running_mode 'multi_threading'

known_args, pipeline_args = parser.parse_known_args(argv)
 p = beam.Pipeline(
         runner=fn_api_runner.FnApiRunner(),
         options=pipeline_options)
{code}
 

{color:#ff}*Update on 30/06/2018.*{color}

Direct runner can now process map tasks across multiple workers. Depending on 
running environment, these workers are running in multithreading or 
multiprocessing mode.

_*It is supported from Beam 2.15.*_

*Run with multiprocessing mode*:
{code:java}
# using subprocess runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' %
  sys.executable.encode('ascii'
{code}
 

*Run with multithreading mode:*
{code:java}
# using embedded grpc runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
    default_environment=beam_runner_api_pb2.Environment(
      urn=python_urns.EMBEDDED_PYTHON_GRPC,
      payload=b'1'))) # payload is # of threads of each worker.{code}
 

*--direct_num_workers* option is used to control parallelism. Default value is 
1. 
{code:java}
# an example to pass it from CLI.
python wordcount.py --input xx --output xx --direct_num_workers 2

# an example to set it with PipelineOptions.
from apache_beam.options.pipeline_options import PipelineOptions
pipeline_options = PipelineOptions(['--direct_num_workers', '2'])

# an example to add it to existing pipeline options.
from apache_beam.options.pipeline_options import DirectOptions
pipeline_options = xxx
pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code}


was (Author: hannahjiang):
*{color:#ff}Update on 01/10/2019{color}*

We added a new option (–direct_running_mode) to make it easy to switch between 
multi_threading and multi_processing mode. It is available from *v2.19.0*.

We can use --direct_running_mode to switch between multi_threading and 
multi_processing.
 The direct_running_mode can be set to one of ['in_memory',  'multi_threading', 
'multi_processing']. Default mode is in_memory.

*in_memory*: it is multi threading mode, worker and runners' communication 
happens in the memory (not through gRPC).
 *multi_threading*: it is multi threading mode, worker and runners communicate 
through gRPC.
 *multi_processing*: it is multi processing, worker and runners communicate 
through gRPC.

Here is how to set the direct_running_mode.
 *Option 1*: set it with pipeline options.
{code:java}
 pipeline_options = PipelineOptions(direct_num_workers=2, 
direct_running_mode='multi_threading')
 p = beam.Pipeline(
         runner=fn_api_runner.FnApiRunner(),
         options=pipeline_options)
{code}
*Option 2*: pass it with CLI.
{code:java}
 python xxx --direct_num_workers 2 - -direct_running_mode 'multi_threading'
 p = beam.Pipeline(runner=fn_api_runner.FnApiRunner())
{code}
 

{color:#ff}*Update on 30/06/2018.*{color}

Direct runner can now process map tasks across multiple workers. Depending on 
running environment, these workers are running in multithreading or 
multiprocessing mode.

_*It is supported from Beam 2.15.*_

*Run with multiprocessing mode*:
{code:java}
# using subprocess runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' %
  sys.executable.encode('ascii'
{code}
 

*Run with multithreading mode:*
{code:java}
# using embedded grpc runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
    

[jira] [Comment Edited] (BEAM-7861) Make it easy to change between multi-process and multi-thread mode for Python Direct runners

2020-01-10 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-7861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013105#comment-17013105
 ] 

Hannah Jiang edited comment on BEAM-7861 at 1/10/20 6:22 PM:
-

We can use --direct_running_mode to switch between multi_threading and 
multi_processing.
 We can set direct_running_mode to one of ['in_memory',  'multi_threading', 
'multi_processing']. Default mode is in_memory.

*in_memory*: it is multi threading mode, worker and runners' communication 
happens in the memory (not through gRPC).
 *multi_threading*: it is multi threading mode, worker and runners communicate 
through gRPC.
 *multi_processing*: it is multi processing, worker and runners communicate 
through gRPC.

Here is how to set the direct_running_mode.
 *Option 1*: set it with PipelineOptions().
{code:java}
 pipeline_options = PipelineOptions(direct_num_workers=2, 
direct_running_mode='multi_threading')
 p = beam.Pipeline(
         runner=fn_api_runner.FnApiRunner(),
         options=pipeline_options)
{code}

*Option 2*: pass it with CLI.
{code:java}
 python xxx --direct_num_workers 2 - -direct_running_mode 'multi_threading'

known_args, pipeline_args = parser.parse_known_args(argv)
 p = beam.Pipeline(
         runner=fn_api_runner.FnApiRunner(),
         options=pipeline_options)
{code}


was (Author: hannahjiang):
We can use --direct_running_mode to switch between multi_threading and 
multi_processing.
 We can set direct_running_mode to one of ['in_memory',  'multi_threading', 
'multi_processing']. Default mode is in_memory.

*in_memory*: it is multi threading mode, worker and runners' communication 
happens in the memory (not through gRPC).
 *multi_threading*: it is multi threading mode, worker and runners communicate 
through gRPC.
 *multi_processing*: it is multi processing, worker and runners communicate 
through gRPC.

Here is how to set the direct_running_mode.
 *Option 1*: set it with pipeline options.
{code:java}
 pipeline_options = PipelineOptions(direct_num_workers=2, 
direct_running_mode='multi_threading')
 p = beam.Pipeline(
         runner=fn_api_runner.FnApiRunner(),
         options=pipeline_options)
{code}

*Option 2*: pass it with CLI.
{code:java}
 python xxx --direct_num_workers 2 - -direct_running_mode 'multi_threading'
 p = beam.Pipeline(runner=fn_api_runner.FnApiRunner())
{code}

> Make it easy to change between multi-process and multi-thread mode for Python 
> Direct runners
> 
>
> Key: BEAM-7861
> URL: https://issues.apache.org/jira/browse/BEAM-7861
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.19.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> BEAM-3645 makes it possible to run a map task parallel.
> However, users need to change runner when switch between multithreading and 
> multiprocessing mode.
> We want to add a flag (ex: --use-multiprocess) to make the switch easy 
> without changing the runner each time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (BEAM-3645) Support multi-process execution on the FnApiRunner

2020-01-10 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896575#comment-16896575
 ] 

Hannah Jiang edited comment on BEAM-3645 at 1/10/20 6:17 PM:
-

*{color:#ff}Update on 01/10/2019{color}*

We added a new option (–direct_running_mode) to make it easy to switch between 
multi_threading and multi_processing mode. It is available from *v2.19.0*.

We can use --direct_running_mode to switch between multi_threading and 
multi_processing.
 The direct_running_mode can be set to one of ['in_memory',  'multi_threading', 
'multi_processing']. Default mode is in_memory.

*in_memory*: it is multi threading mode, worker and runners' communication 
happens in the memory (not through gRPC).
 *multi_threading*: it is multi threading mode, worker and runners communicate 
through gRPC.
 *multi_processing*: it is multi processing, worker and runners communicate 
through gRPC.

Here is how to set the direct_running_mode.
 *Option 1*: set it with pipeline options.
{code:java}
 pipeline_options = PipelineOptions(direct_num_workers=2, 
direct_running_mode='multi_threading')
 p = beam.Pipeline(
         runner=fn_api_runner.FnApiRunner(),
         options=pipeline_options)
{code}
*Option 2*: pass it with CLI.
{code:java}
 python xxx --direct_num_workers 2 - -direct_running_mode 'multi_threading'
 p = beam.Pipeline(runner=fn_api_runner.FnApiRunner())
{code}
 

{color:#ff}*Update on 30/06/2018.*{color}

Direct runner can now process map tasks across multiple workers. Depending on 
running environment, these workers are running in multithreading or 
multiprocessing mode.

_*It is supported from Beam 2.15.*_

*Run with multiprocessing mode*:
{code:java}
# using subprocess runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' %
  sys.executable.encode('ascii'
{code}
 

*Run with multithreading mode:*
{code:java}
# using embedded grpc runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
    default_environment=beam_runner_api_pb2.Environment(
      urn=python_urns.EMBEDDED_PYTHON_GRPC,
      payload=b'1'))) # payload is # of threads of each worker.{code}
 

*--direct_num_workers* option is used to control parallelism. Default value is 
1. 
{code:java}
# an example to pass it from CLI.
python wordcount.py --input xx --output xx --direct_num_workers 2

# an example to set it with PipelineOptions.
from apache_beam.options.pipeline_options import PipelineOptions
pipeline_options = PipelineOptions(['--direct_num_workers', '2'])

# an example to add it to existing pipeline options.
from apache_beam.options.pipeline_options import DirectOptions
pipeline_options = xxx
pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code}


was (Author: hannahjiang):
*{color:#ff}Update on 01/10/2019{color}*

We added a new option (–direct_running_mode) to make it easy to switch between 
multi_threading and multi_processing mode. It is available from *v2.19.0*.

We can use --direct_running_mode to switch between multi_threading and 
multi_processing.
 We can set direct_running_mode to one of ['in_memory',  'multi_threading', 
'multi_processing']. Default mode is in_memory.

*in_memory*: it is multi threading mode, worker and runners' communication 
happens in the memory (not through gRPC).
 *multi_threading*: it is multi threading mode, worker and runners communicate 
through gRPC.
 *multi_processing*: it is multi processing, worker and runners communicate 
through gRPC.

Here is how to set the direct_running_mode.
 *Option 1*: set it with pipeline options.
{code:java}
 pipeline_options = PipelineOptions(direct_num_workers=2, 
direct_running_mode='multi_threading')
 p = beam.Pipeline(
         runner=fn_api_runner.FnApiRunner(),
         options=pipeline_options)
{code}
*Option 2*: pass it with CLI.
{code:java}
 python xxx --direct_num_workers 2 - -direct_running_mode 'multi_threading'
 p = beam.Pipeline(runner=fn_api_runner.FnApiRunner())
{code}
 

{color:#FF}*Update on 30/06/2018.*{color}

Direct runner can now process map tasks across multiple workers. Depending on 
running environment, these workers are running in multithreading or 
multiprocessing mode.

_*It is supported from Beam 2.15.*_

*Run with multiprocessing mode*:
{code:java}
# using subprocess runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' %
  sys.executable.encode('ascii'
{code}
 

*Run with multithreading mode:*
{code:java}
# using embedded grpc runner
p = 

[jira] [Comment Edited] (BEAM-3645) Support multi-process execution on the FnApiRunner

2020-01-10 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896575#comment-16896575
 ] 

Hannah Jiang edited comment on BEAM-3645 at 1/10/20 6:16 PM:
-

*{color:#ff}Update on 01/10/2019{color}*

We added a new option (–direct_running_mode) to make it easy to switch between 
multi_threading and multi_processing mode. It is available from *v2.19.0*.

We can use --direct_running_mode to switch between multi_threading and 
multi_processing.
 We can set direct_running_mode to one of ['in_memory',  'multi_threading', 
'multi_processing']. Default mode is in_memory.

*in_memory*: it is multi threading mode, worker and runners' communication 
happens in the memory (not through gRPC).
 *multi_threading*: it is multi threading mode, worker and runners communicate 
through gRPC.
 *multi_processing*: it is multi processing, worker and runners communicate 
through gRPC.

Here is how to set the direct_running_mode.
 *Option 1*: set it with pipeline options.
{code:java}
 pipeline_options = PipelineOptions(direct_num_workers=2, 
direct_running_mode='multi_threading')
 p = beam.Pipeline(
         runner=fn_api_runner.FnApiRunner(),
         options=pipeline_options)
{code}
*Option 2*: pass it with CLI.
{code:java}
 python xxx --direct_num_workers 2 - -direct_running_mode 'multi_threading'
 p = beam.Pipeline(runner=fn_api_runner.FnApiRunner())
{code}
 

{color:#FF}*Update on 30/06/2018.*{color}

Direct runner can now process map tasks across multiple workers. Depending on 
running environment, these workers are running in multithreading or 
multiprocessing mode.

_*It is supported from Beam 2.15.*_

*Run with multiprocessing mode*:
{code:java}
# using subprocess runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' %
  sys.executable.encode('ascii'
{code}
 

*Run with multithreading mode:*
{code:java}
# using embedded grpc runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
    default_environment=beam_runner_api_pb2.Environment(
      urn=python_urns.EMBEDDED_PYTHON_GRPC,
      payload=b'1'))) # payload is # of threads of each worker.{code}
 

*--direct_num_workers* option is used to control parallelism. Default value is 
1. 
{code:java}
# an example to pass it from CLI.
python wordcount.py --input xx --output xx --direct_num_workers 2

# an example to set it with PipelineOptions.
from apache_beam.options.pipeline_options import PipelineOptions
pipeline_options = PipelineOptions(['--direct_num_workers', '2'])

# an example to add it to existing pipeline options.
from apache_beam.options.pipeline_options import DirectOptions
pipeline_options = xxx
pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code}


was (Author: hannahjiang):
*{color:#FF}Update on 01/10/2019{color}*

We added a new option (–direct_running_mode) to make it easy to switch between 
multi_threading and multi_processing mode. It is available from *v2.19.0*.

We can use --direct_running_mode to switch between multi_threading and 
multi_processing.
 We can set direct_running_mode to one of ['in_memory',  'multi_threading', 
'multi_processing']. Default mode is in_memory.

*in_memory*: it is multi threading mode, worker and runners' communication 
happens in the memory (not through gRPC).
 *multi_threading*: it is multi threading mode, worker and runners communicate 
through gRPC.
 *multi_processing*: it is multi processing, worker and runners communicate 
through gRPC.

Here is how to set the direct_running_mode.
 *Option 1*: set it with pipeline options.
{code:java}
 pipeline_options = PipelineOptions(direct_num_workers=2, 
direct_running_mode='multi_threading')
 p = beam.Pipeline(
         runner=fn_api_runner.FnApiRunner(),
         options=pipeline_options)
{code}

*Option 2*: pass it with CLI.
{code:java}
 python xxx --direct_num_workers 2 - -direct_running_mode 'multi_threading'
 p = beam.Pipeline(runner=fn_api_runner.FnApiRunner())
{code}

 

Update on 30/06/2018.

Direct runner can now process map tasks across multiple workers. Depending on 
running environment, these workers are running in multithreading or 
multiprocessing mode.

_*It is supported from Beam 2.15.*_

*Run with multiprocessing mode*:
{code:java}
# using subprocess runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' %
  sys.executable.encode('ascii'
{code}
 

*Run with multithreading mode:*
{code:java}
# using embedded grpc runner
p = beam.Pipeline(options=pipeline_options,
  

[jira] [Comment Edited] (BEAM-3645) Support multi-process execution on the FnApiRunner

2020-01-10 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-3645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896575#comment-16896575
 ] 

Hannah Jiang edited comment on BEAM-3645 at 1/10/20 6:16 PM:
-

*{color:#FF}Update on 01/10/2019{color}*

We added a new option (–direct_running_mode) to make it easy to switch between 
multi_threading and multi_processing mode. It is available from *v2.19.0*.

We can use --direct_running_mode to switch between multi_threading and 
multi_processing.
 We can set direct_running_mode to one of ['in_memory',  'multi_threading', 
'multi_processing']. Default mode is in_memory.

*in_memory*: it is multi threading mode, worker and runners' communication 
happens in the memory (not through gRPC).
 *multi_threading*: it is multi threading mode, worker and runners communicate 
through gRPC.
 *multi_processing*: it is multi processing, worker and runners communicate 
through gRPC.

Here is how to set the direct_running_mode.
 *Option 1*: set it with pipeline options.
{code:java}
 pipeline_options = PipelineOptions(direct_num_workers=2, 
direct_running_mode='multi_threading')
 p = beam.Pipeline(
         runner=fn_api_runner.FnApiRunner(),
         options=pipeline_options)
{code}

*Option 2*: pass it with CLI.
{code:java}
 python xxx --direct_num_workers 2 - -direct_running_mode 'multi_threading'
 p = beam.Pipeline(runner=fn_api_runner.FnApiRunner())
{code}

 

Update on 30/06/2018.

Direct runner can now process map tasks across multiple workers. Depending on 
running environment, these workers are running in multithreading or 
multiprocessing mode.

_*It is supported from Beam 2.15.*_

*Run with multiprocessing mode*:
{code:java}
# using subprocess runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' %
  sys.executable.encode('ascii'
{code}
 

*Run with multithreading mode:*
{code:java}
# using embedded grpc runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
    default_environment=beam_runner_api_pb2.Environment(
      urn=python_urns.EMBEDDED_PYTHON_GRPC,
      payload=b'1'))) # payload is # of threads of each worker.{code}
 

*--direct_num_workers* option is used to control parallelism. Default value is 
1. 
{code:java}
# an example to pass it from CLI.
python wordcount.py --input xx --output xx --direct_num_workers 2

# an example to set it with PipelineOptions.
from apache_beam.options.pipeline_options import PipelineOptions
pipeline_options = PipelineOptions(['--direct_num_workers', '2'])

# an example to add it to existing pipeline options.
from apache_beam.options.pipeline_options import DirectOptions
pipeline_options = xxx
pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code}


was (Author: hannahjiang):
Direct runner can now process map tasks across multiple workers. Depending on 
running environment, these workers are running in multithreading or 
multiprocessing mode.

_*It is supported from Beam 2.15.*_

*Run with multiprocessing mode*:
{code:java}
# using subprocess runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
default_environment=beam_runner_api_pb2.Environment(
  urn=python_urns.SUBPROCESS_SDK,
  payload=b'%s -m apache_beam.runners.worker.sdk_worker_main' %
  sys.executable.encode('ascii'
{code}
 

*Run with multithreading mode:*
{code:java}
# using embedded grpc runner
p = beam.Pipeline(options=pipeline_options,
  runner=fn_api_runner.FnApiRunner(
    default_environment=beam_runner_api_pb2.Environment(
      urn=python_urns.EMBEDDED_PYTHON_GRPC,
      payload=b'1'))) # payload is # of threads of each worker.{code}
 

*--direct_num_workers* option is used to control parallelism. Default value is 
1. 
{code:java}
# an example to pass it from CLI.
python wordcount.py --input xx --output xx --direct_num_workers 2

# an example to set it with PipelineOptions.
from apache_beam.options.pipeline_options import PipelineOptions
pipeline_options = PipelineOptions(['--direct_num_workers', '2'])

# an example to add it to existing pipeline options.
from apache_beam.options.pipeline_options import DirectOptions
pipeline_options = xxx
pipeline_options.view_as(DirectOptions).direct_num_workers = 2{code}

> Support multi-process execution on the FnApiRunner
> --
>
> Key: BEAM-3645
> URL: https://issues.apache.org/jira/browse/BEAM-3645
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Charles Chen
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 

[jira] [Comment Edited] (BEAM-7861) Make it easy to change between multi-process and multi-thread mode for Python Direct runners

2020-01-10 Thread Hannah Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-7861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013105#comment-17013105
 ] 

Hannah Jiang edited comment on BEAM-7861 at 1/10/20 6:11 PM:
-

We can use --direct_running_mode to switch between multi_threading and 
multi_processing.
 We can set direct_running_mode to one of ['in_memory',  'multi_threading', 
'multi_processing']. Default mode is in_memory.

*in_memory*: it is multi threading mode, worker and runners' communication 
happens in the memory (not through gRPC).
 *multi_threading*: it is multi threading mode, worker and runners communicate 
through gRPC.
 *multi_processing*: it is multi processing, worker and runners communicate 
through gRPC.

Here is how to set the direct_running_mode.
 *Option 1*: set it with pipeline options.
{code:java}
 pipeline_options = PipelineOptions(direct_num_workers=2, 
direct_running_mode='multi_threading')
 p = beam.Pipeline(
         runner=fn_api_runner.FnApiRunner(),
         options=pipeline_options)
{code}

*Option 2*: pass it with CLI.
{code:java}
 python xxx --direct_num_workers 2 - -direct_running_mode 'multi_threading'
 p = beam.Pipeline(runner=fn_api_runner.FnApiRunner())
{code}


was (Author: hannahjiang):
We can use --direct_running_mode to switch between multi_threading and 
multi_processing.
We can set direct_running_mode to one of ['in_memory',  'multi_threading', 
'multi_processing']. Default mode is in_memory.

*in_memory*: it is multi threading mode, worker and runners' communication 
happens in the memory (not through gRPC).
*multi_threading*: it is multi threading mode, worker and runners communicate 
through gRPC.
*multi_processing*: it is multi processing, worker and runners communicate 
through gRPC.

Here is how to set the direct_running_mode.
*Option 1*: set it with pipeline options.
pipeline_options = PipelineOptions(direct_num_workers=2, 
direct_running_mode='multi_threading')
p = beam.Pipeline(
        runner=fn_api_runner.FnApiRunner(),
        options=pipeline_options)

*Option 2*: pass it with CLI.
python xxx --direct_num_workers 2  --direct_running_mode 'multi_threading'
p = beam.Pipeline(runner=fn_api_runner.FnApiRunner())

> Make it easy to change between multi-process and multi-thread mode for Python 
> Direct runners
> 
>
> Key: BEAM-7861
> URL: https://issues.apache.org/jira/browse/BEAM-7861
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.19.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> BEAM-3645 makes it possible to run a map task parallel.
> However, users need to change runner when switch between multithreading and 
> multiprocessing mode.
> We want to add a flag (ex: --use-multiprocess) to make the switch easy 
> without changing the runner each time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-7861) Make it easy to change between multi-process and multi-thread mode for Python Direct runners

2020-01-10 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang resolved BEAM-7861.

Resolution: Fixed

We can use --direct_running_mode to switch between multi_threading and 
multi_processing.
We can set direct_running_mode to one of ['in_memory',  'multi_threading', 
'multi_processing']. Default mode is in_memory.

*in_memory*: it is multi threading mode, worker and runners' communication 
happens in the memory (not through gRPC).
*multi_threading*: it is multi threading mode, worker and runners communicate 
through gRPC.
*multi_processing*: it is multi processing, worker and runners communicate 
through gRPC.

Here is how to set the direct_running_mode.
*Option 1*: set it with pipeline options.
pipeline_options = PipelineOptions(direct_num_workers=2, 
direct_running_mode='multi_threading')
p = beam.Pipeline(
        runner=fn_api_runner.FnApiRunner(),
        options=pipeline_options)

*Option 2*: pass it with CLI.
python xxx --direct_num_workers 2  --direct_running_mode 'multi_threading'
p = beam.Pipeline(runner=fn_api_runner.FnApiRunner())

> Make it easy to change between multi-process and multi-thread mode for Python 
> Direct runners
> 
>
> Key: BEAM-7861
> URL: https://issues.apache.org/jira/browse/BEAM-7861
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
> Fix For: 2.19.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> BEAM-3645 makes it possible to run a map task parallel.
> However, users need to change runner when switch between multithreading and 
> multiprocessing mode.
> We want to add a flag (ex: --use-multiprocess) to make the switch easy 
> without changing the runner each time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9084) Cleaning up SDK docker image tagging

2020-01-09 Thread Hannah Jiang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hannah Jiang updated BEAM-9084:
---
Status: Open  (was: Triage Needed)

> Cleaning up SDK docker image tagging
> 
>
> Key: BEAM-9084
> URL: https://issues.apache.org/jira/browse/BEAM-9084
> Project: Beam
>  Issue Type: Task
>  Components: build-system
>Affects Versions: 2.16.0, 2.17.0
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   >