[jira] [Commented] (BEAM-8979) protoc-gen-mypy: program not found or is not executable

2020-04-05 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17075858#comment-17075858
 ] 

Chad Dombrova commented on BEAM-8979:
-

I thought this issue was resolved.  Can you provide the error you are getting 
and how to reproduce it?  What version of beam are you using?

> protoc-gen-mypy: program not found or is not executable
> ---
>
> Key: BEAM-8979
> URL: https://issues.apache.org/jira/browse/BEAM-8979
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core, test-failures
>Reporter: Kamil Wasilewski
>Assignee: Chad Dombrova
>Priority: Major
> Fix For: Not applicable
>
>  Time Spent: 12h 10m
>  Remaining Estimate: 0h
>
> In some tests, `:sdks:python:sdist:` task fails due to problems in finding 
> protoc-gen-mypy. The following tests are affected (there might be more):
>  * 
> [https://builds.apache.org/job/beam_LoadTests_Python_37_ParDo_Dataflow_Batch_PR/]
>  * 
> [https://builds.apache.org/job/beam_BiqQueryIO_Write_Performance_Test_Python_Batch/
>  
> |https://builds.apache.org/job/beam_BiqQueryIO_Write_Performance_Test_Python_Batch/]
> Relevant logs:
> {code:java}
> 10:46:32 > Task :sdks:python:sdist FAILED
> 10:46:32 Requirement already satisfied: mypy-protobuf==1.12 in 
> /home/jenkins/jenkins-slave/workspace/beam_LoadTests_Python_37_ParDo_Dataflow_Batch_PR/src/build/gradleenv/192237/lib/python3.7/site-packages
>  (1.12)
> 10:46:32 beam_fn_api.proto: warning: Import google/protobuf/descriptor.proto 
> but not used.
> 10:46:32 beam_fn_api.proto: warning: Import google/protobuf/wrappers.proto 
> but not used.
> 10:46:32 protoc-gen-mypy: program not found or is not executable
> 10:46:32 --mypy_out: protoc-gen-mypy: Plugin failed with status code 1.
> 10:46:32 
> /home/jenkins/jenkins-slave/workspace/beam_LoadTests_Python_37_ParDo_Dataflow_Batch_PR/src/build/gradleenv/192237/lib/python3.7/site-packages/setuptools/dist.py:476:
>  UserWarning: Normalizing '2.19.0.dev' to '2.19.0.dev0'
> 10:46:32   normalized_version,
> 10:46:32 Traceback (most recent call last):
> 10:46:32   File "setup.py", line 295, in 
> 10:46:32 'mypy': generate_protos_first(mypy),
> 10:46:32   File 
> "/home/jenkins/jenkins-slave/workspace/beam_LoadTests_Python_37_ParDo_Dataflow_Batch_PR/src/build/gradleenv/192237/lib/python3.7/site-packages/setuptools/__init__.py",
>  line 145, in setup
> 10:46:32 return distutils.core.setup(**attrs)
> 10:46:32   File "/usr/lib/python3.7/distutils/core.py", line 148, in setup
> 10:46:32 dist.run_commands()
> 10:46:32   File "/usr/lib/python3.7/distutils/dist.py", line 966, in 
> run_commands
> 10:46:32 self.run_command(cmd)
> 10:46:32   File "/usr/lib/python3.7/distutils/dist.py", line 985, in 
> run_command
> 10:46:32 cmd_obj.run()
> 10:46:32   File 
> "/home/jenkins/jenkins-slave/workspace/beam_LoadTests_Python_37_ParDo_Dataflow_Batch_PR/src/build/gradleenv/192237/lib/python3.7/site-packages/setuptools/command/sdist.py",
>  line 44, in run
> 10:46:32 self.run_command('egg_info')
> 10:46:32   File "/usr/lib/python3.7/distutils/cmd.py", line 313, in 
> run_command
> 10:46:32 self.distribution.run_command(command)
> 10:46:32   File "/usr/lib/python3.7/distutils/dist.py", line 985, in 
> run_command
> 10:46:32 cmd_obj.run()
> 10:46:32   File "setup.py", line 220, in run
> 10:46:32 gen_protos.generate_proto_files(log=log)
> 10:46:32   File 
> "/home/jenkins/jenkins-slave/workspace/beam_LoadTests_Python_37_ParDo_Dataflow_Batch_PR/src/sdks/python/gen_protos.py",
>  line 144, in generate_proto_files
> 10:46:32 '%s' % ret_code)
> 10:46:32 RuntimeError: Protoc returned non-zero status (see logs for 
> details): 1
> {code}
>  
> This is what I have tried so far to resolve this (without being successful):
>  * Including _--plugin=protoc-gen-mypy=\{abs_path_to_executable}_ parameter 
> to the _protoc_ call ingen_protos.py:131
>  * Appending protoc-gen-mypy's directory to the PATH variable
> I wasn't able to reproduce this error locally.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9626) pymongo should be an optional requirement

2020-03-27 Thread Chad Dombrova (Jira)
Chad Dombrova created BEAM-9626:
---

 Summary: pymongo should be an optional requirement
 Key: BEAM-9626
 URL: https://issues.apache.org/jira/browse/BEAM-9626
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py-core
Reporter: Chad Dombrova
Assignee: Chad Dombrova


The pymongo driver is installed by default, but as the number of IO connectors 
in the python sdk grows, I don't think this is the precedent we want to set.  
We already have "extra" packages for gcp, aws, and interactive, we should also 
add one for mongo. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9479) Provide option to run pylint in local git pre-commit

2020-03-09 Thread Chad Dombrova (Jira)
Chad Dombrova created BEAM-9479:
---

 Summary: Provide option to run pylint in local git pre-commit
 Key: BEAM-9479
 URL: https://issues.apache.org/jira/browse/BEAM-9479
 Project: Beam
  Issue Type: Improvement
  Components: testing
Reporter: Chad Dombrova
Assignee: Chad Dombrova


Now that we have support for running yapf in pre-commit, it would be nice to 
add pylint. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9427) Python precommits flaky with ImportError: cannot import name 'ContextManager'

2020-03-02 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049827#comment-17049827
 ] 

Chad Dombrova commented on BEAM-9427:
-

hmmm it would be nice to prevent these transitive deps from floating 
between versions.  Tools like poetry and pipenv provide a lockfile mechanism to 
help with that.  But we could run {{pip freeze}} and use that as a constraints 
file when installing. 

> Python precommits flaky with ImportError: cannot import name 'ContextManager'
> -
>
> Key: BEAM-9427
> URL: https://issues.apache.org/jira/browse/BEAM-9427
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core, test-failures
>Reporter: Udi Meiri
>Assignee: Udi Meiri
>Priority: Major
> Fix For: Not applicable
>
>
> https://builds.apache.org/job/beam_PreCommit_Python_Commit/11469/
> {code}
> 11:51:34  > Task :sdks:python:test-suites:tox:py35:testPy35Gcp FAILED
> 11:51:34  py35-gcp-pytest create: 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py35/build/srcs/sdks/python/target/.tox-py35-gcp-pytest/py35-gcp-pytest
> 11:51:34  ERROR: invocation failed (exit code 1), logfile: 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py35/build/srcs/sdks/python/target/.tox-py35-gcp-pytest/py35-gcp-pytest/log/py35-gcp-pytest-0.log
> 11:51:34  == log start 
> ===
> 11:51:34  ImportError: cannot import name 'ContextManager'
> 11:51:34  
> 11:51:34  === log end 
> 
> 11:51:34  ERROR: InvocationError for command 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-1227304285/bin/python3.5
>  -m virtualenv --no-download --python 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-1227304285/bin/python3.5
>  py35-gcp-pytest (exited with code 1)
> 11:51:34  ___ summary 
> 
> 11:51:34  ERROR:   py35-gcp-pytest: InvocationError for command 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-1227304285/bin/python3.5
>  -m virtualenv --no-download --python 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-1227304285/bin/python3.5
>  py35-gcp-pytest (exited with code 1)
> {code}
> {code}
> 11:51:35  > Task :sdks:python:test-suites:tox:py35:testPy35Cython FAILED
> 11:51:35  py35-cython-pytest create: 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py35/build/srcs/sdks/python/target/.tox-py35-cython-pytest/py35-cython-pytest
> 11:51:35  ERROR: invocation failed (exit code 1), logfile: 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py35/build/srcs/sdks/python/target/.tox-py35-cython-pytest/py35-cython-pytest/log/py35-cython-pytest-0.log
> 11:51:35  == log start 
> ===
> 11:51:35  ImportError: cannot import name 'ContextManager'
> 11:51:35  
> 11:51:35  === log end 
> 
> 11:51:35  ERROR: InvocationError for command 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-1227304285/bin/python3.5
>  -m virtualenv --no-download --python 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-1227304285/bin/python3.5
>  py35-cython-pytest (exited with code 1)
> 11:51:35  ___ summary 
> 
> 11:51:35  ERROR:   py35-cython-pytest: InvocationError for command 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-1227304285/bin/python3.5
>  -m virtualenv --no-download --python 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-1227304285/bin/python3.5
>  py35-cython-pytest (exited with code 1)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9427) Python precommits flaky with ImportError: cannot import name 'ContextManager'

2020-03-02 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049714#comment-17049714
 ] 

Chad Dombrova commented on BEAM-9427:
-

Is there a traceback anywhere that can tell us what's importing 
{{ContextManager}}?

Relates to: 

- testing  on specific minor/micro releases: BEAM-8152
- dropping support for 3.5.2: BEAM-9372
- stop installing typing for python versions that include it:  
https://github.com/apache/beam/pull/10821

{quote}Looks like typing.ContextManager is new in 3.5.4{quote}

Could it be new in 3.5.3?  I don't see a tag for 3.5.4 in the typing releases 
(though I'm not 100% certain how strongly the typing releases relate to python 
releases)

https://github.com/python/typing/releases





> Python precommits flaky with ImportError: cannot import name 'ContextManager'
> -
>
> Key: BEAM-9427
> URL: https://issues.apache.org/jira/browse/BEAM-9427
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core, test-failures
>Reporter: Udi Meiri
>Priority: Major
>
> https://builds.apache.org/job/beam_PreCommit_Python_Commit/11469/
> {code}
> 11:51:34  > Task :sdks:python:test-suites:tox:py35:testPy35Gcp FAILED
> 11:51:34  py35-gcp-pytest create: 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py35/build/srcs/sdks/python/target/.tox-py35-gcp-pytest/py35-gcp-pytest
> 11:51:34  ERROR: invocation failed (exit code 1), logfile: 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py35/build/srcs/sdks/python/target/.tox-py35-gcp-pytest/py35-gcp-pytest/log/py35-gcp-pytest-0.log
> 11:51:34  == log start 
> ===
> 11:51:34  ImportError: cannot import name 'ContextManager'
> 11:51:34  
> 11:51:34  === log end 
> 
> 11:51:34  ERROR: InvocationError for command 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-1227304285/bin/python3.5
>  -m virtualenv --no-download --python 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-1227304285/bin/python3.5
>  py35-gcp-pytest (exited with code 1)
> 11:51:34  ___ summary 
> 
> 11:51:34  ERROR:   py35-gcp-pytest: InvocationError for command 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-1227304285/bin/python3.5
>  -m virtualenv --no-download --python 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-1227304285/bin/python3.5
>  py35-gcp-pytest (exited with code 1)
> {code}
> {code}
> 11:51:35  > Task :sdks:python:test-suites:tox:py35:testPy35Cython FAILED
> 11:51:35  py35-cython-pytest create: 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py35/build/srcs/sdks/python/target/.tox-py35-cython-pytest/py35-cython-pytest
> 11:51:35  ERROR: invocation failed (exit code 1), logfile: 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py35/build/srcs/sdks/python/target/.tox-py35-cython-pytest/py35-cython-pytest/log/py35-cython-pytest-0.log
> 11:51:35  == log start 
> ===
> 11:51:35  ImportError: cannot import name 'ContextManager'
> 11:51:35  
> 11:51:35  === log end 
> 
> 11:51:35  ERROR: InvocationError for command 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-1227304285/bin/python3.5
>  -m virtualenv --no-download --python 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-1227304285/bin/python3.5
>  py35-cython-pytest (exited with code 1)
> 11:51:35  ___ summary 
> 
> 11:51:35  ERROR:   py35-cython-pytest: InvocationError for command 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-1227304285/bin/python3.5
>  -m virtualenv --no-download --python 
> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/build/gradleenv/-1227304285/bin/python3.5
>  py35-cython-pytest (exited with code 1)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9058) Line-too-long Python lint checks are no longer working.

2020-02-29 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048437#comment-17048437
 ] 

Chad Dombrova commented on BEAM-9058:
-

This is resolved. 

> Line-too-long Python lint checks are no longer working.
> ---
>
> Key: BEAM-9058
> URL: https://issues.apache.org/jira/browse/BEAM-9058
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Chad Dombrova
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Without getting into a question whether 80 is a reasonable limit, what are 
> the reasons to treat  type annotations differently than other parts of Python 
> code (including comments), which have a 80 character limit?  
>  
> Places that need to be fixed (PR coming):
> ```
> {noformat}
> > Task :sdks:python:test-suites:tox:py37:lintPy37
>  
>  * Module apache_beam.testing.test_stream_it_test
>  apache_beam/testing/test_stream_it_test.py:50:0: C0301: Line too long 
> (106/80) (line-too-long)
>  * Module apache_beam.io.filesystems
>  apache_beam/io/filesystems.py:99:0: C0301: Line too long (101/80) 
> (line-too-long)
>  apache_beam/io/filesystems.py:100:0: C0301: Line too long (111/80) 
> (line-too-long)
>  * Module apache_beam.runners.portability.fn_api_runner
>  apache_beam/runners/portability/fn_api_runner.py:115:0: C0301: Line too long 
> (114/80) (line-too-long)
>  * Module apache_beam.transforms.core
>  apache_beam/transforms/core.py:1307:0: C0301: Line too long (116/80) 
> (line-too-long)
>  apache_beam/transforms/core.py:2271:0: C0301: Line too long (95/80) 
> (line-too-long)
>  apache_beam/transforms/core.py:2272:0: C0301: Line too long (90/80) 
> (line-too-long)
>  * Module setup
>  setup.py:141:0: C0301: Line too long (81/80) (line-too-long)
> 
>  {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8979) protoc-gen-mypy: program not found or is not executable

2020-02-29 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048436#comment-17048436
 ] 

Chad Dombrova commented on BEAM-8979:
-

This is resolved. 

> protoc-gen-mypy: program not found or is not executable
> ---
>
> Key: BEAM-8979
> URL: https://issues.apache.org/jira/browse/BEAM-8979
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core, test-failures
>Reporter: Kamil Wasilewski
>Assignee: Chad Dombrova
>Priority: Major
>  Time Spent: 11h
>  Remaining Estimate: 0h
>
> In some tests, `:sdks:python:sdist:` task fails due to problems in finding 
> protoc-gen-mypy. The following tests are affected (there might be more):
>  * 
> [https://builds.apache.org/job/beam_LoadTests_Python_37_ParDo_Dataflow_Batch_PR/]
>  * 
> [https://builds.apache.org/job/beam_BiqQueryIO_Write_Performance_Test_Python_Batch/
>  
> |https://builds.apache.org/job/beam_BiqQueryIO_Write_Performance_Test_Python_Batch/]
> Relevant logs:
> {code:java}
> 10:46:32 > Task :sdks:python:sdist FAILED
> 10:46:32 Requirement already satisfied: mypy-protobuf==1.12 in 
> /home/jenkins/jenkins-slave/workspace/beam_LoadTests_Python_37_ParDo_Dataflow_Batch_PR/src/build/gradleenv/192237/lib/python3.7/site-packages
>  (1.12)
> 10:46:32 beam_fn_api.proto: warning: Import google/protobuf/descriptor.proto 
> but not used.
> 10:46:32 beam_fn_api.proto: warning: Import google/protobuf/wrappers.proto 
> but not used.
> 10:46:32 protoc-gen-mypy: program not found or is not executable
> 10:46:32 --mypy_out: protoc-gen-mypy: Plugin failed with status code 1.
> 10:46:32 
> /home/jenkins/jenkins-slave/workspace/beam_LoadTests_Python_37_ParDo_Dataflow_Batch_PR/src/build/gradleenv/192237/lib/python3.7/site-packages/setuptools/dist.py:476:
>  UserWarning: Normalizing '2.19.0.dev' to '2.19.0.dev0'
> 10:46:32   normalized_version,
> 10:46:32 Traceback (most recent call last):
> 10:46:32   File "setup.py", line 295, in 
> 10:46:32 'mypy': generate_protos_first(mypy),
> 10:46:32   File 
> "/home/jenkins/jenkins-slave/workspace/beam_LoadTests_Python_37_ParDo_Dataflow_Batch_PR/src/build/gradleenv/192237/lib/python3.7/site-packages/setuptools/__init__.py",
>  line 145, in setup
> 10:46:32 return distutils.core.setup(**attrs)
> 10:46:32   File "/usr/lib/python3.7/distutils/core.py", line 148, in setup
> 10:46:32 dist.run_commands()
> 10:46:32   File "/usr/lib/python3.7/distutils/dist.py", line 966, in 
> run_commands
> 10:46:32 self.run_command(cmd)
> 10:46:32   File "/usr/lib/python3.7/distutils/dist.py", line 985, in 
> run_command
> 10:46:32 cmd_obj.run()
> 10:46:32   File 
> "/home/jenkins/jenkins-slave/workspace/beam_LoadTests_Python_37_ParDo_Dataflow_Batch_PR/src/build/gradleenv/192237/lib/python3.7/site-packages/setuptools/command/sdist.py",
>  line 44, in run
> 10:46:32 self.run_command('egg_info')
> 10:46:32   File "/usr/lib/python3.7/distutils/cmd.py", line 313, in 
> run_command
> 10:46:32 self.distribution.run_command(command)
> 10:46:32   File "/usr/lib/python3.7/distutils/dist.py", line 985, in 
> run_command
> 10:46:32 cmd_obj.run()
> 10:46:32   File "setup.py", line 220, in run
> 10:46:32 gen_protos.generate_proto_files(log=log)
> 10:46:32   File 
> "/home/jenkins/jenkins-slave/workspace/beam_LoadTests_Python_37_ParDo_Dataflow_Batch_PR/src/sdks/python/gen_protos.py",
>  line 144, in generate_proto_files
> 10:46:32 '%s' % ret_code)
> 10:46:32 RuntimeError: Protoc returned non-zero status (see logs for 
> details): 1
> {code}
>  
> This is what I have tried so far to resolve this (without being successful):
>  * Including _--plugin=protoc-gen-mypy=\{abs_path_to_executable}_ parameter 
> to the _protoc_ call ingen_protos.py:131
>  * Appending protoc-gen-mypy's directory to the PATH variable
> I wasn't able to reproduce this error locally.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9414) [beam_PostCommit_Python2] build failed

2020-02-29 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048423#comment-17048423
 ] 

Chad Dombrova commented on BEAM-9414:
-

This should have been fixed in BEAM-9405, in this PR: 
https://github.com/apache/beam/pull/11001

[~chamikara] the link you posted is the _fix_ to the problem, rather than the 
cause.  Well, that's the intent anyway.

{{PortableRunner.create_job_service}} was renamed, which caused the break in 
that test, and in the PR linked above it was renamed back to 
{{PortableRunner.create_job_service}}, so everything should be good again.  Let 
me know if it is not. 



> [beam_PostCommit_Python2] build failed
> --
>
> Key: BEAM-9414
> URL: https://issues.apache.org/jira/browse/BEAM-9414
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Yueyang Qiu
>Assignee: Chad Dombrova
>Priority: Major
>  Labels: currently-failing
>
> See [https://builds.apache.org/job/beam_PostCommit_Python2/1844/]
>  
> Error:
>  
> *16:07:07* FAILURE: Build failed with an exception.*16:07:07* *16:07:07* * 
> Where:*16:07:07* Build file 
> '/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python2/src/sdks/python/test-suites/portable/py2/build.gradle'
>  line: 143*16:07:07* *16:07:07* * What went wrong:*16:07:07* Execution failed 
> for task 
> ':sdks:python:test-suites:portable:py2:crossLanguagePortableWordCount'.*16:07:07*
>  > Process 'command 'sh'' finished with non-zero exit value 1
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9405) Python PostCommit is flaky: 'PortableRunner' object has no attribute 'create_job_service'

2020-02-28 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17047888#comment-17047888
 ] 

Chad Dombrova commented on BEAM-9405:
-

PR is up: [https://github.com/apache/beam/pull/11001]

> Python PostCommit is flaky: 'PortableRunner' object has no attribute 
> 'create_job_service'
> -
>
> Key: BEAM-9405
> URL: https://issues.apache.org/jira/browse/BEAM-9405
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core, test-failures
>Reporter: Kamil Wasilewski
>Assignee: Chad Dombrova
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> See: [https://builds.apache.org/job/beam_PostCommit_Python2/]
> It seems that it is caused by [this 
> |https://github.com/apache/beam/commit/1856d8533c879ab236d0593be1f9c7fff41edd7f]commit.
> An example log:
> {code:java}
> :sdks:python:test-suites:portable:py2:crossLanguagePortableWordCount FAILED
> DEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. 
> Please upgrade your Python as Python 2.7 is no longer maintained. A future 
> version of pip will drop support for Python 2.7. More details about Python 2 
> support in pip, can be found at 
> https://pip.pypa.io/en/latest/development/release-process/#python-2-support
> apache_beam/__init__.py:82: UserWarning: You are using Apache Beam with 
> Python 2. New releases of Apache Beam will soon support Python 3 only.
>   'You are using Apache Beam with Python 2. '
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> ERROR StatusLogger Log4j2 could not find a logging implementation. Please add 
> log4j-core to the classpath. Using SimpleLogger to log to the console...
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
>   File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
> exec code in run_globals
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python2_PR/src/sdks/python/apache_beam/examples/wordcount_xlang.py",
>  line 137, in 
> main()
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python2_PR/src/sdks/python/apache_beam/examples/wordcount_xlang.py",
>  line 128, in main
> p.runner.create_job_service(pipeline_options)
> AttributeError: 'PortableRunner' object has no attribute 'create_job_service'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9405) Python PostCommit is flaky: 'PortableRunner' object has no attribute 'create_job_service'

2020-02-28 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17047731#comment-17047731
 ] 

Chad Dombrova commented on BEAM-9405:
-

I will have a look at this first thing in the morning.

On Fri, Feb 28, 2020 at 6:13 AM Kamil Wasilewski (Jira) 



> Python PostCommit is flaky: 'PortableRunner' object has no attribute 
> 'create_job_service'
> -
>
> Key: BEAM-9405
> URL: https://issues.apache.org/jira/browse/BEAM-9405
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core, test-failures
>Reporter: Kamil Wasilewski
>Assignee: Chad Dombrova
>Priority: Major
>
> See: [https://builds.apache.org/job/beam_PostCommit_Python2/]
> It seems that it is caused by [this 
> |https://github.com/apache/beam/commit/1856d8533c879ab236d0593be1f9c7fff41edd7f]commit.
> An example log:
> {code:java}
> :sdks:python:test-suites:portable:py2:crossLanguagePortableWordCount FAILED
> DEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. 
> Please upgrade your Python as Python 2.7 is no longer maintained. A future 
> version of pip will drop support for Python 2.7. More details about Python 2 
> support in pip, can be found at 
> https://pip.pypa.io/en/latest/development/release-process/#python-2-support
> apache_beam/__init__.py:82: UserWarning: You are using Apache Beam with 
> Python 2. New releases of Apache Beam will soon support Python 3 only.
>   'You are using Apache Beam with Python 2. '
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> ERROR StatusLogger Log4j2 could not find a logging implementation. Please add 
> log4j-core to the classpath. Using SimpleLogger to log to the console...
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
>   File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
> exec code in run_globals
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python2_PR/src/sdks/python/apache_beam/examples/wordcount_xlang.py",
>  line 137, in 
> main()
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python2_PR/src/sdks/python/apache_beam/examples/wordcount_xlang.py",
>  line 128, in main
> p.runner.create_job_service(pipeline_options)
> AttributeError: 'PortableRunner' object has no attribute 'create_job_service'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-3788) Implement a Kafka IO for Python SDK

2020-02-19 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040558#comment-17040558
 ] 

Chad Dombrova commented on BEAM-3788:
-

Unless something has changed recently, 
https://issues.apache.org/jira/browse/BEAM-7870 is still a blocker for using 
KafkaIO in python out of the box.  As the title suggests, it's also blocking 
PubSubIO in python and conceptually any external transform with a non-trivial 
coder. 

[~mxm], [~bhulette] has anything changed on that issue lately?

 

> Implement a Kafka IO for Python SDK
> ---
>
> Key: BEAM-3788
> URL: https://issues.apache.org/jira/browse/BEAM-3788
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Chamikara Madhusanka Jayalath
>Priority: Major
>
> Java KafkaIO will be made available to Python users as a cross-language 
> transform.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8402) Create a class hierarchy to represent environments

2020-02-16 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037961#comment-17037961
 ] 

Chad Dombrova commented on BEAM-8402:
-

This is resolved in 2.18

> Create a class hierarchy to represent environments
> --
>
> Key: BEAM-8402
> URL: https://issues.apache.org/jira/browse/BEAM-8402
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chad Dombrova
>Assignee: Chad Dombrova
>Priority: Major
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> As a first step towards making it possible to assign different environments 
> to sections of a pipeline, we first need to expose environment classes to the 
> pipeline API.  Unlike PTransforms, PCollections, Coders, and Windowings,  
> environments exists solely in the portability framework as protobuf objects.  
>  By creating a hierarchy of "native" classes that represent the various 
> environment types -- external, docker, process, etc -- users will be able to 
> instantiate these and assign them to parts of the pipeline.  The assignment 
> portion will be covered in a follow-up issue/PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9276) python: create a class to encapsulate the work required to submit a pipeline to a job service

2020-02-09 Thread Chad Dombrova (Jira)
Chad Dombrova created BEAM-9276:
---

 Summary: python: create a class to encapsulate the work required 
to submit a pipeline to a job service
 Key: BEAM-9276
 URL: https://issues.apache.org/jira/browse/BEAM-9276
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py-core
Reporter: Chad Dombrova
Assignee: Chad Dombrova


{{PortableRunner.run_pipeline}} is somewhat of a monolithic method for 
submitting a pipeline.  It would be useful to factor out the code responsible 
for interacting with the job and artifact services (prepare, stage, run) to 
make this easier to modify this behavior in portable runner subclasses, as well 
as in tests. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9274) Support running yapf in a git pre-commit hook

2020-02-08 Thread Chad Dombrova (Jira)
Chad Dombrova created BEAM-9274:
---

 Summary: Support running yapf in a git pre-commit hook
 Key: BEAM-9274
 URL: https://issues.apache.org/jira/browse/BEAM-9274
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py-core
Reporter: Chad Dombrova
Assignee: Chad Dombrova


As a developer I want to be able to automatically run yapf before I make a 
commit so that I don't waste time with failures on jenkins. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9202) lintPy37 precommit broken

2020-01-27 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024800#comment-17024800
 ] 

Chad Dombrova commented on BEAM-9202:
-

Looks like this issue was already fixed by Ankur 

> lintPy37 precommit broken
> -
>
> Key: BEAM-9202
> URL: https://issues.apache.org/jira/browse/BEAM-9202
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Udi Meiri
>Assignee: Chad Dombrova
>Priority: Major
>
> Culprit: https://github.com/apache/beam/pull/10683
> Jenkins tests are not started automatically.
> {code}
> 09:47:37 > Task :sdks:python:test-suites:tox:py37:lintPy37
> 09:47:37 * Module apache_beam.io.gcp.datastore.v1new.types
> 09:47:37 apache_beam/io/gcp/datastore/v1new/types.py:47:0: C0301: Line too 
> long (87/80) (line-too-long)
> {code}
> https://builds.apache.org/job/beam_PreCommit_PythonLint_Commit/2033/console



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9130) sdks:python:test-suites:direct:py2:hdfsIntegrationTest is failing with ImportError: No module named google.protobuf.message

2020-01-17 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018227#comment-17018227
 ] 

Chad Dombrova commented on BEAM-9130:
-

{quote}
My solution doesn't require the workaround in setup.py, which TBH I still don't 
fully follow.
{quote}

It's some seriously esoteric python stuff, but here's a terse summary.  The 
protobuf package in python2 is installed as "protobuf", but it uses a 
protobuf-3.11.0-py3.7-nspkg.pth file (living next to the installed package) to 
mount the package under the "google" namespace package.  This was an effort by 
google to allow all of their various python libs to install into one parent 
"google" namespace.  All well and good except that .pth files don't work unless 
they are in one of the special directories returned by 
{{site.getsitepackages()}}, which is by default just the python interpreter's 
native lib and site-packges directories.  So if you ever do a non-standard 
install of protobuf (i.e. using {{pip install --target}}) then the .pth file is 
not read and executed by the python interpreter and the package cannot be 
imported as {{google.protobuf}}. 

Got all that?  Well you can forget it all now, because they added official 
support for namespace packages in python3 that don't require this .pth hack  :)

{quote}
WDYT about my fix, which essentially matches what setupVirtualenv does?
{quote}

Making this match setupVirtualenv is definitely the right fix.  Obvious in 
retrospect!

This has gotten me motivated to revisit my pep517 changes again.  I would love 
some help on it if you have time. 




> sdks:python:test-suites:direct:py2:hdfsIntegrationTest is failing with 
> ImportError: No module named google.protobuf.message
> ---
>
> Key: BEAM-9130
> URL: https://issues.apache.org/jira/browse/BEAM-9130
> Project: Beam
>  Issue Type: Improvement
>  Components: test-failures
>Reporter: Valentyn Tymofieiev
>Priority: Major
>  Labels: currently-failing
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> From logs:
> {noformat}
> 16:33:50File "/usr/local/lib/python2.7/multiprocessing/process.py", line 
> 267, in _bootstrap
> 16:33:50  self.run()
> 16:33:50File "/usr/local/lib/python2.7/multiprocessing/process.py", line 
> 114, in run
> 16:33:50  self._target(*self._args, **self._kwargs)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 357, in 
> _install_grpcio_tools_and_generate_proto_files
> 16:33:50  generate_proto_files(force=force)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 324, in 
> generate_proto_files
> 16:33:50  generate_urn_files(log, out_dir)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 65, in 
> generate_urn_files
> 16:33:50  import google.protobuf.message as message
> 16:33:50  ImportError: No module named google.protobuf.message
> 16:33:50  Traceback (most recent call last):
> 16:33:50File "setup.py", line 305, in 
> 16:33:50  'mypy': generate_protos_first(mypy),
> 16:33:50File 
> "/usr/local/lib/python2.7/site-packages/setuptools/__init__.py", line 145, in 
> setup
> 16:33:50  return distutils.core.setup(**attrs)
> 16:33:50File "/usr/local/lib/python2.7/distutils/core.py", line 151, in 
> setup
> 16:33:50  dist.run_commands()
> 16:33:50File "/usr/local/lib/python2.7/distutils/dist.py", line 953, in 
> run_commands
> 16:33:50  self.run_command(cmd)
> 16:33:50File "/usr/local/lib/python2.7/distutils/dist.py", line 972, in 
> run_command
> 16:33:50  cmd_obj.run()
> 16:33:50File 
> "/usr/local/lib/python2.7/site-packages/setuptools/command/sdist.py", line 
> 44, in run
> 16:33:50  self.run_command('egg_info')
> 16:33:50File "/usr/local/lib/python2.7/distutils/cmd.py", line 326, in 
> run_command
> 16:33:50  self.distribution.run_command(command)
> 16:33:50File "/usr/local/lib/python2.7/distutils/dist.py", line 972, in 
> run_command
> 16:33:50  cmd_obj.run()
> 16:33:50File "setup.py", line 229, in run
> 16:33:50  gen_protos.generate_proto_files(log=log)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 291, in 
> generate_proto_files
> 16:33:50  raise ValueError("Proto generation failed (see log for 
> details).")
> 16:33:50  ValueError: Proto generation failed (see log for 
> details
> {noformat}
> {noformat}
> import google.protobuf.message as message
> ImportError: No module named google.protobuf.message
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9130) sdks:python:test-suites:direct:py2:hdfsIntegrationTest is failing with ImportError: No module named google.protobuf.message

2020-01-16 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017732#comment-17017732
 ] 

Chad Dombrova commented on BEAM-9130:
-

I updated my PR.  There are still other problems with it, but I solved the 
google.protobuf import problem.  

There are a few pieces to the solution:
 * use a pyproject.toml file to specify build requirements: 
[https://github.com/apache/beam/pull/10038/files#diff-71ede7e13a310beaedc4466b48096e26R18]
 * use pep517 module to build source: 
[https://github.com/apache/beam/pull/10038/files#diff-d3cc66016c37760e131f228abbee6667R34]
 * add pep517's temp build directory to the site dir to allow the google 
package's .pth file to work:  
[https://github.com/apache/beam/pull/10038/commits/ed06bf1b17e5f60d7f6bcf7766b93e4eaf4b86ff]

I'm sure there are easier ways to hack this to work without such dramatic 
changes, but this approach sets us on a course to untangle the mess that we 
have now. 

 

 

> sdks:python:test-suites:direct:py2:hdfsIntegrationTest is failing with 
> ImportError: No module named google.protobuf.message
> ---
>
> Key: BEAM-9130
> URL: https://issues.apache.org/jira/browse/BEAM-9130
> Project: Beam
>  Issue Type: Improvement
>  Components: test-failures
>Reporter: Valentyn Tymofieiev
>Priority: Major
>  Labels: currently-failing
>
> From logs:
> {noformat}
> 16:33:50File "/usr/local/lib/python2.7/multiprocessing/process.py", line 
> 267, in _bootstrap
> 16:33:50  self.run()
> 16:33:50File "/usr/local/lib/python2.7/multiprocessing/process.py", line 
> 114, in run
> 16:33:50  self._target(*self._args, **self._kwargs)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 357, in 
> _install_grpcio_tools_and_generate_proto_files
> 16:33:50  generate_proto_files(force=force)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 324, in 
> generate_proto_files
> 16:33:50  generate_urn_files(log, out_dir)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 65, in 
> generate_urn_files
> 16:33:50  import google.protobuf.message as message
> 16:33:50  ImportError: No module named google.protobuf.message
> 16:33:50  Traceback (most recent call last):
> 16:33:50File "setup.py", line 305, in 
> 16:33:50  'mypy': generate_protos_first(mypy),
> 16:33:50File 
> "/usr/local/lib/python2.7/site-packages/setuptools/__init__.py", line 145, in 
> setup
> 16:33:50  return distutils.core.setup(**attrs)
> 16:33:50File "/usr/local/lib/python2.7/distutils/core.py", line 151, in 
> setup
> 16:33:50  dist.run_commands()
> 16:33:50File "/usr/local/lib/python2.7/distutils/dist.py", line 953, in 
> run_commands
> 16:33:50  self.run_command(cmd)
> 16:33:50File "/usr/local/lib/python2.7/distutils/dist.py", line 972, in 
> run_command
> 16:33:50  cmd_obj.run()
> 16:33:50File 
> "/usr/local/lib/python2.7/site-packages/setuptools/command/sdist.py", line 
> 44, in run
> 16:33:50  self.run_command('egg_info')
> 16:33:50File "/usr/local/lib/python2.7/distutils/cmd.py", line 326, in 
> run_command
> 16:33:50  self.distribution.run_command(command)
> 16:33:50File "/usr/local/lib/python2.7/distutils/dist.py", line 972, in 
> run_command
> 16:33:50  cmd_obj.run()
> 16:33:50File "setup.py", line 229, in run
> 16:33:50  gen_protos.generate_proto_files(log=log)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 291, in 
> generate_proto_files
> 16:33:50  raise ValueError("Proto generation failed (see log for 
> details).")
> 16:33:50  ValueError: Proto generation failed (see log for 
> details
> {noformat}
> {noformat}
> import google.protobuf.message as message
> ImportError: No module named google.protobuf.message
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9130) sdks:python:test-suites:direct:py2:hdfsIntegrationTest is failing with ImportError: No module named google.protobuf.message

2020-01-16 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017699#comment-17017699
 ] 

Chad Dombrova commented on BEAM-9130:
-

Just in case it's not clear, I think you'd want to edit the Dockerfile to use 
tox, here:

 
{code:java}
CMD holdup -t 45 http://namenode:50070 http://datanode:50075 && \
echo "Waiting for safe mode to end." && \
sleep 45 && \
hdfscli -v -v -v upload -f kinglear.txt / && \
python -m apache_beam.examples.wordcount \
--input hdfs://kinglear* \
--output hdfs://py-wordcount-integration \
--hdfs_host namenode --hdfs_port 50070 --hdfs_user root
 {code}

> sdks:python:test-suites:direct:py2:hdfsIntegrationTest is failing with 
> ImportError: No module named google.protobuf.message
> ---
>
> Key: BEAM-9130
> URL: https://issues.apache.org/jira/browse/BEAM-9130
> Project: Beam
>  Issue Type: Improvement
>  Components: test-failures
>Reporter: Valentyn Tymofieiev
>Priority: Major
>  Labels: currently-failing
>
> From logs:
> {noformat}
> 16:33:50File "/usr/local/lib/python2.7/multiprocessing/process.py", line 
> 267, in _bootstrap
> 16:33:50  self.run()
> 16:33:50File "/usr/local/lib/python2.7/multiprocessing/process.py", line 
> 114, in run
> 16:33:50  self._target(*self._args, **self._kwargs)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 357, in 
> _install_grpcio_tools_and_generate_proto_files
> 16:33:50  generate_proto_files(force=force)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 324, in 
> generate_proto_files
> 16:33:50  generate_urn_files(log, out_dir)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 65, in 
> generate_urn_files
> 16:33:50  import google.protobuf.message as message
> 16:33:50  ImportError: No module named google.protobuf.message
> 16:33:50  Traceback (most recent call last):
> 16:33:50File "setup.py", line 305, in 
> 16:33:50  'mypy': generate_protos_first(mypy),
> 16:33:50File 
> "/usr/local/lib/python2.7/site-packages/setuptools/__init__.py", line 145, in 
> setup
> 16:33:50  return distutils.core.setup(**attrs)
> 16:33:50File "/usr/local/lib/python2.7/distutils/core.py", line 151, in 
> setup
> 16:33:50  dist.run_commands()
> 16:33:50File "/usr/local/lib/python2.7/distutils/dist.py", line 953, in 
> run_commands
> 16:33:50  self.run_command(cmd)
> 16:33:50File "/usr/local/lib/python2.7/distutils/dist.py", line 972, in 
> run_command
> 16:33:50  cmd_obj.run()
> 16:33:50File 
> "/usr/local/lib/python2.7/site-packages/setuptools/command/sdist.py", line 
> 44, in run
> 16:33:50  self.run_command('egg_info')
> 16:33:50File "/usr/local/lib/python2.7/distutils/cmd.py", line 326, in 
> run_command
> 16:33:50  self.distribution.run_command(command)
> 16:33:50File "/usr/local/lib/python2.7/distutils/dist.py", line 972, in 
> run_command
> 16:33:50  cmd_obj.run()
> 16:33:50File "setup.py", line 229, in run
> 16:33:50  gen_protos.generate_proto_files(log=log)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 291, in 
> generate_proto_files
> 16:33:50  raise ValueError("Proto generation failed (see log for 
> details).")
> 16:33:50  ValueError: Proto generation failed (see log for 
> details
> {noformat}
> {noformat}
> import google.protobuf.message as message
> ImportError: No module named google.protobuf.message
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9130) sdks:python:test-suites:direct:py2:hdfsIntegrationTest is failing with ImportError: No module named google.protobuf.message

2020-01-16 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017691#comment-17017691
 ] 

Chad Dombrova commented on BEAM-9130:
-

the protobuf package is a dependency of grpcio-tools, which is part of 
build_requirements.txt.   Did you include build_requirements.txt to the deps 
for your tox env?   that could help, but it also might not:  I can't remember 
if the deps are installed in time to affect setup.py (which calls gen_protos).  
I think it may not be, which is where pyproject.toml and pep517 comes into 
play:  they specify the deps that need to be installed before setup.py runs.   

I'm currently rebasing my old testing refactor branch to see if it helps.  

can you push your current WIP solution for this so that I can try to help out.  
I'm on vacation tomorrow and monday, but I may be able to chip away at some 
things on the plane. 

 

 

> sdks:python:test-suites:direct:py2:hdfsIntegrationTest is failing with 
> ImportError: No module named google.protobuf.message
> ---
>
> Key: BEAM-9130
> URL: https://issues.apache.org/jira/browse/BEAM-9130
> Project: Beam
>  Issue Type: Improvement
>  Components: test-failures
>Reporter: Valentyn Tymofieiev
>Priority: Major
>  Labels: currently-failing
>
> From logs:
> {noformat}
> 16:33:50File "/usr/local/lib/python2.7/multiprocessing/process.py", line 
> 267, in _bootstrap
> 16:33:50  self.run()
> 16:33:50File "/usr/local/lib/python2.7/multiprocessing/process.py", line 
> 114, in run
> 16:33:50  self._target(*self._args, **self._kwargs)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 357, in 
> _install_grpcio_tools_and_generate_proto_files
> 16:33:50  generate_proto_files(force=force)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 324, in 
> generate_proto_files
> 16:33:50  generate_urn_files(log, out_dir)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 65, in 
> generate_urn_files
> 16:33:50  import google.protobuf.message as message
> 16:33:50  ImportError: No module named google.protobuf.message
> 16:33:50  Traceback (most recent call last):
> 16:33:50File "setup.py", line 305, in 
> 16:33:50  'mypy': generate_protos_first(mypy),
> 16:33:50File 
> "/usr/local/lib/python2.7/site-packages/setuptools/__init__.py", line 145, in 
> setup
> 16:33:50  return distutils.core.setup(**attrs)
> 16:33:50File "/usr/local/lib/python2.7/distutils/core.py", line 151, in 
> setup
> 16:33:50  dist.run_commands()
> 16:33:50File "/usr/local/lib/python2.7/distutils/dist.py", line 953, in 
> run_commands
> 16:33:50  self.run_command(cmd)
> 16:33:50File "/usr/local/lib/python2.7/distutils/dist.py", line 972, in 
> run_command
> 16:33:50  cmd_obj.run()
> 16:33:50File 
> "/usr/local/lib/python2.7/site-packages/setuptools/command/sdist.py", line 
> 44, in run
> 16:33:50  self.run_command('egg_info')
> 16:33:50File "/usr/local/lib/python2.7/distutils/cmd.py", line 326, in 
> run_command
> 16:33:50  self.distribution.run_command(command)
> 16:33:50File "/usr/local/lib/python2.7/distutils/dist.py", line 972, in 
> run_command
> 16:33:50  cmd_obj.run()
> 16:33:50File "setup.py", line 229, in run
> 16:33:50  gen_protos.generate_proto_files(log=log)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 291, in 
> generate_proto_files
> 16:33:50  raise ValueError("Proto generation failed (see log for 
> details).")
> 16:33:50  ValueError: Proto generation failed (see log for 
> details
> {noformat}
> {noformat}
> import google.protobuf.message as message
> ImportError: No module named google.protobuf.message
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (BEAM-9130) sdks:python:test-suites:direct:py2:hdfsIntegrationTest is failing with ImportError: No module named google.protobuf.message

2020-01-16 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017633#comment-17017633
 ] 

Chad Dombrova edited comment on BEAM-9130 at 1/17/20 1:37 AM:
--

IIRC, the reason it only happens for python2 is that the `google` package in 
python2 uses .pth files to find sub-packages, and .pth files must be in a 
"site" directory, so if it's installed in a non-standard way (into a directory 
that isn't one of the python interpreter's official site-packages directories), 
it won't be able to find the sub-packages.   It is possible to mark a directory 
as a "site" directory using site.addsitedir().  For more info see 
[https://docs.python.org/2/library/site.html]

In python3 namespace packages were redesigned in order to avoid this brittle 
workflow:  [https://www.python.org/dev/peps/pep-0420/]

If this test went through tox, I'm pretty sure this error would not occur.

What are the chances that we can work on getting all tests running through tox 
in the near term?

 

 

 

 

 


was (Author: chadrik):
IIRC, the reason it only happens for python2 is that the `google` package in 
python2 uses .pth files to find sub-packages, and .pth files must be in a 
"site" directory, so if it's installed in a non-standard way (one that doesn't 
end up in the python interpreter's official site-packages directory), it won't 
be able to find the sub-packages.   It is possible to mark a directory as a 
"site" directory using site.addsitedir().  For more info see 
[https://docs.python.org/2/library/site.html]

In python3 namespace packages were redesigned in order to avoid this brittle 
workflow:  [https://www.python.org/dev/peps/pep-0420/]

If this test went through tox, I'm pretty sure this error would not occur.

What are the chances that we can work on getting all tests running through tox 
in the near term?

 

 

 

 

 

> sdks:python:test-suites:direct:py2:hdfsIntegrationTest is failing with 
> ImportError: No module named google.protobuf.message
> ---
>
> Key: BEAM-9130
> URL: https://issues.apache.org/jira/browse/BEAM-9130
> Project: Beam
>  Issue Type: Improvement
>  Components: test-failures
>Reporter: Valentyn Tymofieiev
>Priority: Major
>  Labels: currently-failing
>
> From logs:
> {noformat}
> 16:33:50File "/usr/local/lib/python2.7/multiprocessing/process.py", line 
> 267, in _bootstrap
> 16:33:50  self.run()
> 16:33:50File "/usr/local/lib/python2.7/multiprocessing/process.py", line 
> 114, in run
> 16:33:50  self._target(*self._args, **self._kwargs)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 357, in 
> _install_grpcio_tools_and_generate_proto_files
> 16:33:50  generate_proto_files(force=force)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 324, in 
> generate_proto_files
> 16:33:50  generate_urn_files(log, out_dir)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 65, in 
> generate_urn_files
> 16:33:50  import google.protobuf.message as message
> 16:33:50  ImportError: No module named google.protobuf.message
> 16:33:50  Traceback (most recent call last):
> 16:33:50File "setup.py", line 305, in 
> 16:33:50  'mypy': generate_protos_first(mypy),
> 16:33:50File 
> "/usr/local/lib/python2.7/site-packages/setuptools/__init__.py", line 145, in 
> setup
> 16:33:50  return distutils.core.setup(**attrs)
> 16:33:50File "/usr/local/lib/python2.7/distutils/core.py", line 151, in 
> setup
> 16:33:50  dist.run_commands()
> 16:33:50File "/usr/local/lib/python2.7/distutils/dist.py", line 953, in 
> run_commands
> 16:33:50  self.run_command(cmd)
> 16:33:50File "/usr/local/lib/python2.7/distutils/dist.py", line 972, in 
> run_command
> 16:33:50  cmd_obj.run()
> 16:33:50File 
> "/usr/local/lib/python2.7/site-packages/setuptools/command/sdist.py", line 
> 44, in run
> 16:33:50  self.run_command('egg_info')
> 16:33:50File "/usr/local/lib/python2.7/distutils/cmd.py", line 326, in 
> run_command
> 16:33:50  self.distribution.run_command(command)
> 16:33:50File "/usr/local/lib/python2.7/distutils/dist.py", line 972, in 
> run_command
> 16:33:50  cmd_obj.run()
> 16:33:50File "setup.py", line 229, in run
> 16:33:50  gen_protos.generate_proto_files(log=log)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 291, in 
> generate_proto_files
> 16:33:50  raise ValueError("Proto generation failed (see log for 
> details).")
> 16:33:50  ValueError: Proto generation failed (see log for 
> details
> {noformat}
> {noformat}
> import google.protobuf.message as 

[jira] [Commented] (BEAM-9130) sdks:python:test-suites:direct:py2:hdfsIntegrationTest is failing with ImportError: No module named google.protobuf.message

2020-01-16 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017633#comment-17017633
 ] 

Chad Dombrova commented on BEAM-9130:
-

IIRC, the reason it only happens for python2 is that the `google` package in 
python2 uses .pth files to find sub-packages, and .pth files must be in a 
"site" directory, so if it's installed in a non-standard way (one that doesn't 
end up in the python interpreter's official site-packages directory), it won't 
be able to find the sub-packages.   It is possible to mark a directory as a 
"site" directory using site.addsitedir().  For more info see 
[https://docs.python.org/2/library/site.html]

In python3 namespace packages were redesigned in order to avoid this brittle 
workflow:  [https://www.python.org/dev/peps/pep-0420/]

If this test went through tox, I'm pretty sure this error would not occur.

What are the chances that we can work on getting all tests running through tox 
in the near term?

 

 

 

 

 

> sdks:python:test-suites:direct:py2:hdfsIntegrationTest is failing with 
> ImportError: No module named google.protobuf.message
> ---
>
> Key: BEAM-9130
> URL: https://issues.apache.org/jira/browse/BEAM-9130
> Project: Beam
>  Issue Type: Improvement
>  Components: test-failures
>Reporter: Valentyn Tymofieiev
>Priority: Major
>  Labels: currently-failing
>
> From logs:
> {noformat}
> 16:33:50File "/usr/local/lib/python2.7/multiprocessing/process.py", line 
> 267, in _bootstrap
> 16:33:50  self.run()
> 16:33:50File "/usr/local/lib/python2.7/multiprocessing/process.py", line 
> 114, in run
> 16:33:50  self._target(*self._args, **self._kwargs)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 357, in 
> _install_grpcio_tools_and_generate_proto_files
> 16:33:50  generate_proto_files(force=force)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 324, in 
> generate_proto_files
> 16:33:50  generate_urn_files(log, out_dir)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 65, in 
> generate_urn_files
> 16:33:50  import google.protobuf.message as message
> 16:33:50  ImportError: No module named google.protobuf.message
> 16:33:50  Traceback (most recent call last):
> 16:33:50File "setup.py", line 305, in 
> 16:33:50  'mypy': generate_protos_first(mypy),
> 16:33:50File 
> "/usr/local/lib/python2.7/site-packages/setuptools/__init__.py", line 145, in 
> setup
> 16:33:50  return distutils.core.setup(**attrs)
> 16:33:50File "/usr/local/lib/python2.7/distutils/core.py", line 151, in 
> setup
> 16:33:50  dist.run_commands()
> 16:33:50File "/usr/local/lib/python2.7/distutils/dist.py", line 953, in 
> run_commands
> 16:33:50  self.run_command(cmd)
> 16:33:50File "/usr/local/lib/python2.7/distutils/dist.py", line 972, in 
> run_command
> 16:33:50  cmd_obj.run()
> 16:33:50File 
> "/usr/local/lib/python2.7/site-packages/setuptools/command/sdist.py", line 
> 44, in run
> 16:33:50  self.run_command('egg_info')
> 16:33:50File "/usr/local/lib/python2.7/distutils/cmd.py", line 326, in 
> run_command
> 16:33:50  self.distribution.run_command(command)
> 16:33:50File "/usr/local/lib/python2.7/distutils/dist.py", line 972, in 
> run_command
> 16:33:50  cmd_obj.run()
> 16:33:50File "setup.py", line 229, in run
> 16:33:50  gen_protos.generate_proto_files(log=log)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 291, in 
> generate_proto_files
> 16:33:50  raise ValueError("Proto generation failed (see log for 
> details).")
> 16:33:50  ValueError: Proto generation failed (see log for 
> details
> {noformat}
> {noformat}
> import google.protobuf.message as message
> ImportError: No module named google.protobuf.message
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9130) sdks:python:test-suites:direct:py2:hdfsIntegrationTest is failing with ImportError: No module named google.protobuf.message

2020-01-16 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017600#comment-17017600
 ] 

Chad Dombrova commented on BEAM-9130:
-

It works for pre-commit, so I'm assuming that this goes back, once again, to 
the inconsistent way that the beam python sdk is built.   

Here's my old PR attempting to get everything using tox and pep517:  
[https://github.com/apache/beam/pull/10038]

 

 

> sdks:python:test-suites:direct:py2:hdfsIntegrationTest is failing with 
> ImportError: No module named google.protobuf.message
> ---
>
> Key: BEAM-9130
> URL: https://issues.apache.org/jira/browse/BEAM-9130
> Project: Beam
>  Issue Type: Improvement
>  Components: test-failures
>Reporter: Valentyn Tymofieiev
>Priority: Major
>  Labels: currently-failing
>
> From logs:
> {noformat}
> 16:33:50File "/usr/local/lib/python2.7/multiprocessing/process.py", line 
> 267, in _bootstrap
> 16:33:50  self.run()
> 16:33:50File "/usr/local/lib/python2.7/multiprocessing/process.py", line 
> 114, in run
> 16:33:50  self._target(*self._args, **self._kwargs)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 357, in 
> _install_grpcio_tools_and_generate_proto_files
> 16:33:50  generate_proto_files(force=force)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 324, in 
> generate_proto_files
> 16:33:50  generate_urn_files(log, out_dir)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 65, in 
> generate_urn_files
> 16:33:50  import google.protobuf.message as message
> 16:33:50  ImportError: No module named google.protobuf.message
> 16:33:50  Traceback (most recent call last):
> 16:33:50File "setup.py", line 305, in 
> 16:33:50  'mypy': generate_protos_first(mypy),
> 16:33:50File 
> "/usr/local/lib/python2.7/site-packages/setuptools/__init__.py", line 145, in 
> setup
> 16:33:50  return distutils.core.setup(**attrs)
> 16:33:50File "/usr/local/lib/python2.7/distutils/core.py", line 151, in 
> setup
> 16:33:50  dist.run_commands()
> 16:33:50File "/usr/local/lib/python2.7/distutils/dist.py", line 953, in 
> run_commands
> 16:33:50  self.run_command(cmd)
> 16:33:50File "/usr/local/lib/python2.7/distutils/dist.py", line 972, in 
> run_command
> 16:33:50  cmd_obj.run()
> 16:33:50File 
> "/usr/local/lib/python2.7/site-packages/setuptools/command/sdist.py", line 
> 44, in run
> 16:33:50  self.run_command('egg_info')
> 16:33:50File "/usr/local/lib/python2.7/distutils/cmd.py", line 326, in 
> run_command
> 16:33:50  self.distribution.run_command(command)
> 16:33:50File "/usr/local/lib/python2.7/distutils/dist.py", line 972, in 
> run_command
> 16:33:50  cmd_obj.run()
> 16:33:50File "setup.py", line 229, in run
> 16:33:50  gen_protos.generate_proto_files(log=log)
> 16:33:50File "/app/sdks/python/gen_protos.py", line 291, in 
> generate_proto_files
> 16:33:50  raise ValueError("Proto generation failed (see log for 
> details).")
> 16:33:50  ValueError: Proto generation failed (see log for 
> details
> {noformat}
> {noformat}
> import google.protobuf.message as message
> ImportError: No module named google.protobuf.message
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9064) Add pytype to lint checks

2020-01-07 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010307#comment-17010307
 ] 

Chad Dombrova commented on BEAM-9064:
-

Thoughts (which I will repeat in the mailing list discussion):

I've gotten mypy completely passing on my own branch. It was a lot of work and 
the longer it lingers the more work it becomes.  So let's focus our efforts on 
getting that merged and integrated into the lint jobs _first._  From there we 
will learn a few things:  1) How difficult it is for end users to keep the 
tests passing (i.e. the time and frustration levels with mypy alone), and 2) 
how much more work it will be to also get pytype passing (i.e. how divergent 
the two tools are).  Based on that feedback we can decide whether we want a 
pytype failure to _prevent_ a PR from merging, or merely be informative 
(failures could be something that the google team tracks on the side through 
metrics/analytics). 

tl;dr   If both tools are actually quite similar, then it shouldn't be much 
more of a burden for users to maintain both.  But if they're not, or users are 
already struggling with mypy on its own, then we should hold off introducing 
pytype as a requirement.

 

> Add pytype to lint checks
> -
>
> Key: BEAM-9064
> URL: https://issues.apache.org/jira/browse/BEAM-9064
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core, testing
>Reporter: Udi Meiri
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [~chadrik]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9064) Add pytype to lint checks

2020-01-07 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010258#comment-17010258
 ] 

Chad Dombrova commented on BEAM-9064:
-

I am _very_ reticent to implement this.  I haven't used pytype, but my 
experience working with mypy over the past few years, and following various 
issues and peps related to it and typing in general,  has taught me there's 
still a lot of room for interpretation and thus variation between type 
checkers.  As a user, it can be quite challenging to solve certain typing 
issues, and I would not be the least bit surprised to find that certain 
problems cannot be solved in a way that satisfies both linters, at least not 
without some seriously ugly workarounds.  We're already asking for our 
contributors to gain a whole new area of expertise in order to get their PRs 
merged – one with a fairly steep learning curve –  I wouldn't want to burden 
them with an additional linter with its own idiosyncrasies.  

 

> Add pytype to lint checks
> -
>
> Key: BEAM-9064
> URL: https://issues.apache.org/jira/browse/BEAM-9064
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core, testing
>Reporter: Udi Meiri
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [~chadrik]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9012) Include `-> None` on Pipeline and PipelineOptions `__init__` methods for pytype compatibility

2019-12-20 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17001295#comment-17001295
 ] 

Chad Dombrova commented on BEAM-9012:
-

I imagine there are going to be _lots_ of little differences between mypy and 
pytype.  I'm curious your motivation for using pytype.   Do you think we should 
aim to support both?  I'd be a bit wary of doing so, since getting mypy to pass 
can be challenging enough on its own.  I can imagine scenarios where there is 
no solution that appeases both mypy and pytype (thinking particularly of 
overloads, whose semantics seem to vary between tools).

 

> Include `-> None` on Pipeline and PipelineOptions `__init__` methods for 
> pytype compatibility
> -
>
> Key: BEAM-9012
> URL: https://issues.apache.org/jira/browse/BEAM-9012
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: Major
> Fix For: 2.19.0
>
>
> mypy [made a decision|https://github.com/python/mypy/issues/604] to allow 
> init methods to omit {{\-> None}} return type annotations, but pytype has no 
> such feature. I think we should include {{\-> None}} annotations for pytype 
> compatibility.
> cc: [~chadrik]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9012) Include `-> None` on Pipeline and PipelineOptions `__init__` methods for pytype compatibility

2019-12-20 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17001279#comment-17001279
 ] 

Chad Dombrova commented on BEAM-9012:
-

Fine by me.  Brian, if you're into the static typing thing, you may want to 
poke in over at my second PR, which is waiting on some feedback:  
[https://github.com/apache/beam/pull/10367]

There will probably be a third (and hopefully final) PR after that one to get 
the project to a point where mypy is fully passing.  We can take care of this 
issue in that final PR. 

 

> Include `-> None` on Pipeline and PipelineOptions `__init__` methods for 
> pytype compatibility
> -
>
> Key: BEAM-9012
> URL: https://issues.apache.org/jira/browse/BEAM-9012
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: Major
> Fix For: 2.19.0
>
>
> mypy [made a decision|https://github.com/python/mypy/issues/604] to allow 
> init methods to omit {{\-> None}} return type annotations, but pytype has no 
> such feature. I think we should include {{\-> None}} annotations for pytype 
> compatibility.
> cc: [~chadrik]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (BEAM-8613) Add environment variable support to Docker environment

2019-12-16 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992035#comment-16992035
 ] 

Chad Dombrova edited comment on BEAM-8613 at 12/17/19 12:34 AM:


{quote}What kind of environment variables are you trying to pass here?
{quote}
We're primarily interested in configuring various libraries and applications 
used by our UDFs. These each have their own set of environment variables which 
typically need to be configured before modules are imported. 

Another use case which we intend to explore soon is passing env vars to control 
the behavior of pip in {{boot}}. For example, to point it at our internal pypi 
mirror. Do you think this falls into the category of "building too much into 
these (unstructured) string fields"?
{quote}Is there not another way to pass this data to the operations being 
performed in this container?
{quote}
Let's frame this as a user story:

"As a developer, I want to set library- and application-specific env variables 
(usually third-party) in the SDK process before any affected modules are 
imported, so that I can bind a particular configuration to a job."

Let's evaluate a few options:
 - custom PipelineOptions:  this runs too late.  by the time we can read the 
pipeline options, our UDF and its pcollection element types have been 
unpickled, thereby importing many dependent modules. 
 - custom config file uploaded to artifact service: same problem as above.
 - custom docker container:  this is too slow.  we don't want to create a new 
docker container for every permutation that we might need. we want this to be 
user controlled at job submission time
 - custom docker ARGS:  this is needlessly complicated.  theoretically if we 
had a custom docker container with a custom entrypoint script and the ability 
to configure docker args via the DOCKER environment we could get this to work, 
but that's a lot of work just to set some env vars.  besides, we already have 
the ability to set env vars for PROCESS environment type, so doing the same for 
DOCKER seems natural. 

I'm not sure what other good options there are. Environment variables seem like 
the most direct and generally useful approach. 

 


was (Author: chadrik):
{quote}What kind of environment variables are you trying to pass here?
{quote}
We're primarily interested in configuring various libraries and applications 
used by our UDFs. These each have their own set of environment variables which 
typically need to be configured before modules are imported. 

Another use case which we intend to explore soon is passing env vars to control 
the behavior of pip in {{boot}}. For example, to point it at our internal pypi 
mirror. Do you think this falls into the category of "building too much into 
these (unstructured) string fields"?
{quote}Is there not another way to pass this data to the operations being 
performed in this container?
{quote}
Let's frame this as a user story:

"As a developer, I want to set library- and application-specific env variables 
(usually third-party) in the SDK process before any affected modules are 
imported, so that I can bind a particular configuration to a job."

Let's evaluate a few options:
 - custom PipelineOptions: by the time we can read the pipeline options, our 
UDF and its pcollection element types have been unpickled, thereby importing 
many dependent modules.
 - custom config file uploaded to artifact service: same problem as above.
 - custom docker container: we don't want to create a new docker container for 
every permutation that we might need. we want this to be user controlled at job 
submission time
 - custom docker ARGS: theoretically if we had a custom docker container with a 
custom entrypoint script and the ability to configure docker args via the 
DOCKER environment we could get this to work. this just seems needlessly 
complicated.  we already have the ability to set env vars for PROCESS 
environment type, so doing the same for DOCKER seems natural. 

I'm not sure what other good options there are. Environment variables seem like 
the most direct and generally useful approach. 

 

> Add environment variable support to Docker environment
> --
>
> Key: BEAM-8613
> URL: https://issues.apache.org/jira/browse/BEAM-8613
> Project: Beam
>  Issue Type: Improvement
>  Components: java-fn-execution, runner-core, runner-direct
>Reporter: Nathan Rusch
>Assignee: Nathan Rusch
>Priority: Trivial
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The Process environment allows specifying environment variables via a map 
> field on its payload message. The Docker environment should support this same 
> pattern, and forward the contents of the map through to the container runtime.



--
This message was sent 

[jira] [Commented] (BEAM-8613) Add environment variable support to Docker environment

2019-12-16 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997743#comment-16997743
 ] 

Chad Dombrova commented on BEAM-8613:
-

[~robertwb] any additional thoughts on this?

> Add environment variable support to Docker environment
> --
>
> Key: BEAM-8613
> URL: https://issues.apache.org/jira/browse/BEAM-8613
> Project: Beam
>  Issue Type: Improvement
>  Components: java-fn-execution, runner-core, runner-direct
>Reporter: Nathan Rusch
>Assignee: Nathan Rusch
>Priority: Trivial
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The Process environment allows specifying environment variables via a map 
> field on its payload message. The Docker environment should support this same 
> pattern, and forward the contents of the map through to the container runtime.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8966) failure in :sdks:python:test-suites:direct:py37:hdfsIntegrationTest

2019-12-16 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997506#comment-16997506
 ] 

Chad Dombrova commented on BEAM-8966:
-

Ok, I'm working on trying to reproduce this, but running into errors.  This 
test copies the entirety of the sdks/python directory, including all of my 
intermediate build, target, and tox env directories, and that's tripping 
something up.  Seems like another example of a highly inefficient approach to 
creating build isolation.  Is there a reason the ITs don't go through tox?

 

 

 

> failure in :sdks:python:test-suites:direct:py37:hdfsIntegrationTest
> ---
>
> Key: BEAM-8966
> URL: https://issues.apache.org/jira/browse/BEAM-8966
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Udi Meiri
>Assignee: Chad Dombrova
>Priority: Major
>
> I believe this is due to https://github.com/apache/beam/pull/9915
> {code}
> Collecting mypy-protobuf==1.12
>   Using cached 
> https://files.pythonhosted.org/packages/b6/28/041dea47c93564bfc0ece050362894292ec4f173caa92fa82994a6d061d1/mypy_protobuf-1.12-py3-none-any.whl
> Installing collected packages: mypy-protobuf
> Successfully installed mypy-protobuf-1.12
> beam_fn_api.proto: warning: Import google/protobuf/descriptor.proto but 
> not used.
> beam_fn_api.proto: warning: Import google/protobuf/wrappers.proto but not 
> used.
> Traceback (most recent call last):
>   File "/usr/local/bin/protoc-gen-mypy", line 13, in 
> import google.protobuf.descriptor_pb2 as d
> ModuleNotFoundError: No module named 'google'
> --mypy_out: protoc-gen-mypy: Plugin failed with status code 1.
> Process Process-1:
> Traceback (most recent call last):
>   File "/app/sdks/python/gen_protos.py", line 104, in generate_proto_files
> from grpc_tools import protoc
> ModuleNotFoundError: No module named 'grpc_tools'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in 
> _bootstrap
> self.run()
>   File "/usr/local/lib/python3.7/multiprocessing/process.py", line 99, in run
> self._target(*self._args, **self._kwargs)
>   File "/app/sdks/python/gen_protos.py", line 189, in 
> _install_grpcio_tools_and_generate_proto_files
> generate_proto_files()
>   File "/app/sdks/python/gen_protos.py", line 144, in generate_proto_files
> '%s' % ret_code)
> RuntimeError: Protoc returned non-zero status (see logs for details): 1
> Traceback (most recent call last):
>   File "/app/sdks/python/gen_protos.py", line 104, in generate_proto_files
> from grpc_tools import protoc
> ModuleNotFoundError: No module named 'grpc_tools'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "setup.py", line 295, in 
> 'mypy': generate_protos_first(mypy),
>   File "/usr/local/lib/python3.7/site-packages/setuptools/__init__.py", line 
> 145, in setup
> return distutils.core.setup(**attrs)
>   File "/usr/local/lib/python3.7/distutils/core.py", line 148, in setup
> dist.run_commands()
>   File "/usr/local/lib/python3.7/distutils/dist.py", line 966, in run_commands
> self.run_command(cmd)
>   File "/usr/local/lib/python3.7/distutils/dist.py", line 985, in run_command
> cmd_obj.run()
>   File "/usr/local/lib/python3.7/site-packages/setuptools/command/sdist.py", 
> line 44, in run
> self.run_command('egg_info')
>   File "/usr/local/lib/python3.7/distutils/cmd.py", line 313, in run_command
> self.distribution.run_command(command)
>   File "/usr/local/lib/python3.7/distutils/dist.py", line 985, in run_command
> cmd_obj.run()
>   File "setup.py", line 220, in run
> gen_protos.generate_proto_files(log=log)
>   File "/app/sdks/python/gen_protos.py", line 121, in generate_proto_files
> raise ValueError("Proto generation failed (see log for details).")
> ValueError: Proto generation failed (see log for details).
> Service 'test' failed to build: The command '/bin/sh -c cd sdks/python && 
> python setup.py sdist && pip install --no-cache-dir $(ls 
> dist/apache-beam-*.tar.gz | tail -n1)[gcp]' returned a non-zero code: 1
> {code}
> https://builds.apache.org/job/beam_PostCommit_Python37/1114/consoleText



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8966) failure in :sdks:python:test-suites:direct:py37:hdfsIntegrationTest

2019-12-16 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997438#comment-16997438
 ] 

Chad Dombrova commented on BEAM-8966:
-

hmmm...  the errors show two different problems.  

In Udi's,  mypy-protobuf is newly installed, protoc-gen-mypy is found by 
protoc, but there's a failure to import google.protobuf.  In Kamil's 
mypy-protobuf is already installed, but protoc-gen-mypy is not found by protoc. 

Does anyone have a way to reproduce this reliably?  What about locally?

I have a hunch this is the kind of problem that would be solved by the pep517 
efforts I've been working on.  In the meantime, the mypy tests are not required 
– they don't actually pass yet, they're simply there for reference – so we 
could remove the mypy-protobuf change and disable the tests until after the 
pep517 changes. 

 

> failure in :sdks:python:test-suites:direct:py37:hdfsIntegrationTest
> ---
>
> Key: BEAM-8966
> URL: https://issues.apache.org/jira/browse/BEAM-8966
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Udi Meiri
>Assignee: Chad Dombrova
>Priority: Major
>
> I believe this is due to https://github.com/apache/beam/pull/9915
> {code}
> Collecting mypy-protobuf==1.12
>   Using cached 
> https://files.pythonhosted.org/packages/b6/28/041dea47c93564bfc0ece050362894292ec4f173caa92fa82994a6d061d1/mypy_protobuf-1.12-py3-none-any.whl
> Installing collected packages: mypy-protobuf
> Successfully installed mypy-protobuf-1.12
> beam_fn_api.proto: warning: Import google/protobuf/descriptor.proto but 
> not used.
> beam_fn_api.proto: warning: Import google/protobuf/wrappers.proto but not 
> used.
> Traceback (most recent call last):
>   File "/usr/local/bin/protoc-gen-mypy", line 13, in 
> import google.protobuf.descriptor_pb2 as d
> ModuleNotFoundError: No module named 'google'
> --mypy_out: protoc-gen-mypy: Plugin failed with status code 1.
> Process Process-1:
> Traceback (most recent call last):
>   File "/app/sdks/python/gen_protos.py", line 104, in generate_proto_files
> from grpc_tools import protoc
> ModuleNotFoundError: No module named 'grpc_tools'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in 
> _bootstrap
> self.run()
>   File "/usr/local/lib/python3.7/multiprocessing/process.py", line 99, in run
> self._target(*self._args, **self._kwargs)
>   File "/app/sdks/python/gen_protos.py", line 189, in 
> _install_grpcio_tools_and_generate_proto_files
> generate_proto_files()
>   File "/app/sdks/python/gen_protos.py", line 144, in generate_proto_files
> '%s' % ret_code)
> RuntimeError: Protoc returned non-zero status (see logs for details): 1
> Traceback (most recent call last):
>   File "/app/sdks/python/gen_protos.py", line 104, in generate_proto_files
> from grpc_tools import protoc
> ModuleNotFoundError: No module named 'grpc_tools'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "setup.py", line 295, in 
> 'mypy': generate_protos_first(mypy),
>   File "/usr/local/lib/python3.7/site-packages/setuptools/__init__.py", line 
> 145, in setup
> return distutils.core.setup(**attrs)
>   File "/usr/local/lib/python3.7/distutils/core.py", line 148, in setup
> dist.run_commands()
>   File "/usr/local/lib/python3.7/distutils/dist.py", line 966, in run_commands
> self.run_command(cmd)
>   File "/usr/local/lib/python3.7/distutils/dist.py", line 985, in run_command
> cmd_obj.run()
>   File "/usr/local/lib/python3.7/site-packages/setuptools/command/sdist.py", 
> line 44, in run
> self.run_command('egg_info')
>   File "/usr/local/lib/python3.7/distutils/cmd.py", line 313, in run_command
> self.distribution.run_command(command)
>   File "/usr/local/lib/python3.7/distutils/dist.py", line 985, in run_command
> cmd_obj.run()
>   File "setup.py", line 220, in run
> gen_protos.generate_proto_files(log=log)
>   File "/app/sdks/python/gen_protos.py", line 121, in generate_proto_files
> raise ValueError("Proto generation failed (see log for details).")
> ValueError: Proto generation failed (see log for details).
> Service 'test' failed to build: The command '/bin/sh -c cd sdks/python && 
> python setup.py sdist && pip install --no-cache-dir $(ls 
> dist/apache-beam-*.tar.gz | tail -n1)[gcp]' returned a non-zero code: 1
> {code}
> https://builds.apache.org/job/beam_PostCommit_Python37/1114/consoleText



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8613) Add environment variable support to Docker environment

2019-12-09 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16992035#comment-16992035
 ] 

Chad Dombrova commented on BEAM-8613:
-

{quote}What kind of environment variables are you trying to pass here?
{quote}
We're primarily interested in configuring various libraries and applications 
used by our UDFs. These each have their own set of environment variables which 
typically need to be configured before modules are imported. 

Another use case which we intend to explore soon is passing env vars to control 
the behavior of pip in {{boot}}. For example, to point it at our internal pypi 
mirror. Do you think this falls into the category of "building too much into 
these (unstructured) string fields"?
{quote}Is there not another way to pass this data to the operations being 
performed in this container?
{quote}
Let's frame this as a user story:

"As a developer, I want to set library- and application-specific env variables 
(usually third-party) in the SDK process before any affected modules are 
imported, so that I can bind a particular configuration to a job."

Let's evaluate a few options:
 - custom PipelineOptions: by the time we can read the pipeline options, our 
UDF and its pcollection element types have been unpickled, thereby importing 
many dependent modules.
 - custom config file uploaded to artifact service: same problem as above.
 - custom docker container: we don't want to create a new docker container for 
every permutation that we might need. we want this to be user controlled at job 
submission time
 - custom docker ARGS: theoretically if we had a custom docker container with a 
custom entrypoint script and the ability to configure docker args via the 
DOCKER environment we could get this to work. this just seems needlessly 
complicated.  we already have the ability to set env vars for PROCESS 
environment type, so doing the same for DOCKER seems natural. 

I'm not sure what other good options there are. Environment variables seem like 
the most direct and generally useful approach. 

 

> Add environment variable support to Docker environment
> --
>
> Key: BEAM-8613
> URL: https://issues.apache.org/jira/browse/BEAM-8613
> Project: Beam
>  Issue Type: Improvement
>  Components: java-fn-execution, runner-core, runner-direct
>Reporter: Nathan Rusch
>Assignee: Nathan Rusch
>Priority: Trivial
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The Process environment allows specifying environment variables via a map 
> field on its payload message. The Docker environment should support this same 
> pattern, and forward the contents of the map through to the container runtime.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-8930) External workers should receive artifact endpoint when started from python

2019-12-09 Thread Chad Dombrova (Jira)
Chad Dombrova created BEAM-8930:
---

 Summary: External workers should receive artifact endpoint when 
started from python
 Key: BEAM-8930
 URL: https://issues.apache.org/jira/browse/BEAM-8930
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py-core
Reporter: Chad Dombrova
Assignee: Chad Dombrova


{{ExternalWorkerHandler}} does not pass the artifact and provision endpoints, 
making it impossible to provision artifacts when the external worker is started 
from python.  The Java code is properly sending this information.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-8746) Allow the local job service to work from inside docker

2019-11-19 Thread Chad Dombrova (Jira)
Chad Dombrova created BEAM-8746:
---

 Summary: Allow the local job service to work from inside docker
 Key: BEAM-8746
 URL: https://issues.apache.org/jira/browse/BEAM-8746
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py-core
Reporter: Chad Dombrova
Assignee: Chad Dombrova


Currently the connection is refused.  It's a simple fix. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-7886) Make row coder a standard coder and implement in python

2019-11-17 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-7886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976165#comment-16976165
 ] 

Chad Dombrova commented on BEAM-7886:
-

Is there a Jira for adding support for logical types?  

Is the idea with logical types that we would defer to the stock set of python 
coders for types outside of the native atomic types?  e.g.  timestamp, pickle, 
etc.

 

> Make row coder a standard coder and implement in python
> ---
>
> Key: BEAM-7886
> URL: https://issues.apache.org/jira/browse/BEAM-7886
> Project: Beam
>  Issue Type: Improvement
>  Components: beam-model, sdk-java-core, sdk-py-core
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: Major
>  Time Spent: 16h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-8732) Add support for additional structured types to Schemas/RowCoders

2019-11-16 Thread Chad Dombrova (Jira)
Chad Dombrova created BEAM-8732:
---

 Summary: Add support for additional structured types to 
Schemas/RowCoders
 Key: BEAM-8732
 URL: https://issues.apache.org/jira/browse/BEAM-8732
 Project: Beam
  Issue Type: New Feature
  Components: sdk-py-core
Reporter: Chad Dombrova


Currently we can convert between a {{NamedTuple}} type and its {{Schema}} 
protos using {{named_tuple_from_schema}} and {{named_tuple_to_schema}}. I'd 
like to introduce a system to support additional types, starting with 
structured types like {{attrs}}, {{dataclasses}}, and {{TypedDict}}.

I've only just started digesting the code, but this task seems pretty 
straightforward. For example, I think the type-to-schema code would look 
roughly like this:
{code:python}
def typing_to_runner_api(type_):
  # type: (Type) -> schema_pb2.FieldType
  structured_handler = _get_structured_handler(type_)
  if structured_handler:
schema = None
if hasattr(type_, 'id'):
  schema = SCHEMA_REGISTRY.get_schema_by_id(type_.id)
if schema is None:
  fields = structured_handler.get_fields()
  type_id = str(uuid4())
  schema = schema_pb2.Schema(fields=fields, id=type_id)
  SCHEMA_REGISTRY.add(type_, schema)

return schema_pb2.FieldType(
row_type=schema_pb2.RowType(
schema=schema))

{code}
The rest of the work would be in implementing a class hierarchy for working 
with structured types, such as getting a list of fields from an instance, and 
instantiation from a list of fields. Eventually we can extend this behavior to 
arbitrary, unstructured types.  

Going in the schema-to-type direction, we have the problem of choosing which 
type to use for a given schema. I believe that as long as 
{{typing_to_runner_api()}} has been called on our structured type in the 
current python session, it should be added to the registry and thus round trip 
ok, so I think we just need a public function for registering schemas for 
structured types.

[~bhulette] Did you want to tackle this or are you ok with me going after it?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8646) PR #9814 appears to cause failures in fnapi_runner tests on Windows

2019-11-13 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973881#comment-16973881
 ] 

Chad Dombrova commented on BEAM-8646:
-

Hi all.  The simplest solution would be to make this new behavior optional (via 
a pipeline option?), though it's a bit annoying to do so without a better 
understanding of why this is happening.  Unfortunately, we don't have a windows 
machine to test on.   Open to suggestions.

 

 

> PR #9814 appears to cause failures in fnapi_runner tests on Windows
> ---
>
> Key: BEAM-8646
> URL: https://issues.apache.org/jira/browse/BEAM-8646
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Wanqi Lyu
>Priority: Major
>
> It appears that changes in 
>  
> https://github.com/apache/beam/commit/d6bcb03f586b5430c30f6ca4a1af9e42711e529c
>  cause test failures in Beam test suite on Windows, for example:
> python setup.py nosetests --tests 
> apache_beam/runners/portability/portable_runner_test.py:PortableRunnerTestWithExternalEnv.test_callbacks_with_exception
>  
> does not finish on a Windows VM machine within at least 60 seconds but passes 
> within a second if  we change host_from_worker to return 'localhost'  in [1].
>  [~violalyu] , do you think you could take a look? Thanks! 
> cc: [~chadrik] [~thw]
> [1] 
> https://github.com/apache/beam/blob/808cb35018cd228a59b152234b655948da2455fa/sdks/python/apache_beam/runners/portability/fn_api_runner.py#L1377.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-4441) Incorrect coder inference for List and Tuple typehints.

2019-11-07 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-4441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969430#comment-16969430
 ] 

Chad Dombrova commented on BEAM-4441:
-

Quick thought on this:  the {{Tuple[int, ...]}} case will be harder to solve 
than {{List[int]}} because if we use the standard {{IterableCoder}} (as we did 
for list) the value will be round-tripped as a list.  So we either accept that 
limitation or we need to use non-standard coders.   

I think it would also be interesting to support set and frozenset.

 

 

> Incorrect coder inference for List and Tuple typehints.
> ---
>
> Key: BEAM-4441
> URL: https://issues.apache.org/jira/browse/BEAM-4441
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Minor
>
> We seem to use a FastPrimitivesCoder for List and Tuple typehints with 
> homogenous element types, and fail to do the type checking:
> {code:python}
> inputs = (1, "a string")
> coder = typecoders.registry.get_coder(typehints.Tuple[int, str])
> print(type(coder)) # 
> encoded = coder.encode(inputs) 
> # Fails: TypeError: an integer is required - correct behaviour
> coder = typecoders.registry.get_coder(typehints.Tuple[int, ...]) # A tuple of 
> integers.
> print(type(coder)) #  
> - wrong coder?
> encoded = coder.encode(inputs)
> # No errors - incorrect behavior.
> coder = typecoders.registry.get_coder(typehints.List[int]) # A list of 
> integers.
> print(type(coder)) #  
> - wrong coder?
> encoded = coder.encode(inputs)
> # No errors - incorrect behavior.{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-4441) Incorrect coder inference for List and Tuple typehints.

2019-11-07 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-4441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969422#comment-16969422
 ] 

Chad Dombrova commented on BEAM-4441:
-

I fixed the list case, but not the homogenous tuple case.

> Incorrect coder inference for List and Tuple typehints.
> ---
>
> Key: BEAM-4441
> URL: https://issues.apache.org/jira/browse/BEAM-4441
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Minor
>
> We seem to use a FastPrimitivesCoder for List and Tuple typehints with 
> homogenous element types, and fail to do the type checking:
> {code:python}
> inputs = (1, "a string")
> coder = typecoders.registry.get_coder(typehints.Tuple[int, str])
> print(type(coder)) # 
> encoded = coder.encode(inputs) 
> # Fails: TypeError: an integer is required - correct behaviour
> coder = typecoders.registry.get_coder(typehints.Tuple[int, ...]) # A tuple of 
> integers.
> print(type(coder)) #  
> - wrong coder?
> encoded = coder.encode(inputs)
> # No errors - incorrect behavior.
> coder = typecoders.registry.get_coder(typehints.List[int]) # A list of 
> integers.
> print(type(coder)) #  
> - wrong coder?
> encoded = coder.encode(inputs)
> # No errors - incorrect behavior.{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-8552) Spark runner should not used STOPPED state as a terminal state

2019-11-04 Thread Chad Dombrova (Jira)
Chad Dombrova created BEAM-8552:
---

 Summary: Spark runner should not used STOPPED state as a terminal 
state
 Key: BEAM-8552
 URL: https://issues.apache.org/jira/browse/BEAM-8552
 Project: Beam
  Issue Type: Bug
  Components: runner-spark
Reporter: Chad Dombrova


Over at BEAM-8539 we're trying to iron out the definitions of the job state 
enums.  One thing that is clear is that STOPPED is not supposed to be a 
terminal state, though since it was poorly documented it slipped in a few 
places.

There are two places in the {{SparkPipelineResult}} where STOPPED seems to be 
used as a terminal event.   These should either be DONE, FAILED, or CANCELLED.

Here's one example:

[https://github.com/apache/beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineResult.java#L133]

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-7870) Externally configured KafkaIO / PubsubIO consumer causes coder problems

2019-11-03 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-7870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966373#comment-16966373
 ] 

Chad Dombrova commented on BEAM-7870:
-

Note that there is not yet support in python for registering a native type to 
replace the row, as there is in Java.  In the example above we'd end up with 
the {{TypedDict}} subclass instead of {{gcp.pubsub.PubsubMessage}}.  This could 
be a follow up task to row coder support.  [~bhulette], do we have a Jira for 
registering native types for row converters yet?

 

> Externally configured KafkaIO / PubsubIO consumer causes coder problems
> ---
>
> Key: BEAM-7870
> URL: https://issues.apache.org/jira/browse/BEAM-7870
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink, sdk-java-core
>Reporter: Maximilian Michels
>Assignee: Maximilian Michels
>Priority: Major
>
> There are limitations for the consumer to work correctly. The biggest issue 
> is the structure of KafkaIO itself, which uses a combination of the source 
> interface and DoFns to generate the desired output. The problem is that the 
> source interface is natively translated by the Flink Runner to support 
> unbounded sources in portability, while the DoFn runs in a Java environment.
> To transfer data between the two a coder needs to be involved. It happens to 
> be that the initial read does not immediately drop the KafakRecord structure 
> which does not work together well with our current assumption of only 
> supporting "standard coders" present in all SDKs. Only the subsequent DoFn 
> converts the KafkaRecord structure into a raw KV[byte, byte], but the DoFn 
> won't have the coder available in its environment.
> There are several possible solutions:
>  1. Make the DoFn which drops the KafkaRecordCoder a native Java transform in 
> the Flink Runner
>  2. Modify KafkaIO to immediately drop the KafkaRecord structure
>  3. Add the KafkaRecordCoder to all SDKs
>  4. Add a generic coder, e.g. AvroCoder to all SDKs
> For a workaround which uses (3), please see this patch which is not a proper 
> fix but adds KafkaRecordCoder to the SDK such that it can be used 
> encode/decode records: 
> [https://github.com/mxm/beam/commit/b31cf99c75b3972018180d8ccc7e73d311f4cfed]
>  
> See also 
> [https://github.com/apache/beam/pull/8251|https://github.com/apache/beam/pull/8251:]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-6857) Support dynamic timers

2019-11-01 Thread Chad Dombrova (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chad Dombrova updated BEAM-6857:

Description: 
The Beam timers API currently requires each timer to be statically specified in 
the DoFn. The user must provide a separate callback method per timer. For 
example:

 
{code:java}
DoFn()
{   
  @TimerId("timer1") 
  private final TimerSpec timer1 = TimerSpecs.timer(...);   
  @TimerId("timer2") 
  private final TimerSpec timer2 = TimerSpecs.timer(...);                 
  .. set timers in processElement    
  @OnTimer("timer1") 
  public void onTimer1() { .}
  @OnTimer("timer2") 
  public void onTimer2() {}
}
{code}
 

However there are many cases where the user does not know the set of timers 
statically when writing their code. This happens when the timer tag should be 
based on the data. It also happens when writing a DSL on top of Beam, where the 
DSL author has to create DoFns but does not know statically which timers their 
users will want to set (e.g. Scio).

 

The goal is to support dynamic timers. Something as follows;

 
{code:java}
DoFn() 
{
  @TimerId("timer") 
  private final TimerSpec timer1 = TimerSpecs.dynamicTimer(...);
  @ProcessElement process(@TimerId("timer") DynamicTimer timer)
  {
       timer.set("tag1'", ts);       
   timer.set("tag2", ts);     
  }
  @OnTimer("timer") 
  public void onTimer1(@TimerTag String tag) { .}
}
{code}
 

  was:
The Beam timers API currently requires each timer to be statically specified in 
the DoFn. The user must provide a separate callback method per timer. For 
example:

 
{code:java}
DoFn()
{   @TimerId("timer1") private final TimerSpec timer1 = TimerSpecs.timer(...);  
 @TimerId("timer2") private final TimerSpec timer2 = TimerSpecs.timer(...);     
            .. set timers in processElement    @OnTimer("timer1") public 
void onTimer1() \{ .}
   @OnTimer("timer2") public void onTimer2() {}
}
{code}
 

However there are many cases where the user does not know the set of timers 
statically when writing their code. This happens when the timer tag should be 
based on the data. It also happens when writing a DSL on top of Beam, where the 
DSL author has to create DoFns but does not know statically which timers their 
users will want to set (e.g. Scio).

 

The goal is to support dynamic timers. Something as follows;

 
{code:java}
DoFn() {
  @TimerId("timer") private final TimerSpec timer1 = 
TimerSpecs.dynamicTimer(...);
   @ProcessElement process(@TimerId("timer") DynamicTimer timer)
{       timer.set("tag1'", ts);       timer.set("tag2", ts);     }
   @OnTimer("timer") public void onTimer1(@TimerTag String tag) { .}
}
{code}
 


> Support dynamic timers
> --
>
> Key: BEAM-6857
> URL: https://issues.apache.org/jira/browse/BEAM-6857
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Reuven Lax
>Assignee: Shehzaad Nakhoda
>Priority: Major
>
> The Beam timers API currently requires each timer to be statically specified 
> in the DoFn. The user must provide a separate callback method per timer. For 
> example:
>  
> {code:java}
> DoFn()
> {   
>   @TimerId("timer1") 
>   private final TimerSpec timer1 = TimerSpecs.timer(...);   
>   @TimerId("timer2") 
>   private final TimerSpec timer2 = TimerSpecs.timer(...);                 
>   .. set timers in processElement    
>   @OnTimer("timer1") 
>   public void onTimer1() { .}
>   @OnTimer("timer2") 
>   public void onTimer2() {}
> }
> {code}
>  
> However there are many cases where the user does not know the set of timers 
> statically when writing their code. This happens when the timer tag should be 
> based on the data. It also happens when writing a DSL on top of Beam, where 
> the DSL author has to create DoFns but does not know statically which timers 
> their users will want to set (e.g. Scio).
>  
> The goal is to support dynamic timers. Something as follows;
>  
> {code:java}
> DoFn() 
> {
>   @TimerId("timer") 
>   private final TimerSpec timer1 = TimerSpecs.dynamicTimer(...);
>   @ProcessElement process(@TimerId("timer") DynamicTimer timer)
>   {
>        timer.set("tag1'", ts);       
>timer.set("tag2", ts);     
>   }
>   @OnTimer("timer") 
>   public void onTimer1(@TimerTag String tag) { .}
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-6857) Support dynamic timers

2019-11-01 Thread Chad Dombrova (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chad Dombrova updated BEAM-6857:

Description: 
The Beam timers API currently requires each timer to be statically specified in 
the DoFn. The user must provide a separate callback method per timer. For 
example:

 
{code:java}
DoFn()
{   @TimerId("timer1") private final TimerSpec timer1 = TimerSpecs.timer(...);  
 @TimerId("timer2") private final TimerSpec timer2 = TimerSpecs.timer(...);     
            .. set timers in processElement    @OnTimer("timer1") public 
void onTimer1() \{ .}
   @OnTimer("timer2") public void onTimer2() {}
}
{code}
 

However there are many cases where the user does not know the set of timers 
statically when writing their code. This happens when the timer tag should be 
based on the data. It also happens when writing a DSL on top of Beam, where the 
DSL author has to create DoFns but does not know statically which timers their 
users will want to set (e.g. Scio).

 

The goal is to support dynamic timers. Something as follows;

 
{code:java}
DoFn() {
  @TimerId("timer") private final TimerSpec timer1 = 
TimerSpecs.dynamicTimer(...);
   @ProcessElement process(@TimerId("timer") DynamicTimer timer)
{       timer.set("tag1'", ts);       timer.set("tag2", ts);     }
   @OnTimer("timer") public void onTimer1(@TimerTag String tag) { .}
}
{code}
 

  was:
The Beam timers API currently requires each timer to be statically specified in 
the DoFn. The user must provide a separate callback method per timer. For 
example:

DoFn() {

  @TimerId("timer1") private final TimerSpec timer1 = TimerSpecs.timer(...);

  @TimerId("timer2") private final TimerSpec timer2 = TimerSpecs.timer(...);

                .. set timers in processElement

   @OnTimer("timer1") public void onTimer1() \{ .}

   @OnTimer("timer2") public void onTimer2() \{}

}

However there are many cases where the user does not know the set of timers 
statically when writing their code. This happens when the timer tag should be 
based on the data. It also happens when writing a DSL on top of Beam, where the 
DSL author has to create DoFns but does not know statically which timers their 
users will want to set (e.g. Scio).

 

The goal is to support dynamic timers. Something as follows;

DoFn() {

  @TimerId("timer") private final TimerSpec timer1 = 
TimerSpecs.dynamicTimer(...);

   @ProcessElement process(@TimerId("timer") DynamicTimer timer) {

      timer.set("tag1'", ts);

      timer.set("tag2", ts);

    }

   @OnTimer("timer") public void onTimer1(@TimerTag String tag) \{ .}

}


> Support dynamic timers
> --
>
> Key: BEAM-6857
> URL: https://issues.apache.org/jira/browse/BEAM-6857
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Reuven Lax
>Assignee: Shehzaad Nakhoda
>Priority: Major
>
> The Beam timers API currently requires each timer to be statically specified 
> in the DoFn. The user must provide a separate callback method per timer. For 
> example:
>  
> {code:java}
> DoFn()
> {   @TimerId("timer1") private final TimerSpec timer1 = 
> TimerSpecs.timer(...);   @TimerId("timer2") private final TimerSpec timer2 = 
> TimerSpecs.timer(...);                 .. set timers in processElement    
> @OnTimer("timer1") public void onTimer1() \{ .}
>    @OnTimer("timer2") public void onTimer2() {}
> }
> {code}
>  
> However there are many cases where the user does not know the set of timers 
> statically when writing their code. This happens when the timer tag should be 
> based on the data. It also happens when writing a DSL on top of Beam, where 
> the DSL author has to create DoFns but does not know statically which timers 
> their users will want to set (e.g. Scio).
>  
> The goal is to support dynamic timers. Something as follows;
>  
> {code:java}
> DoFn() {
>   @TimerId("timer") private final TimerSpec timer1 = 
> TimerSpecs.dynamicTimer(...);
>    @ProcessElement process(@TimerId("timer") DynamicTimer timer)
> {       timer.set("tag1'", ts);       timer.set("tag2", ts);     }
>    @OnTimer("timer") public void onTimer1(@TimerTag String tag) { .}
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8539) Clearly define the valid job state transitions

2019-11-01 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964981#comment-16964981
 ] 

Chad Dombrova commented on BEAM-8539:
-

I searched the java code for STOPPED and it seems to be used correctly 
everywhere except for possibly one spot in the spark runner:

[https://github.com/apache/beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/SparkPipelineResult.java#L133]

It's hard to tell from the [spark 
docs|https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/api/java/JavaSparkContext.html]
 if {{JavaSparkContext.stop()}} is considered terminal (i.e. can it be 
resumed?).  Who would know for sure?
{quote}We should add documentation to the JobState enum in the Job API (e.g. 
state machine diagram).
{quote}
I'm happy to do this, but I'm not that familiar with the documentation 
generation, so I have a few questions:
 * Where should this go? I don't see the generated code for in the python or 
java docs. Also, since Java and Python use different documentation generators, 
I'm not sure if the diagram can be universally rendered. If not there, then 
where? JobInvocation?
 * Can you give me an example of somewhere else in the code that is currently 
generating a diagram, so that I can see how it's done?

> Clearly define the valid job state transitions
> --
>
> Key: BEAM-8539
> URL: https://issues.apache.org/jira/browse/BEAM-8539
> Project: Beam
>  Issue Type: Improvement
>  Components: beam-model, runner-core, sdk-java-core, sdk-py-core
>Reporter: Chad Dombrova
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The Beam job state transitions are ill-defined, which is big problem for 
> anything that relies on the values coming from JobAPI.GetStateStream.
> I was hoping to find something like a state transition diagram in the docs so 
> that I could determine the start state, the terminal states, and the valid 
> transitions, but I could not find this. The code reveals that the SDKs differ 
> on the fundamentals:
> Java InMemoryJobService:
>  * start state: *STOPPED*
>  * run - about to submit to executor:  STARTING
>  * run - actually running on executor:  RUNNING
>  * terminal states: DONE, FAILED, CANCELLED, DRAINED
> Python AbstractJobServiceServicer / LocalJobServicer:
>  * start state: STARTING
>  * terminal states: DONE, FAILED, CANCELLED, *STOPPED*
> I think it would be good to make python work like Java, so that there is a 
> difference in state between a job that has been prepared and one that has 
> additionally been run.
> It's hard to tell how far this problem has spread within the various runners. 
>  I think a simple thing that can be done to help standardize behavior is to 
> implement the terminal states as an enum in the beam_job_api.proto, or create 
> a utility function in each language for checking if a state is terminal, so 
> that it's not left up to each runner to reimplement this logic.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8539) Clearly define the valid job state transitions

2019-10-31 Thread Chad Dombrova (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chad Dombrova updated BEAM-8539:

Description: 
The Beam job state transitions are ill-defined, which is big problem for 
anything that relies on the values coming from JobAPI.GetStateStream.

I was hoping to find something like a state transition diagram in the docs so 
that I could determine the start state, the terminal states, and the valid 
transitions, but I could not find this. The code reveals that the SDKs differ 
on the fundamentals:

Java InMemoryJobService:
 * start state: *STOPPED*
 * run: about to submit to executor:  STARTING
 * run: actually running on executor:  RUNNING
 * terminal states: DONE, FAILED, CANCELLED, DRAINED

Python AbstractJobServiceServicer / LocalJobServicer:
 * start state: STARTING
 * terminal states: DONE, FAILED, CANCELLED, *STOPPED*

I think it would be good to make python work like Java, so that there is a 
difference in state between a job that has been prepared and one that has 
additionally been run.

It's hard to tell how far this problem has spread within the various runners.  
I think a simple thing that can be done to help standardize behavior is to 
implement the terminal states as an enum in the beam_job_api.proto, or create a 
utility function in each language for checking if a state is terminal, so that 
it's not left up to each runner to reimplement this logic.

 

  was:
The Beam job state transitions are ill-defined, which is big problem for 
anything that relies on the values coming from JobAPI.GetStateStream.

I was hoping to find something like a state transition diagram in the docs so 
that I could determine the start state, the terminal states, and the valid 
transitions, but I could not find this. The code reveals that the SDKs differ 
on the fundamentals:

Java InMemoryJobService:
 * start state (prepared): STOPPED
 * about to submit to executor:  STARTING
 * actually running on executor:  RUNNING
 * terminal states: DONE, FAILED, CANCELLED, DRAINED

Python AbstractJobServiceServicer / LocalJobServicer:
 * start state: STARTING
 * terminal states: DONE, FAILED, CANCELLED, STOPPED

I think it would be good to make python work like Java, so that there is a 
difference in state between a job that has been prepared and one that has been 
run.

It's hard to tell how far this problem has spread within the various runners.  
I think a simple thing that can be done to help standardize behavior is to 
implement the terminal states as an enum in the beam_job_api.proto, or create a 
utility function in each language for checking if a state is terminal, so that 
it's not left up to each runner to reimplement this logic.

 


> Clearly define the valid job state transitions
> --
>
> Key: BEAM-8539
> URL: https://issues.apache.org/jira/browse/BEAM-8539
> Project: Beam
>  Issue Type: Improvement
>  Components: beam-model, runner-core, sdk-java-core, sdk-py-core
>Reporter: Chad Dombrova
>Priority: Major
>
> The Beam job state transitions are ill-defined, which is big problem for 
> anything that relies on the values coming from JobAPI.GetStateStream.
> I was hoping to find something like a state transition diagram in the docs so 
> that I could determine the start state, the terminal states, and the valid 
> transitions, but I could not find this. The code reveals that the SDKs differ 
> on the fundamentals:
> Java InMemoryJobService:
>  * start state: *STOPPED*
>  * run: about to submit to executor:  STARTING
>  * run: actually running on executor:  RUNNING
>  * terminal states: DONE, FAILED, CANCELLED, DRAINED
> Python AbstractJobServiceServicer / LocalJobServicer:
>  * start state: STARTING
>  * terminal states: DONE, FAILED, CANCELLED, *STOPPED*
> I think it would be good to make python work like Java, so that there is a 
> difference in state between a job that has been prepared and one that has 
> additionally been run.
> It's hard to tell how far this problem has spread within the various runners. 
>  I think a simple thing that can be done to help standardize behavior is to 
> implement the terminal states as an enum in the beam_job_api.proto, or create 
> a utility function in each language for checking if a state is terminal, so 
> that it's not left up to each runner to reimplement this logic.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8539) Clearly define the valid job state transitions

2019-10-31 Thread Chad Dombrova (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chad Dombrova updated BEAM-8539:

Description: 
The Beam job state transitions are ill-defined, which is big problem for 
anything that relies on the values coming from JobAPI.GetStateStream.

I was hoping to find something like a state transition diagram in the docs so 
that I could determine the start state, the terminal states, and the valid 
transitions, but I could not find this. The code reveals that the SDKs differ 
on the fundamentals:

Java InMemoryJobService:
 * start state: *STOPPED*
 * run - about to submit to executor:  STARTING
 * run - actually running on executor:  RUNNING
 * terminal states: DONE, FAILED, CANCELLED, DRAINED

Python AbstractJobServiceServicer / LocalJobServicer:
 * start state: STARTING
 * terminal states: DONE, FAILED, CANCELLED, *STOPPED*

I think it would be good to make python work like Java, so that there is a 
difference in state between a job that has been prepared and one that has 
additionally been run.

It's hard to tell how far this problem has spread within the various runners.  
I think a simple thing that can be done to help standardize behavior is to 
implement the terminal states as an enum in the beam_job_api.proto, or create a 
utility function in each language for checking if a state is terminal, so that 
it's not left up to each runner to reimplement this logic.

 

  was:
The Beam job state transitions are ill-defined, which is big problem for 
anything that relies on the values coming from JobAPI.GetStateStream.

I was hoping to find something like a state transition diagram in the docs so 
that I could determine the start state, the terminal states, and the valid 
transitions, but I could not find this. The code reveals that the SDKs differ 
on the fundamentals:

Java InMemoryJobService:
 * start state: *STOPPED*
 * run: about to submit to executor:  STARTING
 * run: actually running on executor:  RUNNING
 * terminal states: DONE, FAILED, CANCELLED, DRAINED

Python AbstractJobServiceServicer / LocalJobServicer:
 * start state: STARTING
 * terminal states: DONE, FAILED, CANCELLED, *STOPPED*

I think it would be good to make python work like Java, so that there is a 
difference in state between a job that has been prepared and one that has 
additionally been run.

It's hard to tell how far this problem has spread within the various runners.  
I think a simple thing that can be done to help standardize behavior is to 
implement the terminal states as an enum in the beam_job_api.proto, or create a 
utility function in each language for checking if a state is terminal, so that 
it's not left up to each runner to reimplement this logic.

 


> Clearly define the valid job state transitions
> --
>
> Key: BEAM-8539
> URL: https://issues.apache.org/jira/browse/BEAM-8539
> Project: Beam
>  Issue Type: Improvement
>  Components: beam-model, runner-core, sdk-java-core, sdk-py-core
>Reporter: Chad Dombrova
>Priority: Major
>
> The Beam job state transitions are ill-defined, which is big problem for 
> anything that relies on the values coming from JobAPI.GetStateStream.
> I was hoping to find something like a state transition diagram in the docs so 
> that I could determine the start state, the terminal states, and the valid 
> transitions, but I could not find this. The code reveals that the SDKs differ 
> on the fundamentals:
> Java InMemoryJobService:
>  * start state: *STOPPED*
>  * run - about to submit to executor:  STARTING
>  * run - actually running on executor:  RUNNING
>  * terminal states: DONE, FAILED, CANCELLED, DRAINED
> Python AbstractJobServiceServicer / LocalJobServicer:
>  * start state: STARTING
>  * terminal states: DONE, FAILED, CANCELLED, *STOPPED*
> I think it would be good to make python work like Java, so that there is a 
> difference in state between a job that has been prepared and one that has 
> additionally been run.
> It's hard to tell how far this problem has spread within the various runners. 
>  I think a simple thing that can be done to help standardize behavior is to 
> implement the terminal states as an enum in the beam_job_api.proto, or create 
> a utility function in each language for checking if a state is terminal, so 
> that it's not left up to each runner to reimplement this logic.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8539) Clearly define the valid job state transitions

2019-10-31 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964580#comment-16964580
 ] 

Chad Dombrova commented on BEAM-8539:
-

[~lcwik], [~mxm] you might be interested in this.

> Clearly define the valid job state transitions
> --
>
> Key: BEAM-8539
> URL: https://issues.apache.org/jira/browse/BEAM-8539
> Project: Beam
>  Issue Type: Improvement
>  Components: beam-model, runner-core, sdk-java-core, sdk-py-core
>Reporter: Chad Dombrova
>Priority: Major
>
> The Beam job state transitions are ill-defined, which is big problem for 
> anything that relies on the values coming from JobAPI.GetStateStream.
> I was hoping to find something like a state transition diagram in the docs so 
> that I could determine the start state, the terminal states, and the valid 
> transitions, but I could not find this. The code reveals that the SDKs differ 
> on the fundamentals:
> Java InMemoryJobService:
>  * start state: *STOPPED*
>  * run: about to submit to executor:  STARTING
>  * run: actually running on executor:  RUNNING
>  * terminal states: DONE, FAILED, CANCELLED, DRAINED
> Python AbstractJobServiceServicer / LocalJobServicer:
>  * start state: STARTING
>  * terminal states: DONE, FAILED, CANCELLED, *STOPPED*
> I think it would be good to make python work like Java, so that there is a 
> difference in state between a job that has been prepared and one that has 
> additionally been run.
> It's hard to tell how far this problem has spread within the various runners. 
>  I think a simple thing that can be done to help standardize behavior is to 
> implement the terminal states as an enum in the beam_job_api.proto, or create 
> a utility function in each language for checking if a state is terminal, so 
> that it's not left up to each runner to reimplement this logic.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-8539) Clearly define the valid job state transitions

2019-10-31 Thread Chad Dombrova (Jira)
Chad Dombrova created BEAM-8539:
---

 Summary: Clearly define the valid job state transitions
 Key: BEAM-8539
 URL: https://issues.apache.org/jira/browse/BEAM-8539
 Project: Beam
  Issue Type: Improvement
  Components: beam-model, runner-core, sdk-java-core, sdk-py-core
Reporter: Chad Dombrova


The Beam job state transitions are ill-defined, which is big problem for 
anything that relies on the values coming from JobAPI.GetStateStream.

I was hoping to find something like a state transition diagram in the docs so 
that I could determine the start state, the terminal states, and the valid 
transitions, but I could not find this. The code reveals that the SDKs differ 
on the fundamentals:

Java InMemoryJobService:
 * start state (prepared): STOPPED
 * about to submit to executor:  STARTING
 * actually running on executor:  RUNNING
 * terminal states: DONE, FAILED, CANCELLED, DRAINED

Python AbstractJobServiceServicer / LocalJobServicer:
 * start state: STARTING
 * terminal states: DONE, FAILED, CANCELLED, STOPPED

I think it would be good to make python work like Java, so that there is a 
difference in state between a job that has been prepared and one that has been 
run.

It's hard to tell how far this problem has spread within the various runners.  
I think a simple thing that can be done to help standardize behavior is to 
implement the terminal states as an enum in the beam_job_api.proto, or create a 
utility function in each language for checking if a state is terminal, so that 
it's not left up to each runner to reimplement this logic.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8523) Add useful timestamp to job servicer GetJobs

2019-10-31 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964406#comment-16964406
 ] 

Chad Dombrova commented on BEAM-8523:
-

WIP: https://github.com/apache/beam/pull/9959

> Add useful timestamp to job servicer GetJobs
> 
>
> Key: BEAM-8523
> URL: https://issues.apache.org/jira/browse/BEAM-8523
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model
>Reporter: Chad Dombrova
>Assignee: Chad Dombrova
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As a user querying jobs with JobService.GetJobs, it would be useful if the 
> JobInfo result contained timestamps indicating various state changes that may 
> have been missed by a client.   Useful timestamps include:
>  
>  * submitted (prepared to the job service)
>  * started (executor enters the RUNNING state)
>  * completed (executor enters a terminal state)
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (BEAM-8523) Add useful timestamp to job servicer GetJobs

2019-10-31 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964223#comment-16964223
 ] 

Chad Dombrova edited comment on BEAM-8523 at 10/31/19 5:22 PM:
---

Looking at this a little more, it seems it would be pretty easy to stream out 
the past state history from GetMessageStream after the initial request is made. 
 This would make it easier on clients, since they would not have to concern 
themselves about ordering and race conditions between calls to a hypothetical 
GetMessageHistory and GetMessageStream (i.e what happens if a new event arrives 
between calls?  If GetMessageHistory is called first we get the new event in 
both results, in the other order we miss the event).  

For tracking the state transition events on the JobInvocation, what's the 
preferred object for 2-tuples?   I noticed javafx.Pair is not used anywhere in 
the Beam code.  Should I use beam.sdk.values.KV?




was (Author: chadrik):
Looking at this a little more, it seems it would be pretty easy to stream out 
the past state history from GetMessageStream after the connection is made.  
This would make it easier on clients, since they would not have to concern 
themselves about ordering and race conditions between calls to a hypothetical 
GetMessageHistory and GetMessageStream (i.e what happens if a new event arrives 
between calls?  If GetMessageHistory is called first we get the new event in 
both results, in the other order we miss the event).  

For tracking the state transition events on the JobInvocation, what's the 
preferred object for 2-tuples?   I noticed javafx.Pair is not used anywhere in 
the Beam code.  Should I use beam.sdk.values.KV?



> Add useful timestamp to job servicer GetJobs
> 
>
> Key: BEAM-8523
> URL: https://issues.apache.org/jira/browse/BEAM-8523
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model
>Reporter: Chad Dombrova
>Assignee: Chad Dombrova
>Priority: Major
>
> As a user querying jobs with JobService.GetJobs, it would be useful if the 
> JobInfo result contained timestamps indicating various state changes that may 
> have been missed by a client.   Useful timestamps include:
>  
>  * submitted (prepared to the job service)
>  * started (executor enters the RUNNING state)
>  * completed (executor enters a terminal state)
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8523) Add useful timestamp to job servicer GetJobs

2019-10-31 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964223#comment-16964223
 ] 

Chad Dombrova commented on BEAM-8523:
-

Looking at this a little more, it seems it would be pretty easy to stream out 
the past state history from GetMessageStream after the connection is made.  
This would make it easier on clients, since they would not have to concern 
themselves about ordering and race conditions between calls to a hypothetical 
GetMessageHistory and GetMessageStream (i.e what happens if a new event arrives 
between calls?  If GetMessageHistory is called first we get the new event in 
both results, in the other order we miss the event).  

For tracking the state transition events on the JobInvocation, what's the 
preferred object for 2-tuples?   I noticed javafx.Pair is not used anywhere in 
the Beam code.  Should I use beam.sdk.values.KV?



> Add useful timestamp to job servicer GetJobs
> 
>
> Key: BEAM-8523
> URL: https://issues.apache.org/jira/browse/BEAM-8523
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model
>Reporter: Chad Dombrova
>Assignee: Chad Dombrova
>Priority: Major
>
> As a user querying jobs with JobService.GetJobs, it would be useful if the 
> JobInfo result contained timestamps indicating various state changes that may 
> have been missed by a client.   Useful timestamps include:
>  
>  * submitted (prepared to the job service)
>  * started (executor enters the RUNNING state)
>  * completed (executor enters a terminal state)
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8523) Add useful timestamp to job servicer GetJobs

2019-10-30 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16963645#comment-16963645
 ] 

Chad Dombrova commented on BEAM-8523:
-

{quote}
Seems to make sense but we might as well just have all the job states have 
timestamps and just have a list of state transitions returned as part of the 
job info. The last element within the list would be the "latest" state.
{quote}

That's an interesting idea.  In that case, I think it might make sense to have 
a GetStateHistory method.  This could be a pattern that we extend to 
GetMessageStream as well.  We're currently working on creating a beam log 
service, e.g. backed by StackDriver, so GetMessageHistory could fit in really 
well there.

That makes me wonder, is there an established gRPC pattern for getting stream 
history?  For example, is there a way for a server to automatically backfill 
the missed messages when a late stream connection is established?  Or is it 
better to keep these as two separate endpoints?


> Add useful timestamp to job servicer GetJobs
> 
>
> Key: BEAM-8523
> URL: https://issues.apache.org/jira/browse/BEAM-8523
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model
>Reporter: Chad Dombrova
>Assignee: Chad Dombrova
>Priority: Major
>
> As a user querying jobs with JobService.GetJobs, it would be useful if the 
> JobInfo result contained timestamps indicating various state changes that may 
> have been missed by a client.   Useful timestamps include:
>  
>  * submitted (prepared to the job service)
>  * started (executor enters the RUNNING state)
>  * completed (executor enters a terminal state)
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8523) Add useful timestamp to job servicer GetJobs

2019-10-30 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16963522#comment-16963522
 ] 

Chad Dombrova commented on BEAM-8523:
-

[~lcwik] does this change sound reasonable to you?

> Add useful timestamp to job servicer GetJobs
> 
>
> Key: BEAM-8523
> URL: https://issues.apache.org/jira/browse/BEAM-8523
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model
>Reporter: Chad Dombrova
>Assignee: Chad Dombrova
>Priority: Major
>
> As a user querying jobs with JobService.GetJobs, it would be useful if the 
> JobInfo result contained timestamps indicating various state changes that may 
> have been missed by a client.   Useful timestamps include:
>  
>  * submitted (prepared to the job service)
>  * started (executor enters the RUNNING state)
>  * completed (executor enters a terminal state)
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8523) Add useful timestamp to job servicer GetJobs

2019-10-30 Thread Chad Dombrova (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chad Dombrova updated BEAM-8523:

Description: 
As a user querying jobs with JobService.GetJobs, it would be useful if the 
JobInfo result contained timestamps indicating various state changes that may 
have been missed by a client.   Useful timestamps include:

 
 * submitted (prepared to the job service)
 * started (executor enters the RUNNING state)
 * completed (executor enters a terminal state)

 

 

  was:
As a user querying jobs with JobService.GetJobs, it would be useful if the 
JobInfo result contained timestamps indicating various state changes that may 
have been missed by a client.   Useful even timestamps include:

 
 * submitted (prepared to the job service)
 * started (executor enters the RUNNING state)
 * completed (executor enters a terminal state)

 

 


> Add useful timestamp to job servicer GetJobs
> 
>
> Key: BEAM-8523
> URL: https://issues.apache.org/jira/browse/BEAM-8523
> Project: Beam
>  Issue Type: New Feature
>  Components: beam-model
>Reporter: Chad Dombrova
>Assignee: Chad Dombrova
>Priority: Major
>
> As a user querying jobs with JobService.GetJobs, it would be useful if the 
> JobInfo result contained timestamps indicating various state changes that may 
> have been missed by a client.   Useful timestamps include:
>  
>  * submitted (prepared to the job service)
>  * started (executor enters the RUNNING state)
>  * completed (executor enters a terminal state)
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-8523) Add useful timestamp to job servicer GetJobs

2019-10-30 Thread Chad Dombrova (Jira)
Chad Dombrova created BEAM-8523:
---

 Summary: Add useful timestamp to job servicer GetJobs
 Key: BEAM-8523
 URL: https://issues.apache.org/jira/browse/BEAM-8523
 Project: Beam
  Issue Type: New Feature
  Components: beam-model
Reporter: Chad Dombrova
Assignee: Chad Dombrova


As a user querying jobs with JobService.GetJobs, it would be useful if the 
JobInfo result contained timestamps indicating various state changes that may 
have been missed by a client.   Useful even timestamps include:

 
 * submitted (prepared to the job service)
 * started (executor enters the RUNNING state)
 * completed (executor enters a terminal state)

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8411) Re-enable pylint checks disabled due to pylint bug

2019-10-30 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16963296#comment-16963296
 ] 

Chad Dombrova commented on BEAM-8411:
-

No, this was introduced with 2.4.2.  we need to switch to 2.4.3+


> Re-enable pylint checks disabled due to pylint bug
> --
>
> Key: BEAM-8411
> URL: https://issues.apache.org/jira/browse/BEAM-8411
> Project: Beam
>  Issue Type: Task
>  Components: sdk-py-core
>Reporter: Chad Dombrova
>Priority: Minor
>  Labels: beginner
>
> The pylint github issue is here:  https://github.com/PyCQA/pylint/issues/3152
> Once that's fixed, remove the exclusions in setup.py and 
> apache_beam/examples/complete/juliaset/setup.py



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-8486) Reference to ParallelBundleManager attr "_skip_registration" should be "_registered"

2019-10-25 Thread Chad Dombrova (Jira)
Chad Dombrova created BEAM-8486:
---

 Summary: Reference to ParallelBundleManager attr 
"_skip_registration" should be "_registered"
 Key: BEAM-8486
 URL: https://issues.apache.org/jira/browse/BEAM-8486
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-core
Reporter: Chad Dombrova


It appears there was a mistake in the multiprocess refactor of the 
BundleManager.  I came across this issue while getting mypy type analysis setup 
(score one for static type checking!)
 
The offending line is here: 
https://github.com/apache/beam/blob/8d2997f8d7ad84649b8ecb2f7e2ca2eceb91b6d0/sdks/python/apache_beam/runners/portability/fn_api_runner.py#L933

I think it should be using "_registered", as seen here: 
https://github.com/apache/beam/blob/8d2997f8d7ad84649b8ecb2f7e2ca2eceb91b6d0/sdks/python/apache_beam/runners/portability/fn_api_runner.py#L1909

I can easily change the attribute name, but presumably this will change the 
behavior.  Hopefully for the better, but I don't know enough to say.  





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-7870) Externally configured KafkaIO / PubsubIO consumer causes coder problems

2019-10-24 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-7870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959135#comment-16959135
 ] 

Chad Dombrova commented on BEAM-7870:
-


The plan forming is to use DefaultSchema with either JavaBeamSchema or 
JavaFieldSchema, whichever works with fewer changes to PubsubMessage and 
KafkaRecord.

Then on the python side, we do something like this:

{code:python}
# this is py3 syntax for clarity, but we'd probably
# need to use the TypedDict('PubsubMessage', ...) version
class PubsubMessage(TypedDict):
   message: ByteString
   attributes: Mapping[unicode, unicode]
   messageId: unicode

coders.registry.register_coder(PubsubMessage, coders.RowCoder)

pcoll
  | 'make some messages' >> 
beam.Map(makeMessage).with_output_types(PubsubMessage)
  | 'write to pubsub' >> beam.io.WriteToPubsub(project, topic) # or something
{code}




> Externally configured KafkaIO / PubsubIO consumer causes coder problems
> ---
>
> Key: BEAM-7870
> URL: https://issues.apache.org/jira/browse/BEAM-7870
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink, sdk-java-core
>Reporter: Maximilian Michels
>Assignee: Maximilian Michels
>Priority: Major
>
> There are limitations for the consumer to work correctly. The biggest issue 
> is the structure of KafkaIO itself, which uses a combination of the source 
> interface and DoFns to generate the desired output. The problem is that the 
> source interface is natively translated by the Flink Runner to support 
> unbounded sources in portability, while the DoFn runs in a Java environment.
> To transfer data between the two a coder needs to be involved. It happens to 
> be that the initial read does not immediately drop the KafakRecord structure 
> which does not work together well with our current assumption of only 
> supporting "standard coders" present in all SDKs. Only the subsequent DoFn 
> converts the KafkaRecord structure into a raw KV[byte, byte], but the DoFn 
> won't have the coder available in its environment.
> There are several possible solutions:
>  1. Make the DoFn which drops the KafkaRecordCoder a native Java transform in 
> the Flink Runner
>  2. Modify KafkaIO to immediately drop the KafkaRecord structure
>  3. Add the KafkaRecordCoder to all SDKs
>  4. Add a generic coder, e.g. AvroCoder to all SDKs
> For a workaround which uses (3), please see this patch which is not a proper 
> fix but adds KafkaRecordCoder to the SDK such that it can be used 
> encode/decode records: 
> [https://github.com/mxm/beam/commit/b31cf99c75b3972018180d8ccc7e73d311f4cfed]
>  
> See also 
> [https://github.com/apache/beam/pull/8251|https://github.com/apache/beam/pull/8251:]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-7870) Externally configured KafkaIO / PubsubIO consumer causes coder problems

2019-10-24 Thread Chad Dombrova (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chad Dombrova updated BEAM-7870:

Summary: Externally configured KafkaIO / PubsubIO consumer causes coder 
problems  (was: Externally configured KafkaIO consumer causes coder problems)

> Externally configured KafkaIO / PubsubIO consumer causes coder problems
> ---
>
> Key: BEAM-7870
> URL: https://issues.apache.org/jira/browse/BEAM-7870
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink, sdk-java-core
>Reporter: Maximilian Michels
>Assignee: Maximilian Michels
>Priority: Major
>
> There are limitations for the consumer to work correctly. The biggest issue 
> is the structure of KafkaIO itself, which uses a combination of the source 
> interface and DoFns to generate the desired output. The problem is that the 
> source interface is natively translated by the Flink Runner to support 
> unbounded sources in portability, while the DoFn runs in a Java environment.
> To transfer data between the two a coder needs to be involved. It happens to 
> be that the initial read does not immediately drop the KafakRecord structure 
> which does not work together well with our current assumption of only 
> supporting "standard coders" present in all SDKs. Only the subsequent DoFn 
> converts the KafkaRecord structure into a raw KV[byte, byte], but the DoFn 
> won't have the coder available in its environment.
> There are several possible solutions:
>  1. Make the DoFn which drops the KafkaRecordCoder a native Java transform in 
> the Flink Runner
>  2. Modify KafkaIO to immediately drop the KafkaRecord structure
>  3. Add the KafkaRecordCoder to all SDKs
>  4. Add a generic coder, e.g. AvroCoder to all SDKs
> For a workaround which uses (3), please see this patch which is not a proper 
> fix but adds KafkaRecordCoder to the SDK such that it can be used 
> encode/decode records: 
> [https://github.com/mxm/beam/commit/b31cf99c75b3972018180d8ccc7e73d311f4cfed]
>  
> See also 
> [https://github.com/apache/beam/pull/8251|https://github.com/apache/beam/pull/8251:]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8350) Upgrade to pylint 2.4

2019-10-23 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958027#comment-16958027
 ] 

Chad Dombrova commented on BEAM-8350:
-

done!

> Upgrade to pylint 2.4
> -
>
> Key: BEAM-8350
> URL: https://issues.apache.org/jira/browse/BEAM-8350
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Chad Dombrova
>Assignee: Chad Dombrova
>Priority: Major
>  Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
> pylint 2.4 provides a number of new features and fixes, but the most 
> important/pressing one for me is that 2.4 adds support for understanding 
> python type annotations, which fixes a bunch of spurious unused import errors 
> in the PR I'm working on for BEAM-7746.
> As of 2.0, pylint dropped support for running tests in python2, so to make 
> the upgrade we have to move our lint jobs to python3.  Doing so will put 
> pylint into "python3-mode" and there is not an option to run in 
> python2-compatible mode.  That said, the beam code is intended to be python3 
> compatible, so in practice, performing a python3 lint on the Beam code-base 
> is perfectly safe.  The primary risk of doing this is that someone introduces 
> a python-3 only change that breaks python2, but these would largely be syntax 
> errors that would be immediately caught by the unit and integration tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-8411) Re-enable pylint checks disabled due to pylint bug

2019-10-16 Thread Chad Dombrova (Jira)
Chad Dombrova created BEAM-8411:
---

 Summary: Re-enable pylint checks disabled due to pylint bug
 Key: BEAM-8411
 URL: https://issues.apache.org/jira/browse/BEAM-8411
 Project: Beam
  Issue Type: Task
  Components: sdk-py-core
Reporter: Chad Dombrova


The pylint github issue is here:  https://github.com/PyCQA/pylint/issues/3152

Once that's fixed, remove the exclusions in setup.py and 
apache_beam/examples/complete/juliaset/setup.py




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8172) Add a label field to FunctionSpec proto

2019-10-15 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952312#comment-16952312
 ] 

Chad Dombrova commented on BEAM-8172:
-

The goal of clarifying intent is a fine one, but I'm concerned that 
"debug_string" could result in developers using this field for large text 
blocks like stack traces, when the intent is a brief descriptor, like a 
human-readable ID.  

other ideas:  hint, descriptor, debug_label, debug_identifier

> Add a label field to FunctionSpec proto
> ---
>
> Key: BEAM-8172
> URL: https://issues.apache.org/jira/browse/BEAM-8172
> Project: Beam
>  Issue Type: Improvement
>  Components: beam-model
>Reporter: Chad Dombrova
>Assignee: Chad Dombrova
>Priority: Major
>
> The FunctionSpec payload is opaque outside of the native environment, which 
> can make debugging pipeline protos difficult.  It would be very useful if the 
> FunctionSpec had an optional human readable "label", for debugging and 
> presenting in UIs and error messages.
> For example, in python, if the payload is an instance of a class, we could 
> attempt to provide a string that represents the dotted path to that class, 
> "mymodule.MyClass".  In the case of coders, we could use the label to hold 
> the type hint:  "Optional[Iterable[mymodule.MyClass]]".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-8402) Create a class hierarchy to represent environments

2019-10-14 Thread Chad Dombrova (Jira)
Chad Dombrova created BEAM-8402:
---

 Summary: Create a class hierarchy to represent environments
 Key: BEAM-8402
 URL: https://issues.apache.org/jira/browse/BEAM-8402
 Project: Beam
  Issue Type: New Feature
  Components: sdk-py-core
Reporter: Chad Dombrova
Assignee: Chad Dombrova


As a first step towards making it possible to assign different environments to 
sections of a pipeline, we first need to expose environment classes to the 
pipeline API.  Unlike PTransforms, PCollections, Coders, and Windowings,  
environments exists solely in the portability framework as protobuf objects.   
By creating a hierarchy of "native" classes that represent the various 
environment types -- external, docker, process, etc -- users will be able to 
instantiate these and assign them to parts of the pipeline.  The assignment 
portion will be covered in a follow-up issue/PR.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-8355) Make BooleanCoder a standard coder

2019-10-05 Thread Chad Dombrova (Jira)
Chad Dombrova created BEAM-8355:
---

 Summary: Make BooleanCoder a standard coder
 Key: BEAM-8355
 URL: https://issues.apache.org/jira/browse/BEAM-8355
 Project: Beam
  Issue Type: New Feature
  Components: beam-model, sdk-java-core, sdk-py-core
Reporter: Chad Dombrova
Assignee: Chad Dombrova


This involves making the current java BooleanCoder a standard coder, and 
implementing an equivalent coder in python



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-8350) Upgrade to pylint 2.4

2019-10-03 Thread Chad Dombrova (Jira)
Chad Dombrova created BEAM-8350:
---

 Summary: Upgrade to pylint 2.4
 Key: BEAM-8350
 URL: https://issues.apache.org/jira/browse/BEAM-8350
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py-core
Reporter: Chad Dombrova
Assignee: Chad Dombrova


pylint 2.4 provides a number of new features and fixes, but the most 
important/pressing one for me is that 2.4 adds support for understanding python 
type annotations, which fixes a bunch of spurious unused import errors in the 
PR I'm working on for BEAM-7746.

As of 2.0, pylint dropped support for running tests in python2, so to make the 
upgrade we have to move our lint jobs to python3.  Doing so will put pylint 
into "python3-mode" and there is not an option to run in python2-compatible 
mode.  That said, the beam code is intended to be python3 compatible, so in 
practice, performing a python3 lint on the Beam code-base is perfectly safe.  
The primary risk of doing this is that someone introduces a python-3 only 
change that breaks python2, but these would largely be syntax errors that would 
be immediately caught by the unit and integration tests.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-7768) Formalize Python external transform API

2019-09-17 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932021#comment-16932021
 ] 

Chad Dombrova commented on BEAM-7768:
-

This is resolved with https://github.com/apache/beam/pull/9098

> Formalize Python external transform API
> ---
>
> Key: BEAM-7768
> URL: https://issues.apache.org/jira/browse/BEAM-7768
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chad Dombrova
>Priority: Major
>
> As a python developer I'd like to have a more opinionated API for external 
> transforms to A) guide me toward more standardized solutions and B) reduce 
> the amount of boilerplate code that needs to be written. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-7984) [python] The coder returned for typehints.List should be IterableCoder

2019-09-17 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-7984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932020#comment-16932020
 ] 

Chad Dombrova commented on BEAM-7984:
-

this is resolved

> [python] The coder returned for typehints.List should be IterableCoder
> --
>
> Key: BEAM-7984
> URL: https://issues.apache.org/jira/browse/BEAM-7984
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Chad Dombrova
>Assignee: Chad Dombrova
>Priority: Major
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> IterableCoder encodes a list and decodes to list, but 
> typecoders.registry.get_coder(typehints.List[bytes]) returns a 
> FastPrimitiveCoder.  I don't see any reason why this would be advantageous. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8238) Improve python gen_protos script

2019-09-17 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932018#comment-16932018
 ] 

Chad Dombrova commented on BEAM-8238:
-

this is resolved

> Improve python gen_protos script
> 
>
> Key: BEAM-8238
> URL: https://issues.apache.org/jira/browse/BEAM-8238
> Project: Beam
>  Issue Type: Improvement
>  Components: build-system, sdk-py-core
>Reporter: Chad Dombrova
>Assignee: Chad Dombrova
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Issues that need to be solved:
> - doesn't print output to shell when run during `python setup.py`
> - runs futurize twice when grcpio-tools is not present and needs to be 
> installed
> - does not account for the case where there are more output files than .proto 
> files (just keeps triggering out-of-date rebuilds due to timestamp of 
> orphaned output file)
> - plus it would be nice for debugging if the script printed why it is 
> regenerating pb2 files, because there are a number of possible reasons



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8148) [python] allow tests to be specified when using tox

2019-09-17 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932019#comment-16932019
 ] 

Chad Dombrova commented on BEAM-8148:
-

this is resolved

> [python] allow tests to be specified when using tox
> ---
>
> Key: BEAM-8148
> URL: https://issues.apache.org/jira/browse/BEAM-8148
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Chad Dombrova
>Assignee: Chad Dombrova
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> I don't know how to specify individual tests using gradle, and as a python 
> developer, it's way too opaque for me to figure out on my own.  
> The developer wiki suggest calling the tests directly:
> {noformat}
> python setup.py nosetests --tests :.
> {noformat}
> But this defeats the purpose of using tox, which is there to help users 
> manage the myriad virtual envs required to run the different tests.
> Luckily there is an easier way!  You just add {posargs} to the tox commands.  
> PR is coming shortly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-8271) StateGetRequest/Response continuation_token should be string

2019-09-17 Thread Chad Dombrova (Jira)
Chad Dombrova created BEAM-8271:
---

 Summary: StateGetRequest/Response continuation_token should be 
string
 Key: BEAM-8271
 URL: https://issues.apache.org/jira/browse/BEAM-8271
 Project: Beam
  Issue Type: Improvement
  Components: beam-model
Reporter: Chad Dombrova
Assignee: Chad Dombrova


I've been working on adding typing to the python code and I came across a 
discrepancy between regarding the type of the continuation token.  The .proto 
defines it as bytes, but the code treats it as a string (i.e. unicode):

 
{code:java}
// A request to get state.
message StateGetRequest {
  // (Optional) If specified, signals to the runner that the response
  // should resume from the following continuation token.
  //
  // If unspecified, signals to the runner that the response should start
  // from the beginning of the logical continuable stream.
  bytes continuation_token = 1;
}

// A response to get state representing a logical byte stream which can be
// continued using the state API.
message StateGetResponse {
  // (Optional) If specified, represents a token which can be used with the
  // state API to get the next chunk of this logical byte stream. The end of
  // the logical byte stream is signalled by this field being unset.
  bytes continuation_token = 1;

  // Represents a part of a logical byte stream. Elements within
  // the logical byte stream are encoded in the nested context and
  // concatenated together.
  bytes data = 2;
} 
{code}
>From FnApiRunner.StateServicer:
{code:python}
def blocking_get(self, state_key, continuation_token=None):
  with self._lock:
full_state = self._state[self._to_key(state_key)]
if self._use_continuation_tokens:
  # The token is "nonce:index".
  if not continuation_token:
token_base = 'token_%x' % len(self._continuations)
self._continuations[token_base] = tuple(full_state)
return b'', '%s:0' % token_base
  else:
token_base, index = continuation_token.split(':')
ix = int(index)
full_state = self._continuations[token_base]
if ix == len(full_state):
  return b'', None
else:
  return full_state[ix], '%s:%d' % (token_base, ix + 1)
else:
  assert not continuation_token
  return b''.join(full_state), None
{code}

This could be a problem in python3.  

All other id values are string, whereas bytes is reserved for data, so I think 
that the proto should be changed to string. 





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8213) Run and report python tox tasks separately within Jenkins

2019-09-16 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930767#comment-16930767
 ] 

Chad Dombrova commented on BEAM-8213:
-

I should also mention I'm fine with lumping the IT jobs in with their 
respective preCommitPy* jobs, if we want to reduce the total number of jenkins 
jobs. So the final list of jobs/tasks would be:

{code}
pythonPreCommit()
  dependsOn ":sdks:python:test-suites:tox:py2:preCommitPy2"
dependsOn "testPy2Cython"
dependsOn "testPython2"
dependsOn "testPy2Gcp"
dependsOn ":sdks:python:test-suites:dataflow:py2:preCommitIT"
  dependsOn ":sdks:python:test-suites:tox:py35:preCommitPy35"
dependsOn "testPython35"
dependsOn "testPy35Gcp"
dependsOn "testPy35Cython"
  dependsOn ":sdks:python:test-suites:tox:py36:preCommitPy36"
dependsOn "testPython36"
dependsOn "testPy36Gcp"
dependsOn "testPy36Cython"
  dependsOn ":sdks:python:test-suites:tox:py37:preCommitPy37"
dependsOn "testPython37"
dependsOn "testPy37Gcp"
dependsOn "testPy37Cython"
dependsOn ":sdks:python:test-suites:dataflow:py37:preCommitIT"
  dependsOn ":sdks:python:test-suites:tox:preCommitLint"
dependsOn ":sdks:python:test-suites:tox:py2:lint"
dependsOn ":sdks:python:test-suites:tox:py2:docs"
dependsOn ":sdks:python:test-suites:tox:py35:lint"
{code}

> Run and report python tox tasks separately within Jenkins
> -
>
> Key: BEAM-8213
> URL: https://issues.apache.org/jira/browse/BEAM-8213
> Project: Beam
>  Issue Type: Improvement
>  Components: build-system
>Reporter: Chad Dombrova
>Priority: Major
>
> As a python developer, the speed and comprehensibility of the jenkins 
> PreCommit job could be greatly improved.
> Here are some of the problems
> - when a lint job fails, it's not reported in the test results summary, so 
> even though the job is marked as failed, I see "Test Result (no failures)" 
> which is quite confusing
> - I have to wait for over an hour to discover the lint failed, which takes 
> about a minute to run on its own
> - The logs are a jumbled mess of all the different tasks running on top of 
> each other
> - The test results give no indication of which version of python they use.  I 
> click on Test results, then the test module, then the test class, then I see 
> 4 tests named the same thing.  I assume that the first is python 2.7, the 
> second is 3.5 and so on.   It takes 5 clicks and then reading the log output 
> to know which version of python a single error pertains to, then I need to 
> repeat for each failure.  This makes it very difficult to discover problems, 
> and deduce that they may have something to do with python version mismatches.
> I believe the solution to this is to split up the single monolithic python 
> PreCommit job into sub-jobs (possibly using a pipeline with steps).  This 
> would give us the following benefits:
> - sub job results should become available as they finish, so for example, 
> lint results should be available very early on
> - sub job results will be reported separately, and there will be a job for 
> each py2, py35, py36 and so on, so it will be clear when an error is related 
> to a particular python version
> - sub jobs without reports, like docs and lint, will have their own failure 
> status and logs, so when they fail it will be more obvious what went wrong.
> I'm happy to help out once I get some feedback on the desired way forward.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (BEAM-8213) Run and report python tox tasks separately within Jenkins

2019-09-16 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930762#comment-16930762
 ] 

Chad Dombrova commented on BEAM-8213:
-

Here's a hierarchy of the relevant pre-commit tasks:

{code}
pythonPreCommit()
  dependsOn ":sdks:python:test-suites:tox:py2:preCommitPy2"
dependsOn "docs"
dependsOn "testPy2Cython"
dependsOn "testPython2"
dependsOn "testPy2Gcp"
dependsOn "lint"
  dependsOn ":sdks:python:test-suites:tox:py35:preCommitPy35"
dependsOn "testPython35"
dependsOn "testPy35Gcp"
dependsOn "testPy35Cython"
dependsOn "lint"
  dependsOn ":sdks:python:test-suites:tox:py36:preCommitPy36"
dependsOn "testPython36"
dependsOn "testPy36Gcp"
dependsOn "testPy36Cython"
  dependsOn ":sdks:python:test-suites:tox:py37:preCommitPy37"
dependsOn "testPython37"
dependsOn "testPy37Gcp"
dependsOn "testPy37Cython"
  dependsOn ":sdks:python:test-suites:dataflow:py2:preCommitIT"
  dependsOn ":sdks:python:test-suites:dataflow:py37:preCommitIT"
{code}

I think an easy place to start is to split up pythonPreCommit into 6 separate 
jobs

- preCommitPy2
- preCommitPy35
- preCommitPy36
- preCommitPy37
- preCommitITPy2
- preCommitITPy37

Then we pull out all the lint and doc tasks into another task:
 
- preCommitPyLint

It's a lot of jobs, but each one will finish faster, and the output for each 
will be much more clear.  



> Run and report python tox tasks separately within Jenkins
> -
>
> Key: BEAM-8213
> URL: https://issues.apache.org/jira/browse/BEAM-8213
> Project: Beam
>  Issue Type: Improvement
>  Components: build-system
>Reporter: Chad Dombrova
>Priority: Major
>
> As a python developer, the speed and comprehensibility of the jenkins 
> PreCommit job could be greatly improved.
> Here are some of the problems
> - when a lint job fails, it's not reported in the test results summary, so 
> even though the job is marked as failed, I see "Test Result (no failures)" 
> which is quite confusing
> - I have to wait for over an hour to discover the lint failed, which takes 
> about a minute to run on its own
> - The logs are a jumbled mess of all the different tasks running on top of 
> each other
> - The test results give no indication of which version of python they use.  I 
> click on Test results, then the test module, then the test class, then I see 
> 4 tests named the same thing.  I assume that the first is python 2.7, the 
> second is 3.5 and so on.   It takes 5 clicks and then reading the log output 
> to know which version of python a single error pertains to, then I need to 
> repeat for each failure.  This makes it very difficult to discover problems, 
> and deduce that they may have something to do with python version mismatches.
> I believe the solution to this is to split up the single monolithic python 
> PreCommit job into sub-jobs (possibly using a pipeline with steps).  This 
> would give us the following benefits:
> - sub job results should become available as they finish, so for example, 
> lint results should be available very early on
> - sub job results will be reported separately, and there will be a job for 
> each py2, py35, py36 and so on, so it will be clear when an error is related 
> to a particular python version
> - sub jobs without reports, like docs and lint, will have their own failure 
> status and logs, so when they fail it will be more obvious what went wrong.
> I'm happy to help out once I get some feedback on the desired way forward.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (BEAM-8235) Python Post Commit 3.6 issue

2019-09-16 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930747#comment-16930747
 ] 

Chad Dombrova commented on BEAM-8235:
-

this is resolved

> Python Post Commit 3.6 issue
> 
>
> Key: BEAM-8235
> URL: https://issues.apache.org/jira/browse/BEAM-8235
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Assignee: Chad Dombrova
>Priority: Major
>
> https://builds.apache.org/job/beam_PostCommit_Python36/469/console
> 12:04:37 
> ==
> 12:04:37 ERROR: Failure: ModuleNotFoundError (No module named 'dataclasses')
> 12:04:37 
> --
> 12:04:37 Traceback (most recent call last):
> 12:04:37   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python36/src/build/gradleenv/-1734967053/lib/python3.6/site-packages/nose/failure.py",
>  line 39, in runTest
> 12:04:37 raise self.exc_val.with_traceback(self.tb)
> 12:04:37   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python36/src/build/gradleenv/-1734967053/lib/python3.6/site-packages/nose/loader.py",
>  line 418, in loadTestsFromName
> 12:04:37 addr.filename, addr.module)
> 12:04:37   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python36/src/build/gradleenv/-1734967053/lib/python3.6/site-packages/nose/importer.py",
>  line 47, in importFromPath
> 12:04:37 return self.importFromDir(dir_path, fqname)
> 12:04:37   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python36/src/build/gradleenv/-1734967053/lib/python3.6/site-packages/nose/importer.py",
>  line 94, in importFromDir
> 12:04:37 mod = load_module(part_fqname, fh, filename, desc)
> 12:04:37   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python36/src/build/gradleenv/-1734967053/lib/python3.6/imp.py",
>  line 235, in load_module
> 12:04:37 return load_source(name, filename, file)
> 12:04:37   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python36/src/build/gradleenv/-1734967053/lib/python3.6/imp.py",
>  line 172, in load_source
> 12:04:37 module = _load(spec)
> 12:04:37   File "", line 684, in _load
> 12:04:37   File "", line 665, in _load_unlocked
> 12:04:37   File "", line 678, in 
> exec_module
> 12:04:37   File "", line 219, in 
> _call_with_frames_removed
> 12:04:37   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python36/src/sdks/python/apache_beam/transforms/external_test_py37.py",
>  line 22, in 
> 12:04:37 import dataclasses
> I am guessing that https://github.com/apache/beam/pull/9570 will suppress 
> this. However there might be additional changes needed after 
> https://github.com/apache/beam/pull/9098 for python 3.6.
> /cc [~tvalentyn]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (BEAM-8229) Python postcommit 2.7 tests are failing since we try to run a test with Python 3.6+ syntax on Python 2.

2019-09-16 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930749#comment-16930749
 ] 

Chad Dombrova commented on BEAM-8229:
-

this is resolved

> Python postcommit 2.7 tests are failing since we try to run a test with 
> Python 3.6+ syntax on Python 2.
> ---
>
> Key: BEAM-8229
> URL: https://issues.apache.org/jira/browse/BEAM-8229
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Assignee: Udi Meiri
>Priority: Blocker
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Root cause: https://github.com/apache/beam/pull/9098
> 20:12:14 Traceback (most recent call last):
> 20:12:14   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python2_PR/src/build/gradleenv/-194514014/local/lib/python2.7/site-packages/nose/loader.py",
>  line 418, in loadTestsFromName
> 20:12:14 addr.filename, addr.module)
> 20:12:14   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python2_PR/src/build/gradleenv/-194514014/local/lib/python2.7/site-packages/nose/importer.py",
>  line 47, in importFromPath
> 20:12:14 return self.importFromDir(dir_path, fqname)
> 20:12:14   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python2_PR/src/build/gradleenv/-194514014/local/lib/python2.7/site-packages/nose/importer.py",
>  line 94, in importFromDir
> 20:12:14 mod = load_module(part_fqname, fh, filename, desc)
> 20:12:14 SyntaxError: invalid syntax (external_test_py37.py, line 46)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (BEAM-8238) Improve python gen_protos script

2019-09-15 Thread Chad Dombrova (Jira)
Chad Dombrova created BEAM-8238:
---

 Summary: Improve python gen_protos script
 Key: BEAM-8238
 URL: https://issues.apache.org/jira/browse/BEAM-8238
 Project: Beam
  Issue Type: Improvement
  Components: build-system, sdk-py-core
Reporter: Chad Dombrova
Assignee: Chad Dombrova


Issues that need to be solved:

- doesn't print output to shell when run during `python setup.py`
- runs futurize twice when grcpio-tools is not present and needs to be installed
- does not account for the case where there are more output files than .proto 
files (just keeps triggering out-of-date rebuilds due to timestamp of orphaned 
output file)
- plus it would be nice for debugging if the script printed why it is 
regenerating pb2 files, because there are a number of possible reasons




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (BEAM-8213) Run and report python tox tasks separately within Jenkins

2019-09-13 Thread Chad Dombrova (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chad Dombrova updated BEAM-8213:

Description: 
As a python developer, the speed and comprehensibility of the jenkins PreCommit 
job could be greatly improved.

Here are some of the problems

- when a lint job fails, it's not reported in the test results summary, so even 
though the job is marked as failed, I see "Test Result (no failures)" which is 
quite confusing
- I have to wait for over an hour to discover the lint failed, which takes 
about a minute to run on its own
- The logs are a jumbled mess of all the different tasks running on top of each 
other
- The test results give no indication of which version of python they use.  I 
click on Test results, then the test module, then the test class, then I see 4 
tests named the same thing.  I assume that the first is python 2.7, the second 
is 3.5 and so on.   It takes 5 clicks and then reading the log output to know 
which version of python a single error pertains to, then I need to repeat for 
each failure.  This makes it very difficult to discover problems, and deduce 
that they may have something to do with python version mismatches.

I believe the solution to this is to split up the single monolithic python 
PreCommit job into sub-jobs (possibly using a pipeline with steps).  This would 
give us the following benefits:

- sub job results should become available as they finish, so for example, lint 
results should be available very early on
- sub job results will be reported separately, and there will be a job for each 
py2, py35, py36 and so on, so it will be clear when an error is related to a 
particular python version
- sub jobs without reports, like docs and lint, will have their own failure 
status and logs, so when they fail it will be more obvious what went wrong.

I'm happy to help out once I get some feedback on the desired way forward.


  was:
As a python developer, the speed and comprehensibility of the jenkins PreCommit 
job could be greatly improved.

Here are some of the problems

- when a lint job fails, it's not reported in the test results summary, so even 
though the job is marked as failed, I see "Test Result (no failures)" which is 
quite confusing
- I have to wait for over an hour to discover the lint failed, which takes 
about a minute to run on its own
- The logs are a jumbled mess of all the different tasks running on top of each 
other
- The test results give no indication of which version of python they use.  I 
click on Test results, then the test module, then the test class, then I see 4 
tests named the same thing.  I assume that the first is python 2.7, the second 
is 3.5 and so on.   It takes 5 clicks and then reading the log output to know 
which version of python a single error pertains to.  This makes it very 
difficult to discover problems, and deduce that they may have something to do 
with python version mismatches.

I believe the solution to this is to split up the single monolithic python 
PreCommit job into sub-jobs (possibly using a pipeline with steps).  This would 
give us the following benefits:

- sub job results should become available as they finish, so for example, lint 
results should be available very early on
- sub job results will be reported separately, and there will be a job for each 
py2, py35, py36 and so on, so it will be clear when an error is related to a 
particular python version
- sub jobs without reports, like docs and lint, will have their own failure 
status and logs, so when they fail it will be more obvious what went wrong.

I'm happy to help out once I get some feedback on the desired way forward.



> Run and report python tox tasks separately within Jenkins
> -
>
> Key: BEAM-8213
> URL: https://issues.apache.org/jira/browse/BEAM-8213
> Project: Beam
>  Issue Type: Improvement
>  Components: build-system
>Reporter: Chad Dombrova
>Priority: Major
>
> As a python developer, the speed and comprehensibility of the jenkins 
> PreCommit job could be greatly improved.
> Here are some of the problems
> - when a lint job fails, it's not reported in the test results summary, so 
> even though the job is marked as failed, I see "Test Result (no failures)" 
> which is quite confusing
> - I have to wait for over an hour to discover the lint failed, which takes 
> about a minute to run on its own
> - The logs are a jumbled mess of all the different tasks running on top of 
> each other
> - The test results give no indication of which version of python they use.  I 
> click on Test results, then the test module, then the test class, then I see 
> 4 tests named the same thing.  I assume that the first is python 2.7, the 
> second is 3.5 and so on.   It takes 5 clicks and then reading the log 

[jira] [Updated] (BEAM-8213) Run and report python tox tasks separately within Jenkins

2019-09-13 Thread Chad Dombrova (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chad Dombrova updated BEAM-8213:

Description: 
As a python developer, the speed and comprehensibility of the jenkins PreCommit 
job could be greatly improved.

Here are some of the problems

- when a lint job fails, it's not reported in the test results summary, so even 
though the job is marked as failed, I see "Test Result (no failures)" which is 
quite confusing
- I have to wait for over an hour to discover the lint failed, which takes 
about a minute to run on its own
- The logs are a jumbled mess of all the different tasks running on top of each 
other
- The test results give no indication of which version of python they use.  I 
click on Test results, then the test module, then the test class, then I see 4 
tests named the same thing.  I assume that the first is python 2.7, the second 
is 3.5 and so on.   It takes 5 clicks and then reading the log output to know 
which version of python a single error pertains to.  This makes it very 
difficult to discover problems, and deduce that they may have something to do 
with python version mismatches.

I believe the solution to this is to split up the single monolithic python 
PreCommit job into sub-jobs (possibly using a pipeline with steps).  This would 
give us the following benefits:

- sub job results should become available as they finish, so for example, lint 
results should be available very early on
- sub job results will be reported separately, and there will be a job for each 
py2, py35, py36 and so on, so it will be clear when an error is related to a 
particular python version
- sub jobs without reports, like docs and lint, will have their own failure 
status and logs, so when they fail it will be more obvious what went wrong.

I'm happy to help out once I get some feedback on the desired way forward.


  was:
As a python developer, the speed and comprehensibility of the jenkins PreCommit 
job could be greatly improved.

Here are some of the problems

- when a lint job fails, it's not reported in the test results summary, so even 
though the job is marked as failed, I see "Test Result (no failures)" which is 
quite confusing
- I have to wait for over an hour to discover the lint failed, which takes 
about a minute to run on its own
- The logs are a jumbled mess of all the different tasks running on top of each 
other
- The test results give no indication of which version of python they use.  I 
click on Test results, then the test module, then the test class, then I see 4 
tests named the same thing.  I assume that the first is python 2.7, the second 
is 3.5 and so on.   It takes 5 clicks and then reading the log output to know 
which version a single error pertains to.  This makes it very difficult to 
discover problems, and deduce that they may have something to do with python 
version mismatches.

I believe the solution to this is to split up the single monolithic python 
PreCommit job into sub-jobs (possibly using a pipeline with steps).  This would 
give us the following benefits:

- sub job results should become available as they finish, so for example, lint 
results should be available very early on
- sub job results will be reported separately, and there will be a job for each 
py2, py35, py36 and so on, so it will be clear when an error is related to a 
particular python version
- sub jobs without reports, like docs and lint, will have their own failure 
status and logs, so when they fail it will be more obvious what went wrong.

I'm happy to help out once I get some feedback on the desired way forward.



> Run and report python tox tasks separately within Jenkins
> -
>
> Key: BEAM-8213
> URL: https://issues.apache.org/jira/browse/BEAM-8213
> Project: Beam
>  Issue Type: Improvement
>  Components: build-system
>Reporter: Chad Dombrova
>Priority: Major
>
> As a python developer, the speed and comprehensibility of the jenkins 
> PreCommit job could be greatly improved.
> Here are some of the problems
> - when a lint job fails, it's not reported in the test results summary, so 
> even though the job is marked as failed, I see "Test Result (no failures)" 
> which is quite confusing
> - I have to wait for over an hour to discover the lint failed, which takes 
> about a minute to run on its own
> - The logs are a jumbled mess of all the different tasks running on top of 
> each other
> - The test results give no indication of which version of python they use.  I 
> click on Test results, then the test module, then the test class, then I see 
> 4 tests named the same thing.  I assume that the first is python 2.7, the 
> second is 3.5 and so on.   It takes 5 clicks and then reading the log output 
> to know which version of python a single 

[jira] [Updated] (BEAM-8213) Run and report python tox tasks separately within Jenkins

2019-09-13 Thread Chad Dombrova (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chad Dombrova updated BEAM-8213:

Description: 
As a python developer, the speed and comprehensibility of the jenkins PreCommit 
job could be greatly improved.

Here are some of the problems

- when a lint job fails, it's not reported in the test results summary, so even 
though the job is marked as failed, I see "Test Result (no failures)" which is 
quite confusing
- I have to wait for over an hour to discover the lint failed, which takes 
about a minute to run on its own
- The logs are a jumbled mess of all the different tasks running on top of each 
other
- The test results give no indication of which version of python they use.  I 
click on Test results, then the test module, then the test class, then I see 4 
tests named the same thing.  I assume that the first is python 2.7, the second 
is 3.5 and so on.   It takes 5 clicks and then reading the log output to know 
which version a single error pertains to.  This makes it very difficult to 
discover problems, and deduce that they may have something to do with python 
version mismatches.

I believe the solution to this is to split up the single monolithic python 
PreCommit job into sub-jobs (possibly using a pipeline with steps).  This would 
give us the following benefits:

- sub job results should become available as they finish, so for example, lint 
results should be available very early on
- sub job results will be reported separately, and there will be a job for each 
py2, py35, py36 and so on, so it will be clear when an error is related to a 
particular python version
- sub jobs without reports, like docs and lint, will have their own failure 
status and logs, so when they fail it will be more obvious what went wrong.

I'm happy to help out once I get some feedback on the desired way forward.


  was:
As a python developer, the speed and comprehensibility of the jenkins PreCommit 
job could be greatly improved.

Here are some of the problems

- when a lint job fails, it's not reported in the Test results, so even though 
the job failed, I see "Test Result (no failures)"
- I have to wait for over an hour to discover the lint failed, which takes 
about a minute to run
- The logs are a jumbled mess of all the different tasks running on top of each 
other
- The test results give no indication of whether they come from python 27, 35, 
etc.  I click on Test results, then the test module, then the test class, then 
I see 4 tests named the same thing.  I assume that the first is python 2.7, the 
second is 3.5 and so on.   This makes it very difficult to discover problems, 
and deduce that they may have something to do with python version mismatches.

I believe the solution to this is to split up the single monolithic python 
PreCommit job into sub-jobs (possibly using a pipeline with steps).  This would 
give us the following benefits:

- sub-job results should become available as they finish, so lint results 
should be available very early on
- sub-job results will be reported separately, so it will be clear when an 
error is related to a particular python version
- sub-jobs without reports, like docs and lint, will have their own failure 
status and logs, so it will be more obvious when they fail what went wrong.

I'm happy to help out once I get some feedback on the desired way forward.



> Run and report python tox tasks separately within Jenkins
> -
>
> Key: BEAM-8213
> URL: https://issues.apache.org/jira/browse/BEAM-8213
> Project: Beam
>  Issue Type: Improvement
>  Components: build-system
>Reporter: Chad Dombrova
>Priority: Major
>
> As a python developer, the speed and comprehensibility of the jenkins 
> PreCommit job could be greatly improved.
> Here are some of the problems
> - when a lint job fails, it's not reported in the test results summary, so 
> even though the job is marked as failed, I see "Test Result (no failures)" 
> which is quite confusing
> - I have to wait for over an hour to discover the lint failed, which takes 
> about a minute to run on its own
> - The logs are a jumbled mess of all the different tasks running on top of 
> each other
> - The test results give no indication of which version of python they use.  I 
> click on Test results, then the test module, then the test class, then I see 
> 4 tests named the same thing.  I assume that the first is python 2.7, the 
> second is 3.5 and so on.   It takes 5 clicks and then reading the log output 
> to know which version a single error pertains to.  This makes it very 
> difficult to discover problems, and deduce that they may have something to do 
> with python version mismatches.
> I believe the solution to this is to split up the single monolithic python 
> PreCommit job 

[jira] [Commented] (BEAM-8229) Python postcommit 2.7 tests are failing since we try to run a test with Python 3.6+ syntax on Python 2.

2019-09-13 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929412#comment-16929412
 ] 

Chad Dombrova commented on BEAM-8229:
-

Based on feedback from Robert, I'm going to split my PR into two:  the first 
which excludes the python3.6-only feature, and the other which includes it and 
the additional test filtering features.



> Python postcommit 2.7 tests are failing since we try to run a test with 
> Python 3.6+ syntax on Python 2.
> ---
>
> Key: BEAM-8229
> URL: https://issues.apache.org/jira/browse/BEAM-8229
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Assignee: Udi Meiri
>Priority: Blocker
>
> Root cause: https://github.com/apache/beam/pull/9098
> 20:12:14 Traceback (most recent call last):
> 20:12:14   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python2_PR/src/build/gradleenv/-194514014/local/lib/python2.7/site-packages/nose/loader.py",
>  line 418, in loadTestsFromName
> 20:12:14 addr.filename, addr.module)
> 20:12:14   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python2_PR/src/build/gradleenv/-194514014/local/lib/python2.7/site-packages/nose/importer.py",
>  line 47, in importFromPath
> 20:12:14 return self.importFromDir(dir_path, fqname)
> 20:12:14   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python2_PR/src/build/gradleenv/-194514014/local/lib/python2.7/site-packages/nose/importer.py",
>  line 94, in importFromDir
> 20:12:14 mod = load_module(part_fqname, fh, filename, desc)
> 20:12:14 SyntaxError: invalid syntax (external_test_py37.py, line 46)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (BEAM-8229) Python postcommit 2.7 tests are failing since we try to run a test with Python 3.6+ syntax on Python 2.

2019-09-13 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929363#comment-16929363
 ] 

Chad Dombrova commented on BEAM-8229:
-

{quote}
I am not familiar with pytest, but suspected it may have some solutions for 
this problem. I thought that you might know if this is the case.
{quote}

I've used pytest quite a bit and I like it better than nose.  With pytest you 
can create a conftest.py file which can setup exclusions programmatically (e.g. 
based on the current sys.executable), which would help consolidate the 
exclusions by moving them from the nosetests cli args currently in various 
shell scripts and tox.ini, into the test code itself.  It's possible to do this 
with nose as well, but IIRC it takes a bit more boilerplate.



> Python postcommit 2.7 tests are failing since we try to run a test with 
> Python 3.6+ syntax on Python 2.
> ---
>
> Key: BEAM-8229
> URL: https://issues.apache.org/jira/browse/BEAM-8229
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Assignee: Udi Meiri
>Priority: Blocker
>
> Root cause: https://github.com/apache/beam/pull/9098
> 20:12:14 Traceback (most recent call last):
> 20:12:14   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python2_PR/src/build/gradleenv/-194514014/local/lib/python2.7/site-packages/nose/loader.py",
>  line 418, in loadTestsFromName
> 20:12:14 addr.filename, addr.module)
> 20:12:14   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python2_PR/src/build/gradleenv/-194514014/local/lib/python2.7/site-packages/nose/importer.py",
>  line 47, in importFromPath
> 20:12:14 return self.importFromDir(dir_path, fqname)
> 20:12:14   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python2_PR/src/build/gradleenv/-194514014/local/lib/python2.7/site-packages/nose/importer.py",
>  line 94, in importFromDir
> 20:12:14 mod = load_module(part_fqname, fh, filename, desc)
> 20:12:14 SyntaxError: invalid syntax (external_test_py37.py, line 46)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (BEAM-8229) Python postcommit 2.7 tests are failing since we try to run a test with Python 3.6+ syntax on Python 2.

2019-09-13 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929360#comment-16929360
 ] 

Chad Dombrova commented on BEAM-8229:
-

Back with more insight.

The integration tests don't go through tox, which means logic regarding 
viritualenv setup and test exclusion is spread around the project a bit.  
Outside of tox.ini test exclusion is performed in build.gradle, 
run_integration_tests.sh, run_validatescontainer.sh.  

Currently it looks like the integration tests don't run the python3-specific 
tests at all:

Here's run_integration_tests.sh:

{code:java}
python setup.py nosetests \
  --test-pipeline-options="$PIPELINE_OPTS" \
  --with-xunitmp --xunitmp-file=$XUNIT_FILE \
  --ignore-files '.*py3.py$' \
  $TEST_OPTS
{code}

Since this is already a bit of a sub-par situation (not running the 
py3-specific tests), I think the most reasonable solution is to extend this 
regex to py3[0-9]*, which will exclude my py37 tests as well.   (Also, note 
that since this is a regex and not a glob, the period should have been escaped: 
as it is it means "any character"). 

I think the ideal long term solution is to make all tests pass through tox, 
which will increase their discoverability, runability, and standardization. 


> Python postcommit 2.7 tests are failing since we try to run a test with 
> Python 3.6+ syntax on Python 2.
> ---
>
> Key: BEAM-8229
> URL: https://issues.apache.org/jira/browse/BEAM-8229
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Assignee: Udi Meiri
>Priority: Blocker
>
> Root cause: https://github.com/apache/beam/pull/9098
> 20:12:14 Traceback (most recent call last):
> 20:12:14   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python2_PR/src/build/gradleenv/-194514014/local/lib/python2.7/site-packages/nose/loader.py",
>  line 418, in loadTestsFromName
> 20:12:14 addr.filename, addr.module)
> 20:12:14   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python2_PR/src/build/gradleenv/-194514014/local/lib/python2.7/site-packages/nose/importer.py",
>  line 47, in importFromPath
> 20:12:14 return self.importFromDir(dir_path, fqname)
> 20:12:14   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python2_PR/src/build/gradleenv/-194514014/local/lib/python2.7/site-packages/nose/importer.py",
>  line 94, in importFromDir
> 20:12:14 mod = load_module(part_fqname, fh, filename, desc)
> 20:12:14 SyntaxError: invalid syntax (external_test_py37.py, line 46)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (BEAM-8229) Python postcommit 3.7 tests are failing with syntax error

2019-09-13 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929346#comment-16929346
 ] 

Chad Dombrova commented on BEAM-8229:
-

This is getting a bit difficult to communicate as the conversation is spread 
across 2 JIRA tickets and 2 github PRs. Reposting this from the others:

Pre-commit does run the python 3.x tests, but it runs many of the tests on 
python 3.5. This PR introduced tests that can only run on python 3.6 or higher, 
and I had to do some extra work to the beam test framework to make that 
possible: 
https://github.com/apache/beam/pull/9098/commits/0c31f7c8b1c108f7df4cbedfaf9da0ce30092ce0

It appears that same work to properly exclude/include tests at the major + 
minor version needs to be done for post-commit tests. I didn't do that simply 
because I didn't know it could be a problem. I assumed everything went through 
tox.

I'm currently looking for the post-commit entry point to see what else needs to 
be filtered.  

Note that the _other_ solution to this is dropping python 3.5 support, but I 
figured that'd be a much bigger conversation.




> Python postcommit 3.7 tests are failing with syntax error
> -
>
> Key: BEAM-8229
> URL: https://issues.apache.org/jira/browse/BEAM-8229
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Assignee: Udi Meiri
>Priority: Blocker
>
> Root cause: https://github.com/apache/beam/pull/9098
> 20:12:14 Traceback (most recent call last):
> 20:12:14   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python2_PR/src/build/gradleenv/-194514014/local/lib/python2.7/site-packages/nose/loader.py",
>  line 418, in loadTestsFromName
> 20:12:14 addr.filename, addr.module)
> 20:12:14   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python2_PR/src/build/gradleenv/-194514014/local/lib/python2.7/site-packages/nose/importer.py",
>  line 47, in importFromPath
> 20:12:14 return self.importFromDir(dir_path, fqname)
> 20:12:14   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python2_PR/src/build/gradleenv/-194514014/local/lib/python2.7/site-packages/nose/importer.py",
>  line 94, in importFromDir
> 20:12:14 mod = load_module(part_fqname, fh, filename, desc)
> 20:12:14 SyntaxError: invalid syntax (external_test_py37.py, line 46)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (BEAM-8230) Python pre-commits need to include python 3.x tests

2019-09-13 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929335#comment-16929335
 ] 

Chad Dombrova commented on BEAM-8230:
-

Reposting this from the PR:

There's a misunderstanding here.  Pre-commit _does_ run the python 3.x tests, 
but it runs many of the tests on python 3.5.  This PR introduced tests that can 
only run on python 3.6 or higher, and I had to do some extra work to the beam 
test framework to make that possible:  
https://github.com/apache/beam/pull/9098/commits/0c31f7c8b1c108f7df4cbedfaf9da0ce30092ce0

I appears that same work to properly exclude/include tests at the major + minor 
version needs to be done for post-commit tests.  I didn't do that simply 
because I didn't know it could be a problem.  I assumed everything went through 
tox. 


> Python pre-commits need to include python 3.x tests
> ---
>
> Key: BEAM-8230
> URL: https://issues.apache.org/jira/browse/BEAM-8230
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core, testing
>Reporter: Ahmet Altay
>Assignee: Valentyn Tymofieiev
>Priority: Major
>
> Python pre-coomit does not seem to run python 3.x test. This results in post 
> commit tests catching simple issues like syntax error.
> Example: https://issues.apache.org/jira/browse/BEAM-8229



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (BEAM-8213) Run and report python tox tasks separately within Jenkins

2019-09-11 Thread Chad Dombrova (Jira)
Chad Dombrova created BEAM-8213:
---

 Summary: Run and report python tox tasks separately within Jenkins
 Key: BEAM-8213
 URL: https://issues.apache.org/jira/browse/BEAM-8213
 Project: Beam
  Issue Type: Improvement
  Components: build-system
Reporter: Chad Dombrova


As a python developer, the speed and comprehensibility of the jenkins PreCommit 
job could be greatly improved.

Here are some of the problems

- when a lint job fails, it's not reported in the Test results, so even though 
the job failed, I see "Test Result (no failures)"
- I have to wait for over an hour to discover the lint failed, which takes 
about a minute to run
- The logs are a jumbled mess of all the different tasks running on top of each 
other
- The test results give no indication of whether they come from python 27, 35, 
etc.  I click on Test results, then the test module, then the test class, then 
I see 4 tests named the same thing.  I assume that the first is python 2.7, the 
second is 3.5 and so on.   This makes it very difficult to discover problems, 
and deduce that they may have something to do with python version mismatches.

I believe the solution to this is to split up the single monolithic python 
PreCommit job into sub-jobs (possibly using a pipeline with steps).  This would 
give us the following benefits:

- sub-job results should become available as they finish, so lint results 
should be available very early on
- sub-job results will be reported separately, so it will be clear when an 
error is related to a particular python version
- sub-jobs without reports, like docs and lint, will have their own failure 
status and logs, so it will be more obvious when they fail what went wrong.

I'm happy to help out once I get some feedback on the desired way forward.




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (BEAM-8172) Add a label field to FunctionSpec proto

2019-09-06 Thread Chad Dombrova (Jira)
Chad Dombrova created BEAM-8172:
---

 Summary: Add a label field to FunctionSpec proto
 Key: BEAM-8172
 URL: https://issues.apache.org/jira/browse/BEAM-8172
 Project: Beam
  Issue Type: Improvement
  Components: beam-model
Reporter: Chad Dombrova
Assignee: Chad Dombrova


The FunctionSpec payload is opaque outside of the native environment, which can 
make debugging pipeline protos difficult.  It would be very useful if the 
FunctionSpec had an optional human readable "label", for debugging and 
presenting in UIs and error messages.

For example, in python, if the payload is an instance of a class, we could 
attempt to provide a string that represents the dotted path to that class, 
"mymodule.MyClass".  In the case of coders, we could use the label to hold the 
type hint:  "Optional[Iterable[mymodule.MyClass]]".




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (BEAM-8088) PCollection boundedness should be tracked and propagated in python sdk

2019-09-05 Thread Chad Dombrova (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chad Dombrova updated BEAM-8088:

Summary: PCollection boundedness should be tracked and propagated in python 
sdk  (was: [python] PCollection boundedness should be tracked and propagated)

> PCollection boundedness should be tracked and propagated in python sdk
> --
>
> Key: BEAM-8088
> URL: https://issues.apache.org/jira/browse/BEAM-8088
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Chad Dombrova
>Assignee: Chad Dombrova
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> As far as I can tell Python does not care about boundedness of PCollections 
> even in streaming mode, but external transforms _do_.  In my ongoing effort 
> to get PubsubIO external transforms working I discovered that I could not 
> generate an unbounded write. 
> My pipeline looks like this:
> {code:python}
> (
> pipe
> | 'PubSubInflow' >> 
> external.pubsub.ReadFromPubSub(subscription=subscription, 
> with_attributes=True)
> | 'PubSubOutflow' >> 
> external.pubsub.WriteToPubSub(topic=OUTPUT_TOPIC, with_attributes=True)
> )
> {code}
> The PCollections returned from the external Read are Unbounded, as expected, 
> but python is responsible for creating the intermediate PCollection, which is 
> always Bounded, and thus external Write is always Bounded. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (BEAM-8088) [python] PCollection boundedness should be tracked and propagated

2019-09-05 Thread Chad Dombrova (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chad Dombrova updated BEAM-8088:

Summary: [python] PCollection boundedness should be tracked and propagated  
(was: PCollection boundedness should be tracked and propagated)

> [python] PCollection boundedness should be tracked and propagated
> -
>
> Key: BEAM-8088
> URL: https://issues.apache.org/jira/browse/BEAM-8088
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Chad Dombrova
>Assignee: Chad Dombrova
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> As far as I can tell Python does not care about boundedness of PCollections 
> even in streaming mode, but external transforms _do_.  In my ongoing effort 
> to get PubsubIO external transforms working I discovered that I could not 
> generate an unbounded write. 
> My pipeline looks like this:
> {code:python}
> (
> pipe
> | 'PubSubInflow' >> 
> external.pubsub.ReadFromPubSub(subscription=subscription, 
> with_attributes=True)
> | 'PubSubOutflow' >> 
> external.pubsub.WriteToPubSub(topic=OUTPUT_TOPIC, with_attributes=True)
> )
> {code}
> The PCollections returned from the external Read are Unbounded, as expected, 
> but python is responsible for creating the intermediate PCollection, which is 
> always Bounded, and thus external Write is always Bounded. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-09-05 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923739#comment-16923739
 ] 

Chad Dombrova commented on BEAM-7060:
-

Note that work on type annotations is continuing in BEAM-7746


> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 15.5h
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (BEAM-8148) [python] allow tests to be specified when using tox

2019-09-05 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923730#comment-16923730
 ] 

Chad Dombrova commented on BEAM-8148:
-

PR is merged, so this is fixed!

> [python] allow tests to be specified when using tox
> ---
>
> Key: BEAM-8148
> URL: https://issues.apache.org/jira/browse/BEAM-8148
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Chad Dombrova
>Assignee: Chad Dombrova
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> I don't know how to specify individual tests using gradle, and as a python 
> developer, it's way too opaque for me to figure out on my own.  
> The developer wiki suggest calling the tests directly:
> {noformat}
> python setup.py nosetests --tests :.
> {noformat}
> But this defeats the purpose of using tox, which is there to help users 
> manage the myriad virtual envs required to run the different tests.
> Luckily there is an easier way!  You just add {posargs} to the tox commands.  
> PR is coming shortly.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (BEAM-8148) [python] allow tests to be specified when using tox

2019-09-04 Thread Chad Dombrova (Jira)
Chad Dombrova created BEAM-8148:
---

 Summary: [python] allow tests to be specified when using tox
 Key: BEAM-8148
 URL: https://issues.apache.org/jira/browse/BEAM-8148
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py-core
Reporter: Chad Dombrova
Assignee: Chad Dombrova


I don't know how to specify individual tests using gradle, and as a python 
developer, it's way too opaque for me to figure out on my own.  

The developer wiki suggest calling the tests directly:

{noformat}
python setup.py nosetests --tests :.
{noformat}

But this defeats the purpose of using tox, which is there to help users manage 
the myriad virtual envs required to run the different tests.

Luckily there is an easier way!  You just add {posargs} to the tox commands.  
PR is coming shortly.







--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (BEAM-8088) PCollection boundedness should be tracked and propagated

2019-08-24 Thread Chad Dombrova (Jira)
Chad Dombrova created BEAM-8088:
---

 Summary: PCollection boundedness should be tracked and propagated
 Key: BEAM-8088
 URL: https://issues.apache.org/jira/browse/BEAM-8088
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py-core
Reporter: Chad Dombrova


As far as I can tell Python does not care about boundedness of PCollections 
even in streaming mode, but external transforms _do_.  In my ongoing effort to 
get PubsubIO external transforms working I discovered that I could not generate 
an unbounded write. 

My pipeline looks like this:

{code:python}
(
pipe
| 'PubSubInflow' >> 
external.pubsub.ReadFromPubSub(subscription=subscription, with_attributes=True)
| 'PubSubOutflow' >> external.pubsub.WriteToPubSub(topic=OUTPUT_TOPIC, 
with_attributes=True)
)
{code}

The PCollections returned from the external Read are Unbounded, as expected, 
but python is responsible for creating the intermediate PCollection, which is 
always Bounded, and thus external Write is always Bounded. 




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (BEAM-8085) PubsubMessageWithAttributesCoder should not produce null attributes map

2019-08-23 Thread Chad Dombrova (Jira)
Chad Dombrova created BEAM-8085:
---

 Summary: PubsubMessageWithAttributesCoder should not produce null 
attributes map
 Key: BEAM-8085
 URL: https://issues.apache.org/jira/browse/BEAM-8085
 Project: Beam
  Issue Type: Improvement
  Components: io-java-gcp
Reporter: Chad Dombrova


Hi, I just got caught by an issue where PubsubMessage.getAttributeMap() 
returned null, because the message was created by 
PubsubMessageWithAttributesCoder which uses a NullableCoder for attributes.

Here are the relevant code snippets:

{code:java}
public class PubsubMessageWithAttributesCoder extends 
CustomCoder {
  // A message's payload can not be null
  private static final Coder PAYLOAD_CODER = ByteArrayCoder.of();
  // A message's attributes can be null.
  private static final Coder> ATTRIBUTES_CODER =
  NullableCoder.of(MapCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of()));

  @Override
  public PubsubMessage decode(InputStream inStream) throws IOException {
return decode(inStream, Context.NESTED);
  }

  @Override
  public PubsubMessage decode(InputStream inStream, Context context) throws 
IOException {
byte[] payload = PAYLOAD_CODER.decode(inStream);
Map attributes = ATTRIBUTES_CODER.decode(inStream, context);
return new PubsubMessage(payload, attributes);
  }
}

{code}
 
{code:java}
public class PubsubMessage {

  private byte[] message;
  private Map attributes;

  public PubsubMessage(byte[] payload, Map attributes) {
this.message = payload;
this.attributes = attributes;
  }

  /** Returns the main PubSub message. */
  public byte[] getPayload() {
return message;
  }

  /** Returns the given attribute value. If not such attribute exists, returns 
null. */
  @Nullable
  public String getAttribute(String attribute) {
checkNotNull(attribute, "attribute");
return attributes.get(attribute);
  }

  /** Returns the full map of attributes. This is an unmodifiable map. */
  public Map getAttributeMap() {
return attributes;
  }
}
{code}

There are a handful of potential solutions:

# Remove the NullableCoder
# In PubsubMessageWithAttributesCoder.decode, check for null and create an 
empty Map before instantiating PubsubMessage
# Allow attributes to be null for  PubsubMessage constructor, but create an 
empty Map if it is (similar to above, but handle it in PubsubMessage)
# Allow PubsubMessage.attributes to be nullable, and indicate it as such




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (BEAM-7870) Externally configured KafkaIO consumer causes coder problems

2019-08-21 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-7870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912720#comment-16912720
 ] 

Chad Dombrova commented on BEAM-7870:
-

Another possible solution:  replace PubsubMessage & KafkaRecord with 
sdk.values.Row, so that they are compatible with the new portable RowCoder 
feature being added in BEAM-7886.  The main downside I see is that the API 
would be considerably uglier on the Java side. 



> Externally configured KafkaIO consumer causes coder problems
> 
>
> Key: BEAM-7870
> URL: https://issues.apache.org/jira/browse/BEAM-7870
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink, sdk-java-core
>Reporter: Maximilian Michels
>Assignee: Maximilian Michels
>Priority: Major
>
> There are limitations for the consumer to work correctly. The biggest issue 
> is the structure of KafkaIO itself, which uses a combination of the source 
> interface and DoFns to generate the desired output. The problem is that the 
> source interface is natively translated by the Flink Runner to support 
> unbounded sources in portability, while the DoFn runs in a Java environment.
> To transfer data between the two a coder needs to be involved. It happens to 
> be that the initial read does not immediately drop the KafakRecord structure 
> which does not work together well with our current assumption of only 
> supporting "standard coders" present in all SDKs. Only the subsequent DoFn 
> converts the KafkaRecord structure into a raw KV[byte, byte], but the DoFn 
> won't have the coder available in its environment.
> There are several possible solutions:
>  1. Make the DoFn which drops the KafkaRecordCoder a native Java transform in 
> the Flink Runner
>  2. Modify KafkaIO to immediately drop the KafkaRecord structure
>  3. Add the KafkaRecordCoder to all SDKs
>  4. Add a generic coder, e.g. AvroCoder to all SDKs
> For a workaround which uses (3), please see this patch which is not a proper 
> fix but adds KafkaRecordCoder to the SDK such that it can be used 
> encode/decode records: 
> [https://github.com/mxm/beam/commit/b31cf99c75b3972018180d8ccc7e73d311f4cfed]
>  
> See also 
> [https://github.com/apache/beam/pull/8251|https://github.com/apache/beam/pull/8251:]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (BEAM-7870) Externally configured KafkaIO consumer causes coder problems

2019-08-20 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-7870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911560#comment-16911560
 ] 

Chad Dombrova edited comment on BEAM-7870 at 8/20/19 10:04 PM:
---

To follow up from the ML thread, I'm having a very similar problem when 
converting PubsubIO to an external transform.

{quote}
To transfer data between the two a coder needs to be involved. It happens to be 
that the initial read does not immediately drop the KafakRecord structure which 
does not work together well with our current assumption of only supporting 
"standard coders" present in all SDKs. Only the subsequent DoFn converts the 
KafkaRecord structure into a raw KV[byte, byte], but the DoFn won't have the 
coder available in its environment.
{quote}

PubsubIO behaves much the same way as you've described -- a Read produces a 
PubsubMessage structure which is converted to bytes (in protobufs format) by a 
subsequent DoFn -- but the problem I'm seeing is not that "the DoFn won't have 
the coder available in its environment", it's that the Read does not use the 
proper coder because FlinkStreamingPortablePipelineTranslator.translateRead 
replaces all non-standard coders with ByteArrayCoder.  So in my tests, the job 
fails with the first read message, complaining that PubsubMessage cannot be 
cast to bytes.

Some thoughts on the proposed solutions:

{quote}
1. Make the DoFn which drops the KafkaRecordCoder a native Java transform in 
the Flink Runner
{quote}

I don't see how this will help PubusbIO, because as I stated I'm fairly 
confident the error happens when flink tries to store the pubsub message 
immediately after it's been read, before the DoFn is even called.  

{quote}
2. Modify KafkaIO to immediately drop the KafkaRecord structure
{quote}

This is the simplest solution if we look at it superficially, but it's severely 
limited in usefulness.  In PubsubIO, the next transform is actually a DoFn that 
wants a PubsubMessage, to update some stats metrics (see my overview of the 
pipeline below).  This solution would either prevent that from working or 
require that every transform decode and encode the payload itself, which I 
think is self-defeating.

{quote}
3. Add the KafkaRecordCoder to all SDKs
{quote}

Thanks for the patch example.  I'll see if something similar works for PubsubIO.

{quote}
4. Add a generic coder, e.g. AvroCoder to all SDKs
{quote}

What about protobufs as a compromise between options 3 and 4?  

In the case of PubsubIO and PubsubMessages, java has its own native coder 
(actually a pair, one for messages w/ attributes and one for w/out), while 
python is using protobufs.  It would be simpler to write a protobufs coder for 
Java than to write a python version of PubsubMessageWithAttributesCoder, so I 
would favor protobufs as a pattern to apply generally to this kind of 
portability problem.

The other thing I like about a protobufs-based approach is that you could write 
an abstract ProtoCoder base class which can easily be specialized by simply 
providing the native class and the proto class.

My current external pubsub test pipeline looks like this after its been 
expanded:

{noformat}
--- java ---
Read
ParDo  # stats
ParDo  # convert to pusbub protobuf byte array
--- python --
ParDo  # convert from pusbub protobuf byte array
ParDo  # custom logger
{noformat}

If we created and registered a protobufs-based coder for PubsubMessage, we 
could cut out the converter transforms, and it would look like this:

{noformat}
--- java ---
Read
ParDo  # stats
--- python --
ParDo  # custom logger
{noformat}

That's appealing to me, though it's worth noting that we would get this benefit 
using any cross-language coder, whether it's protobufs, avro, or just rewriting 
PubsubMessageWithAttributesCoder in python.



was (Author: chadrik):
To follow up from the ML thread, I'm having a very similar problem when 
converting PubsubIO to an external transform.

{quote}
To transfer data between the two a coder needs to be involved. It happens to be 
that the initial read does not immediately drop the KafakRecord structure which 
does not work together well with our current assumption of only supporting 
"standard coders" present in all SDKs. Only the subsequent DoFn converts the 
KafkaRecord structure into a raw KV[byte, byte], but the DoFn won't have the 
coder available in its environment.
{quote}

PubsubIO behaves much the same way as you've described -- a Read produces a 
PubsubMessage structure which is converted to bytes (in protobufs format) by a 
subsequent DoFn -- but the problem I'm seeing is not that "the DoFn won't have 
the coder available in its environment", it's that the Read does not use the 
proper coder because FlinkStreamingPortablePipelineTranslator.translateRead 
replaces all non-standard coders with ByteArrayCoder.  So in my tests, the job 
fails with the first read message, complaining 

[jira] [Comment Edited] (BEAM-7870) Externally configured KafkaIO consumer causes coder problems

2019-08-20 Thread Chad Dombrova (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-7870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911560#comment-16911560
 ] 

Chad Dombrova edited comment on BEAM-7870 at 8/20/19 7:01 PM:
--

To follow up from the ML thread, I'm having a very similar problem when 
converting PubsubIO to an external transform.

{quote}
To transfer data between the two a coder needs to be involved. It happens to be 
that the initial read does not immediately drop the KafakRecord structure which 
does not work together well with our current assumption of only supporting 
"standard coders" present in all SDKs. Only the subsequent DoFn converts the 
KafkaRecord structure into a raw KV[byte, byte], but the DoFn won't have the 
coder available in its environment.
{quote}

PubsubIO behaves much the same way as you've described -- a Read produces a 
PubsubMessage structure which is converted to bytes (in protobufs format) by a 
subsequent DoFn -- but the problem I'm seeing is not that "the DoFn won't have 
the coder available in its environment", it's that the Read does not use the 
proper coder because FlinkStreamingPortablePipelineTranslator.translateRead 
replaces all non-standard coders with ByteArrayCoder.  So in my tests, the job 
fails with the first read message, complaining that PubsubMessage cannot be 
cast to bytes.

Some thoughts on the proposed solutions:

{quote}
1. Make the DoFn which drops the KafkaRecordCoder a native Java transform in 
the Flink Runner
{quote}

I don't see how this will help PubusbIO, because as I stated I'm fairly 
confident the error happens when flink tries to store the read pubsub message, 
before the DoFn is even called.  

{quote}
2. Modify KafkaIO to immediately drop the KafkaRecord structure
{quote}

This is the simplest solution if we look at it superficially, but it's severely 
limited in usefulness.  In PubsubIO, the next transform is actually a DoFn that 
wants a PubsubMessage, to update some stats metrics (see my overview of the 
pipeline below).  This solution would either prevent that from working or 
require that every transform decode and encode the payload itself, which I 
think is self-defeating.

{quote}
3. Add the KafkaRecordCoder to all SDKs
{quote}

Thanks for the patch example.  I'll see if something similar works for PubsubIO.

{quote}
4. Add a generic coder, e.g. AvroCoder to all SDKs
{quote}

What about protobufs as a compromise between options 3 and 4?  

In the case of PubsubIO and PubsubMessages, java has its own native coder 
(actually a pair, one for messages w/ attributes and one for w/out), while 
python is using protobufs.  It would be simpler to write a protobufs coder for 
Java than to write a python version of PubsubMessageWithAttributesCoder, so I 
would favor protobufs as a pattern to apply generally to this kind of 
portability problem.

The other thing I like about a protobufs-based approach is that you could write 
an abstract ProtoCoder base class which can easily be specialized by simply 
providing the native class and the proto class.

My current external pubsub test pipeline looks like this after its been 
expanded:

{noformat}
--- java ---
Read
ParDo  # stats
ParDo  # convert to pusbub protobuf byte array
--- python --
ParDo  # convert from pusbub protobuf byte array
ParDo  # custom logger
{noformat}

If we created and registered a protobufs-based coder for PubsubMessage, we 
could cut out the converter transforms, and it would look like this:

{noformat}
--- java ---
Read
ParDo  # stats
--- python --
ParDo  # custom logger
{noformat}

That's appealing to me, though it's worth noting that we would get this benefit 
using any cross-language coder, whether it's protobufs, avro, or just rewriting 
PubsubMessageWithAttributesCoder in python.



was (Author: chadrik):
To follow up from the ML thread, I'm having a very similar problem when 
converting PubsubIO to an external transform.

{quote}
To transfer data between the two a coder needs to be involved. It happens to be 
that the initial read does not immediately drop the KafakRecord structure which 
does not work together well with our current assumption of only supporting 
"standard coders" present in all SDKs. Only the subsequent DoFn converts the 
KafkaRecord structure into a raw KV[byte, byte], but the DoFn won't have the 
coder available in its environment.
{quote}

PubsubIO behaves much the same way as you've described -- a Read produces a 
PubsubMessage structure which is converted to bytes (in protobufs format) by a 
subsequent DoFn -- but the problem I'm seeing is not that "the DoFn won't have 
the coder available in its environment", it's that the Read does not use the 
proper coder because FlinkStreamingPortablePipelineTranslator.translateRead 
replaces all non-standard coders with ByteArrayCoder.  So in my tests, the job 
fails with the first read message, complaining that PubsubMessage cannot be 

[jira] [Created] (BEAM-8000) Add Delete method to gRPC JobService

2019-08-18 Thread Chad Dombrova (JIRA)
Chad Dombrova created BEAM-8000:
---

 Summary: Add Delete method to gRPC JobService
 Key: BEAM-8000
 URL: https://issues.apache.org/jira/browse/BEAM-8000
 Project: Beam
  Issue Type: Improvement
  Components: beam-model
Reporter: Chad Dombrova
Assignee: Chad Dombrova


As a user of the InMemoryJobService, I want a method to purge jobs from memory 
when they are no longer needed, so that the service does not balloon in memory 
usage over time.

I was planning to name this Delete.  Also considering the name Purge.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (BEAM-7720) Fix the exception type of InMemoryJobService when job id not found

2019-08-14 Thread Chad Dombrova (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chad Dombrova reassigned BEAM-7720:
---

Assignee: Chad Dombrova

> Fix the exception type of InMemoryJobService when job id not found
> --
>
> Key: BEAM-7720
> URL: https://issues.apache.org/jira/browse/BEAM-7720
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Reporter: Chad Dombrova
>Assignee: Chad Dombrova
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The contract in beam_job_api.proto for `CancelJobRequest`, 
> `GetJobStateRequest`, and `GetJobPipelineRequest` states:
>   
> {noformat}
> // Throws error NOT_FOUND if the jobId is not found{noformat}
>   
> However, `InMemoryJobService` is handling this exception incorrectly by 
> rethrowing `NOT_FOUND` exceptions as `INTERNAL`.
> neither `JobMessagesRequest` nor `GetJobMetricsRequest` state their contract 
> wrt exceptions, but they should probably be updated to handle `NOT_FOUND` in 
> the same way.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (BEAM-7984) [python] The coder returned for typehints.List should be IterableCoder

2019-08-14 Thread Chad Dombrova (JIRA)
Chad Dombrova created BEAM-7984:
---

 Summary: [python] The coder returned for typehints.List should be 
IterableCoder
 Key: BEAM-7984
 URL: https://issues.apache.org/jira/browse/BEAM-7984
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py-core
Reporter: Chad Dombrova
Assignee: Chad Dombrova


IterableCoder encodes a list and decodes to list, but 
typecoders.registry.get_coder(typehints.List[bytes]) returns a 
FastPrimitiveCoder.  I don't see any reason why this would be advantageous. 




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (BEAM-7931) Fix gradlew to allow offline builds

2019-08-08 Thread Chad Dombrova (JIRA)
Chad Dombrova created BEAM-7931:
---

 Summary: Fix gradlew to allow offline builds
 Key: BEAM-7931
 URL: https://issues.apache.org/jira/browse/BEAM-7931
 Project: Beam
  Issue Type: Improvement
  Components: build-system
Reporter: Chad Dombrova


When running `./gradlew` the first thing it tries to do is download gradle:

{noformat}
Downloading https://services.gradle.org/distributions/gradle-5.2.1-all.zip
{noformat}

There seems to be no way to skip this, even if the correct version of gradle 
has already been downloaded and exists in the correct place.  The tool should 
be smart enough to do a version check and skip downloading if it exists.  
Unfortunately, the logic is wrapped up inside gradle/wrapper/gradle-wrapper.jar 
so there does not seem to be any way to fix this.  Where is the code for this 
jar?

This is the first step of several to allow beam to be built offline.




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (BEAM-7927) Add ability to get the list of submitted jobs from gRPC JobService

2019-08-07 Thread Chad Dombrova (JIRA)
Chad Dombrova created BEAM-7927:
---

 Summary: Add ability to get the list of submitted jobs from gRPC 
JobService
 Key: BEAM-7927
 URL: https://issues.apache.org/jira/browse/BEAM-7927
 Project: Beam
  Issue Type: New Feature
  Components: beam-model
Reporter: Chad Dombrova


As a developer building a client for monitoring running jobs via the 
JobService, I want the ability to get a list of jobs – particularly their job 
ids – so that I can use this as an entry point for getting other information 
about a job already offered by the JobService, such as the pipeline definition, 
a stream of status changes, etc. 

Currently, the JobService is only useful if you already have a valid job id. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (BEAM-7738) Support PubSubIO to be configured externally for use with other SDKs

2019-08-05 Thread Chad Dombrova (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chad Dombrova reassigned BEAM-7738:
---

Assignee: Chad Dombrova

> Support PubSubIO to be configured externally for use with other SDKs
> 
>
> Key: BEAM-7738
> URL: https://issues.apache.org/jira/browse/BEAM-7738
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp, runner-flink, sdk-py-core
>Reporter: Chad Dombrova
>Assignee: Chad Dombrova
>Priority: Major
>  Labels: portability
>
> Now that KafkaIO is supported via the external transform API (BEAM-7029) we 
> should add support for PubSub.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


  1   2   >