[jira] [Commented] (BEAM-8286) Python precommit (:sdks:python:test-suites:tox:py2:docs) failing
[ https://issues.apache.org/jira/browse/BEAM-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933891#comment-16933891 ] Chamikara Jayalath commented on BEAM-8286: -- Seems like Kyle's PR fixed the failure. Closing. > Python precommit (:sdks:python:test-suites:tox:py2:docs) failing > > > Key: BEAM-8286 > URL: https://issues.apache.org/jira/browse/BEAM-8286 > Project: Beam > Issue Type: Bug > Components: test-failures >Reporter: Kyle Weaver >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > Example failure: > [https://builds.apache.org/job/beam_PreCommit_Python_Commit/8638/console] > 17:29:13 * What went wrong: > 17:29:13 Execution failed for task ':sdks:python:test-suites:tox:py2:docs'. > 17:29:13 > Process 'command 'sh'' finished with non-zero exit value 1 > Fails on my local machine (on head) as well. Can't determine exact cause. > ERROR: InvocationError for command /usr/bin/time scripts/generate_pydoc.sh > (exited with code 1) > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-8286) Python precommit (:sdks:python:test-suites:tox:py2:docs) failing
[ https://issues.apache.org/jira/browse/BEAM-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath resolved BEAM-8286. -- Fix Version/s: Not applicable Resolution: Fixed > Python precommit (:sdks:python:test-suites:tox:py2:docs) failing > > > Key: BEAM-8286 > URL: https://issues.apache.org/jira/browse/BEAM-8286 > Project: Beam > Issue Type: Bug > Components: test-failures >Reporter: Kyle Weaver >Priority: Major > Fix For: Not applicable > > Time Spent: 40m > Remaining Estimate: 0h > > Example failure: > [https://builds.apache.org/job/beam_PreCommit_Python_Commit/8638/console] > 17:29:13 * What went wrong: > 17:29:13 Execution failed for task ':sdks:python:test-suites:tox:py2:docs'. > 17:29:13 > Process 'command 'sh'' finished with non-zero exit value 1 > Fails on my local machine (on head) as well. Can't determine exact cause. > ERROR: InvocationError for command /usr/bin/time scripts/generate_pydoc.sh > (exited with code 1) > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8286) Python precommit (:sdks:python:test-suites:tox:py2:docs) failing
[ https://issues.apache.org/jira/browse/BEAM-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933748#comment-16933748 ] Chamikara Jayalath commented on BEAM-8286: -- Hmm, looking at another Gradle build I see only followning warnings. WARNING: intersphinx inventory 'https://googleapis.github.io/google-cloud-python/latest/objects.inv' not fetchable due to : 404 Client Error: Not Found for url: https://googleapis.github.io/google-cloud-python/latest/objects.inv /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/apache_beam/io/gcp/datastore/v1new/helper.py:docstring of apache_beam.io.gcp.datastore.v1new.helper.write_mutations:: WARNING: py:class reference target not found: google.cloud.datastore.batch.Batch /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/apache_beam/io/gcp/datastore/v1new/types.py:docstring of apache_beam.io.gcp.datastore.v1new.types.Key:: WARNING: py:class reference target not found: google.cloud.datastore.key.Key /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/apache_beam/io/gcp/datastore/v1new/types.py:docstring of apache_beam.io.gcp.datastore.v1new.types.Key.to_client_key:1: WARNING: py:class reference target not found: google.cloud.datastore.key.Key /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/apache_beam/io/gcp/datastore/v1new/types.py:docstring of apache_beam.io.gcp.datastore.v1new.types.Entity.set_properties:: WARNING: py:class reference target not found: google.cloud.datastore.entity.Entity /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/apache_beam/io/gcp/datastore/v1new/types.py:docstring of apache_beam.io.gcp.datastore.v1new.types.Entity.to_client_entity:1: WARNING: py:class reference target not found: google.cloud.datastore.entity.Entity All these warnings are related to datastore. First one is interesting since that link does not seem to be actually available. Looking at a passing "sdks:python:test-suites:tox:py2:docs'" I don't see any warning. > Python precommit (:sdks:python:test-suites:tox:py2:docs) failing > > > Key: BEAM-8286 > URL: https://issues.apache.org/jira/browse/BEAM-8286 > Project: Beam > Issue Type: Bug > Components: test-failures >Reporter: Kyle Weaver >Priority: Major > > Example failure: > [https://builds.apache.org/job/beam_PreCommit_Python_Commit/8638/console] > 17:29:13 * What went wrong: > 17:29:13 Execution failed for task ':sdks:python:test-suites:tox:py2:docs'. > 17:29:13 > Process 'command 'sh'' finished with non-zero exit value 1 > Fails on my local machine (on head) as well. Can't determine exact cause. > ERROR: InvocationError for command /usr/bin/time scripts/generate_pydoc.sh > (exited with code 1) > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8286) Python precommit (:sdks:python:test-suites:tox:py2:docs) failing
[ https://issues.apache.org/jira/browse/BEAM-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933732#comment-16933732 ] Chamikara Jayalath commented on BEAM-8286: -- This seems to be the first PR for which pre-commit failed: [https://github.com/apache/beam/pull/9611] But I'm not sure if it's related. cc: [~udim] > Python precommit (:sdks:python:test-suites:tox:py2:docs) failing > > > Key: BEAM-8286 > URL: https://issues.apache.org/jira/browse/BEAM-8286 > Project: Beam > Issue Type: Bug > Components: test-failures >Reporter: Kyle Weaver >Priority: Major > > Example failure: > [https://builds.apache.org/job/beam_PreCommit_Python_Commit/8638/console] > 17:29:13 * What went wrong: > 17:29:13 Execution failed for task ':sdks:python:test-suites:tox:py2:docs'. > 17:29:13 > Process 'command 'sh'' finished with non-zero exit value 1 > Fails on my local machine (on head) as well. Can't determine exact cause. > ERROR: InvocationError for command /usr/bin/time scripts/generate_pydoc.sh > (exited with code 1) > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8240) SDK Harness
[ https://issues.apache.org/jira/browse/BEAM-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930826#comment-16930826 ] Chamikara Jayalath commented on BEAM-8240: -- Makes sense. BTW please fix the title of the JIRA. > SDK Harness > --- > > Key: BEAM-8240 > URL: https://issues.apache.org/jira/browse/BEAM-8240 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Reporter: Luke Cwik >Assignee: Luke Cwik >Priority: Minor > Time Spent: 2h 50m > Remaining Estimate: 0h > > SDK harness incorrectly identifies itself when using custom SDK container > within environment field when building pipeline proto. > > Passing in the experiment *worker_harness_container_image=YYY* doesn't > override the pipeline proto environment field and it is still being populated > with *gcr.io/cloud-dataflow/v1beta3/python-fnapi:beam-master-20190802* > > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (BEAM-8222) Consider making insertId optional in BigQuery.insertAll
[ https://issues.apache.org/jira/browse/BEAM-8222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930697#comment-16930697 ] Chamikara Jayalath commented on BEAM-8222: -- Based on some offline comments from [~reuvenlax] this might be undesirable and may cause user confusion. AFAIK Dataflow and other Beam runners that support BigQueryIO.Sink are tolerant to failures and may retry workitems. So handling duplicates is required for the safety of inserted data. Without insertid things might speed up in the short term for runs without failures but this mode of execution is not safe in the long run. > Consider making insertId optional in BigQuery.insertAll > --- > > Key: BEAM-8222 > URL: https://issues.apache.org/jira/browse/BEAM-8222 > Project: Beam > Issue Type: New Feature > Components: io-java-gcp >Reporter: Boyuan Zhang >Priority: Major > > Current implementation of > StreamingWriteFn(https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StreamingWriteFn.java#L102) > sets insertId from input element, which is added an uniqueId by > https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/TagWithUniqueIds.java#L53. > Users report that if leaving insertId as empty, writing will be extremely > speeded up. Can we add an bqOption like, nonInsertId and emit empty id based > on this option? -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (BEAM-8240) SDK Harness
[ https://issues.apache.org/jira/browse/BEAM-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930682#comment-16930682 ] Chamikara Jayalath commented on BEAM-8240: -- In the cross-language path, there will be more than one SDK container, so I think the correct solution is to update Dataflow to set a pointer to the container image in the environment payload (of transforms) and determine the set of containers needed within Dataflow service. For customer containers, we should be updating environemt payload instead of just updating worker_harness_container_image flag. cc: [~robertwb] > SDK Harness > --- > > Key: BEAM-8240 > URL: https://issues.apache.org/jira/browse/BEAM-8240 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Reporter: Luke Cwik >Assignee: Luke Cwik >Priority: Minor > > SDK harness incorrectly identifies itself when using custom SDK container > within environment field when building pipeline proto. > > Passing in the experiment *worker_harness_container_image=YYY* doesn't > override the pipeline proto environment field and it is still being populated > with *gcr.io/cloud-dataflow/v1beta3/python-fnapi:beam-master-20190802* > > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Closed] (BEAM-8231) State sampler is dropping stage-step names
[ https://issues.apache.org/jira/browse/BEAM-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath closed BEAM-8231. Fix Version/s: 2.17.0 Resolution: Fixed > State sampler is dropping stage-step names > -- > > Key: BEAM-8231 > URL: https://issues.apache.org/jira/browse/BEAM-8231 > Project: Beam > Issue Type: Bug > Components: runner-dataflow >Reporter: Chamikara Jayalath >Assignee: Chamikara Jayalath >Priority: Minor > Fix For: 2.17.0 > > Time Spent: 50m > Remaining Estimate: 0h > > State sampler is currently dropping stage and step names from reported > current state. > > Only affects DataflowRunner. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (BEAM-8231) State sampler is dropping stage-step names
Chamikara Jayalath created BEAM-8231: Summary: State sampler is dropping stage-step names Key: BEAM-8231 URL: https://issues.apache.org/jira/browse/BEAM-8231 Project: Beam Issue Type: Bug Components: runner-dataflow Reporter: Chamikara Jayalath Assignee: Chamikara Jayalath State sampler is currently dropping stage and step names from reported current state. Only affects DataflowRunner. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (BEAM-8215) Wordcount 1GB Python PKB benchmarks sometimes fail with uninformative error
[ https://issues.apache.org/jira/browse/BEAM-8215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928866#comment-16928866 ] Chamikara Jayalath commented on BEAM-8215: -- Seems like this is making Python 3 post commits extremely flaky (but not Python 2 strangely). Some recent examples. [https://builds.apache.org/job/beam_PerformanceTests_WordCountIT_Py35/479/console] [https://builds.apache.org/job/beam_PerformanceTests_WordCountIT_Py36/456/console] > Wordcount 1GB Python PKB benchmarks sometimes fail with uninformative error > --- > > Key: BEAM-8215 > URL: https://issues.apache.org/jira/browse/BEAM-8215 > Project: Beam > Issue Type: Bug > Components: testing >Reporter: Valentyn Tymofieiev >Assignee: Mark Liu >Priority: Major > > Example: > https://builds.apache.org/job/beam_PerformanceTests_WordCountIT_Py36/452/console > {noformat} > 12:09:27 2019-09-11 19:09:27,655 a47400ce MainThread > beam_integration_benchmark(1/1) ERRORError during benchmark > beam_integration_benchmark > 12:09:27 Traceback (most recent call last): > 12:09:27 File > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_WordCountIT_Py36/PerfKitBenchmarker/perfkitbenchmarker/pkb.py", > line 841, in RunBenchmark > 12:09:27 DoRunPhase(spec, collector, detailed_timer) > 12:09:27 File > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_WordCountIT_Py36/PerfKitBenchmarker/perfkitbenchmarker/pkb.py", > line 687, in DoRunPhase > 12:09:27 samples = spec.BenchmarkRun(spec) > 12:09:27 File > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_WordCountIT_Py36/PerfKitBenchmarker/perfkitbenchmarker/linux_benchmarks/beam_integration_benchmark.py", > line 160, in Run > 12:09:27 job_type=job_type) > 12:09:27 File > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_WordCountIT_Py36/PerfKitBenchmarker/perfkitbenchmarker/providers/gcp/gcp_dpb_dataflow.py", > line 91, in SubmitJob > 12:09:27 assert retcode == 0, "Integration Test Failed." > 12:09:27 AssertionError: Integration Test Failed. > {noformat} > It seems like Job submission failed, but there are no details. I talked with > [~markflyhigh], and sounds like we plan to stop using PKB in favor of another > framework. > Assigning to Mark for now to triage follow up or reassign as appropriate. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (BEAM-8219) crossLanguagePortableWordCount seems to be flaky for beam_PostCommit_Python2
Chamikara Jayalath created BEAM-8219: Summary: crossLanguagePortableWordCount seems to be flaky for beam_PostCommit_Python2 Key: BEAM-8219 URL: https://issues.apache.org/jira/browse/BEAM-8219 Project: Beam Issue Type: Bug Components: test-failures Reporter: Chamikara Jayalath Assignee: Chamikara Jayalath For example, [https://builds.apache.org/job/beam_PostCommit_Python2/451/console] [https://builds.apache.org/job/beam_PostCommit_Python2/454/console] *10:37:22* * What went wrong:*10:37:22* Execution failed for task ':sdks:python:test-suites:portable:py2:crossLanguagePortableWordCount'.*10:37:22* > Process 'command 'sh'' finished with non-zero exit value 1*10:37:22* cc: [~heejong] [~mxm] -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (BEAM-7611) Python BigTableIO IT is not running in any test suites
[ https://issues.apache.org/jira/browse/BEAM-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath updated BEAM-7611: - Priority: Critical (was: Blocker) > Python BigTableIO IT is not running in any test suites > -- > > Key: BEAM-7611 > URL: https://issues.apache.org/jira/browse/BEAM-7611 > Project: Beam > Issue Type: Bug > Components: io-py-gcp, testing >Reporter: Chamikara Jayalath >Assignee: Solomon Duskis >Priority: Critical > Fix For: 2.16.0 > > > We added an integration test here: [https://github.com/apache/beam/pull/7367] > > But this currently does not get picked up by any test suites (and get skipped > by some due to missing dependencies) hence BigTable sink is largely untested. > > First attempt to enable it: [https://github.com/apache/beam/pull/8886] > > Solomon assigning to you since I cannot find Juan's (PR author) Jira ID. > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (BEAM-7611) Python BigTableIO IT is not running in any test suites
[ https://issues.apache.org/jira/browse/BEAM-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928763#comment-16928763 ] Chamikara Jayalath commented on BEAM-7611: -- Created cherry-pick [https://github.com/apache/beam/pull/9558] Reducing the severity of this bug from a blocker to critical. > Python BigTableIO IT is not running in any test suites > -- > > Key: BEAM-7611 > URL: https://issues.apache.org/jira/browse/BEAM-7611 > Project: Beam > Issue Type: Bug > Components: io-py-gcp, testing >Reporter: Chamikara Jayalath >Assignee: Solomon Duskis >Priority: Blocker > Fix For: 2.16.0 > > > We added an integration test here: [https://github.com/apache/beam/pull/7367] > > But this currently does not get picked up by any test suites (and get skipped > by some due to missing dependencies) hence BigTable sink is largely untested. > > First attempt to enable it: [https://github.com/apache/beam/pull/8886] > > Solomon assigning to you since I cannot find Juan's (PR author) Jira ID. > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (BEAM-7611) Python BigTableIO IT is not running in any test suites
[ https://issues.apache.org/jira/browse/BEAM-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath updated BEAM-7611: - Fix Version/s: (was: 2.16.0) > Python BigTableIO IT is not running in any test suites > -- > > Key: BEAM-7611 > URL: https://issues.apache.org/jira/browse/BEAM-7611 > Project: Beam > Issue Type: Bug > Components: io-py-gcp, testing >Reporter: Chamikara Jayalath >Assignee: Solomon Duskis >Priority: Critical > > We added an integration test here: [https://github.com/apache/beam/pull/7367] > > But this currently does not get picked up by any test suites (and get skipped > by some due to missing dependencies) hence BigTable sink is largely untested. > > First attempt to enable it: [https://github.com/apache/beam/pull/8886] > > Solomon assigning to you since I cannot find Juan's (PR author) Jira ID. > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (BEAM-7611) Python BigTableIO IT is not running in any test suites
[ https://issues.apache.org/jira/browse/BEAM-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928044#comment-16928044 ] Chamikara Jayalath commented on BEAM-7611: -- I believe [~udim] is trying to run a pipeline with BigTable sink using new dependencies in [https://github.com/apache/beam/pull/9491]. If that works, we can remove this Jira out of Beam 2.16.0 blocking status but fixing test coverage for BigTable sink is still critical IMO. > Python BigTableIO IT is not running in any test suites > -- > > Key: BEAM-7611 > URL: https://issues.apache.org/jira/browse/BEAM-7611 > Project: Beam > Issue Type: Bug > Components: io-py-gcp, testing >Reporter: Chamikara Jayalath >Assignee: Solomon Duskis >Priority: Blocker > Fix For: 2.16.0 > > > We added an integration test here: [https://github.com/apache/beam/pull/7367] > > But this currently does not get picked up by any test suites (and get skipped > by some due to missing dependencies) hence BigTable sink is largely untested. > > First attempt to enable it: [https://github.com/apache/beam/pull/8886] > > Solomon assigning to you since I cannot find Juan's (PR author) Jira ID. > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (BEAM-7611) Python BigTableIO IT is not running in any test suites
[ https://issues.apache.org/jira/browse/BEAM-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927627#comment-16927627 ] Chamikara Jayalath commented on BEAM-7611: -- Also, this is blocking a critical Dataflow dependency upgrade that is needed for Beam 2.16.0. > Python BigTableIO IT is not running in any test suites > -- > > Key: BEAM-7611 > URL: https://issues.apache.org/jira/browse/BEAM-7611 > Project: Beam > Issue Type: Bug > Components: io-py-gcp, testing >Reporter: Chamikara Jayalath >Assignee: Solomon Duskis >Priority: Blocker > Fix For: 2.16.0 > > > We added an integration test here: [https://github.com/apache/beam/pull/7367] > > But this currently does not get picked up by any test suites (and get skipped > by some due to missing dependencies) hence BigTable sink is largely untested. > > First attempt to enable it: [https://github.com/apache/beam/pull/8886] > > Solomon assigning to you since I cannot find Juan's (PR author) Jira ID. > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (BEAM-7611) Python BigTableIO IT is not running in any test suites
[ https://issues.apache.org/jira/browse/BEAM-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927628#comment-16927628 ] Chamikara Jayalath commented on BEAM-7611: -- cc: [~altay] [~udim] > Python BigTableIO IT is not running in any test suites > -- > > Key: BEAM-7611 > URL: https://issues.apache.org/jira/browse/BEAM-7611 > Project: Beam > Issue Type: Bug > Components: io-py-gcp, testing >Reporter: Chamikara Jayalath >Assignee: Solomon Duskis >Priority: Blocker > Fix For: 2.16.0 > > > We added an integration test here: [https://github.com/apache/beam/pull/7367] > > But this currently does not get picked up by any test suites (and get skipped > by some due to missing dependencies) hence BigTable sink is largely untested. > > First attempt to enable it: [https://github.com/apache/beam/pull/8886] > > Solomon assigning to you since I cannot find Juan's (PR author) Jira ID. > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (BEAM-7611) Python BigTableIO IT is not running in any test suites
[ https://issues.apache.org/jira/browse/BEAM-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath updated BEAM-7611: - Priority: Blocker (was: Critical) > Python BigTableIO IT is not running in any test suites > -- > > Key: BEAM-7611 > URL: https://issues.apache.org/jira/browse/BEAM-7611 > Project: Beam > Issue Type: Bug > Components: io-py-gcp, testing >Reporter: Chamikara Jayalath >Assignee: Solomon Duskis >Priority: Blocker > > We added an integration test here: [https://github.com/apache/beam/pull/7367] > > But this currently does not get picked up by any test suites (and get skipped > by some due to missing dependencies) hence BigTable sink is largely untested. > > First attempt to enable it: [https://github.com/apache/beam/pull/8886] > > Solomon assigning to you since I cannot find Juan's (PR author) Jira ID. > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (BEAM-7611) Python BigTableIO IT is not running in any test suites
[ https://issues.apache.org/jira/browse/BEAM-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath updated BEAM-7611: - Fix Version/s: 2.16.0 > Python BigTableIO IT is not running in any test suites > -- > > Key: BEAM-7611 > URL: https://issues.apache.org/jira/browse/BEAM-7611 > Project: Beam > Issue Type: Bug > Components: io-py-gcp, testing >Reporter: Chamikara Jayalath >Assignee: Solomon Duskis >Priority: Blocker > Fix For: 2.16.0 > > > We added an integration test here: [https://github.com/apache/beam/pull/7367] > > But this currently does not get picked up by any test suites (and get skipped > by some due to missing dependencies) hence BigTable sink is largely untested. > > First attempt to enable it: [https://github.com/apache/beam/pull/8886] > > Solomon assigning to you since I cannot find Juan's (PR author) Jira ID. > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (BEAM-7611) Python BigTableIO IT is not running in any test suites
[ https://issues.apache.org/jira/browse/BEAM-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927161#comment-16927161 ] Chamikara Jayalath commented on BEAM-7611: -- [~sduskis] can you please look into this or assign to the original author (couldn't find Jira account of original author Juan Rael) ? Seems like we are lacking test coverage for Python BigTable sink. > Python BigTableIO IT is not running in any test suites > -- > > Key: BEAM-7611 > URL: https://issues.apache.org/jira/browse/BEAM-7611 > Project: Beam > Issue Type: Bug > Components: io-py-gcp, testing >Reporter: Chamikara Jayalath >Assignee: Solomon Duskis >Priority: Major > > We added an integration test here: [https://github.com/apache/beam/pull/7367] > > But this currently does not get picked up by any test suites (and get skipped > by some due to missing dependencies) hence BigTable sink is largely untested. > > First attempt to enable it: [https://github.com/apache/beam/pull/8886] > > Solomon assigning to you since I cannot find Juan's (PR author) Jira ID. > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Updated] (BEAM-7611) Python BigTableIO IT is not running in any test suites
[ https://issues.apache.org/jira/browse/BEAM-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath updated BEAM-7611: - Priority: Critical (was: Major) > Python BigTableIO IT is not running in any test suites > -- > > Key: BEAM-7611 > URL: https://issues.apache.org/jira/browse/BEAM-7611 > Project: Beam > Issue Type: Bug > Components: io-py-gcp, testing >Reporter: Chamikara Jayalath >Assignee: Solomon Duskis >Priority: Critical > > We added an integration test here: [https://github.com/apache/beam/pull/7367] > > But this currently does not get picked up by any test suites (and get skipped > by some due to missing dependencies) hence BigTable sink is largely untested. > > First attempt to enable it: [https://github.com/apache/beam/pull/8886] > > Solomon assigning to you since I cannot find Juan's (PR author) Jira ID. > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (BEAM-8155) Add PubsubIO.readMessagesWithCoderAndParseFn method
[ https://issues.apache.org/jira/browse/BEAM-8155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926870#comment-16926870 ] Chamikara Jayalath commented on BEAM-8155: -- Closing since PR was merged. Thanks! > Add PubsubIO.readMessagesWithCoderAndParseFn method > --- > > Key: BEAM-8155 > URL: https://issues.apache.org/jira/browse/BEAM-8155 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Reporter: Claire McGinty >Assignee: Claire McGinty >Priority: Major > Fix For: 2.16.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Complements [#8879|https://github.com/apache/beam/pull/8879]; the constructor > for a generic-typed PubsubIO.Read is private, so some kind of public > constructor is needed to use custom parse functions with generic types. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Resolved] (BEAM-8155) Add PubsubIO.readMessagesWithCoderAndParseFn method
[ https://issues.apache.org/jira/browse/BEAM-8155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath resolved BEAM-8155. -- Resolution: Fixed > Add PubsubIO.readMessagesWithCoderAndParseFn method > --- > > Key: BEAM-8155 > URL: https://issues.apache.org/jira/browse/BEAM-8155 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Reporter: Claire McGinty >Assignee: Claire McGinty >Priority: Major > Fix For: 2.16.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Complements [#8879|https://github.com/apache/beam/pull/8879]; the constructor > for a generic-typed PubsubIO.Read is private, so some kind of public > constructor is needed to use custom parse functions with generic types. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (BEAM-8138) Fix code snippets in FileIO Java docs
Chamikara Jayalath created BEAM-8138: Summary: Fix code snippets in FileIO Java docs Key: BEAM-8138 URL: https://issues.apache.org/jira/browse/BEAM-8138 Project: Beam Issue Type: Improvement Components: io-java-avro Reporter: Chamikara Jayalath For example first two snippets here seems to be incorrect. [https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroIO.java#L153] -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (BEAM-8046) Unable to read from bigquery and publish to pubsub using dataflow runner (python SDK)
[ https://issues.apache.org/jira/browse/BEAM-8046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916973#comment-16916973 ] Chamikara Jayalath commented on BEAM-8046: -- I think the issue there is current PubSub implementation in Python SDK being a native source/sink that is implemented in Dataflow streaming backend. > Unable to read from bigquery and publish to pubsub using dataflow runner > (python SDK) > - > > Key: BEAM-8046 > URL: https://issues.apache.org/jira/browse/BEAM-8046 > Project: Beam > Issue Type: Improvement > Components: io-py-gcp, runner-dataflow >Affects Versions: 2.13.0, 2.14.0 >Reporter: James Hutchison >Priority: Major > > With the Python SDK: > The dataflow runner does not allow use of reading from bigquery in streaming > pipelines. > Pubsub is not allowed for batch pipelines. > Thus, there's no way to create a pipeline on the dataflow runner that reads > from bigquery and publishes to pubsub. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (BEAM-8046) Unable to read from bigquery and publish to pubsub using dataflow runner (python SDK)
[ https://issues.apache.org/jira/browse/BEAM-8046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916899#comment-16916899 ] Chamikara Jayalath commented on BEAM-8046: -- BigQuery is an bounded source so we'll need to add an unbounded read from bounded source wrapper to support it in a streaming pipelines. For example we have [1] for Java. Also, we don't have an unbounded source framework for Python yet. Splittable DoFn is currently in the works to this end. [1][https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/UnboundedReadFromBoundedSource.java] > Unable to read from bigquery and publish to pubsub using dataflow runner > (python SDK) > - > > Key: BEAM-8046 > URL: https://issues.apache.org/jira/browse/BEAM-8046 > Project: Beam > Issue Type: Improvement > Components: io-py-gcp, runner-dataflow >Affects Versions: 2.13.0, 2.14.0 >Reporter: James Hutchison >Priority: Major > > With the Python SDK: > The dataflow runner does not allow use of reading from bigquery in streaming > pipelines. > Pubsub is not allowed for batch pipelines. > Thus, there's no way to create a pipeline on the dataflow runner that reads > from bigquery and publishes to pubsub. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (BEAM-8089) Error while using customGcsTempLocation() with Dataflow
[ https://issues.apache.org/jira/browse/BEAM-8089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916117#comment-16916117 ] Chamikara Jayalath commented on BEAM-8089: -- Have you tried running Dataflow in the same region as where your bucket located using option [1] ? Networks charges should not apply in this case according to [2]. [1] [https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/options/DataflowPipelineOptions.java#L133] [2] [https://cloud.google.com/storage/pricing] > Error while using customGcsTempLocation() with Dataflow > --- > > Key: BEAM-8089 > URL: https://issues.apache.org/jira/browse/BEAM-8089 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.13.0 >Reporter: Harshit Dwivedi >Assignee: Chamikara Jayalath >Priority: Major > > I have the following code snippet which writes content to BigQuery via File > Loads. > Currently the files are being written to a GCS Bucket, but I want to write > them to the local file storage of Dataflow instead and want BigQuery to load > data from there. > > > > {code:java} > BigQueryIO > .writeTableRows() > .withNumFileShards(100) > .withTriggeringFrequency(Duration.standardSeconds(90)) > .withMethod(BigQueryIO.Write.Method.FILE_LOADS) > .withSchema(getSchema()) > .withoutValidation() > .withCustomGcsTempLocation(new ValueProvider() { > @Override > public String get(){ > return "/home/harshit/testFiles"; > } > @Override > public boolean isAccessible(){ > return true; > }}) > .withTimePartitioning(new TimePartitioning().setType("DAY")) > .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED) > .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND) > .to(tableName)); > {code} > > > On running this, I don't see any files being written to the provided path and > the BQ load jobs fail with an IOException. > > I looked at the docs, but I was unable to find any working example for this. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (BEAM-8089) Error while using customGcsTempLocation() with Dataflow
[ https://issues.apache.org/jira/browse/BEAM-8089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916052#comment-16916052 ] Chamikara Jayalath commented on BEAM-8089: -- BTW may I ask why you cannot use GCS in this case ? Dataflow already needs GCS to run and storage costs should be minimum. > Error while using customGcsTempLocation() with Dataflow > --- > > Key: BEAM-8089 > URL: https://issues.apache.org/jira/browse/BEAM-8089 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.13.0 >Reporter: Harshit Dwivedi >Assignee: Chamikara Jayalath >Priority: Major > > I have the following code snippet which writes content to BigQuery via File > Loads. > Currently the files are being written to a GCS Bucket, but I want to write > them to the local file storage of Dataflow instead and want BigQuery to load > data from there. > > > > {code:java} > BigQueryIO > .writeTableRows() > .withNumFileShards(100) > .withTriggeringFrequency(Duration.standardSeconds(90)) > .withMethod(BigQueryIO.Write.Method.FILE_LOADS) > .withSchema(getSchema()) > .withoutValidation() > .withCustomGcsTempLocation(new ValueProvider() { > @Override > public String get(){ > return "/home/harshit/testFiles"; > } > @Override > public boolean isAccessible(){ > return true; > }}) > .withTimePartitioning(new TimePartitioning().setType("DAY")) > .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED) > .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND) > .to(tableName)); > {code} > > > On running this, I don't see any files being written to the provided path and > the BQ load jobs fail with an IOException. > > I looked at the docs, but I was unable to find any working example for this. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (BEAM-8089) Error while using customGcsTempLocation() with Dataflow
[ https://issues.apache.org/jira/browse/BEAM-8089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916051#comment-16916051 ] Chamikara Jayalath commented on BEAM-8089: -- I don't think we can fork Beam code for a very specific scenario of the Dataflow runner (single worker with autoscaling disabled). In general, Dataflow does not fuse the step that write files and the step that execute the BQ job so these two steps may not execute in the same worker. > Error while using customGcsTempLocation() with Dataflow > --- > > Key: BEAM-8089 > URL: https://issues.apache.org/jira/browse/BEAM-8089 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.13.0 >Reporter: Harshit Dwivedi >Assignee: Chamikara Jayalath >Priority: Major > > I have the following code snippet which writes content to BigQuery via File > Loads. > Currently the files are being written to a GCS Bucket, but I want to write > them to the local file storage of Dataflow instead and want BigQuery to load > data from there. > > > > {code:java} > BigQueryIO > .writeTableRows() > .withNumFileShards(100) > .withTriggeringFrequency(Duration.standardSeconds(90)) > .withMethod(BigQueryIO.Write.Method.FILE_LOADS) > .withSchema(getSchema()) > .withoutValidation() > .withCustomGcsTempLocation(new ValueProvider() { > @Override > public String get(){ > return "/home/harshit/testFiles"; > } > @Override > public boolean isAccessible(){ > return true; > }}) > .withTimePartitioning(new TimePartitioning().setType("DAY")) > .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED) > .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND) > .to(tableName)); > {code} > > > On running this, I don't see any files being written to the provided path and > the BQ load jobs fail with an IOException. > > I looked at the docs, but I was unable to find any working example for this. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (BEAM-8089) Error while using customGcsTempLocation() with Dataflow
[ https://issues.apache.org/jira/browse/BEAM-8089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916033#comment-16916033 ] Chamikara Jayalath commented on BEAM-8089: -- True, seems like this is supported in a limited way (wildcards not supported for example). I think Beam will have a hard time supporting this since most Beam runners are distributed and use multiple nodes to write data (to files) in parallel. So there's no "single" local disk. This is why we use a distributed storage location to which all workers have access to write individual files (a directory in GCS in this case) and execute a single BQ load job for all files from there. > Error while using customGcsTempLocation() with Dataflow > --- > > Key: BEAM-8089 > URL: https://issues.apache.org/jira/browse/BEAM-8089 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.13.0 >Reporter: Harshit Dwivedi >Assignee: Chamikara Jayalath >Priority: Major > > I have the following code snippet which writes content to BigQuery via File > Loads. > Currently the files are being written to a GCS Bucket, but I want to write > them to the local file storage of Dataflow instead and want BigQuery to load > data from there. > > > > {code:java} > BigQueryIO > .writeTableRows() > .withNumFileShards(100) > .withTriggeringFrequency(Duration.standardSeconds(90)) > .withMethod(BigQueryIO.Write.Method.FILE_LOADS) > .withSchema(getSchema()) > .withoutValidation() > .withCustomGcsTempLocation(new ValueProvider() { > @Override > public String get(){ > return "/home/harshit/testFiles"; > } > @Override > public boolean isAccessible(){ > return true; > }}) > .withTimePartitioning(new TimePartitioning().setType("DAY")) > .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED) > .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND) > .to(tableName)); > {code} > > > On running this, I don't see any files being written to the provided path and > the BQ load jobs fail with an IOException. > > I looked at the docs, but I was unable to find any working example for this. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (BEAM-8089) Error while using customGcsTempLocation() with Dataflow
[ https://issues.apache.org/jira/browse/BEAM-8089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915992#comment-16915992 ] Chamikara Jayalath commented on BEAM-8089: -- BQ cannot execute load jobs from local files. Files have to be in GCS. So I think this is working as intended. > Error while using customGcsTempLocation() with Dataflow > --- > > Key: BEAM-8089 > URL: https://issues.apache.org/jira/browse/BEAM-8089 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.13.0 >Reporter: Harshit Dwivedi >Assignee: Chamikara Jayalath >Priority: Major > > I have the following code snippet which writes content to BigQuery via File > Loads. > Currently the files are being written to a GCS Bucket, but I want to write > them to the local file storage of Dataflow instead and want BigQuery to load > data from there. > > > > {code:java} > BigQueryIO > .writeTableRows() > .withNumFileShards(100) > .withTriggeringFrequency(Duration.standardSeconds(90)) > .withMethod(BigQueryIO.Write.Method.FILE_LOADS) > .withSchema(getSchema()) > .withoutValidation() > .withCustomGcsTempLocation(new ValueProvider() { > @Override > public String get(){ > return "/home/harshit/testFiles"; > } > @Override > public boolean isAccessible(){ > return true; > }}) > .withTimePartitioning(new TimePartitioning().setType("DAY")) > .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED) > .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND) > .to(tableName)); > {code} > > > On running this, I don't see any files being written to the provided path and > the BQ load jobs fail with an IOException. > > I looked at the docs, but I was unable to find any working example for this. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (BEAM-8023) Allow specifying BigQuery Storage API readOptions at runtime
[ https://issues.apache.org/jira/browse/BEAM-8023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913647#comment-16913647 ] Chamikara Jayalath commented on BEAM-8023: -- Wow that was fast :) Thanks Ken. I'll take a look. > Allow specifying BigQuery Storage API readOptions at runtime > > > Key: BEAM-8023 > URL: https://issues.apache.org/jira/browse/BEAM-8023 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Reporter: Jeff Klukas >Assignee: Kenneth Jung >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > We have support in the Java SDK for using the BigQuery Storage API for reads, > but only the target query or table is supported as a ValueProvider to be > specified at runtime. AFAICT, there is no reason we can't delay specifying > readOptions until runtime as well. > The readOptions are accessed by BigQueryStorageTableSource in getTargetTable; > I believe that's occurring at runtime, but I'd love for someone with deeper > BoundedSource knowledge to confirm that. > I'd advocate for adding new methods > `TypedRead.withSelectedFields(ValueProvider> value)` and > `TypedRead.withRowRestriction(ValueProvider value)`. The existing > `withReadOptions` method would then populate the other two as > StaticValueProviders. Perhaps we'd want to deprecate `withReadOptions` in > favor or specifying individual read options as separate parameters. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Assigned] (BEAM-8023) Allow specifying BigQuery Storage API readOptions at runtime
[ https://issues.apache.org/jira/browse/BEAM-8023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath reassigned BEAM-8023: Assignee: Kenneth Jung > Allow specifying BigQuery Storage API readOptions at runtime > > > Key: BEAM-8023 > URL: https://issues.apache.org/jira/browse/BEAM-8023 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Reporter: Jeff Klukas >Assignee: Kenneth Jung >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > We have support in the Java SDK for using the BigQuery Storage API for reads, > but only the target query or table is supported as a ValueProvider to be > specified at runtime. AFAICT, there is no reason we can't delay specifying > readOptions until runtime as well. > The readOptions are accessed by BigQueryStorageTableSource in getTargetTable; > I believe that's occurring at runtime, but I'd love for someone with deeper > BoundedSource knowledge to confirm that. > I'd advocate for adding new methods > `TypedRead.withSelectedFields(ValueProvider> value)` and > `TypedRead.withRowRestriction(ValueProvider value)`. The existing > `withReadOptions` method would then populate the other two as > StaticValueProviders. Perhaps we'd want to deprecate `withReadOptions` in > favor or specifying individual read options as separate parameters. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (BEAM-8023) Allow specifying BigQuery Storage API readOptions at runtime
[ https://issues.apache.org/jira/browse/BEAM-8023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913488#comment-16913488 ] Chamikara Jayalath commented on BEAM-8023: -- Yeah, I believe this should be possible. [~jeff.klu...@gmail.com] will you be able to send a PR for this ? If not [~aryann] or [~kjung520] may be interested in adding this functionality. > Allow specifying BigQuery Storage API readOptions at runtime > > > Key: BEAM-8023 > URL: https://issues.apache.org/jira/browse/BEAM-8023 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Reporter: Jeff Klukas >Priority: Minor > > We have support in the Java SDK for using the BigQuery Storage API for reads, > but only the target query or table is supported as a ValueProvider to be > specified at runtime. AFAICT, there is no reason we can't delay specifying > readOptions until runtime as well. > The readOptions are accessed by BigQueryStorageTableSource in getTargetTable; > I believe that's occurring at runtime, but I'd love for someone with deeper > BoundedSource knowledge to confirm that. > I'd advocate for adding new methods > `TypedRead.withSelectedFields(ValueProvider> value)` and > `TypedRead.withRowRestriction(ValueProvider value)`. The existing > `withReadOptions` method would then populate the other two as > StaticValueProviders. Perhaps we'd want to deprecate `withReadOptions` in > favor or specifying individual read options as separate parameters. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (BEAM-8019) Support cross-language transforms for DataflowRunner
Chamikara Jayalath created BEAM-8019: Summary: Support cross-language transforms for DataflowRunner Key: BEAM-8019 URL: https://issues.apache.org/jira/browse/BEAM-8019 Project: Beam Issue Type: New Feature Components: sdk-py-core Reporter: Chamikara Jayalath Assignee: Chamikara Jayalath This is to capture the Beam changes needed for this task. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Assigned] (BEAM-7978) ArithmeticExceptions on getting backlog bytes
[ https://issues.apache.org/jira/browse/BEAM-7978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath reassigned BEAM-7978: Assignee: Alexey Romanenko > ArithmeticExceptions on getting backlog bytes > -- > > Key: BEAM-7978 > URL: https://issues.apache.org/jira/browse/BEAM-7978 > Project: Beam > Issue Type: Bug > Components: io-java-kinesis >Affects Versions: 2.14.0 >Reporter: Mateusz >Assignee: Alexey Romanenko >Priority: Major > > Hello, > Beam 2.14.0 > (and to be more precise > [commit|https://github.com/apache/beam/commit/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad#diff-b4964a457006b1555c7042c739b405ec]) > introduced a change in watermark calculation in Kinesis IO causing below > error: > {code:java} > exception: "java.lang.RuntimeException: Unknown kinesis failure, when trying > to reach kinesis > at > org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.wrapExceptions(SimplifiedKinesisClient.java:227) > at > org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.getBacklogBytes(SimplifiedKinesisClient.java:167) > at > org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.getBacklogBytes(SimplifiedKinesisClient.java:155) > at > org.apache.beam.sdk.io.kinesis.KinesisReader.getTotalBacklogBytes(KinesisReader.java:158) > at > org.apache.beam.runners.dataflow.worker.StreamingModeExecutionContext.flushState(StreamingModeExecutionContext.java:433) > at > org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1289) > at > org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:149) > at > org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:1024) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArithmeticException: Value cannot fit in an int: > 153748963401 > at org.joda.time.field.FieldUtils.safeToInt(FieldUtils.java:229) > at > org.joda.time.field.BaseDurationField.getDifference(BaseDurationField.java:141) > at > org.joda.time.base.BaseSingleFieldPeriod.between(BaseSingleFieldPeriod.java:72) > at org.joda.time.Minutes.minutesBetween(Minutes.java:101) > at > org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.lambda$getBacklogBytes$3(SimplifiedKinesisClient.java:169) > at > org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.wrapExceptions(SimplifiedKinesisClient.java:210) > ... 10 more > {code} > We spotted this issue on Dataflow runner. It's problematic as inability to > get backlog bytes seems to result in constant recreation of KinesisReader. > The issue happens if the backlog bytes are retrieved before watermark value > is updated from initial default value. Easy way to reproduce it is to create > a pipeline with Kinesis source for a stream where no records are being put. > While debugging it locally, you can observe that the watermark is set to the > value on the past (like: "-290308-12-21T19:59:05.225Z"). After two minutes > (default watermark idle duration threshold is set to 2 minutes) , the > watermark is set to value of > [watermarkIdleThreshold|https://github.com/apache/beam/blob/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad/sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/WatermarkPolicyFactory.java#L110]), > so the next backlog bytes retrieval should be correct. However, as described > before, running the pipeline on Dataflow runner results in KinesisReader > being closed just after creation, so the watermark won't be fixed. > The reason of the issue is following: The introduced watermark policies are > relying on > [WatermarkParameters|https://github.com/apache/beam/blob/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad/sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/WatermarkParameters.java] > which initialises currentWatermark and eventTime to > [BoundedWindow.TIMESTAMP_MIN_VALUE|https://github.com/apache/beam/commit/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad#diff-b4964a457006b1555c7042c739b405ecR52]. > This result in watermark being set to new Instant(-9223372036854775L) at the > KinesisReader creation. Calculated [period between the watermark and the > current > timestamp|https://github.com/apache/beam/blob/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad/sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/SimplifiedKinesisClient.java#L169] > is bigger than expected causing the ArithmeticException to be thrown. > The maximum retention on Kinesis
[jira] [Commented] (BEAM-7978) ArithmeticExceptions on getting backlog bytes
[ https://issues.apache.org/jira/browse/BEAM-7978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911880#comment-16911880 ] Chamikara Jayalath commented on BEAM-7978: -- Thanks Alexey. Assigned the Jira to you. > ArithmeticExceptions on getting backlog bytes > -- > > Key: BEAM-7978 > URL: https://issues.apache.org/jira/browse/BEAM-7978 > Project: Beam > Issue Type: Bug > Components: io-java-kinesis >Affects Versions: 2.14.0 >Reporter: Mateusz >Assignee: Alexey Romanenko >Priority: Major > > Hello, > Beam 2.14.0 > (and to be more precise > [commit|https://github.com/apache/beam/commit/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad#diff-b4964a457006b1555c7042c739b405ec]) > introduced a change in watermark calculation in Kinesis IO causing below > error: > {code:java} > exception: "java.lang.RuntimeException: Unknown kinesis failure, when trying > to reach kinesis > at > org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.wrapExceptions(SimplifiedKinesisClient.java:227) > at > org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.getBacklogBytes(SimplifiedKinesisClient.java:167) > at > org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.getBacklogBytes(SimplifiedKinesisClient.java:155) > at > org.apache.beam.sdk.io.kinesis.KinesisReader.getTotalBacklogBytes(KinesisReader.java:158) > at > org.apache.beam.runners.dataflow.worker.StreamingModeExecutionContext.flushState(StreamingModeExecutionContext.java:433) > at > org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1289) > at > org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:149) > at > org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:1024) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArithmeticException: Value cannot fit in an int: > 153748963401 > at org.joda.time.field.FieldUtils.safeToInt(FieldUtils.java:229) > at > org.joda.time.field.BaseDurationField.getDifference(BaseDurationField.java:141) > at > org.joda.time.base.BaseSingleFieldPeriod.between(BaseSingleFieldPeriod.java:72) > at org.joda.time.Minutes.minutesBetween(Minutes.java:101) > at > org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.lambda$getBacklogBytes$3(SimplifiedKinesisClient.java:169) > at > org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.wrapExceptions(SimplifiedKinesisClient.java:210) > ... 10 more > {code} > We spotted this issue on Dataflow runner. It's problematic as inability to > get backlog bytes seems to result in constant recreation of KinesisReader. > The issue happens if the backlog bytes are retrieved before watermark value > is updated from initial default value. Easy way to reproduce it is to create > a pipeline with Kinesis source for a stream where no records are being put. > While debugging it locally, you can observe that the watermark is set to the > value on the past (like: "-290308-12-21T19:59:05.225Z"). After two minutes > (default watermark idle duration threshold is set to 2 minutes) , the > watermark is set to value of > [watermarkIdleThreshold|https://github.com/apache/beam/blob/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad/sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/WatermarkPolicyFactory.java#L110]), > so the next backlog bytes retrieval should be correct. However, as described > before, running the pipeline on Dataflow runner results in KinesisReader > being closed just after creation, so the watermark won't be fixed. > The reason of the issue is following: The introduced watermark policies are > relying on > [WatermarkParameters|https://github.com/apache/beam/blob/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad/sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/WatermarkParameters.java] > which initialises currentWatermark and eventTime to > [BoundedWindow.TIMESTAMP_MIN_VALUE|https://github.com/apache/beam/commit/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad#diff-b4964a457006b1555c7042c739b405ecR52]. > This result in watermark being set to new Instant(-9223372036854775L) at the > KinesisReader creation. Calculated [period between the watermark and the > current > timestamp|https://github.com/apache/beam/blob/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad/sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/SimplifiedKinesisClient.java#L169] > is bigger than expected causing the ArithmeticException to be
[jira] [Commented] (BEAM-7883) PubsubIO (Java) write batch size can exceed request payload limit
[ https://issues.apache.org/jira/browse/BEAM-7883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910649#comment-16910649 ] Chamikara Jayalath commented on BEAM-7883: -- Thanks. Batch sizing that is done at PubSub IO is just an estimations and there can be cases where actual messages go over the 10 MB hard limit. Using 'withMaxBatchBytesSize' is the correct workaround. > PubsubIO (Java) write batch size can exceed request payload limit > - > > Key: BEAM-7883 > URL: https://issues.apache.org/jira/browse/BEAM-7883 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.13.0 >Reporter: Yurii Atamanchuk >Priority: Minor > > In some (probably rare) cases PubsubIO write (in Batch mode) batch size can > exceed request payload limit of 10mb. PubsubIO ensures that batch size is > less than limit (10mb by default). But then PubsubJsonClient is used that > converts message payloads into URL-Safe Base64 encoding which can inflate > message size (in my case for json strings it was up to 25-30%). As result we > get 400 response (with 'Request payload size exceeds the limit: 10485760 > bytes' message), even though original batch had correct size. > Obvious workaround is to reduce batch size > (`PubsubIO.writeMessages().to(...).withMaxBatchBytesSize(... i.e. 5mb ...)`), > but it is a bit annoying. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (BEAM-7978) ArithmeticExceptions on getting backlog bytes
[ https://issues.apache.org/jira/browse/BEAM-7978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907539#comment-16907539 ] Chamikara Jayalath commented on BEAM-7978: -- I guess this can occur for other runners as well ? CCing some folks who recently updated Kinesis: [~aromanenko] [~iemejia] [~rtshadow] @ > ArithmeticExceptions on getting backlog bytes > -- > > Key: BEAM-7978 > URL: https://issues.apache.org/jira/browse/BEAM-7978 > Project: Beam > Issue Type: Bug > Components: io-java-kinesis >Affects Versions: 2.14.0 >Reporter: Mateusz >Priority: Major > > Hello, > Beam 2.14.0 > (and to be more precise > [commit|https://github.com/apache/beam/commit/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad#diff-b4964a457006b1555c7042c739b405ec]) > introduced a change in watermark calculation in Kinesis IO causing below > error: > {code:java} > exception: "java.lang.RuntimeException: Unknown kinesis failure, when trying > to reach kinesis > at > org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.wrapExceptions(SimplifiedKinesisClient.java:227) > at > org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.getBacklogBytes(SimplifiedKinesisClient.java:167) > at > org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.getBacklogBytes(SimplifiedKinesisClient.java:155) > at > org.apache.beam.sdk.io.kinesis.KinesisReader.getTotalBacklogBytes(KinesisReader.java:158) > at > org.apache.beam.runners.dataflow.worker.StreamingModeExecutionContext.flushState(StreamingModeExecutionContext.java:433) > at > org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1289) > at > org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:149) > at > org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:1024) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArithmeticException: Value cannot fit in an int: > 153748963401 > at org.joda.time.field.FieldUtils.safeToInt(FieldUtils.java:229) > at > org.joda.time.field.BaseDurationField.getDifference(BaseDurationField.java:141) > at > org.joda.time.base.BaseSingleFieldPeriod.between(BaseSingleFieldPeriod.java:72) > at org.joda.time.Minutes.minutesBetween(Minutes.java:101) > at > org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.lambda$getBacklogBytes$3(SimplifiedKinesisClient.java:169) > at > org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.wrapExceptions(SimplifiedKinesisClient.java:210) > ... 10 more > {code} > We spotted this issue on Dataflow runner. It's problematic as inability to > get backlog bytes seems to result in constant recreation of KinesisReader. > The issue happens if the backlog bytes are retrieved before watermark value > is updated from initial default value. Easy way to reproduce it is to create > a pipeline with Kinesis source for a stream where no records are being put. > While debugging it locally, you can observe that the watermark is set to the > value on the past (like: "-290308-12-21T19:59:05.225Z"). After two minutes > (default watermark idle duration threshold is set to 2 minutes) , the > watermark is set to value of > [watermarkIdleThreshold|https://github.com/apache/beam/blob/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad/sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/WatermarkPolicyFactory.java#L110]), > so the next backlog bytes retrieval should be correct. However, as described > before, running the pipeline on Dataflow runner results in KinesisReader > being closed just after creation, so the watermark won't be fixed. > The reason of the issue is following: The introduced watermark policies are > relying on > [WatermarkParameters|https://github.com/apache/beam/blob/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad/sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/WatermarkParameters.java] > which initialises currentWatermark and eventTime to > [BoundedWindow.TIMESTAMP_MIN_VALUE|https://github.com/apache/beam/commit/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad#diff-b4964a457006b1555c7042c739b405ecR52]. > This result in watermark being set to new Instant(-9223372036854775L) at the > KinesisReader creation. Calculated [period between the watermark and the > current > timestamp|https://github.com/apache/beam/blob/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad/sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/SimplifiedKinesisClient.java#L169] > is
[jira] [Commented] (BEAM-7850) Make Environment a top level attribute of PTransform
[ https://issues.apache.org/jira/browse/BEAM-7850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900399#comment-16900399 ] Chamikara Jayalath commented on BEAM-7850: -- Thanks Luke. That makes sense. I assume we should also just replace SdkFunctionSpecs view_fn and window_mapping_fn in SideInput with FuctionSpec ? [https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L958] Seems like SideInput lives within PTransform payloads (for example, ParDoPayload) so I don't think SideInput will need to have an Environment defined separately (in addition to the PTransform that holds it). > Make Environment a top level attribute of PTransform > > > Key: BEAM-7850 > URL: https://issues.apache.org/jira/browse/BEAM-7850 > Project: Beam > Issue Type: Sub-task > Components: beam-model >Reporter: Chamikara Jayalath >Priority: Major > > Currently Environment is not a top level attribute of the PTransform (of > runner API proto). > [https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L99] > Instead it is hidden inside various payload objects. For example, for ParDo, > environment will be inside SdkFunctionSpec of ParDoPayload. > [https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L99] > > This makes tracking environment of different types of PTransforms harder and > we have to fork code (on the type of PTransform) to extract the Environment > where the PTransform should be executed. It will probably be simpler to just > make Environment a top level attribute of PTransform. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (BEAM-7866) Python MongoDB IO performance and correctness issues
[ https://issues.apache.org/jira/browse/BEAM-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath updated BEAM-7866: - Status: Open (was: Triage Needed) > Python MongoDB IO performance and correctness issues > > > Key: BEAM-7866 > URL: https://issues.apache.org/jira/browse/BEAM-7866 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Eugene Kirpichov >Assignee: Yichi Zhang >Priority: Blocker > Fix For: 2.14.0, 2.15.0 > > > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py > splits the query result by computing number of results in constructor, and > then in each reader re-executing the whole query and getting an index > sub-range of those results. > This is broken in several critical ways: > - The order of query results returned by find() is not necessarily > deterministic, so the idea of index ranges on it is meaningless: each shard > may basically get random, possibly overlapping subsets of the total results > - Even if you add order by `_id`, the database may be changing concurrently > to reading and splitting. E.g. if the database contained documents with ids > 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the > assumption that these shards would contain respectively 10 20 30, and 40 50), > and then suppose shard 10 20 30 is read and then document 25 is inserted - > then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and > document 25 is lost. > - Every shard re-executes the query and skips the first start_offset items, > which in total is quadratic complexity > - The query is first executed in the constructor in order to count results, > which 1) means the constructor can be super slow and 2) it won't work at all > if the database is unavailable at the time the pipeline is constructed (e.g. > if this is a template). > Unfortunately, none of these issues are caught by SourceTestUtils: this class > has extensive coverage with it, and the tests pass. This is because the tests > return the same results in the same order. I don't know how to catch this > automatically, and I don't know how to catch the performance issue > automatically, but these would all be important follow-up items after the > actual fix. > CC: [~chamikara] as reviewer. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Updated] (BEAM-7866) Python MongoDB IO performance and correctness issues
[ https://issues.apache.org/jira/browse/BEAM-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath updated BEAM-7866: - Fix Version/s: 2.15.0 > Python MongoDB IO performance and correctness issues > > > Key: BEAM-7866 > URL: https://issues.apache.org/jira/browse/BEAM-7866 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Eugene Kirpichov >Assignee: Yichi Zhang >Priority: Blocker > Fix For: 2.14.0, 2.15.0 > > > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py > splits the query result by computing number of results in constructor, and > then in each reader re-executing the whole query and getting an index > sub-range of those results. > This is broken in several critical ways: > - The order of query results returned by find() is not necessarily > deterministic, so the idea of index ranges on it is meaningless: each shard > may basically get random, possibly overlapping subsets of the total results > - Even if you add order by `_id`, the database may be changing concurrently > to reading and splitting. E.g. if the database contained documents with ids > 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the > assumption that these shards would contain respectively 10 20 30, and 40 50), > and then suppose shard 10 20 30 is read and then document 25 is inserted - > then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and > document 25 is lost. > - Every shard re-executes the query and skips the first start_offset items, > which in total is quadratic complexity > - The query is first executed in the constructor in order to count results, > which 1) means the constructor can be super slow and 2) it won't work at all > if the database is unavailable at the time the pipeline is constructed (e.g. > if this is a template). > Unfortunately, none of these issues are caught by SourceTestUtils: this class > has extensive coverage with it, and the tests pass. This is because the tests > return the same results in the same order. I don't know how to catch this > automatically, and I don't know how to catch the performance issue > automatically, but these would all be important follow-up items after the > actual fix. > CC: [~chamikara] as reviewer. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-7866) Python MongoDB IO performance and correctness issues
[ https://issues.apache.org/jira/browse/BEAM-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897555#comment-16897555 ] Chamikara Jayalath commented on BEAM-7866: -- Thanks Eugene for pointing out these issues. Seems like, in the Java implementation, we build splits with specific IDs. Probably we can do something similar for Python ? [https://github.com/apache/beam/blob/master/sdks/java/io/mongodb/src/main/java/org/apache/beam/sdk/io/mongodb/MongoDbIO.java#L519] BTW is it correct to assume that the source will be correct if that MongoDB instance stays static (without any data insertion/deletion/update) for the duration of the execution of the read step ? (I understand that this might be improbable for many applications of the Database). Seems like elements will have the same order for different find() calls if the database is static since MongoDB defaults to natural sort order. [https://docs.mongodb.com/manual/reference/glossary/#term-natural-order] > Python MongoDB IO performance and correctness issues > > > Key: BEAM-7866 > URL: https://issues.apache.org/jira/browse/BEAM-7866 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Eugene Kirpichov >Assignee: Yichi Zhang >Priority: Blocker > Fix For: 2.14.0 > > > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py > splits the query result by computing number of results in constructor, and > then in each reader re-executing the whole query and getting an index > sub-range of those results. > This is broken in several critical ways: > - The order of query results returned by find() is not necessarily > deterministic, so the idea of index ranges on it is meaningless: each shard > may basically get random, possibly overlapping subsets of the total results > - Even if you add order by `_id`, the database may be changing concurrently > to reading and splitting. E.g. if the database contained documents with ids > 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the > assumption that these shards would contain respectively 10 20 30, and 40 50), > and then suppose shard 10 20 30 is read and then document 25 is inserted - > then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and > document 25 is lost. > - Every shard re-executes the query and skips the first start_offset items, > which in total is quadratic complexity > - The query is first executed in the constructor in order to count results, > which 1) means the constructor can be super slow and 2) it won't work at all > if the database is unavailable at the time the pipeline is constructed (e.g. > if this is a template). > Unfortunately, none of these issues are caught by SourceTestUtils: this class > has extensive coverage with it, and the tests pass. This is because the tests > return the same results in the same order. I don't know how to catch this > automatically, and I don't know how to catch the performance issue > automatically, but these would all be important follow-up items after the > actual fix. > CC: [~chamikara] as reviewer. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-7850) Make Environment a top level attribute of PTransform
[ https://issues.apache.org/jira/browse/BEAM-7850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897400#comment-16897400 ] Chamikara Jayalath commented on BEAM-7850: -- Thanks Max. I understand that this will be a bit invasive change. But I think it will simplify implementations quite a bit. So the main motivation is implementation simplicity. Currently to determine the environment where a given PTransform has to be executed, we have to fork for all possible payload types. Just to clarify, if we make Environment a top level attribute of Transform that will be enough for WindowInto transform as well, right ? Looking at Beam runner API proto, seems like only two messages that use SdkFunctionSpec and are not a payload of a certain type of PTransform are WindowingStrategy and SideInput. SideInput already seems to be within transform payload messages (ParDoPayload and WriteFilesPayload). We can keep environment around in SdkFunctionSpec if there are non-transform messages that need it. If we consider the cross-language expansion scenario, what we get from a remote environment is an expanded PTransform (along with coders, inputs, outputs, dependencies needed to execute it). So I think it makes sense to associate a PTransform with an Environment directly. > Make Environment a top level attribute of PTransform > > > Key: BEAM-7850 > URL: https://issues.apache.org/jira/browse/BEAM-7850 > Project: Beam > Issue Type: Improvement > Components: beam-model >Reporter: Chamikara Jayalath >Priority: Major > > Currently Environment is not a top level attribute of the PTransform (of > runner API proto). > [https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L99] > Instead it is hidden inside various payload objects. For example, for ParDo, > environment will be inside SdkFunctionSpec of ParDoPayload. > [https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L99] > > This makes tracking environment of different types of PTransforms harder and > we have to fork code (on the type of PTransform) to extract the Environment > where the PTransform should be executed. It will probably be simpler to just > make Environment a top level attribute of PTransform. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-7850) Make Environment a top level attribute of PTransform
[ https://issues.apache.org/jira/browse/BEAM-7850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896289#comment-16896289 ] Chamikara Jayalath commented on BEAM-7850: -- cc: [~robertwb] [~lcwik] [~mxm] [~angoenka] > Make Environment a top level attribute of PTransform > > > Key: BEAM-7850 > URL: https://issues.apache.org/jira/browse/BEAM-7850 > Project: Beam > Issue Type: Improvement > Components: beam-model >Reporter: Chamikara Jayalath >Priority: Major > > Currently Environment is not a top level attribute of the PTransform (of > runner API proto). > [https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L99] > Instead it is hidden inside various payload objects. For example, for ParDo, > environment will be inside SdkFunctionSpec of ParDoPayload. > [https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L99] > > This makes tracking environment of different types of PTransforms harder and > we have to fork code (on the type of PTransform) to extract the Environment > where the PTransform should be executed. It will probably be simpler to just > make Environment a top level attribute of PTransform. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (BEAM-7850) Make Environment a top level attribute of PTransform
Chamikara Jayalath created BEAM-7850: Summary: Make Environment a top level attribute of PTransform Key: BEAM-7850 URL: https://issues.apache.org/jira/browse/BEAM-7850 Project: Beam Issue Type: Improvement Components: beam-model Reporter: Chamikara Jayalath Currently Environment is not a top level attribute of the PTransform (of runner API proto). [https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L99] Instead it is hidden inside various payload objects. For example, for ParDo, environment will be inside SdkFunctionSpec of ParDoPayload. [https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L99] This makes tracking environment of different types of PTransforms harder and we have to fork code (on the type of PTransform) to extract the Environment where the PTransform should be executed. It will probably be simpler to just make Environment a top level attribute of PTransform. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-2572) Implement an S3 filesystem for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895698#comment-16895698 ] Chamikara Jayalath commented on BEAM-2572: -- [~cagatayk] and [~sacherus] contributions are absolutely welcome. What we need though is an implementation of FileSystem interface for S3 as I mentioned in a previous comment. This will allow us to use S3 from for file-basd source (for e..g. text, avro). Seems like your implementation has some of the code that can be used for this but not exactly what we need. Can you implement the proper interface and submit as a PR (with tests) ? > Implement an S3 filesystem for Python SDK > - > > Key: BEAM-2572 > URL: https://issues.apache.org/jira/browse/BEAM-2572 > Project: Beam > Issue Type: Task > Components: sdk-py-core >Reporter: Dmitry Demeshchuk >Assignee: Pablo Estrada >Priority: Minor > Labels: GSoC2019, gsoc, gsoc2019, mentor > > There are two paths worth exploring, to my understanding: > 1. Sticking to the HDFS-based approach (like it's done in Java). > 2. Using boto/boto3 for accessing S3 through its common API endpoints. > I personally prefer the second approach, for a few reasons: > 1. In real life, HDFS and S3 have different consistency guarantees, therefore > their behaviors may contradict each other in some edge cases (say, we write > something to S3, but it's not immediately accessible for reading from another > end). > 2. There are other AWS-based sources and sinks we may want to create in the > future: DynamoDB, Kinesis, SQS, etc. > 3. boto3 already provides somewhat good logic for basic things like > reattempting. > Whatever path we choose, there's another problem related to this: we > currently cannot pass any global settings (say, pipeline options, or just an > arbitrary kwarg) to a filesystem. Because of that, we'd have to setup the > runner nodes to have AWS keys set up in the environment, which is not trivial > to achieve and doesn't look too clean either (I'd rather see one single place > for configuring the runner options). > Also, it's worth mentioning that I already have a janky S3 filesystem > implementation that only supports DirectRunner at the moment (because of the > previous paragraph). I'm perfectly fine finishing it myself, with some > guidance from the maintainers. > Where should I move on from here, and whose input should I be looking for? > Thanks! -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (BEAM-7824) Set a default environment for Python SDK jobs for Dataflow runner
Chamikara Jayalath created BEAM-7824: Summary: Set a default environment for Python SDK jobs for Dataflow runner Key: BEAM-7824 URL: https://issues.apache.org/jira/browse/BEAM-7824 Project: Beam Issue Type: Bug Components: runner-dataflow, sdk-py-core Reporter: Chamikara Jayalath Assignee: Chamikara Jayalath Currently default environment is set to empty. We should change the default environment to urn: beam:env:docker:v1 and payload to a DockerPayload where container_image is set to container image used by Dataflow. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-7680) synthetic_pipeline_test.py flaky
[ https://issues.apache.org/jira/browse/BEAM-7680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883424#comment-16883424 ] Chamikara Jayalath commented on BEAM-7680: -- Sorry, this is from code I wrote several years back and I lost context. I think comment was about test being time based on general (and hence there's a possibility of flakes). > synthetic_pipeline_test.py flaky > > > Key: BEAM-7680 > URL: https://issues.apache.org/jira/browse/BEAM-7680 > Project: Beam > Issue Type: Bug > Components: test-failures >Reporter: Udi Meiri >Assignee: Kasia Kucharczyk >Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > {code:java} > 11:51:43 FAIL: testSyntheticSDFStep > (apache_beam.testing.synthetic_pipeline_test.SyntheticPipelineTest) > 11:51:43 > -- > 11:51:43 Traceback (most recent call last): > 11:51:43 File > "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Cron/src/sdks/python/test-suites/tox/py35/build/srcs/sdks/python/apache_beam/testing/synthetic_pipeline_test.py", > line 82, in testSyntheticSDFStep > 11:51:43 self.assertTrue(0.5 <= elapsed <= 3, elapsed) > 11:51:43 AssertionError: False is not true : 3.659700632095337{code} > [https://builds.apache.org/job/beam_PreCommit_Python_Cron/1502/consoleFull] > > Two flaky TODOs: > [https://github.com/apache/beam/blob/b79f24ced1c8519c29443ea7109c59ad18be2ebe/sdks/python/apache_beam/testing/synthetic_pipeline_test.py#L69-L82] -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (BEAM-7424) Retry HTTP 429 errors from GCS w/ exponential backoff when reading data
[ https://issues.apache.org/jira/browse/BEAM-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876506#comment-16876506 ] Chamikara Jayalath commented on BEAM-7424: -- Please make sure to cherry-pick [https://github.com/apache/beam/pull/8933] to the 2.14 release branch. > Retry HTTP 429 errors from GCS w/ exponential backoff when reading data > --- > > Key: BEAM-7424 > URL: https://issues.apache.org/jira/browse/BEAM-7424 > Project: Beam > Issue Type: Bug > Components: io-java-gcp, io-python-gcp, sdk-py-core >Reporter: Chamikara Jayalath >Assignee: Heejong Lee >Priority: Blocker > Fix For: 2.14.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > This has to be done for both Java and Python SDKs. > Seems like Java SDK already retries 429 errors w/o backoff (please verify): > [https://github.com/apache/beam/blob/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/util/RetryHttpRequestInitializer.java#L185] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-7631) Remove experimental annotation from stable transforms in Java BigTableIO connector
[ https://issues.apache.org/jira/browse/BEAM-7631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872586#comment-16872586 ] Chamikara Jayalath commented on BEAM-7631: -- cc: [~sduskis] [~igorbernstein] [~altay] [~dhalp...@google.com] [~iemejia] > Remove experimental annotation from stable transforms in Java BigTableIO > connector > -- > > Key: BEAM-7631 > URL: https://issues.apache.org/jira/browse/BEAM-7631 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Reporter: Chamikara Jayalath >Assignee: Chamikara Jayalath >Priority: Major > > This connector has been around for some time and many Beam users use it. > We can consider removing experimental tag from following transforms. > BigTableIO.Read > BigTableIO.Write > Removing the experimental tag will guarantee our users that the existing API > will stay stable. > > Note that this does not prevent adding new transforms to the API or adding > new features to existing transforms. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (BEAM-7631) Remove experimental annotation from stable transforms in Java BigTableIO connector
[ https://issues.apache.org/jira/browse/BEAM-7631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath reassigned BEAM-7631: Assignee: Chamikara Jayalath > Remove experimental annotation from stable transforms in Java BigTableIO > connector > -- > > Key: BEAM-7631 > URL: https://issues.apache.org/jira/browse/BEAM-7631 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Reporter: Chamikara Jayalath >Assignee: Chamikara Jayalath >Priority: Major > > This connector has been around for some time and many Beam users use it. > We can consider removing experimental tag from following transforms. > BigTableIO.Read > BigTableIO.Write > Removing the experimental tag will guarantee our users that the existing API > will stay stable. > > Note that this does not prevent adding new transforms to the API or adding > new features to existing transforms. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-7631) Remove experimental annotation from stable transforms in Java BigTableIO connector
Chamikara Jayalath created BEAM-7631: Summary: Remove experimental annotation from stable transforms in Java BigTableIO connector Key: BEAM-7631 URL: https://issues.apache.org/jira/browse/BEAM-7631 Project: Beam Issue Type: Improvement Components: io-java-gcp Reporter: Chamikara Jayalath This connector has been around for some time and many Beam users use it. We can consider removing experimental tag from following transforms. BigTableIO.Read BigTableIO.Write Removing the experimental tag will guarantee our users that the existing API will stay stable. Note that this does not prevent adding new transforms to the API or adding new features to existing transforms. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-7611) Python BigTableIO IT is not running in any test suites
Chamikara Jayalath created BEAM-7611: Summary: Python BigTableIO IT is not running in any test suites Key: BEAM-7611 URL: https://issues.apache.org/jira/browse/BEAM-7611 Project: Beam Issue Type: Bug Components: io-python-gcp, testing Reporter: Chamikara Jayalath Assignee: Solomon Duskis We added an integration test here: [https://github.com/apache/beam/pull/7367] But this currently does not get picked up by any test suites (and get skipped by some due to missing dependencies) hence BigTable sink is largely untested. First attempt to enable it: [https://github.com/apache/beam/pull/8886] Solomon assigning to you since I cannot find Juan's (PR author) Jira ID. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-6590) LTS backport: upgrade gcsio dependency to 1.9.13
[ https://issues.apache.org/jira/browse/BEAM-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868079#comment-16868079 ] Chamikara Jayalath commented on BEAM-6590: -- Sorry about the delay. I'll look into 2.7.1 blockers assigned to me early next week if that works. > LTS backport: upgrade gcsio dependency to 1.9.13 > > > Key: BEAM-6590 > URL: https://issues.apache.org/jira/browse/BEAM-6590 > Project: Beam > Issue Type: Bug > Components: io-java-gcp, sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Chamikara Jayalath >Priority: Blocker > Fix For: 2.7.1 > > > 1.9.12 has following critical bug so should be avoided. > > [https://github.com/GoogleCloudPlatform/bigdata-interop/commit/52f5055d37f20a04303b146e9063e7ccc876ec17] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-7501) concatenated compressed files bug with python sdk (2.7.1 merge)
[ https://issues.apache.org/jira/browse/BEAM-7501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868078#comment-16868078 ] Chamikara Jayalath commented on BEAM-7501: -- Sorry about the delay. I'll look into this early next week if that works. > concatenated compressed files bug with python sdk (2.7.1 merge) > --- > > Key: BEAM-7501 > URL: https://issues.apache.org/jira/browse/BEAM-7501 > Project: Beam > Issue Type: Bug > Components: io-python-files >Reporter: Chamikara Jayalath >Assignee: Chamikara Jayalath >Priority: Blocker > Fix For: 2.7.1 > > > Same as https://issues.apache.org/jira/browse/BEAM-6952. > > For tracking merging the fix to 2.7.1 (LTS) branch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (BEAM-7569) filesystem_test is failing for Windows
[ https://issues.apache.org/jira/browse/BEAM-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath resolved BEAM-7569. -- Resolution: Fixed Fix Version/s: Not applicable > filesystem_test is failing for Windows > -- > > Key: BEAM-7569 > URL: https://issues.apache.org/jira/browse/BEAM-7569 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Chamikara Jayalath >Assignee: Chamikara Jayalath >Priority: Major > Fix For: Not applicable > > Time Spent: 1h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-7424) Retry HTTP 429 errors from GCS w/ exponential backoff when reading data
[ https://issues.apache.org/jira/browse/BEAM-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867126#comment-16867126 ] Chamikara Jayalath commented on BEAM-7424: -- I believe Python fix is still in development. > Retry HTTP 429 errors from GCS w/ exponential backoff when reading data > --- > > Key: BEAM-7424 > URL: https://issues.apache.org/jira/browse/BEAM-7424 > Project: Beam > Issue Type: Bug > Components: io-java-gcp, io-python-gcp, sdk-py-core >Reporter: Chamikara Jayalath >Assignee: Heejong Lee >Priority: Blocker > Fix For: 2.14.0 > > > This has to be done for both Java and Python SDKs. > Seems like Java SDK already retries 429 errors w/o backoff (please verify): > [https://github.com/apache/beam/blob/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/util/RetryHttpRequestInitializer.java#L185] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-7588) Python BigQuery sink should be able to handle 15TB load job quota
Chamikara Jayalath created BEAM-7588: Summary: Python BigQuery sink should be able to handle 15TB load job quota Key: BEAM-7588 URL: https://issues.apache.org/jira/browse/BEAM-7588 Project: Beam Issue Type: Improvement Components: io-python-gcp Reporter: Chamikara Jayalath Assignee: Pablo Estrada This can be done by using multiple load jobs under 15TB when the amount of data to be loaded is > 15TB. This is already handled by Java SDK. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-7569) filesystem_test is failing for Windows
Chamikara Jayalath created BEAM-7569: Summary: filesystem_test is failing for Windows Key: BEAM-7569 URL: https://issues.apache.org/jira/browse/BEAM-7569 Project: Beam Issue Type: Bug Components: sdk-py-core Reporter: Chamikara Jayalath Assignee: Chamikara Jayalath -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-7549) Provide a better error message for 410 Gone errors from GCS
Chamikara Jayalath created BEAM-7549: Summary: Provide a better error message for 410 Gone errors from GCS Key: BEAM-7549 URL: https://issues.apache.org/jira/browse/BEAM-7549 Project: Beam Issue Type: Bug Components: io-java-gcp, io-python-gcp Reporter: Chamikara Jayalath For example 410 errors might get raised here. com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.waitForCompletionAndThrowIfUploadFailed(AbstractGoogleAsyncWriteChannel.java:431) at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:289) at com.google.cloud.dataflow.sdk.io.FileBasedSink$FileBasedWriter.close(FileBasedSink.java:571) at com.google.cloud.dataflow.sdk.io.FileBasedSink$FileBasedWriter.close(FileBasedSink.java:474) at com.google.cloud.dataflow.sdk.io.Write$Bound$WriteBundles.finishBundle(Write.java:202) Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 410 Gone Example client side error message that gives a better description to the end-user: "GCS upload of processed data transiently failed and will need to be retried. Some produced data might need to be recomputed" -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-6952) concatenated compressed files bug with python sdk
[ https://issues.apache.org/jira/browse/BEAM-6952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath updated BEAM-6952: - Affects Version/s: (was: 2.13.0) Not applicable > concatenated compressed files bug with python sdk > - > > Key: BEAM-6952 > URL: https://issues.apache.org/jira/browse/BEAM-6952 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Affects Versions: Not applicable >Reporter: Daniel Lescohier >Priority: Major > Fix For: 2.14.0 > > Time Spent: 3h 20m > Remaining Estimate: 0h > > The Python apache_beam.io.filesystem module has a bug handling concatenated > compressed files. > The PR I will create has two commits: > # a new unit test that shows the problem > # a fix to the problem. > The unit test is added to the apache_beam.io.filesystem_test module. It was > added to this module because the test: > apache_beam.io.textio_test.test_read_gzip_concat does not encounter the > problem in the Beam 2.11 and earlier code base because the test data is too > small: the data is smaller than read_size, so it goes through logic in the > code that avoids the problem in the code. So, this test sets read_size > smaller and test data bigger, in order to encounter the problem. It would be > difficult to test in the textio_test module, because you'd need very large > test data because default read_size is 16MiB, and the ReadFromText interface > does not allow you to modify the read_size. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-6952) concatenated compressed files bug with python sdk
[ https://issues.apache.org/jira/browse/BEAM-6952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath updated BEAM-6952: - Affects Version/s: (was: 2.11.0) 2.13.0 > concatenated compressed files bug with python sdk > - > > Key: BEAM-6952 > URL: https://issues.apache.org/jira/browse/BEAM-6952 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Affects Versions: 2.13.0 >Reporter: Daniel Lescohier >Priority: Major > Fix For: 2.14.0 > > Time Spent: 3h 20m > Remaining Estimate: 0h > > The Python apache_beam.io.filesystem module has a bug handling concatenated > compressed files. > The PR I will create has two commits: > # a new unit test that shows the problem > # a fix to the problem. > The unit test is added to the apache_beam.io.filesystem_test module. It was > added to this module because the test: > apache_beam.io.textio_test.test_read_gzip_concat does not encounter the > problem in the Beam 2.11 and earlier code base because the test data is too > small: the data is smaller than read_size, so it goes through logic in the > code that avoids the problem in the code. So, this test sets read_size > smaller and test data bigger, in order to encounter the problem. It would be > difficult to test in the textio_test module, because you'd need very large > test data because default read_size is 16MiB, and the ReadFromText interface > does not allow you to modify the read_size. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (BEAM-6952) concatenated compressed files bug with python sdk
[ https://issues.apache.org/jira/browse/BEAM-6952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath resolved BEAM-6952. -- Resolution: Fixed > concatenated compressed files bug with python sdk > - > > Key: BEAM-6952 > URL: https://issues.apache.org/jira/browse/BEAM-6952 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Affects Versions: 2.11.0 >Reporter: Daniel Lescohier >Priority: Major > Fix For: 2.14.0 > > Time Spent: 3h 20m > Remaining Estimate: 0h > > The Python apache_beam.io.filesystem module has a bug handling concatenated > compressed files. > The PR I will create has two commits: > # a new unit test that shows the problem > # a fix to the problem. > The unit test is added to the apache_beam.io.filesystem_test module. It was > added to this module because the test: > apache_beam.io.textio_test.test_read_gzip_concat does not encounter the > problem in the Beam 2.11 and earlier code base because the test data is too > small: the data is smaller than read_size, so it goes through logic in the > code that avoids the problem in the code. So, this test sets read_size > smaller and test data bigger, in order to encounter the problem. It would be > difficult to test in the textio_test module, because you'd need very large > test data because default read_size is 16MiB, and the ReadFromText interface > does not allow you to modify the read_size. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-6952) concatenated compressed files bug with python sdk
[ https://issues.apache.org/jira/browse/BEAM-6952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860455#comment-16860455 ] Chamikara Jayalath commented on BEAM-6952: -- Yeah. closing. > concatenated compressed files bug with python sdk > - > > Key: BEAM-6952 > URL: https://issues.apache.org/jira/browse/BEAM-6952 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Affects Versions: 2.11.0 >Reporter: Daniel Lescohier >Priority: Major > Fix For: 2.14.0 > > Time Spent: 3h 20m > Remaining Estimate: 0h > > The Python apache_beam.io.filesystem module has a bug handling concatenated > compressed files. > The PR I will create has two commits: > # a new unit test that shows the problem > # a fix to the problem. > The unit test is added to the apache_beam.io.filesystem_test module. It was > added to this module because the test: > apache_beam.io.textio_test.test_read_gzip_concat does not encounter the > problem in the Beam 2.11 and earlier code base because the test data is too > small: the data is smaller than read_size, so it goes through logic in the > code that avoids the problem in the code. So, this test sets read_size > smaller and test data bigger, in order to encounter the problem. It would be > difficult to test in the textio_test module, because you'd need very large > test data because default read_size is 16MiB, and the ReadFromText interface > does not allow you to modify the read_size. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-7424) Retry HTTP 429 errors from GCS w/ exponential backoff when reading data
[ https://issues.apache.org/jira/browse/BEAM-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860421#comment-16860421 ] Chamikara Jayalath commented on BEAM-7424: -- There are few things that are planned to do. (1) Add retry logic for Java and Python SDKs. (2) For Dataflow runner, plumb throttled time to Dataflow backend to consider when making autoscaling decisions. This bug is for (1), It's great if we can get (2) into 2.14 as well but I'm not sure if timelines will match. > Retry HTTP 429 errors from GCS w/ exponential backoff when reading data > --- > > Key: BEAM-7424 > URL: https://issues.apache.org/jira/browse/BEAM-7424 > Project: Beam > Issue Type: Bug > Components: io-java-gcp, io-python-gcp, sdk-py-core >Reporter: Chamikara Jayalath >Assignee: Heejong Lee >Priority: Blocker > Fix For: 2.14.0 > > > This has to be done for both Java and Python SDKs. > Seems like Java SDK already retries 429 errors w/o backoff (please verify): > [https://github.com/apache/beam/blob/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/util/RetryHttpRequestInitializer.java#L185] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (BEAM-3788) Implement a Kafka IO for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath reassigned BEAM-3788: Assignee: (was: Chamikara Jayalath) > Implement a Kafka IO for Python SDK > --- > > Key: BEAM-3788 > URL: https://issues.apache.org/jira/browse/BEAM-3788 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Jayalath >Priority: Major > > This will be implemented using the Splittable DoFn framework. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-3788) Implement a Kafka IO for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860419#comment-16860419 ] Chamikara Jayalath commented on BEAM-3788: -- This is blocked till we have a streaming source framework for Python SDK for portable runners. To this end, Splittable DoFn is currently under development. Note that, for folks using Flink runner, native Kafka source of Flink is currently available to Python SDK users through the cross-language transform API. > Implement a Kafka IO for Python SDK > --- > > Key: BEAM-3788 > URL: https://issues.apache.org/jira/browse/BEAM-3788 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Jayalath >Assignee: Chamikara Jayalath >Priority: Major > > This will be implemented using the Splittable DoFn framework. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-7501) concatenated compressed files bug with python sdk (2.7.1 merge)
Chamikara Jayalath created BEAM-7501: Summary: concatenated compressed files bug with python sdk (2.7.1 merge) Key: BEAM-7501 URL: https://issues.apache.org/jira/browse/BEAM-7501 Project: Beam Issue Type: Bug Components: io-python-files Reporter: Chamikara Jayalath Assignee: Chamikara Jayalath Fix For: 2.7.1 Same as https://issues.apache.org/jira/browse/BEAM-6952. For tracking merging the fix to 2.7.1 (LTS) branch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-6952) concatenated compressed files bug with python sdk
[ https://issues.apache.org/jira/browse/BEAM-6952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858203#comment-16858203 ] Chamikara Jayalath commented on BEAM-6952: -- Created https://issues.apache.org/jira/browse/BEAM-7501 to track merging the fix to 2.7.1 (LTS) branch. > concatenated compressed files bug with python sdk > - > > Key: BEAM-6952 > URL: https://issues.apache.org/jira/browse/BEAM-6952 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Affects Versions: 2.11.0 >Reporter: Daniel Lescohier >Priority: Major > Fix For: 2.14.0 > > Time Spent: 2.5h > Remaining Estimate: 0h > > The Python apache_beam.io.filesystem module has a bug handling concatenated > compressed files. > The PR I will create has two commits: > # a new unit test that shows the problem > # a fix to the problem. > The unit test is added to the apache_beam.io.filesystem_test module. It was > added to this module because the test: > apache_beam.io.textio_test.test_read_gzip_concat does not encounter the > problem in the Beam 2.11 and earlier code base because the test data is too > small: the data is smaller than read_size, so it goes through logic in the > code that avoids the problem in the code. So, this test sets read_size > smaller and test data bigger, in order to encounter the problem. It would be > difficult to test in the textio_test module, because you'd need very large > test data because default read_size is 16MiB, and the ReadFromText interface > does not allow you to modify the read_size. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-6952) concatenated compressed files bug with python sdk
[ https://issues.apache.org/jira/browse/BEAM-6952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath updated BEAM-6952: - Fix Version/s: 2.14.0 > concatenated compressed files bug with python sdk > - > > Key: BEAM-6952 > URL: https://issues.apache.org/jira/browse/BEAM-6952 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Affects Versions: 2.11.0 >Reporter: Daniel Lescohier >Priority: Major > Fix For: 2.14.0 > > Time Spent: 2.5h > Remaining Estimate: 0h > > The Python apache_beam.io.filesystem module has a bug handling concatenated > compressed files. > The PR I will create has two commits: > # a new unit test that shows the problem > # a fix to the problem. > The unit test is added to the apache_beam.io.filesystem_test module. It was > added to this module because the test: > apache_beam.io.textio_test.test_read_gzip_concat does not encounter the > problem in the Beam 2.11 and earlier code base because the test data is too > small: the data is smaller than read_size, so it goes through logic in the > code that avoids the problem in the code. So, this test sets read_size > smaller and test data bigger, in order to encounter the problem. It would be > difficult to test in the textio_test module, because you'd need very large > test data because default read_size is 16MiB, and the ReadFromText interface > does not allow you to modify the read_size. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-7424) Retry HTTP 429 errors from GCS w/ exponential backoff when reading data
[ https://issues.apache.org/jira/browse/BEAM-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath updated BEAM-7424: - Fix Version/s: (was: 2.7.1) > Retry HTTP 429 errors from GCS w/ exponential backoff when reading data > --- > > Key: BEAM-7424 > URL: https://issues.apache.org/jira/browse/BEAM-7424 > Project: Beam > Issue Type: Bug > Components: io-java-gcp, io-python-gcp, sdk-py-core >Reporter: Chamikara Jayalath >Assignee: Heejong Lee >Priority: Blocker > Fix For: 2.14.0 > > > This has to be done for both Java and Python SDKs. > Seems like Java SDK already retries 429 errors w/o backoff (please verify): > [https://github.com/apache/beam/blob/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/util/RetryHttpRequestInitializer.java#L185] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-7500) Retry HTTP 429 errors from GCS w/ exponential backoff when reading data (2.7.1 merge)
Chamikara Jayalath created BEAM-7500: Summary: Retry HTTP 429 errors from GCS w/ exponential backoff when reading data (2.7.1 merge) Key: BEAM-7500 URL: https://issues.apache.org/jira/browse/BEAM-7500 Project: Beam Issue Type: Bug Components: io-java-gcp, io-python-gcp Reporter: Chamikara Jayalath Assignee: Heejong Lee Same as https://issues.apache.org/jira/browse/BEAM-7424 For tracking merging the fix to 2.7.1 (LTS branch). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-7424) Retry HTTP 429 errors from GCS w/ exponential backoff when reading data
[ https://issues.apache.org/jira/browse/BEAM-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858200#comment-16858200 ] Chamikara Jayalath commented on BEAM-7424: -- Created https://issues.apache.org/jira/browse/BEAM-7500 to track merging the fix to 2.7.1 branch. > Retry HTTP 429 errors from GCS w/ exponential backoff when reading data > --- > > Key: BEAM-7424 > URL: https://issues.apache.org/jira/browse/BEAM-7424 > Project: Beam > Issue Type: Bug > Components: io-java-gcp, io-python-gcp, sdk-py-core >Reporter: Chamikara Jayalath >Assignee: Heejong Lee >Priority: Blocker > Fix For: 2.7.1, 2.14.0 > > > This has to be done for both Java and Python SDKs. > Seems like Java SDK already retries 429 errors w/o backoff (please verify): > [https://github.com/apache/beam/blob/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/util/RetryHttpRequestInitializer.java#L185] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-7500) Retry HTTP 429 errors from GCS w/ exponential backoff when reading data (2.7.1 merge)
[ https://issues.apache.org/jira/browse/BEAM-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath updated BEAM-7500: - Fix Version/s: 2.7.1 > Retry HTTP 429 errors from GCS w/ exponential backoff when reading data > (2.7.1 merge) > - > > Key: BEAM-7500 > URL: https://issues.apache.org/jira/browse/BEAM-7500 > Project: Beam > Issue Type: Bug > Components: io-java-gcp, io-python-gcp >Reporter: Chamikara Jayalath >Assignee: Heejong Lee >Priority: Blocker > Fix For: 2.7.1 > > > Same as https://issues.apache.org/jira/browse/BEAM-7424 > > For tracking merging the fix to 2.7.1 (LTS branch). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-7424) Retry HTTP 429 errors from GCS w/ exponential backoff when reading data
[ https://issues.apache.org/jira/browse/BEAM-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath updated BEAM-7424: - Fix Version/s: 2.14.0 2.7.1 > Retry HTTP 429 errors from GCS w/ exponential backoff when reading data > --- > > Key: BEAM-7424 > URL: https://issues.apache.org/jira/browse/BEAM-7424 > Project: Beam > Issue Type: Bug > Components: io-java-gcp, io-python-gcp, sdk-py-core >Reporter: Chamikara Jayalath >Assignee: Heejong Lee >Priority: Blocker > Fix For: 2.7.1, 2.14.0 > > > This has to be done for both Java and Python SDKs. > Seems like Java SDK already retries 429 errors w/o backoff (please verify): > [https://github.com/apache/beam/blob/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/util/RetryHttpRequestInitializer.java#L185] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-7424) Retry HTTP 429 errors from GCS w/ exponential backoff when reading data
[ https://issues.apache.org/jira/browse/BEAM-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath updated BEAM-7424: - Priority: Blocker (was: Major) > Retry HTTP 429 errors from GCS w/ exponential backoff when reading data > --- > > Key: BEAM-7424 > URL: https://issues.apache.org/jira/browse/BEAM-7424 > Project: Beam > Issue Type: Bug > Components: io-java-gcp, io-python-gcp, sdk-py-core >Reporter: Chamikara Jayalath >Assignee: Heejong Lee >Priority: Blocker > > This has to be done for both Java and Python SDKs. > Seems like Java SDK already retries 429 errors w/o backoff (please verify): > [https://github.com/apache/beam/blob/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/util/RetryHttpRequestInitializer.java#L185] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-7483) Audit mis-uses of GCS client library API
Chamikara Jayalath created BEAM-7483: Summary: Audit mis-uses of GCS client library API Key: BEAM-7483 URL: https://issues.apache.org/jira/browse/BEAM-7483 Project: Beam Issue Type: Improvement Components: io-java-gcp Reporter: Chamikara Jayalath For example, GoogleCloudStorageReadChannel constructor may take an storage object that is null but this will not work at run-time. Carefully, auditing and fixing similar mis-uses of the API in main code or tests could reduce the chance of future bugs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-7455) Improve Avro IO integration test coverage on Python 3.
[ https://issues.apache.org/jira/browse/BEAM-7455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853485#comment-16853485 ] Chamikara Jayalath commented on BEAM-7455: -- I think it's good if we can add an integration test as well. For example, Java has an AvroIO performance test here which we could replicate for Python: [https://builds.apache.org/view/A-D/view/Beam/view/PerformanceTests/job/beam_PerformanceTests_AvroIOIT/] > Improve Avro IO integration test coverage on Python 3. > -- > > Key: BEAM-7455 > URL: https://issues.apache.org/jira/browse/BEAM-7455 > Project: Beam > Issue Type: Sub-task > Components: io-python-avro >Reporter: Valentyn Tymofieiev >Assignee: Frederik Bode >Priority: Major > > It seems that we don't have an integration test for Avro IO on Python 3: > fastavro_it_test [1] depends on both avro and fastavro, however avro package > currently does not work with Beam on Python 3, so we don't have an > integration test that exercises Avro IO on Python 3. > We should add an integration test for Avro IO that does not need both > libraries at the same time, and instead can run using either library. > [~frederik] is this something you could help with? > cc: [~chamikara] [~Juta] > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/fastavro_it_test.py -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-7439) Bigquery Write with schema None: TypeError: 'NoneType' object has no attribute '__getitem__'
[ https://issues.apache.org/jira/browse/BEAM-7439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16850115#comment-16850115 ] Chamikara Jayalath commented on BEAM-7439: -- Also, Ahmet pointed out that we seems to be missing a critical test here, where we write to an existing BQ table without providing a schema. > Bigquery Write with schema None: TypeError: 'NoneType' object has no > attribute '__getitem__' > > > Key: BEAM-7439 > URL: https://issues.apache.org/jira/browse/BEAM-7439 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Juta Staes >Assignee: Pablo Estrada >Priority: Blocker > Fix For: 2.13.0 > > > When running a simple write to bigquery on apache-beam==2.12.0 > {code:java} > input_data = [ >{'str': 'test'} > ] > (pipeline | 'create' >> beam.Create(input_data) >| 'write' >> beam.io.WriteToBigQuery( >':beam_test.test')) > {code} > > I get the following error: > {code:java} > WARNING:root:Start running in the cloud > Traceback (most recent call last): > File "test_pipeline.py", line 193, in > main() > File "test_pipeline.py", line 183, in main > ':beam_test.test')) > File > "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv2/local/lib/python2.7/site-packages/apache_beam/pvalue.py", > line 112, in __or__ > return self.pipeline.apply(ptransform, self) > File > "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv2/local/lib/python2.7/site-packages/apache_beam/pipeline.py", > line 470, in apply > label or transform.label) > File > "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv2/local/lib/python2.7/site-packages/apache_beam/pipeline.py", > line 480, in apply > return self.apply(transform, pvalueish) > File > "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv2/local/lib/python2.7/site-packages/apache_beam/pipeline.py", > line 516, in apply > pvalueish_result = self.runner.apply(transform, pvalueish, self._options) > File > "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv2/local/lib/python2.7/site-packages/apache_beam/runners/runner.py", > line 193, in apply > return m(transform, input, options) > File > "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv2/local/lib/python2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", > line 617, in apply_WriteToBigQuery > parse_table_schema_from_json(json.dumps(transform.schema)), > File > "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv2/local/lib/python2.7/site-packages/apache_beam/io/gcp/bigquery_tools.py", > line 130, in parse_table_schema_from_json > fields = [_parse_schema_field(f) for f in json_schema['fields']] > TypeError: 'NoneType' object has no attribute '__getitem__'{code} > I already proposed a fix for this as part of a larger pr: > https://github.com/apache/beam/pull/8621/commits/41cdfbda5a4e2a56b6d10046ba265ad68c78675d > I was wondering if this also needs to be patched for version 2.12.0? > cc: [~tvalentyn] [~pabloem] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (BEAM-7424) Retry HTTP 429 errors from GCS when reading/writing data and when staging files
[ https://issues.apache.org/jira/browse/BEAM-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath reassigned BEAM-7424: Assignee: Yueyang Qiu > Retry HTTP 429 errors from GCS when reading/writing data and when staging > files > > > Key: BEAM-7424 > URL: https://issues.apache.org/jira/browse/BEAM-7424 > Project: Beam > Issue Type: Bug > Components: io-java-gcp, io-python-gcp, sdk-py-core >Reporter: Chamikara Jayalath >Assignee: Yueyang Qiu >Priority: Major > > This has to be done for both Java and Python SDKs. > > Seems like Java SDK already retries 429 errors (so prob. just have to > verify): > [https://github.com/apache/beam/blob/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/util/RetryHttpRequestInitializer.java#L185] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-7424) Retry HTTP 429 errors from GCS when reading/writing data and when staging files
Chamikara Jayalath created BEAM-7424: Summary: Retry HTTP 429 errors from GCS when reading/writing data and when staging files Key: BEAM-7424 URL: https://issues.apache.org/jira/browse/BEAM-7424 Project: Beam Issue Type: Bug Components: io-java-gcp, io-python-gcp, sdk-py-core Reporter: Chamikara Jayalath This has to be done for both Java and Python SDKs. Seems like Java SDK already retries 429 errors (so prob. just have to verify): [https://github.com/apache/beam/blob/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/util/RetryHttpRequestInitializer.java#L185] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-7246) Create a Spanner IO for Python
[ https://issues.apache.org/jira/browse/BEAM-7246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846256#comment-16846256 ] Chamikara Jayalath commented on BEAM-7246: -- Shehzaad, is this something you hope to work on in the near future ? If so, appreciate if you can provide a rough design. Existing Java connector should serve as a good example. Probably you can use the client library here: [https://pypi.org/project/google-cloud-spanner/] > Create a Spanner IO for Python > -- > > Key: BEAM-7246 > URL: https://issues.apache.org/jira/browse/BEAM-7246 > Project: Beam > Issue Type: Bug > Components: io-python-gcp >Reporter: Reuven Lax >Assignee: Shehzaad Nakhoda >Priority: Major > > Add I/O support for Google Cloud Spanner for the Python SDK (Batch Only). > Testing in this work item will be in the form of DirectRunner tests and > manual testing. > Integration and performance tests are a separate work item (not included > here). > See https://beam.apache.org/documentation/io/built-in/. The goal is to add > Google Clound Spanner to the Database column for the Python/Batch row. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-6951) Beam Dependency Update Request: com.github.spotbugs:spotbugs-annotations
[ https://issues.apache.org/jira/browse/BEAM-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844202#comment-16844202 ] Chamikara Jayalath commented on BEAM-6951: -- cc: [~kenn] > Beam Dependency Update Request: com.github.spotbugs:spotbugs-annotations > > > Key: BEAM-6951 > URL: https://issues.apache.org/jira/browse/BEAM-6951 > Project: Beam > Issue Type: Sub-task > Components: dependencies >Reporter: Beam JIRA Bot >Priority: Major > > - 2019-04-01 12:15:05.460427 > - > Please consider upgrading the dependency > com.github.spotbugs:spotbugs-annotations. > The current version is 3.1.11. The latest version is 4.0.0-beta1 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2019-04-08 12:15:37.305259 > - > Please consider upgrading the dependency > com.github.spotbugs:spotbugs-annotations. > The current version is 3.1.11. The latest version is 4.0.0-beta1 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2019-04-15 12:35:52.817108 > - > Please consider upgrading the dependency > com.github.spotbugs:spotbugs-annotations. > The current version is 3.1.11. The latest version is 4.0.0-beta1 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2019-04-22 12:13:25.261372 > - > Please consider upgrading the dependency > com.github.spotbugs:spotbugs-annotations. > The current version is 3.1.11. The latest version is 4.0.0-beta1 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2019-05-20 16:39:18.034675 > - > Please consider upgrading the dependency > com.github.spotbugs:spotbugs-annotations. > The current version is 3.1.11. The latest version is 4.0.0-beta1 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2019-05-20 16:54:09.180503 > - > Please consider upgrading the dependency > com.github.spotbugs:spotbugs-annotations. > The current version is 3.1.11. The latest version is 4.0.0-beta1 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2019-05-20 17:37:40.326607 > - > Please consider upgrading the dependency > com.github.spotbugs:spotbugs-annotations. > The current version is 3.1.11. The latest version is 4.0.0-beta1 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-5968) Beam Dependency Update Request: future
[ https://issues.apache.org/jira/browse/BEAM-5968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844197#comment-16844197 ] Chamikara Jayalath commented on BEAM-5968: -- cc: [~tvalentyn] > Beam Dependency Update Request: future > -- > > Key: BEAM-5968 > URL: https://issues.apache.org/jira/browse/BEAM-5968 > Project: Beam > Issue Type: Bug > Components: dependencies >Reporter: Beam JIRA Bot >Priority: Major > > - 2018-11-05 12:10:39.020851 > - > Please consider upgrading the dependency future. > The current version is 0.16.0. The latest version is 0.17.1 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2018-11-12 12:10:22.255179 > - > Please consider upgrading the dependency future. > The current version is 0.16.0. The latest version is 0.17.1 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2018-11-19 12:10:39.701768 > - > Please consider upgrading the dependency future. > The current version is 0.16.0. The latest version is 0.17.1 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2018-11-26 12:09:05.586578 > - > Please consider upgrading the dependency future. > The current version is 0.16.0. The latest version is 0.17.1 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2018-12-03 12:09:53.018813 > - > Please consider upgrading the dependency future. > The current version is 0.16.0. The latest version is 0.17.1 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2018-12-10 12:09:14.146087 > - > Please consider upgrading the dependency future. > The current version is 0.16.0. The latest version is 0.17.1 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2018-12-17 12:11:47.510213 > - > Please consider upgrading the dependency future. > The current version is 0.16.0. The latest version is 0.17.1 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2018-12-31 15:15:47.043329 > - > Please consider upgrading the dependency future. > The current version is 0.16.0. The latest version is 0.17.1 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2019-01-07 12:09:52.562726 > - > Please consider upgrading the dependency future. > The current version is 0.16.0. The latest version is 0.17.1 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2019-01-14 12:10:27.645112 > - > Please consider upgrading the dependency future. > The current version is 0.16.0. The latest version is 0.17.1 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2019-01-21 12:18:02.030313 > - > Please consider upgrading the dependency future. > The current version is 0.16.0. The latest version is 0.17.1 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2019-01-28 12:10:09.289616 > - > Please
[jira] [Commented] (BEAM-7365) apache_beam.io.avroio_test.TestAvro.test_dynamic_work_rebalancing_exhaustive is very slow
[ https://issues.apache.org/jira/browse/BEAM-7365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844145#comment-16844145 ] Chamikara Jayalath commented on BEAM-7365: -- Talking a look. Seems like test works fine for Py2. So probably have to be redesigned to fit the threading model of Py3. > apache_beam.io.avroio_test.TestAvro.test_dynamic_work_rebalancing_exhaustive > is very slow > - > > Key: BEAM-7365 > URL: https://issues.apache.org/jira/browse/BEAM-7365 > Project: Beam > Issue Type: Sub-task > Components: io-python-avro >Reporter: Robert Bradshaw >Priority: Major > > {noformat} > $ python setup.py test -s > apache_beam.io.avroio_test.TestAvro.test_dynamic_work_rebalancing_exhaustive > test_dynamic_work_rebalancing_exhaustive > (apache_beam.io.avroio_test.TestFastAvro) ... WARNING:root:After 101 > concurrent splitting trials at item #2, observed only failure, giving up on > this item > WARNING:root:After 101 concurrent splitting trials at item #21, observed only > failure, giving up on this item > WARNING:root:After 101 concurrent splitting trials at item #22, observed only > failure, giving up on this item > WARNING:root:After 1014 total concurrent splitting trials, considered only 25 > items, giving up. > ok > -- > Ran 1 test in 172.223s > > {noformat} > Compare this with > {noformat} > $ python setup.py test -s > apache_beam.io.avroio_test.TestAvro.test_dynamic_work_rebalancing_exhaustive > test_dynamic_work_rebalancing_exhaustive > (apache_beam.io.avroio_test.TestAvro) ... ok > -- > Ran 1 test in 0.623s > OK > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-7063) Stage Dataflow runner harness in Dataflow FnApi Python test suites that run using an unreleased SDK.
[ https://issues.apache.org/jira/browse/BEAM-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath updated BEAM-7063: - Priority: Major (was: Critical) > Stage Dataflow runner harness in Dataflow FnApi Python test suites that run > using an unreleased SDK. > > > Key: BEAM-7063 > URL: https://issues.apache.org/jira/browse/BEAM-7063 > Project: Beam > Issue Type: Bug > Components: sdk-py-core, test-failures >Reporter: Chamikara Jayalath >Assignee: Valentyn Tymofieiev >Priority: Major > Fix For: 2.13.0 > > Time Spent: 5h > Remaining Estimate: 0h > > Seems like this was the first post-commit failure: > [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_Python_Verify/7855/] > (the one before this also failed but that looks like a flake) > > > Looking at failed tests I noticed that some of the were due to failed > Dataflow jobs. I noticed three issues. > (1) Error in Dataflow jobs "This handler is only capable of dealing with > urn:beam:sideinput:materialization:multimap:0.1 materializations but was > asked to handle beam:side_input:multimap:v1 for PCollectionView with tag > side0-FilterOutSpammers." > [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_17_57_01-16397932959358575026?project=apache-beam-testing] > [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_18_07_11-1342017143600540529?project=apache-beam-testing] > > (2) BigQuery job failures (possibly this is transient) > [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_18_25_22-16437560120793921616?project=apache-beam-testing] > > (3) Some batch Dataflow jobs timed out and were cancelled (possibly also > transient). > > (1) is probably due to > [https://github.com/apache/beam/commit/684f8130284a7c7979773300d04e5473ca0ac8f3] > > > cc: [~altay] [~lostluck] [~alanmyrvold] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-3055) Retry downloading required test artifacts with a backoff when download fails.
[ https://issues.apache.org/jira/browse/BEAM-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834215#comment-16834215 ] Chamikara Jayalath commented on BEAM-3055: -- Few more flakes due to this. [https://builds.apache.org/job/beam_PostRelease_NightlySnapshot/607/] [https://builds.apache.org/job/beam_PostRelease_NightlySnapshot/605/] cc: [~alanmyrvold] [~yifanzou] > Retry downloading required test artifacts with a backoff when download fails. > - > > Key: BEAM-3055 > URL: https://issues.apache.org/jira/browse/BEAM-3055 > Project: Beam > Issue Type: Improvement > Components: test-failures, testing >Reporter: Valentyn Tymofieiev >Assignee: Jason Kuster >Priority: Major > Fix For: Not applicable > > > When Maven fails to download a required artifact for a test, the test fails. > Is it possible to configure Maven to retry the download with a backoff up to > N number of attempts? > Example test failure: > https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/15004/console > 2017-10-11T19:01:21.382 [INFO] > > 2017-10-11T19:01:21.382 [INFO] BUILD FAILURE > 2017-10-11T19:01:21.382 [INFO] > > 2017-10-11T19:01:21.383 [INFO] Total time: 55:20 min > 2017-10-11T19:01:21.383 [INFO] Finished at: 2017-10-11T19:01:21+00:00 > 2017-10-11T19:01:23.807 [INFO] Final Memory: 261M/2068M > 2017-10-11T19:01:23.807 [INFO] > > 2017-10-11T19:01:23.836 [ERROR] Failed to execute goal on project > beam-sdks-java-io-hcatalog: Could not resolve dependencies for project > org.apache.beam:beam-sdks-java-io-hcatalog:jar:2.2.0-SNAPSHOT: The following > artifacts could not be resolved: org.apache.hive:hive-metastore:jar:2.1.0, > javolution:javolution:jar:5.5.1: Could not transfer artifact > org.apache.hive:hive-metastore:jar:2.1.0 from/to central > (https://repo.maven.apache.org/maven2): GET request of: > org/apache/hive/hive-metastore/2.1.0/hive-metastore-2.1.0.jar from central > failed: Connection reset -> [Help 1]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (BEAM-6894) ExternalTransform.expand() does not create the proper AppliedPTransform sub-graph
[ https://issues.apache.org/jira/browse/BEAM-6894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath resolved BEAM-6894. -- Resolution: Fixed Fix Version/s: 2.13.0 > ExternalTransform.expand() does not create the proper AppliedPTransform > sub-graph > - > > Key: BEAM-6894 > URL: https://issues.apache.org/jira/browse/BEAM-6894 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Chamikara Jayalath >Assignee: Chamikara Jayalath >Priority: Major > Fix For: 2.13.0 > > Time Spent: 10h 20m > Remaining Estimate: 0h > > 'ExternalTransform.expand()' can be used to expand a remote transform and > build the correct runner-api subgraph for that transform. However currently > we do not modify the AppliedPTransform sub-graph correctly during this > process. Relevant code location here. > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/external.py#L135] > > Without this, DataflowRunner that relies in this object graph (not just the > runner API proto) to build the job submission request to Dataflow service > cannot construct this request properly. > > cc: [~robertwb] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (BEAM-7172) Support using SDFs as cross-language transforms using ExternalTransform
Chamikara Jayalath created BEAM-7172: Summary: Support using SDFs as cross-language transforms using ExternalTransform Key: BEAM-7172 URL: https://issues.apache.org/jira/browse/BEAM-7172 Project: Beam Issue Type: Bug Components: runner-dataflow, sdk-py-core, sdk-py-harness Reporter: Chamikara Jayalath For example, we need to determine restriction coder before job submission for Dataflow runner. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-6590) LTS backport: upgrade gcsio dependency to 1.9.13
[ https://issues.apache.org/jira/browse/BEAM-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827207#comment-16827207 ] Chamikara Jayalath commented on BEAM-6590: -- Will look into this. Thanks. Is there a ETA for 2.7.1 ? > LTS backport: upgrade gcsio dependency to 1.9.13 > > > Key: BEAM-6590 > URL: https://issues.apache.org/jira/browse/BEAM-6590 > Project: Beam > Issue Type: Bug > Components: io-java-gcp, sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Chamikara Jayalath >Priority: Blocker > Labels: triaged > Fix For: 2.7.1 > > > 1.9.12 has following critical bug so should be avoided. > > [https://github.com/GoogleCloudPlatform/bigdata-interop/commit/52f5055d37f20a04303b146e9063e7ccc876ec17] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-7137) TypeError caused by using str variable as header argument in apache_beam.io.textio.WriteToText
[ https://issues.apache.org/jira/browse/BEAM-7137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827174#comment-16827174 ] Chamikara Jayalath commented on BEAM-7137: -- [~pabloem] [~ŁukaszG] in case we already have plans to add such tests. > TypeError caused by using str variable as header argument in > apache_beam.io.textio.WriteToText > -- > > Key: BEAM-7137 > URL: https://issues.apache.org/jira/browse/BEAM-7137 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Affects Versions: 2.11.0 > Environment: Python 3.5.6 > macOS Mojave 10.14.4 >Reporter: yoshiki obata >Assignee: yoshiki obata >Priority: Major > Fix For: 2.13.0 > > > Using str header to apache_beam.io.textio.WriteToText as argument cause > TypeError with Python 3.5.6 - or maybe higher - despite docstring says header > is str. > This error occurred by writing header to file without encoding to bytes at > apache_beam.io.textio._TextSink.open. > > {code:java} > Traceback (most recent call last): > File "apache_beam/runners/common.py", line 727, in > apache_beam.runners.common.DoFnRunner.process > File "apache_beam/runners/common.py", line 555, in > apache_beam.runners.common.PerWindowInvoker.invoke_process > File "apache_beam/runners/common.py", line 625, in > apache_beam.runners.common.PerWindowInvoker._invoke_per_window > File > "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/io/iobase.py", > line 1033, in process > self.writer = self.sink.open_writer(init_result, str(uuid.uuid4())) > File > "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/options/value_provider.py", > line 137, in _f > return fnc(self, *args, **kwargs) > File > "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/io/filebasedsink.py", > line 185, in open_writer > return FileBasedSinkWriter(self, os.path.join(init_result, uid) + suffix) > File > "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/io/filebasedsink.py", > line 389, in __init__ > self.temp_handle = self.sink.open(temp_shard_path) > File > "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/io/textio.py", > line 393, in open > file_handle.write(self._header) > TypeError: a bytes-like object is required, not 'str' > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-7137) TypeError caused by using str variable as header argument in apache_beam.io.textio.WriteToText
[ https://issues.apache.org/jira/browse/BEAM-7137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827173#comment-16827173 ] Chamikara Jayalath commented on BEAM-7137: -- Yes. I think we need more integration test coverage in Beam for Python file-based IO (this is currently indirectly covered by other tests) > TypeError caused by using str variable as header argument in > apache_beam.io.textio.WriteToText > -- > > Key: BEAM-7137 > URL: https://issues.apache.org/jira/browse/BEAM-7137 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Affects Versions: 2.11.0 > Environment: Python 3.5.6 > macOS Mojave 10.14.4 >Reporter: yoshiki obata >Assignee: yoshiki obata >Priority: Major > Fix For: 2.13.0 > > > Using str header to apache_beam.io.textio.WriteToText as argument cause > TypeError with Python 3.5.6 - or maybe higher - despite docstring says header > is str. > This error occurred by writing header to file without encoding to bytes at > apache_beam.io.textio._TextSink.open. > > {code:java} > Traceback (most recent call last): > File "apache_beam/runners/common.py", line 727, in > apache_beam.runners.common.DoFnRunner.process > File "apache_beam/runners/common.py", line 555, in > apache_beam.runners.common.PerWindowInvoker.invoke_process > File "apache_beam/runners/common.py", line 625, in > apache_beam.runners.common.PerWindowInvoker._invoke_per_window > File > "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/io/iobase.py", > line 1033, in process > self.writer = self.sink.open_writer(init_result, str(uuid.uuid4())) > File > "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/options/value_provider.py", > line 137, in _f > return fnc(self, *args, **kwargs) > File > "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/io/filebasedsink.py", > line 185, in open_writer > return FileBasedSinkWriter(self, os.path.join(init_result, uid) + suffix) > File > "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/io/filebasedsink.py", > line 389, in __init__ > self.temp_handle = self.sink.open(temp_shard_path) > File > "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/io/textio.py", > line 393, in open > file_handle.write(self._header) > TypeError: a bytes-like object is required, not 'str' > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-7137) TypeError caused by using str variable as header argument in apache_beam.io.textio.WriteToText
[ https://issues.apache.org/jira/browse/BEAM-7137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826287#comment-16826287 ] Chamikara Jayalath commented on BEAM-7137: -- +1 for encoding header to bytes (using UTF-8). We have clearly documented that header should be a string here: [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/textio.py#L599] > TypeError caused by using str variable as header argument in > apache_beam.io.textio.WriteToText > -- > > Key: BEAM-7137 > URL: https://issues.apache.org/jira/browse/BEAM-7137 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Affects Versions: 2.11.0 > Environment: Python 3.5.6 > macOS Mojave 10.14.4 >Reporter: yoshiki obata >Assignee: yoshiki obata >Priority: Major > > Using str header to apache_beam.io.textio.WriteToText as argument cause > TypeError with Python 3.5.6 - or maybe higher - despite docstring says header > is str. > This error occurred by writing header to file without encoding to bytes at > apache_beam.io.textio._TextSink.open. > > {code:java} > Traceback (most recent call last): > File "apache_beam/runners/common.py", line 727, in > apache_beam.runners.common.DoFnRunner.process > File "apache_beam/runners/common.py", line 555, in > apache_beam.runners.common.PerWindowInvoker.invoke_process > File "apache_beam/runners/common.py", line 625, in > apache_beam.runners.common.PerWindowInvoker._invoke_per_window > File > "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/io/iobase.py", > line 1033, in process > self.writer = self.sink.open_writer(init_result, str(uuid.uuid4())) > File > "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/options/value_provider.py", > line 137, in _f > return fnc(self, *args, **kwargs) > File > "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/io/filebasedsink.py", > line 185, in open_writer > return FileBasedSinkWriter(self, os.path.join(init_result, uid) + suffix) > File > "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/io/filebasedsink.py", > line 389, in __init__ > self.temp_handle = self.sink.open(temp_shard_path) > File > "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/io/textio.py", > line 393, in open > file_handle.write(self._header) > TypeError: a bytes-like object is required, not 'str' > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (BEAM-6962) Add wordcount_xlang to Python post-commit
[ https://issues.apache.org/jira/browse/BEAM-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath resolved BEAM-6962. -- Resolution: Fixed Fix Version/s: 2.13.0 > Add wordcount_xlang to Python post-commit > - > > Key: BEAM-6962 > URL: https://issues.apache.org/jira/browse/BEAM-6962 > Project: Beam > Issue Type: Improvement > Components: runner-flink, sdk-py-core, sdk-py-harness >Reporter: Chamikara Jayalath >Assignee: Heejong Lee >Priority: Major > Fix For: 2.13.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > This example works great but we should add it to Python post-commit to > prevent bit-rot. > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_xlang.py] > Also, please consider adding a step to the pipeline that performs a checksum > on the output to make sure that the output is valid. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-7063) Stage Dataflow runner harness in Dataflow FnApi Python test suites that run using an unreleased SDK.
[ https://issues.apache.org/jira/browse/BEAM-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823559#comment-16823559 ] Chamikara Jayalath commented on BEAM-7063: -- Ah OK. Thanks. This bug was opened for post-commit failure but I see now that the title has changed. > Stage Dataflow runner harness in Dataflow FnApi Python test suites that run > using an unreleased SDK. > > > Key: BEAM-7063 > URL: https://issues.apache.org/jira/browse/BEAM-7063 > Project: Beam > Issue Type: Bug > Components: sdk-py-core, test-failures >Reporter: Chamikara Jayalath >Assignee: Luke Cwik >Priority: Critical > Fix For: 2.13.0 > > Time Spent: 5h > Remaining Estimate: 0h > > Seems like this was the first post-commit failure: > [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_Python_Verify/7855/] > (the one before this also failed but that looks like a flake) > > > Looking at failed tests I noticed that some of the were due to failed > Dataflow jobs. I noticed three issues. > (1) Error in Dataflow jobs "This handler is only capable of dealing with > urn:beam:sideinput:materialization:multimap:0.1 materializations but was > asked to handle beam:side_input:multimap:v1 for PCollectionView with tag > side0-FilterOutSpammers." > [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_17_57_01-16397932959358575026?project=apache-beam-testing] > [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_18_07_11-1342017143600540529?project=apache-beam-testing] > > (2) BigQuery job failures (possibly this is transient) > [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_18_25_22-16437560120793921616?project=apache-beam-testing] > > (3) Some batch Dataflow jobs timed out and were cancelled (possibly also > transient). > > (1) is probably due to > [https://github.com/apache/beam/commit/684f8130284a7c7979773300d04e5473ca0ac8f3] > > > cc: [~altay] [~lostluck] [~alanmyrvold] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (BEAM-7063) beam_PostCommit_Python_Verify is consistently failing due to a sideinput materialization related issue
[ https://issues.apache.org/jira/browse/BEAM-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath resolved BEAM-7063. -- Resolution: Fixed Seems like test suite is passing now. Resolving. Thanks everyone. > beam_PostCommit_Python_Verify is consistently failing due to a sideinput > materialization related issue > -- > > Key: BEAM-7063 > URL: https://issues.apache.org/jira/browse/BEAM-7063 > Project: Beam > Issue Type: Bug > Components: sdk-py-core, test-failures >Reporter: Chamikara Jayalath >Assignee: Luke Cwik >Priority: Critical > Fix For: 2.13.0 > > Time Spent: 5h > Remaining Estimate: 0h > > Seems like this was the first post-commit failure: > [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_Python_Verify/7855/] > (the one before this also failed but that looks like a flake) > > > Looking at failed tests I noticed that some of the were due to failed > Dataflow jobs. I noticed three issues. > (1) Error in Dataflow jobs "This handler is only capable of dealing with > urn:beam:sideinput:materialization:multimap:0.1 materializations but was > asked to handle beam:side_input:multimap:v1 for PCollectionView with tag > side0-FilterOutSpammers." > [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_17_57_01-16397932959358575026?project=apache-beam-testing] > [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_18_07_11-1342017143600540529?project=apache-beam-testing] > > (2) BigQuery job failures (possibly this is transient) > [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_18_25_22-16437560120793921616?project=apache-beam-testing] > > (3) Some batch Dataflow jobs timed out and were cancelled (possibly also > transient). > > (1) is probably due to > [https://github.com/apache/beam/commit/684f8130284a7c7979773300d04e5473ca0ac8f3] > > > cc: [~altay] [~lostluck] [~alanmyrvold] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-7063) beam_PostCommit_Python_Verify is consistently failing due to a sideinput materialization related issue
[ https://issues.apache.org/jira/browse/BEAM-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819448#comment-16819448 ] Chamikara Jayalath commented on BEAM-7063: -- [~tvalentyn] since failing tests seems to be a Python 3 test. > beam_PostCommit_Python_Verify is consistently failing due to a sideinput > materialization related issue > -- > > Key: BEAM-7063 > URL: https://issues.apache.org/jira/browse/BEAM-7063 > Project: Beam > Issue Type: Bug > Components: sdk-py-core, test-failures >Reporter: Chamikara Jayalath >Assignee: Luke Cwik >Priority: Critical > Fix For: 2.13.0 > > Time Spent: 2h 50m > Remaining Estimate: 0h > > Seems like this was the first post-commit failure: > [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_Python_Verify/7855/] > (the one before this also failed but that looks like a flake) > > > Looking at failed tests I noticed that some of the were due to failed > Dataflow jobs. I noticed three issues. > (1) Error in Dataflow jobs "This handler is only capable of dealing with > urn:beam:sideinput:materialization:multimap:0.1 materializations but was > asked to handle beam:side_input:multimap:v1 for PCollectionView with tag > side0-FilterOutSpammers." > [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_17_57_01-16397932959358575026?project=apache-beam-testing] > [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_18_07_11-1342017143600540529?project=apache-beam-testing] > > (2) BigQuery job failures (possibly this is transient) > [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_18_25_22-16437560120793921616?project=apache-beam-testing] > > (3) Some batch Dataflow jobs timed out and were cancelled (possibly also > transient). > > (1) is probably due to > [https://github.com/apache/beam/commit/684f8130284a7c7979773300d04e5473ca0ac8f3] > > > cc: [~altay] [~lostluck] [~alanmyrvold] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Reopened] (BEAM-7063) beam_PostCommit_Python_Verify is consistently failing due to a sideinput materialization related issue
[ https://issues.apache.org/jira/browse/BEAM-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chamikara Jayalath reopened BEAM-7063: -- > beam_PostCommit_Python_Verify is consistently failing due to a sideinput > materialization related issue > -- > > Key: BEAM-7063 > URL: https://issues.apache.org/jira/browse/BEAM-7063 > Project: Beam > Issue Type: Bug > Components: sdk-py-core, test-failures >Reporter: Chamikara Jayalath >Assignee: Luke Cwik >Priority: Critical > Fix For: 2.13.0 > > Time Spent: 2h 50m > Remaining Estimate: 0h > > Seems like this was the first post-commit failure: > [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_Python_Verify/7855/] > (the one before this also failed but that looks like a flake) > > > Looking at failed tests I noticed that some of the were due to failed > Dataflow jobs. I noticed three issues. > (1) Error in Dataflow jobs "This handler is only capable of dealing with > urn:beam:sideinput:materialization:multimap:0.1 materializations but was > asked to handle beam:side_input:multimap:v1 for PCollectionView with tag > side0-FilterOutSpammers." > [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_17_57_01-16397932959358575026?project=apache-beam-testing] > [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_18_07_11-1342017143600540529?project=apache-beam-testing] > > (2) BigQuery job failures (possibly this is transient) > [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_18_25_22-16437560120793921616?project=apache-beam-testing] > > (3) Some batch Dataflow jobs timed out and were cancelled (possibly also > transient). > > (1) is probably due to > [https://github.com/apache/beam/commit/684f8130284a7c7979773300d04e5473ca0ac8f3] > > > cc: [~altay] [~lostluck] [~alanmyrvold] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-7063) beam_PostCommit_Python_Verify is consistently failing due to a sideinput materialization related issue
[ https://issues.apache.org/jira/browse/BEAM-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819309#comment-16819309 ] Chamikara Jayalath commented on BEAM-7063: -- Looks like one test of postCommit is still failing. [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_Python3_Verify/562/] [https://scans.gradle.com/s/2jqz3qlygliv4/console-log?task=:beam-sdks-python-test-suites-dataflow-py3:postCommitIT#L29] test_wordcount_fnapi_it (apache_beam.examples.wordcount_it_test.WordCountIT) ... ERROR apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error: java.lang.IllegalArgumentException: This handler is only capable of dealing with urn:beam:sideinput:materialization:multimap:0.1 materializations but was asked to handle beam:side_input:multimap:v1 for PCollectionView with tag side0-write/Write/WriteImpl/WriteBundles. at org.apache.beam.vendor.guava.v20_0.com.google.common.base.Preconditions.checkArgument(Preconditions.java:399) at org.apache.beam.runners.dataflow.worker.graph.RegisterNodeFunction.transformSideInputForRunner(RegisterNodeFunc tion.java:506) at org.apache.beam.runners.dataflow.worker.graph.RegisterNodeFunction.apply(RegisterNodeFunction.java:327) at org.apache.beam.runners.dataflow.worker.graph.RegisterNodeFunction.apply(RegisterNodeFunction.java:97) at java.util.function.Function.lambda$andThen$1(Function.java:88) at org.apache.beam.runners.dataflow.worker.graph.CreateRegisterFnOperationFunction.apply(CreateRegisterFnOperationFunction.java:208) at org.apache.beam.runners.dataflow.worker.graph.CreateRegisterFnOperationFunction.apply(CreateRegisterFnOperationFunction.java:75) at java.util.function.Function.lambda$andThen$1(Function.java:88) at java.util.function.Function.lambda$andThen$1(Function.java:88) at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:347) at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:306) at org.apache.beam.runners.dataflow.worker.DataflowRunnerHarness.start(DataflowRunnerHarness.java:195) at org.apache.beam.runners.dataflow.worker.DataflowRunnerHarness.main(DataflowRunnerHarness.java:123) > beam_PostCommit_Python_Verify is consistently failing due to a sideinput > materialization related issue > -- > > Key: BEAM-7063 > URL: https://issues.apache.org/jira/browse/BEAM-7063 > Project: Beam > Issue Type: Bug > Components: sdk-py-core, test-failures >Reporter: Chamikara Jayalath >Assignee: Luke Cwik >Priority: Critical > Fix For: 2.13.0 > > Time Spent: 2h 50m > Remaining Estimate: 0h > > Seems like this was the first post-commit failure: > [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_Python_Verify/7855/] > (the one before this also failed but that looks like a flake) > > > Looking at failed tests I noticed that some of the were due to failed > Dataflow jobs. I noticed three issues. > (1) Error in Dataflow jobs "This handler is only capable of dealing with > urn:beam:sideinput:materialization:multimap:0.1 materializations but was > asked to handle beam:side_input:multimap:v1 for PCollectionView with tag > side0-FilterOutSpammers." > [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_17_57_01-16397932959358575026?project=apache-beam-testing] > [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_18_07_11-1342017143600540529?project=apache-beam-testing] > > (2) BigQuery job failures (possibly this is transient) > [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_18_25_22-16437560120793921616?project=apache-beam-testing] > > (3) Some batch Dataflow jobs timed out and were cancelled (possibly also > transient). > > (1) is probably due to > [https://github.com/apache/beam/commit/684f8130284a7c7979773300d04e5473ca0ac8f3] > > > cc: [~altay] [~lostluck] [~alanmyrvold] -- This message was sent by Atlassian JIRA (v7.6.3#76005)