from:"Chamikara Jayalath \(Jira\)"

[jira] [Commented] (BEAM-8286) Python precommit (:sdks:python:test-suites:tox:py2:docs) failing

2019-09-19 Thread Chamikara Jayalath (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933891#comment-16933891
 ] 

Chamikara Jayalath commented on BEAM-8286:
--

Seems like Kyle's PR fixed the failure. Closing.

> Python precommit (:sdks:python:test-suites:tox:py2:docs) failing
> 
>
> Key: BEAM-8286
> URL: https://issues.apache.org/jira/browse/BEAM-8286
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Kyle Weaver
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Example failure: 
> [https://builds.apache.org/job/beam_PreCommit_Python_Commit/8638/console]
> 17:29:13 * What went wrong:
> 17:29:13 Execution failed for task ':sdks:python:test-suites:tox:py2:docs'.
> 17:29:13 > Process 'command 'sh'' finished with non-zero exit value 1
> Fails on my local machine (on head) as well. Can't determine exact cause.
> ERROR: InvocationError for command /usr/bin/time scripts/generate_pydoc.sh 
> (exited with code 1) 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (BEAM-8286) Python precommit (:sdks:python:test-suites:tox:py2:docs) failing

2019-09-19 Thread Chamikara Jayalath (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath resolved BEAM-8286.
--
Fix Version/s: Not applicable
   Resolution: Fixed

> Python precommit (:sdks:python:test-suites:tox:py2:docs) failing
> 
>
> Key: BEAM-8286
> URL: https://issues.apache.org/jira/browse/BEAM-8286
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Kyle Weaver
>Priority: Major
> Fix For: Not applicable
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Example failure: 
> [https://builds.apache.org/job/beam_PreCommit_Python_Commit/8638/console]
> 17:29:13 * What went wrong:
> 17:29:13 Execution failed for task ':sdks:python:test-suites:tox:py2:docs'.
> 17:29:13 > Process 'command 'sh'' finished with non-zero exit value 1
> Fails on my local machine (on head) as well. Can't determine exact cause.
> ERROR: InvocationError for command /usr/bin/time scripts/generate_pydoc.sh 
> (exited with code 1) 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (BEAM-8286) Python precommit (:sdks:python:test-suites:tox:py2:docs) failing

2019-09-19 Thread Chamikara Jayalath (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933748#comment-16933748
 ] 

Chamikara Jayalath commented on BEAM-8286:
--

Hmm, looking at another Gradle build I see only followning warnings.

WARNING: intersphinx inventory 
'https://googleapis.github.io/google-cloud-python/latest/objects.inv' not 
fetchable due to : 404 Client Error: Not 
Found for url: 
https://googleapis.github.io/google-cloud-python/latest/objects.inv

/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/apache_beam/io/gcp/datastore/v1new/helper.py:docstring
 of apache_beam.io.gcp.datastore.v1new.helper.write_mutations:: WARNING: 
py:class reference target not found: google.cloud.datastore.batch.Batch
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/apache_beam/io/gcp/datastore/v1new/types.py:docstring
 of apache_beam.io.gcp.datastore.v1new.types.Key:: WARNING: py:class reference 
target not found: google.cloud.datastore.key.Key
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/apache_beam/io/gcp/datastore/v1new/types.py:docstring
 of apache_beam.io.gcp.datastore.v1new.types.Key.to_client_key:1: WARNING: 
py:class reference target not found: google.cloud.datastore.key.Key
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/apache_beam/io/gcp/datastore/v1new/types.py:docstring
 of apache_beam.io.gcp.datastore.v1new.types.Entity.set_properties:: WARNING: 
py:class reference target not found: google.cloud.datastore.entity.Entity
/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Phrase/src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/apache_beam/io/gcp/datastore/v1new/types.py:docstring
 of apache_beam.io.gcp.datastore.v1new.types.Entity.to_client_entity:1: 
WARNING: py:class reference target not found: 
google.cloud.datastore.entity.Entity

 

All these warnings are related to datastore.

First one is interesting since that link does not seem to be actually available.

Looking at a passing "sdks:python:test-suites:tox:py2:docs'" I don't see any 
warning.

 

> Python precommit (:sdks:python:test-suites:tox:py2:docs) failing
> 
>
> Key: BEAM-8286
> URL: https://issues.apache.org/jira/browse/BEAM-8286
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Kyle Weaver
>Priority: Major
>
> Example failure: 
> [https://builds.apache.org/job/beam_PreCommit_Python_Commit/8638/console]
> 17:29:13 * What went wrong:
> 17:29:13 Execution failed for task ':sdks:python:test-suites:tox:py2:docs'.
> 17:29:13 > Process 'command 'sh'' finished with non-zero exit value 1
> Fails on my local machine (on head) as well. Can't determine exact cause.
> ERROR: InvocationError for command /usr/bin/time scripts/generate_pydoc.sh 
> (exited with code 1) 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (BEAM-8286) Python precommit (:sdks:python:test-suites:tox:py2:docs) failing

2019-09-19 Thread Chamikara Jayalath (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933732#comment-16933732
 ] 

Chamikara Jayalath commented on BEAM-8286:
--

This seems to be the first PR for which pre-commit failed: 
[https://github.com/apache/beam/pull/9611]

But I'm not sure if it's related.

cc: [~udim] 

> Python precommit (:sdks:python:test-suites:tox:py2:docs) failing
> 
>
> Key: BEAM-8286
> URL: https://issues.apache.org/jira/browse/BEAM-8286
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Kyle Weaver
>Priority: Major
>
> Example failure: 
> [https://builds.apache.org/job/beam_PreCommit_Python_Commit/8638/console]
> 17:29:13 * What went wrong:
> 17:29:13 Execution failed for task ':sdks:python:test-suites:tox:py2:docs'.
> 17:29:13 > Process 'command 'sh'' finished with non-zero exit value 1
> Fails on my local machine (on head) as well. Can't determine exact cause.
> ERROR: InvocationError for command /usr/bin/time scripts/generate_pydoc.sh 
> (exited with code 1) 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (BEAM-8240) SDK Harness

2019-09-16 Thread Chamikara Jayalath (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930826#comment-16930826
 ] 

Chamikara Jayalath commented on BEAM-8240:
--

Makes sense. BTW please fix the title of the JIRA.

> SDK Harness
> ---
>
> Key: BEAM-8240
> URL: https://issues.apache.org/jira/browse/BEAM-8240
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Reporter: Luke Cwik
>Assignee: Luke Cwik
>Priority: Minor
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> SDK harness incorrectly identifies itself when using custom SDK container 
> within environment field when building pipeline proto.
>  
> Passing in the experiment *worker_harness_container_image=YYY* doesn't 
> override the pipeline proto environment field and it is still being populated 
> with *gcr.io/cloud-dataflow/v1beta3/python-fnapi:beam-master-20190802*
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (BEAM-8222) Consider making insertId optional in BigQuery.insertAll

2019-09-16 Thread Chamikara Jayalath (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-8222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930697#comment-16930697
 ] 

Chamikara Jayalath commented on BEAM-8222:
--

Based on some offline comments from [~reuvenlax] this might be undesirable and 
may cause user confusion.

 

AFAIK Dataflow and other Beam runners that support BigQueryIO.Sink are tolerant 
to failures and may retry workitems. So handling duplicates is required for the 
safety of inserted data. Without insertid things might speed up in the short 
term for runs without failures but this mode of execution is not safe in the 
long run.

> Consider making insertId optional in BigQuery.insertAll
> ---
>
> Key: BEAM-8222
> URL: https://issues.apache.org/jira/browse/BEAM-8222
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp
>Reporter: Boyuan Zhang
>Priority: Major
>
> Current implementation of 
> StreamingWriteFn(https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StreamingWriteFn.java#L102)
>  sets insertId from input element, which is added an uniqueId by 
> https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/TagWithUniqueIds.java#L53.
>  Users report that if leaving insertId as empty, writing will be extremely 
> speeded up. Can we add an bqOption like, nonInsertId and emit empty id based 
> on this option?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (BEAM-8240) SDK Harness

2019-09-16 Thread Chamikara Jayalath (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930682#comment-16930682
 ] 

Chamikara Jayalath commented on BEAM-8240:
--

In the cross-language path, there will be more than one SDK container, so I 
think the correct solution is to update Dataflow to set a pointer to the 
container image in the environment payload (of transforms) and determine the 
set of containers needed within Dataflow service.

 

For customer containers, we should be updating environemt payload instead of 
just updating worker_harness_container_image flag.

 

cc: [~robertwb]

> SDK Harness
> ---
>
> Key: BEAM-8240
> URL: https://issues.apache.org/jira/browse/BEAM-8240
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Reporter: Luke Cwik
>Assignee: Luke Cwik
>Priority: Minor
>
> SDK harness incorrectly identifies itself when using custom SDK container 
> within environment field when building pipeline proto.
>  
> Passing in the experiment *worker_harness_container_image=YYY* doesn't 
> override the pipeline proto environment field and it is still being populated 
> with *gcr.io/cloud-dataflow/v1beta3/python-fnapi:beam-master-20190802*
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Closed] (BEAM-8231) State sampler is dropping stage-step names

2019-09-13 Thread Chamikara Jayalath (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath closed BEAM-8231.

Fix Version/s: 2.17.0
   Resolution: Fixed

> State sampler is dropping stage-step names
> --
>
> Key: BEAM-8231
> URL: https://issues.apache.org/jira/browse/BEAM-8231
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Reporter: Chamikara Jayalath
>Assignee: Chamikara Jayalath
>Priority: Minor
> Fix For: 2.17.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> State sampler is currently dropping stage and step names from reported 
> current state.
>  
> Only affects DataflowRunner.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (BEAM-8231) State sampler is dropping stage-step names

2019-09-13 Thread Chamikara Jayalath (Jira)

Chamikara Jayalath created BEAM-8231:


 Summary: State sampler is dropping stage-step names
 Key: BEAM-8231
 URL: https://issues.apache.org/jira/browse/BEAM-8231
 Project: Beam
  Issue Type: Bug
  Components: runner-dataflow
Reporter: Chamikara Jayalath
Assignee: Chamikara Jayalath


State sampler is currently dropping stage and step names from reported current 
state.

 

Only affects DataflowRunner.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (BEAM-8215) Wordcount 1GB Python PKB benchmarks sometimes fail with uninformative error

2019-09-12 Thread Chamikara Jayalath (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-8215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928866#comment-16928866
 ] 

Chamikara Jayalath commented on BEAM-8215:
--

Seems like this is making Python 3 post commits extremely flaky (but not Python 
2 strangely).

 

Some recent examples.

[https://builds.apache.org/job/beam_PerformanceTests_WordCountIT_Py35/479/console]

[https://builds.apache.org/job/beam_PerformanceTests_WordCountIT_Py36/456/console]

> Wordcount 1GB Python PKB benchmarks sometimes fail with uninformative error
> ---
>
> Key: BEAM-8215
> URL: https://issues.apache.org/jira/browse/BEAM-8215
> Project: Beam
>  Issue Type: Bug
>  Components: testing
>Reporter: Valentyn Tymofieiev
>Assignee: Mark Liu
>Priority: Major
>
> Example:
> https://builds.apache.org/job/beam_PerformanceTests_WordCountIT_Py36/452/console
> {noformat}
> 12:09:27 2019-09-11 19:09:27,655 a47400ce MainThread 
> beam_integration_benchmark(1/1) ERRORError during benchmark 
> beam_integration_benchmark
> 12:09:27 Traceback (most recent call last):
> 12:09:27   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_WordCountIT_Py36/PerfKitBenchmarker/perfkitbenchmarker/pkb.py",
>  line 841, in RunBenchmark
> 12:09:27 DoRunPhase(spec, collector, detailed_timer)
> 12:09:27   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_WordCountIT_Py36/PerfKitBenchmarker/perfkitbenchmarker/pkb.py",
>  line 687, in DoRunPhase
> 12:09:27 samples = spec.BenchmarkRun(spec)
> 12:09:27   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_WordCountIT_Py36/PerfKitBenchmarker/perfkitbenchmarker/linux_benchmarks/beam_integration_benchmark.py",
>  line 160, in Run
> 12:09:27 job_type=job_type)
> 12:09:27   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_WordCountIT_Py36/PerfKitBenchmarker/perfkitbenchmarker/providers/gcp/gcp_dpb_dataflow.py",
>  line 91, in SubmitJob
> 12:09:27 assert retcode == 0, "Integration Test Failed."
> 12:09:27 AssertionError: Integration Test Failed.
> {noformat}
> It seems like Job submission failed, but there are no details. I talked with 
> [~markflyhigh], and sounds like we plan to stop using PKB in favor of another 
> framework.
> Assigning to Mark for now to triage follow up or reassign as appropriate.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (BEAM-8219) crossLanguagePortableWordCount seems to be flaky for beam_PostCommit_Python2

2019-09-12 Thread Chamikara Jayalath (Jira)

Chamikara Jayalath created BEAM-8219:


 Summary: crossLanguagePortableWordCount seems to be flaky for 
beam_PostCommit_Python2 
 Key: BEAM-8219
 URL: https://issues.apache.org/jira/browse/BEAM-8219
 Project: Beam
  Issue Type: Bug
  Components: test-failures
Reporter: Chamikara Jayalath
Assignee: Chamikara Jayalath


For example,

[https://builds.apache.org/job/beam_PostCommit_Python2/451/console]

[https://builds.apache.org/job/beam_PostCommit_Python2/454/console]
*10:37:22* * What went wrong:*10:37:22* Execution failed for task 
':sdks:python:test-suites:portable:py2:crossLanguagePortableWordCount'.*10:37:22*
 > Process 'command 'sh'' finished with non-zero exit value 1*10:37:22* 
 

cc: [~heejong] [~mxm]

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (BEAM-7611) Python BigTableIO IT is not running in any test suites

2019-09-12 Thread Chamikara Jayalath (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath updated BEAM-7611:
-
Priority: Critical  (was: Blocker)

> Python BigTableIO IT is not running in any test suites
> --
>
> Key: BEAM-7611
> URL: https://issues.apache.org/jira/browse/BEAM-7611
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp, testing
>Reporter: Chamikara Jayalath
>Assignee: Solomon Duskis
>Priority: Critical
> Fix For: 2.16.0
>
>
> We added an integration test here: [https://github.com/apache/beam/pull/7367]
>  
> But this currently does not get picked up by any test suites (and get skipped 
> by some due to missing dependencies) hence BigTable sink is largely untested.
>  
> First attempt to enable it: [https://github.com/apache/beam/pull/8886]
>  
> Solomon assigning to you since I cannot find Juan's (PR author) Jira ID.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (BEAM-7611) Python BigTableIO IT is not running in any test suites

2019-09-12 Thread Chamikara Jayalath (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928763#comment-16928763
 ] 

Chamikara Jayalath commented on BEAM-7611:
--

Created cherry-pick [https://github.com/apache/beam/pull/9558]

 

Reducing the severity of this bug from a blocker to critical.

> Python BigTableIO IT is not running in any test suites
> --
>
> Key: BEAM-7611
> URL: https://issues.apache.org/jira/browse/BEAM-7611
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp, testing
>Reporter: Chamikara Jayalath
>Assignee: Solomon Duskis
>Priority: Blocker
> Fix For: 2.16.0
>
>
> We added an integration test here: [https://github.com/apache/beam/pull/7367]
>  
> But this currently does not get picked up by any test suites (and get skipped 
> by some due to missing dependencies) hence BigTable sink is largely untested.
>  
> First attempt to enable it: [https://github.com/apache/beam/pull/8886]
>  
> Solomon assigning to you since I cannot find Juan's (PR author) Jira ID.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (BEAM-7611) Python BigTableIO IT is not running in any test suites

2019-09-12 Thread Chamikara Jayalath (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath updated BEAM-7611:
-
Fix Version/s: (was: 2.16.0)

> Python BigTableIO IT is not running in any test suites
> --
>
> Key: BEAM-7611
> URL: https://issues.apache.org/jira/browse/BEAM-7611
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp, testing
>Reporter: Chamikara Jayalath
>Assignee: Solomon Duskis
>Priority: Critical
>
> We added an integration test here: [https://github.com/apache/beam/pull/7367]
>  
> But this currently does not get picked up by any test suites (and get skipped 
> by some due to missing dependencies) hence BigTable sink is largely untested.
>  
> First attempt to enable it: [https://github.com/apache/beam/pull/8886]
>  
> Solomon assigning to you since I cannot find Juan's (PR author) Jira ID.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (BEAM-7611) Python BigTableIO IT is not running in any test suites

2019-09-11 Thread Chamikara Jayalath (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928044#comment-16928044
 ] 

Chamikara Jayalath commented on BEAM-7611:
--

I believe [~udim] is trying to run a pipeline with BigTable sink using new 
dependencies in [https://github.com/apache/beam/pull/9491].

 

If that works, we can remove this Jira out of Beam 2.16.0 blocking status but 
fixing test coverage for BigTable sink is still critical IMO.

> Python BigTableIO IT is not running in any test suites
> --
>
> Key: BEAM-7611
> URL: https://issues.apache.org/jira/browse/BEAM-7611
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp, testing
>Reporter: Chamikara Jayalath
>Assignee: Solomon Duskis
>Priority: Blocker
> Fix For: 2.16.0
>
>
> We added an integration test here: [https://github.com/apache/beam/pull/7367]
>  
> But this currently does not get picked up by any test suites (and get skipped 
> by some due to missing dependencies) hence BigTable sink is largely untested.
>  
> First attempt to enable it: [https://github.com/apache/beam/pull/8886]
>  
> Solomon assigning to you since I cannot find Juan's (PR author) Jira ID.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (BEAM-7611) Python BigTableIO IT is not running in any test suites

2019-09-11 Thread Chamikara Jayalath (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927627#comment-16927627
 ] 

Chamikara Jayalath commented on BEAM-7611:
--

Also, this is blocking a critical Dataflow dependency upgrade that is needed 
for Beam 2.16.0.

> Python BigTableIO IT is not running in any test suites
> --
>
> Key: BEAM-7611
> URL: https://issues.apache.org/jira/browse/BEAM-7611
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp, testing
>Reporter: Chamikara Jayalath
>Assignee: Solomon Duskis
>Priority: Blocker
> Fix For: 2.16.0
>
>
> We added an integration test here: [https://github.com/apache/beam/pull/7367]
>  
> But this currently does not get picked up by any test suites (and get skipped 
> by some due to missing dependencies) hence BigTable sink is largely untested.
>  
> First attempt to enable it: [https://github.com/apache/beam/pull/8886]
>  
> Solomon assigning to you since I cannot find Juan's (PR author) Jira ID.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (BEAM-7611) Python BigTableIO IT is not running in any test suites

2019-09-11 Thread Chamikara Jayalath (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927628#comment-16927628
 ] 

Chamikara Jayalath commented on BEAM-7611:
--

cc: [~altay] [~udim]

> Python BigTableIO IT is not running in any test suites
> --
>
> Key: BEAM-7611
> URL: https://issues.apache.org/jira/browse/BEAM-7611
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp, testing
>Reporter: Chamikara Jayalath
>Assignee: Solomon Duskis
>Priority: Blocker
> Fix For: 2.16.0
>
>
> We added an integration test here: [https://github.com/apache/beam/pull/7367]
>  
> But this currently does not get picked up by any test suites (and get skipped 
> by some due to missing dependencies) hence BigTable sink is largely untested.
>  
> First attempt to enable it: [https://github.com/apache/beam/pull/8886]
>  
> Solomon assigning to you since I cannot find Juan's (PR author) Jira ID.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (BEAM-7611) Python BigTableIO IT is not running in any test suites

2019-09-11 Thread Chamikara Jayalath (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath updated BEAM-7611:
-
Priority: Blocker  (was: Critical)

> Python BigTableIO IT is not running in any test suites
> --
>
> Key: BEAM-7611
> URL: https://issues.apache.org/jira/browse/BEAM-7611
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp, testing
>Reporter: Chamikara Jayalath
>Assignee: Solomon Duskis
>Priority: Blocker
>
> We added an integration test here: [https://github.com/apache/beam/pull/7367]
>  
> But this currently does not get picked up by any test suites (and get skipped 
> by some due to missing dependencies) hence BigTable sink is largely untested.
>  
> First attempt to enable it: [https://github.com/apache/beam/pull/8886]
>  
> Solomon assigning to you since I cannot find Juan's (PR author) Jira ID.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (BEAM-7611) Python BigTableIO IT is not running in any test suites

2019-09-11 Thread Chamikara Jayalath (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath updated BEAM-7611:
-
Fix Version/s: 2.16.0

> Python BigTableIO IT is not running in any test suites
> --
>
> Key: BEAM-7611
> URL: https://issues.apache.org/jira/browse/BEAM-7611
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp, testing
>Reporter: Chamikara Jayalath
>Assignee: Solomon Duskis
>Priority: Blocker
> Fix For: 2.16.0
>
>
> We added an integration test here: [https://github.com/apache/beam/pull/7367]
>  
> But this currently does not get picked up by any test suites (and get skipped 
> by some due to missing dependencies) hence BigTable sink is largely untested.
>  
> First attempt to enable it: [https://github.com/apache/beam/pull/8886]
>  
> Solomon assigning to you since I cannot find Juan's (PR author) Jira ID.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (BEAM-7611) Python BigTableIO IT is not running in any test suites

2019-09-10 Thread Chamikara Jayalath (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16927161#comment-16927161
 ] 

Chamikara Jayalath commented on BEAM-7611:
--

[~sduskis] can you please look into this or assign to the original author 
(couldn't find Jira account of original author Juan Rael) ? Seems like we are 
lacking test coverage for Python BigTable sink.

> Python BigTableIO IT is not running in any test suites
> --
>
> Key: BEAM-7611
> URL: https://issues.apache.org/jira/browse/BEAM-7611
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp, testing
>Reporter: Chamikara Jayalath
>Assignee: Solomon Duskis
>Priority: Major
>
> We added an integration test here: [https://github.com/apache/beam/pull/7367]
>  
> But this currently does not get picked up by any test suites (and get skipped 
> by some due to missing dependencies) hence BigTable sink is largely untested.
>  
> First attempt to enable it: [https://github.com/apache/beam/pull/8886]
>  
> Solomon assigning to you since I cannot find Juan's (PR author) Jira ID.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (BEAM-7611) Python BigTableIO IT is not running in any test suites

2019-09-10 Thread Chamikara Jayalath (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-7611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath updated BEAM-7611:
-
Priority: Critical  (was: Major)

> Python BigTableIO IT is not running in any test suites
> --
>
> Key: BEAM-7611
> URL: https://issues.apache.org/jira/browse/BEAM-7611
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp, testing
>Reporter: Chamikara Jayalath
>Assignee: Solomon Duskis
>Priority: Critical
>
> We added an integration test here: [https://github.com/apache/beam/pull/7367]
>  
> But this currently does not get picked up by any test suites (and get skipped 
> by some due to missing dependencies) hence BigTable sink is largely untested.
>  
> First attempt to enable it: [https://github.com/apache/beam/pull/8886]
>  
> Solomon assigning to you since I cannot find Juan's (PR author) Jira ID.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (BEAM-8155) Add PubsubIO.readMessagesWithCoderAndParseFn method

2019-09-10 Thread Chamikara Jayalath (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-8155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926870#comment-16926870
 ] 

Chamikara Jayalath commented on BEAM-8155:
--

Closing since PR was merged. Thanks!

> Add PubsubIO.readMessagesWithCoderAndParseFn method
> ---
>
> Key: BEAM-8155
> URL: https://issues.apache.org/jira/browse/BEAM-8155
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Complements [#8879|https://github.com/apache/beam/pull/8879]; the constructor 
> for a generic-typed PubsubIO.Read is private, so some kind of public 
> constructor is needed to use custom parse functions with generic types.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (BEAM-8155) Add PubsubIO.readMessagesWithCoderAndParseFn method

2019-09-10 Thread Chamikara Jayalath (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath resolved BEAM-8155.
--
Resolution: Fixed

> Add PubsubIO.readMessagesWithCoderAndParseFn method
> ---
>
> Key: BEAM-8155
> URL: https://issues.apache.org/jira/browse/BEAM-8155
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Complements [#8879|https://github.com/apache/beam/pull/8879]; the constructor 
> for a generic-typed PubsubIO.Read is private, so some kind of public 
> constructor is needed to use custom parse functions with generic types.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (BEAM-8138) Fix code snippets in FileIO Java docs

2019-09-03 Thread Chamikara Jayalath (Jira)

Chamikara Jayalath created BEAM-8138:


 Summary: Fix code snippets in FileIO Java docs
 Key: BEAM-8138
 URL: https://issues.apache.org/jira/browse/BEAM-8138
 Project: Beam
  Issue Type: Improvement
  Components: io-java-avro
Reporter: Chamikara Jayalath


For example first two snippets here seems to be incorrect.

[https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/AvroIO.java#L153]

 

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (BEAM-8046) Unable to read from bigquery and publish to pubsub using dataflow runner (python SDK)

2019-08-27 Thread Chamikara Jayalath (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-8046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916973#comment-16916973
 ] 

Chamikara Jayalath commented on BEAM-8046:
--

I think the issue there is current PubSub implementation in Python SDK being a 
native source/sink that is implemented in Dataflow streaming backend.

> Unable to read from bigquery and publish to pubsub using dataflow runner 
> (python SDK)
> -
>
> Key: BEAM-8046
> URL: https://issues.apache.org/jira/browse/BEAM-8046
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-gcp, runner-dataflow
>Affects Versions: 2.13.0, 2.14.0
>Reporter: James Hutchison
>Priority: Major
>
> With the Python SDK:
> The dataflow runner does not allow use of reading from bigquery in streaming 
> pipelines.
> Pubsub is not allowed for batch pipelines.
> Thus, there's no way to create a pipeline on the dataflow runner that reads 
> from bigquery and publishes to pubsub.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (BEAM-8046) Unable to read from bigquery and publish to pubsub using dataflow runner (python SDK)

2019-08-27 Thread Chamikara Jayalath (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-8046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916899#comment-16916899
 ] 

Chamikara Jayalath commented on BEAM-8046:
--

BigQuery is an bounded source so we'll need to add an unbounded read from 
bounded source wrapper to support it in a streaming pipelines. For example we 
have [1] for Java.

Also, we don't have an unbounded source framework for Python yet. Splittable 
DoFn is currently in the works to this end.

 

[1][https://github.com/apache/beam/blob/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/UnboundedReadFromBoundedSource.java]

> Unable to read from bigquery and publish to pubsub using dataflow runner 
> (python SDK)
> -
>
> Key: BEAM-8046
> URL: https://issues.apache.org/jira/browse/BEAM-8046
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-gcp, runner-dataflow
>Affects Versions: 2.13.0, 2.14.0
>Reporter: James Hutchison
>Priority: Major
>
> With the Python SDK:
> The dataflow runner does not allow use of reading from bigquery in streaming 
> pipelines.
> Pubsub is not allowed for batch pipelines.
> Thus, there's no way to create a pipeline on the dataflow runner that reads 
> from bigquery and publishes to pubsub.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (BEAM-8089) Error while using customGcsTempLocation() with Dataflow

2019-08-26 Thread Chamikara Jayalath (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-8089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916117#comment-16916117
 ] 

Chamikara Jayalath commented on BEAM-8089:
--

Have you tried running Dataflow in the same region as where your bucket located 
using option [1] ? Networks charges should not apply in this case according to 
[2].

 

[1] 
[https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/options/DataflowPipelineOptions.java#L133]

[2] [https://cloud.google.com/storage/pricing]

> Error while using customGcsTempLocation() with Dataflow
> ---
>
> Key: BEAM-8089
> URL: https://issues.apache.org/jira/browse/BEAM-8089
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.13.0
>Reporter: Harshit Dwivedi
>Assignee: Chamikara Jayalath
>Priority: Major
>
> I have the following code snippet which writes content to BigQuery via File 
> Loads.
> Currently the files are being written to a GCS Bucket, but I want to write 
> them to the local file storage of Dataflow instead and want BigQuery to load 
> data from there.
>  
>  
>  
> {code:java}
> BigQueryIO
>  .writeTableRows()
>  .withNumFileShards(100)
>  .withTriggeringFrequency(Duration.standardSeconds(90))
>  .withMethod(BigQueryIO.Write.Method.FILE_LOADS)
>  .withSchema(getSchema())
>  .withoutValidation()
>  .withCustomGcsTempLocation(new ValueProvider() {
>     @Override
>     public String get(){
>          return "/home/harshit/testFiles";     
> }
>     @Override
>     public boolean isAccessible(){
>          return true;     
> }})
>  .withTimePartitioning(new TimePartitioning().setType("DAY"))
>  .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
>  .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
>  .to(tableName));
> {code}
>  
>  
> On running this, I don't see any files being written to the provided path and 
> the BQ load jobs fail with an IOException.
>  
> I looked at the docs, but I was unable to find any working example for this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (BEAM-8089) Error while using customGcsTempLocation() with Dataflow

2019-08-26 Thread Chamikara Jayalath (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-8089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916052#comment-16916052
 ] 

Chamikara Jayalath commented on BEAM-8089:
--

BTW may I ask why you cannot use GCS in this case ? Dataflow already needs GCS 
to run and storage costs should be minimum.

> Error while using customGcsTempLocation() with Dataflow
> ---
>
> Key: BEAM-8089
> URL: https://issues.apache.org/jira/browse/BEAM-8089
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.13.0
>Reporter: Harshit Dwivedi
>Assignee: Chamikara Jayalath
>Priority: Major
>
> I have the following code snippet which writes content to BigQuery via File 
> Loads.
> Currently the files are being written to a GCS Bucket, but I want to write 
> them to the local file storage of Dataflow instead and want BigQuery to load 
> data from there.
>  
>  
>  
> {code:java}
> BigQueryIO
>  .writeTableRows()
>  .withNumFileShards(100)
>  .withTriggeringFrequency(Duration.standardSeconds(90))
>  .withMethod(BigQueryIO.Write.Method.FILE_LOADS)
>  .withSchema(getSchema())
>  .withoutValidation()
>  .withCustomGcsTempLocation(new ValueProvider() {
>     @Override
>     public String get(){
>          return "/home/harshit/testFiles";     
> }
>     @Override
>     public boolean isAccessible(){
>          return true;     
> }})
>  .withTimePartitioning(new TimePartitioning().setType("DAY"))
>  .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
>  .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
>  .to(tableName));
> {code}
>  
>  
> On running this, I don't see any files being written to the provided path and 
> the BQ load jobs fail with an IOException.
>  
> I looked at the docs, but I was unable to find any working example for this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (BEAM-8089) Error while using customGcsTempLocation() with Dataflow

2019-08-26 Thread Chamikara Jayalath (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-8089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916051#comment-16916051
 ] 

Chamikara Jayalath commented on BEAM-8089:
--

I don't think we can fork Beam code for a very specific scenario of the 
Dataflow runner (single worker with autoscaling disabled). In general, Dataflow 
does not fuse the step that write files and the step that execute the BQ job so 
these two steps may not execute in the same worker.

> Error while using customGcsTempLocation() with Dataflow
> ---
>
> Key: BEAM-8089
> URL: https://issues.apache.org/jira/browse/BEAM-8089
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.13.0
>Reporter: Harshit Dwivedi
>Assignee: Chamikara Jayalath
>Priority: Major
>
> I have the following code snippet which writes content to BigQuery via File 
> Loads.
> Currently the files are being written to a GCS Bucket, but I want to write 
> them to the local file storage of Dataflow instead and want BigQuery to load 
> data from there.
>  
>  
>  
> {code:java}
> BigQueryIO
>  .writeTableRows()
>  .withNumFileShards(100)
>  .withTriggeringFrequency(Duration.standardSeconds(90))
>  .withMethod(BigQueryIO.Write.Method.FILE_LOADS)
>  .withSchema(getSchema())
>  .withoutValidation()
>  .withCustomGcsTempLocation(new ValueProvider() {
>     @Override
>     public String get(){
>          return "/home/harshit/testFiles";     
> }
>     @Override
>     public boolean isAccessible(){
>          return true;     
> }})
>  .withTimePartitioning(new TimePartitioning().setType("DAY"))
>  .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
>  .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
>  .to(tableName));
> {code}
>  
>  
> On running this, I don't see any files being written to the provided path and 
> the BQ load jobs fail with an IOException.
>  
> I looked at the docs, but I was unable to find any working example for this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (BEAM-8089) Error while using customGcsTempLocation() with Dataflow

2019-08-26 Thread Chamikara Jayalath (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-8089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16916033#comment-16916033
 ] 

Chamikara Jayalath commented on BEAM-8089:
--

True, seems like this is supported in a limited way (wildcards not supported 
for example).

 

I think Beam will have a hard time supporting this since most Beam runners are 
distributed and use multiple nodes to write data (to files) in parallel. So 
there's no "single" local disk. This is why we use a distributed storage 
location to which all workers have access to write individual files (a 
directory in GCS in this case) and execute a single BQ load job for all files 
from there.

> Error while using customGcsTempLocation() with Dataflow
> ---
>
> Key: BEAM-8089
> URL: https://issues.apache.org/jira/browse/BEAM-8089
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.13.0
>Reporter: Harshit Dwivedi
>Assignee: Chamikara Jayalath
>Priority: Major
>
> I have the following code snippet which writes content to BigQuery via File 
> Loads.
> Currently the files are being written to a GCS Bucket, but I want to write 
> them to the local file storage of Dataflow instead and want BigQuery to load 
> data from there.
>  
>  
>  
> {code:java}
> BigQueryIO
>  .writeTableRows()
>  .withNumFileShards(100)
>  .withTriggeringFrequency(Duration.standardSeconds(90))
>  .withMethod(BigQueryIO.Write.Method.FILE_LOADS)
>  .withSchema(getSchema())
>  .withoutValidation()
>  .withCustomGcsTempLocation(new ValueProvider() {
>     @Override
>     public String get(){
>          return "/home/harshit/testFiles";     
> }
>     @Override
>     public boolean isAccessible(){
>          return true;     
> }})
>  .withTimePartitioning(new TimePartitioning().setType("DAY"))
>  .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
>  .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
>  .to(tableName));
> {code}
>  
>  
> On running this, I don't see any files being written to the provided path and 
> the BQ load jobs fail with an IOException.
>  
> I looked at the docs, but I was unable to find any working example for this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (BEAM-8089) Error while using customGcsTempLocation() with Dataflow

2019-08-26 Thread Chamikara Jayalath (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-8089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915992#comment-16915992
 ] 

Chamikara Jayalath commented on BEAM-8089:
--

BQ cannot execute load jobs from local files. Files have to be in GCS.

 

So I think this is working as intended.

> Error while using customGcsTempLocation() with Dataflow
> ---
>
> Key: BEAM-8089
> URL: https://issues.apache.org/jira/browse/BEAM-8089
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.13.0
>Reporter: Harshit Dwivedi
>Assignee: Chamikara Jayalath
>Priority: Major
>
> I have the following code snippet which writes content to BigQuery via File 
> Loads.
> Currently the files are being written to a GCS Bucket, but I want to write 
> them to the local file storage of Dataflow instead and want BigQuery to load 
> data from there.
>  
>  
>  
> {code:java}
> BigQueryIO
>  .writeTableRows()
>  .withNumFileShards(100)
>  .withTriggeringFrequency(Duration.standardSeconds(90))
>  .withMethod(BigQueryIO.Write.Method.FILE_LOADS)
>  .withSchema(getSchema())
>  .withoutValidation()
>  .withCustomGcsTempLocation(new ValueProvider() {
>     @Override
>     public String get(){
>          return "/home/harshit/testFiles";     
> }
>     @Override
>     public boolean isAccessible(){
>          return true;     
> }})
>  .withTimePartitioning(new TimePartitioning().setType("DAY"))
>  .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
>  .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
>  .to(tableName));
> {code}
>  
>  
> On running this, I don't see any files being written to the provided path and 
> the BQ load jobs fail with an IOException.
>  
> I looked at the docs, but I was unable to find any working example for this.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (BEAM-8023) Allow specifying BigQuery Storage API readOptions at runtime

2019-08-22 Thread Chamikara Jayalath (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-8023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913647#comment-16913647
 ] 

Chamikara Jayalath commented on BEAM-8023:
--

Wow that was fast :)

Thanks Ken. I'll take a look.

> Allow specifying BigQuery Storage API readOptions at runtime
> 
>
> Key: BEAM-8023
> URL: https://issues.apache.org/jira/browse/BEAM-8023
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Reporter: Jeff Klukas
>Assignee: Kenneth Jung
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We have support in the Java SDK for using the BigQuery Storage API for reads, 
> but only the target query or table is supported as a ValueProvider to be 
> specified at runtime. AFAICT, there is no reason we can't delay specifying 
> readOptions until runtime as well.
> The readOptions are accessed by BigQueryStorageTableSource in getTargetTable; 
> I believe that's occurring at runtime, but I'd love for someone with deeper 
> BoundedSource knowledge to confirm that.
> I'd advocate for adding new methods 
> `TypedRead.withSelectedFields(ValueProvider> value)` and 
> `TypedRead.withRowRestriction(ValueProvider value)`. The existing 
> `withReadOptions` method would then populate the other two as 
> StaticValueProviders. Perhaps we'd want to deprecate `withReadOptions` in 
> favor or specifying individual read options as separate parameters.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (BEAM-8023) Allow specifying BigQuery Storage API readOptions at runtime

2019-08-22 Thread Chamikara Jayalath (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-8023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath reassigned BEAM-8023:


Assignee: Kenneth Jung

> Allow specifying BigQuery Storage API readOptions at runtime
> 
>
> Key: BEAM-8023
> URL: https://issues.apache.org/jira/browse/BEAM-8023
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Reporter: Jeff Klukas
>Assignee: Kenneth Jung
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We have support in the Java SDK for using the BigQuery Storage API for reads, 
> but only the target query or table is supported as a ValueProvider to be 
> specified at runtime. AFAICT, there is no reason we can't delay specifying 
> readOptions until runtime as well.
> The readOptions are accessed by BigQueryStorageTableSource in getTargetTable; 
> I believe that's occurring at runtime, but I'd love for someone with deeper 
> BoundedSource knowledge to confirm that.
> I'd advocate for adding new methods 
> `TypedRead.withSelectedFields(ValueProvider> value)` and 
> `TypedRead.withRowRestriction(ValueProvider value)`. The existing 
> `withReadOptions` method would then populate the other two as 
> StaticValueProviders. Perhaps we'd want to deprecate `withReadOptions` in 
> favor or specifying individual read options as separate parameters.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (BEAM-8023) Allow specifying BigQuery Storage API readOptions at runtime

2019-08-22 Thread Chamikara Jayalath (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-8023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913488#comment-16913488
 ] 

Chamikara Jayalath commented on BEAM-8023:
--

Yeah, I believe this should be possible.

[~jeff.klu...@gmail.com] will you be able to send a PR for this ?

If not [~aryann] or [~kjung520] may be interested in adding this functionality.

> Allow specifying BigQuery Storage API readOptions at runtime
> 
>
> Key: BEAM-8023
> URL: https://issues.apache.org/jira/browse/BEAM-8023
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Reporter: Jeff Klukas
>Priority: Minor
>
> We have support in the Java SDK for using the BigQuery Storage API for reads, 
> but only the target query or table is supported as a ValueProvider to be 
> specified at runtime. AFAICT, there is no reason we can't delay specifying 
> readOptions until runtime as well.
> The readOptions are accessed by BigQueryStorageTableSource in getTargetTable; 
> I believe that's occurring at runtime, but I'd love for someone with deeper 
> BoundedSource knowledge to confirm that.
> I'd advocate for adding new methods 
> `TypedRead.withSelectedFields(ValueProvider> value)` and 
> `TypedRead.withRowRestriction(ValueProvider value)`. The existing 
> `withReadOptions` method would then populate the other two as 
> StaticValueProviders. Perhaps we'd want to deprecate `withReadOptions` in 
> favor or specifying individual read options as separate parameters.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (BEAM-8019) Support cross-language transforms for DataflowRunner

2019-08-20 Thread Chamikara Jayalath (Jira)

Chamikara Jayalath created BEAM-8019:


 Summary: Support cross-language transforms for DataflowRunner
 Key: BEAM-8019
 URL: https://issues.apache.org/jira/browse/BEAM-8019
 Project: Beam
  Issue Type: New Feature
  Components: sdk-py-core
Reporter: Chamikara Jayalath
Assignee: Chamikara Jayalath


This is to capture the Beam changes needed for this task.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (BEAM-7978) ArithmeticExceptions on getting backlog bytes

2019-08-20 Thread Chamikara Jayalath (Jira)



 [ 
https://issues.apache.org/jira/browse/BEAM-7978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath reassigned BEAM-7978:


Assignee: Alexey Romanenko

> ArithmeticExceptions on getting backlog bytes 
> --
>
> Key: BEAM-7978
> URL: https://issues.apache.org/jira/browse/BEAM-7978
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kinesis
>Affects Versions: 2.14.0
>Reporter: Mateusz
>Assignee: Alexey Romanenko
>Priority: Major
>
> Hello,
> Beam 2.14.0
>  (and to be more precise 
> [commit|https://github.com/apache/beam/commit/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad#diff-b4964a457006b1555c7042c739b405ec])
>  introduced a change in watermark calculation in Kinesis IO causing below 
> error:
> {code:java}
> exception:  "java.lang.RuntimeException: Unknown kinesis failure, when trying 
> to reach kinesis
>   at 
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.wrapExceptions(SimplifiedKinesisClient.java:227)
>   at 
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.getBacklogBytes(SimplifiedKinesisClient.java:167)
>   at 
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.getBacklogBytes(SimplifiedKinesisClient.java:155)
>   at 
> org.apache.beam.sdk.io.kinesis.KinesisReader.getTotalBacklogBytes(KinesisReader.java:158)
>   at 
> org.apache.beam.runners.dataflow.worker.StreamingModeExecutionContext.flushState(StreamingModeExecutionContext.java:433)
>   at 
> org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1289)
>   at 
> org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:149)
>   at 
> org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:1024)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArithmeticException: Value cannot fit in an int: 
> 153748963401
>   at org.joda.time.field.FieldUtils.safeToInt(FieldUtils.java:229)
>   at 
> org.joda.time.field.BaseDurationField.getDifference(BaseDurationField.java:141)
>   at 
> org.joda.time.base.BaseSingleFieldPeriod.between(BaseSingleFieldPeriod.java:72)
>   at org.joda.time.Minutes.minutesBetween(Minutes.java:101)
>   at 
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.lambda$getBacklogBytes$3(SimplifiedKinesisClient.java:169)
>   at 
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.wrapExceptions(SimplifiedKinesisClient.java:210)
>   ... 10 more
> {code}
> We spotted this issue on Dataflow runner. It's problematic as inability to 
> get backlog bytes seems to result in constant recreation of KinesisReader.
> The issue happens if the backlog bytes are retrieved before watermark value 
> is updated from initial default value. Easy way to reproduce it is to create 
> a pipeline with Kinesis source for a stream where no records are being put. 
> While debugging it locally, you can observe that the watermark is set to the 
> value on the past (like: "-290308-12-21T19:59:05.225Z"). After two minutes 
> (default watermark idle duration threshold is set to 2 minutes) , the 
> watermark is set to value of 
> [watermarkIdleThreshold|https://github.com/apache/beam/blob/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad/sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/WatermarkPolicyFactory.java#L110]),
>  so the next backlog bytes retrieval should be correct. However, as described 
> before, running the pipeline on Dataflow runner results in KinesisReader 
> being closed just after creation, so the watermark won't be fixed.
> The reason of the issue is following: The introduced watermark policies are 
> relying on 
> [WatermarkParameters|https://github.com/apache/beam/blob/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad/sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/WatermarkParameters.java]
>  which initialises currentWatermark and eventTime to 
> [BoundedWindow.TIMESTAMP_MIN_VALUE|https://github.com/apache/beam/commit/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad#diff-b4964a457006b1555c7042c739b405ecR52].
>  This result in watermark being set to new Instant(-9223372036854775L) at the 
> KinesisReader creation. Calculated [period between the watermark and the 
> current 
> timestamp|https://github.com/apache/beam/blob/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad/sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/SimplifiedKinesisClient.java#L169]
>  is bigger than expected causing the ArithmeticException to be thrown.
> The maximum retention on Kinesis

[jira] [Commented] (BEAM-7978) ArithmeticExceptions on getting backlog bytes

2019-08-20 Thread Chamikara Jayalath (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-7978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911880#comment-16911880
 ] 

Chamikara Jayalath commented on BEAM-7978:
--

Thanks Alexey. Assigned the Jira to you.

> ArithmeticExceptions on getting backlog bytes 
> --
>
> Key: BEAM-7978
> URL: https://issues.apache.org/jira/browse/BEAM-7978
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kinesis
>Affects Versions: 2.14.0
>Reporter: Mateusz
>Assignee: Alexey Romanenko
>Priority: Major
>
> Hello,
> Beam 2.14.0
>  (and to be more precise 
> [commit|https://github.com/apache/beam/commit/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad#diff-b4964a457006b1555c7042c739b405ec])
>  introduced a change in watermark calculation in Kinesis IO causing below 
> error:
> {code:java}
> exception:  "java.lang.RuntimeException: Unknown kinesis failure, when trying 
> to reach kinesis
>   at 
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.wrapExceptions(SimplifiedKinesisClient.java:227)
>   at 
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.getBacklogBytes(SimplifiedKinesisClient.java:167)
>   at 
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.getBacklogBytes(SimplifiedKinesisClient.java:155)
>   at 
> org.apache.beam.sdk.io.kinesis.KinesisReader.getTotalBacklogBytes(KinesisReader.java:158)
>   at 
> org.apache.beam.runners.dataflow.worker.StreamingModeExecutionContext.flushState(StreamingModeExecutionContext.java:433)
>   at 
> org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1289)
>   at 
> org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:149)
>   at 
> org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:1024)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArithmeticException: Value cannot fit in an int: 
> 153748963401
>   at org.joda.time.field.FieldUtils.safeToInt(FieldUtils.java:229)
>   at 
> org.joda.time.field.BaseDurationField.getDifference(BaseDurationField.java:141)
>   at 
> org.joda.time.base.BaseSingleFieldPeriod.between(BaseSingleFieldPeriod.java:72)
>   at org.joda.time.Minutes.minutesBetween(Minutes.java:101)
>   at 
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.lambda$getBacklogBytes$3(SimplifiedKinesisClient.java:169)
>   at 
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.wrapExceptions(SimplifiedKinesisClient.java:210)
>   ... 10 more
> {code}
> We spotted this issue on Dataflow runner. It's problematic as inability to 
> get backlog bytes seems to result in constant recreation of KinesisReader.
> The issue happens if the backlog bytes are retrieved before watermark value 
> is updated from initial default value. Easy way to reproduce it is to create 
> a pipeline with Kinesis source for a stream where no records are being put. 
> While debugging it locally, you can observe that the watermark is set to the 
> value on the past (like: "-290308-12-21T19:59:05.225Z"). After two minutes 
> (default watermark idle duration threshold is set to 2 minutes) , the 
> watermark is set to value of 
> [watermarkIdleThreshold|https://github.com/apache/beam/blob/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad/sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/WatermarkPolicyFactory.java#L110]),
>  so the next backlog bytes retrieval should be correct. However, as described 
> before, running the pipeline on Dataflow runner results in KinesisReader 
> being closed just after creation, so the watermark won't be fixed.
> The reason of the issue is following: The introduced watermark policies are 
> relying on 
> [WatermarkParameters|https://github.com/apache/beam/blob/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad/sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/WatermarkParameters.java]
>  which initialises currentWatermark and eventTime to 
> [BoundedWindow.TIMESTAMP_MIN_VALUE|https://github.com/apache/beam/commit/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad#diff-b4964a457006b1555c7042c739b405ecR52].
>  This result in watermark being set to new Instant(-9223372036854775L) at the 
> KinesisReader creation. Calculated [period between the watermark and the 
> current 
> timestamp|https://github.com/apache/beam/blob/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad/sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/SimplifiedKinesisClient.java#L169]
>  is bigger than expected causing the ArithmeticException to be

[jira] [Commented] (BEAM-7883) PubsubIO (Java) write batch size can exceed request payload limit

2019-08-19 Thread Chamikara Jayalath (Jira)



[ 
https://issues.apache.org/jira/browse/BEAM-7883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16910649#comment-16910649
 ] 

Chamikara Jayalath commented on BEAM-7883:
--

Thanks. Batch sizing that is done at PubSub IO is just an estimations and there 
can be cases where actual messages go over the 10 MB hard limit. Using 
'withMaxBatchBytesSize' is the correct workaround.

> PubsubIO (Java) write batch size can exceed request payload limit
> -
>
> Key: BEAM-7883
> URL: https://issues.apache.org/jira/browse/BEAM-7883
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.13.0
>Reporter: Yurii Atamanchuk
>Priority: Minor
>
> In some (probably rare) cases PubsubIO write (in Batch mode) batch size can 
> exceed request payload limit of 10mb. PubsubIO ensures that batch size is 
> less than limit (10mb by default). But then PubsubJsonClient is used that 
> converts message payloads into URL-Safe Base64 encoding which can inflate 
> message size (in my case for json strings it was up to 25-30%). As result we 
> get 400 response (with 'Request payload size exceeds the limit: 10485760 
> bytes' message), even though original batch had correct size.
> Obvious workaround is to reduce batch size 
> (`PubsubIO.writeMessages().to(...).withMaxBatchBytesSize(... i.e. 5mb ...)`), 
> but it is a bit annoying.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (BEAM-7978) ArithmeticExceptions on getting backlog bytes

2019-08-14 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907539#comment-16907539
 ] 

Chamikara Jayalath commented on BEAM-7978:
--

I guess this can occur for other runners as well ?

CCing some folks who recently updated Kinesis:

[~aromanenko] [~iemejia] [~rtshadow] @

 

> ArithmeticExceptions on getting backlog bytes 
> --
>
> Key: BEAM-7978
> URL: https://issues.apache.org/jira/browse/BEAM-7978
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kinesis
>Affects Versions: 2.14.0
>Reporter: Mateusz
>Priority: Major
>
> Hello,
> Beam 2.14.0
>  (and to be more precise 
> [commit|https://github.com/apache/beam/commit/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad#diff-b4964a457006b1555c7042c739b405ec])
>  introduced a change in watermark calculation in Kinesis IO causing below 
> error:
> {code:java}
> exception:  "java.lang.RuntimeException: Unknown kinesis failure, when trying 
> to reach kinesis
>   at 
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.wrapExceptions(SimplifiedKinesisClient.java:227)
>   at 
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.getBacklogBytes(SimplifiedKinesisClient.java:167)
>   at 
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.getBacklogBytes(SimplifiedKinesisClient.java:155)
>   at 
> org.apache.beam.sdk.io.kinesis.KinesisReader.getTotalBacklogBytes(KinesisReader.java:158)
>   at 
> org.apache.beam.runners.dataflow.worker.StreamingModeExecutionContext.flushState(StreamingModeExecutionContext.java:433)
>   at 
> org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1289)
>   at 
> org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:149)
>   at 
> org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:1024)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArithmeticException: Value cannot fit in an int: 
> 153748963401
>   at org.joda.time.field.FieldUtils.safeToInt(FieldUtils.java:229)
>   at 
> org.joda.time.field.BaseDurationField.getDifference(BaseDurationField.java:141)
>   at 
> org.joda.time.base.BaseSingleFieldPeriod.between(BaseSingleFieldPeriod.java:72)
>   at org.joda.time.Minutes.minutesBetween(Minutes.java:101)
>   at 
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.lambda$getBacklogBytes$3(SimplifiedKinesisClient.java:169)
>   at 
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.wrapExceptions(SimplifiedKinesisClient.java:210)
>   ... 10 more
> {code}
> We spotted this issue on Dataflow runner. It's problematic as inability to 
> get backlog bytes seems to result in constant recreation of KinesisReader.
> The issue happens if the backlog bytes are retrieved before watermark value 
> is updated from initial default value. Easy way to reproduce it is to create 
> a pipeline with Kinesis source for a stream where no records are being put. 
> While debugging it locally, you can observe that the watermark is set to the 
> value on the past (like: "-290308-12-21T19:59:05.225Z"). After two minutes 
> (default watermark idle duration threshold is set to 2 minutes) , the 
> watermark is set to value of 
> [watermarkIdleThreshold|https://github.com/apache/beam/blob/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad/sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/WatermarkPolicyFactory.java#L110]),
>  so the next backlog bytes retrieval should be correct. However, as described 
> before, running the pipeline on Dataflow runner results in KinesisReader 
> being closed just after creation, so the watermark won't be fixed.
> The reason of the issue is following: The introduced watermark policies are 
> relying on 
> [WatermarkParameters|https://github.com/apache/beam/blob/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad/sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/WatermarkParameters.java]
>  which initialises currentWatermark and eventTime to 
> [BoundedWindow.TIMESTAMP_MIN_VALUE|https://github.com/apache/beam/commit/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad#diff-b4964a457006b1555c7042c739b405ecR52].
>  This result in watermark being set to new Instant(-9223372036854775L) at the 
> KinesisReader creation. Calculated [period between the watermark and the 
> current 
> timestamp|https://github.com/apache/beam/blob/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad/sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/SimplifiedKinesisClient.java#L169]
>  is

[jira] [Commented] (BEAM-7850) Make Environment a top level attribute of PTransform

2019-08-05 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900399#comment-16900399
 ] 

Chamikara Jayalath commented on BEAM-7850:
--

Thanks Luke. That makes sense.

I assume we should also just replace SdkFunctionSpecs view_fn and 
window_mapping_fn in SideInput  with FuctionSpec ? 

[https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L958]

Seems like SideInput lives within PTransform payloads (for example, 
ParDoPayload) so I don't think SideInput will need to have an Environment 
defined separately (in addition to the PTransform that holds it).

 

> Make Environment a top level attribute of PTransform
> 
>
> Key: BEAM-7850
> URL: https://issues.apache.org/jira/browse/BEAM-7850
> Project: Beam
>  Issue Type: Sub-task
>  Components: beam-model
>Reporter: Chamikara Jayalath
>Priority: Major
>
> Currently Environment is not a top level attribute of the PTransform (of 
> runner API proto).
> [https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L99]
> Instead it is hidden inside various payload objects. For example, for ParDo, 
> environment will be inside SdkFunctionSpec of ParDoPayload.
> [https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L99]
>  
> This makes tracking environment of different types of PTransforms harder and 
> we have to fork code (on the type of PTransform) to extract the Environment 
> where the PTransform should be executed. It will probably be simpler to just 
> make Environment a top level attribute of PTransform.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-07-31 Thread Chamikara Jayalath (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath updated BEAM-7866:
-
Status: Open  (was: Triage Needed)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.14.0, 2.15.0
>
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results, 
> which 1) means the constructor can be super slow and 2) it won't work at all 
> if the database is unavailable at the time the pipeline is constructed (e.g. 
> if this is a template).
> Unfortunately, none of these issues are caught by SourceTestUtils: this class 
> has extensive coverage with it, and the tests pass. This is because the tests 
> return the same results in the same order. I don't know how to catch this 
> automatically, and I don't know how to catch the performance issue 
> automatically, but these would all be important follow-up items after the 
> actual fix.
> CC: [~chamikara] as reviewer.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Updated] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-07-31 Thread Chamikara Jayalath (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath updated BEAM-7866:
-
Fix Version/s: 2.15.0

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.14.0, 2.15.0
>
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results, 
> which 1) means the constructor can be super slow and 2) it won't work at all 
> if the database is unavailable at the time the pipeline is constructed (e.g. 
> if this is a template).
> Unfortunately, none of these issues are caught by SourceTestUtils: this class 
> has extensive coverage with it, and the tests pass. This is because the tests 
> return the same results in the same order. I don't know how to catch this 
> automatically, and I don't know how to catch the performance issue 
> automatically, but these would all be important follow-up items after the 
> actual fix.
> CC: [~chamikara] as reviewer.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-07-31 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897555#comment-16897555
 ] 

Chamikara Jayalath commented on BEAM-7866:
--

Thanks Eugene for pointing out these issues.

 

Seems like, in the Java implementation, we build splits with specific IDs. 
Probably we can do something similar for Python ? 

[https://github.com/apache/beam/blob/master/sdks/java/io/mongodb/src/main/java/org/apache/beam/sdk/io/mongodb/MongoDbIO.java#L519]

 

BTW is it correct to assume that the source will be correct if that MongoDB 
instance stays static (without any data insertion/deletion/update) for the 
duration of the execution of the read step ? (I understand that this might be 
improbable for many applications of the Database).

Seems like elements will have the same order for different find() calls if the 
database is static since MongoDB defaults to natural sort order.

[https://docs.mongodb.com/manual/reference/glossary/#term-natural-order]

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.14.0
>
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results, 
> which 1) means the constructor can be super slow and 2) it won't work at all 
> if the database is unavailable at the time the pipeline is constructed (e.g. 
> if this is a template).
> Unfortunately, none of these issues are caught by SourceTestUtils: this class 
> has extensive coverage with it, and the tests pass. This is because the tests 
> return the same results in the same order. I don't know how to catch this 
> automatically, and I don't know how to catch the performance issue 
> automatically, but these would all be important follow-up items after the 
> actual fix.
> CC: [~chamikara] as reviewer.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-7850) Make Environment a top level attribute of PTransform

2019-07-31 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897400#comment-16897400
 ] 

Chamikara Jayalath commented on BEAM-7850:
--

Thanks Max. I understand that this will be a bit invasive change. But I think 
it will simplify implementations quite a bit. So the main motivation is 
implementation simplicity. Currently to determine the environment where a given 
PTransform has to be executed, we have to fork for all possible payload types.

Just to clarify, if we make Environment a top level attribute of Transform that 
will be enough for WindowInto transform as well, right ? 

Looking at Beam runner API proto, seems like only two messages that use 
SdkFunctionSpec and are not a payload of a certain type of PTransform are 
WindowingStrategy and SideInput. SideInput already seems to be within transform 
payload messages (ParDoPayload and WriteFilesPayload). We can keep environment 
around in SdkFunctionSpec if there are non-transform messages that need it.

If we consider the cross-language expansion scenario, what we get from a remote 
environment is an expanded PTransform (along with coders, inputs, outputs, 
dependencies needed to execute it). So I think it makes sense to associate a 
PTransform with an Environment directly.

> Make Environment a top level attribute of PTransform
> 
>
> Key: BEAM-7850
> URL: https://issues.apache.org/jira/browse/BEAM-7850
> Project: Beam
>  Issue Type: Improvement
>  Components: beam-model
>Reporter: Chamikara Jayalath
>Priority: Major
>
> Currently Environment is not a top level attribute of the PTransform (of 
> runner API proto).
> [https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L99]
> Instead it is hidden inside various payload objects. For example, for ParDo, 
> environment will be inside SdkFunctionSpec of ParDoPayload.
> [https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L99]
>  
> This makes tracking environment of different types of PTransforms harder and 
> we have to fork code (on the type of PTransform) to extract the Environment 
> where the PTransform should be executed. It will probably be simpler to just 
> make Environment a top level attribute of PTransform.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-7850) Make Environment a top level attribute of PTransform

2019-07-30 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896289#comment-16896289
 ] 

Chamikara Jayalath commented on BEAM-7850:
--

cc: [~robertwb] [~lcwik] [~mxm] [~angoenka]

> Make Environment a top level attribute of PTransform
> 
>
> Key: BEAM-7850
> URL: https://issues.apache.org/jira/browse/BEAM-7850
> Project: Beam
>  Issue Type: Improvement
>  Components: beam-model
>Reporter: Chamikara Jayalath
>Priority: Major
>
> Currently Environment is not a top level attribute of the PTransform (of 
> runner API proto).
> [https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L99]
> Instead it is hidden inside various payload objects. For example, for ParDo, 
> environment will be inside SdkFunctionSpec of ParDoPayload.
> [https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L99]
>  
> This makes tracking environment of different types of PTransforms harder and 
> we have to fork code (on the type of PTransform) to extract the Environment 
> where the PTransform should be executed. It will probably be simpler to just 
> make Environment a top level attribute of PTransform.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (BEAM-7850) Make Environment a top level attribute of PTransform

2019-07-30 Thread Chamikara Jayalath (JIRA)

Chamikara Jayalath created BEAM-7850:


 Summary: Make Environment a top level attribute of PTransform
 Key: BEAM-7850
 URL: https://issues.apache.org/jira/browse/BEAM-7850
 Project: Beam
  Issue Type: Improvement
  Components: beam-model
Reporter: Chamikara Jayalath


Currently Environment is not a top level attribute of the PTransform (of runner 
API proto).

[https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L99]

Instead it is hidden inside various payload objects. For example, for ParDo, 
environment will be inside SdkFunctionSpec of ParDoPayload.

[https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L99]

 

This makes tracking environment of different types of PTransforms harder and we 
have to fork code (on the type of PTransform) to extract the Environment where 
the PTransform should be executed. It will probably be simpler to just make 
Environment a top level attribute of PTransform.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-2572) Implement an S3 filesystem for Python SDK

2019-07-29 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16895698#comment-16895698
 ] 

Chamikara Jayalath commented on BEAM-2572:
--

[~cagatayk] and [~sacherus] contributions are absolutely welcome.

 

What we need though is an implementation of FileSystem interface for S3 as I 
mentioned in a previous comment. This will allow us to use S3 from for 
file-basd source (for e..g. text, avro). Seems like your implementation has 
some of the code that can be used for this but not exactly what we need. Can 
you implement the proper interface and submit as a PR (with tests) ?

> Implement an S3 filesystem for Python SDK
> -
>
> Key: BEAM-2572
> URL: https://issues.apache.org/jira/browse/BEAM-2572
> Project: Beam
>  Issue Type: Task
>  Components: sdk-py-core
>Reporter: Dmitry Demeshchuk
>Assignee: Pablo Estrada
>Priority: Minor
>  Labels: GSoC2019, gsoc, gsoc2019, mentor
>
> There are two paths worth exploring, to my understanding:
> 1. Sticking to the HDFS-based approach (like it's done in Java).
> 2. Using boto/boto3 for accessing S3 through its common API endpoints.
> I personally prefer the second approach, for a few reasons:
> 1. In real life, HDFS and S3 have different consistency guarantees, therefore 
> their behaviors may contradict each other in some edge cases (say, we write 
> something to S3, but it's not immediately accessible for reading from another 
> end).
> 2. There are other AWS-based sources and sinks we may want to create in the 
> future: DynamoDB, Kinesis, SQS, etc.
> 3. boto3 already provides somewhat good logic for basic things like 
> reattempting.
> Whatever path we choose, there's another problem related to this: we 
> currently cannot pass any global settings (say, pipeline options, or just an 
> arbitrary kwarg) to a filesystem. Because of that, we'd have to setup the 
> runner nodes to have AWS keys set up in the environment, which is not trivial 
> to achieve and doesn't look too clean either (I'd rather see one single place 
> for configuring the runner options).
> Also, it's worth mentioning that I already have a janky S3 filesystem 
> implementation that only supports DirectRunner at the moment (because of the 
> previous paragraph). I'm perfectly fine finishing it myself, with some 
> guidance from the maintainers.
> Where should I move on from here, and whose input should I be looking for?
> Thanks!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (BEAM-7824) Set a default environment for Python SDK jobs for Dataflow runner

2019-07-25 Thread Chamikara Jayalath (JIRA)

Chamikara Jayalath created BEAM-7824:


 Summary: Set a default environment for Python SDK jobs for 
Dataflow runner
 Key: BEAM-7824
 URL: https://issues.apache.org/jira/browse/BEAM-7824
 Project: Beam
  Issue Type: Bug
  Components: runner-dataflow, sdk-py-core
Reporter: Chamikara Jayalath
Assignee: Chamikara Jayalath


Currently default environment is set to empty. We should change the default 
environment to urn: beam:env:docker:v1 and payload to a DockerPayload where 
container_image is set to container image used by Dataflow.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-7680) synthetic_pipeline_test.py flaky

2019-07-11 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883424#comment-16883424
 ] 

Chamikara Jayalath commented on BEAM-7680:
--

Sorry, this is from code I wrote several years back and I lost context.

I think comment was about test being time based on general (and hence there's a 
possibility of flakes).

> synthetic_pipeline_test.py flaky
> 
>
> Key: BEAM-7680
> URL: https://issues.apache.org/jira/browse/BEAM-7680
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Udi Meiri
>Assignee: Kasia Kucharczyk
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> {code:java}
> 11:51:43 FAIL: testSyntheticSDFStep 
> (apache_beam.testing.synthetic_pipeline_test.SyntheticPipelineTest)
> 11:51:43 
> --
> 11:51:43 Traceback (most recent call last):
> 11:51:43   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Cron/src/sdks/python/test-suites/tox/py35/build/srcs/sdks/python/apache_beam/testing/synthetic_pipeline_test.py",
>  line 82, in testSyntheticSDFStep
> 11:51:43 self.assertTrue(0.5 <= elapsed <= 3, elapsed)
> 11:51:43 AssertionError: False is not true : 3.659700632095337{code}
> [https://builds.apache.org/job/beam_PreCommit_Python_Cron/1502/consoleFull]
>  
> Two flaky TODOs: 
> [https://github.com/apache/beam/blob/b79f24ced1c8519c29443ea7109c59ad18be2ebe/sdks/python/apache_beam/testing/synthetic_pipeline_test.py#L69-L82]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-7424) Retry HTTP 429 errors from GCS w/ exponential backoff when reading data

2019-07-01 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876506#comment-16876506
 ] 

Chamikara Jayalath commented on BEAM-7424:
--

Please make sure to cherry-pick [https://github.com/apache/beam/pull/8933] to 
the 2.14 release branch.

> Retry HTTP 429 errors from GCS w/ exponential backoff when reading data
> ---
>
> Key: BEAM-7424
> URL: https://issues.apache.org/jira/browse/BEAM-7424
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp, io-python-gcp, sdk-py-core
>Reporter: Chamikara Jayalath
>Assignee: Heejong Lee
>Priority: Blocker
> Fix For: 2.14.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> This has to be done for both Java and Python SDKs.
> Seems like Java SDK already retries 429 errors w/o backoff (please verify): 
> [https://github.com/apache/beam/blob/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/util/RetryHttpRequestInitializer.java#L185]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-7631) Remove experimental annotation from stable transforms in Java BigTableIO connector

2019-06-25 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872586#comment-16872586
 ] 

Chamikara Jayalath commented on BEAM-7631:
--

cc: [~sduskis] [~igorbernstein] [~altay] [~dhalp...@google.com] [~iemejia]

> Remove experimental annotation from stable transforms in Java BigTableIO 
> connector
> --
>
> Key: BEAM-7631
> URL: https://issues.apache.org/jira/browse/BEAM-7631
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Reporter: Chamikara Jayalath
>Assignee: Chamikara Jayalath
>Priority: Major
>
> This connector has been around for some time and many Beam users use it.
> We can consider removing experimental tag from following transforms.
> BigTableIO.Read
> BigTableIO.Write
> Removing the experimental tag will guarantee our users that the existing API 
> will stay stable.
>  
> Note that this does not prevent adding new transforms to the API or adding 
> new features to existing transforms.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (BEAM-7631) Remove experimental annotation from stable transforms in Java BigTableIO connector

2019-06-25 Thread Chamikara Jayalath (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath reassigned BEAM-7631:


Assignee: Chamikara Jayalath

> Remove experimental annotation from stable transforms in Java BigTableIO 
> connector
> --
>
> Key: BEAM-7631
> URL: https://issues.apache.org/jira/browse/BEAM-7631
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Reporter: Chamikara Jayalath
>Assignee: Chamikara Jayalath
>Priority: Major
>
> This connector has been around for some time and many Beam users use it.
> We can consider removing experimental tag from following transforms.
> BigTableIO.Read
> BigTableIO.Write
> Removing the experimental tag will guarantee our users that the existing API 
> will stay stable.
>  
> Note that this does not prevent adding new transforms to the API or adding 
> new features to existing transforms.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (BEAM-7631) Remove experimental annotation from stable transforms in Java BigTableIO connector

2019-06-25 Thread Chamikara Jayalath (JIRA)

Chamikara Jayalath created BEAM-7631:


 Summary: Remove experimental annotation from stable transforms in 
Java BigTableIO connector
 Key: BEAM-7631
 URL: https://issues.apache.org/jira/browse/BEAM-7631
 Project: Beam
  Issue Type: Improvement
  Components: io-java-gcp
Reporter: Chamikara Jayalath


This connector has been around for some time and many Beam users use it.

We can consider removing experimental tag from following transforms.

BigTableIO.Read

BigTableIO.Write

Removing the experimental tag will guarantee our users that the existing API 
will stay stable.

 

Note that this does not prevent adding new transforms to the API or adding new 
features to existing transforms.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (BEAM-7611) Python BigTableIO IT is not running in any test suites

2019-06-20 Thread Chamikara Jayalath (JIRA)

Chamikara Jayalath created BEAM-7611:


 Summary: Python BigTableIO IT is not running in any test suites
 Key: BEAM-7611
 URL: https://issues.apache.org/jira/browse/BEAM-7611
 Project: Beam
  Issue Type: Bug
  Components: io-python-gcp, testing
Reporter: Chamikara Jayalath
Assignee: Solomon Duskis


We added an integration test here: [https://github.com/apache/beam/pull/7367]

 

But this currently does not get picked up by any test suites (and get skipped 
by some due to missing dependencies) hence BigTable sink is largely untested.

 

First attempt to enable it: [https://github.com/apache/beam/pull/8886]

 

Solomon assigning to you since I cannot find Juan's (PR author) Jira ID.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-6590) LTS backport: upgrade gcsio dependency to 1.9.13

2019-06-19 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868079#comment-16868079
 ] 

Chamikara Jayalath commented on BEAM-6590:
--

Sorry about the delay. I'll look into 2.7.1 blockers assigned to me early next 
week if that works.

> LTS backport: upgrade gcsio dependency to 1.9.13
> 
>
> Key: BEAM-6590
> URL: https://issues.apache.org/jira/browse/BEAM-6590
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp, sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Chamikara Jayalath
>Priority: Blocker
> Fix For: 2.7.1
>
>
> 1.9.12 has following critical bug so should be avoided.
>  
> [https://github.com/GoogleCloudPlatform/bigdata-interop/commit/52f5055d37f20a04303b146e9063e7ccc876ec17]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-7501) concatenated compressed files bug with python sdk (2.7.1 merge)

2019-06-19 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868078#comment-16868078
 ] 

Chamikara Jayalath commented on BEAM-7501:
--

Sorry about the delay. I'll look into this early next week if that works.

> concatenated compressed files bug with python sdk (2.7.1 merge)
> ---
>
> Key: BEAM-7501
> URL: https://issues.apache.org/jira/browse/BEAM-7501
> Project: Beam
>  Issue Type: Bug
>  Components: io-python-files
>Reporter: Chamikara Jayalath
>Assignee: Chamikara Jayalath
>Priority: Blocker
> Fix For: 2.7.1
>
>
> Same as https://issues.apache.org/jira/browse/BEAM-6952.
>  
> For tracking merging the fix to 2.7.1 (LTS) branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (BEAM-7569) filesystem_test is failing for Windows

2019-06-19 Thread Chamikara Jayalath (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath resolved BEAM-7569.
--
   Resolution: Fixed
Fix Version/s: Not applicable

> filesystem_test is failing for Windows
> --
>
> Key: BEAM-7569
> URL: https://issues.apache.org/jira/browse/BEAM-7569
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Chamikara Jayalath
>Assignee: Chamikara Jayalath
>Priority: Major
> Fix For: Not applicable
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-7424) Retry HTTP 429 errors from GCS w/ exponential backoff when reading data

2019-06-18 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867126#comment-16867126
 ] 

Chamikara Jayalath commented on BEAM-7424:
--

I believe Python fix is still in development.

> Retry HTTP 429 errors from GCS w/ exponential backoff when reading data
> ---
>
> Key: BEAM-7424
> URL: https://issues.apache.org/jira/browse/BEAM-7424
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp, io-python-gcp, sdk-py-core
>Reporter: Chamikara Jayalath
>Assignee: Heejong Lee
>Priority: Blocker
> Fix For: 2.14.0
>
>
> This has to be done for both Java and Python SDKs.
> Seems like Java SDK already retries 429 errors w/o backoff (please verify): 
> [https://github.com/apache/beam/blob/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/util/RetryHttpRequestInitializer.java#L185]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (BEAM-7588) Python BigQuery sink should be able to handle 15TB load job quota

2019-06-18 Thread Chamikara Jayalath (JIRA)

Chamikara Jayalath created BEAM-7588:


 Summary: Python BigQuery sink should be able to handle 15TB load 
job quota
 Key: BEAM-7588
 URL: https://issues.apache.org/jira/browse/BEAM-7588
 Project: Beam
  Issue Type: Improvement
  Components: io-python-gcp
Reporter: Chamikara Jayalath
Assignee: Pablo Estrada


This can be done by using multiple load jobs under 15TB when the amount of data 
to be loaded is > 15TB.

 

This is already handled by Java SDK.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (BEAM-7569) filesystem_test is failing for Windows

2019-06-14 Thread Chamikara Jayalath (JIRA)

Chamikara Jayalath created BEAM-7569:


 Summary: filesystem_test is failing for Windows
 Key: BEAM-7569
 URL: https://issues.apache.org/jira/browse/BEAM-7569
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-core
Reporter: Chamikara Jayalath
Assignee: Chamikara Jayalath






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (BEAM-7549) Provide a better error message for 410 Gone errors from GCS

2019-06-12 Thread Chamikara Jayalath (JIRA)

Chamikara Jayalath created BEAM-7549:


 Summary: Provide a better error message for 410 Gone errors from 
GCS
 Key: BEAM-7549
 URL: https://issues.apache.org/jira/browse/BEAM-7549
 Project: Beam
  Issue Type: Bug
  Components: io-java-gcp, io-python-gcp
Reporter: Chamikara Jayalath


For example 410 errors might get raised here.

com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.waitForCompletionAndThrowIfUploadFailed(AbstractGoogleAsyncWriteChannel.java:431)
        at 
com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:289)
        at 
com.google.cloud.dataflow.sdk.io.FileBasedSink$FileBasedWriter.close(FileBasedSink.java:571)
        at 
com.google.cloud.dataflow.sdk.io.FileBasedSink$FileBasedWriter.close(FileBasedSink.java:474)
        at 
com.google.cloud.dataflow.sdk.io.Write$Bound$WriteBundles.finishBundle(Write.java:202)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 
410 Gone

 

Example client side error message that gives a better description to the 
end-user: "GCS upload of processed data transiently failed and will need to be 
retried. Some produced data might need to be recomputed"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (BEAM-6952) concatenated compressed files bug with python sdk

2019-06-10 Thread Chamikara Jayalath (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-6952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath updated BEAM-6952:
-
Affects Version/s: (was: 2.13.0)
   Not applicable

> concatenated compressed files bug with python sdk
> -
>
> Key: BEAM-6952
> URL: https://issues.apache.org/jira/browse/BEAM-6952
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: Not applicable
>Reporter: Daniel Lescohier
>Priority: Major
> Fix For: 2.14.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> The Python apache_beam.io.filesystem module has a bug handling concatenated 
> compressed files.
> The PR I will create has two commits:
>  # a new unit test that shows the problem 
>  # a fix to the problem.
> The unit test is added to the apache_beam.io.filesystem_test module. It was 
> added to this module because the test: 
> apache_beam.io.textio_test.test_read_gzip_concat does not encounter the 
> problem in the Beam 2.11 and earlier code base because the test data is too 
> small: the data is smaller than read_size, so it goes through logic in the 
> code that avoids the problem in the code. So, this test sets read_size 
> smaller and test data bigger, in order to encounter the problem. It would be 
> difficult to test in the textio_test module, because you'd need very large 
> test data because default read_size is 16MiB, and the ReadFromText interface 
> does not allow you to modify the read_size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (BEAM-6952) concatenated compressed files bug with python sdk

2019-06-10 Thread Chamikara Jayalath (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-6952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath updated BEAM-6952:
-
Affects Version/s: (was: 2.11.0)
   2.13.0

> concatenated compressed files bug with python sdk
> -
>
> Key: BEAM-6952
> URL: https://issues.apache.org/jira/browse/BEAM-6952
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.13.0
>Reporter: Daniel Lescohier
>Priority: Major
> Fix For: 2.14.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> The Python apache_beam.io.filesystem module has a bug handling concatenated 
> compressed files.
> The PR I will create has two commits:
>  # a new unit test that shows the problem 
>  # a fix to the problem.
> The unit test is added to the apache_beam.io.filesystem_test module. It was 
> added to this module because the test: 
> apache_beam.io.textio_test.test_read_gzip_concat does not encounter the 
> problem in the Beam 2.11 and earlier code base because the test data is too 
> small: the data is smaller than read_size, so it goes through logic in the 
> code that avoids the problem in the code. So, this test sets read_size 
> smaller and test data bigger, in order to encounter the problem. It would be 
> difficult to test in the textio_test module, because you'd need very large 
> test data because default read_size is 16MiB, and the ReadFromText interface 
> does not allow you to modify the read_size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (BEAM-6952) concatenated compressed files bug with python sdk

2019-06-10 Thread Chamikara Jayalath (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-6952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath resolved BEAM-6952.
--
Resolution: Fixed

> concatenated compressed files bug with python sdk
> -
>
> Key: BEAM-6952
> URL: https://issues.apache.org/jira/browse/BEAM-6952
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.11.0
>Reporter: Daniel Lescohier
>Priority: Major
> Fix For: 2.14.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> The Python apache_beam.io.filesystem module has a bug handling concatenated 
> compressed files.
> The PR I will create has two commits:
>  # a new unit test that shows the problem 
>  # a fix to the problem.
> The unit test is added to the apache_beam.io.filesystem_test module. It was 
> added to this module because the test: 
> apache_beam.io.textio_test.test_read_gzip_concat does not encounter the 
> problem in the Beam 2.11 and earlier code base because the test data is too 
> small: the data is smaller than read_size, so it goes through logic in the 
> code that avoids the problem in the code. So, this test sets read_size 
> smaller and test data bigger, in order to encounter the problem. It would be 
> difficult to test in the textio_test module, because you'd need very large 
> test data because default read_size is 16MiB, and the ReadFromText interface 
> does not allow you to modify the read_size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-6952) concatenated compressed files bug with python sdk

2019-06-10 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-6952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860455#comment-16860455
 ] 

Chamikara Jayalath commented on BEAM-6952:
--

Yeah. closing.

> concatenated compressed files bug with python sdk
> -
>
> Key: BEAM-6952
> URL: https://issues.apache.org/jira/browse/BEAM-6952
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.11.0
>Reporter: Daniel Lescohier
>Priority: Major
> Fix For: 2.14.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> The Python apache_beam.io.filesystem module has a bug handling concatenated 
> compressed files.
> The PR I will create has two commits:
>  # a new unit test that shows the problem 
>  # a fix to the problem.
> The unit test is added to the apache_beam.io.filesystem_test module. It was 
> added to this module because the test: 
> apache_beam.io.textio_test.test_read_gzip_concat does not encounter the 
> problem in the Beam 2.11 and earlier code base because the test data is too 
> small: the data is smaller than read_size, so it goes through logic in the 
> code that avoids the problem in the code. So, this test sets read_size 
> smaller and test data bigger, in order to encounter the problem. It would be 
> difficult to test in the textio_test module, because you'd need very large 
> test data because default read_size is 16MiB, and the ReadFromText interface 
> does not allow you to modify the read_size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-7424) Retry HTTP 429 errors from GCS w/ exponential backoff when reading data

2019-06-10 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860421#comment-16860421
 ] 

Chamikara Jayalath commented on BEAM-7424:
--

There are few things that are planned to do.

 

(1) Add retry logic for Java and Python SDKs.

(2) For Dataflow runner, plumb throttled time to Dataflow backend to consider 
when making autoscaling decisions.

 

This bug is for (1), It's great if we can get (2) into 2.14 as well but I'm not 
sure if timelines will match.

> Retry HTTP 429 errors from GCS w/ exponential backoff when reading data
> ---
>
> Key: BEAM-7424
> URL: https://issues.apache.org/jira/browse/BEAM-7424
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp, io-python-gcp, sdk-py-core
>Reporter: Chamikara Jayalath
>Assignee: Heejong Lee
>Priority: Blocker
> Fix For: 2.14.0
>
>
> This has to be done for both Java and Python SDKs.
> Seems like Java SDK already retries 429 errors w/o backoff (please verify): 
> [https://github.com/apache/beam/blob/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/util/RetryHttpRequestInitializer.java#L185]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (BEAM-3788) Implement a Kafka IO for Python SDK

2019-06-10 Thread Chamikara Jayalath (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath reassigned BEAM-3788:


Assignee: (was: Chamikara Jayalath)

> Implement a Kafka IO for Python SDK
> ---
>
> Key: BEAM-3788
> URL: https://issues.apache.org/jira/browse/BEAM-3788
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Jayalath
>Priority: Major
>
> This will be implemented using the Splittable DoFn framework.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-3788) Implement a Kafka IO for Python SDK

2019-06-10 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16860419#comment-16860419
 ] 

Chamikara Jayalath commented on BEAM-3788:
--

This is blocked till we have a streaming source framework for Python SDK for 
portable runners. To this end, Splittable DoFn is currently under development.

Note that, for folks using Flink runner, native Kafka source of Flink is 
currently available to Python SDK users through the cross-language transform 
API.

> Implement a Kafka IO for Python SDK
> ---
>
> Key: BEAM-3788
> URL: https://issues.apache.org/jira/browse/BEAM-3788
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Jayalath
>Assignee: Chamikara Jayalath
>Priority: Major
>
> This will be implemented using the Splittable DoFn framework.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (BEAM-7501) concatenated compressed files bug with python sdk (2.7.1 merge)

2019-06-06 Thread Chamikara Jayalath (JIRA)

Chamikara Jayalath created BEAM-7501:


 Summary: concatenated compressed files bug with python sdk (2.7.1 
merge)
 Key: BEAM-7501
 URL: https://issues.apache.org/jira/browse/BEAM-7501
 Project: Beam
  Issue Type: Bug
  Components: io-python-files
Reporter: Chamikara Jayalath
Assignee: Chamikara Jayalath
 Fix For: 2.7.1


Same as https://issues.apache.org/jira/browse/BEAM-6952.

 

For tracking merging the fix to 2.7.1 (LTS) branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-6952) concatenated compressed files bug with python sdk

2019-06-06 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-6952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858203#comment-16858203
 ] 

Chamikara Jayalath commented on BEAM-6952:
--

Created https://issues.apache.org/jira/browse/BEAM-7501 to track merging the 
fix to 2.7.1 (LTS) branch.

> concatenated compressed files bug with python sdk
> -
>
> Key: BEAM-6952
> URL: https://issues.apache.org/jira/browse/BEAM-6952
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.11.0
>Reporter: Daniel Lescohier
>Priority: Major
> Fix For: 2.14.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> The Python apache_beam.io.filesystem module has a bug handling concatenated 
> compressed files.
> The PR I will create has two commits:
>  # a new unit test that shows the problem 
>  # a fix to the problem.
> The unit test is added to the apache_beam.io.filesystem_test module. It was 
> added to this module because the test: 
> apache_beam.io.textio_test.test_read_gzip_concat does not encounter the 
> problem in the Beam 2.11 and earlier code base because the test data is too 
> small: the data is smaller than read_size, so it goes through logic in the 
> code that avoids the problem in the code. So, this test sets read_size 
> smaller and test data bigger, in order to encounter the problem. It would be 
> difficult to test in the textio_test module, because you'd need very large 
> test data because default read_size is 16MiB, and the ReadFromText interface 
> does not allow you to modify the read_size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (BEAM-6952) concatenated compressed files bug with python sdk

2019-06-06 Thread Chamikara Jayalath (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-6952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath updated BEAM-6952:
-
Fix Version/s: 2.14.0

> concatenated compressed files bug with python sdk
> -
>
> Key: BEAM-6952
> URL: https://issues.apache.org/jira/browse/BEAM-6952
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.11.0
>Reporter: Daniel Lescohier
>Priority: Major
> Fix For: 2.14.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> The Python apache_beam.io.filesystem module has a bug handling concatenated 
> compressed files.
> The PR I will create has two commits:
>  # a new unit test that shows the problem 
>  # a fix to the problem.
> The unit test is added to the apache_beam.io.filesystem_test module. It was 
> added to this module because the test: 
> apache_beam.io.textio_test.test_read_gzip_concat does not encounter the 
> problem in the Beam 2.11 and earlier code base because the test data is too 
> small: the data is smaller than read_size, so it goes through logic in the 
> code that avoids the problem in the code. So, this test sets read_size 
> smaller and test data bigger, in order to encounter the problem. It would be 
> difficult to test in the textio_test module, because you'd need very large 
> test data because default read_size is 16MiB, and the ReadFromText interface 
> does not allow you to modify the read_size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (BEAM-7424) Retry HTTP 429 errors from GCS w/ exponential backoff when reading data

2019-06-06 Thread Chamikara Jayalath (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath updated BEAM-7424:
-
Fix Version/s: (was: 2.7.1)

> Retry HTTP 429 errors from GCS w/ exponential backoff when reading data
> ---
>
> Key: BEAM-7424
> URL: https://issues.apache.org/jira/browse/BEAM-7424
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp, io-python-gcp, sdk-py-core
>Reporter: Chamikara Jayalath
>Assignee: Heejong Lee
>Priority: Blocker
> Fix For: 2.14.0
>
>
> This has to be done for both Java and Python SDKs.
> Seems like Java SDK already retries 429 errors w/o backoff (please verify): 
> [https://github.com/apache/beam/blob/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/util/RetryHttpRequestInitializer.java#L185]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (BEAM-7500) Retry HTTP 429 errors from GCS w/ exponential backoff when reading data (2.7.1 merge)

2019-06-06 Thread Chamikara Jayalath (JIRA)

Chamikara Jayalath created BEAM-7500:


 Summary: Retry HTTP 429 errors from GCS w/ exponential backoff 
when reading data (2.7.1 merge)
 Key: BEAM-7500
 URL: https://issues.apache.org/jira/browse/BEAM-7500
 Project: Beam
  Issue Type: Bug
  Components: io-java-gcp, io-python-gcp
Reporter: Chamikara Jayalath
Assignee: Heejong Lee


 Same as https://issues.apache.org/jira/browse/BEAM-7424

 

For tracking merging the fix to 2.7.1 (LTS branch).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-7424) Retry HTTP 429 errors from GCS w/ exponential backoff when reading data

2019-06-06 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858200#comment-16858200
 ] 

Chamikara Jayalath commented on BEAM-7424:
--

Created https://issues.apache.org/jira/browse/BEAM-7500 to track merging the 
fix to 2.7.1 branch.

> Retry HTTP 429 errors from GCS w/ exponential backoff when reading data
> ---
>
> Key: BEAM-7424
> URL: https://issues.apache.org/jira/browse/BEAM-7424
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp, io-python-gcp, sdk-py-core
>Reporter: Chamikara Jayalath
>Assignee: Heejong Lee
>Priority: Blocker
> Fix For: 2.7.1, 2.14.0
>
>
> This has to be done for both Java and Python SDKs.
> Seems like Java SDK already retries 429 errors w/o backoff (please verify): 
> [https://github.com/apache/beam/blob/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/util/RetryHttpRequestInitializer.java#L185]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (BEAM-7500) Retry HTTP 429 errors from GCS w/ exponential backoff when reading data (2.7.1 merge)

2019-06-06 Thread Chamikara Jayalath (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath updated BEAM-7500:
-
Fix Version/s: 2.7.1

> Retry HTTP 429 errors from GCS w/ exponential backoff when reading data 
> (2.7.1 merge)
> -
>
> Key: BEAM-7500
> URL: https://issues.apache.org/jira/browse/BEAM-7500
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp, io-python-gcp
>Reporter: Chamikara Jayalath
>Assignee: Heejong Lee
>Priority: Blocker
> Fix For: 2.7.1
>
>
>  Same as https://issues.apache.org/jira/browse/BEAM-7424
>  
> For tracking merging the fix to 2.7.1 (LTS branch).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (BEAM-7424) Retry HTTP 429 errors from GCS w/ exponential backoff when reading data

2019-06-06 Thread Chamikara Jayalath (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath updated BEAM-7424:
-
Fix Version/s: 2.14.0
   2.7.1

> Retry HTTP 429 errors from GCS w/ exponential backoff when reading data
> ---
>
> Key: BEAM-7424
> URL: https://issues.apache.org/jira/browse/BEAM-7424
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp, io-python-gcp, sdk-py-core
>Reporter: Chamikara Jayalath
>Assignee: Heejong Lee
>Priority: Blocker
> Fix For: 2.7.1, 2.14.0
>
>
> This has to be done for both Java and Python SDKs.
> Seems like Java SDK already retries 429 errors w/o backoff (please verify): 
> [https://github.com/apache/beam/blob/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/util/RetryHttpRequestInitializer.java#L185]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (BEAM-7424) Retry HTTP 429 errors from GCS w/ exponential backoff when reading data

2019-06-06 Thread Chamikara Jayalath (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath updated BEAM-7424:
-
Priority: Blocker  (was: Major)

> Retry HTTP 429 errors from GCS w/ exponential backoff when reading data
> ---
>
> Key: BEAM-7424
> URL: https://issues.apache.org/jira/browse/BEAM-7424
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp, io-python-gcp, sdk-py-core
>Reporter: Chamikara Jayalath
>Assignee: Heejong Lee
>Priority: Blocker
>
> This has to be done for both Java and Python SDKs.
> Seems like Java SDK already retries 429 errors w/o backoff (please verify): 
> [https://github.com/apache/beam/blob/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/util/RetryHttpRequestInitializer.java#L185]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (BEAM-7483) Audit mis-uses of GCS client library API

2019-06-03 Thread Chamikara Jayalath (JIRA)

Chamikara Jayalath created BEAM-7483:


 Summary: Audit mis-uses of GCS client library API
 Key: BEAM-7483
 URL: https://issues.apache.org/jira/browse/BEAM-7483
 Project: Beam
  Issue Type: Improvement
  Components: io-java-gcp
Reporter: Chamikara Jayalath


For example,

 

GoogleCloudStorageReadChannel constructor may take an storage object that is 
null but this will not work at run-time.

 

Carefully, auditing and fixing similar mis-uses of the API in main code or 
tests could reduce the chance of future bugs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-7455) Improve Avro IO integration test coverage on Python 3.

2019-05-31 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853485#comment-16853485
 ] 

Chamikara Jayalath commented on BEAM-7455:
--

I think it's good if we can add an integration test as well. For example, Java 
has an AvroIO performance test here which we could replicate for Python: 
[https://builds.apache.org/view/A-D/view/Beam/view/PerformanceTests/job/beam_PerformanceTests_AvroIOIT/]

> Improve Avro IO integration test coverage on Python 3.
> --
>
> Key: BEAM-7455
> URL: https://issues.apache.org/jira/browse/BEAM-7455
> Project: Beam
>  Issue Type: Sub-task
>  Components: io-python-avro
>Reporter: Valentyn Tymofieiev
>Assignee: Frederik Bode
>Priority: Major
>
> It seems that we don't have an integration test for Avro IO on Python 3:
> fastavro_it_test [1] depends on both avro and fastavro, however avro package 
> currently does not work with Beam on Python 3, so we don't have an 
> integration test that exercises Avro IO on Python 3. 
> We should add an integration test for Avro IO that does not need both 
> libraries at the same time, and instead can run using either library. 
> [~frederik] is this something you could help with? 
> cc: [~chamikara] [~Juta]
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/fastavro_it_test.py



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-7439) Bigquery Write with schema None: TypeError: 'NoneType' object has no attribute 'getitem'

2019-05-28 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16850115#comment-16850115
 ] 

Chamikara Jayalath commented on BEAM-7439:
--

Also, Ahmet pointed out that we seems to be missing a critical test here, where 
we write to an existing BQ table without providing a schema.

> Bigquery Write with schema None: TypeError: 'NoneType' object has no 
> attribute '__getitem__'
> 
>
> Key: BEAM-7439
> URL: https://issues.apache.org/jira/browse/BEAM-7439
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Juta Staes
>Assignee: Pablo Estrada
>Priority: Blocker
> Fix For: 2.13.0
>
>
> When running a simple write to bigquery on apache-beam==2.12.0
> {code:java}
> input_data = [
>{'str': 'test'}
>  ]
> (pipeline | 'create' >> beam.Create(input_data)
>| 'write' >> beam.io.WriteToBigQuery(
>':beam_test.test'))
> {code}
>  
> I get the following error:
> {code:java}
> WARNING:root:Start running in the cloud
> Traceback (most recent call last):
>  File "test_pipeline.py", line 193, in 
>  main()
>  File "test_pipeline.py", line 183, in main
>  ':beam_test.test'))
>  File 
> "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv2/local/lib/python2.7/site-packages/apache_beam/pvalue.py",
>  line 112, in __or__
>  return self.pipeline.apply(ptransform, self)
>  File 
> "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv2/local/lib/python2.7/site-packages/apache_beam/pipeline.py",
>  line 470, in apply
>  label or transform.label)
>  File 
> "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv2/local/lib/python2.7/site-packages/apache_beam/pipeline.py",
>  line 480, in apply
>  return self.apply(transform, pvalueish)
>  File 
> "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv2/local/lib/python2.7/site-packages/apache_beam/pipeline.py",
>  line 516, in apply
>  pvalueish_result = self.runner.apply(transform, pvalueish, self._options)
>  File 
> "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv2/local/lib/python2.7/site-packages/apache_beam/runners/runner.py",
>  line 193, in apply
>  return m(transform, input, options)
>  File 
> "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv2/local/lib/python2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.py",
>  line 617, in apply_WriteToBigQuery
>  parse_table_schema_from_json(json.dumps(transform.schema)),
>  File 
> "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv2/local/lib/python2.7/site-packages/apache_beam/io/gcp/bigquery_tools.py",
>  line 130, in parse_table_schema_from_json
>  fields = [_parse_schema_field(f) for f in json_schema['fields']]
> TypeError: 'NoneType' object has no attribute '__getitem__'{code}
> I already proposed a fix for this as part of a larger pr: 
> https://github.com/apache/beam/pull/8621/commits/41cdfbda5a4e2a56b6d10046ba265ad68c78675d
> I was wondering if this also needs to be patched for version 2.12.0?
> cc: [~tvalentyn] [~pabloem]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (BEAM-7424) Retry HTTP 429 errors from GCS when reading/writing data and when staging files

2019-05-24 Thread Chamikara Jayalath (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath reassigned BEAM-7424:


Assignee: Yueyang Qiu

> Retry HTTP 429 errors from GCS when reading/writing data and when staging 
> files 
> 
>
> Key: BEAM-7424
> URL: https://issues.apache.org/jira/browse/BEAM-7424
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp, io-python-gcp, sdk-py-core
>Reporter: Chamikara Jayalath
>Assignee: Yueyang Qiu
>Priority: Major
>
> This has to be done for both Java and Python SDKs.
>  
> Seems like Java SDK already retries 429 errors (so prob. just have to 
> verify): 
> [https://github.com/apache/beam/blob/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/util/RetryHttpRequestInitializer.java#L185]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (BEAM-7424) Retry HTTP 429 errors from GCS when reading/writing data and when staging files

2019-05-24 Thread Chamikara Jayalath (JIRA)

Chamikara Jayalath created BEAM-7424:


 Summary: Retry HTTP 429 errors from GCS when reading/writing data 
and when staging files 
 Key: BEAM-7424
 URL: https://issues.apache.org/jira/browse/BEAM-7424
 Project: Beam
  Issue Type: Bug
  Components: io-java-gcp, io-python-gcp, sdk-py-core
Reporter: Chamikara Jayalath


This has to be done for both Java and Python SDKs.

 

Seems like Java SDK already retries 429 errors (so prob. just have to verify): 
[https://github.com/apache/beam/blob/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/util/RetryHttpRequestInitializer.java#L185]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-7246) Create a Spanner IO for Python

2019-05-22 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846256#comment-16846256
 ] 

Chamikara Jayalath commented on BEAM-7246:
--

Shehzaad, is this something you hope to work on in the near future ?

 

If so, appreciate if you can provide a rough design. Existing Java connector 
should serve as a good example.

Probably you can use the client library here: 
[https://pypi.org/project/google-cloud-spanner/]

> Create a Spanner IO for Python
> --
>
> Key: BEAM-7246
> URL: https://issues.apache.org/jira/browse/BEAM-7246
> Project: Beam
>  Issue Type: Bug
>  Components: io-python-gcp
>Reporter: Reuven Lax
>Assignee: Shehzaad Nakhoda
>Priority: Major
>
> Add I/O support for Google Cloud Spanner for the Python SDK (Batch Only).
> Testing in this work item will be in the form of DirectRunner tests and 
> manual testing.
> Integration and performance tests are a separate work item (not included 
> here).
> See https://beam.apache.org/documentation/io/built-in/. The goal is to add 
> Google Clound Spanner to the Database column for the Python/Batch row.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-6951) Beam Dependency Update Request: com.github.spotbugs:spotbugs-annotations

2019-05-20 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844202#comment-16844202
 ] 

Chamikara Jayalath commented on BEAM-6951:
--

cc: [~kenn]

> Beam Dependency Update Request: com.github.spotbugs:spotbugs-annotations
> 
>
> Key: BEAM-6951
> URL: https://issues.apache.org/jira/browse/BEAM-6951
> Project: Beam
>  Issue Type: Sub-task
>  Components: dependencies
>Reporter: Beam JIRA Bot
>Priority: Major
>
>  - 2019-04-01 12:15:05.460427 
> -
> Please consider upgrading the dependency 
> com.github.spotbugs:spotbugs-annotations. 
> The current version is 3.1.11. The latest version is 4.0.0-beta1 
> cc: 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2019-04-08 12:15:37.305259 
> -
> Please consider upgrading the dependency 
> com.github.spotbugs:spotbugs-annotations. 
> The current version is 3.1.11. The latest version is 4.0.0-beta1 
> cc: 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2019-04-15 12:35:52.817108 
> -
> Please consider upgrading the dependency 
> com.github.spotbugs:spotbugs-annotations. 
> The current version is 3.1.11. The latest version is 4.0.0-beta1 
> cc: 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2019-04-22 12:13:25.261372 
> -
> Please consider upgrading the dependency 
> com.github.spotbugs:spotbugs-annotations. 
> The current version is 3.1.11. The latest version is 4.0.0-beta1 
> cc: 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2019-05-20 16:39:18.034675 
> -
> Please consider upgrading the dependency 
> com.github.spotbugs:spotbugs-annotations. 
> The current version is 3.1.11. The latest version is 4.0.0-beta1 
> cc: 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2019-05-20 16:54:09.180503 
> -
> Please consider upgrading the dependency 
> com.github.spotbugs:spotbugs-annotations. 
> The current version is 3.1.11. The latest version is 4.0.0-beta1 
> cc: 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2019-05-20 17:37:40.326607 
> -
> Please consider upgrading the dependency 
> com.github.spotbugs:spotbugs-annotations. 
> The current version is 3.1.11. The latest version is 4.0.0-beta1 
> cc: 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-5968) Beam Dependency Update Request: future

2019-05-20 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-5968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844197#comment-16844197
 ] 

Chamikara Jayalath commented on BEAM-5968:
--

cc: [~tvalentyn]

> Beam Dependency Update Request: future
> --
>
> Key: BEAM-5968
> URL: https://issues.apache.org/jira/browse/BEAM-5968
> Project: Beam
>  Issue Type: Bug
>  Components: dependencies
>Reporter: Beam JIRA Bot
>Priority: Major
>
>  - 2018-11-05 12:10:39.020851 
> -
> Please consider upgrading the dependency future. 
> The current version is 0.16.0. The latest version is 0.17.1 
> cc: 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2018-11-12 12:10:22.255179 
> -
> Please consider upgrading the dependency future. 
> The current version is 0.16.0. The latest version is 0.17.1 
> cc: 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2018-11-19 12:10:39.701768 
> -
> Please consider upgrading the dependency future. 
> The current version is 0.16.0. The latest version is 0.17.1 
> cc: 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2018-11-26 12:09:05.586578 
> -
> Please consider upgrading the dependency future. 
> The current version is 0.16.0. The latest version is 0.17.1 
> cc: 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2018-12-03 12:09:53.018813 
> -
> Please consider upgrading the dependency future. 
> The current version is 0.16.0. The latest version is 0.17.1 
> cc: 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2018-12-10 12:09:14.146087 
> -
> Please consider upgrading the dependency future. 
> The current version is 0.16.0. The latest version is 0.17.1 
> cc: 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2018-12-17 12:11:47.510213 
> -
> Please consider upgrading the dependency future. 
> The current version is 0.16.0. The latest version is 0.17.1 
> cc: 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2018-12-31 15:15:47.043329 
> -
> Please consider upgrading the dependency future. 
> The current version is 0.16.0. The latest version is 0.17.1 
> cc: 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2019-01-07 12:09:52.562726 
> -
> Please consider upgrading the dependency future. 
> The current version is 0.16.0. The latest version is 0.17.1 
> cc: 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2019-01-14 12:10:27.645112 
> -
> Please consider upgrading the dependency future. 
> The current version is 0.16.0. The latest version is 0.17.1 
> cc: 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2019-01-21 12:18:02.030313 
> -
> Please consider upgrading the dependency future. 
> The current version is 0.16.0. The latest version is 0.17.1 
> cc: 
>  Please refer to [Beam Dependency Guide 
> |https://beam.apache.org/contribute/dependencies/]for more information. 
> Do Not Modify The Description Above. 
>  - 2019-01-28 12:10:09.289616 
> -
> Please

[jira] [Commented] (BEAM-7365) apache_beam.io.avroio_test.TestAvro.test_dynamic_work_rebalancing_exhaustive is very slow

2019-05-20 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844145#comment-16844145
 ] 

Chamikara Jayalath commented on BEAM-7365:
--

Talking a look. Seems like test works fine for Py2. So probably have to be 
redesigned to fit the threading model of Py3.

> apache_beam.io.avroio_test.TestAvro.test_dynamic_work_rebalancing_exhaustive 
> is very slow
> -
>
> Key: BEAM-7365
> URL: https://issues.apache.org/jira/browse/BEAM-7365
> Project: Beam
>  Issue Type: Sub-task
>  Components: io-python-avro
>Reporter: Robert Bradshaw
>Priority: Major
>
> {noformat}
> $ python setup.py test -s 
> apache_beam.io.avroio_test.TestAvro.test_dynamic_work_rebalancing_exhaustive
> test_dynamic_work_rebalancing_exhaustive 
> (apache_beam.io.avroio_test.TestFastAvro) ... WARNING:root:After 101 
> concurrent splitting trials at item #2, observed only failure, giving up on 
> this item
> WARNING:root:After 101 concurrent splitting trials at item #21, observed only 
> failure, giving up on this item
> WARNING:root:After 101 concurrent splitting trials at item #22, observed only 
> failure, giving up on this item
> WARNING:root:After 1014 total concurrent splitting trials, considered only 25 
> items, giving up.
> ok
> --
> Ran 1 test in 172.223s
>  
> {noformat}
> Compare this with 
> {noformat}
> $ python setup.py test -s 
> apache_beam.io.avroio_test.TestAvro.test_dynamic_work_rebalancing_exhaustive
> test_dynamic_work_rebalancing_exhaustive 
> (apache_beam.io.avroio_test.TestAvro) ... ok
> --
> Ran 1 test in 0.623s
> OK
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (BEAM-7063) Stage Dataflow runner harness in Dataflow FnApi Python test suites that run using an unreleased SDK.

2019-05-09 Thread Chamikara Jayalath (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath updated BEAM-7063:
-
Priority: Major  (was: Critical)

> Stage Dataflow runner harness in Dataflow FnApi Python test suites that run 
> using an unreleased SDK.
> 
>
> Key: BEAM-7063
> URL: https://issues.apache.org/jira/browse/BEAM-7063
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core, test-failures
>Reporter: Chamikara Jayalath
>Assignee: Valentyn Tymofieiev
>Priority: Major
> Fix For: 2.13.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> Seems like this was the first post-commit failure: 
> [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_Python_Verify/7855/]
> (the one before this also failed but that looks like a flake)
>  
>  
> Looking at failed tests I noticed that some of the were due to failed 
> Dataflow jobs. I noticed three issues.
> (1) Error in Dataflow jobs "This handler is only capable of dealing with 
> urn:beam:sideinput:materialization:multimap:0.1 materializations but was 
> asked to handle beam:side_input:multimap:v1 for PCollectionView with tag 
> side0-FilterOutSpammers."
> [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_17_57_01-16397932959358575026?project=apache-beam-testing]
> [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_18_07_11-1342017143600540529?project=apache-beam-testing]
>  
> (2)  BigQuery job failures (possibly this is transient)
> [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_18_25_22-16437560120793921616?project=apache-beam-testing]
>  
> (3) Some batch Dataflow jobs timed out and were cancelled (possibly also 
> transient).
>  
> (1) is probably due to 
> [https://github.com/apache/beam/commit/684f8130284a7c7979773300d04e5473ca0ac8f3]
>  
>  
> cc: [~altay] [~lostluck] [~alanmyrvold]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-3055) Retry downloading required test artifacts with a backoff when download fails.

2019-05-06 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834215#comment-16834215
 ] 

Chamikara Jayalath commented on BEAM-3055:
--

Few more flakes due to this.

 

[https://builds.apache.org/job/beam_PostRelease_NightlySnapshot/607/]

[https://builds.apache.org/job/beam_PostRelease_NightlySnapshot/605/]

 

cc: [~alanmyrvold] [~yifanzou]

 

> Retry downloading required test artifacts with a backoff when download fails.
> -
>
> Key: BEAM-3055
> URL: https://issues.apache.org/jira/browse/BEAM-3055
> Project: Beam
>  Issue Type: Improvement
>  Components: test-failures, testing
>Reporter: Valentyn Tymofieiev
>Assignee: Jason Kuster
>Priority: Major
> Fix For: Not applicable
>
>
> When Maven fails to download a required artifact for a test, the test fails. 
> Is it possible to configure Maven to retry the download with a backoff up to 
> N number of attempts?
> Example test failure:
> https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/15004/console
> 2017-10-11T19:01:21.382 [INFO] 
> 
> 2017-10-11T19:01:21.382 [INFO] BUILD FAILURE
> 2017-10-11T19:01:21.382 [INFO] 
> 
> 2017-10-11T19:01:21.383 [INFO] Total time: 55:20 min
> 2017-10-11T19:01:21.383 [INFO] Finished at: 2017-10-11T19:01:21+00:00
> 2017-10-11T19:01:23.807 [INFO] Final Memory: 261M/2068M
> 2017-10-11T19:01:23.807 [INFO] 
> 
> 2017-10-11T19:01:23.836 [ERROR] Failed to execute goal on project 
> beam-sdks-java-io-hcatalog: Could not resolve dependencies for project 
> org.apache.beam:beam-sdks-java-io-hcatalog:jar:2.2.0-SNAPSHOT: The following 
> artifacts could not be resolved: org.apache.hive:hive-metastore:jar:2.1.0, 
> javolution:javolution:jar:5.5.1: Could not transfer artifact 
> org.apache.hive:hive-metastore:jar:2.1.0 from/to central 
> (https://repo.maven.apache.org/maven2): GET request of: 
> org/apache/hive/hive-metastore/2.1.0/hive-metastore-2.1.0.jar from central 
> failed: Connection reset -> [Help 1].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (BEAM-6894) ExternalTransform.expand() does not create the proper AppliedPTransform sub-graph

2019-04-29 Thread Chamikara Jayalath (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-6894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath resolved BEAM-6894.
--
   Resolution: Fixed
Fix Version/s: 2.13.0

> ExternalTransform.expand() does not create the proper AppliedPTransform 
> sub-graph
> -
>
> Key: BEAM-6894
> URL: https://issues.apache.org/jira/browse/BEAM-6894
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Chamikara Jayalath
>Assignee: Chamikara Jayalath
>Priority: Major
> Fix For: 2.13.0
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> 'ExternalTransform.expand()' can be used to expand a remote transform and 
> build the correct runner-api subgraph for that transform. However currently 
> we do not modify the AppliedPTransform sub-graph correctly during this 
> process. Relevant code location here.
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/external.py#L135]
>  
> Without this, DataflowRunner that relies in this object graph (not just the 
> runner API proto) to build the job submission request to Dataflow service 
> cannot construct this request properly.
>  
> cc: [~robertwb]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (BEAM-7172) Support using SDFs as cross-language transforms using ExternalTransform

2019-04-29 Thread Chamikara Jayalath (JIRA)

Chamikara Jayalath created BEAM-7172:


 Summary: Support using SDFs as cross-language transforms using 
ExternalTransform
 Key: BEAM-7172
 URL: https://issues.apache.org/jira/browse/BEAM-7172
 Project: Beam
  Issue Type: Bug
  Components: runner-dataflow, sdk-py-core, sdk-py-harness
Reporter: Chamikara Jayalath


For example, we need to determine restriction coder before job submission for 
Dataflow runner.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-6590) LTS backport: upgrade gcsio dependency to 1.9.13

2019-04-26 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-6590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827207#comment-16827207
 ] 

Chamikara Jayalath commented on BEAM-6590:
--

Will look into this. Thanks. Is there a ETA for 2.7.1 ?

> LTS backport: upgrade gcsio dependency to 1.9.13
> 
>
> Key: BEAM-6590
> URL: https://issues.apache.org/jira/browse/BEAM-6590
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp, sdk-java-core
>Reporter: Chamikara Jayalath
>Assignee: Chamikara Jayalath
>Priority: Blocker
>  Labels: triaged
> Fix For: 2.7.1
>
>
> 1.9.12 has following critical bug so should be avoided.
>  
> [https://github.com/GoogleCloudPlatform/bigdata-interop/commit/52f5055d37f20a04303b146e9063e7ccc876ec17]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-7137) TypeError caused by using str variable as header argument in apache_beam.io.textio.WriteToText

2019-04-26 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827174#comment-16827174
 ] 

Chamikara Jayalath commented on BEAM-7137:
--

[~pabloem] [~ŁukaszG] in case we already have plans to add such tests.

> TypeError caused by using str variable as header argument in 
> apache_beam.io.textio.WriteToText
> --
>
> Key: BEAM-7137
> URL: https://issues.apache.org/jira/browse/BEAM-7137
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Affects Versions: 2.11.0
> Environment: Python 3.5.6
> macOS Mojave 10.14.4
>Reporter: yoshiki obata
>Assignee: yoshiki obata
>Priority: Major
> Fix For: 2.13.0
>
>
> Using str header to apache_beam.io.textio.WriteToText as argument cause 
> TypeError with Python 3.5.6 - or maybe higher - despite docstring says header 
> is str.
>  This error occurred by writing header to file without encoding to bytes at 
> apache_beam.io.textio._TextSink.open.
>  
> {code:java}
> Traceback (most recent call last):
>   File "apache_beam/runners/common.py", line 727, in 
> apache_beam.runners.common.DoFnRunner.process
>   File "apache_beam/runners/common.py", line 555, in 
> apache_beam.runners.common.PerWindowInvoker.invoke_process
>   File "apache_beam/runners/common.py", line 625, in 
> apache_beam.runners.common.PerWindowInvoker._invoke_per_window
>   File 
> "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/io/iobase.py",
>  line 1033, in process
>     self.writer = self.sink.open_writer(init_result, str(uuid.uuid4()))
>   File 
> "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/options/value_provider.py",
>  line 137, in _f
>     return fnc(self, *args, **kwargs)
>   File 
> "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/io/filebasedsink.py",
>  line 185, in open_writer
>     return FileBasedSinkWriter(self, os.path.join(init_result, uid) + suffix)
>   File 
> "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/io/filebasedsink.py",
>  line 389, in __init__
>     self.temp_handle = self.sink.open(temp_shard_path)
>   File 
> "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/io/textio.py",
>  line 393, in open
>     file_handle.write(self._header)
> TypeError: a bytes-like object is required, not 'str'
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-7137) TypeError caused by using str variable as header argument in apache_beam.io.textio.WriteToText

2019-04-26 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16827173#comment-16827173
 ] 

Chamikara Jayalath commented on BEAM-7137:
--

Yes. I think we need more integration test coverage in Beam for Python 
file-based IO (this is currently indirectly covered by other tests)

> TypeError caused by using str variable as header argument in 
> apache_beam.io.textio.WriteToText
> --
>
> Key: BEAM-7137
> URL: https://issues.apache.org/jira/browse/BEAM-7137
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Affects Versions: 2.11.0
> Environment: Python 3.5.6
> macOS Mojave 10.14.4
>Reporter: yoshiki obata
>Assignee: yoshiki obata
>Priority: Major
> Fix For: 2.13.0
>
>
> Using str header to apache_beam.io.textio.WriteToText as argument cause 
> TypeError with Python 3.5.6 - or maybe higher - despite docstring says header 
> is str.
>  This error occurred by writing header to file without encoding to bytes at 
> apache_beam.io.textio._TextSink.open.
>  
> {code:java}
> Traceback (most recent call last):
>   File "apache_beam/runners/common.py", line 727, in 
> apache_beam.runners.common.DoFnRunner.process
>   File "apache_beam/runners/common.py", line 555, in 
> apache_beam.runners.common.PerWindowInvoker.invoke_process
>   File "apache_beam/runners/common.py", line 625, in 
> apache_beam.runners.common.PerWindowInvoker._invoke_per_window
>   File 
> "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/io/iobase.py",
>  line 1033, in process
>     self.writer = self.sink.open_writer(init_result, str(uuid.uuid4()))
>   File 
> "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/options/value_provider.py",
>  line 137, in _f
>     return fnc(self, *args, **kwargs)
>   File 
> "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/io/filebasedsink.py",
>  line 185, in open_writer
>     return FileBasedSinkWriter(self, os.path.join(init_result, uid) + suffix)
>   File 
> "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/io/filebasedsink.py",
>  line 389, in __init__
>     self.temp_handle = self.sink.open(temp_shard_path)
>   File 
> "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/io/textio.py",
>  line 393, in open
>     file_handle.write(self._header)
> TypeError: a bytes-like object is required, not 'str'
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-7137) TypeError caused by using str variable as header argument in apache_beam.io.textio.WriteToText

2019-04-25 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826287#comment-16826287
 ] 

Chamikara Jayalath commented on BEAM-7137:
--

+1 for encoding header to bytes (using UTF-8). We have clearly documented that 
header should be a string here: 
[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/textio.py#L599]

> TypeError caused by using str variable as header argument in 
> apache_beam.io.textio.WriteToText
> --
>
> Key: BEAM-7137
> URL: https://issues.apache.org/jira/browse/BEAM-7137
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Affects Versions: 2.11.0
> Environment: Python 3.5.6
> macOS Mojave 10.14.4
>Reporter: yoshiki obata
>Assignee: yoshiki obata
>Priority: Major
>
> Using str header to apache_beam.io.textio.WriteToText as argument cause 
> TypeError with Python 3.5.6 - or maybe higher - despite docstring says header 
> is str.
>  This error occurred by writing header to file without encoding to bytes at 
> apache_beam.io.textio._TextSink.open.
>  
> {code:java}
> Traceback (most recent call last):
>   File "apache_beam/runners/common.py", line 727, in 
> apache_beam.runners.common.DoFnRunner.process
>   File "apache_beam/runners/common.py", line 555, in 
> apache_beam.runners.common.PerWindowInvoker.invoke_process
>   File "apache_beam/runners/common.py", line 625, in 
> apache_beam.runners.common.PerWindowInvoker._invoke_per_window
>   File 
> "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/io/iobase.py",
>  line 1033, in process
>     self.writer = self.sink.open_writer(init_result, str(uuid.uuid4()))
>   File 
> "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/options/value_provider.py",
>  line 137, in _f
>     return fnc(self, *args, **kwargs)
>   File 
> "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/io/filebasedsink.py",
>  line 185, in open_writer
>     return FileBasedSinkWriter(self, os.path.join(init_result, uid) + suffix)
>   File 
> "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/io/filebasedsink.py",
>  line 389, in __init__
>     self.temp_handle = self.sink.open(temp_shard_path)
>   File 
> "/Users/yob/.local/share/virtualenvs/test/lib/python3.5/site-packages/apache_beam/io/textio.py",
>  line 393, in open
>     file_handle.write(self._header)
> TypeError: a bytes-like object is required, not 'str'
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (BEAM-6962) Add wordcount_xlang to Python post-commit

2019-04-22 Thread Chamikara Jayalath (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath resolved BEAM-6962.
--
   Resolution: Fixed
Fix Version/s: 2.13.0

> Add wordcount_xlang to Python post-commit
> -
>
> Key: BEAM-6962
> URL: https://issues.apache.org/jira/browse/BEAM-6962
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-flink, sdk-py-core, sdk-py-harness
>Reporter: Chamikara Jayalath
>Assignee: Heejong Lee
>Priority: Major
> Fix For: 2.13.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> This example works great but we should add it to Python post-commit to 
> prevent bit-rot. 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_xlang.py]
> Also, please consider adding a step to the pipeline that performs a checksum 
> on the output to make sure that the output is valid.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-7063) Stage Dataflow runner harness in Dataflow FnApi Python test suites that run using an unreleased SDK.

2019-04-22 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823559#comment-16823559
 ] 

Chamikara Jayalath commented on BEAM-7063:
--

Ah OK. Thanks. This bug was opened for post-commit failure but I see now that 
the title has changed.

> Stage Dataflow runner harness in Dataflow FnApi Python test suites that run 
> using an unreleased SDK.
> 
>
> Key: BEAM-7063
> URL: https://issues.apache.org/jira/browse/BEAM-7063
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core, test-failures
>Reporter: Chamikara Jayalath
>Assignee: Luke Cwik
>Priority: Critical
> Fix For: 2.13.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> Seems like this was the first post-commit failure: 
> [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_Python_Verify/7855/]
> (the one before this also failed but that looks like a flake)
>  
>  
> Looking at failed tests I noticed that some of the were due to failed 
> Dataflow jobs. I noticed three issues.
> (1) Error in Dataflow jobs "This handler is only capable of dealing with 
> urn:beam:sideinput:materialization:multimap:0.1 materializations but was 
> asked to handle beam:side_input:multimap:v1 for PCollectionView with tag 
> side0-FilterOutSpammers."
> [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_17_57_01-16397932959358575026?project=apache-beam-testing]
> [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_18_07_11-1342017143600540529?project=apache-beam-testing]
>  
> (2)  BigQuery job failures (possibly this is transient)
> [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_18_25_22-16437560120793921616?project=apache-beam-testing]
>  
> (3) Some batch Dataflow jobs timed out and were cancelled (possibly also 
> transient).
>  
> (1) is probably due to 
> [https://github.com/apache/beam/commit/684f8130284a7c7979773300d04e5473ca0ac8f3]
>  
>  
> cc: [~altay] [~lostluck] [~alanmyrvold]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (BEAM-7063) beam_PostCommit_Python_Verify is consistently failing due to a sideinput materialization related issue

2019-04-22 Thread Chamikara Jayalath (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath resolved BEAM-7063.
--
Resolution: Fixed

Seems like test suite is passing now. Resolving.

 

Thanks everyone.

> beam_PostCommit_Python_Verify is consistently failing due to a sideinput 
> materialization related issue
> --
>
> Key: BEAM-7063
> URL: https://issues.apache.org/jira/browse/BEAM-7063
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core, test-failures
>Reporter: Chamikara Jayalath
>Assignee: Luke Cwik
>Priority: Critical
> Fix For: 2.13.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> Seems like this was the first post-commit failure: 
> [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_Python_Verify/7855/]
> (the one before this also failed but that looks like a flake)
>  
>  
> Looking at failed tests I noticed that some of the were due to failed 
> Dataflow jobs. I noticed three issues.
> (1) Error in Dataflow jobs "This handler is only capable of dealing with 
> urn:beam:sideinput:materialization:multimap:0.1 materializations but was 
> asked to handle beam:side_input:multimap:v1 for PCollectionView with tag 
> side0-FilterOutSpammers."
> [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_17_57_01-16397932959358575026?project=apache-beam-testing]
> [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_18_07_11-1342017143600540529?project=apache-beam-testing]
>  
> (2)  BigQuery job failures (possibly this is transient)
> [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_18_25_22-16437560120793921616?project=apache-beam-testing]
>  
> (3) Some batch Dataflow jobs timed out and were cancelled (possibly also 
> transient).
>  
> (1) is probably due to 
> [https://github.com/apache/beam/commit/684f8130284a7c7979773300d04e5473ca0ac8f3]
>  
>  
> cc: [~altay] [~lostluck] [~alanmyrvold]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-7063) beam_PostCommit_Python_Verify is consistently failing due to a sideinput materialization related issue

2019-04-16 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819448#comment-16819448
 ] 

Chamikara Jayalath commented on BEAM-7063:
--

[~tvalentyn] since failing tests seems to be a Python 3 test.

> beam_PostCommit_Python_Verify is consistently failing due to a sideinput 
> materialization related issue
> --
>
> Key: BEAM-7063
> URL: https://issues.apache.org/jira/browse/BEAM-7063
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core, test-failures
>Reporter: Chamikara Jayalath
>Assignee: Luke Cwik
>Priority: Critical
> Fix For: 2.13.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Seems like this was the first post-commit failure: 
> [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_Python_Verify/7855/]
> (the one before this also failed but that looks like a flake)
>  
>  
> Looking at failed tests I noticed that some of the were due to failed 
> Dataflow jobs. I noticed three issues.
> (1) Error in Dataflow jobs "This handler is only capable of dealing with 
> urn:beam:sideinput:materialization:multimap:0.1 materializations but was 
> asked to handle beam:side_input:multimap:v1 for PCollectionView with tag 
> side0-FilterOutSpammers."
> [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_17_57_01-16397932959358575026?project=apache-beam-testing]
> [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_18_07_11-1342017143600540529?project=apache-beam-testing]
>  
> (2)  BigQuery job failures (possibly this is transient)
> [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_18_25_22-16437560120793921616?project=apache-beam-testing]
>  
> (3) Some batch Dataflow jobs timed out and were cancelled (possibly also 
> transient).
>  
> (1) is probably due to 
> [https://github.com/apache/beam/commit/684f8130284a7c7979773300d04e5473ca0ac8f3]
>  
>  
> cc: [~altay] [~lostluck] [~alanmyrvold]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Reopened] (BEAM-7063) beam_PostCommit_Python_Verify is consistently failing due to a sideinput materialization related issue

2019-04-16 Thread Chamikara Jayalath (JIRA)



 [ 
https://issues.apache.org/jira/browse/BEAM-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chamikara Jayalath reopened BEAM-7063:
--

> beam_PostCommit_Python_Verify is consistently failing due to a sideinput 
> materialization related issue
> --
>
> Key: BEAM-7063
> URL: https://issues.apache.org/jira/browse/BEAM-7063
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core, test-failures
>Reporter: Chamikara Jayalath
>Assignee: Luke Cwik
>Priority: Critical
> Fix For: 2.13.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Seems like this was the first post-commit failure: 
> [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_Python_Verify/7855/]
> (the one before this also failed but that looks like a flake)
>  
>  
> Looking at failed tests I noticed that some of the were due to failed 
> Dataflow jobs. I noticed three issues.
> (1) Error in Dataflow jobs "This handler is only capable of dealing with 
> urn:beam:sideinput:materialization:multimap:0.1 materializations but was 
> asked to handle beam:side_input:multimap:v1 for PCollectionView with tag 
> side0-FilterOutSpammers."
> [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_17_57_01-16397932959358575026?project=apache-beam-testing]
> [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_18_07_11-1342017143600540529?project=apache-beam-testing]
>  
> (2)  BigQuery job failures (possibly this is transient)
> [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_18_25_22-16437560120793921616?project=apache-beam-testing]
>  
> (3) Some batch Dataflow jobs timed out and were cancelled (possibly also 
> transient).
>  
> (1) is probably due to 
> [https://github.com/apache/beam/commit/684f8130284a7c7979773300d04e5473ca0ac8f3]
>  
>  
> cc: [~altay] [~lostluck] [~alanmyrvold]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-7063) beam_PostCommit_Python_Verify is consistently failing due to a sideinput materialization related issue

2019-04-16 Thread Chamikara Jayalath (JIRA)



[ 
https://issues.apache.org/jira/browse/BEAM-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819309#comment-16819309
 ] 

Chamikara Jayalath commented on BEAM-7063:
--

Looks like one test of postCommit is still failing.

[https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_Python3_Verify/562/]

[https://scans.gradle.com/s/2jqz3qlygliv4/console-log?task=:beam-sdks-python-test-suites-dataflow-py3:postCommitIT#L29]

test_wordcount_fnapi_it (apache_beam.examples.wordcount_it_test.WordCountIT) 
... ERROR

apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow 
pipeline failed. State: FAILED, Error:
java.lang.IllegalArgumentException: This handler is only capable of dealing 
with urn:beam:sideinput:materialization:multimap:0.1 materializations but was 
asked to handle beam:side_input:multimap:v1 for PCollectionView with tag 
side0-write/Write/WriteImpl/WriteBundles.
 at 
org.apache.beam.vendor.guava.v20_0.com.google.common.base.Preconditions.checkArgument(Preconditions.java:399)
 at 
org.apache.beam.runners.dataflow.worker.graph.RegisterNodeFunction.transformSideInputForRunner(RegisterNodeFunc
tion.java:506)
 at 
org.apache.beam.runners.dataflow.worker.graph.RegisterNodeFunction.apply(RegisterNodeFunction.java:327)
 at 
org.apache.beam.runners.dataflow.worker.graph.RegisterNodeFunction.apply(RegisterNodeFunction.java:97)
 at java.util.function.Function.lambda$andThen$1(Function.java:88)
 at 
org.apache.beam.runners.dataflow.worker.graph.CreateRegisterFnOperationFunction.apply(CreateRegisterFnOperationFunction.java:208)
 at 
org.apache.beam.runners.dataflow.worker.graph.CreateRegisterFnOperationFunction.apply(CreateRegisterFnOperationFunction.java:75)
 at java.util.function.Function.lambda$andThen$1(Function.java:88)
 at java.util.function.Function.lambda$andThen$1(Function.java:88)
 at 
org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:347)
 at 
org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:306)
 at 
org.apache.beam.runners.dataflow.worker.DataflowRunnerHarness.start(DataflowRunnerHarness.java:195)
 at 
org.apache.beam.runners.dataflow.worker.DataflowRunnerHarness.main(DataflowRunnerHarness.java:123)

> beam_PostCommit_Python_Verify is consistently failing due to a sideinput 
> materialization related issue
> --
>
> Key: BEAM-7063
> URL: https://issues.apache.org/jira/browse/BEAM-7063
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core, test-failures
>Reporter: Chamikara Jayalath
>Assignee: Luke Cwik
>Priority: Critical
> Fix For: 2.13.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Seems like this was the first post-commit failure: 
> [https://builds.apache.org/view/A-D/view/Beam/view/PostCommit/job/beam_PostCommit_Python_Verify/7855/]
> (the one before this also failed but that looks like a flake)
>  
>  
> Looking at failed tests I noticed that some of the were due to failed 
> Dataflow jobs. I noticed three issues.
> (1) Error in Dataflow jobs "This handler is only capable of dealing with 
> urn:beam:sideinput:materialization:multimap:0.1 materializations but was 
> asked to handle beam:side_input:multimap:v1 for PCollectionView with tag 
> side0-FilterOutSpammers."
> [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_17_57_01-16397932959358575026?project=apache-beam-testing]
> [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_18_07_11-1342017143600540529?project=apache-beam-testing]
>  
> (2)  BigQuery job failures (possibly this is transient)
> [https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2019-04-08_18_25_22-16437560120793921616?project=apache-beam-testing]
>  
> (3) Some batch Dataflow jobs timed out and were cancelled (possibly also 
> transient).
>  
> (1) is probably due to 
> [https://github.com/apache/beam/commit/684f8130284a7c7979773300d04e5473ca0ac8f3]
>  
>  
> cc: [~altay] [~lostluck] [~alanmyrvold]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

1 2 3 >

1 - 100 of 297 matches

Mail list logo