[jira] [Work logged] (BEAM-13647) Go SDK FnApi environment version 9 should be compatible with Runner v2 artifact service
[ https://issues.apache.org/jira/browse/BEAM-13647?focusedWorklogId=778469&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778469 ] ASF GitHub Bot logged work on BEAM-13647: - Author: ASF GitHub Bot Created on: 05/Jun/22 22:51 Start Date: 05/Jun/22 22:51 Worklog Time Spent: 10m Work Description: lostluck commented on PR #18533: URL: https://github.com/apache/beam/pull/18533#issuecomment-1146897475 Run Go Postcommit Issue Time Tracking --- Worklog Id: (was: 778469) Time Spent: 7h 10m (was: 7h) > Go SDK FnApi environment version 9 should be compatible with Runner v2 > artifact service > --- > > Key: BEAM-13647 > URL: https://issues.apache.org/jira/browse/BEAM-13647 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow, sdk-go >Reporter: Heejong Lee >Assignee: Danny McCormick >Priority: P2 > Fix For: 2.40.0 > > Time Spent: 7h 10m > Remaining Estimate: 0h > > Go SDK FnApi environment version 9 should be compatible with Runner v2 > artifact service. We need to test and verify whether Go SDK works well with > Runner v2 artifact service before increasing the version number. > For 2.37 and 2.38 we'll keep both current and old versions as transitional > code, but we should be able to comfortably pull out the old stuff in 2.39. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-10785) Support for coder argument in WriteToBigQuery
[ https://issues.apache.org/jira/browse/BEAM-10785?focusedWorklogId=778427&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778427 ] ASF GitHub Bot logged work on BEAM-10785: - Author: ASF GitHub Bot Created on: 05/Jun/22 06:58 Start Date: 05/Jun/22 06:58 Worklog Time Spent: 10m Work Description: harrydrippin commented on PR #17518: URL: https://github.com/apache/beam/pull/17518#issuecomment-1146753160 @pabloem Just for sure, do you mean it is good if I apply the change to the original `RowAsDictJsonCoder` implementation just like my custom one? (adding `ensure_ascii=False` on the `json.dumps`) I will submit another PR for this if you confirm. Thanks! Issue Time Tracking --- Worklog Id: (was: 778427) Time Spent: 2h 20m (was: 2h 10m) > Support for coder argument in WriteToBigQuery > - > > Key: BEAM-10785 > URL: https://issues.apache.org/jira/browse/BEAM-10785 > Project: Beam > Issue Type: Bug > Components: io-py-gcp >Reporter: Nakamura Yu >Assignee: Seunghwan Hong >Priority: P1 > Time Spent: 2h 20m > Remaining Estimate: 0h > > When using WriteToBigQuery to transfer data to BigQuery, non-ascii characters > are replaced with replacement characters. > This was due to the RowAsDictJsonCoder being set as the coder for the > BigQueryBatchFileLoads called inside WriteToBigQuery. > I want to add coder to the argument of WriteToBigQuery so that I can set a > coder other than RowAsDictJsonCoder. > If no problem, I will create a Pull Request next weekend. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-7439) Bigquery Write with schema None: TypeError: 'NoneType' object has no attribute '__getitem__'
[ https://issues.apache.org/jira/browse/BEAM-7439?focusedWorklogId=778368&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778368 ] ASF GitHub Bot logged work on BEAM-7439: Author: ASF GitHub Bot Created on: 04/Jun/22 01:49 Start Date: 04/Jun/22 01:49 Worklog Time Spent: 10m Work Description: kennknowles opened a new issue, #19463: URL: https://github.com/apache/beam/issues/19463 This is a followup item from https://issues.apache.org/jira/browse/BEAM-7439 Update the documentation for comment https://issues.apache.org/jira/browse/BEAM-7439?focusedCommentId=16851420&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16851420 cc: [~tvalentyn] [~chamikara] Imported from Jira [BEAM-7482](https://issues.apache.org/jira/browse/BEAM-7482). Original Jira may contain additional context. Reported by: angoenka. Issue Time Tracking --- Worklog Id: (was: 778368) Time Spent: 1h (was: 50m) > Bigquery Write with schema None: TypeError: 'NoneType' object has no > attribute '__getitem__' > > > Key: BEAM-7439 > URL: https://issues.apache.org/jira/browse/BEAM-7439 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Juta Staes >Assignee: Pablo Estrada >Priority: P0 > Fix For: 2.13.0 > > Time Spent: 1h > Remaining Estimate: 0h > > When running a simple write to bigquery on apache-beam==2.12.0 > {code:java} > input_data = [ >{'str': 'test'} > ] > (pipeline | 'create' >> beam.Create(input_data) >| 'write' >> beam.io.WriteToBigQuery( >':beam_test.test')) > {code} > > I get the following error: > {code:java} > WARNING:root:Start running in the cloud > Traceback (most recent call last): > File "test_pipeline.py", line 193, in > main() > File "test_pipeline.py", line 183, in main > ':beam_test.test')) > File > "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv2/local/lib/python2.7/site-packages/apache_beam/pvalue.py", > line 112, in __or__ > return self.pipeline.apply(ptransform, self) > File > "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv2/local/lib/python2.7/site-packages/apache_beam/pipeline.py", > line 470, in apply > label or transform.label) > File > "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv2/local/lib/python2.7/site-packages/apache_beam/pipeline.py", > line 480, in apply > return self.apply(transform, pvalueish) > File > "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv2/local/lib/python2.7/site-packages/apache_beam/pipeline.py", > line 516, in apply > pvalueish_result = self.runner.apply(transform, pvalueish, self._options) > File > "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv2/local/lib/python2.7/site-packages/apache_beam/runners/runner.py", > line 193, in apply > return m(transform, input, options) > File > "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv2/local/lib/python2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", > line 617, in apply_WriteToBigQuery > parse_table_schema_from_json(json.dumps(transform.schema)), > File > "/mnt/c/Users/Juta/Documents/02-projects/apache/beam/sdks/venv2/local/lib/python2.7/site-packages/apache_beam/io/gcp/bigquery_tools.py", > line 130, in parse_table_schema_from_json > fields = [_parse_schema_field(f) for f in json_schema['fields']] > TypeError: 'NoneType' object has no attribute '__getitem__'{code} > I already proposed a fix for this as part of a larger pr: > https://github.com/apache/beam/pull/8621/commits/41cdfbda5a4e2a56b6d10046ba265ad68c78675d > I was wondering if this also needs to be patched for version 2.12.0? > cc: [~tvalentyn] [~pabloem] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14556) Honor custom formatters being installed on the root logging handler
[ https://issues.apache.org/jira/browse/BEAM-14556?focusedWorklogId=778365&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778365 ] ASF GitHub Bot logged work on BEAM-14556: - Author: ASF GitHub Bot Created on: 04/Jun/22 00:25 Start Date: 04/Jun/22 00:25 Worklog Time Spent: 10m Work Description: lukecwik merged PR #17820: URL: https://github.com/apache/beam/pull/17820 Issue Time Tracking --- Worklog Id: (was: 778365) Time Spent: 1h 20m (was: 1h 10m) > Honor custom formatters being installed on the root logging handler > --- > > Key: BEAM-14556 > URL: https://issues.apache.org/jira/browse/BEAM-14556 > Project: Beam > Issue Type: New Feature > Components: runner-dataflow, sdk-java-harness >Reporter: Luke Cwik >Assignee: Luke Cwik >Priority: P2 > Time Spent: 1h 20m > Remaining Estimate: 0h > > This will allow users to create custom formatters integrating with an MDC if > they so choose by using a JvmInitializer to install their custom formatter on > the root logging handler. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14556) Honor custom formatters being installed on the root logging handler
[ https://issues.apache.org/jira/browse/BEAM-14556?focusedWorklogId=778361&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778361 ] ASF GitHub Bot logged work on BEAM-14556: - Author: ASF GitHub Bot Created on: 04/Jun/22 00:03 Start Date: 04/Jun/22 00:03 Worklog Time Spent: 10m Work Description: lukecwik commented on PR #17820: URL: https://github.com/apache/beam/pull/17820#issuecomment-1146467756 Run Java PreCommit Issue Time Tracking --- Worklog Id: (was: 778361) Time Spent: 1h 10m (was: 1h) > Honor custom formatters being installed on the root logging handler > --- > > Key: BEAM-14556 > URL: https://issues.apache.org/jira/browse/BEAM-14556 > Project: Beam > Issue Type: New Feature > Components: runner-dataflow, sdk-java-harness >Reporter: Luke Cwik >Assignee: Luke Cwik >Priority: P2 > Time Spent: 1h 10m > Remaining Estimate: 0h > > This will allow users to create custom formatters integrating with an MDC if > they so choose by using a JvmInitializer to install their custom formatter on > the root logging handler. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14293) Add @DoFn.yields_batches and @DoFn.yields_elements decorators to override defaults
[ https://issues.apache.org/jira/browse/BEAM-14293?focusedWorklogId=778360&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778360 ] ASF GitHub Bot logged work on BEAM-14293: - Author: ASF GitHub Bot Created on: 03/Jun/22 23:51 Start Date: 03/Jun/22 23:51 Worklog Time Spent: 10m Work Description: codecov[bot] commented on PR #19268: URL: https://github.com/apache/beam/pull/19268#issuecomment-1146462274 # [Codecov](https://codecov.io/gh/apache/beam/pull/19268?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report > Merging [#19268](https://codecov.io/gh/apache/beam/pull/19268?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (0f9a41e) into [master](https://codecov.io/gh/apache/beam/commit/c77971053d2eac2e2bcd1c36c3d312157fc4e450?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (c779710) will **increase** coverage by `0.00%`. > The diff coverage is `85.81%`. ```diff @@ Coverage Diff @@ ## master #19268 +/- ## === Coverage 74.09% 74.09% === Files 697 697 Lines 9198692036 +50 === + Hits6815468197 +43 - Misses 2258322590+7 Partials 1249 1249 ``` | Flag | Coverage Δ | | |---|---|---| | python | `83.75% <85.81%> (+<0.01%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/beam/pull/19268?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | | |---|---|---| | [...on/apache\_beam/runners/direct/sdf\_direct\_runner.py](https://codecov.io/gh/apache/beam/pull/19268/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vcnVubmVycy9kaXJlY3Qvc2RmX2RpcmVjdF9ydW5uZXIucHk=) | `35.53% <75.00%> (ø)` | | | [sdks/python/apache\_beam/runners/common.py](https://codecov.io/gh/apache/beam/pull/19268/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vcnVubmVycy9jb21tb24ucHk=) | `88.68% <83.33%> (+0.74%)` | :arrow_up: | | [sdks/python/apache\_beam/transforms/core.py](https://codecov.io/gh/apache/beam/pull/19268/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vdHJhbnNmb3Jtcy9jb3JlLnB5) | `92.38% <90.47%> (+0.08%)` | :arrow_up: | | [sdks/python/apache\_beam/utils/windowed\_value.py](https://codecov.io/gh/apache/beam/pull/19268/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vdXRpbHMvd2luZG93ZWRfdmFsdWUucHk=) | `92.52% <100.00%> (+0.17%)` | :arrow_up: | | [.../python/apache\_beam/testing/test\_stream\_service.py](https://codecov.io/gh/apache/beam/pull/19268/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vdGVzdGluZy90ZXN0X3N0cmVhbV9zZXJ2aWNlLnB5) | `88.09% <0.00%> (-4.77%)` | :arrow_down: | | [.../apache\_beam/runners/interactive/dataproc/types.py](https://codecov.io/gh/apache/beam/pull/19268/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vcnVubmVycy9pbnRlcmFjdGl2ZS9kYXRhcHJvYy90eXBlcy5weQ==) | `93.10% <0.00%> (-3.45%)` | :arrow_down: | | [sdks/python/apache\_beam/internal/dill\_pickler.py](https://codecov.io/gh/apache/beam/pull/19268/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9weXRob24vYXBhY2hlX2JlYW0vaW50ZXJuYWwvZGlsbF9waWNrbGVyLnB5) | `84.89% <0.00%> (-1.44%)` | :arrow_down: | | [.
[jira] [Work logged] (BEAM-14293) Add @DoFn.yields_batches and @DoFn.yields_elements decorators to override defaults
[ https://issues.apache.org/jira/browse/BEAM-14293?focusedWorklogId=778359&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778359 ] ASF GitHub Bot logged work on BEAM-14293: - Author: ASF GitHub Bot Created on: 03/Jun/22 23:34 Start Date: 03/Jun/22 23:34 Worklog Time Spent: 10m Work Description: TheNeuralBit commented on PR #19268: URL: https://github.com/apache/beam/pull/19268#issuecomment-1146452287 Despite the inlining this PR seems to have a minor effect (0.1-0.2 microseconds/element) on map_fn_microbenchmark`, presumably due to the new branch. Benchmarks on my desktop with Intel Xeon W-2135 CPU. c77971053d2e: ``` Fixed cost 1.6063439090633391 Per-element 7.317467689514159e-07 R^2 0.9972847477152673 ``` This PR: ``` Fixed cost 1.630842854329745 Per-element 7.487253376931855e-07 R^2 0.998137270424781 ``` Issue Time Tracking --- Worklog Id: (was: 778359) Time Spent: 20m (was: 10m) > Add @DoFn.yields_batches and @DoFn.yields_elements decorators to override > defaults > -- > > Key: BEAM-14293 > URL: https://issues.apache.org/jira/browse/BEAM-14293 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: P2 > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14293) Add @DoFn.yields_batches and @DoFn.yields_elements decorators to override defaults
[ https://issues.apache.org/jira/browse/BEAM-14293?focusedWorklogId=778358&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778358 ] ASF GitHub Bot logged work on BEAM-14293: - Author: ASF GitHub Bot Created on: 03/Jun/22 23:30 Start Date: 03/Jun/22 23:30 Worklog Time Spent: 10m Work Description: TheNeuralBit opened a new pull request, #19268: URL: https://github.com/apache/beam/pull/19268 This PR adds two new decorators, `@yields_batches` and `@yields_elements`, which override the default interpretation for the output of `DoFn.process` and `DoFn.process_batch`. These decorators are handled via changes to `OutputProcessor` (now renamed to `OutputHandler` for clarity), which branches depending on whether batches or elements are expected in the `results` iterable. Where possible, common logic in `OutputHandler` was extracted into private helper methods that can be re-used, the .pxd definition for these methods indicates they should be inlined in cython-generated code so they don't affect performance. GitHub Actions Tests Status (on master branch) [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule) [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule) [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule) See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI. Issue Time Tracking --- Worklog Id: (was: 778358) Remaining Estimate: 0h Time Spent: 10m > Add @DoFn.yields_batches and @DoFn.yields_elements decorators to override > defaults > -- > > Key: BEAM-14293 > URL: https://issues.apache.org/jira/browse/BEAM-14293 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: P2 > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14556) Honor custom formatters being installed on the root logging handler
[ https://issues.apache.org/jira/browse/BEAM-14556?focusedWorklogId=778355&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778355 ] ASF GitHub Bot logged work on BEAM-14556: - Author: ASF GitHub Bot Created on: 03/Jun/22 23:24 Start Date: 03/Jun/22 23:24 Worklog Time Spent: 10m Work Description: lukecwik commented on PR #17820: URL: https://github.com/apache/beam/pull/17820#issuecomment-1146448098 Run Java PreCommit Issue Time Tracking --- Worklog Id: (was: 778355) Time Spent: 1h (was: 50m) > Honor custom formatters being installed on the root logging handler > --- > > Key: BEAM-14556 > URL: https://issues.apache.org/jira/browse/BEAM-14556 > Project: Beam > Issue Type: New Feature > Components: runner-dataflow, sdk-java-harness >Reporter: Luke Cwik >Assignee: Luke Cwik >Priority: P2 > Time Spent: 1h > Remaining Estimate: 0h > > This will allow users to create custom formatters integrating with an MDC if > they so choose by using a JvmInitializer to install their custom formatter on > the root logging handler. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14529) Beam BQIO not accepting integer values for BQ table fields of type Float64
[ https://issues.apache.org/jira/browse/BEAM-14529?focusedWorklogId=778349&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778349 ] ASF GitHub Bot logged work on BEAM-14529: - Author: ASF GitHub Bot Created on: 03/Jun/22 22:58 Start Date: 03/Jun/22 22:58 Worklog Time Spent: 10m Work Description: reuvenlax merged PR #17779: URL: https://github.com/apache/beam/pull/17779 Issue Time Tracking --- Worklog Id: (was: 778349) Time Spent: 2h 20m (was: 2h 10m) > Beam BQIO not accepting integer values for BQ table fields of type Float64 > -- > > Key: BEAM-14529 > URL: https://issues.apache.org/jira/browse/BEAM-14529 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Reporter: Yiru Tang >Assignee: Yiru Tang >Priority: P2 > Fix For: 2.40.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > Customer (F5) is working on migrating to the new Storage API support in Beam > SDK and are facing issues in the latest Snapshot build. Today, in production, > they are using the legacy streaming API and their pipeline works fine when > they are mapping an incoming integer field to a BQ table field of type > Float64. However, with the storage API, their pipeline is failing with the > following error "Unexpected value :0, type: class java.lang.Integer. Table > field name: a, type: FLOAT64" Customer considers this as a regression and is > preventing them from going live with Storage API. Customer is planning on > going live with Storage API end of July and would like to have this fixed in > the 2.39.0 GA release. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14529) Beam BQIO not accepting integer values for BQ table fields of type Float64
[ https://issues.apache.org/jira/browse/BEAM-14529?focusedWorklogId=778346&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778346 ] ASF GitHub Bot logged work on BEAM-14529: - Author: ASF GitHub Bot Created on: 03/Jun/22 22:36 Start Date: 03/Jun/22 22:36 Worklog Time Spent: 10m Work Description: reuvenlax commented on PR #17779: URL: https://github.com/apache/beam/pull/17779#issuecomment-1146421195 Run Java PreCommit Issue Time Tracking --- Worklog Id: (was: 778346) Time Spent: 2h 10m (was: 2h) > Beam BQIO not accepting integer values for BQ table fields of type Float64 > -- > > Key: BEAM-14529 > URL: https://issues.apache.org/jira/browse/BEAM-14529 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Reporter: Yiru Tang >Assignee: Yiru Tang >Priority: P2 > Fix For: 2.40.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > Customer (F5) is working on migrating to the new Storage API support in Beam > SDK and are facing issues in the latest Snapshot build. Today, in production, > they are using the legacy streaming API and their pipeline works fine when > they are mapping an incoming integer field to a BQ table field of type > Float64. However, with the storage API, their pipeline is failing with the > following error "Unexpected value :0, type: class java.lang.Integer. Table > field name: a, type: FLOAT64" Customer considers this as a regression and is > preventing them from going live with Storage API. Customer is planning on > going live with Storage API end of July and would like to have this fixed in > the 2.39.0 GA release. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-13806) [Cross-Language] Jenkins integration test for Go SDK BigQuery IO.
[ https://issues.apache.org/jira/browse/BEAM-13806?focusedWorklogId=778345&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778345 ] ASF GitHub Bot logged work on BEAM-13806: - Author: ASF GitHub Bot Created on: 03/Jun/22 22:25 Start Date: 03/Jun/22 22:25 Worklog Time Spent: 10m Work Description: youngoli commented on PR #16818: URL: https://github.com/apache/beam/pull/16818#issuecomment-114648 Run XVR_GoUsingJava_Dataflow PostCommit Issue Time Tracking --- Worklog Id: (was: 778345) Time Spent: 9h (was: 8h 50m) > [Cross-Language] Jenkins integration test for Go SDK BigQuery IO. > - > > Key: BEAM-13806 > URL: https://issues.apache.org/jira/browse/BEAM-13806 > Project: Beam > Issue Type: New Feature > Components: cross-language, io-go-gcp, sdk-go >Reporter: Daniel Oliveira >Assignee: Daniel Oliveira >Priority: P2 > Time Spent: 9h > Remaining Estimate: 0h > > Title says it all. Add an integration test for cross-language BigQuery IO > that runs on Jenkins. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14529) Beam BQIO not accepting integer values for BQ table fields of type Float64
[ https://issues.apache.org/jira/browse/BEAM-14529?focusedWorklogId=778341&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778341 ] ASF GitHub Bot logged work on BEAM-14529: - Author: ASF GitHub Bot Created on: 03/Jun/22 22:07 Start Date: 03/Jun/22 22:07 Worklog Time Spent: 10m Work Description: reuvenlax commented on PR #17779: URL: https://github.com/apache/beam/pull/17779#issuecomment-1146401113 Run Java PreCommit Issue Time Tracking --- Worklog Id: (was: 778341) Time Spent: 2h (was: 1h 50m) > Beam BQIO not accepting integer values for BQ table fields of type Float64 > -- > > Key: BEAM-14529 > URL: https://issues.apache.org/jira/browse/BEAM-14529 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Reporter: Yiru Tang >Assignee: Yiru Tang >Priority: P2 > Fix For: 2.40.0 > > Time Spent: 2h > Remaining Estimate: 0h > > Customer (F5) is working on migrating to the new Storage API support in Beam > SDK and are facing issues in the latest Snapshot build. Today, in production, > they are using the legacy streaming API and their pipeline works fine when > they are mapping an incoming integer field to a BQ table field of type > Float64. However, with the storage API, their pipeline is failing with the > following error "Unexpected value :0, type: class java.lang.Integer. Table > field name: a, type: FLOAT64" Customer considers this as a regression and is > preventing them from going live with Storage API. Customer is planning on > going live with Storage API end of July and would like to have this fixed in > the 2.39.0 GA release. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14529) Beam BQIO not accepting integer values for BQ table fields of type Float64
[ https://issues.apache.org/jira/browse/BEAM-14529?focusedWorklogId=778325&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778325 ] ASF GitHub Bot logged work on BEAM-14529: - Author: ASF GitHub Bot Created on: 03/Jun/22 21:26 Start Date: 03/Jun/22 21:26 Worklog Time Spent: 10m Work Description: reuvenlax commented on PR #17779: URL: https://github.com/apache/beam/pull/17779#issuecomment-1146374757 Run Java PreCommit Issue Time Tracking --- Worklog Id: (was: 778325) Time Spent: 1h 50m (was: 1h 40m) > Beam BQIO not accepting integer values for BQ table fields of type Float64 > -- > > Key: BEAM-14529 > URL: https://issues.apache.org/jira/browse/BEAM-14529 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Reporter: Yiru Tang >Assignee: Yiru Tang >Priority: P2 > Fix For: 2.40.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Customer (F5) is working on migrating to the new Storage API support in Beam > SDK and are facing issues in the latest Snapshot build. Today, in production, > they are using the legacy streaming API and their pipeline works fine when > they are mapping an incoming integer field to a BQ table field of type > Float64. However, with the storage API, their pipeline is failing with the > following error "Unexpected value :0, type: class java.lang.Integer. Table > field name: a, type: FLOAT64" Customer considers this as a regression and is > preventing them from going live with Storage API. Customer is planning on > going live with Storage API end of July and would like to have this fixed in > the 2.39.0 GA release. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-13806) [Cross-Language] Jenkins integration test for Go SDK BigQuery IO.
[ https://issues.apache.org/jira/browse/BEAM-13806?focusedWorklogId=778318&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778318 ] ASF GitHub Bot logged work on BEAM-13806: - Author: ASF GitHub Bot Created on: 03/Jun/22 21:07 Start Date: 03/Jun/22 21:07 Worklog Time Spent: 10m Work Description: youngoli commented on PR #16818: URL: https://github.com/apache/beam/pull/16818#issuecomment-1146363217 I added a test that reads from a Query, and while testing in my own environment I'm getting weird results. The elements output from a query read have all fields as pointers because SQL queries don't match the table's schema. That's pretty normal. But when I made an equivalent struct in Go, I get errors when decoding into that struct, even though all the fields seem to be matching exactly: ``` panic: reflect: Call using struct { Counter *int64 "beam:\"counter\""; Rand_data *struct { Flip *bool "beam:\"flip\""; Num *int64 "beam:\"num\""; Word *string "beam:\"word\"" } "beam:\"rand_data\"" } as type bigquery.TestRowPtrs Full error: while executing Process for Plan[s08-67]: 2: DataSink[S[ptransform-65@localhost:12371]] Coder:W;coder-80>!GWC 3: PCollection[pcollection-72] Out:[2] 4: ParDo[bigquery.ReadFromQueryPipeline.func1] Out:[2] 1: DataSource[S[ptransform-64@localhost:12371], 0] Coder:W;coder-76>!GWC Out:4 caused by: panic: reflect: Call using struct { Counter *int64 "beam:\"counter\""; Rand_data *struct { Flip *bool "beam:\"flip\""; Num *int64 "beam:\"num\""; Word *string "beam:\"word\"" } "beam:\"rand_data\"" } as type bigquery.TestRowPtrs goroutine 60 [running]: runtime/debug.Stack() ... ``` Issue Time Tracking --- Worklog Id: (was: 778318) Time Spent: 8h 50m (was: 8h 40m) > [Cross-Language] Jenkins integration test for Go SDK BigQuery IO. > - > > Key: BEAM-13806 > URL: https://issues.apache.org/jira/browse/BEAM-13806 > Project: Beam > Issue Type: New Feature > Components: cross-language, io-go-gcp, sdk-go >Reporter: Daniel Oliveira >Assignee: Daniel Oliveira >Priority: P2 > Time Spent: 8h 50m > Remaining Estimate: 0h > > Title says it all. Add an integration test for cross-language BigQuery IO > that runs on Jenkins. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-13806) [Cross-Language] Jenkins integration test for Go SDK BigQuery IO.
[ https://issues.apache.org/jira/browse/BEAM-13806?focusedWorklogId=778317&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778317 ] ASF GitHub Bot logged work on BEAM-13806: - Author: ASF GitHub Bot Created on: 03/Jun/22 21:05 Start Date: 03/Jun/22 21:05 Worklog Time Spent: 10m Work Description: codecov[bot] commented on PR #16818: URL: https://github.com/apache/beam/pull/16818#issuecomment-1146361667 # [Codecov](https://codecov.io/gh/apache/beam/pull/16818?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report > Merging [#16818](https://codecov.io/gh/apache/beam/pull/16818?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (c9ce532) into [master](https://codecov.io/gh/apache/beam/commit/23aeca4e373ff5a4de176770d52f15a37653a32e?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (23aeca4) will **increase** coverage by `0.02%`. > The diff coverage is `n/a`. ```diff @@Coverage Diff @@ ## master #16818 +/- ## == + Coverage 74.07% 74.10% +0.02% == Files 697 697 Lines 9192792040 +113 == + Hits6809168202 +111 + Misses 2259122587 -4 - Partials 1245 1251 +6 ``` | Flag | Coverage Δ | | |---|---|---| | go | `50.95% <ø> (+0.19%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/beam/pull/16818?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | | |---|---|---| | [sdks/go/pkg/beam/runners/dataflow/dataflow.go](https://codecov.io/gh/apache/beam/pull/16818/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9nby9wa2cvYmVhbS9ydW5uZXJzL2RhdGFmbG93L2RhdGFmbG93Lmdv) | `58.42% <0.00%> (-0.20%)` | :arrow_down: | | [sdks/go/pkg/beam/testing/passert/sum.go](https://codecov.io/gh/apache/beam/pull/16818/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9nby9wa2cvYmVhbS90ZXN0aW5nL3Bhc3NlcnQvc3VtLmdv) | `100.00% <0.00%> (ø)` | | | [sdks/go/pkg/beam/core/runtime/exec/fn.go](https://codecov.io/gh/apache/beam/pull/16818/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9nby9wa2cvYmVhbS9jb3JlL3J1bnRpbWUvZXhlYy9mbi5nbw==) | `69.55% <0.00%> (ø)` | | | [sdks/go/pkg/beam/core/runtime/exec/sdf.go](https://codecov.io/gh/apache/beam/pull/16818/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9nby9wa2cvYmVhbS9jb3JlL3J1bnRpbWUvZXhlYy9zZGYuZ28=) | `70.84% <0.00%> (+0.23%)` | :arrow_up: | | [sdks/go/pkg/beam/core/runtime/exec/pardo.go](https://codecov.io/gh/apache/beam/pull/16818/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9nby9wa2cvYmVhbS9jb3JlL3J1bnRpbWUvZXhlYy9wYXJkby5nbw==) | `59.43% <0.00%> (+0.32%)` | :arrow_up: | | [...o/pkg/beam/io/rtrackers/offsetrange/offsetrange.go](https://codecov.io/gh/apache/beam/pull/16818/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9nby9wa2cvYmVhbS9pby9ydHJhY2tlcnMvb2Zmc2V0cmFuZ2Uvb2Zmc2V0cmFuZ2UuZ28=) | `80.62% <0.00%> (+2.12%)` | :arrow_up: | | [sdks/go/pkg/beam/testing/passert/passert.go](https://codecov.io/gh/apache/beam/pull/16818/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-c2Rrcy9nby9wa2cvYmVhbS90ZXN0aW5nL3Bhc3NlcnQvcGFzc2VydC5nbw==) | `81.11% <0.00%> (+2.36%)` | :arrow_up: | | [sdks/go/pkg/beam/testing/passert/count.go](https://codecov.io/gh/apache/beam/pull/16818/diff?src=pr&el=tree
[jira] [Work logged] (BEAM-14556) Honor custom formatters being installed on the root logging handler
[ https://issues.apache.org/jira/browse/BEAM-14556?focusedWorklogId=778315&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778315 ] ASF GitHub Bot logged work on BEAM-14556: - Author: ASF GitHub Bot Created on: 03/Jun/22 21:02 Start Date: 03/Jun/22 21:02 Worklog Time Spent: 10m Work Description: lukecwik commented on PR #17820: URL: https://github.com/apache/beam/pull/17820#issuecomment-1146360017 Run Java PreCommit Issue Time Tracking --- Worklog Id: (was: 778315) Time Spent: 50m (was: 40m) > Honor custom formatters being installed on the root logging handler > --- > > Key: BEAM-14556 > URL: https://issues.apache.org/jira/browse/BEAM-14556 > Project: Beam > Issue Type: New Feature > Components: runner-dataflow, sdk-java-harness >Reporter: Luke Cwik >Assignee: Luke Cwik >Priority: P2 > Time Spent: 50m > Remaining Estimate: 0h > > This will allow users to create custom formatters integrating with an MDC if > they so choose by using a JvmInitializer to install their custom formatter on > the root logging handler. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-9482) beam_PerformanceTests_Kafka_IO failing due to " provided port is already allocated"
[ https://issues.apache.org/jira/browse/BEAM-9482?focusedWorklogId=778313&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778313 ] ASF GitHub Bot logged work on BEAM-9482: Author: ASF GitHub Bot Created on: 03/Jun/22 20:58 Start Date: 03/Jun/22 20:58 Worklog Time Spent: 10m Work Description: benWize commented on PR #17727: URL: https://github.com/apache/beam/pull/17727#issuecomment-1146356790 Run Java KafkaIO Performance Test Issue Time Tracking --- Worklog Id: (was: 778313) Time Spent: 6h 20m (was: 6h 10m) > beam_PerformanceTests_Kafka_IO failing due to " provided port is already > allocated" > --- > > Key: BEAM-9482 > URL: https://issues.apache.org/jira/browse/BEAM-9482 > Project: Beam > Issue Type: Bug > Components: test-failures >Reporter: Chamikara Madhusanka Jayalath >Assignee: Benjamin Gonzalez >Priority: P1 > Labels: sickbay > Time Spent: 6h 20m > Remaining Estimate: 0h > > For example, > [https://builds.apache.org/view/A-D/view/Beam/view/PerformanceTests/job/beam_PerformanceTests_Kafka_IO/514/console] > > 18:55:33 Error from server (Invalid): error when creating > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_Kafka_IO/src/.test-infra/kubernetes/kafka-cluster/04-outside-services/outside-0.yml": > Service "outside-0" is invalid: spec.ports[0].nodePort: Invalid value: > 32400: provided port is already allocated > 18:55:33 Error from server (Invalid): error when creating > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_Kafka_IO/src/.test-infra/kubernetes/kafka-cluster/04-outside-services/outside-1.yml": > Service "outside-1" is invalid: spec.ports[0].nodePort: Invalid value: > 32401: provided port is already allocated > 18:55:33 Error from server (Invalid): error when creating > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_Kafka_IO/src/.test-infra/kubernetes/kafka-cluster/04-outside-services/outside-2.yml": > Service "outside-2" is invalid: spec.ports[0].nodePort: Invalid value: > 32402: provided port is already allocated > 1 > > Seems like we tried three ports but they were being used. Probably we should > update code to find an unused port dynamically. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14534) Add an interface to allow users to compress values being written to shuffle
[ https://issues.apache.org/jira/browse/BEAM-14534?focusedWorklogId=778310&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778310 ] ASF GitHub Bot logged work on BEAM-14534: - Author: ASF GitHub Bot Created on: 03/Jun/22 20:52 Start Date: 03/Jun/22 20:52 Worklog Time Spent: 10m Work Description: lukecwik commented on PR #17783: URL: https://github.com/apache/beam/pull/17783#issuecomment-1146352894 > > The other part of the change makes sense to reduce byte[] copies by using ByteString. > > CC: @tudorm > > Maybe I'll pull the ByteString refactoring stuff out into another review just to make this easier? Do you have any particular issues with it using ByteString there? > > The downsides with using Output/Input stream are really too big to ignore here, the performance differences are orders of magnitude in our tests. The main problem is that most "stream" compressor implementations are designed to compress a large amount of data, but in this case we're usually only compressing a few 100-1KB. It makes the overhead from creating/destroying the compressor streams very high (comparatively at least). We ran into this problem both with deflate and zstd, and its one of the reasons we ended up with an interface like this. If its really a non-starter putting this on OSS with a similar interface that's fine though, we can continue maintaining this in our own fork for the time being. > > The PipelineVisitor idea is interesting, although I'm skeptical how well it'd work in practice. For example with a Combine the coder for the data being shuffled is the accumulator coder, not the value coder of the KV. I bet you'd need a bunch of special cases to pick the "right" coder to wrap for various transforms. I'm happy with using ByteStrings as it reduces copies and will review and merge that and then we can revisit this. I was unaware of the additional overhead for using a stream compressor/decompressor vs one that worked on fixed length byte strings. I'm curious to learn more about it though as I might see something that wasn't considered before and also with the migration to Dataflow runner v2 you'll lose the ability to have this option have an effect unless you go the coder route which will have additional benefits since you'll reduce the amount of data being sent over gRPC between the SDK harness and the runner and also how much the runner materializes in shuffle in addition to those other use cases such as side inputs/user state/... Using the PipelineVisitor should work quite well as since you really only need to handle GBK, CoGBK and Combine and that would give you everything that is optimized across runners today to my knowledge. Issue Time Tracking --- Worklog Id: (was: 778310) Time Spent: 1h 40m (was: 1.5h) > Add an interface to allow users to compress values being written to shuffle > --- > > Key: BEAM-14534 > URL: https://issues.apache.org/jira/browse/BEAM-14534 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Steve Niemitz >Assignee: Steve Niemitz >Priority: P2 > Time Spent: 1h 40m > Remaining Estimate: 0h > > Frequently values being shuffled are large and compressible, while users can > compress them on their own by using a coder that compresses the data, it > would be nice to be able to do so globally for all values. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-10496) Eliminate nullability errors from :sdks:java:core
[ https://issues.apache.org/jira/browse/BEAM-10496?focusedWorklogId=778308&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778308 ] ASF GitHub Bot logged work on BEAM-10496: - Author: ASF GitHub Bot Created on: 03/Jun/22 20:52 Start Date: 03/Jun/22 20:52 Worklog Time Spent: 10m Work Description: kennknowles commented on PR #17819: URL: https://github.com/apache/beam/pull/17819#issuecomment-1146352708 Run Java PreCommit Issue Time Tracking --- Worklog Id: (was: 778308) Time Spent: 1.5h (was: 1h 20m) > Eliminate nullability errors from :sdks:java:core > - > > Key: BEAM-10496 > URL: https://issues.apache.org/jira/browse/BEAM-10496 > Project: Beam > Issue Type: Sub-task > Components: sdk-java-core >Reporter: Kenneth Knowles >Priority: P3 > Labels: Clarified, starter > Time Spent: 1.5h > Remaining Estimate: 0h > > Just edit {{build.gradle}} and set {{enableChecker: true}} and fix some > errors! > As of 2020-07-20 setting {{-Xmaxerrs 1}} there are 1451 errors in the > core Java SDK. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-13756) [Playground] Merge Log and Output tabs into one and add there filtering
[ https://issues.apache.org/jira/browse/BEAM-13756?focusedWorklogId=778307&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778307 ] ASF GitHub Bot logged work on BEAM-13756: - Author: ASF GitHub Bot Created on: 03/Jun/22 20:51 Start Date: 03/Jun/22 20:51 Worklog Time Spent: 10m Work Description: pabloem merged PR #17792: URL: https://github.com/apache/beam/pull/17792 Issue Time Tracking --- Worklog Id: (was: 778307) Time Spent: 1h (was: 50m) > [Playground] Merge Log and Output tabs into one and add there filtering > --- > > Key: BEAM-13756 > URL: https://issues.apache.org/jira/browse/BEAM-13756 > Project: Beam > Issue Type: Improvement > Components: beam-playground >Reporter: Artur Khanin >Assignee: Alexander Zhuravlev >Priority: P3 > Labels: beam-playground-frontend > Time Spent: 1h > Remaining Estimate: 0h > > As a Beam Playground {*}User{*}, I want to have one tab with Logs and Outputs > so I can filter them to see just Logs, Outputs, or both. > *Acceptance criteria:* > * Logs and Outputs are at the same tab > * Logs and Outputs appear as they come > * Logs and Outputs can be filtered > * Logs and Outputs are showed both as default -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-13756) [Playground] Merge Log and Output tabs into one and add there filtering
[ https://issues.apache.org/jira/browse/BEAM-13756?focusedWorklogId=778306&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778306 ] ASF GitHub Bot logged work on BEAM-13756: - Author: ASF GitHub Bot Created on: 03/Jun/22 20:51 Start Date: 03/Jun/22 20:51 Worklog Time Spent: 10m Work Description: pabloem commented on PR #17792: URL: https://github.com/apache/beam/pull/17792#issuecomment-1146352247 LGTM thanks all! Issue Time Tracking --- Worklog Id: (was: 778306) Time Spent: 50m (was: 40m) > [Playground] Merge Log and Output tabs into one and add there filtering > --- > > Key: BEAM-13756 > URL: https://issues.apache.org/jira/browse/BEAM-13756 > Project: Beam > Issue Type: Improvement > Components: beam-playground >Reporter: Artur Khanin >Assignee: Alexander Zhuravlev >Priority: P3 > Labels: beam-playground-frontend > Time Spent: 50m > Remaining Estimate: 0h > > As a Beam Playground {*}User{*}, I want to have one tab with Logs and Outputs > so I can filter them to see just Logs, Outputs, or both. > *Acceptance criteria:* > * Logs and Outputs are at the same tab > * Logs and Outputs appear as they come > * Logs and Outputs can be filtered > * Logs and Outputs are showed both as default -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-13945) Update BQ connector to support new JSON type
[ https://issues.apache.org/jira/browse/BEAM-13945?focusedWorklogId=778305&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778305 ] ASF GitHub Bot logged work on BEAM-13945: - Author: ASF GitHub Bot Created on: 03/Jun/22 20:50 Start Date: 03/Jun/22 20:50 Worklog Time Spent: 10m Work Description: pabloem merged PR #18374: URL: https://github.com/apache/beam/pull/18374 Issue Time Tracking --- Worklog Id: (was: 778305) Time Spent: 15.5h (was: 15h 20m) > Update BQ connector to support new JSON type > > > Key: BEAM-13945 > URL: https://issues.apache.org/jira/browse/BEAM-13945 > Project: Beam > Issue Type: New Feature > Components: io-java-gcp >Reporter: Chamikara Madhusanka Jayalath >Assignee: Ahmed Abualsaud >Priority: P2 > Time Spent: 15.5h > Remaining Estimate: 0h > > BQ has a new JSON type that is defined here: > https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#json_type > We should update Beam BQ Java and Python connectors to support that for > various read methods (export jobs, storage API) and write methods (load jobs, > streaming inserts, storage API). > We should also add integration tests that exercise reading from /writing to > BQ tables with columns that has JSON type. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-9482) beam_PerformanceTests_Kafka_IO failing due to " provided port is already allocated"
[ https://issues.apache.org/jira/browse/BEAM-9482?focusedWorklogId=778304&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778304 ] ASF GitHub Bot logged work on BEAM-9482: Author: ASF GitHub Bot Created on: 03/Jun/22 20:50 Start Date: 03/Jun/22 20:50 Worklog Time Spent: 10m Work Description: benWize commented on PR #17727: URL: https://github.com/apache/beam/pull/17727#issuecomment-1146351526 > sorry - I ran SEed Job again - can you try it once more after this job? https://ci-beam.apache.org/job/beam_SeedJob/9774/ Ok, thanks Issue Time Tracking --- Worklog Id: (was: 778304) Time Spent: 6h 10m (was: 6h) > beam_PerformanceTests_Kafka_IO failing due to " provided port is already > allocated" > --- > > Key: BEAM-9482 > URL: https://issues.apache.org/jira/browse/BEAM-9482 > Project: Beam > Issue Type: Bug > Components: test-failures >Reporter: Chamikara Madhusanka Jayalath >Assignee: Benjamin Gonzalez >Priority: P1 > Labels: sickbay > Time Spent: 6h 10m > Remaining Estimate: 0h > > For example, > [https://builds.apache.org/view/A-D/view/Beam/view/PerformanceTests/job/beam_PerformanceTests_Kafka_IO/514/console] > > 18:55:33 Error from server (Invalid): error when creating > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_Kafka_IO/src/.test-infra/kubernetes/kafka-cluster/04-outside-services/outside-0.yml": > Service "outside-0" is invalid: spec.ports[0].nodePort: Invalid value: > 32400: provided port is already allocated > 18:55:33 Error from server (Invalid): error when creating > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_Kafka_IO/src/.test-infra/kubernetes/kafka-cluster/04-outside-services/outside-1.yml": > Service "outside-1" is invalid: spec.ports[0].nodePort: Invalid value: > 32401: provided port is already allocated > 18:55:33 Error from server (Invalid): error when creating > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_Kafka_IO/src/.test-infra/kubernetes/kafka-cluster/04-outside-services/outside-2.yml": > Service "outside-2" is invalid: spec.ports[0].nodePort: Invalid value: > 32402: provided port is already allocated > 1 > > Seems like we tried three ports but they were being used. Probably we should > update code to find an unused port dynamically. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-9482) beam_PerformanceTests_Kafka_IO failing due to " provided port is already allocated"
[ https://issues.apache.org/jira/browse/BEAM-9482?focusedWorklogId=778302&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778302 ] ASF GitHub Bot logged work on BEAM-9482: Author: ASF GitHub Bot Created on: 03/Jun/22 20:48 Start Date: 03/Jun/22 20:48 Worklog Time Spent: 10m Work Description: pabloem commented on PR #17727: URL: https://github.com/apache/beam/pull/17727#issuecomment-1146350371 sorry - I ran SEed Job again - can you try it once more after this job? https://ci-beam.apache.org/job/beam_SeedJob/9774/ Issue Time Tracking --- Worklog Id: (was: 778302) Time Spent: 6h (was: 5h 50m) > beam_PerformanceTests_Kafka_IO failing due to " provided port is already > allocated" > --- > > Key: BEAM-9482 > URL: https://issues.apache.org/jira/browse/BEAM-9482 > Project: Beam > Issue Type: Bug > Components: test-failures >Reporter: Chamikara Madhusanka Jayalath >Assignee: Benjamin Gonzalez >Priority: P1 > Labels: sickbay > Time Spent: 6h > Remaining Estimate: 0h > > For example, > [https://builds.apache.org/view/A-D/view/Beam/view/PerformanceTests/job/beam_PerformanceTests_Kafka_IO/514/console] > > 18:55:33 Error from server (Invalid): error when creating > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_Kafka_IO/src/.test-infra/kubernetes/kafka-cluster/04-outside-services/outside-0.yml": > Service "outside-0" is invalid: spec.ports[0].nodePort: Invalid value: > 32400: provided port is already allocated > 18:55:33 Error from server (Invalid): error when creating > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_Kafka_IO/src/.test-infra/kubernetes/kafka-cluster/04-outside-services/outside-1.yml": > Service "outside-1" is invalid: spec.ports[0].nodePort: Invalid value: > 32401: provided port is already allocated > 18:55:33 Error from server (Invalid): error when creating > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_Kafka_IO/src/.test-infra/kubernetes/kafka-cluster/04-outside-services/outside-2.yml": > Service "outside-2" is invalid: spec.ports[0].nodePort: Invalid value: > 32402: provided port is already allocated > 1 > > Seems like we tried three ports but they were being used. Probably we should > update code to find an unused port dynamically. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-9482) beam_PerformanceTests_Kafka_IO failing due to " provided port is already allocated"
[ https://issues.apache.org/jira/browse/BEAM-9482?focusedWorklogId=778300&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778300 ] ASF GitHub Bot logged work on BEAM-9482: Author: ASF GitHub Bot Created on: 03/Jun/22 20:45 Start Date: 03/Jun/22 20:45 Worklog Time Spent: 10m Work Description: benWize commented on PR #17727: URL: https://github.com/apache/beam/pull/17727#issuecomment-1146348170 Run Java KafkaIO Performance Test Issue Time Tracking --- Worklog Id: (was: 778300) Time Spent: 5h 50m (was: 5h 40m) > beam_PerformanceTests_Kafka_IO failing due to " provided port is already > allocated" > --- > > Key: BEAM-9482 > URL: https://issues.apache.org/jira/browse/BEAM-9482 > Project: Beam > Issue Type: Bug > Components: test-failures >Reporter: Chamikara Madhusanka Jayalath >Assignee: Benjamin Gonzalez >Priority: P1 > Labels: sickbay > Time Spent: 5h 50m > Remaining Estimate: 0h > > For example, > [https://builds.apache.org/view/A-D/view/Beam/view/PerformanceTests/job/beam_PerformanceTests_Kafka_IO/514/console] > > 18:55:33 Error from server (Invalid): error when creating > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_Kafka_IO/src/.test-infra/kubernetes/kafka-cluster/04-outside-services/outside-0.yml": > Service "outside-0" is invalid: spec.ports[0].nodePort: Invalid value: > 32400: provided port is already allocated > 18:55:33 Error from server (Invalid): error when creating > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_Kafka_IO/src/.test-infra/kubernetes/kafka-cluster/04-outside-services/outside-1.yml": > Service "outside-1" is invalid: spec.ports[0].nodePort: Invalid value: > 32401: provided port is already allocated > 18:55:33 Error from server (Invalid): error when creating > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_Kafka_IO/src/.test-infra/kubernetes/kafka-cluster/04-outside-services/outside-2.yml": > Service "outside-2" is invalid: spec.ports[0].nodePort: Invalid value: > 32402: provided port is already allocated > 1 > > Seems like we tried three ports but they were being used. Probably we should > update code to find an unused port dynamically. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-9482) beam_PerformanceTests_Kafka_IO failing due to " provided port is already allocated"
[ https://issues.apache.org/jira/browse/BEAM-9482?focusedWorklogId=778295&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778295 ] ASF GitHub Bot logged work on BEAM-9482: Author: ASF GitHub Bot Created on: 03/Jun/22 20:35 Start Date: 03/Jun/22 20:35 Worklog Time Spent: 10m Work Description: pabloem commented on PR #17727: URL: https://github.com/apache/beam/pull/17727#issuecomment-1146341544 https://ci-beam.apache.org/job/beam_SeedJob/9772/console I'll record a quick tutorial later Issue Time Tracking --- Worklog Id: (was: 778295) Time Spent: 5h 40m (was: 5.5h) > beam_PerformanceTests_Kafka_IO failing due to " provided port is already > allocated" > --- > > Key: BEAM-9482 > URL: https://issues.apache.org/jira/browse/BEAM-9482 > Project: Beam > Issue Type: Bug > Components: test-failures >Reporter: Chamikara Madhusanka Jayalath >Assignee: Benjamin Gonzalez >Priority: P1 > Labels: sickbay > Time Spent: 5h 40m > Remaining Estimate: 0h > > For example, > [https://builds.apache.org/view/A-D/view/Beam/view/PerformanceTests/job/beam_PerformanceTests_Kafka_IO/514/console] > > 18:55:33 Error from server (Invalid): error when creating > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_Kafka_IO/src/.test-infra/kubernetes/kafka-cluster/04-outside-services/outside-0.yml": > Service "outside-0" is invalid: spec.ports[0].nodePort: Invalid value: > 32400: provided port is already allocated > 18:55:33 Error from server (Invalid): error when creating > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_Kafka_IO/src/.test-infra/kubernetes/kafka-cluster/04-outside-services/outside-1.yml": > Service "outside-1" is invalid: spec.ports[0].nodePort: Invalid value: > 32401: provided port is already allocated > 18:55:33 Error from server (Invalid): error when creating > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_Kafka_IO/src/.test-infra/kubernetes/kafka-cluster/04-outside-services/outside-2.yml": > Service "outside-2" is invalid: spec.ports[0].nodePort: Invalid value: > 32402: provided port is already allocated > 1 > > Seems like we tried three ports but they were being used. Probably we should > update code to find an unused port dynamically. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14068) RunInference Benchmarking tests
[ https://issues.apache.org/jira/browse/BEAM-14068?focusedWorklogId=778291&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778291 ] ASF GitHub Bot logged work on BEAM-14068: - Author: ASF GitHub Bot Created on: 03/Jun/22 20:25 Start Date: 03/Jun/22 20:25 Worklog Time Spent: 10m Work Description: AnandInguva commented on PR #17462: URL: https://github.com/apache/beam/pull/17462#issuecomment-1146334152 Tested postcommit -> https://ci-beam.apache.org/job/beam_PostCommit_Python39_PR/24/testReport/ Issue Time Tracking --- Worklog Id: (was: 778291) Time Spent: 10h 20m (was: 10h 10m) > RunInference Benchmarking tests > --- > > Key: BEAM-14068 > URL: https://issues.apache.org/jira/browse/BEAM-14068 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Anand Inguva >Assignee: Anand Inguva >Priority: P2 > Time Spent: 10h 20m > Remaining Estimate: 0h > > RunInference benchmarks will evaluate performance of Pipelines, which > represent common use cases of Beam + Dataflow in Pytorch, sklearn and > possibly TFX. These benchmarks would be the integration tests that exercise > several software components using Beam, PyTorch, Scikit learn and TensorFlow > extended. > we would use the datasets that's available publicly (Eg; Kaggle). > Size: small / 10 GB / 1 TB etc > The default execution runner would be Dataflow unless specified otherwise. > These tests would be run very less frequently(every release cycle). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14553) Dataflow portable job submission translate FileResultCoder only with window coder
[ https://issues.apache.org/jira/browse/BEAM-14553?focusedWorklogId=778290&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778290 ] ASF GitHub Bot logged work on BEAM-14553: - Author: ASF GitHub Bot Created on: 03/Jun/22 20:23 Start Date: 03/Jun/22 20:23 Worklog Time Spent: 10m Work Description: y1chi commented on PR #17818: URL: https://github.com/apache/beam/pull/17818#issuecomment-1146332642 Run Java PreCommit Issue Time Tracking --- Worklog Id: (was: 778290) Time Spent: 2h 40m (was: 2.5h) > Dataflow portable job submission translate FileResultCoder only with window > coder > - > > Key: BEAM-14553 > URL: https://issues.apache.org/jira/browse/BEAM-14553 > Project: Beam > Issue Type: Bug > Components: runner-dataflow >Reporter: Yichi Zhang >Priority: P2 > Time Spent: 2h 40m > Remaining Estimate: 0h > > The destination coder is neglected, if there are multiple FileResultCoders > with different destination coder, only first registration is successful. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14068) RunInference Benchmarking tests
[ https://issues.apache.org/jira/browse/BEAM-14068?focusedWorklogId=778289&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778289 ] ASF GitHub Bot logged work on BEAM-14068: - Author: ASF GitHub Bot Created on: 03/Jun/22 20:16 Start Date: 03/Jun/22 20:16 Worklog Time Spent: 10m Work Description: AnandInguva commented on PR #17462: URL: https://github.com/apache/beam/pull/17462#issuecomment-1146327462 Run Python 3.9 PostCommit Issue Time Tracking --- Worklog Id: (was: 778289) Time Spent: 10h 10m (was: 10h) > RunInference Benchmarking tests > --- > > Key: BEAM-14068 > URL: https://issues.apache.org/jira/browse/BEAM-14068 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Anand Inguva >Assignee: Anand Inguva >Priority: P2 > Time Spent: 10h 10m > Remaining Estimate: 0h > > RunInference benchmarks will evaluate performance of Pipelines, which > represent common use cases of Beam + Dataflow in Pytorch, sklearn and > possibly TFX. These benchmarks would be the integration tests that exercise > several software components using Beam, PyTorch, Scikit learn and TensorFlow > extended. > we would use the datasets that's available publicly (Eg; Kaggle). > Size: small / 10 GB / 1 TB etc > The default execution runner would be Dataflow unless specified otherwise. > These tests would be run very less frequently(every release cycle). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14068) RunInference Benchmarking tests
[ https://issues.apache.org/jira/browse/BEAM-14068?focusedWorklogId=778288&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778288 ] ASF GitHub Bot logged work on BEAM-14068: - Author: ASF GitHub Bot Created on: 03/Jun/22 20:14 Start Date: 03/Jun/22 20:14 Worklog Time Spent: 10m Work Description: AnandInguva commented on PR #17462: URL: https://github.com/apache/beam/pull/17462#issuecomment-1146326503 @tvalentyn Changed the last few things. I kept run() and run_pipeline() as separate because there are plans to return an pipeline object from `run_pipeline()` for benchmark tests(still working on that). So I thought may be its better to have a separate method. Right now, I have a different PR working on top of this code by returning a pipeline object from `run_pipeline()` in a different file. I will try to refactor the code later if thats okay. Issue Time Tracking --- Worklog Id: (was: 778288) Time Spent: 10h (was: 9h 50m) > RunInference Benchmarking tests > --- > > Key: BEAM-14068 > URL: https://issues.apache.org/jira/browse/BEAM-14068 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Anand Inguva >Assignee: Anand Inguva >Priority: P2 > Time Spent: 10h > Remaining Estimate: 0h > > RunInference benchmarks will evaluate performance of Pipelines, which > represent common use cases of Beam + Dataflow in Pytorch, sklearn and > possibly TFX. These benchmarks would be the integration tests that exercise > several software components using Beam, PyTorch, Scikit learn and TensorFlow > extended. > we would use the datasets that's available publicly (Eg; Kaggle). > Size: small / 10 GB / 1 TB etc > The default execution runner would be Dataflow unless specified otherwise. > These tests would be run very less frequently(every release cycle). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14337) Support **kwargs for PyTorch models.
[ https://issues.apache.org/jira/browse/BEAM-14337?focusedWorklogId=778281&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778281 ] ASF GitHub Bot logged work on BEAM-14337: - Author: ASF GitHub Bot Created on: 03/Jun/22 20:02 Start Date: 03/Jun/22 20:02 Worklog Time Spent: 10m Work Description: yeandy commented on PR #17470: URL: https://github.com/apache/beam/pull/17470#issuecomment-1146317553 retest this please Issue Time Tracking --- Worklog Id: (was: 778281) Time Spent: 8h 20m (was: 8h 10m) > Support **kwargs for PyTorch models. > > > Key: BEAM-14337 > URL: https://issues.apache.org/jira/browse/BEAM-14337 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Anand Inguva >Assignee: Andy Ye >Priority: P2 > Time Spent: 8h 20m > Remaining Estimate: 0h > > Some models in Pytorch instantiating from torch.nn.Module, has extra > parameters in the forward function call. These extra parameters can be passed > as Dict or as positional arguments. > Example of PyTorch models supported by Hugging Face -> > [https://huggingface.co/bert-base-uncased] > [Some torch models on Hugging > face|https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py] > Eg: > [https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertModel] > {code:java} > inputs = { > input_ids: Tensor1, > attention_mask: Tensor2, > token_type_ids: Tensor3, > } > model = BertModel.from_pretrained("bert-base-uncased") # which is a > # subclass of torch.nn.Module > outputs = model(**inputs) # model forward method should be expecting the keys > in the inputs as the positional arguments.{code} > > [Transformers|https://pytorch.org/hub/huggingface_pytorch-transformers/] > integrated in Pytorch is supported by Hugging Face as well. > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14068) RunInference Benchmarking tests
[ https://issues.apache.org/jira/browse/BEAM-14068?focusedWorklogId=778279&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778279 ] ASF GitHub Bot logged work on BEAM-14068: - Author: ASF GitHub Bot Created on: 03/Jun/22 19:55 Start Date: 03/Jun/22 19:55 Worklog Time Spent: 10m Work Description: tvalentyn commented on code in PR #17462: URL: https://github.com/apache/beam/pull/17462#discussion_r889238242 ## sdks/python/scripts/run_integration_test.sh: ## @@ -213,9 +218,13 @@ if [[ -z $PIPELINE_OPTS ]]; then # Install test dependencies for ValidatesRunner tests. # pyhamcrest==1.10.0 doesn't work on Py2. # See: https://github.com/hamcrest/PyHamcrest/issues/131. - echo "pyhamcrest!=1.10.0,<2.0.0" > postcommit_requirements.txt Review Comment: there is .txt extension is missing here, but present in .gitignore ## sdks/python/apache_beam/examples/inference/pytorch_image_classification.py: ## @@ -0,0 +1,160 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +A pipeline that uses RunInference API to perform image classification.""" + +import argparse +import io +import os +from functools import partial +from typing import Dict +from typing import Iterable +from typing import Optional +from typing import Tuple + +import apache_beam as beam +import torch +from apache_beam.io.filesystems import FileSystems +from apache_beam.ml.inference.api import PredictionResult +from apache_beam.ml.inference.api import RunInference +from apache_beam.ml.inference.pytorch_inference import PytorchModelLoader +from apache_beam.options.pipeline_options import PipelineOptions +from apache_beam.options.pipeline_options import SetupOptions +from PIL import Image +from torchvision import transforms +from torchvision.models.mobilenetv2 import MobileNetV2 + + +def read_image(image_file_name: str, + path_to_dir: Optional[str] = None) -> Tuple[str, Image.Image]: + if path_to_dir is not None: +image_file_name = os.path.join(path_to_dir, image_file_name) + with FileSystems().open(image_file_name, 'r') as file: +data = Image.open(io.BytesIO(file.read())).convert('RGB') +return image_file_name, data + + +def preprocess_image(data: Image.Image) -> torch.Tensor: + image_size = (224, 224) + # Pre-trained PyTorch models expect input images normalized the Review Comment: ```suggestion # Pre-trained PyTorch models expect input images normalized with the ``` ## sdks/python/apache_beam/examples/inference/pytorch_image_classification.py: ## @@ -0,0 +1,160 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +A pipeline that uses RunInference API to perform image classification.""" + +import argparse +import io +import os +from functools import partial +from typing import Dict +from typing import Iterable +from typing import Optional +from typing import Tuple + +import apache_beam as beam +import torch +from apache_beam.io.filesystems import FileSystems +from apache_beam.ml.inference.api import PredictionResult +from apache_beam.ml.inference.api import RunInference +from apache_beam.ml.inference.pytorch_inference import PytorchModelLoader +from apache_beam.options.pipeline_options import PipelineOptions +from apache_beam.options.pipeline_options import SetupOptions +from PIL import Image +from torchvision import transforms +from torch
[jira] [Work logged] (BEAM-12572) All beam examples should get continuously exercised
[ https://issues.apache.org/jira/browse/BEAM-12572?focusedWorklogId=778278&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778278 ] ASF GitHub Bot logged work on BEAM-12572: - Author: ASF GitHub Bot Created on: 03/Jun/22 19:55 Start Date: 03/Jun/22 19:55 Worklog Time Spent: 10m Work Description: kennknowles commented on PR #17015: URL: https://github.com/apache/beam/pull/17015#issuecomment-1146312464 I'll review Issue Time Tracking --- Worklog Id: (was: 778278) Time Spent: 71h (was: 70h 50m) > All beam examples should get continuously exercised > --- > > Key: BEAM-12572 > URL: https://issues.apache.org/jira/browse/BEAM-12572 > Project: Beam > Issue Type: New Feature > Components: testing >Reporter: Valentyn Tymofieiev >Priority: P3 > Time Spent: 71h > Remaining Estimate: 0h > > Sometimes our examples become broken without us noticing. For example, see: > https://lists.apache.org/thread.html/r45340bbee91a6caf798fe62d24388f645f8792cc7506351fd66adec3%40%3Cdev.beam.apache.org%3E > We should test our examples better. > Test examples on Dataflow, Spark, Flink and Direct runners -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-12572) All beam examples should get continuously exercised
[ https://issues.apache.org/jira/browse/BEAM-12572?focusedWorklogId=778277&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778277 ] ASF GitHub Bot logged work on BEAM-12572: - Author: ASF GitHub Bot Created on: 03/Jun/22 19:32 Start Date: 03/Jun/22 19:32 Worklog Time Spent: 10m Work Description: aaltay commented on PR #17015: URL: https://github.com/apache/beam/pull/17015#issuecomment-1146295253 I think @kileys or @kennknowles would be the right people to review this. I will ping them. Issue Time Tracking --- Worklog Id: (was: 778277) Time Spent: 70h 50m (was: 70h 40m) > All beam examples should get continuously exercised > --- > > Key: BEAM-12572 > URL: https://issues.apache.org/jira/browse/BEAM-12572 > Project: Beam > Issue Type: New Feature > Components: testing >Reporter: Valentyn Tymofieiev >Priority: P3 > Time Spent: 70h 50m > Remaining Estimate: 0h > > Sometimes our examples become broken without us noticing. For example, see: > https://lists.apache.org/thread.html/r45340bbee91a6caf798fe62d24388f645f8792cc7506351fd66adec3%40%3Cdev.beam.apache.org%3E > We should test our examples better. > Test examples on Dataflow, Spark, Flink and Direct runners -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-9482) beam_PerformanceTests_Kafka_IO failing due to " provided port is already allocated"
[ https://issues.apache.org/jira/browse/BEAM-9482?focusedWorklogId=778276&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778276 ] ASF GitHub Bot logged work on BEAM-9482: Author: ASF GitHub Bot Created on: 03/Jun/22 19:31 Start Date: 03/Jun/22 19:31 Worklog Time Spent: 10m Work Description: aaltay commented on PR #17727: URL: https://github.com/apache/beam/pull/17727#issuecomment-1146294860 @pabloem - could you please run the seed job for this PR? (I tried using your instructions from your email today but failed to have a successful run.) Issue Time Tracking --- Worklog Id: (was: 778276) Time Spent: 5.5h (was: 5h 20m) > beam_PerformanceTests_Kafka_IO failing due to " provided port is already > allocated" > --- > > Key: BEAM-9482 > URL: https://issues.apache.org/jira/browse/BEAM-9482 > Project: Beam > Issue Type: Bug > Components: test-failures >Reporter: Chamikara Madhusanka Jayalath >Assignee: Benjamin Gonzalez >Priority: P1 > Labels: sickbay > Time Spent: 5.5h > Remaining Estimate: 0h > > For example, > [https://builds.apache.org/view/A-D/view/Beam/view/PerformanceTests/job/beam_PerformanceTests_Kafka_IO/514/console] > > 18:55:33 Error from server (Invalid): error when creating > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_Kafka_IO/src/.test-infra/kubernetes/kafka-cluster/04-outside-services/outside-0.yml": > Service "outside-0" is invalid: spec.ports[0].nodePort: Invalid value: > 32400: provided port is already allocated > 18:55:33 Error from server (Invalid): error when creating > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_Kafka_IO/src/.test-infra/kubernetes/kafka-cluster/04-outside-services/outside-1.yml": > Service "outside-1" is invalid: spec.ports[0].nodePort: Invalid value: > 32401: provided port is already allocated > 18:55:33 Error from server (Invalid): error when creating > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_Kafka_IO/src/.test-infra/kubernetes/kafka-cluster/04-outside-services/outside-2.yml": > Service "outside-2" is invalid: spec.ports[0].nodePort: Invalid value: > 32402: provided port is already allocated > 1 > > Seems like we tried three ports but they were being used. Probably we should > update code to find an unused port dynamically. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-9482) beam_PerformanceTests_Kafka_IO failing due to " provided port is already allocated"
[ https://issues.apache.org/jira/browse/BEAM-9482?focusedWorklogId=778274&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778274 ] ASF GitHub Bot logged work on BEAM-9482: Author: ASF GitHub Bot Created on: 03/Jun/22 19:28 Start Date: 03/Jun/22 19:28 Worklog Time Spent: 10m Work Description: aaltay commented on PR #17727: URL: https://github.com/apache/beam/pull/17727#issuecomment-1146293085 Run Seed Job Issue Time Tracking --- Worklog Id: (was: 778274) Time Spent: 5h 20m (was: 5h 10m) > beam_PerformanceTests_Kafka_IO failing due to " provided port is already > allocated" > --- > > Key: BEAM-9482 > URL: https://issues.apache.org/jira/browse/BEAM-9482 > Project: Beam > Issue Type: Bug > Components: test-failures >Reporter: Chamikara Madhusanka Jayalath >Assignee: Benjamin Gonzalez >Priority: P1 > Labels: sickbay > Time Spent: 5h 20m > Remaining Estimate: 0h > > For example, > [https://builds.apache.org/view/A-D/view/Beam/view/PerformanceTests/job/beam_PerformanceTests_Kafka_IO/514/console] > > 18:55:33 Error from server (Invalid): error when creating > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_Kafka_IO/src/.test-infra/kubernetes/kafka-cluster/04-outside-services/outside-0.yml": > Service "outside-0" is invalid: spec.ports[0].nodePort: Invalid value: > 32400: provided port is already allocated > 18:55:33 Error from server (Invalid): error when creating > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_Kafka_IO/src/.test-infra/kubernetes/kafka-cluster/04-outside-services/outside-1.yml": > Service "outside-1" is invalid: spec.ports[0].nodePort: Invalid value: > 32401: provided port is already allocated > 18:55:33 Error from server (Invalid): error when creating > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_Kafka_IO/src/.test-infra/kubernetes/kafka-cluster/04-outside-services/outside-2.yml": > Service "outside-2" is invalid: spec.ports[0].nodePort: Invalid value: > 32402: provided port is already allocated > 1 > > Seems like we tried three ports but they were being used. Probably we should > update code to find an unused port dynamically. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-11104) [Go SDK] DoFn Self Checkpointing
[ https://issues.apache.org/jira/browse/BEAM-11104?focusedWorklogId=778272&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778272 ] ASF GitHub Bot logged work on BEAM-11104: - Author: ASF GitHub Bot Created on: 03/Jun/22 19:20 Start Date: 03/Jun/22 19:20 Worklog Time Spent: 10m Work Description: github-actions[bot] commented on PR #17956: URL: https://github.com/apache/beam/pull/17956#issuecomment-1146286647 Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control Issue Time Tracking --- Worklog Id: (was: 778272) Time Spent: 28h 40m (was: 28.5h) > [Go SDK] DoFn Self Checkpointing > > > Key: BEAM-11104 > URL: https://issues.apache.org/jira/browse/BEAM-11104 > Project: Beam > Issue Type: Sub-task > Components: sdk-go >Reporter: Robert Burke >Assignee: Jack McCluskey >Priority: P3 > Fix For: 2.40.0 > > Time Spent: 28h 40m > Remaining Estimate: 0h > > Allow SplittableDoFns to self checkpoint. > Design doc: > [https://docs.google.com/document/d/1_JbzjY9JR07ZK5v7PcZevUfzHPsqwzfV7W6AouNpMPk/edit?usp=sharing] > > Feature is written E2E and users will be able to return ProcessContinuations > from SDFs as of 2.39.0 but the full behavior has not been fully validated. An > integration test that validates self-checkpointing is working as-intended > will need to be written and passing before the feature is no longer > considered experimental and this ticket is marked as resolved. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-11104) [Go SDK] DoFn Self Checkpointing
[ https://issues.apache.org/jira/browse/BEAM-11104?focusedWorklogId=778271&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778271 ] ASF GitHub Bot logged work on BEAM-11104: - Author: ASF GitHub Bot Created on: 03/Jun/22 19:19 Start Date: 03/Jun/22 19:19 Worklog Time Spent: 10m Work Description: jrmccluskey commented on PR #17956: URL: https://github.com/apache/beam/pull/17956#issuecomment-1146285870 R: @lostluck Issue Time Tracking --- Worklog Id: (was: 778271) Time Spent: 28.5h (was: 28h 20m) > [Go SDK] DoFn Self Checkpointing > > > Key: BEAM-11104 > URL: https://issues.apache.org/jira/browse/BEAM-11104 > Project: Beam > Issue Type: Sub-task > Components: sdk-go >Reporter: Robert Burke >Assignee: Jack McCluskey >Priority: P3 > Fix For: 2.40.0 > > Time Spent: 28.5h > Remaining Estimate: 0h > > Allow SplittableDoFns to self checkpoint. > Design doc: > [https://docs.google.com/document/d/1_JbzjY9JR07ZK5v7PcZevUfzHPsqwzfV7W6AouNpMPk/edit?usp=sharing] > > Feature is written E2E and users will be able to return ProcessContinuations > from SDFs as of 2.39.0 but the full behavior has not been fully validated. An > integration test that validates self-checkpointing is working as-intended > will need to be written and passing before the feature is no longer > considered experimental and this ticket is marked as resolved. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-13647) Go SDK FnApi environment version 9 should be compatible with Runner v2 artifact service
[ https://issues.apache.org/jira/browse/BEAM-13647?focusedWorklogId=778270&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778270 ] ASF GitHub Bot logged work on BEAM-13647: - Author: ASF GitHub Bot Created on: 03/Jun/22 19:18 Start Date: 03/Jun/22 19:18 Worklog Time Spent: 10m Work Description: damccorm commented on PR #18533: URL: https://github.com/apache/beam/pull/18533#issuecomment-1146284943 Run Go Samza ValidatesRunner Issue Time Tracking --- Worklog Id: (was: 778270) Time Spent: 7h (was: 6h 50m) > Go SDK FnApi environment version 9 should be compatible with Runner v2 > artifact service > --- > > Key: BEAM-13647 > URL: https://issues.apache.org/jira/browse/BEAM-13647 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow, sdk-go >Reporter: Heejong Lee >Assignee: Danny McCormick >Priority: P2 > Fix For: 2.40.0 > > Time Spent: 7h > Remaining Estimate: 0h > > Go SDK FnApi environment version 9 should be compatible with Runner v2 > artifact service. We need to test and verify whether Go SDK works well with > Runner v2 artifact service before increasing the version number. > For 2.37 and 2.38 we'll keep both current and old versions as transitional > code, but we should be able to comfortably pull out the old stuff in 2.39. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-13647) Go SDK FnApi environment version 9 should be compatible with Runner v2 artifact service
[ https://issues.apache.org/jira/browse/BEAM-13647?focusedWorklogId=778268&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778268 ] ASF GitHub Bot logged work on BEAM-13647: - Author: ASF GitHub Bot Created on: 03/Jun/22 19:18 Start Date: 03/Jun/22 19:18 Worklog Time Spent: 10m Work Description: damccorm commented on PR #18533: URL: https://github.com/apache/beam/pull/18533#issuecomment-1146284732 Run Go Flink ValidatesRunner Issue Time Tracking --- Worklog Id: (was: 778268) Time Spent: 6h 40m (was: 6.5h) > Go SDK FnApi environment version 9 should be compatible with Runner v2 > artifact service > --- > > Key: BEAM-13647 > URL: https://issues.apache.org/jira/browse/BEAM-13647 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow, sdk-go >Reporter: Heejong Lee >Assignee: Danny McCormick >Priority: P2 > Fix For: 2.40.0 > > Time Spent: 6h 40m > Remaining Estimate: 0h > > Go SDK FnApi environment version 9 should be compatible with Runner v2 > artifact service. We need to test and verify whether Go SDK works well with > Runner v2 artifact service before increasing the version number. > For 2.37 and 2.38 we'll keep both current and old versions as transitional > code, but we should be able to comfortably pull out the old stuff in 2.39. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-13647) Go SDK FnApi environment version 9 should be compatible with Runner v2 artifact service
[ https://issues.apache.org/jira/browse/BEAM-13647?focusedWorklogId=778269&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778269 ] ASF GitHub Bot logged work on BEAM-13647: - Author: ASF GitHub Bot Created on: 03/Jun/22 19:18 Start Date: 03/Jun/22 19:18 Worklog Time Spent: 10m Work Description: damccorm commented on PR #18533: URL: https://github.com/apache/beam/pull/18533#issuecomment-1146284860 Run Go Spark ValidatesRunner Issue Time Tracking --- Worklog Id: (was: 778269) Time Spent: 6h 50m (was: 6h 40m) > Go SDK FnApi environment version 9 should be compatible with Runner v2 > artifact service > --- > > Key: BEAM-13647 > URL: https://issues.apache.org/jira/browse/BEAM-13647 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow, sdk-go >Reporter: Heejong Lee >Assignee: Danny McCormick >Priority: P2 > Fix For: 2.40.0 > > Time Spent: 6h 50m > Remaining Estimate: 0h > > Go SDK FnApi environment version 9 should be compatible with Runner v2 > artifact service. We need to test and verify whether Go SDK works well with > Runner v2 artifact service before increasing the version number. > For 2.37 and 2.38 we'll keep both current and old versions as transitional > code, but we should be able to comfortably pull out the old stuff in 2.39. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-13647) Go SDK FnApi environment version 9 should be compatible with Runner v2 artifact service
[ https://issues.apache.org/jira/browse/BEAM-13647?focusedWorklogId=778267&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778267 ] ASF GitHub Bot logged work on BEAM-13647: - Author: ASF GitHub Bot Created on: 03/Jun/22 19:14 Start Date: 03/Jun/22 19:14 Worklog Time Spent: 10m Work Description: github-actions[bot] commented on PR #18533: URL: https://github.com/apache/beam/pull/18533#issuecomment-1146282207 Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control Issue Time Tracking --- Worklog Id: (was: 778267) Time Spent: 6.5h (was: 6h 20m) > Go SDK FnApi environment version 9 should be compatible with Runner v2 > artifact service > --- > > Key: BEAM-13647 > URL: https://issues.apache.org/jira/browse/BEAM-13647 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow, sdk-go >Reporter: Heejong Lee >Assignee: Danny McCormick >Priority: P2 > Fix For: 2.40.0 > > Time Spent: 6.5h > Remaining Estimate: 0h > > Go SDK FnApi environment version 9 should be compatible with Runner v2 > artifact service. We need to test and verify whether Go SDK works well with > Runner v2 artifact service before increasing the version number. > For 2.37 and 2.38 we'll keep both current and old versions as transitional > code, but we should be able to comfortably pull out the old stuff in 2.39. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-13647) Go SDK FnApi environment version 9 should be compatible with Runner v2 artifact service
[ https://issues.apache.org/jira/browse/BEAM-13647?focusedWorklogId=778264&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778264 ] ASF GitHub Bot logged work on BEAM-13647: - Author: ASF GitHub Bot Created on: 03/Jun/22 19:13 Start Date: 03/Jun/22 19:13 Worklog Time Spent: 10m Work Description: asf-ci commented on PR #18533: URL: https://github.com/apache/beam/pull/18533#issuecomment-1146281115 Can one of the admins verify this patch? Issue Time Tracking --- Worklog Id: (was: 778264) Time Spent: 6h (was: 5h 50m) > Go SDK FnApi environment version 9 should be compatible with Runner v2 > artifact service > --- > > Key: BEAM-13647 > URL: https://issues.apache.org/jira/browse/BEAM-13647 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow, sdk-go >Reporter: Heejong Lee >Assignee: Danny McCormick >Priority: P2 > Fix For: 2.40.0 > > Time Spent: 6h > Remaining Estimate: 0h > > Go SDK FnApi environment version 9 should be compatible with Runner v2 > artifact service. We need to test and verify whether Go SDK works well with > Runner v2 artifact service before increasing the version number. > For 2.37 and 2.38 we'll keep both current and old versions as transitional > code, but we should be able to comfortably pull out the old stuff in 2.39. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-13647) Go SDK FnApi environment version 9 should be compatible with Runner v2 artifact service
[ https://issues.apache.org/jira/browse/BEAM-13647?focusedWorklogId=778266&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778266 ] ASF GitHub Bot logged work on BEAM-13647: - Author: ASF GitHub Bot Created on: 03/Jun/22 19:13 Start Date: 03/Jun/22 19:13 Worklog Time Spent: 10m Work Description: damccorm commented on PR #18533: URL: https://github.com/apache/beam/pull/18533#issuecomment-1146281396 R: @lostluck since you have the most context - anything I'm missing here that makes this unsafe? Issue Time Tracking --- Worklog Id: (was: 778266) Time Spent: 6h 20m (was: 6h 10m) > Go SDK FnApi environment version 9 should be compatible with Runner v2 > artifact service > --- > > Key: BEAM-13647 > URL: https://issues.apache.org/jira/browse/BEAM-13647 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow, sdk-go >Reporter: Heejong Lee >Assignee: Danny McCormick >Priority: P2 > Fix For: 2.40.0 > > Time Spent: 6h 20m > Remaining Estimate: 0h > > Go SDK FnApi environment version 9 should be compatible with Runner v2 > artifact service. We need to test and verify whether Go SDK works well with > Runner v2 artifact service before increasing the version number. > For 2.37 and 2.38 we'll keep both current and old versions as transitional > code, but we should be able to comfortably pull out the old stuff in 2.39. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-13647) Go SDK FnApi environment version 9 should be compatible with Runner v2 artifact service
[ https://issues.apache.org/jira/browse/BEAM-13647?focusedWorklogId=778263&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778263 ] ASF GitHub Bot logged work on BEAM-13647: - Author: ASF GitHub Bot Created on: 03/Jun/22 19:13 Start Date: 03/Jun/22 19:13 Worklog Time Spent: 10m Work Description: damccorm opened a new pull request, #18533: URL: https://github.com/apache/beam/pull/18533 In #16729, we moved to using the RoleUrn for the Go Worker Binary, with the intent of removing the legacy check after a few sprints. That should be safe now. Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Add a link to the appropriate issue in your description, if applicable. This will automatically link the pull request to the issue. - [ ] Update `CHANGES.md` with noteworthy changes. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier). To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md) GitHub Actions Tests Status (on master branch) [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule) [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule) [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule) See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI. Issue Time Tracking --- Worklog Id: (was: 778263) Time Spent: 5h 50m (was: 5h 40m) > Go SDK FnApi environment version 9 should be compatible with Runner v2 > artifact service > --- > > Key: BEAM-13647 > URL: https://issues.apache.org/jira/browse/BEAM-13647 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow, sdk-go >Reporter: Heejong Lee >Assignee: Danny McCormick >Priority: P2 > Fix For: 2.40.0 > > Time Spent: 5h 50m > Remaining Estimate: 0h > > Go SDK FnApi environment version 9 should be compatible with Runner v2 > artifact service. We need to test and verify whether Go SDK works well with > Runner v2 artifact service before increasing the version number. > For 2.37 and 2.38 we'll keep both current and old versions as transitional > code, but we should be able to comfortably pull out the old stuff in 2.39. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-13647) Go SDK FnApi environment version 9 should be compatible with Runner v2 artifact service
[ https://issues.apache.org/jira/browse/BEAM-13647?focusedWorklogId=778265&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778265 ] ASF GitHub Bot logged work on BEAM-13647: - Author: ASF GitHub Bot Created on: 03/Jun/22 19:13 Start Date: 03/Jun/22 19:13 Worklog Time Spent: 10m Work Description: asf-ci commented on PR #18533: URL: https://github.com/apache/beam/pull/18533#issuecomment-1146281116 Can one of the admins verify this patch? Issue Time Tracking --- Worklog Id: (was: 778265) Time Spent: 6h 10m (was: 6h) > Go SDK FnApi environment version 9 should be compatible with Runner v2 > artifact service > --- > > Key: BEAM-13647 > URL: https://issues.apache.org/jira/browse/BEAM-13647 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow, sdk-go >Reporter: Heejong Lee >Assignee: Danny McCormick >Priority: P2 > Fix For: 2.40.0 > > Time Spent: 6h 10m > Remaining Estimate: 0h > > Go SDK FnApi environment version 9 should be compatible with Runner v2 > artifact service. We need to test and verify whether Go SDK works well with > Runner v2 artifact service before increasing the version number. > For 2.37 and 2.38 we'll keep both current and old versions as transitional > code, but we should be able to comfortably pull out the old stuff in 2.39. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14504) Add support for including addittional parameters to executebundle method in fhirio.
[ https://issues.apache.org/jira/browse/BEAM-14504?focusedWorklogId=778261&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778261 ] ASF GitHub Bot logged work on BEAM-14504: - Author: ASF GitHub Bot Created on: 03/Jun/22 19:01 Start Date: 03/Jun/22 19:01 Worklog Time Spent: 10m Work Description: pabloem commented on PR #17741: URL: https://github.com/apache/beam/pull/17741#issuecomment-1146273331 issue unrelated. merging. Issue Time Tracking --- Worklog Id: (was: 778261) Time Spent: 8h (was: 7h 50m) > Add support for including addittional parameters to executebundle method in > fhirio. > --- > > Key: BEAM-14504 > URL: https://issues.apache.org/jira/browse/BEAM-14504 > Project: Beam > Issue Type: Improvement > Components: io-java-healthcare >Reporter: Fathima Mohammed >Assignee: Fathima Mohammed >Priority: P2 > Time Spent: 8h > Remaining Estimate: 0h > > Add FhirBundleWithMetadata in executebundles method so that we can pass > additional information like message id. > FhirBundleWithMetadata represents a FHIR bundle, with it's metadata (eg. Hl7 > messageId) to be executed on the intermediate FHIR store. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14504) Add support for including addittional parameters to executebundle method in fhirio.
[ https://issues.apache.org/jira/browse/BEAM-14504?focusedWorklogId=778262&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778262 ] ASF GitHub Bot logged work on BEAM-14504: - Author: ASF GitHub Bot Created on: 03/Jun/22 19:01 Start Date: 03/Jun/22 19:01 Worklog Time Spent: 10m Work Description: pabloem merged PR #17741: URL: https://github.com/apache/beam/pull/17741 Issue Time Tracking --- Worklog Id: (was: 778262) Time Spent: 8h 10m (was: 8h) > Add support for including addittional parameters to executebundle method in > fhirio. > --- > > Key: BEAM-14504 > URL: https://issues.apache.org/jira/browse/BEAM-14504 > Project: Beam > Issue Type: Improvement > Components: io-java-healthcare >Reporter: Fathima Mohammed >Assignee: Fathima Mohammed >Priority: P2 > Time Spent: 8h 10m > Remaining Estimate: 0h > > Add FhirBundleWithMetadata in executebundles method so that we can pass > additional information like message id. > FhirBundleWithMetadata represents a FHIR bundle, with it's metadata (eg. Hl7 > messageId) to be executed on the intermediate FHIR store. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-13945) Update BQ connector to support new JSON type
[ https://issues.apache.org/jira/browse/BEAM-13945?focusedWorklogId=778258&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778258 ] ASF GitHub Bot logged work on BEAM-13945: - Author: ASF GitHub Bot Created on: 03/Jun/22 18:51 Start Date: 03/Jun/22 18:51 Worklog Time Spent: 10m Work Description: pabloem commented on PR #18374: URL: https://github.com/apache/beam/pull/18374#issuecomment-1146266043 Run Java PreCommit Issue Time Tracking --- Worklog Id: (was: 778258) Time Spent: 15h 20m (was: 15h 10m) > Update BQ connector to support new JSON type > > > Key: BEAM-13945 > URL: https://issues.apache.org/jira/browse/BEAM-13945 > Project: Beam > Issue Type: New Feature > Components: io-java-gcp >Reporter: Chamikara Madhusanka Jayalath >Assignee: Ahmed Abualsaud >Priority: P2 > Time Spent: 15h 20m > Remaining Estimate: 0h > > BQ has a new JSON type that is defined here: > https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#json_type > We should update Beam BQ Java and Python connectors to support that for > various read methods (export jobs, storage API) and write methods (load jobs, > streaming inserts, storage API). > We should also add integration tests that exercise reading from /writing to > BQ tables with columns that has JSON type. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14068) RunInference Benchmarking tests
[ https://issues.apache.org/jira/browse/BEAM-14068?focusedWorklogId=778252&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778252 ] ASF GitHub Bot logged work on BEAM-14068: - Author: ASF GitHub Bot Created on: 03/Jun/22 18:40 Start Date: 03/Jun/22 18:40 Worklog Time Spent: 10m Work Description: tvalentyn commented on code in PR #17462: URL: https://github.com/apache/beam/pull/17462#discussion_r889236325 ## sdks/python/apache_beam/examples/inference/pytorch_image_classification.py: ## @@ -0,0 +1,160 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +A pipeline that uses RunInference API to perform image classification.""" + +import argparse +import io +import os +from functools import partial +from typing import Dict +from typing import Iterable +from typing import Optional +from typing import Tuple + +import apache_beam as beam +import torch +from apache_beam.io.filesystems import FileSystems +from apache_beam.ml.inference.api import PredictionResult +from apache_beam.ml.inference.api import RunInference +from apache_beam.ml.inference.pytorch_inference import PytorchModelLoader +from apache_beam.options.pipeline_options import PipelineOptions +from apache_beam.options.pipeline_options import SetupOptions +from PIL import Image +from torchvision import transforms +from torchvision.models.mobilenetv2 import MobileNetV2 + + +def read_image(image_file_name: str, + path_to_dir: Optional[str] = None) -> Tuple[str, Image.Image]: + if path_to_dir is not None: +image_file_name = os.path.join(path_to_dir, image_file_name) + with FileSystems().open(image_file_name, 'r') as file: +data = Image.open(io.BytesIO(file.read())).convert('RGB') +return image_file_name, data + + +def preprocess_image(data: Image.Image) -> torch.Tensor: + image_size = (224, 224) + # Pre-trained PyTorch models expect input images normalized the + # below values ref: https://pytorch.org/vision/stable/models.html# + normalize = transforms.Normalize( + mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) + transform = transforms.Compose([ + transforms.Resize(image_size), + transforms.ToTensor(), + normalize, + ]) + return transform(data) + + +class PostProcessor(beam.DoFn): + def process(self, element: Tuple[str, PredictionResult]) -> Iterable[str]: +filename, prediction_result = element +prediction = torch.argmax(prediction_result.inference, dim=0) +yield filename + ',' + str(prediction.item()) + + +def run_pipeline( +options: PipelineOptions, +model_class: Optional[torch.nn.Module], +model_params: Optional[Dict], +args=None): + """ + Args: +options: options used to set up the pipeline. +model_class: Reference to the class definition of the model. +If None, MobilenetV2 will be used as default . +model_params: Parameters passed to the constructor of the model_class. + These will be used to instantiate the model object in the + RunInference API. +args: Command line arguments defined for this example. + """ + if not model_class: +model_class = MobileNetV2 +model_params = {'num_classes': 1000} + + model_loader = PytorchModelLoader( + state_dict_path=args.model_state_dict_path, + model_class=model_class, + model_params=model_params) + + with beam.Pipeline(options=options) as p: +filename_value_pair = ( +p +| 'ReadImageNames' >> beam.io.ReadFromText( +args.input, skip_header_lines=1) +| 'ReadImageData' >> beam.Map( +partial(read_image, path_to_dir=args.images_dir)) +| 'PreprocessImages' >> beam.MapTuple( +lambda file_name, data: (file_name, preprocess_image(data +predictions = ( +filename_value_pair +| 'PyTorchRunInference' >> RunInference(model_loader).with_output_types( +Tuple[str, PredictionResult]) +| 'ProcessOutput' >> beam.ParDo(PostProcessor())) + +if args.output: + predictions | "WriteOutputToGCS" >> beam.io.WriteToText( # pylint: disable=expression-
[jira] [Work logged] (BEAM-11104) [Go SDK] DoFn Self Checkpointing
[ https://issues.apache.org/jira/browse/BEAM-11104?focusedWorklogId=778251&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778251 ] ASF GitHub Bot logged work on BEAM-11104: - Author: ASF GitHub Bot Created on: 03/Jun/22 18:38 Start Date: 03/Jun/22 18:38 Worklog Time Spent: 10m Work Description: github-actions[bot] commented on PR #17956: URL: https://github.com/apache/beam/pull/17956#issuecomment-1146254738 Assigning reviewers. If you would like to opt out of this review, comment `assign to next reviewer`: R: @riteshghorse for label go. Available commands: - `stop reviewer notifications` - opt out of the automated review tooling - `remind me after tests pass` - tag the comment author after tests pass - `waiting on author` - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers) The PR bot will only process comments in the main thread (not review comments). Issue Time Tracking --- Worklog Id: (was: 778251) Time Spent: 28h 20m (was: 28h 10m) > [Go SDK] DoFn Self Checkpointing > > > Key: BEAM-11104 > URL: https://issues.apache.org/jira/browse/BEAM-11104 > Project: Beam > Issue Type: Sub-task > Components: sdk-go >Reporter: Robert Burke >Assignee: Jack McCluskey >Priority: P3 > Fix For: 2.40.0 > > Time Spent: 28h 20m > Remaining Estimate: 0h > > Allow SplittableDoFns to self checkpoint. > Design doc: > [https://docs.google.com/document/d/1_JbzjY9JR07ZK5v7PcZevUfzHPsqwzfV7W6AouNpMPk/edit?usp=sharing] > > Feature is written E2E and users will be able to return ProcessContinuations > from SDFs as of 2.39.0 but the full behavior has not been fully validated. An > integration test that validates self-checkpointing is working as-intended > will need to be written and passing before the feature is no longer > considered experimental and this ticket is marked as resolved. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14504) Add support for including addittional parameters to executebundle method in fhirio.
[ https://issues.apache.org/jira/browse/BEAM-14504?focusedWorklogId=778250&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778250 ] ASF GitHub Bot logged work on BEAM-14504: - Author: ASF GitHub Bot Created on: 03/Jun/22 18:30 Start Date: 03/Jun/22 18:30 Worklog Time Spent: 10m Work Description: pabloem commented on PR #17741: URL: https://github.com/apache/beam/pull/17741#issuecomment-1146248768 Run Java PreCommit Issue Time Tracking --- Worklog Id: (was: 778250) Time Spent: 7h 50m (was: 7h 40m) > Add support for including addittional parameters to executebundle method in > fhirio. > --- > > Key: BEAM-14504 > URL: https://issues.apache.org/jira/browse/BEAM-14504 > Project: Beam > Issue Type: Improvement > Components: io-java-healthcare >Reporter: Fathima Mohammed >Assignee: Fathima Mohammed >Priority: P2 > Time Spent: 7h 50m > Remaining Estimate: 0h > > Add FhirBundleWithMetadata in executebundles method so that we can pass > additional information like message id. > FhirBundleWithMetadata represents a FHIR bundle, with it's metadata (eg. Hl7 > messageId) to be executed on the intermediate FHIR store. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14068) RunInference Benchmarking tests
[ https://issues.apache.org/jira/browse/BEAM-14068?focusedWorklogId=778249&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778249 ] ASF GitHub Bot logged work on BEAM-14068: - Author: ASF GitHub Bot Created on: 03/Jun/22 18:29 Start Date: 03/Jun/22 18:29 Worklog Time Spent: 10m Work Description: AnandInguva commented on PR #17462: URL: https://github.com/apache/beam/pull/17462#issuecomment-1146248454 PTAL @tvalentyn Issue Time Tracking --- Worklog Id: (was: 778249) Time Spent: 9.5h (was: 9h 20m) > RunInference Benchmarking tests > --- > > Key: BEAM-14068 > URL: https://issues.apache.org/jira/browse/BEAM-14068 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Anand Inguva >Assignee: Anand Inguva >Priority: P2 > Time Spent: 9.5h > Remaining Estimate: 0h > > RunInference benchmarks will evaluate performance of Pipelines, which > represent common use cases of Beam + Dataflow in Pytorch, sklearn and > possibly TFX. These benchmarks would be the integration tests that exercise > several software components using Beam, PyTorch, Scikit learn and TensorFlow > extended. > we would use the datasets that's available publicly (Eg; Kaggle). > Size: small / 10 GB / 1 TB etc > The default execution runner would be Dataflow unless specified otherwise. > These tests would be run very less frequently(every release cycle). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-13945) Update BQ connector to support new JSON type
[ https://issues.apache.org/jira/browse/BEAM-13945?focusedWorklogId=778244&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778244 ] ASF GitHub Bot logged work on BEAM-13945: - Author: ASF GitHub Bot Created on: 03/Jun/22 18:12 Start Date: 03/Jun/22 18:12 Worklog Time Spent: 10m Work Description: johnjcasey commented on PR #18374: URL: https://github.com/apache/beam/pull/18374#issuecomment-1146234428 Run Java Precommit Issue Time Tracking --- Worklog Id: (was: 778244) Time Spent: 15h 10m (was: 15h) > Update BQ connector to support new JSON type > > > Key: BEAM-13945 > URL: https://issues.apache.org/jira/browse/BEAM-13945 > Project: Beam > Issue Type: New Feature > Components: io-java-gcp >Reporter: Chamikara Madhusanka Jayalath >Assignee: Ahmed Abualsaud >Priority: P2 > Time Spent: 15h 10m > Remaining Estimate: 0h > > BQ has a new JSON type that is defined here: > https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#json_type > We should update Beam BQ Java and Python connectors to support that for > various read methods (export jobs, storage API) and write methods (load jobs, > streaming inserts, storage API). > We should also add integration tests that exercise reading from /writing to > BQ tables with columns that has JSON type. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14068) RunInference Benchmarking tests
[ https://issues.apache.org/jira/browse/BEAM-14068?focusedWorklogId=778242&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778242 ] ASF GitHub Bot logged work on BEAM-14068: - Author: ASF GitHub Bot Created on: 03/Jun/22 18:06 Start Date: 03/Jun/22 18:06 Worklog Time Spent: 10m Work Description: AnandInguva commented on code in PR #17462: URL: https://github.com/apache/beam/pull/17462#discussion_r889212195 ## sdks/python/apache_beam/examples/inference/pytorch_image_classification.py: ## @@ -0,0 +1,146 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Pipeline that uses RunInference API to perform classification task on imagenet dataset""" # pylint: disable=line-too-long + +import argparse +import io +import os +from functools import partial +from typing import Any +from typing import Iterable +from typing import Tuple +from typing import Union + +import apache_beam as beam +import torch +from apache_beam.io.filesystems import FileSystems +from apache_beam.ml.inference.api import PredictionResult +from apache_beam.ml.inference.api import RunInference +from apache_beam.ml.inference.pytorch_inference import PytorchModelLoader +from apache_beam.options.pipeline_options import PipelineOptions +from apache_beam.options.pipeline_options import SetupOptions +from PIL import Image +from torchvision import transforms +from torchvision.models.mobilenetv2 import MobileNetV2 + + +def read_image(image_file_name: str, + path_to_dir: str = None) -> Tuple[str, Image.Image]: + if path_to_dir is not None: +image_file_name = os.path.join(path_to_dir, image_file_name) + with FileSystems().open(image_file_name, 'r') as file: +data = Image.open(io.BytesIO(file.read())).convert('RGB') +return image_file_name, data + + +def preprocess_image(data: Image) -> torch.Tensor: + image_size = (224, 224) + # to use models in torch with imagenet weights, + # normalize the images using the below values. + # ref: https://pytorch.org/vision/stable/models.html# + normalize = transforms.Normalize( + mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) + transform = transforms.Compose([ + transforms.Resize(image_size), + transforms.ToTensor(), + normalize, + ]) + return transform(data) + + +class PostProcessor(beam.DoFn): + def process( + self, element: Union[PredictionResult, Tuple[Any, PredictionResult]] + ) -> Iterable[str]: +filename, prediction_result = element +prediction = torch.argmax(prediction_result.inference, dim=0) +yield filename + ',' + str(prediction.item()) + + +def run_pipeline(options: PipelineOptions, args=None): + """Sets up PyTorch RunInference pipeline""" + # reference to the class definition of the model. + model_class = MobileNetV2 + # params for model class constructor. These values will be used in + # RunInference API to instantiate the model object. + model_params = {'num_classes': 1000} # imagenet has 1000 classes. + # for this example, the pretrained weights are downloaded from + # "https://download.pytorch.org/models/mobilenet_v2-b0353104.pth"; + # and saved on GCS bucket gs://apache-beam-ml/models/imagenet_classification_mobilenet_v2.pt, + # which will be used to load the model state_dict in the RunInference API. + model_loader = PytorchModelLoader( + state_dict_path=args.model_state_dict_path, + model_class=model_class, + model_params=model_params) + with beam.Pipeline(options=options) as p: +filename_value_pair = ( +p +| 'Read from csv file' >> beam.io.ReadFromText( +args.input, skip_header_lines=1) +| 'Parse and read files from the input_file' >> beam.Map( +partial(read_image, path_to_dir=args.images_dir)) +| 'Preprocess images' >> beam.MapTuple( +lambda file_name, data: (file_name, preprocess_image(data +predictions = ( +filename_value_pair +| 'PyTorch RunInference' >> RunInference(model_loader) +| 'Process output' >> beam.ParDo(PostProcessor())) + +if args.output: + predictions | "Write output to GCS"
[jira] [Work logged] (BEAM-14535) Add support for Pandas Dataframes to sklearn RunInference Implementation
[ https://issues.apache.org/jira/browse/BEAM-14535?focusedWorklogId=778239&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778239 ] ASF GitHub Bot logged work on BEAM-14535: - Author: ASF GitHub Bot Created on: 03/Jun/22 18:00 Start Date: 03/Jun/22 18:00 Worklog Time Spent: 10m Work Description: ryanthompson591 commented on PR #17800: URL: https://github.com/apache/beam/pull/17800#issuecomment-1146225373 R: @TheNeuralBit Issue Time Tracking --- Worklog Id: (was: 778239) Time Spent: 1.5h (was: 1h 20m) > Add support for Pandas Dataframes to sklearn RunInference Implementation > > > Key: BEAM-14535 > URL: https://issues.apache.org/jira/browse/BEAM-14535 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Ryan Thompson >Assignee: Ryan Thompson >Priority: P2 > Time Spent: 1.5h > Remaining Estimate: 0h > > Sklearn pipelines are often set up to take pandas dataframes. > > Our current implementation only supports numpy arrays. > > This FR allows the sklearn implementation to autodetect pandas dataframes or > numpy arrays and then combine them (via concat). > > In the case of a pandas dataframe that value will be passed through to the > pipeline. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14068) RunInference Benchmarking tests
[ https://issues.apache.org/jira/browse/BEAM-14068?focusedWorklogId=778241&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778241 ] ASF GitHub Bot logged work on BEAM-14068: - Author: ASF GitHub Bot Created on: 03/Jun/22 18:03 Start Date: 03/Jun/22 18:03 Worklog Time Spent: 10m Work Description: AnandInguva commented on code in PR #17462: URL: https://github.com/apache/beam/pull/17462#discussion_r889210162 ## sdks/python/apache_beam/examples/inference/pytorch_image_classification.py: ## @@ -0,0 +1,146 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Pipeline that uses RunInference API to perform classification task on imagenet dataset""" # pylint: disable=line-too-long + +import argparse +import io +import os +from functools import partial +from typing import Any +from typing import Iterable +from typing import Tuple +from typing import Union + +import apache_beam as beam +import torch +from apache_beam.io.filesystems import FileSystems +from apache_beam.ml.inference.api import PredictionResult +from apache_beam.ml.inference.api import RunInference +from apache_beam.ml.inference.pytorch_inference import PytorchModelLoader +from apache_beam.options.pipeline_options import PipelineOptions +from apache_beam.options.pipeline_options import SetupOptions +from PIL import Image +from torchvision import transforms +from torchvision.models.mobilenetv2 import MobileNetV2 + + +def read_image(image_file_name: str, + path_to_dir: str = None) -> Tuple[str, Image.Image]: + if path_to_dir is not None: +image_file_name = os.path.join(path_to_dir, image_file_name) + with FileSystems().open(image_file_name, 'r') as file: +data = Image.open(io.BytesIO(file.read())).convert('RGB') +return image_file_name, data + + +def preprocess_image(data: Image) -> torch.Tensor: + image_size = (224, 224) + # to use models in torch with imagenet weights, + # normalize the images using the below values. + # ref: https://pytorch.org/vision/stable/models.html# + normalize = transforms.Normalize( + mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) + transform = transforms.Compose([ + transforms.Resize(image_size), + transforms.ToTensor(), + normalize, + ]) + return transform(data) + + +class PostProcessor(beam.DoFn): + def process( + self, element: Union[PredictionResult, Tuple[Any, PredictionResult]] Review Comment: 1 seems to be simple for this example and it works. Thanks. I will commit that change Issue Time Tracking --- Worklog Id: (was: 778241) Time Spent: 9h 10m (was: 9h) > RunInference Benchmarking tests > --- > > Key: BEAM-14068 > URL: https://issues.apache.org/jira/browse/BEAM-14068 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Anand Inguva >Assignee: Anand Inguva >Priority: P2 > Time Spent: 9h 10m > Remaining Estimate: 0h > > RunInference benchmarks will evaluate performance of Pipelines, which > represent common use cases of Beam + Dataflow in Pytorch, sklearn and > possibly TFX. These benchmarks would be the integration tests that exercise > several software components using Beam, PyTorch, Scikit learn and TensorFlow > extended. > we would use the datasets that's available publicly (Eg; Kaggle). > Size: small / 10 GB / 1 TB etc > The default execution runner would be Dataflow unless specified otherwise. > These tests would be run very less frequently(every release cycle). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-10785) Support for coder argument in WriteToBigQuery
[ https://issues.apache.org/jira/browse/BEAM-10785?focusedWorklogId=778237&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778237 ] ASF GitHub Bot logged work on BEAM-10785: - Author: ASF GitHub Bot Created on: 03/Jun/22 17:52 Start Date: 03/Jun/22 17:52 Worklog Time Spent: 10m Work Description: pabloem commented on PR #17518: URL: https://github.com/apache/beam/pull/17518#issuecomment-1146218977 @harrydrippin thanks for the analysis, and great catch! - I think your fix to the coder itself would be a much better fix. Would you be willing to perform that fix instead? Issue Time Tracking --- Worklog Id: (was: 778237) Time Spent: 2h 10m (was: 2h) > Support for coder argument in WriteToBigQuery > - > > Key: BEAM-10785 > URL: https://issues.apache.org/jira/browse/BEAM-10785 > Project: Beam > Issue Type: Bug > Components: io-py-gcp >Reporter: Nakamura Yu >Assignee: Seunghwan Hong >Priority: P1 > Time Spent: 2h 10m > Remaining Estimate: 0h > > When using WriteToBigQuery to transfer data to BigQuery, non-ascii characters > are replaced with replacement characters. > This was due to the RowAsDictJsonCoder being set as the coder for the > BigQueryBatchFileLoads called inside WriteToBigQuery. > I want to add coder to the argument of WriteToBigQuery so that I can set a > coder other than RowAsDictJsonCoder. > If no problem, I will create a Pull Request next weekend. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14504) Add support for including addittional parameters to executebundle method in fhirio.
[ https://issues.apache.org/jira/browse/BEAM-14504?focusedWorklogId=778235&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778235 ] ASF GitHub Bot logged work on BEAM-14504: - Author: ASF GitHub Bot Created on: 03/Jun/22 17:50 Start Date: 03/Jun/22 17:50 Worklog Time Spent: 10m Work Description: pabloem commented on PR #17741: URL: https://github.com/apache/beam/pull/17741#issuecomment-1146217741 Run Java PreCommit Issue Time Tracking --- Worklog Id: (was: 778235) Time Spent: 7h 40m (was: 7.5h) > Add support for including addittional parameters to executebundle method in > fhirio. > --- > > Key: BEAM-14504 > URL: https://issues.apache.org/jira/browse/BEAM-14504 > Project: Beam > Issue Type: Improvement > Components: io-java-healthcare >Reporter: Fathima Mohammed >Assignee: Fathima Mohammed >Priority: P2 > Time Spent: 7h 40m > Remaining Estimate: 0h > > Add FhirBundleWithMetadata in executebundles method so that we can pass > additional information like message id. > FhirBundleWithMetadata represents a FHIR bundle, with it's metadata (eg. Hl7 > messageId) to be executed on the intermediate FHIR store. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14504) Add support for including addittional parameters to executebundle method in fhirio.
[ https://issues.apache.org/jira/browse/BEAM-14504?focusedWorklogId=778232&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778232 ] ASF GitHub Bot logged work on BEAM-14504: - Author: ASF GitHub Bot Created on: 03/Jun/22 17:47 Start Date: 03/Jun/22 17:47 Worklog Time Spent: 10m Work Description: pabloem commented on PR #17741: URL: https://github.com/apache/beam/pull/17741#issuecomment-1146215215 Run Java PreCommit Issue Time Tracking --- Worklog Id: (was: 778232) Time Spent: 7.5h (was: 7h 20m) > Add support for including addittional parameters to executebundle method in > fhirio. > --- > > Key: BEAM-14504 > URL: https://issues.apache.org/jira/browse/BEAM-14504 > Project: Beam > Issue Type: Improvement > Components: io-java-healthcare >Reporter: Fathima Mohammed >Assignee: Fathima Mohammed >Priority: P2 > Time Spent: 7.5h > Remaining Estimate: 0h > > Add FhirBundleWithMetadata in executebundles method so that we can pass > additional information like message id. > FhirBundleWithMetadata represents a FHIR bundle, with it's metadata (eg. Hl7 > messageId) to be executed on the intermediate FHIR store. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14068) RunInference Benchmarking tests
[ https://issues.apache.org/jira/browse/BEAM-14068?focusedWorklogId=778233&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778233 ] ASF GitHub Bot logged work on BEAM-14068: - Author: ASF GitHub Bot Created on: 03/Jun/22 17:47 Start Date: 03/Jun/22 17:47 Worklog Time Spent: 10m Work Description: tvalentyn commented on code in PR #17462: URL: https://github.com/apache/beam/pull/17462#discussion_r889198878 ## sdks/python/apache_beam/examples/inference/pytorch_image_classification.py: ## @@ -0,0 +1,146 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Pipeline that uses RunInference API to perform classification task on imagenet dataset""" # pylint: disable=line-too-long + +import argparse +import io +import os +from functools import partial +from typing import Any +from typing import Iterable +from typing import Tuple +from typing import Union + +import apache_beam as beam +import torch +from apache_beam.io.filesystems import FileSystems +from apache_beam.ml.inference.api import PredictionResult +from apache_beam.ml.inference.api import RunInference +from apache_beam.ml.inference.pytorch_inference import PytorchModelLoader +from apache_beam.options.pipeline_options import PipelineOptions +from apache_beam.options.pipeline_options import SetupOptions +from PIL import Image +from torchvision import transforms +from torchvision.models.mobilenetv2 import MobileNetV2 + + +def read_image(image_file_name: str, + path_to_dir: str = None) -> Tuple[str, Image.Image]: + if path_to_dir is not None: +image_file_name = os.path.join(path_to_dir, image_file_name) + with FileSystems().open(image_file_name, 'r') as file: +data = Image.open(io.BytesIO(file.read())).convert('RGB') +return image_file_name, data + + +def preprocess_image(data: Image) -> torch.Tensor: + image_size = (224, 224) + # to use models in torch with imagenet weights, + # normalize the images using the below values. + # ref: https://pytorch.org/vision/stable/models.html# + normalize = transforms.Normalize( + mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) + transform = transforms.Compose([ + transforms.Resize(image_size), + transforms.ToTensor(), + normalize, + ]) + return transform(data) + + +class PostProcessor(beam.DoFn): + def process( + self, element: Union[PredictionResult, Tuple[Any, PredictionResult]] Review Comment: I think this is because RunInference API can produce predictions with and without original examples. I think there are two ways to resolve this: 1) Clarify in RunInference transform that we will have only examples and predictions in the output ``` | 'PyTorch RunInference' >> RunInference(model_loader).with_output_types(Tuple[str, PredictionResult]) ``` 2) We generalize the postprocessor to handle both cases: when there are examples + predictions and when there are no examples, only predictions. In that case, the current type hint would make sense. If we go with 2) we can also consider adding an additional pipeline option whether or not to include examples or just predictions in the final result. Issue Time Tracking --- Worklog Id: (was: 778233) Time Spent: 9h (was: 8h 50m) > RunInference Benchmarking tests > --- > > Key: BEAM-14068 > URL: https://issues.apache.org/jira/browse/BEAM-14068 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Anand Inguva >Assignee: Anand Inguva >Priority: P2 > Time Spent: 9h > Remaining Estimate: 0h > > RunInference benchmarks will evaluate performance of Pipelines, which > represent common use cases of Beam + Dataflow in Pytorch, sklearn and > possibly TFX. These benchmarks would be the integration tests that exercise > several software components using Beam, PyTorch, Scikit learn and TensorFlow > extended. > we would use the datasets that's available publicly (Eg; Kaggle). > Size: small / 10 GB / 1 TB etc > The
[jira] [Work logged] (BEAM-11104) [Go SDK] DoFn Self Checkpointing
[ https://issues.apache.org/jira/browse/BEAM-11104?focusedWorklogId=778231&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778231 ] ASF GitHub Bot logged work on BEAM-11104: - Author: ASF GitHub Bot Created on: 03/Jun/22 17:43 Start Date: 03/Jun/22 17:43 Worklog Time Spent: 10m Work Description: damccorm commented on code in PR #17956: URL: https://github.com/apache/beam/pull/17956#discussion_r889196045 ## website/www/site/content/en/documentation/programming-guide.md: ## @@ -6422,7 +6422,26 @@ resource utilization. {{< /highlight >}} {{< highlight go >}} -This is not supported yet, see BEAM-11104. +func (fn *splittableDoFn) ProcessElement(rt *sdf.LockRTracker, emit func(Record)) sdf.ProcessContinuation { Review Comment: Yeah - I don't care if we make up empty functions for that, we already do that for a number of the existing snippets. The process continuation stuff should compile though, and that's the important bit anyways. Issue Time Tracking --- Worklog Id: (was: 778231) Time Spent: 28h 10m (was: 28h) > [Go SDK] DoFn Self Checkpointing > > > Key: BEAM-11104 > URL: https://issues.apache.org/jira/browse/BEAM-11104 > Project: Beam > Issue Type: Sub-task > Components: sdk-go >Reporter: Robert Burke >Assignee: Jack McCluskey >Priority: P3 > Fix For: 2.40.0 > > Time Spent: 28h 10m > Remaining Estimate: 0h > > Allow SplittableDoFns to self checkpoint. > Design doc: > [https://docs.google.com/document/d/1_JbzjY9JR07ZK5v7PcZevUfzHPsqwzfV7W6AouNpMPk/edit?usp=sharing] > > Feature is written E2E and users will be able to return ProcessContinuations > from SDFs as of 2.39.0 but the full behavior has not been fully validated. An > integration test that validates self-checkpointing is working as-intended > will need to be written and passing before the feature is no longer > considered experimental and this ticket is marked as resolved. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-11104) [Go SDK] DoFn Self Checkpointing
[ https://issues.apache.org/jira/browse/BEAM-11104?focusedWorklogId=778230&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778230 ] ASF GitHub Bot logged work on BEAM-11104: - Author: ASF GitHub Bot Created on: 03/Jun/22 17:43 Start Date: 03/Jun/22 17:43 Worklog Time Spent: 10m Work Description: damccorm commented on code in PR #17956: URL: https://github.com/apache/beam/pull/17956#discussion_r889196045 ## website/www/site/content/en/documentation/programming-guide.md: ## @@ -6422,7 +6422,26 @@ resource utilization. {{< /highlight >}} {{< highlight go >}} -This is not supported yet, see BEAM-11104. +func (fn *splittableDoFn) ProcessElement(rt *sdf.LockRTracker, emit func(Record)) sdf.ProcessContinuation { Review Comment: Yeah - I don't care if we make up empty functions for that, we already do that for a number of the existing snippets. The process continuation stuff should compile though. Issue Time Tracking --- Worklog Id: (was: 778230) Time Spent: 28h (was: 27h 50m) > [Go SDK] DoFn Self Checkpointing > > > Key: BEAM-11104 > URL: https://issues.apache.org/jira/browse/BEAM-11104 > Project: Beam > Issue Type: Sub-task > Components: sdk-go >Reporter: Robert Burke >Assignee: Jack McCluskey >Priority: P3 > Fix For: 2.40.0 > > Time Spent: 28h > Remaining Estimate: 0h > > Allow SplittableDoFns to self checkpoint. > Design doc: > [https://docs.google.com/document/d/1_JbzjY9JR07ZK5v7PcZevUfzHPsqwzfV7W6AouNpMPk/edit?usp=sharing] > > Feature is written E2E and users will be able to return ProcessContinuations > from SDFs as of 2.39.0 but the full behavior has not been fully validated. An > integration test that validates self-checkpointing is working as-intended > will need to be written and passing before the feature is no longer > considered experimental and this ticket is marked as resolved. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-11104) [Go SDK] DoFn Self Checkpointing
[ https://issues.apache.org/jira/browse/BEAM-11104?focusedWorklogId=778229&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778229 ] ASF GitHub Bot logged work on BEAM-11104: - Author: ASF GitHub Bot Created on: 03/Jun/22 17:42 Start Date: 03/Jun/22 17:42 Worklog Time Spent: 10m Work Description: jrmccluskey commented on code in PR #17956: URL: https://github.com/apache/beam/pull/17956#discussion_r889195081 ## website/www/site/content/en/documentation/programming-guide.md: ## @@ -6422,7 +6422,26 @@ resource utilization. {{< /highlight >}} {{< highlight go >}} -This is not supported yet, see BEAM-11104. +func (fn *splittableDoFn) ProcessElement(rt *sdf.LockRTracker, emit func(Record)) sdf.ProcessContinuation { Review Comment: Adding an error parameter is reasonable, so that's added. It did lead to some extra error checking overhead since we don't have the try-catch mechanism the other SDKs leverage but that's not a huge problem. Issue Time Tracking --- Worklog Id: (was: 778229) Time Spent: 27h 50m (was: 27h 40m) > [Go SDK] DoFn Self Checkpointing > > > Key: BEAM-11104 > URL: https://issues.apache.org/jira/browse/BEAM-11104 > Project: Beam > Issue Type: Sub-task > Components: sdk-go >Reporter: Robert Burke >Assignee: Jack McCluskey >Priority: P3 > Fix For: 2.40.0 > > Time Spent: 27h 50m > Remaining Estimate: 0h > > Allow SplittableDoFns to self checkpoint. > Design doc: > [https://docs.google.com/document/d/1_JbzjY9JR07ZK5v7PcZevUfzHPsqwzfV7W6AouNpMPk/edit?usp=sharing] > > Feature is written E2E and users will be able to return ProcessContinuations > from SDFs as of 2.39.0 but the full behavior has not been fully validated. An > integration test that validates self-checkpointing is working as-intended > will need to be written and passing before the feature is no longer > considered experimental and this ticket is marked as resolved. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-11104) [Go SDK] DoFn Self Checkpointing
[ https://issues.apache.org/jira/browse/BEAM-11104?focusedWorklogId=778227&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778227 ] ASF GitHub Bot logged work on BEAM-11104: - Author: ASF GitHub Bot Created on: 03/Jun/22 17:41 Start Date: 03/Jun/22 17:41 Worklog Time Spent: 10m Work Description: jrmccluskey commented on code in PR #17956: URL: https://github.com/apache/beam/pull/17956#discussion_r889194510 ## website/www/site/content/en/documentation/programming-guide.md: ## @@ -6422,7 +6422,26 @@ resource utilization. {{< /highlight >}} {{< highlight go >}} -This is not supported yet, see BEAM-11104. +func (fn *splittableDoFn) ProcessElement(rt *sdf.LockRTracker, emit func(Record)) sdf.ProcessContinuation { + position := rt.GetRestriction().(offsetrange.Restriction).Start + for { +records, err := fn.ExternalService.readNextRecords(position) +if err == fn.ExternalService.ThrottlingErr { + return sdf.ResumeProcessingIn(60 * time.Seconds) +} +if len(records) == 0 { + return sdf.ResumeProcessingIn(10 * time.Seconds) Review Comment: That's a fair note. Adding clarifying comments is always good for a documentation snippet Issue Time Tracking --- Worklog Id: (was: 778227) Time Spent: 27h 40m (was: 27.5h) > [Go SDK] DoFn Self Checkpointing > > > Key: BEAM-11104 > URL: https://issues.apache.org/jira/browse/BEAM-11104 > Project: Beam > Issue Type: Sub-task > Components: sdk-go >Reporter: Robert Burke >Assignee: Jack McCluskey >Priority: P3 > Fix For: 2.40.0 > > Time Spent: 27h 40m > Remaining Estimate: 0h > > Allow SplittableDoFns to self checkpoint. > Design doc: > [https://docs.google.com/document/d/1_JbzjY9JR07ZK5v7PcZevUfzHPsqwzfV7W6AouNpMPk/edit?usp=sharing] > > Feature is written E2E and users will be able to return ProcessContinuations > from SDFs as of 2.39.0 but the full behavior has not been fully validated. An > integration test that validates self-checkpointing is working as-intended > will need to be written and passing before the feature is no longer > considered experimental and this ticket is marked as resolved. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14068) RunInference Benchmarking tests
[ https://issues.apache.org/jira/browse/BEAM-14068?focusedWorklogId=778225&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778225 ] ASF GitHub Bot logged work on BEAM-14068: - Author: ASF GitHub Bot Created on: 03/Jun/22 17:39 Start Date: 03/Jun/22 17:39 Worklog Time Spent: 10m Work Description: AnandInguva commented on code in PR #17462: URL: https://github.com/apache/beam/pull/17462#discussion_r889185574 ## sdks/python/apache_beam/examples/inference/pytorch_image_classification.py: ## @@ -0,0 +1,146 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Pipeline that uses RunInference API to perform classification task on imagenet dataset""" # pylint: disable=line-too-long + +import argparse +import io +import os +from functools import partial +from typing import Any +from typing import Iterable +from typing import Tuple +from typing import Union + +import apache_beam as beam +import torch +from apache_beam.io.filesystems import FileSystems +from apache_beam.ml.inference.api import PredictionResult +from apache_beam.ml.inference.api import RunInference +from apache_beam.ml.inference.pytorch_inference import PytorchModelLoader +from apache_beam.options.pipeline_options import PipelineOptions +from apache_beam.options.pipeline_options import SetupOptions +from PIL import Image +from torchvision import transforms +from torchvision.models.mobilenetv2 import MobileNetV2 + + +def read_image(image_file_name: str, + path_to_dir: str = None) -> Tuple[str, Image.Image]: + if path_to_dir is not None: +image_file_name = os.path.join(path_to_dir, image_file_name) + with FileSystems().open(image_file_name, 'r') as file: +data = Image.open(io.BytesIO(file.read())).convert('RGB') +return image_file_name, data + + +def preprocess_image(data: Image) -> torch.Tensor: Review Comment: Thanks for catching ## sdks/python/apache_beam/examples/inference/pytorch_image_classification.py: ## @@ -0,0 +1,146 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Pipeline that uses RunInference API to perform classification task on imagenet dataset""" # pylint: disable=line-too-long + +import argparse +import io +import os +from functools import partial +from typing import Any +from typing import Iterable +from typing import Tuple +from typing import Union + +import apache_beam as beam +import torch +from apache_beam.io.filesystems import FileSystems +from apache_beam.ml.inference.api import PredictionResult +from apache_beam.ml.inference.api import RunInference +from apache_beam.ml.inference.pytorch_inference import PytorchModelLoader +from apache_beam.options.pipeline_options import PipelineOptions +from apache_beam.options.pipeline_options import SetupOptions +from PIL import Image +from torchvision import transforms +from torchvision.models.mobilenetv2 import MobileNetV2 + + +def read_image(image_file_name: str, + path_to_dir: str = None) -> Tuple[str, Image.Image]: + if path_to_dir is not None: +image_file_name = os.path.join(path_to_dir, image_file_name) + with FileSystems().open(image_file_name, 'r') as file: +data = Image.open(io.BytesIO(file.read())).convert('RGB') +return image_file_name, data + + +def preprocess_image(data: Image) -> torch.Tensor: + image_size = (224, 224) + # to use models in torch with imagenet w
[jira] [Work logged] (BEAM-11104) [Go SDK] DoFn Self Checkpointing
[ https://issues.apache.org/jira/browse/BEAM-11104?focusedWorklogId=778220&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778220 ] ASF GitHub Bot logged work on BEAM-11104: - Author: ASF GitHub Bot Created on: 03/Jun/22 17:36 Start Date: 03/Jun/22 17:36 Worklog Time Spent: 10m Work Description: jrmccluskey commented on code in PR #17956: URL: https://github.com/apache/beam/pull/17956#discussion_r889191309 ## website/www/site/content/en/documentation/programming-guide.md: ## @@ -6422,7 +6422,26 @@ resource utilization. {{< /highlight >}} {{< highlight go >}} -This is not supported yet, see BEAM-11104. +func (fn *splittableDoFn) ProcessElement(rt *sdf.LockRTracker, emit func(Record)) sdf.ProcessContinuation { Review Comment: IIRC if an SDF signature has a ProcessContinuation return we *always* expect either a Resume() or Stop() continuation and never a nil. Issue Time Tracking --- Worklog Id: (was: 778220) Time Spent: 27.5h (was: 27h 20m) > [Go SDK] DoFn Self Checkpointing > > > Key: BEAM-11104 > URL: https://issues.apache.org/jira/browse/BEAM-11104 > Project: Beam > Issue Type: Sub-task > Components: sdk-go >Reporter: Robert Burke >Assignee: Jack McCluskey >Priority: P3 > Fix For: 2.40.0 > > Time Spent: 27.5h > Remaining Estimate: 0h > > Allow SplittableDoFns to self checkpoint. > Design doc: > [https://docs.google.com/document/d/1_JbzjY9JR07ZK5v7PcZevUfzHPsqwzfV7W6AouNpMPk/edit?usp=sharing] > > Feature is written E2E and users will be able to return ProcessContinuations > from SDFs as of 2.39.0 but the full behavior has not been fully validated. An > integration test that validates self-checkpointing is working as-intended > will need to be written and passing before the feature is no longer > considered experimental and this ticket is marked as resolved. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-11104) [Go SDK] DoFn Self Checkpointing
[ https://issues.apache.org/jira/browse/BEAM-11104?focusedWorklogId=778218&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778218 ] ASF GitHub Bot logged work on BEAM-11104: - Author: ASF GitHub Bot Created on: 03/Jun/22 17:35 Start Date: 03/Jun/22 17:35 Worklog Time Spent: 10m Work Description: jrmccluskey commented on code in PR #17956: URL: https://github.com/apache/beam/pull/17956#discussion_r889190683 ## website/www/site/content/en/documentation/programming-guide.md: ## @@ -6422,7 +6422,26 @@ resource utilization. {{< /highlight >}} {{< highlight go >}} -This is not supported yet, see BEAM-11104. +func (fn *splittableDoFn) ProcessElement(rt *sdf.LockRTracker, emit func(Record)) sdf.ProcessContinuation { Review Comment: This is a completely fictional IO since we don't actually have a robust native streaming IO, so there's nothing to compile. It's just modeled after the Python and Java versions. Issue Time Tracking --- Worklog Id: (was: 778218) Time Spent: 27h 20m (was: 27h 10m) > [Go SDK] DoFn Self Checkpointing > > > Key: BEAM-11104 > URL: https://issues.apache.org/jira/browse/BEAM-11104 > Project: Beam > Issue Type: Sub-task > Components: sdk-go >Reporter: Robert Burke >Assignee: Jack McCluskey >Priority: P3 > Fix For: 2.40.0 > > Time Spent: 27h 20m > Remaining Estimate: 0h > > Allow SplittableDoFns to self checkpoint. > Design doc: > [https://docs.google.com/document/d/1_JbzjY9JR07ZK5v7PcZevUfzHPsqwzfV7W6AouNpMPk/edit?usp=sharing] > > Feature is written E2E and users will be able to return ProcessContinuations > from SDFs as of 2.39.0 but the full behavior has not been fully validated. An > integration test that validates self-checkpointing is working as-intended > will need to be written and passing before the feature is no longer > considered experimental and this ticket is marked as resolved. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14068) RunInference Benchmarking tests
[ https://issues.apache.org/jira/browse/BEAM-14068?focusedWorklogId=778215&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778215 ] ASF GitHub Bot logged work on BEAM-14068: - Author: ASF GitHub Bot Created on: 03/Jun/22 17:27 Start Date: 03/Jun/22 17:27 Worklog Time Spent: 10m Work Description: AnandInguva commented on code in PR #17462: URL: https://github.com/apache/beam/pull/17462#discussion_r889185085 ## sdks/python/apache_beam/examples/inference/pytorch_image_classification.py: ## @@ -0,0 +1,146 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Pipeline that uses RunInference API to perform classification task on imagenet dataset""" # pylint: disable=line-too-long + +import argparse +import io +import os +from functools import partial +from typing import Any +from typing import Iterable +from typing import Tuple +from typing import Union + +import apache_beam as beam +import torch +from apache_beam.io.filesystems import FileSystems +from apache_beam.ml.inference.api import PredictionResult +from apache_beam.ml.inference.api import RunInference +from apache_beam.ml.inference.pytorch_inference import PytorchModelLoader +from apache_beam.options.pipeline_options import PipelineOptions +from apache_beam.options.pipeline_options import SetupOptions +from PIL import Image +from torchvision import transforms +from torchvision.models.mobilenetv2 import MobileNetV2 + + +def read_image(image_file_name: str, + path_to_dir: str = None) -> Tuple[str, Image.Image]: + if path_to_dir is not None: +image_file_name = os.path.join(path_to_dir, image_file_name) + with FileSystems().open(image_file_name, 'r') as file: +data = Image.open(io.BytesIO(file.read())).convert('RGB') +return image_file_name, data + + +def preprocess_image(data: Image) -> torch.Tensor: + image_size = (224, 224) + # to use models in torch with imagenet weights, + # normalize the images using the below values. + # ref: https://pytorch.org/vision/stable/models.html# + normalize = transforms.Normalize( + mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) + transform = transforms.Compose([ + transforms.Resize(image_size), + transforms.ToTensor(), + normalize, + ]) + return transform(data) + + +class PostProcessor(beam.DoFn): + def process( + self, element: Union[PredictionResult, Tuple[Any, PredictionResult]] Review Comment: I had what you suggested here but I got this errror. `apache_beam.typehints.decorators.TypeCheckError: Type hint violation for 'ProcessOutput': requires Tuple[Any, PredictionResult] but got Union[PredictionResult, Tuple[Any, PredictionResult]] for element`. Output defined here: https://github.com/apache/beam/blob/6d6a54d21b483c20005debcfd0b6d9ff6739ec80/sdks/python/apache_beam/ml/inference/api.py#L39 I am not sure of the behavior though. It should be able to accept `element: Tuple[str, PredictionResult]]` but to avoid this error, I changed the code Issue Time Tracking --- Worklog Id: (was: 778215) Time Spent: 8h 40m (was: 8.5h) > RunInference Benchmarking tests > --- > > Key: BEAM-14068 > URL: https://issues.apache.org/jira/browse/BEAM-14068 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Anand Inguva >Assignee: Anand Inguva >Priority: P2 > Time Spent: 8h 40m > Remaining Estimate: 0h > > RunInference benchmarks will evaluate performance of Pipelines, which > represent common use cases of Beam + Dataflow in Pytorch, sklearn and > possibly TFX. These benchmarks would be the integration tests that exercise > several software components using Beam, PyTorch, Scikit learn and TensorFlow > extended. > we would use the datasets that's available publicly (Eg; Kaggle). > Size: small / 10 GB / 1 TB etc > The default execution runner would be Dataflow unless specified otherwise. > These tests would be run very less frequently(every release cycle). -
[jira] [Work logged] (BEAM-11104) [Go SDK] DoFn Self Checkpointing
[ https://issues.apache.org/jira/browse/BEAM-11104?focusedWorklogId=778161&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778161 ] ASF GitHub Bot logged work on BEAM-11104: - Author: ASF GitHub Bot Created on: 03/Jun/22 15:59 Start Date: 03/Jun/22 15:59 Worklog Time Spent: 10m Work Description: asf-ci commented on PR #17956: URL: https://github.com/apache/beam/pull/17956#issuecomment-1146103311 Can one of the admins verify this patch? Issue Time Tracking --- Worklog Id: (was: 778161) Time Spent: 26h 40m (was: 26.5h) > [Go SDK] DoFn Self Checkpointing > > > Key: BEAM-11104 > URL: https://issues.apache.org/jira/browse/BEAM-11104 > Project: Beam > Issue Type: Sub-task > Components: sdk-go >Reporter: Robert Burke >Assignee: Jack McCluskey >Priority: P3 > Fix For: 2.40.0 > > Time Spent: 26h 40m > Remaining Estimate: 0h > > Allow SplittableDoFns to self checkpoint. > Design doc: > [https://docs.google.com/document/d/1_JbzjY9JR07ZK5v7PcZevUfzHPsqwzfV7W6AouNpMPk/edit?usp=sharing] > > Feature is written E2E and users will be able to return ProcessContinuations > from SDFs as of 2.39.0 but the full behavior has not been fully validated. An > integration test that validates self-checkpointing is working as-intended > will need to be written and passing before the feature is no longer > considered experimental and this ticket is marked as resolved. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-11104) [Go SDK] DoFn Self Checkpointing
[ https://issues.apache.org/jira/browse/BEAM-11104?focusedWorklogId=778204&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778204 ] ASF GitHub Bot logged work on BEAM-11104: - Author: ASF GitHub Bot Created on: 03/Jun/22 16:59 Start Date: 03/Jun/22 16:59 Worklog Time Spent: 10m Work Description: damccorm commented on code in PR #17956: URL: https://github.com/apache/beam/pull/17956#discussion_r889155710 ## website/www/site/content/en/documentation/programming-guide.md: ## @@ -6422,7 +6422,26 @@ resource utilization. {{< /highlight >}} {{< highlight go >}} -This is not supported yet, see BEAM-11104. +func (fn *splittableDoFn) ProcessElement(rt *sdf.LockRTracker, emit func(Record)) sdf.ProcessContinuation { Review Comment: Can we put this in the snippets folder (example below in Watermark estimation section)? I know we haven't been clean on that before, but it: (a) makes sure that the code actually compiles (b) makes it easier to reuse (e.g. I know Dataflow has docs that use snippets from Beam) ## website/www/site/content/en/documentation/programming-guide.md: ## @@ -6422,7 +6422,26 @@ resource utilization. {{< /highlight >}} {{< highlight go >}} -This is not supported yet, see BEAM-11104. +func (fn *splittableDoFn) ProcessElement(rt *sdf.LockRTracker, emit func(Record)) sdf.ProcessContinuation { + position := rt.GetRestriction().(offsetrange.Restriction).Start + for { +records, err := fn.ExternalService.readNextRecords(position) +if err == fn.ExternalService.ThrottlingErr { + return sdf.ResumeProcessingIn(60 * time.Seconds) +} +if len(records) == 0 { + return sdf.ResumeProcessingIn(10 * time.Seconds) Review Comment: Maybe add a comment along the lines of `// Wait for data to be available`? Might be nice to have a similar comment for the throttling case and the finish execution case as well. ## website/www/site/content/en/documentation/programming-guide.md: ## @@ -6422,7 +6422,26 @@ resource utilization. {{< /highlight >}} {{< highlight go >}} -This is not supported yet, see BEAM-11104. +func (fn *splittableDoFn) ProcessElement(rt *sdf.LockRTracker, emit func(Record)) sdf.ProcessContinuation { Review Comment: Could you return an `err` parameter as well (it can just return nil)? Something I realized w/ Bundle Finalization is that its much more helpful if we provide the parameters that surround the one we are demonstrating because it allows users to see the ordering we require. Side note unrelated to this PR: We probably need better ordering error messages, they are pretty confusing right now. ## website/www/site/content/en/documentation/programming-guide.md: ## @@ -6422,7 +6422,26 @@ resource utilization. {{< /highlight >}} {{< highlight go >}} -This is not supported yet, see BEAM-11104. +func (fn *splittableDoFn) ProcessElement(rt *sdf.LockRTracker, emit func(Record)) sdf.ProcessContinuation { Review Comment: I'm actually also curious about what process continuation we should return when we return an err response actually - is it nil? Might be worth including that as an option if for example, `records, err := fn.ExternalService.readNextRecords(position)` returns a non-nil, non-throttling error respone Issue Time Tracking --- Worklog Id: (was: 778204) Time Spent: 27h 10m (was: 27h) > [Go SDK] DoFn Self Checkpointing > > > Key: BEAM-11104 > URL: https://issues.apache.org/jira/browse/BEAM-11104 > Project: Beam > Issue Type: Sub-task > Components: sdk-go >Reporter: Robert Burke >Assignee: Jack McCluskey >Priority: P3 > Fix For: 2.40.0 > > Time Spent: 27h 10m > Remaining Estimate: 0h > > Allow SplittableDoFns to self checkpoint. > Design doc: > [https://docs.google.com/document/d/1_JbzjY9JR07ZK5v7PcZevUfzHPsqwzfV7W6AouNpMPk/edit?usp=sharing] > > Feature is written E2E and users will be able to return ProcessContinuations > from SDFs as of 2.39.0 but the full behavior has not been fully validated. An > integration test that validates self-checkpointing is working as-intended > will need to be written and passing before the feature is no longer > considered experimental and this ticket is marked as resolved. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14337) Support **kwargs for PyTorch models.
[ https://issues.apache.org/jira/browse/BEAM-14337?focusedWorklogId=778203&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778203 ] ASF GitHub Bot logged work on BEAM-14337: - Author: ASF GitHub Bot Created on: 03/Jun/22 16:57 Start Date: 03/Jun/22 16:57 Worklog Time Spent: 10m Work Description: yeandy commented on PR #17470: URL: https://github.com/apache/beam/pull/17470#issuecomment-1146178363 R: @tvalentyn Issue Time Tracking --- Worklog Id: (was: 778203) Time Spent: 8h 10m (was: 8h) > Support **kwargs for PyTorch models. > > > Key: BEAM-14337 > URL: https://issues.apache.org/jira/browse/BEAM-14337 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Anand Inguva >Assignee: Andy Ye >Priority: P2 > Time Spent: 8h 10m > Remaining Estimate: 0h > > Some models in Pytorch instantiating from torch.nn.Module, has extra > parameters in the forward function call. These extra parameters can be passed > as Dict or as positional arguments. > Example of PyTorch models supported by Hugging Face -> > [https://huggingface.co/bert-base-uncased] > [Some torch models on Hugging > face|https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py] > Eg: > [https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertModel] > {code:java} > inputs = { > input_ids: Tensor1, > attention_mask: Tensor2, > token_type_ids: Tensor3, > } > model = BertModel.from_pretrained("bert-base-uncased") # which is a > # subclass of torch.nn.Module > outputs = model(**inputs) # model forward method should be expecting the keys > in the inputs as the positional arguments.{code} > > [Transformers|https://pytorch.org/hub/huggingface_pytorch-transformers/] > integrated in Pytorch is supported by Hugging Face as well. > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14553) Dataflow portable job submission translate FileResultCoder only with window coder
[ https://issues.apache.org/jira/browse/BEAM-14553?focusedWorklogId=778202&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778202 ] ASF GitHub Bot logged work on BEAM-14553: - Author: ASF GitHub Bot Created on: 03/Jun/22 16:56 Start Date: 03/Jun/22 16:56 Worklog Time Spent: 10m Work Description: y1chi commented on code in PR #17818: URL: https://github.com/apache/beam/pull/17818#discussion_r889158954 ## sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileBasedSink.java: ## @@ -1196,7 +1196,7 @@ public static FileResultCoder of( @Override public List> getCoderArguments() { - return Arrays.asList(windowCoder); + return Arrays.asList(windowCoder, destinationCoder); Review Comment: So IIUC, the right thing to do is to have FileResult so that coder inference can work for both window type and destination type. But since all FileResult PCollections use explicit setCoders that's not a requirement. I still think window coder needs to be in the components so that FileResultCoder with different window coders won't collide with each other. Issue Time Tracking --- Worklog Id: (was: 778202) Time Spent: 2.5h (was: 2h 20m) > Dataflow portable job submission translate FileResultCoder only with window > coder > - > > Key: BEAM-14553 > URL: https://issues.apache.org/jira/browse/BEAM-14553 > Project: Beam > Issue Type: Bug > Components: runner-dataflow >Reporter: Yichi Zhang >Priority: P2 > Time Spent: 2.5h > Remaining Estimate: 0h > > The destination coder is neglected, if there are multiple FileResultCoders > with different destination coder, only first registration is successful. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14504) Add support for including addittional parameters to executebundle method in fhirio.
[ https://issues.apache.org/jira/browse/BEAM-14504?focusedWorklogId=778200&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778200 ] ASF GitHub Bot logged work on BEAM-14504: - Author: ASF GitHub Bot Created on: 03/Jun/22 16:55 Start Date: 03/Jun/22 16:55 Worklog Time Spent: 10m Work Description: asf-ci commented on PR #17741: URL: https://github.com/apache/beam/pull/17741#issuecomment-1146177133 Can one of the admins verify this patch? Issue Time Tracking --- Worklog Id: (was: 778200) Time Spent: 7h 10m (was: 7h) > Add support for including addittional parameters to executebundle method in > fhirio. > --- > > Key: BEAM-14504 > URL: https://issues.apache.org/jira/browse/BEAM-14504 > Project: Beam > Issue Type: Improvement > Components: io-java-healthcare >Reporter: Fathima Mohammed >Assignee: Fathima Mohammed >Priority: P2 > Time Spent: 7h 10m > Remaining Estimate: 0h > > Add FhirBundleWithMetadata in executebundles method so that we can pass > additional information like message id. > FhirBundleWithMetadata represents a FHIR bundle, with it's metadata (eg. Hl7 > messageId) to be executed on the intermediate FHIR store. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14504) Add support for including addittional parameters to executebundle method in fhirio.
[ https://issues.apache.org/jira/browse/BEAM-14504?focusedWorklogId=778201&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778201 ] ASF GitHub Bot logged work on BEAM-14504: - Author: ASF GitHub Bot Created on: 03/Jun/22 16:55 Start Date: 03/Jun/22 16:55 Worklog Time Spent: 10m Work Description: asf-ci commented on PR #17741: URL: https://github.com/apache/beam/pull/17741#issuecomment-1146177154 Can one of the admins verify this patch? Issue Time Tracking --- Worklog Id: (was: 778201) Time Spent: 7h 20m (was: 7h 10m) > Add support for including addittional parameters to executebundle method in > fhirio. > --- > > Key: BEAM-14504 > URL: https://issues.apache.org/jira/browse/BEAM-14504 > Project: Beam > Issue Type: Improvement > Components: io-java-healthcare >Reporter: Fathima Mohammed >Assignee: Fathima Mohammed >Priority: P2 > Time Spent: 7h 20m > Remaining Estimate: 0h > > Add FhirBundleWithMetadata in executebundles method so that we can pass > additional information like message id. > FhirBundleWithMetadata represents a FHIR bundle, with it's metadata (eg. Hl7 > messageId) to be executed on the intermediate FHIR store. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14553) Dataflow portable job submission translate FileResultCoder only with window coder
[ https://issues.apache.org/jira/browse/BEAM-14553?focusedWorklogId=778199&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778199 ] ASF GitHub Bot logged work on BEAM-14553: - Author: ASF GitHub Bot Created on: 03/Jun/22 16:53 Start Date: 03/Jun/22 16:53 Worklog Time Spent: 10m Work Description: y1chi commented on PR #17818: URL: https://github.com/apache/beam/pull/17818#issuecomment-1146175229 Run Java PreCommit Issue Time Tracking --- Worklog Id: (was: 778199) Time Spent: 2h 20m (was: 2h 10m) > Dataflow portable job submission translate FileResultCoder only with window > coder > - > > Key: BEAM-14553 > URL: https://issues.apache.org/jira/browse/BEAM-14553 > Project: Beam > Issue Type: Bug > Components: runner-dataflow >Reporter: Yichi Zhang >Priority: P2 > Time Spent: 2h 20m > Remaining Estimate: 0h > > The destination coder is neglected, if there are multiple FileResultCoders > with different destination coder, only first registration is successful. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14504) Add support for including addittional parameters to executebundle method in fhirio.
[ https://issues.apache.org/jira/browse/BEAM-14504?focusedWorklogId=778198&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778198 ] ASF GitHub Bot logged work on BEAM-14504: - Author: ASF GitHub Bot Created on: 03/Jun/22 16:49 Start Date: 03/Jun/22 16:49 Worklog Time Spent: 10m Work Description: pabloem commented on PR #17741: URL: https://github.com/apache/beam/pull/17741#issuecomment-1146170628 Run Java PreCommit Issue Time Tracking --- Worklog Id: (was: 778198) Time Spent: 7h (was: 6h 50m) > Add support for including addittional parameters to executebundle method in > fhirio. > --- > > Key: BEAM-14504 > URL: https://issues.apache.org/jira/browse/BEAM-14504 > Project: Beam > Issue Type: Improvement > Components: io-java-healthcare >Reporter: Fathima Mohammed >Assignee: Fathima Mohammed >Priority: P2 > Time Spent: 7h > Remaining Estimate: 0h > > Add FhirBundleWithMetadata in executebundles method so that we can pass > additional information like message id. > FhirBundleWithMetadata represents a FHIR bundle, with it's metadata (eg. Hl7 > messageId) to be executed on the intermediate FHIR store. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14504) Add support for including addittional parameters to executebundle method in fhirio.
[ https://issues.apache.org/jira/browse/BEAM-14504?focusedWorklogId=778195&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778195 ] ASF GitHub Bot logged work on BEAM-14504: - Author: ASF GitHub Bot Created on: 03/Jun/22 16:46 Start Date: 03/Jun/22 16:46 Worklog Time Spent: 10m Work Description: pabloem closed pull request #17741: [BEAM-14504] Add support for including addittional parameters to executebundle method in fhirio. URL: https://github.com/apache/beam/pull/17741 Issue Time Tracking --- Worklog Id: (was: 778195) Time Spent: 6h 40m (was: 6.5h) > Add support for including addittional parameters to executebundle method in > fhirio. > --- > > Key: BEAM-14504 > URL: https://issues.apache.org/jira/browse/BEAM-14504 > Project: Beam > Issue Type: Improvement > Components: io-java-healthcare >Reporter: Fathima Mohammed >Assignee: Fathima Mohammed >Priority: P2 > Time Spent: 6h 40m > Remaining Estimate: 0h > > Add FhirBundleWithMetadata in executebundles method so that we can pass > additional information like message id. > FhirBundleWithMetadata represents a FHIR bundle, with it's metadata (eg. Hl7 > messageId) to be executed on the intermediate FHIR store. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14504) Add support for including addittional parameters to executebundle method in fhirio.
[ https://issues.apache.org/jira/browse/BEAM-14504?focusedWorklogId=778196&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778196 ] ASF GitHub Bot logged work on BEAM-14504: - Author: ASF GitHub Bot Created on: 03/Jun/22 16:46 Start Date: 03/Jun/22 16:46 Worklog Time Spent: 10m Work Description: fbeevikm opened a new pull request, #17741: URL: https://github.com/apache/beam/pull/17741 **Please** add a meaningful description for your change here Add FhirBundleWithMetadata in executebundles method so that we can pass additional information like message id. FhirBundleWithMetadata represents a FHIR bundle, with it's metadata (eg. Hl7 messageId) to be executed on the intermediate FHIR store. Make FhirResourcePagesIterator constructor public. Whistle plugins calls this constructor. Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] Update `CHANGES.md` with noteworthy changes. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier). To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md) GitHub Actions Tests Status (on master branch) [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule) [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule) [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule) See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI. Issue Time Tracking --- Worklog Id: (was: 778196) Time Spent: 6h 50m (was: 6h 40m) > Add support for including addittional parameters to executebundle method in > fhirio. > --- > > Key: BEAM-14504 > URL: https://issues.apache.org/jira/browse/BEAM-14504 > Project: Beam > Issue Type: Improvement > Components: io-java-healthcare >Reporter: Fathima Mohammed >Assignee: Fathima Mohammed >Priority: P2 > Time Spent: 6h 50m > Remaining Estimate: 0h > > Add FhirBundleWithMetadata in executebundles method so that we can pass > additional information like message id. > FhirBundleWithMetadata represents a FHIR bundle, with it's metadata (eg. Hl7 > messageId) to be executed on the intermediate FHIR store. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-13945) Update BQ connector to support new JSON type
[ https://issues.apache.org/jira/browse/BEAM-13945?focusedWorklogId=778194&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778194 ] ASF GitHub Bot logged work on BEAM-13945: - Author: ASF GitHub Bot Created on: 03/Jun/22 16:44 Start Date: 03/Jun/22 16:44 Worklog Time Spent: 10m Work Description: pabloem merged PR #18038: URL: https://github.com/apache/beam/pull/18038 Issue Time Tracking --- Worklog Id: (was: 778194) Time Spent: 15h (was: 14h 50m) > Update BQ connector to support new JSON type > > > Key: BEAM-13945 > URL: https://issues.apache.org/jira/browse/BEAM-13945 > Project: Beam > Issue Type: New Feature > Components: io-java-gcp >Reporter: Chamikara Madhusanka Jayalath >Assignee: Ahmed Abualsaud >Priority: P2 > Time Spent: 15h > Remaining Estimate: 0h > > BQ has a new JSON type that is defined here: > https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#json_type > We should update Beam BQ Java and Python connectors to support that for > various read methods (export jobs, storage API) and write methods (load jobs, > streaming inserts, storage API). > We should also add integration tests that exercise reading from /writing to > BQ tables with columns that has JSON type. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14546) [Go SDK] passert.Count succeeds for empty PCollections.
[ https://issues.apache.org/jira/browse/BEAM-14546?focusedWorklogId=778189&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778189 ] ASF GitHub Bot logged work on BEAM-14546: - Author: ASF GitHub Bot Created on: 03/Jun/22 16:30 Start Date: 03/Jun/22 16:30 Worklog Time Spent: 10m Work Description: lostluck merged PR #17813: URL: https://github.com/apache/beam/pull/17813 Issue Time Tracking --- Worklog Id: (was: 778189) Time Spent: 3.5h (was: 3h 20m) > [Go SDK] passert.Count succeeds for empty PCollections. > --- > > Key: BEAM-14546 > URL: https://issues.apache.org/jira/browse/BEAM-14546 > Project: Beam > Issue Type: Improvement > Components: sdk-go >Reporter: Robert Burke >Assignee: Jack McCluskey >Priority: P3 > Time Spent: 3.5h > Remaining Estimate: 0h > > https://github.com/apache/beam/blob/sdks/v2.39.0/sdks/go/pkg/beam/testing/passert/count.go#L28 > Since it's using a Combine to do the count, it never executes for empty > Pcollections, and is unable to fail. > The fix is: when count > 0, plumb the pcollection through as a side input to > a DoFn that requires the side input to be non-empty. This would catch the > empty PCollection case. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-974) Adds attributes support to PubsubIO
[ https://issues.apache.org/jira/browse/BEAM-974?focusedWorklogId=778186&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778186 ] ASF GitHub Bot logged work on BEAM-974: --- Author: ASF GitHub Bot Created on: 03/Jun/22 16:29 Start Date: 03/Jun/22 16:29 Worklog Time Spent: 10m Work Description: kennknowles opened a new issue, #18180: URL: https://github.com/apache/beam/issues/18180 Hello, I'm a Google Cloud Support Specialist who recently owned a case from a platinum customer who was reporting "validation of workflow failed" internal error when attempting to use new beta SDK. Eventually, it was determined that the problem was that they weren't using .withCoder, which became required after BEAM-974[1]. They would like to request that a more informative error be thrown, as the current one is far too vague to be able to derive any useful information from. Thank you. Regards, Garrett Anderson Cloud Support Specialist Google Cloud Platform Support [1] https://issues.apache.org/jira/browse/BEAM-974 Imported from Jira [BEAM-1662](https://issues.apache.org/jira/browse/BEAM-1662). Original Jira may contain additional context. Reported by: gfunk. Issue Time Tracking --- Worklog Id: (was: 778186) Remaining Estimate: 0h Time Spent: 10m > Adds attributes support to PubsubIO > --- > > Key: BEAM-974 > URL: https://issues.apache.org/jira/browse/BEAM-974 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Reuven Lax >Assignee: Reuven Lax >Priority: P2 > Labels: Done > Fix For: 0.5.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Adds support for Pubsub attributes by introducing a new PubsubMessage class > that allows manipulation of payload and attributes. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-13945) Update BQ connector to support new JSON type
[ https://issues.apache.org/jira/browse/BEAM-13945?focusedWorklogId=778176&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778176 ] ASF GitHub Bot logged work on BEAM-13945: - Author: ASF GitHub Bot Created on: 03/Jun/22 16:15 Start Date: 03/Jun/22 16:15 Worklog Time Spent: 10m Work Description: pabloem commented on PR #17492: URL: https://github.com/apache/beam/pull/17492#issuecomment-1146144655 sorry about that. I've opened a rollback and a fix forward: https://github.com/apache/beam/pull/18038 https://github.com/apache/beam/pull/18060 Issue Time Tracking --- Worklog Id: (was: 778176) Time Spent: 14h 50m (was: 14h 40m) > Update BQ connector to support new JSON type > > > Key: BEAM-13945 > URL: https://issues.apache.org/jira/browse/BEAM-13945 > Project: Beam > Issue Type: New Feature > Components: io-java-gcp >Reporter: Chamikara Madhusanka Jayalath >Assignee: Ahmed Abualsaud >Priority: P2 > Time Spent: 14h 50m > Remaining Estimate: 0h > > BQ has a new JSON type that is defined here: > https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#json_type > We should update Beam BQ Java and Python connectors to support that for > various read methods (export jobs, storage API) and write methods (load jobs, > streaming inserts, storage API). > We should also add integration tests that exercise reading from /writing to > BQ tables with columns that has JSON type. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-13945) Update BQ connector to support new JSON type
[ https://issues.apache.org/jira/browse/BEAM-13945?focusedWorklogId=778175&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778175 ] ASF GitHub Bot logged work on BEAM-13945: - Author: ASF GitHub Bot Created on: 03/Jun/22 16:15 Start Date: 03/Jun/22 16:15 Worklog Time Spent: 10m Work Description: pabloem commented on PR #18038: URL: https://github.com/apache/beam/pull/18038#issuecomment-1146144032 This fixes breakage introduced in #17492 Issue Time Tracking --- Worklog Id: (was: 778175) Time Spent: 14h 40m (was: 14.5h) > Update BQ connector to support new JSON type > > > Key: BEAM-13945 > URL: https://issues.apache.org/jira/browse/BEAM-13945 > Project: Beam > Issue Type: New Feature > Components: io-java-gcp >Reporter: Chamikara Madhusanka Jayalath >Assignee: Ahmed Abualsaud >Priority: P2 > Time Spent: 14h 40m > Remaining Estimate: 0h > > BQ has a new JSON type that is defined here: > https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#json_type > We should update Beam BQ Java and Python connectors to support that for > various read methods (export jobs, storage API) and write methods (load jobs, > streaming inserts, storage API). > We should also add integration tests that exercise reading from /writing to > BQ tables with columns that has JSON type. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-9482) beam_PerformanceTests_Kafka_IO failing due to " provided port is already allocated"
[ https://issues.apache.org/jira/browse/BEAM-9482?focusedWorklogId=778170&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778170 ] ASF GitHub Bot logged work on BEAM-9482: Author: ASF GitHub Bot Created on: 03/Jun/22 16:12 Start Date: 03/Jun/22 16:12 Worklog Time Spent: 10m Work Description: benWize commented on PR #17727: URL: https://github.com/apache/beam/pull/17727#issuecomment-1146141837 Hi @aaltay, could you help me to run a seed job for this PR to validate my changes? Issue Time Tracking --- Worklog Id: (was: 778170) Time Spent: 5h 10m (was: 5h) > beam_PerformanceTests_Kafka_IO failing due to " provided port is already > allocated" > --- > > Key: BEAM-9482 > URL: https://issues.apache.org/jira/browse/BEAM-9482 > Project: Beam > Issue Type: Bug > Components: test-failures >Reporter: Chamikara Madhusanka Jayalath >Assignee: Benjamin Gonzalez >Priority: P1 > Labels: sickbay > Time Spent: 5h 10m > Remaining Estimate: 0h > > For example, > [https://builds.apache.org/view/A-D/view/Beam/view/PerformanceTests/job/beam_PerformanceTests_Kafka_IO/514/console] > > 18:55:33 Error from server (Invalid): error when creating > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_Kafka_IO/src/.test-infra/kubernetes/kafka-cluster/04-outside-services/outside-0.yml": > Service "outside-0" is invalid: spec.ports[0].nodePort: Invalid value: > 32400: provided port is already allocated > 18:55:33 Error from server (Invalid): error when creating > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_Kafka_IO/src/.test-infra/kubernetes/kafka-cluster/04-outside-services/outside-1.yml": > Service "outside-1" is invalid: spec.ports[0].nodePort: Invalid value: > 32401: provided port is already allocated > 18:55:33 Error from server (Invalid): error when creating > "/home/jenkins/jenkins-slave/workspace/beam_PerformanceTests_Kafka_IO/src/.test-infra/kubernetes/kafka-cluster/04-outside-services/outside-2.yml": > Service "outside-2" is invalid: spec.ports[0].nodePort: Invalid value: > 32402: provided port is already allocated > 1 > > Seems like we tried three ports but they were being used. Probably we should > update code to find an unused port dynamically. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-13945) Update BQ connector to support new JSON type
[ https://issues.apache.org/jira/browse/BEAM-13945?focusedWorklogId=778168&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778168 ] ASF GitHub Bot logged work on BEAM-13945: - Author: ASF GitHub Bot Created on: 03/Jun/22 16:11 Start Date: 03/Jun/22 16:11 Worklog Time Spent: 10m Work Description: pabloem opened a new pull request, #18038: URL: https://github.com/apache/beam/pull/18038 … BQ connector to support new JSON type " This reverts commit 12be69dcd14df43e314a4c3abb086a7066c2a6a0. **Please** add a meaningful description for your change here Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Add a link to the appropriate issue in your description, if applicable. This will automatically link the pull request to the issue. - [ ] Update `CHANGES.md` with noteworthy changes. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier). To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md) GitHub Actions Tests Status (on master branch) [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule) [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule) [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule) See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI. Issue Time Tracking --- Worklog Id: (was: 778168) Time Spent: 14.5h (was: 14h 20m) > Update BQ connector to support new JSON type > > > Key: BEAM-13945 > URL: https://issues.apache.org/jira/browse/BEAM-13945 > Project: Beam > Issue Type: New Feature > Components: io-java-gcp >Reporter: Chamikara Madhusanka Jayalath >Assignee: Ahmed Abualsaud >Priority: P2 > Time Spent: 14.5h > Remaining Estimate: 0h > > BQ has a new JSON type that is defined here: > https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#json_type > We should update Beam BQ Java and Python connectors to support that for > various read methods (export jobs, storage API) and write methods (load jobs, > streaming inserts, storage API). > We should also add integration tests that exercise reading from /writing to > BQ tables with columns that has JSON type. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-11104) [Go SDK] DoFn Self Checkpointing
[ https://issues.apache.org/jira/browse/BEAM-11104?focusedWorklogId=778167&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778167 ] ASF GitHub Bot logged work on BEAM-11104: - Author: ASF GitHub Bot Created on: 03/Jun/22 16:09 Start Date: 03/Jun/22 16:09 Worklog Time Spent: 10m Work Description: jrmccluskey commented on PR #17956: URL: https://github.com/apache/beam/pull/17956#issuecomment-1146139843 R: @riteshghorse @damccorm Issue Time Tracking --- Worklog Id: (was: 778167) Time Spent: 27h (was: 26h 50m) > [Go SDK] DoFn Self Checkpointing > > > Key: BEAM-11104 > URL: https://issues.apache.org/jira/browse/BEAM-11104 > Project: Beam > Issue Type: Sub-task > Components: sdk-go >Reporter: Robert Burke >Assignee: Jack McCluskey >Priority: P3 > Fix For: 2.40.0 > > Time Spent: 27h > Remaining Estimate: 0h > > Allow SplittableDoFns to self checkpoint. > Design doc: > [https://docs.google.com/document/d/1_JbzjY9JR07ZK5v7PcZevUfzHPsqwzfV7W6AouNpMPk/edit?usp=sharing] > > Feature is written E2E and users will be able to return ProcessContinuations > from SDFs as of 2.39.0 but the full behavior has not been fully validated. An > integration test that validates self-checkpointing is working as-intended > will need to be written and passing before the feature is no longer > considered experimental and this ticket is marked as resolved. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-13171) Support for stopReadTime on KafkaIO SDF
[ https://issues.apache.org/jira/browse/BEAM-13171?focusedWorklogId=778164&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778164 ] ASF GitHub Bot logged work on BEAM-13171: - Author: ASF GitHub Bot Created on: 03/Jun/22 16:06 Start Date: 03/Jun/22 16:06 Worklog Time Spent: 10m Work Description: kennknowles commented on PR #15951: URL: https://github.com/apache/beam/pull/15951#issuecomment-1146131230 I may be getting mixed up. I had thought that https://github.com/apache/beam/pull/15951#issuecomment-990238052 was addressed. Specifically that we would have a bounded-per-input-element SDF, hence bounded PCollections, when stop read time was specified. Is this false? Are PCollections still unbounded? Issue Time Tracking --- Worklog Id: (was: 778164) Time Spent: 6h 50m (was: 6h 40m) > Support for stopReadTime on KafkaIO SDF > > > Key: BEAM-13171 > URL: https://issues.apache.org/jira/browse/BEAM-13171 > Project: Beam > Issue Type: Improvement > Components: io-java-kafka >Reporter: Mostafa Aghajani >Assignee: Mostafa Aghajani >Priority: P2 > Fix For: 2.36.0 > > Time Spent: 6h 50m > Remaining Estimate: 0h > > There is already the support for startReadTime using SDF when the Kafka > version is supported. > I want to add the support for stopReadTIme so we can extract messages from > Kafka only up to a point in time and then the task will be finished. > One use case: when you want to only re-process (re-read) a period of time for > a Kafka topic in your pipeline. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-11104) [Go SDK] DoFn Self Checkpointing
[ https://issues.apache.org/jira/browse/BEAM-11104?focusedWorklogId=778160&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778160 ] ASF GitHub Bot logged work on BEAM-11104: - Author: ASF GitHub Bot Created on: 03/Jun/22 15:59 Start Date: 03/Jun/22 15:59 Worklog Time Spent: 10m Work Description: jrmccluskey opened a new pull request, #17956: URL: https://github.com/apache/beam/pull/17956 Adds small code snippet example to the Beam Programming Guide that demonstrates self-checkpointing behavior in Beam Go. Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Add a link to the appropriate issue in your description, if applicable. This will automatically link the pull request to the issue. - [ ] Update `CHANGES.md` with noteworthy changes. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier). To check the build health, please visit [https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md](https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md) GitHub Actions Tests Status (on master branch) [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule) [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule) [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule) See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI. Issue Time Tracking --- Worklog Id: (was: 778160) Time Spent: 26.5h (was: 26h 20m) > [Go SDK] DoFn Self Checkpointing > > > Key: BEAM-11104 > URL: https://issues.apache.org/jira/browse/BEAM-11104 > Project: Beam > Issue Type: Sub-task > Components: sdk-go >Reporter: Robert Burke >Assignee: Jack McCluskey >Priority: P3 > Fix For: 2.40.0 > > Time Spent: 26.5h > Remaining Estimate: 0h > > Allow SplittableDoFns to self checkpoint. > Design doc: > [https://docs.google.com/document/d/1_JbzjY9JR07ZK5v7PcZevUfzHPsqwzfV7W6AouNpMPk/edit?usp=sharing] > > Feature is written E2E and users will be able to return ProcessContinuations > from SDFs as of 2.39.0 but the full behavior has not been fully validated. An > integration test that validates self-checkpointing is working as-intended > will need to be written and passing before the feature is no longer > considered experimental and this ticket is marked as resolved. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-11104) [Go SDK] DoFn Self Checkpointing
[ https://issues.apache.org/jira/browse/BEAM-11104?focusedWorklogId=778162&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778162 ] ASF GitHub Bot logged work on BEAM-11104: - Author: ASF GitHub Bot Created on: 03/Jun/22 15:59 Start Date: 03/Jun/22 15:59 Worklog Time Spent: 10m Work Description: asf-ci commented on PR #17956: URL: https://github.com/apache/beam/pull/17956#issuecomment-1146103317 Can one of the admins verify this patch? Issue Time Tracking --- Worklog Id: (was: 778162) Time Spent: 26h 50m (was: 26h 40m) > [Go SDK] DoFn Self Checkpointing > > > Key: BEAM-11104 > URL: https://issues.apache.org/jira/browse/BEAM-11104 > Project: Beam > Issue Type: Sub-task > Components: sdk-go >Reporter: Robert Burke >Assignee: Jack McCluskey >Priority: P3 > Fix For: 2.40.0 > > Time Spent: 26h 50m > Remaining Estimate: 0h > > Allow SplittableDoFns to self checkpoint. > Design doc: > [https://docs.google.com/document/d/1_JbzjY9JR07ZK5v7PcZevUfzHPsqwzfV7W6AouNpMPk/edit?usp=sharing] > > Feature is written E2E and users will be able to return ProcessContinuations > from SDFs as of 2.39.0 but the full behavior has not been fully validated. An > integration test that validates self-checkpointing is working as-intended > will need to be written and passing before the feature is no longer > considered experimental and this ticket is marked as resolved. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-12572) All beam examples should get continuously exercised
[ https://issues.apache.org/jira/browse/BEAM-12572?focusedWorklogId=778159&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778159 ] ASF GitHub Bot logged work on BEAM-12572: - Author: ASF GitHub Bot Created on: 03/Jun/22 15:55 Start Date: 03/Jun/22 15:55 Worklog Time Spent: 10m Work Description: fernando-wizeline commented on PR #17015: URL: https://github.com/apache/beam/pull/17015#issuecomment-1146100411 Hi @aaltay! Do you know someone who can help us to review this? Thanks! Issue Time Tracking --- Worklog Id: (was: 778159) Time Spent: 70h 40m (was: 70.5h) > All beam examples should get continuously exercised > --- > > Key: BEAM-12572 > URL: https://issues.apache.org/jira/browse/BEAM-12572 > Project: Beam > Issue Type: New Feature > Components: testing >Reporter: Valentyn Tymofieiev >Assignee: Fernando Morales >Priority: P3 > Labels: stale-assigned > Time Spent: 70h 40m > Remaining Estimate: 0h > > Sometimes our examples become broken without us noticing. For example, see: > https://lists.apache.org/thread.html/r45340bbee91a6caf798fe62d24388f645f8792cc7506351fd66adec3%40%3Cdev.beam.apache.org%3E > We should test our examples better. > Test examples on Dataflow, Spark, Flink and Direct runners -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-12918) TPC-DS: Add Jenkins jobs
[ https://issues.apache.org/jira/browse/BEAM-12918?focusedWorklogId=778155&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778155 ] ASF GitHub Bot logged work on BEAM-12918: - Author: ASF GitHub Bot Created on: 03/Jun/22 15:38 Start Date: 03/Jun/22 15:38 Worklog Time Spent: 10m Work Description: aromanenko-dev commented on PR #17680: URL: https://github.com/apache/beam/pull/17680#issuecomment-1146086878 Run Dataflow Runner Nexmark Tests Issue Time Tracking --- Worklog Id: (was: 778155) Time Spent: 10.5h (was: 10h 20m) > TPC-DS: Add Jenkins jobs > > > Key: BEAM-12918 > URL: https://issues.apache.org/jira/browse/BEAM-12918 > Project: Beam > Issue Type: Test > Components: testing-tpcds >Reporter: Alexey Romanenko >Assignee: Alexey Romanenko >Priority: P2 > Time Spent: 10.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14014) Support impersonation credentials in Dataflow runner.
[ https://issues.apache.org/jira/browse/BEAM-14014?focusedWorklogId=778139&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778139 ] ASF GitHub Bot logged work on BEAM-14014: - Author: ASF GitHub Bot Created on: 03/Jun/22 15:20 Start Date: 03/Jun/22 15:20 Worklog Time Spent: 10m Work Description: kennknowles commented on code in PR #17394: URL: https://github.com/apache/beam/pull/17394#discussion_r889045192 ## sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java: ## @@ -169,6 +169,25 @@ public interface GcpOptions extends GoogleApiDebugOptions, PipelineOptions { void setGcpCredential(Credentials value); + /** + * All API requests will be made as the given service account or target service account in an + * impersonation delegation chain instead of the currently selected account. You can specify + * either a single service account as the impersonator, or a comma-separated list of service + * accounts to create an impersonation delegation chain. + */ + @Description( + "All API requests will be made as the given service account or" + + " target service account in an impersonation delegation chain" + + " instead of the currently selected account. You can specify" + + " either a single service account as the impersonator, or a" + + " comma-separated list of service accounts to create an" + + " impersonation delegation chain.") + @JsonIgnore + @Nullable + String getImpersonateServiceAccount(); + + void setImpersonateServiceAccount(String impersonateServiceAccount); Review Comment: Ah, I did not realize this. That could be a useful follow-up. We do want the format of the parameter to be the same across SDKs and across Beam and gcloud, etc. But it seems that comma-separated strings will be the same across all of them. Issue Time Tracking --- Worklog Id: (was: 778139) Time Spent: 9h (was: 8h 50m) > Support impersonation credentials in Dataflow runner. > - > > Key: BEAM-14014 > URL: https://issues.apache.org/jira/browse/BEAM-14014 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Valentyn Tymofieiev >Assignee: Ryan Thompson >Priority: P2 > Fix For: 2.40.0 > > Time Spent: 9h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-13945) Update BQ connector to support new JSON type
[ https://issues.apache.org/jira/browse/BEAM-13945?focusedWorklogId=778129&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778129 ] ASF GitHub Bot logged work on BEAM-13945: - Author: ASF GitHub Bot Created on: 03/Jun/22 15:08 Start Date: 03/Jun/22 15:08 Worklog Time Spent: 10m Work Description: lukecwik commented on PR #17492: URL: https://github.com/apache/beam/pull/17492#issuecomment-1146058065 I believe this is the cause for the permared Java PreCommit which is failing due to: ``` 09:55:24 * What went wrong: 09:55:24 Execution failed for task ':sdks:java:io:google-cloud-platform:analyzeClassesDependencies'. 09:55:24 > Dependency analysis found issues. 09:55:24 usedUndeclaredArtifacts 09:55:24- org.json:json:20200518@jar ``` Issue Time Tracking --- Worklog Id: (was: 778129) Time Spent: 14h 20m (was: 14h 10m) > Update BQ connector to support new JSON type > > > Key: BEAM-13945 > URL: https://issues.apache.org/jira/browse/BEAM-13945 > Project: Beam > Issue Type: New Feature > Components: io-java-gcp >Reporter: Chamikara Madhusanka Jayalath >Assignee: Ahmed Abualsaud >Priority: P2 > Time Spent: 14h 20m > Remaining Estimate: 0h > > BQ has a new JSON type that is defined here: > https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#json_type > We should update Beam BQ Java and Python connectors to support that for > various read methods (export jobs, storage API) and write methods (load jobs, > streaming inserts, storage API). > We should also add integration tests that exercise reading from /writing to > BQ tables with columns that has JSON type. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14441) Migrate from Jira to GitHub Issues
[ https://issues.apache.org/jira/browse/BEAM-14441?focusedWorklogId=778126&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778126 ] ASF GitHub Bot logged work on BEAM-14441: - Author: ASF GitHub Bot Created on: 03/Jun/22 15:03 Start Date: 03/Jun/22 15:03 Worklog Time Spent: 10m Work Description: tvalentyn merged PR #17812: URL: https://github.com/apache/beam/pull/17812 Issue Time Tracking --- Worklog Id: (was: 778126) Time Spent: 3h 10m (was: 3h) > Migrate from Jira to GitHub Issues > -- > > Key: BEAM-14441 > URL: https://issues.apache.org/jira/browse/BEAM-14441 > Project: Beam > Issue Type: New Feature > Components: beam-community >Reporter: Danny McCormick >Assignee: Danny McCormick >Priority: P2 > Time Spent: 3h 10m > Remaining Estimate: 0h > > Discussion & Planning Thread > https://lists.apache.org/thread/q5nbwxqvfkzlz664c4kchzkbj26c3r89 > Migration thread: > https://lists.apache.org/thread/p623g3k8hx5dc9jzmso2n9mk2cdk15v9 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14441) Migrate from Jira to GitHub Issues
[ https://issues.apache.org/jira/browse/BEAM-14441?focusedWorklogId=778125&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778125 ] ASF GitHub Bot logged work on BEAM-14441: - Author: ASF GitHub Bot Created on: 03/Jun/22 15:00 Start Date: 03/Jun/22 15:00 Worklog Time Spent: 10m Work Description: damccorm commented on PR #17812: URL: https://github.com/apache/beam/pull/17812#issuecomment-1146050749 R: @kennknowles Issue Time Tracking --- Worklog Id: (was: 778125) Time Spent: 3h (was: 2h 50m) > Migrate from Jira to GitHub Issues > -- > > Key: BEAM-14441 > URL: https://issues.apache.org/jira/browse/BEAM-14441 > Project: Beam > Issue Type: New Feature > Components: beam-community >Reporter: Danny McCormick >Assignee: Danny McCormick >Priority: P2 > Time Spent: 3h > Remaining Estimate: 0h > > Discussion & Planning Thread > https://lists.apache.org/thread/q5nbwxqvfkzlz664c4kchzkbj26c3r89 > Migration thread: > https://lists.apache.org/thread/p623g3k8hx5dc9jzmso2n9mk2cdk15v9 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14557) Migrate exclusively to the v1 progress/metrics API
[ https://issues.apache.org/jira/browse/BEAM-14557?focusedWorklogId=778108&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778108 ] ASF GitHub Bot logged work on BEAM-14557: - Author: ASF GitHub Bot Created on: 03/Jun/22 14:29 Start Date: 03/Jun/22 14:29 Worklog Time Spent: 10m Work Description: riteshghorse commented on PR #17821: URL: https://github.com/apache/beam/pull/17821#issuecomment-1146022317 Okay, it looks like the universal runner extracts metrics from MonitoringInfos for producing the pipeline result. Need to change that as well. Issue Time Tracking --- Worklog Id: (was: 778108) Remaining Estimate: 0h Time Spent: 10m > Migrate exclusively to the v1 progress/metrics API > -- > > Key: BEAM-14557 > URL: https://issues.apache.org/jira/browse/BEAM-14557 > Project: Beam > Issue Type: Improvement > Components: sdk-go >Affects Versions: 2.40.0 >Reporter: Robert Burke >Assignee: Ritesh Ghorse >Priority: P2 > Fix For: 2.40.0 > > Time Spent: 10m > Remaining Estimate: 0h > > The Go SDK has continued to send the legacy metrics to runners, > simultaneously with the new metrics. We should remove the legacy approach and > only use the modern version which is more efficient. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14406) Integration testing of Pipeline drain in Go SDK
[ https://issues.apache.org/jira/browse/BEAM-14406?focusedWorklogId=778101&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778101 ] ASF GitHub Bot logged work on BEAM-14406: - Author: ASF GitHub Bot Created on: 03/Jun/22 14:04 Start Date: 03/Jun/22 14:04 Worklog Time Spent: 10m Work Description: github-actions[bot] commented on PR #17814: URL: https://github.com/apache/beam/pull/17814#issuecomment-1145998797 R: @youngoli for final approval Issue Time Tracking --- Worklog Id: (was: 778101) Time Spent: 0.5h (was: 20m) > Integration testing of Pipeline drain in Go SDK > --- > > Key: BEAM-14406 > URL: https://issues.apache.org/jira/browse/BEAM-14406 > Project: Beam > Issue Type: Test > Components: sdk-go >Reporter: Ritesh Ghorse >Assignee: Ritesh Ghorse >Priority: P2 > Time Spent: 0.5h > Remaining Estimate: 0h > > Add an integration test for pipeline drain feature > [BEAM-11106|https://issues.apache.org/jira/browse/BEAM-11106] added in > https://github.com/apache/beam/pull/17432. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work logged] (BEAM-14546) [Go SDK] passert.Count succeeds for empty PCollections.
[ https://issues.apache.org/jira/browse/BEAM-14546?focusedWorklogId=778098&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778098 ] ASF GitHub Bot logged work on BEAM-14546: - Author: ASF GitHub Bot Created on: 03/Jun/22 13:51 Start Date: 03/Jun/22 13:51 Worklog Time Spent: 10m Work Description: jrmccluskey commented on PR #17813: URL: https://github.com/apache/beam/pull/17813#issuecomment-1145987212 Run GoPortable PreCommit Issue Time Tracking --- Worklog Id: (was: 778098) Time Spent: 3h 20m (was: 3h 10m) > [Go SDK] passert.Count succeeds for empty PCollections. > --- > > Key: BEAM-14546 > URL: https://issues.apache.org/jira/browse/BEAM-14546 > Project: Beam > Issue Type: Improvement > Components: sdk-go >Reporter: Robert Burke >Assignee: Jack McCluskey >Priority: P3 > Time Spent: 3h 20m > Remaining Estimate: 0h > > https://github.com/apache/beam/blob/sdks/v2.39.0/sdks/go/pkg/beam/testing/passert/count.go#L28 > Since it's using a Combine to do the count, it never executes for empty > Pcollections, and is unable to fail. > The fix is: when count > 0, plumb the pcollection through as a side input to > a DoFn that requires the side input to be non-empty. This would catch the > empty PCollection case. -- This message was sent by Atlassian Jira (v8.20.7#820007)