[jira] [Updated] (BEAM-9204) HBase SDF @SplitRestriction does not take the range input into account to restrict splits
[ https://issues.apache.org/jira/browse/BEAM-9204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía updated BEAM-9204: --- Description: This is an issue because that the split is called multiple times and in this cas it will produce repeated work. (was: This is an issue because it is common if split is called multiple times work this will produce repeated work.) > HBase SDF @SplitRestriction does not take the range input into account to > restrict splits > - > > Key: BEAM-9204 > URL: https://issues.apache.org/jira/browse/BEAM-9204 > Project: Beam > Issue Type: Bug > Components: io-java-hbase >Reporter: Ismaël Mejía >Assignee: Ismaël Mejía >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > This is an issue because that the split is called multiple times and in this > cas it will produce repeated work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9162) Upgrade Jackson to version 2.10.2
[ https://issues.apache.org/jira/browse/BEAM-9162?focusedWorklogId=378097=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378097 ] ASF GitHub Bot logged work on BEAM-9162: Author: ASF GitHub Bot Created on: 28/Jan/20 08:25 Start Date: 28/Jan/20 08:25 Worklog Time Spent: 10m Work Description: iemejia commented on issue #10643: [BEAM-9162] Upgrade Jackson to version 2.10.2 URL: https://github.com/apache/beam/pull/10643#issuecomment-579133146 @suztomo maybe you know a way I can run the linkagechecker analysis in the full set of Beam modules? I think is more scalable to have a task for that that we invoke during PRs to validate that no regressions are included as suggested by Luke. (I can do that in Maven but my gradle-fu is still not good enough). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378097) Time Spent: 1h 40m (was: 1.5h) > Upgrade Jackson to version 2.10.2 > - > > Key: BEAM-9162 > URL: https://issues.apache.org/jira/browse/BEAM-9162 > Project: Beam > Issue Type: Improvement > Components: build-system, sdk-java-core >Reporter: Ismaël Mejía >Assignee: Ismaël Mejía >Priority: Minor > Time Spent: 1h 40m > Remaining Estimate: 0h > > Jackson has a new way to deal with [deserialization security > issues|https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.10] in > 2.10.x so worth the upgrade. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9204) HBase SDF @SplitRestriction does not take the range input into account to restrict splits
Ismaël Mejía created BEAM-9204: -- Summary: HBase SDF @SplitRestriction does not take the range input into account to restrict splits Key: BEAM-9204 URL: https://issues.apache.org/jira/browse/BEAM-9204 Project: Beam Issue Type: Bug Components: io-java-hbase Reporter: Ismaël Mejía Assignee: Ismaël Mejía -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7810) Allow ValueProvider arguments to ReadFromDatastore
[ https://issues.apache.org/jira/browse/BEAM-7810?focusedWorklogId=378091=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378091 ] ASF GitHub Bot logged work on BEAM-7810: Author: ASF GitHub Bot Created on: 28/Jan/20 08:08 Start Date: 28/Jan/20 08:08 Worklog Time Spent: 10m Work Description: EDjur commented on issue #10683: [BEAM-7810] Added ValueProvider support for Datastore query namespaces URL: https://github.com/apache/beam/pull/10683#issuecomment-579127724 Cheers for opening and fixing that. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378091) Time Spent: 1h 10m (was: 1h) > Allow ValueProvider arguments to ReadFromDatastore > -- > > Key: BEAM-7810 > URL: https://issues.apache.org/jira/browse/BEAM-7810 > Project: Beam > Issue Type: New Feature > Components: io-py-gcp >Reporter: Udi Meiri >Assignee: Elias Djurfeldt >Priority: Minor > Fix For: 2.20.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > From: > https://stackoverflow.com/questions/56748893/trying-to-achieve-runtime-value-of-namespace-of-datastore-in-dataflow-template -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9204) HBase SDF @SplitRestriction does not take the range input into account to restrict splits
[ https://issues.apache.org/jira/browse/BEAM-9204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía updated BEAM-9204: --- Status: Open (was: Triage Needed) > HBase SDF @SplitRestriction does not take the range input into account to > restrict splits > - > > Key: BEAM-9204 > URL: https://issues.apache.org/jira/browse/BEAM-9204 > Project: Beam > Issue Type: Bug > Components: io-java-hbase >Reporter: Ismaël Mejía >Assignee: Ismaël Mejía >Priority: Minor > > This is an issue because it is common if split is called multiple times work > this will produce repeated work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9204) HBase SDF @SplitRestriction does not take the range input into account to restrict splits
[ https://issues.apache.org/jira/browse/BEAM-9204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía updated BEAM-9204: --- Description: This is an issue because it is common if split is called multiple times work this will produce repeated work. > HBase SDF @SplitRestriction does not take the range input into account to > restrict splits > - > > Key: BEAM-9204 > URL: https://issues.apache.org/jira/browse/BEAM-9204 > Project: Beam > Issue Type: Bug > Components: io-java-hbase >Reporter: Ismaël Mejía >Assignee: Ismaël Mejía >Priority: Minor > > This is an issue because it is common if split is called multiple times work > this will produce repeated work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-4735) Make HBaseIO.read() based on SDF
[ https://issues.apache.org/jira/browse/BEAM-4735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17024990#comment-17024990 ] Ismaël Mejía commented on BEAM-4735: Oh interesting finding! Just filled BEAM-9204 to tackle it. > Make HBaseIO.read() based on SDF > > > Key: BEAM-4735 > URL: https://issues.apache.org/jira/browse/BEAM-4735 > Project: Beam > Issue Type: Improvement > Components: io-java-hbase >Reporter: Ismaël Mejía >Priority: Minor > > BEAM-4020 introduces HBaseIO reads based on SDF. So far the read() method > still uses the Source based API for two reasons: > 1. Most distributed runners don't supports Bounded SDF today. > 2. SDF does not support Dynamic Work Rebalancing but the Source API of HBase > already supports it so changing it means losing some functionality. > Once there is improvements in both (1) and (2) we should consider moving the > main read() function to use the SDF API and remove the Source based > implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9204) HBase SDF @SplitRestriction does not take the range input into account to restrict splits
[ https://issues.apache.org/jira/browse/BEAM-9204?focusedWorklogId=378121=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378121 ] ASF GitHub Bot logged work on BEAM-9204: Author: ASF GitHub Bot Created on: 28/Jan/20 09:45 Start Date: 28/Jan/20 09:45 Worklog Time Spent: 10m Work Description: iemejia commented on pull request #10700: [BEAM-9204] Fix HBase SplitRestriction to be based on provided Range URL: https://github.com/apache/beam/pull/10700 R: @lukecwik This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378121) Remaining Estimate: 0h Time Spent: 10m > HBase SDF @SplitRestriction does not take the range input into account to > restrict splits > - > > Key: BEAM-9204 > URL: https://issues.apache.org/jira/browse/BEAM-9204 > Project: Beam > Issue Type: Bug > Components: io-java-hbase >Reporter: Ismaël Mejía >Assignee: Ismaël Mejía >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > This is an issue because it is common if split is called multiple times work > this will produce repeated work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9162) Upgrade Jackson to version 2.10.2
[ https://issues.apache.org/jira/browse/BEAM-9162?focusedWorklogId=378095=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378095 ] ASF GitHub Bot logged work on BEAM-9162: Author: ASF GitHub Bot Created on: 28/Jan/20 08:23 Start Date: 28/Jan/20 08:23 Worklog Time Spent: 10m Work Description: iemejia commented on issue #10643: [BEAM-9162] Upgrade Jackson to version 2.10.2 URL: https://github.com/apache/beam/pull/10643#issuecomment-579132447 Run JavaPortabilityApi PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378095) Time Spent: 1h 20m (was: 1h 10m) > Upgrade Jackson to version 2.10.2 > - > > Key: BEAM-9162 > URL: https://issues.apache.org/jira/browse/BEAM-9162 > Project: Beam > Issue Type: Improvement > Components: build-system, sdk-java-core >Reporter: Ismaël Mejía >Assignee: Ismaël Mejía >Priority: Minor > Time Spent: 1h 20m > Remaining Estimate: 0h > > Jackson has a new way to deal with [deserialization security > issues|https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.10] in > 2.10.x so worth the upgrade. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9162) Upgrade Jackson to version 2.10.2
[ https://issues.apache.org/jira/browse/BEAM-9162?focusedWorklogId=378096=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378096 ] ASF GitHub Bot logged work on BEAM-9162: Author: ASF GitHub Bot Created on: 28/Jan/20 08:23 Start Date: 28/Jan/20 08:23 Worklog Time Spent: 10m Work Description: iemejia commented on issue #10643: [BEAM-9162] Upgrade Jackson to version 2.10.2 URL: https://github.com/apache/beam/pull/10643#issuecomment-579132447 Run JavaPortabilityApi PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378096) Time Spent: 1.5h (was: 1h 20m) > Upgrade Jackson to version 2.10.2 > - > > Key: BEAM-9162 > URL: https://issues.apache.org/jira/browse/BEAM-9162 > Project: Beam > Issue Type: Improvement > Components: build-system, sdk-java-core >Reporter: Ismaël Mejía >Assignee: Ismaël Mejía >Priority: Minor > Time Spent: 1.5h > Remaining Estimate: 0h > > Jackson has a new way to deal with [deserialization security > issues|https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.10] in > 2.10.x so worth the upgrade. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9188) Improving speed of splitting for Custom Sources
[ https://issues.apache.org/jira/browse/BEAM-9188?focusedWorklogId=378468=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378468 ] ASF GitHub Bot logged work on BEAM-9188: Author: ASF GitHub Bot Created on: 28/Jan/20 21:06 Start Date: 28/Jan/20 21:06 Worklog Time Spent: 10m Work Description: boyuanzz commented on issue #10701: [BEAM-9188] CassandraIO split performance improvement - cache size of the table URL: https://github.com/apache/beam/pull/10701#issuecomment-579455635 Retest it please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378468) Time Spent: 40m (was: 0.5h) > Improving speed of splitting for Custom Sources > --- > > Key: BEAM-9188 > URL: https://issues.apache.org/jira/browse/BEAM-9188 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Radosław Stankiewicz >Assignee: Radosław Stankiewicz >Priority: Minor > Time Spent: 40m > Remaining Estimate: 0h > > At this moment Custom Source in being split and serialized in sequence. If > there are many splits, it takes time to process all splits. > > Example: it takes 2s to calculate size and serialize CassandraSource due to > connection setup and teardown. With 100+ splits, it's a lot of time spent in > 1 worker. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8756) Beam Dependency Update Request: com.google.cloud:google-cloud-core
[ https://issues.apache.org/jira/browse/BEAM-8756?focusedWorklogId=378473=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378473 ] ASF GitHub Bot logged work on BEAM-8756: Author: ASF GitHub Bot Created on: 28/Jan/20 21:13 Start Date: 28/Jan/20 21:13 Worklog Time Spent: 10m Work Description: lukecwik commented on pull request #10674: [BEAM-8756] Google cloud client libraries to use 2019 versions URL: https://github.com/apache/beam/pull/10674 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378473) Time Spent: 3h 10m (was: 3h) > Beam Dependency Update Request: com.google.cloud:google-cloud-core > -- > > Key: BEAM-8756 > URL: https://issues.apache.org/jira/browse/BEAM-8756 > Project: Beam > Issue Type: Sub-task > Components: dependencies >Reporter: Beam JIRA Bot >Assignee: Tomo Suzuki >Priority: Major > Time Spent: 3h 10m > Remaining Estimate: 0h > > - 2019-11-19 21:05:22.063293 > - > Please consider upgrading the dependency > com.google.cloud:google-cloud-core. > The current version is 1.61.0. The latest version is 1.91.3 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2019-12-02 12:10:54.007619 > - > Please consider upgrading the dependency > com.google.cloud:google-cloud-core. > The current version is 1.61.0. The latest version is 1.91.3 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2019-12-09 12:10:01.481594 > - > Please consider upgrading the dependency > com.google.cloud:google-cloud-core. > The current version is 1.61.0. The latest version is 1.91.3 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2019-12-23 12:10:02.830621 > - > Please consider upgrading the dependency > com.google.cloud:google-cloud-core. > The current version is 1.61.0. The latest version is 1.92.0 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2019-12-30 14:05:33.624755 > - > Please consider upgrading the dependency > com.google.cloud:google-cloud-core. > The current version is 1.61.0. The latest version is 1.92.0 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2020-01-06 12:09:08.158708 > - > Please consider upgrading the dependency > com.google.cloud:google-cloud-core. > The current version is 1.61.0. The latest version is 1.92.1 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2020-01-13 12:08:46.788844 > - > Please consider upgrading the dependency > com.google.cloud:google-cloud-core. > The current version is 1.61.0. The latest version is 1.92.2 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2020-01-20 12:08:23.305323 > - > Please consider upgrading the dependency > com.google.cloud:google-cloud-core. > The current version is 1.61.0. The latest version is 1.92.2 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2020-01-27 12:09:28.216096 > - > Please consider upgrading the
[jira] [Work logged] (BEAM-9125) Update BigQuery Storage API documentation
[ https://issues.apache.org/jira/browse/BEAM-9125?focusedWorklogId=378477=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378477 ] ASF GitHub Bot logged work on BEAM-9125: Author: ASF GitHub Bot Created on: 28/Jan/20 21:21 Start Date: 28/Jan/20 21:21 Worklog Time Spent: 10m Work Description: aaltay commented on issue #10606: [BEAM-9125] Update BQ Storage API documentation URL: https://github.com/apache/beam/pull/10606#issuecomment-579461992 R: @rosetn @soyrice This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378477) Time Spent: 20m (was: 10m) > Update BigQuery Storage API documentation > - > > Key: BEAM-9125 > URL: https://issues.apache.org/jira/browse/BEAM-9125 > Project: Beam > Issue Type: Bug > Components: website >Reporter: Kenneth Jung >Assignee: Kenneth Jung >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > Some of the [documented limitations|#storage-api]] for using the BigQuery > Storage API in Beam are no longer accurate. In particular: > * The BigQuery Storage API no longer needs to be enabled manually > * Dynamic work rebalancing has been supported in Beam since 2.15.0 > * The sample code should use .withSelectedFields(...) rather than > .withReadOptions(...). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8590) Python typehints: native types: consider bare container types as containing Any
[ https://issues.apache.org/jira/browse/BEAM-8590?focusedWorklogId=378482=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378482 ] ASF GitHub Bot logged work on BEAM-8590: Author: ASF GitHub Bot Created on: 28/Jan/20 21:28 Start Date: 28/Jan/20 21:28 Worklog Time Spent: 10m Work Description: udim commented on issue #10042: [BEAM-8590] Support unsubscripted native types URL: https://github.com/apache/beam/pull/10042#issuecomment-579464714 R: @robertwb - LMK if you have bandwidth for this This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378482) Time Spent: 50m (was: 40m) > Python typehints: native types: consider bare container types as containing > Any > --- > > Key: BEAM-8590 > URL: https://issues.apache.org/jira/browse/BEAM-8590 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Udi Meiri >Assignee: Udi Meiri >Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > This is for convert_to_beam_type: > For example, process(element: List) is the same as process(element: > List[Any]). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (BEAM-8627) Migrate bigtable-client-core to 1.12.1
[ https://issues.apache.org/jira/browse/BEAM-8627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomo Suzuki closed BEAM-8627. - > Migrate bigtable-client-core to 1.12.1 > -- > > Key: BEAM-8627 > URL: https://issues.apache.org/jira/browse/BEAM-8627 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp, runner-dataflow >Reporter: Tomo Suzuki >Assignee: Tomo Suzuki >Priority: Major > Fix For: Not applicable > > Attachments: rsP0SotuxeR.png > > > Beam currently uses Migrate bigtable-client-core to 1.8.0. The latest is > 1.12.1 > [https://search.maven.org/artifact/com.google.cloud.bigtable/bigtable-client-core/1.12.1/jar] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8691) Beam Dependency Update Request: com.google.cloud.bigtable:bigtable-client-core
[ https://issues.apache.org/jira/browse/BEAM-8691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025456#comment-17025456 ] Tomo Suzuki commented on BEAM-8691: --- Trying 1.13.0. Failed {noformat} > Task :checkJavaLinkage SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. Exception in thread "main" org.eclipse.aether.collection.DependencyCollectionException: Failed to collect dependencies for org.apache.beam:beam-sdks-java-io-google-cloud-platform:jar:2.20.0-SNAPSHOT (compile) at org.eclipse.aether.internal.impl.collect.DefaultDependencyCollector.collectDependencies(DefaultDependencyCollector.java:295) at org.eclipse.aether.internal.impl.DefaultRepositorySystem.collectDependencies(DefaultRepositorySystem.java:284) at com.google.cloud.tools.opensource.dependencies.DependencyGraphBuilder.resolveCompileTimeDependencies(DependencyGraphBuilder.java:188) at com.google.cloud.tools.opensource.dependencies.DependencyGraphBuilder.getStaticLinkageCheckDependencyGraph(DependencyGraphBuilder.java:229) at com.google.cloud.tools.opensource.classpath.ClassPathBuilder.resolve(ClassPathBuilder.java:71) at com.google.cloud.tools.opensource.classpath.LinkageCheckerMain.main(LinkageCheckerMain.java:69) Caused by: org.eclipse.aether.collection.UnsolvableVersionConflictException: Could not resolve version conflict among [org.apache.beam:beam-sdks-java-io-google-cloud-platform:jar:2.20.0-SNAPSHOT -> com.google.api:gax-grpc:jar:1.52.0 -> io.grpc:grpc-netty-shaded:jar:1.25.0 -> io.grpc:grpc-core:jar:[1.25.0,1.25.0], org.apache.beam:beam-sdks-java-io-google-cloud-platform:jar:2.20.0-SNAPSHOT -> com.google.api:gax-grpc:jar:1.52.0 -> io.grpc:grpc-alts:jar:1.25.0 -> io.grpc:grpc-core:jar:[1.25.0,1.25.0], org.apache.beam:beam-sdks-java-io-google-cloud-platform:jar:2.20.0-SNAPSHOT -> com.google.cloud:google-cloud-bigquerystorage:jar:0.120.1-beta -> io.grpc:grpc-core:jar:1.25.0, org.apache.beam:beam-sdks-java-io-google-cloud-platform:jar:2.20.0-SNAPSHOT -> com.google.cloud.bigtable:bigtable-client-core:jar:1.13.0 -> com.google.cloud:google-cloud-bigtable:jar:1.9.1 -> io.grpc:grpc-core:jar:1.26.0, org.apache.beam:beam-sdks-java-io-google-cloud-platform:jar:2.20.0-SNAPSHOT -> com.google.cloud.bigtable:bigtable-client-core:jar:1.13.0 -> io.grpc:grpc-core:jar:1.26.0, org.apache.beam:beam-sdks-java-io-google-cloud-platform:jar:2.20.0-SNAPSHOT -> com.google.cloud.bigtable:bigtable-client-core:jar:1.13.0 -> io.grpc:grpc-grpclb:jar:1.26.0 -> io.grpc:grpc-core:jar:[1.26.0,1.26.0], org.apache.beam:beam-sdks-java-io-google-cloud-platform:jar:2.20.0-SNAPSHOT -> com.google.cloud.bigtable:bigtable-client-core:jar:1.13.0 -> io.opencensus:opencensus-contrib-grpc-util:jar:0.24.0 -> io.grpc:grpc-core:jar:1.22.1, org.apache.beam:beam-sdks-java-io-google-cloud-platform:jar:2.20.0-SNAPSHOT -> com.google.cloud:google-cloud-core-grpc:jar:1.92.2 -> io.grpc:grpc-core:jar:1.26.0, org.apache.beam:beam-sdks-java-io-google-cloud-platform:jar:2.20.0-SNAPSHOT -> io.grpc:grpc-all:jar:1.25.0 -> io.grpc:grpc-core:jar:[1.25.0,1.25.0], org.apache.beam:beam-sdks-java-io-google-cloud-platform:jar:2.20.0-SNAPSHOT -> io.grpc:grpc-all:jar:1.25.0 -> io.grpc:grpc-okhttp:jar:1.25.0 -> io.grpc:grpc-core:jar:[1.25.0,1.25.0], org.apache.beam:beam-sdks-java-io-google-cloud-platform:jar:2.20.0-SNAPSHOT -> io.grpc:grpc-all:jar:1.25.0 -> io.grpc:grpc-testing:jar:1.25.0 -> io.grpc:grpc-core:jar:[1.25.0,1.25.0], org.apache.beam:beam-sdks-java-io-google-cloud-platform:jar:2.20.0-SNAPSHOT -> io.grpc:grpc-core:jar:1.25.0, org.apache.beam:beam-sdks-java-io-google-cloud-platform:jar:2.20.0-SNAPSHOT -> io.grpc:grpc-netty:jar:1.25.0 -> io.grpc:grpc-core:jar:[1.25.0,1.25.0]] at org.eclipse.aether.util.graph.transformer.NearestVersionSelector.newFailure(NearestVersionSelector.java:159) at org.eclipse.aether.util.graph.transformer.NearestVersionSelector.backtrack(NearestVersionSelector.java:120) at org.eclipse.aether.util.graph.transformer.NearestVersionSelector.selectVersion(NearestVersionSelector.java:93) at org.eclipse.aether.util.graph.transformer.ConflictResolver.transformGraph(ConflictResolver.java:180) at org.eclipse.aether.util.graph.transformer.ChainedDependencyGraphTransformer.transformGraph(ChainedDependencyGraphTransformer.java:80) at org.eclipse.aether.internal.impl.collect.DefaultDependencyCollector.collectDependencies(DefaultDependencyCollector.java:273) ... 5 more {noformat} > Beam Dependency Update Request: com.google.cloud.bigtable:bigtable-client-core > -- > > Key: BEAM-8691 > URL:
[jira] [Work logged] (BEAM-9178) Support ZetaSQL TIMESTAMP functions in BeamSQL
[ https://issues.apache.org/jira/browse/BEAM-9178?focusedWorklogId=378496=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378496 ] ASF GitHub Bot logged work on BEAM-9178: Author: ASF GitHub Bot Created on: 28/Jan/20 21:52 Start Date: 28/Jan/20 21:52 Worklog Time Spent: 10m Work Description: robinyqiu commented on issue #10634: [BEAM-9178] Support all ZetaSQL TIMESTAMP functions URL: https://github.com/apache/beam/pull/10634#issuecomment-579474737 Factored out the controversial parts that you mentioned. Now we don't support extracting or truncating "week with weekdays" part (I think this is fine). PR is ready for review now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378496) Time Spent: 20m (was: 10m) > Support ZetaSQL TIMESTAMP functions in BeamSQL > -- > > Key: BEAM-9178 > URL: https://issues.apache.org/jira/browse/BEAM-9178 > Project: Beam > Issue Type: New Feature > Components: dsl-sql-zetasql >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Support *all* TIMESTAMP functions defined in ZetaSQL (BigQuery Standard SQL). > See the full list of functions below: > [https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9188) Improving speed of splitting for Custom Sources
[ https://issues.apache.org/jira/browse/BEAM-9188?focusedWorklogId=378497=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378497 ] ASF GitHub Bot logged work on BEAM-9188: Author: ASF GitHub Bot Created on: 28/Jan/20 21:53 Start Date: 28/Jan/20 21:53 Worklog Time Spent: 10m Work Description: boyuanzz commented on issue #10685: [BEAM-9188] Dataflow's WorkerCustomSources improvement - parallelize creation of Derived Sources (splitting) URL: https://github.com/apache/beam/pull/10685#issuecomment-579475332 Would you mind closing this PR given that we decided to go with https://github.com/apache/beam/pull/10701? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378497) Time Spent: 50m (was: 40m) > Improving speed of splitting for Custom Sources > --- > > Key: BEAM-9188 > URL: https://issues.apache.org/jira/browse/BEAM-9188 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Radosław Stankiewicz >Assignee: Radosław Stankiewicz >Priority: Minor > Time Spent: 50m > Remaining Estimate: 0h > > At this moment Custom Source in being split and serialized in sequence. If > there are many splits, it takes time to process all splits. > > Example: it takes 2s to calculate size and serialize CassandraSource due to > connection setup and teardown. With 100+ splits, it's a lot of time spent in > 1 worker. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8298) Implement state caching for side inputs
[ https://issues.apache.org/jira/browse/BEAM-8298?focusedWorklogId=378512=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378512 ] ASF GitHub Bot logged work on BEAM-8298: Author: ASF GitHub Bot Created on: 28/Jan/20 22:08 Start Date: 28/Jan/20 22:08 Worklog Time Spent: 10m Work Description: lukecwik commented on pull request #10705: [BEAM-8298] Implement side input caching. URL: https://github.com/apache/beam/pull/10705#discussion_r372080828 ## File path: sdks/python/apache_beam/runners/worker/sdk_worker.py ## @@ -825,43 +833,92 @@ def extend(self, def clear(self, state_key, is_cached=False): # type: (beam_fn_api_pb2.StateKey, bool) -> _Future -if self._should_be_cached(is_cached): +cache_token = self._get_cache_token(state_key, is_cached) +if cache_token: cache_key = self._convert_to_cache_key(state_key) - self._state_cache.clear(cache_key, self._context.cache_token) + self._state_cache.clear(cache_key, cache_token) return self._underlying.clear(state_key) def done(self): # type: () -> None self._underlying.done() - def _materialize_iter(self, -state_key, # type: beam_fn_api_pb2.StateKey -coder # type: coder_impl.CoderImpl - ): + def _lazy_iterator( + self, + state_key, # type: beam_fn_api_pb2.StateKey + coder, # type: coder_impl.CoderImpl + continuation_token=None # type: Optional[bytes] +): # type: (...) -> Iterator[Any] """Materializes the state lazily, one element at a time. :return A generator which returns the next element if advanced. """ -continuation_token = None while True: - data, continuation_token = \ - self._underlying.get_raw(state_key, continuation_token) + data, continuation_token = ( + self._underlying.get_raw(state_key, continuation_token)) input_stream = coder_impl.create_InputStream(data) while input_stream.size() > 0: yield coder.decode_from_stream(input_stream, True) if not continuation_token: break - def _should_be_cached(self, request_is_cached): -return (self._state_cache.is_cache_enabled() and -request_is_cached and -self._context.cache_token) + def _get_cache_token(self, state_key, request_is_cached): +if not self._state_cache.is_cache_enabled(): + return None +elif state_key.HasField('bag_user_state'): + if request_is_cached and self._context.user_state_cache_token: +return self._context.user_state_cache_token + else: +return self._context.bundle_cache_token +elif state_key.WhichOneof('type').endswith('_side_input'): + side_input = getattr(state_key, state_key.WhichOneof('type')) + return self._context.side_input_cache_tokens.get( +(side_input.transform_id, side_input.side_input_id), +self._context.bundle_cache_token) + + def _partially_cached_iterable( + self, + state_key, # type: beam_fn_api_pb2.StateKey + coder # type: coder_impl.CoderImpl +): +# type: (...) -> Iterable[Any] +"""Materialized the first page of data, concatinated with a lazy iterable +of the rest, if any. +""" +data, continuation_token = ( +self._underlying.get_raw(state_key, None)) +head = [] +input_stream = coder_impl.create_InputStream(data) +while input_stream.size() > 0: + head.append(coder.decode_from_stream(input_stream, True)) + +if continuation_token is None: + return head +else: + def iter_func(): +for item in head: + yield item +for item in self._lazy_iterator(state_key, coder, continuation_token): + yield item + return _IterableFromIterator(iter_func) @staticmethod def _convert_to_cache_key(state_key): return state_key.SerializeToString() +class _IterableFromIterator(object): + """Wraps an iterator as an iterable.""" + def __init__(self, iter_func): +self._iter_func = iter_func + def __iter__(self): +return iter_func() Review comment: ```suggestion return iter_func ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378512) Time Spent: 1h 50m (was: 1h 40m) > Implement state caching for side inputs > --- > > Key: BEAM-8298 > URL: https://issues.apache.org/jira/browse/BEAM-8298 > Project: Beam >
[jira] [Work logged] (BEAM-8298) Implement state caching for side inputs
[ https://issues.apache.org/jira/browse/BEAM-8298?focusedWorklogId=378511=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378511 ] ASF GitHub Bot logged work on BEAM-8298: Author: ASF GitHub Bot Created on: 28/Jan/20 22:08 Start Date: 28/Jan/20 22:08 Worklog Time Spent: 10m Work Description: lukecwik commented on pull request #10705: [BEAM-8298] Implement side input caching. URL: https://github.com/apache/beam/pull/10705#discussion_r372074511 ## File path: sdks/python/apache_beam/runners/worker/sdk_worker.py ## @@ -825,43 +833,92 @@ def extend(self, def clear(self, state_key, is_cached=False): # type: (beam_fn_api_pb2.StateKey, bool) -> _Future -if self._should_be_cached(is_cached): +cache_token = self._get_cache_token(state_key, is_cached) +if cache_token: cache_key = self._convert_to_cache_key(state_key) - self._state_cache.clear(cache_key, self._context.cache_token) + self._state_cache.clear(cache_key, cache_token) return self._underlying.clear(state_key) def done(self): # type: () -> None self._underlying.done() - def _materialize_iter(self, -state_key, # type: beam_fn_api_pb2.StateKey -coder # type: coder_impl.CoderImpl - ): + def _lazy_iterator( + self, + state_key, # type: beam_fn_api_pb2.StateKey + coder, # type: coder_impl.CoderImpl + continuation_token=None # type: Optional[bytes] +): # type: (...) -> Iterator[Any] """Materializes the state lazily, one element at a time. :return A generator which returns the next element if advanced. """ -continuation_token = None while True: - data, continuation_token = \ - self._underlying.get_raw(state_key, continuation_token) + data, continuation_token = ( + self._underlying.get_raw(state_key, continuation_token)) input_stream = coder_impl.create_InputStream(data) while input_stream.size() > 0: yield coder.decode_from_stream(input_stream, True) if not continuation_token: break - def _should_be_cached(self, request_is_cached): -return (self._state_cache.is_cache_enabled() and -request_is_cached and -self._context.cache_token) + def _get_cache_token(self, state_key, request_is_cached): +if not self._state_cache.is_cache_enabled(): + return None +elif state_key.HasField('bag_user_state'): + if request_is_cached and self._context.user_state_cache_token: +return self._context.user_state_cache_token + else: +return self._context.bundle_cache_token +elif state_key.WhichOneof('type').endswith('_side_input'): + side_input = getattr(state_key, state_key.WhichOneof('type')) + return self._context.side_input_cache_tokens.get( +(side_input.transform_id, side_input.side_input_id), +self._context.bundle_cache_token) + + def _partially_cached_iterable( + self, + state_key, # type: beam_fn_api_pb2.StateKey + coder # type: coder_impl.CoderImpl +): +# type: (...) -> Iterable[Any] +"""Materialized the first page of data, concatinated with a lazy iterable Review comment: ```suggestion """Materialized the first page of data, concatenated with a lazy iterable ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378511) Time Spent: 1h 50m (was: 1h 40m) > Implement state caching for side inputs > --- > > Key: BEAM-8298 > URL: https://issues.apache.org/jira/browse/BEAM-8298 > Project: Beam > Issue Type: Improvement > Components: runner-core, sdk-py-harness >Reporter: Maximilian Michels >Assignee: Jing Chen >Priority: Major > Time Spent: 1h 50m > Remaining Estimate: 0h > > Caching is currently only implemented for user state. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9188) Improving speed of splitting for Custom Sources
[ https://issues.apache.org/jira/browse/BEAM-9188?focusedWorklogId=378518=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378518 ] ASF GitHub Bot logged work on BEAM-9188: Author: ASF GitHub Bot Created on: 28/Jan/20 22:13 Start Date: 28/Jan/20 22:13 Worklog Time Spent: 10m Work Description: boyuanzz commented on issue #10701: [BEAM-9188] CassandraIO split performance improvement - cache size of the table URL: https://github.com/apache/beam/pull/10701#issuecomment-579484771 Retest it please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378518) Time Spent: 1h 10m (was: 1h) > Improving speed of splitting for Custom Sources > --- > > Key: BEAM-9188 > URL: https://issues.apache.org/jira/browse/BEAM-9188 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Radosław Stankiewicz >Assignee: Radosław Stankiewicz >Priority: Minor > Time Spent: 1h 10m > Remaining Estimate: 0h > > At this moment Custom Source in being split and serialized in sequence. If > there are many splits, it takes time to process all splits. > > Example: it takes 2s to calculate size and serialize CassandraSource due to > connection setup and teardown. With 100+ splits, it's a lot of time spent in > 1 worker. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8933) BigQuery IO should support read/write in Arrow format
[ https://issues.apache.org/jira/browse/BEAM-8933?focusedWorklogId=378547=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378547 ] ASF GitHub Bot logged work on BEAM-8933: Author: ASF GitHub Bot Created on: 28/Jan/20 22:57 Start Date: 28/Jan/20 22:57 Worklog Time Spent: 10m Work Description: TheNeuralBit commented on pull request #10384: [BEAM-8933] Utilities for converting Arrow schemas and reading Arrow batches as Rows URL: https://github.com/apache/beam/pull/10384#discussion_r372105866 ## File path: sdks/java/extensions/arrow/src/main/java/org/apache/beam/sdk/extensions/arrow/ArrowConversion.java ## @@ -0,0 +1,448 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.arrow; + +import static org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument; + +import java.util.Iterator; +import java.util.List; +import java.util.function.Function; +import java.util.stream.Collectors; +import javax.annotation.Nullable; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.types.TimeUnit; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.util.Text; +import org.apache.beam.sdk.annotations.Experimental; +import org.apache.beam.sdk.schemas.CachingFactory; +import org.apache.beam.sdk.schemas.Factory; +import org.apache.beam.sdk.schemas.FieldValueGetter; +import org.apache.beam.sdk.schemas.Schema; +import org.apache.beam.sdk.schemas.Schema.Field; +import org.apache.beam.sdk.schemas.Schema.FieldType; +import org.apache.beam.sdk.schemas.logicaltypes.FixedBytes; +import org.apache.beam.sdk.values.Row; +import org.joda.time.DateTime; +import org.joda.time.DateTimeZone; + +/** + * Utilities to create {@link Iterable}s of Beam {@link Row} instances backed by Arrow record + * batches. + */ +@Experimental(Experimental.Kind.SCHEMAS) +public class ArrowConversion { + /** Converts Arrow schema to Beam row schema. */ + public static Schema toBeamSchema(org.apache.arrow.vector.types.pojo.Schema schema) { +return toBeamSchema(schema.getFields()); + } + + public static Schema toBeamSchema(List fields) { +Schema.Builder builder = Schema.builder(); +for (org.apache.arrow.vector.types.pojo.Field field : fields) { + Field beamField = toBeamField(field); + builder.addField(beamField); +} +return builder.build(); + } + + /** Get Beam Field from Arrow Field. */ + private static Field toBeamField(org.apache.arrow.vector.types.pojo.Field field) { +FieldType beamFieldType = toFieldType(field.getFieldType(), field.getChildren()); +return Field.of(field.getName(), beamFieldType); + } + + /** Converts Arrow FieldType to Beam FieldType. */ + private static FieldType toFieldType( + org.apache.arrow.vector.types.pojo.FieldType arrowFieldType, + List childrenFields) { +FieldType fieldType = +arrowFieldType +.getType() +.accept( +new ArrowType.ArrowTypeVisitor() { + @Override + public FieldType visit(ArrowType.Null type) { +throw new IllegalArgumentException( +"Type \'" + type.toString() + "\' not supported."); + } + + @Override + public FieldType visit(ArrowType.Struct type) { +return FieldType.row(toBeamSchema(childrenFields)); + } + + @Override + public FieldType visit(ArrowType.List type) { +checkArgument( +childrenFields.size() == 1, +"Encountered " ++ childrenFields.size() ++ " child fields for list type, expected 1"); +return FieldType.array(toBeamField(childrenFields.get(0)).getType()); + } + + @Override + public FieldType visit(ArrowType.FixedSizeList
[jira] [Work logged] (BEAM-8933) BigQuery IO should support read/write in Arrow format
[ https://issues.apache.org/jira/browse/BEAM-8933?focusedWorklogId=378548=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378548 ] ASF GitHub Bot logged work on BEAM-8933: Author: ASF GitHub Bot Created on: 28/Jan/20 22:58 Start Date: 28/Jan/20 22:58 Worklog Time Spent: 10m Work Description: emkornfield commented on pull request #10384: [BEAM-8933] Utilities for converting Arrow schemas and reading Arrow batches as Rows URL: https://github.com/apache/beam/pull/10384#discussion_r372106201 ## File path: sdks/java/extensions/arrow/src/test/java/org/apache/beam/sdk/extensions/arrow/ArrowConversionTest.java ## @@ -0,0 +1,179 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.arrow; + +import static java.util.Arrays.asList; +import static org.hamcrest.MatcherAssert.assertThat; +import static org.hamcrest.Matchers.equalTo; + +import java.util.ArrayList; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.BitVector; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.FixedSizeBinaryVector; +import org.apache.arrow.vector.Float8Vector; +import org.apache.arrow.vector.IntVector; +import org.apache.arrow.vector.TimeStampMicroTZVector; +import org.apache.arrow.vector.TimeStampMilliTZVector; +import org.apache.arrow.vector.VarCharVector; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.complex.ListVector; +import org.apache.arrow.vector.types.FloatingPointPrecision; +import org.apache.arrow.vector.types.TimeUnit; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.util.Text; +import org.apache.beam.sdk.schemas.Schema; +import org.apache.beam.sdk.schemas.Schema.Field; +import org.apache.beam.sdk.schemas.Schema.FieldType; +import org.apache.beam.sdk.values.Row; +import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.ImmutableList; +import org.hamcrest.collection.IsIterableContainingInOrder; +import org.joda.time.DateTime; +import org.joda.time.DateTimeZone; +import org.junit.After; +import org.junit.Before; +import org.junit.Test; +import org.junit.runner.RunWith; +import org.junit.runners.JUnit4; + +@RunWith(JUnit4.class) +public class ArrowConversionTest { + + private BufferAllocator allocator; + + @Before + public void init() { +allocator = new RootAllocator(Long.MAX_VALUE); + } + + @After + public void teardown() { +allocator.close(); + } + + @Test + public void toBeamSchema_convertsSimpleArrowSchema() { +Schema expected = +Schema.of(Field.of("int8", FieldType.BYTE), Field.of("int16", FieldType.INT16)); + +org.apache.arrow.vector.types.pojo.Schema arrowSchema = +new org.apache.arrow.vector.types.pojo.Schema( +ImmutableList.of( +field("int8", new ArrowType.Int(8, true)), +field("int16", new ArrowType.Int(16, true; + +assertThat(ArrowConversion.toBeamSchema(arrowSchema), equalTo(expected)); + } + + @Test + public void rowIterator() { +org.apache.arrow.vector.types.pojo.Schema schema = +new org.apache.arrow.vector.types.pojo.Schema( +asList( +field("int32", new ArrowType.Int(32, true)), +field("float64", new ArrowType.FloatingPoint(FloatingPointPrecision.DOUBLE)), +field("string", new ArrowType.Utf8()), +field("timestampMicroUTC", new ArrowType.Timestamp(TimeUnit.MICROSECOND, "UTC")), +field("timestampMilliUTC", new ArrowType.Timestamp(TimeUnit.MILLISECOND, "UTC")), +field( +"int32_list", +new ArrowType.List(), +field("int32s", new ArrowType.Int(32, true))), +field("boolean", new ArrowType.Bool()), +field("fixed_size_binary", new ArrowType.FixedSizeBinary(3; + +Schema beamSchema = ArrowConversion.toBeamSchema(schema); + +VectorSchemaRoot expectedSchemaRoot =
[jira] [Resolved] (BEAM-8627) Migrate bigtable-client-core to 1.12.1
[ https://issues.apache.org/jira/browse/BEAM-8627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomo Suzuki resolved BEAM-8627. --- Fix Version/s: Not applicable Resolution: Duplicate > Migrate bigtable-client-core to 1.12.1 > -- > > Key: BEAM-8627 > URL: https://issues.apache.org/jira/browse/BEAM-8627 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp, runner-dataflow >Reporter: Tomo Suzuki >Assignee: Tomo Suzuki >Priority: Major > Fix For: Not applicable > > Attachments: rsP0SotuxeR.png > > > Beam currently uses Migrate bigtable-client-core to 1.8.0. The latest is > 1.12.1 > [https://search.maven.org/artifact/com.google.cloud.bigtable/bigtable-client-core/1.12.1/jar] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9200) Portable job jar postcommits failing
[ https://issues.apache.org/jira/browse/BEAM-9200?focusedWorklogId=378520=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378520 ] ASF GitHub Bot logged work on BEAM-9200: Author: ASF GitHub Bot Created on: 28/Jan/20 22:14 Start Date: 28/Jan/20 22:14 Worklog Time Spent: 10m Work Description: ibzib commented on issue #10703: [BEAM-9200] fix portable jar test version property URL: https://github.com/apache/beam/pull/10703#issuecomment-579485140 Run PortableJar_Flink PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378520) Time Spent: 20m (was: 10m) > Portable job jar postcommits failing > > > Key: BEAM-9200 > URL: https://issues.apache.org/jira/browse/BEAM-9200 > Project: Beam > Issue Type: Improvement > Components: runner-flink, runner-spark >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > Labels: portability-flink, portability-spark > Time Spent: 20m > Remaining Estimate: 0h > > 15:25:58 Execution failed for task > ':runners:spark:job-server:testJavaJarCreatorPy37'. > 15:25:58 > Could not get unknown property 'python_sdk_version' for project > ':runners:spark:job-server' of type org.gradle.api.Project. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9200) Portable job jar postcommits failing
[ https://issues.apache.org/jira/browse/BEAM-9200?focusedWorklogId=378521=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378521 ] ASF GitHub Bot logged work on BEAM-9200: Author: ASF GitHub Bot Created on: 28/Jan/20 22:14 Start Date: 28/Jan/20 22:14 Worklog Time Spent: 10m Work Description: ibzib commented on issue #10703: [BEAM-9200] fix portable jar test version property URL: https://github.com/apache/beam/pull/10703#issuecomment-579485247 Run PortableJar_Spark PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378521) Time Spent: 0.5h (was: 20m) > Portable job jar postcommits failing > > > Key: BEAM-9200 > URL: https://issues.apache.org/jira/browse/BEAM-9200 > Project: Beam > Issue Type: Improvement > Components: runner-flink, runner-spark >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > Labels: portability-flink, portability-spark > Time Spent: 0.5h > Remaining Estimate: 0h > > 15:25:58 Execution failed for task > ':runners:spark:job-server:testJavaJarCreatorPy37'. > 15:25:58 > Could not get unknown property 'python_sdk_version' for project > ':runners:spark:job-server' of type org.gradle.api.Project. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9200) Portable job jar postcommits failing
[ https://issues.apache.org/jira/browse/BEAM-9200?focusedWorklogId=378522=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378522 ] ASF GitHub Bot logged work on BEAM-9200: Author: ASF GitHub Bot Created on: 28/Jan/20 22:16 Start Date: 28/Jan/20 22:16 Worklog Time Spent: 10m Work Description: Hannah-Jiang commented on issue #10703: [BEAM-9200] fix portable jar test version property URL: https://github.com/apache/beam/pull/10703#issuecomment-579486529 LGTM, thanks for fixing it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378522) Time Spent: 40m (was: 0.5h) > Portable job jar postcommits failing > > > Key: BEAM-9200 > URL: https://issues.apache.org/jira/browse/BEAM-9200 > Project: Beam > Issue Type: Improvement > Components: runner-flink, runner-spark >Reporter: Kyle Weaver >Assignee: Kyle Weaver >Priority: Major > Labels: portability-flink, portability-spark > Time Spent: 40m > Remaining Estimate: 0h > > 15:25:58 Execution failed for task > ':runners:spark:job-server:testJavaJarCreatorPy37'. > 15:25:58 > Could not get unknown property 'python_sdk_version' for project > ':runners:spark:job-server' of type org.gradle.api.Project. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (BEAM-9204) HBase SDF @SplitRestriction does not take the range input into account to restrict splits
[ https://issues.apache.org/jira/browse/BEAM-9204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luke Cwik closed BEAM-9204. --- Fix Version/s: 2.20.0 Resolution: Fixed > HBase SDF @SplitRestriction does not take the range input into account to > restrict splits > - > > Key: BEAM-9204 > URL: https://issues.apache.org/jira/browse/BEAM-9204 > Project: Beam > Issue Type: Bug > Components: io-java-hbase >Reporter: Ismaël Mejía >Assignee: Ismaël Mejía >Priority: Minor > Fix For: 2.20.0 > > Time Spent: 20m > Remaining Estimate: 0h > > This is an issue because that the split is called multiple times and in this > cas it will produce repeated work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9188) Improving speed of splitting for Custom Sources
[ https://issues.apache.org/jira/browse/BEAM-9188?focusedWorklogId=378529=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378529 ] ASF GitHub Bot logged work on BEAM-9188: Author: ASF GitHub Bot Created on: 28/Jan/20 22:26 Start Date: 28/Jan/20 22:26 Worklog Time Spent: 10m Work Description: boyuanzz commented on pull request #10701: [BEAM-9188] CassandraIO split performance improvement - cache size of the table URL: https://github.com/apache/beam/pull/10701#discussion_r372093592 ## File path: sdks/java/io/cassandra/src/main/java/org/apache/beam/sdk/io/cassandra/CassandraIO.java ## @@ -375,9 +375,15 @@ private CassandraIO() {} static class CassandraSource extends BoundedSource { final Read spec; final List splitQueries; +final Long estimatedSize; Review comment: Could you please add some comments about this field? Like, why and when the size is cached. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378529) Time Spent: 1h 20m (was: 1h 10m) > Improving speed of splitting for Custom Sources > --- > > Key: BEAM-9188 > URL: https://issues.apache.org/jira/browse/BEAM-9188 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Radosław Stankiewicz >Assignee: Radosław Stankiewicz >Priority: Minor > Time Spent: 1h 20m > Remaining Estimate: 0h > > At this moment Custom Source in being split and serialized in sequence. If > there are many splits, it takes time to process all splits. > > Example: it takes 2s to calculate size and serialize CassandraSource due to > connection setup and teardown. With 100+ splits, it's a lot of time spent in > 1 worker. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9188) Improving speed of splitting for Custom Sources
[ https://issues.apache.org/jira/browse/BEAM-9188?focusedWorklogId=378530=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378530 ] ASF GitHub Bot logged work on BEAM-9188: Author: ASF GitHub Bot Created on: 28/Jan/20 22:26 Start Date: 28/Jan/20 22:26 Worklog Time Spent: 10m Work Description: boyuanzz commented on pull request #10701: [BEAM-9188] CassandraIO split performance improvement - cache size of the table URL: https://github.com/apache/beam/pull/10701#discussion_r372092162 ## File path: sdks/java/io/cassandra/src/main/java/org/apache/beam/sdk/io/cassandra/CassandraIO.java ## @@ -508,26 +518,30 @@ private static long getNumSplits( @Override public long getEstimatedSizeBytes(PipelineOptions pipelineOptions) { - try (Cluster cluster = - getCluster( - spec.hosts(), - spec.port(), - spec.username(), - spec.password(), - spec.localDc(), - spec.consistencyLevel())) { -if (isMurmur3Partitioner(cluster)) { - try { -List tokenRanges = -getTokenRanges(cluster, spec.keyspace().get(), spec.table().get()); -return getEstimatedSizeBytesFromTokenRanges(tokenRanges); - } catch (Exception e) { -LOG.warn("Can't estimate the size", e); + if (estimatedSize != null) { +return estimatedSize; + } else { +try (Cluster cluster = +getCluster( +spec.hosts(), +spec.port(), +spec.username(), +spec.password(), +spec.localDc(), +spec.consistencyLevel())) { + if (isMurmur3Partitioner(cluster)) { +try { + List tokenRanges = + getTokenRanges(cluster, spec.keyspace().get(), spec.table().get()); + return getEstimatedSizeBytesFromTokenRanges(tokenRanges); Review comment: Can we also cache the size here in case `getEstimateSizeBytes` is called more than once? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378530) Time Spent: 1.5h (was: 1h 20m) > Improving speed of splitting for Custom Sources > --- > > Key: BEAM-9188 > URL: https://issues.apache.org/jira/browse/BEAM-9188 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Radosław Stankiewicz >Assignee: Radosław Stankiewicz >Priority: Minor > Time Spent: 1.5h > Remaining Estimate: 0h > > At this moment Custom Source in being split and serialized in sequence. If > there are many splits, it takes time to process all splits. > > Example: it takes 2s to calculate size and serialize CassandraSource due to > connection setup and teardown. With 100+ splits, it's a lot of time spent in > 1 worker. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8933) BigQuery IO should support read/write in Arrow format
[ https://issues.apache.org/jira/browse/BEAM-8933?focusedWorklogId=378537=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378537 ] ASF GitHub Bot logged work on BEAM-8933: Author: ASF GitHub Bot Created on: 28/Jan/20 22:33 Start Date: 28/Jan/20 22:33 Worklog Time Spent: 10m Work Description: TheNeuralBit commented on issue #10384: [BEAM-8933] Utilities for converting Arrow schemas and reading Arrow batches as Rows URL: https://github.com/apache/beam/pull/10384#issuecomment-579495519 Run Java PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378537) Time Spent: 8h 50m (was: 8h 40m) > BigQuery IO should support read/write in Arrow format > - > > Key: BEAM-8933 > URL: https://issues.apache.org/jira/browse/BEAM-8933 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Reporter: Kirill Kozlov >Assignee: Kirill Kozlov >Priority: Major > Time Spent: 8h 50m > Remaining Estimate: 0h > > As of right now BigQuery uses Avro format for reading and writing. > We should add a config to BigQueryIO to specify which format to use: Arrow or > Avro (with Avro as default). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8933) BigQuery IO should support read/write in Arrow format
[ https://issues.apache.org/jira/browse/BEAM-8933?focusedWorklogId=378536=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378536 ] ASF GitHub Bot logged work on BEAM-8933: Author: ASF GitHub Bot Created on: 28/Jan/20 22:32 Start Date: 28/Jan/20 22:32 Worklog Time Spent: 10m Work Description: TheNeuralBit commented on issue #10384: [BEAM-8933] Utilities for converting Arrow schemas and reading Arrow batches as Rows URL: https://github.com/apache/beam/pull/10384#issuecomment-579495439 @alexvanboxel do you have time to take a look at this? cc: @emkornfield in case you want to take a look This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378536) Time Spent: 8h 40m (was: 8.5h) > BigQuery IO should support read/write in Arrow format > - > > Key: BEAM-8933 > URL: https://issues.apache.org/jira/browse/BEAM-8933 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Reporter: Kirill Kozlov >Assignee: Kirill Kozlov >Priority: Major > Time Spent: 8h 40m > Remaining Estimate: 0h > > As of right now BigQuery uses Avro format for reading and writing. > We should add a config to BigQueryIO to specify which format to use: Arrow or > Avro (with Avro as default). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9037) Instant and duration as logical type
[ https://issues.apache.org/jira/browse/BEAM-9037?focusedWorklogId=378538=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378538 ] ASF GitHub Bot logged work on BEAM-9037: Author: ASF GitHub Bot Created on: 28/Jan/20 22:37 Start Date: 28/Jan/20 22:37 Worklog Time Spent: 10m Work Description: TheNeuralBit commented on pull request #10486: [BEAM-9037] Instant and duration as logical type URL: https://github.com/apache/beam/pull/10486#discussion_r366529282 ## File path: sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/logicaltypes/NanosDuration.java ## @@ -0,0 +1,48 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.schemas.logicaltypes; + +import java.time.Duration; +import org.apache.beam.sdk.values.Row; + +/** A duration represented in nanoseconds. */ +public class NanosDuration extends NanosType { + public static final String IDENTIFIER = "beam:logical_type:nanos_duration"; Review comment: I think we should version these, can you add a v1 to this and nanos_instant? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378538) Time Spent: 1h 40m (was: 1.5h) > Instant and duration as logical type > - > > Key: BEAM-9037 > URL: https://issues.apache.org/jira/browse/BEAM-9037 > Project: Beam > Issue Type: Task > Components: sdk-java-core >Reporter: Alex Van Boxel >Assignee: Alex Van Boxel >Priority: Major > Fix For: 2.19.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > The proto schema includes Timestamp and Duration with nano precision. The > logical types should be promoted to the core logical types, so they can be > handled on various IO's as standard mandatory conversions. > This means that the logical type should use the proto specific Timestamp and > Duration but the java 8 Instant and Duration. > See discussion in the design document: > [https://docs.google.com/document/d/1uu9pJktzT_O3DxGd1-Q2op4nRk4HekIZbzi-0oTAips/edit#heading=h.9uhml95iygqr] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8933) BigQuery IO should support read/write in Arrow format
[ https://issues.apache.org/jira/browse/BEAM-8933?focusedWorklogId=378540=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378540 ] ASF GitHub Bot logged work on BEAM-8933: Author: ASF GitHub Bot Created on: 28/Jan/20 22:51 Start Date: 28/Jan/20 22:51 Worklog Time Spent: 10m Work Description: emkornfield commented on pull request #10384: [BEAM-8933] Utilities for converting Arrow schemas and reading Arrow batches as Rows URL: https://github.com/apache/beam/pull/10384#discussion_r372103595 ## File path: sdks/java/extensions/arrow/src/main/java/org/apache/beam/sdk/extensions/arrow/ArrowConversion.java ## @@ -0,0 +1,448 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.arrow; + +import static org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument; + +import java.util.Iterator; +import java.util.List; +import java.util.function.Function; +import java.util.stream.Collectors; +import javax.annotation.Nullable; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.types.TimeUnit; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.util.Text; +import org.apache.beam.sdk.annotations.Experimental; +import org.apache.beam.sdk.schemas.CachingFactory; +import org.apache.beam.sdk.schemas.Factory; +import org.apache.beam.sdk.schemas.FieldValueGetter; +import org.apache.beam.sdk.schemas.Schema; +import org.apache.beam.sdk.schemas.Schema.Field; +import org.apache.beam.sdk.schemas.Schema.FieldType; +import org.apache.beam.sdk.schemas.logicaltypes.FixedBytes; +import org.apache.beam.sdk.values.Row; +import org.joda.time.DateTime; +import org.joda.time.DateTimeZone; + +/** + * Utilities to create {@link Iterable}s of Beam {@link Row} instances backed by Arrow record + * batches. + */ +@Experimental(Experimental.Kind.SCHEMAS) +public class ArrowConversion { + /** Converts Arrow schema to Beam row schema. */ + public static Schema toBeamSchema(org.apache.arrow.vector.types.pojo.Schema schema) { +return toBeamSchema(schema.getFields()); + } + + public static Schema toBeamSchema(List fields) { +Schema.Builder builder = Schema.builder(); +for (org.apache.arrow.vector.types.pojo.Field field : fields) { + Field beamField = toBeamField(field); + builder.addField(beamField); +} +return builder.build(); + } + + /** Get Beam Field from Arrow Field. */ + private static Field toBeamField(org.apache.arrow.vector.types.pojo.Field field) { +FieldType beamFieldType = toFieldType(field.getFieldType(), field.getChildren()); +return Field.of(field.getName(), beamFieldType); + } + + /** Converts Arrow FieldType to Beam FieldType. */ + private static FieldType toFieldType( + org.apache.arrow.vector.types.pojo.FieldType arrowFieldType, + List childrenFields) { +FieldType fieldType = +arrowFieldType +.getType() +.accept( +new ArrowType.ArrowTypeVisitor() { + @Override + public FieldType visit(ArrowType.Null type) { +throw new IllegalArgumentException( +"Type \'" + type.toString() + "\' not supported."); + } + + @Override + public FieldType visit(ArrowType.Struct type) { +return FieldType.row(toBeamSchema(childrenFields)); + } + + @Override + public FieldType visit(ArrowType.List type) { +checkArgument( +childrenFields.size() == 1, +"Encountered " ++ childrenFields.size() ++ " child fields for list type, expected 1"); +return FieldType.array(toBeamField(childrenFields.get(0)).getType()); + } + + @Override + public FieldType visit(ArrowType.FixedSizeList
[jira] [Work logged] (BEAM-8933) BigQuery IO should support read/write in Arrow format
[ https://issues.apache.org/jira/browse/BEAM-8933?focusedWorklogId=378541=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378541 ] ASF GitHub Bot logged work on BEAM-8933: Author: ASF GitHub Bot Created on: 28/Jan/20 22:52 Start Date: 28/Jan/20 22:52 Worklog Time Spent: 10m Work Description: emkornfield commented on pull request #10384: [BEAM-8933] Utilities for converting Arrow schemas and reading Arrow batches as Rows URL: https://github.com/apache/beam/pull/10384#discussion_r372104055 ## File path: sdks/java/extensions/arrow/src/main/java/org/apache/beam/sdk/extensions/arrow/ArrowConversion.java ## @@ -0,0 +1,448 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.arrow; + +import static org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument; + +import java.util.Iterator; +import java.util.List; +import java.util.function.Function; +import java.util.stream.Collectors; +import javax.annotation.Nullable; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.types.TimeUnit; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.util.Text; +import org.apache.beam.sdk.annotations.Experimental; +import org.apache.beam.sdk.schemas.CachingFactory; +import org.apache.beam.sdk.schemas.Factory; +import org.apache.beam.sdk.schemas.FieldValueGetter; +import org.apache.beam.sdk.schemas.Schema; +import org.apache.beam.sdk.schemas.Schema.Field; +import org.apache.beam.sdk.schemas.Schema.FieldType; +import org.apache.beam.sdk.schemas.logicaltypes.FixedBytes; +import org.apache.beam.sdk.values.Row; +import org.joda.time.DateTime; +import org.joda.time.DateTimeZone; + +/** + * Utilities to create {@link Iterable}s of Beam {@link Row} instances backed by Arrow record + * batches. + */ +@Experimental(Experimental.Kind.SCHEMAS) +public class ArrowConversion { + /** Converts Arrow schema to Beam row schema. */ + public static Schema toBeamSchema(org.apache.arrow.vector.types.pojo.Schema schema) { +return toBeamSchema(schema.getFields()); + } + + public static Schema toBeamSchema(List fields) { +Schema.Builder builder = Schema.builder(); +for (org.apache.arrow.vector.types.pojo.Field field : fields) { + Field beamField = toBeamField(field); + builder.addField(beamField); +} +return builder.build(); + } + + /** Get Beam Field from Arrow Field. */ + private static Field toBeamField(org.apache.arrow.vector.types.pojo.Field field) { +FieldType beamFieldType = toFieldType(field.getFieldType(), field.getChildren()); +return Field.of(field.getName(), beamFieldType); + } + + /** Converts Arrow FieldType to Beam FieldType. */ + private static FieldType toFieldType( + org.apache.arrow.vector.types.pojo.FieldType arrowFieldType, + List childrenFields) { +FieldType fieldType = +arrowFieldType +.getType() +.accept( +new ArrowType.ArrowTypeVisitor() { + @Override + public FieldType visit(ArrowType.Null type) { +throw new IllegalArgumentException( +"Type \'" + type.toString() + "\' not supported."); + } + + @Override + public FieldType visit(ArrowType.Struct type) { +return FieldType.row(toBeamSchema(childrenFields)); + } + + @Override + public FieldType visit(ArrowType.List type) { +checkArgument( +childrenFields.size() == 1, +"Encountered " ++ childrenFields.size() ++ " child fields for list type, expected 1"); +return FieldType.array(toBeamField(childrenFields.get(0)).getType()); + } + + @Override + public FieldType visit(ArrowType.FixedSizeList
[jira] [Work logged] (BEAM-8933) BigQuery IO should support read/write in Arrow format
[ https://issues.apache.org/jira/browse/BEAM-8933?focusedWorklogId=378551=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378551 ] ASF GitHub Bot logged work on BEAM-8933: Author: ASF GitHub Bot Created on: 28/Jan/20 23:01 Start Date: 28/Jan/20 23:01 Worklog Time Spent: 10m Work Description: emkornfield commented on pull request #10384: [BEAM-8933] Utilities for converting Arrow schemas and reading Arrow batches as Rows URL: https://github.com/apache/beam/pull/10384#discussion_r372107178 ## File path: sdks/java/extensions/arrow/src/main/java/org/apache/beam/sdk/extensions/arrow/ArrowConversion.java ## @@ -0,0 +1,448 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.arrow; + +import static org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument; + +import java.util.Iterator; +import java.util.List; +import java.util.function.Function; +import java.util.stream.Collectors; +import javax.annotation.Nullable; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.types.TimeUnit; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.util.Text; +import org.apache.beam.sdk.annotations.Experimental; +import org.apache.beam.sdk.schemas.CachingFactory; +import org.apache.beam.sdk.schemas.Factory; +import org.apache.beam.sdk.schemas.FieldValueGetter; +import org.apache.beam.sdk.schemas.Schema; +import org.apache.beam.sdk.schemas.Schema.Field; +import org.apache.beam.sdk.schemas.Schema.FieldType; +import org.apache.beam.sdk.schemas.logicaltypes.FixedBytes; +import org.apache.beam.sdk.values.Row; +import org.joda.time.DateTime; +import org.joda.time.DateTimeZone; + +/** + * Utilities to create {@link Iterable}s of Beam {@link Row} instances backed by Arrow record + * batches. + */ +@Experimental(Experimental.Kind.SCHEMAS) +public class ArrowConversion { + /** Converts Arrow schema to Beam row schema. */ + public static Schema toBeamSchema(org.apache.arrow.vector.types.pojo.Schema schema) { +return toBeamSchema(schema.getFields()); + } + + public static Schema toBeamSchema(List fields) { +Schema.Builder builder = Schema.builder(); +for (org.apache.arrow.vector.types.pojo.Field field : fields) { + Field beamField = toBeamField(field); + builder.addField(beamField); +} +return builder.build(); + } + + /** Get Beam Field from Arrow Field. */ + private static Field toBeamField(org.apache.arrow.vector.types.pojo.Field field) { +FieldType beamFieldType = toFieldType(field.getFieldType(), field.getChildren()); +return Field.of(field.getName(), beamFieldType); + } + + /** Converts Arrow FieldType to Beam FieldType. */ + private static FieldType toFieldType( + org.apache.arrow.vector.types.pojo.FieldType arrowFieldType, + List childrenFields) { +FieldType fieldType = +arrowFieldType +.getType() +.accept( +new ArrowType.ArrowTypeVisitor() { + @Override + public FieldType visit(ArrowType.Null type) { +throw new IllegalArgumentException( +"Type \'" + type.toString() + "\' not supported."); + } + + @Override + public FieldType visit(ArrowType.Struct type) { +return FieldType.row(toBeamSchema(childrenFields)); + } + + @Override + public FieldType visit(ArrowType.List type) { +checkArgument( +childrenFields.size() == 1, +"Encountered " ++ childrenFields.size() ++ " child fields for list type, expected 1"); +return FieldType.array(toBeamField(childrenFields.get(0)).getType()); + } + + @Override + public FieldType visit(ArrowType.FixedSizeList
[jira] [Work logged] (BEAM-8933) BigQuery IO should support read/write in Arrow format
[ https://issues.apache.org/jira/browse/BEAM-8933?focusedWorklogId=378555=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378555 ] ASF GitHub Bot logged work on BEAM-8933: Author: ASF GitHub Bot Created on: 28/Jan/20 23:08 Start Date: 28/Jan/20 23:08 Worklog Time Spent: 10m Work Description: emkornfield commented on pull request #10384: [BEAM-8933] Utilities for converting Arrow schemas and reading Arrow batches as Rows URL: https://github.com/apache/beam/pull/10384#discussion_r372109648 ## File path: sdks/java/extensions/arrow/src/main/java/org/apache/beam/sdk/extensions/arrow/ArrowConversion.java ## @@ -0,0 +1,448 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.arrow; + +import static org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument; + +import java.util.Iterator; +import java.util.List; +import java.util.function.Function; +import java.util.stream.Collectors; +import javax.annotation.Nullable; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.types.TimeUnit; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.util.Text; +import org.apache.beam.sdk.annotations.Experimental; +import org.apache.beam.sdk.schemas.CachingFactory; +import org.apache.beam.sdk.schemas.Factory; +import org.apache.beam.sdk.schemas.FieldValueGetter; +import org.apache.beam.sdk.schemas.Schema; +import org.apache.beam.sdk.schemas.Schema.Field; +import org.apache.beam.sdk.schemas.Schema.FieldType; +import org.apache.beam.sdk.schemas.logicaltypes.FixedBytes; +import org.apache.beam.sdk.values.Row; +import org.joda.time.DateTime; +import org.joda.time.DateTimeZone; + +/** + * Utilities to create {@link Iterable}s of Beam {@link Row} instances backed by Arrow record + * batches. + */ +@Experimental(Experimental.Kind.SCHEMAS) +public class ArrowConversion { + /** Converts Arrow schema to Beam row schema. */ + public static Schema toBeamSchema(org.apache.arrow.vector.types.pojo.Schema schema) { +return toBeamSchema(schema.getFields()); + } + + public static Schema toBeamSchema(List fields) { +Schema.Builder builder = Schema.builder(); +for (org.apache.arrow.vector.types.pojo.Field field : fields) { + Field beamField = toBeamField(field); + builder.addField(beamField); +} +return builder.build(); + } + + /** Get Beam Field from Arrow Field. */ + private static Field toBeamField(org.apache.arrow.vector.types.pojo.Field field) { +FieldType beamFieldType = toFieldType(field.getFieldType(), field.getChildren()); +return Field.of(field.getName(), beamFieldType); + } + + /** Converts Arrow FieldType to Beam FieldType. */ + private static FieldType toFieldType( + org.apache.arrow.vector.types.pojo.FieldType arrowFieldType, + List childrenFields) { +FieldType fieldType = +arrowFieldType +.getType() +.accept( +new ArrowType.ArrowTypeVisitor() { + @Override + public FieldType visit(ArrowType.Null type) { +throw new IllegalArgumentException( +"Type \'" + type.toString() + "\' not supported."); + } + + @Override + public FieldType visit(ArrowType.Struct type) { +return FieldType.row(toBeamSchema(childrenFields)); + } + + @Override + public FieldType visit(ArrowType.List type) { +checkArgument( +childrenFields.size() == 1, +"Encountered " ++ childrenFields.size() ++ " child fields for list type, expected 1"); +return FieldType.array(toBeamField(childrenFields.get(0)).getType()); + } + + @Override + public FieldType visit(ArrowType.FixedSizeList
[jira] [Work logged] (BEAM-8933) BigQuery IO should support read/write in Arrow format
[ https://issues.apache.org/jira/browse/BEAM-8933?focusedWorklogId=378556=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378556 ] ASF GitHub Bot logged work on BEAM-8933: Author: ASF GitHub Bot Created on: 28/Jan/20 23:09 Start Date: 28/Jan/20 23:09 Worklog Time Spent: 10m Work Description: emkornfield commented on issue #10384: [BEAM-8933] Utilities for converting Arrow schemas and reading Arrow batches as Rows URL: https://github.com/apache/beam/pull/10384#issuecomment-579515450 @TheNeuralBit @iemejia Regarding Arrow dependencies there was a discussion on potentially moving to a [different allocator recently](https://mail-archives.apache.org/mod_mbox/arrow-dev/202001.mbox/%3CCAK7Z5T_KEnfFmVvg_EUzTH5OFuAu3OFK5NMhVeQ0UTHE%2BQ5smw%40mail.gmail.com%3E) on the Arrow Dev mailing list. If you have thoughts please chime in. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378556) Time Spent: 10.5h (was: 10h 20m) > BigQuery IO should support read/write in Arrow format > - > > Key: BEAM-8933 > URL: https://issues.apache.org/jira/browse/BEAM-8933 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Reporter: Kirill Kozlov >Assignee: Kirill Kozlov >Priority: Major > Time Spent: 10.5h > Remaining Estimate: 0h > > As of right now BigQuery uses Avro format for reading and writing. > We should add a config to BigQueryIO to specify which format to use: Arrow or > Avro (with Avro as default). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7246) Create a Spanner IO for Python
[ https://issues.apache.org/jira/browse/BEAM-7246?focusedWorklogId=378494=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378494 ] ASF GitHub Bot logged work on BEAM-7246: Author: ASF GitHub Bot Created on: 28/Jan/20 21:49 Start Date: 28/Jan/20 21:49 Worklog Time Spent: 10m Work Description: chamikaramj commented on issue #10706: [BEAM-7246] Fix Spanner auth endpoints URL: https://github.com/apache/beam/pull/10706#issuecomment-579473549 cc: @mszb This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378494) Time Spent: 13h (was: 12h 50m) > Create a Spanner IO for Python > -- > > Key: BEAM-7246 > URL: https://issues.apache.org/jira/browse/BEAM-7246 > Project: Beam > Issue Type: Bug > Components: io-py-gcp >Reporter: Reuven Lax >Assignee: Shehzaad Nakhoda >Priority: Major > Time Spent: 13h > Remaining Estimate: 0h > > Add I/O support for Google Cloud Spanner for the Python SDK (Batch Only). > Testing in this work item will be in the form of DirectRunner tests and > manual testing. > Integration and performance tests are a separate work item (not included > here). > See https://beam.apache.org/documentation/io/built-in/. The goal is to add > Google Clound Spanner to the Database column for the Python/Batch row. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9188) Improving speed of splitting for Custom Sources
[ https://issues.apache.org/jira/browse/BEAM-9188?focusedWorklogId=378504=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378504 ] ASF GitHub Bot logged work on BEAM-9188: Author: ASF GitHub Bot Created on: 28/Jan/20 22:00 Start Date: 28/Jan/20 22:00 Worklog Time Spent: 10m Work Description: stankiewicz commented on pull request #10685: [BEAM-9188] Dataflow's WorkerCustomSources improvement - parallelize creation of Derived Sources (splitting) URL: https://github.com/apache/beam/pull/10685 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378504) Time Spent: 1h (was: 50m) > Improving speed of splitting for Custom Sources > --- > > Key: BEAM-9188 > URL: https://issues.apache.org/jira/browse/BEAM-9188 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Radosław Stankiewicz >Assignee: Radosław Stankiewicz >Priority: Minor > Time Spent: 1h > Remaining Estimate: 0h > > At this moment Custom Source in being split and serialized in sequence. If > there are many splits, it takes time to process all splits. > > Example: it takes 2s to calculate size and serialize CassandraSource due to > connection setup and teardown. With 100+ splits, it's a lot of time spent in > 1 worker. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7246) Create a Spanner IO for Python
[ https://issues.apache.org/jira/browse/BEAM-7246?focusedWorklogId=378523=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378523 ] ASF GitHub Bot logged work on BEAM-7246: Author: ASF GitHub Bot Created on: 28/Jan/20 22:18 Start Date: 28/Jan/20 22:18 Worklog Time Spent: 10m Work Description: mszb commented on issue #10706: [BEAM-7246] Fix Spanner auth endpoints URL: https://github.com/apache/beam/pull/10706#issuecomment-579487332 LGTM This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378523) Time Spent: 13h 10m (was: 13h) > Create a Spanner IO for Python > -- > > Key: BEAM-7246 > URL: https://issues.apache.org/jira/browse/BEAM-7246 > Project: Beam > Issue Type: Bug > Components: io-py-gcp >Reporter: Reuven Lax >Assignee: Shehzaad Nakhoda >Priority: Major > Time Spent: 13h 10m > Remaining Estimate: 0h > > Add I/O support for Google Cloud Spanner for the Python SDK (Batch Only). > Testing in this work item will be in the form of DirectRunner tests and > manual testing. > Integration and performance tests are a separate work item (not included > here). > See https://beam.apache.org/documentation/io/built-in/. The goal is to add > Google Clound Spanner to the Database column for the Python/Batch row. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7246) Create a Spanner IO for Python
[ https://issues.apache.org/jira/browse/BEAM-7246?focusedWorklogId=378524=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378524 ] ASF GitHub Bot logged work on BEAM-7246: Author: ASF GitHub Bot Created on: 28/Jan/20 22:19 Start Date: 28/Jan/20 22:19 Worklog Time Spent: 10m Work Description: chamikaramj commented on issue #10706: [BEAM-7246] Fix Spanner auth endpoints URL: https://github.com/apache/beam/pull/10706#issuecomment-579487896 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378524) Time Spent: 13h 20m (was: 13h 10m) > Create a Spanner IO for Python > -- > > Key: BEAM-7246 > URL: https://issues.apache.org/jira/browse/BEAM-7246 > Project: Beam > Issue Type: Bug > Components: io-py-gcp >Reporter: Reuven Lax >Assignee: Shehzaad Nakhoda >Priority: Major > Time Spent: 13h 20m > Remaining Estimate: 0h > > Add I/O support for Google Cloud Spanner for the Python SDK (Batch Only). > Testing in this work item will be in the form of DirectRunner tests and > manual testing. > Integration and performance tests are a separate work item (not included > here). > See https://beam.apache.org/documentation/io/built-in/. The goal is to add > Google Clound Spanner to the Database column for the Python/Batch row. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7246) Create a Spanner IO for Python
[ https://issues.apache.org/jira/browse/BEAM-7246?focusedWorklogId=378525=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378525 ] ASF GitHub Bot logged work on BEAM-7246: Author: ASF GitHub Bot Created on: 28/Jan/20 22:19 Start Date: 28/Jan/20 22:19 Worklog Time Spent: 10m Work Description: chamikaramj commented on issue #10706: [BEAM-7246] Fix Spanner auth endpoints URL: https://github.com/apache/beam/pull/10706#issuecomment-579487946 Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378525) Time Spent: 13.5h (was: 13h 20m) > Create a Spanner IO for Python > -- > > Key: BEAM-7246 > URL: https://issues.apache.org/jira/browse/BEAM-7246 > Project: Beam > Issue Type: Bug > Components: io-py-gcp >Reporter: Reuven Lax >Assignee: Shehzaad Nakhoda >Priority: Major > Time Spent: 13.5h > Remaining Estimate: 0h > > Add I/O support for Google Cloud Spanner for the Python SDK (Batch Only). > Testing in this work item will be in the form of DirectRunner tests and > manual testing. > Integration and performance tests are a separate work item (not included > here). > See https://beam.apache.org/documentation/io/built-in/. The goal is to add > Google Clound Spanner to the Database column for the Python/Batch row. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9204) HBase SDF @SplitRestriction does not take the range input into account to restrict splits
[ https://issues.apache.org/jira/browse/BEAM-9204?focusedWorklogId=378527=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378527 ] ASF GitHub Bot logged work on BEAM-9204: Author: ASF GitHub Bot Created on: 28/Jan/20 22:23 Start Date: 28/Jan/20 22:23 Worklog Time Spent: 10m Work Description: lukecwik commented on pull request #10700: [BEAM-9204] Fix HBase SplitRestriction to be based on provided Range URL: https://github.com/apache/beam/pull/10700 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378527) Time Spent: 20m (was: 10m) > HBase SDF @SplitRestriction does not take the range input into account to > restrict splits > - > > Key: BEAM-9204 > URL: https://issues.apache.org/jira/browse/BEAM-9204 > Project: Beam > Issue Type: Bug > Components: io-java-hbase >Reporter: Ismaël Mejía >Assignee: Ismaël Mejía >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > This is an issue because that the split is called multiple times and in this > cas it will produce repeated work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8933) BigQuery IO should support read/write in Arrow format
[ https://issues.apache.org/jira/browse/BEAM-8933?focusedWorklogId=378543=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378543 ] ASF GitHub Bot logged work on BEAM-8933: Author: ASF GitHub Bot Created on: 28/Jan/20 22:54 Start Date: 28/Jan/20 22:54 Worklog Time Spent: 10m Work Description: emkornfield commented on pull request #10384: [BEAM-8933] Utilities for converting Arrow schemas and reading Arrow batches as Rows URL: https://github.com/apache/beam/pull/10384#discussion_r372104685 ## File path: sdks/java/extensions/arrow/src/main/java/org/apache/beam/sdk/extensions/arrow/ArrowConversion.java ## @@ -0,0 +1,448 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.arrow; + +import static org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument; + +import java.util.Iterator; +import java.util.List; +import java.util.function.Function; +import java.util.stream.Collectors; +import javax.annotation.Nullable; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.types.TimeUnit; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.util.Text; +import org.apache.beam.sdk.annotations.Experimental; +import org.apache.beam.sdk.schemas.CachingFactory; +import org.apache.beam.sdk.schemas.Factory; +import org.apache.beam.sdk.schemas.FieldValueGetter; +import org.apache.beam.sdk.schemas.Schema; +import org.apache.beam.sdk.schemas.Schema.Field; +import org.apache.beam.sdk.schemas.Schema.FieldType; +import org.apache.beam.sdk.schemas.logicaltypes.FixedBytes; +import org.apache.beam.sdk.values.Row; +import org.joda.time.DateTime; +import org.joda.time.DateTimeZone; + +/** + * Utilities to create {@link Iterable}s of Beam {@link Row} instances backed by Arrow record + * batches. + */ +@Experimental(Experimental.Kind.SCHEMAS) +public class ArrowConversion { + /** Converts Arrow schema to Beam row schema. */ + public static Schema toBeamSchema(org.apache.arrow.vector.types.pojo.Schema schema) { +return toBeamSchema(schema.getFields()); + } + + public static Schema toBeamSchema(List fields) { +Schema.Builder builder = Schema.builder(); +for (org.apache.arrow.vector.types.pojo.Field field : fields) { + Field beamField = toBeamField(field); + builder.addField(beamField); +} +return builder.build(); + } + + /** Get Beam Field from Arrow Field. */ + private static Field toBeamField(org.apache.arrow.vector.types.pojo.Field field) { +FieldType beamFieldType = toFieldType(field.getFieldType(), field.getChildren()); +return Field.of(field.getName(), beamFieldType); + } + + /** Converts Arrow FieldType to Beam FieldType. */ + private static FieldType toFieldType( + org.apache.arrow.vector.types.pojo.FieldType arrowFieldType, + List childrenFields) { +FieldType fieldType = +arrowFieldType +.getType() +.accept( +new ArrowType.ArrowTypeVisitor() { + @Override + public FieldType visit(ArrowType.Null type) { +throw new IllegalArgumentException( +"Type \'" + type.toString() + "\' not supported."); + } + + @Override + public FieldType visit(ArrowType.Struct type) { +return FieldType.row(toBeamSchema(childrenFields)); + } + + @Override + public FieldType visit(ArrowType.List type) { +checkArgument( +childrenFields.size() == 1, +"Encountered " ++ childrenFields.size() ++ " child fields for list type, expected 1"); +return FieldType.array(toBeamField(childrenFields.get(0)).getType()); + } + + @Override + public FieldType visit(ArrowType.FixedSizeList
[jira] [Work logged] (BEAM-8933) BigQuery IO should support read/write in Arrow format
[ https://issues.apache.org/jira/browse/BEAM-8933?focusedWorklogId=378546=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378546 ] ASF GitHub Bot logged work on BEAM-8933: Author: ASF GitHub Bot Created on: 28/Jan/20 22:55 Start Date: 28/Jan/20 22:55 Worklog Time Spent: 10m Work Description: emkornfield commented on pull request #10384: [BEAM-8933] Utilities for converting Arrow schemas and reading Arrow batches as Rows URL: https://github.com/apache/beam/pull/10384#discussion_r372105055 ## File path: sdks/java/extensions/arrow/src/main/java/org/apache/beam/sdk/extensions/arrow/ArrowConversion.java ## @@ -0,0 +1,448 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.arrow; + +import static org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument; + +import java.util.Iterator; +import java.util.List; +import java.util.function.Function; +import java.util.stream.Collectors; +import javax.annotation.Nullable; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.types.TimeUnit; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.util.Text; +import org.apache.beam.sdk.annotations.Experimental; +import org.apache.beam.sdk.schemas.CachingFactory; +import org.apache.beam.sdk.schemas.Factory; +import org.apache.beam.sdk.schemas.FieldValueGetter; +import org.apache.beam.sdk.schemas.Schema; +import org.apache.beam.sdk.schemas.Schema.Field; +import org.apache.beam.sdk.schemas.Schema.FieldType; +import org.apache.beam.sdk.schemas.logicaltypes.FixedBytes; +import org.apache.beam.sdk.values.Row; +import org.joda.time.DateTime; +import org.joda.time.DateTimeZone; + +/** + * Utilities to create {@link Iterable}s of Beam {@link Row} instances backed by Arrow record + * batches. + */ +@Experimental(Experimental.Kind.SCHEMAS) +public class ArrowConversion { + /** Converts Arrow schema to Beam row schema. */ + public static Schema toBeamSchema(org.apache.arrow.vector.types.pojo.Schema schema) { +return toBeamSchema(schema.getFields()); + } + + public static Schema toBeamSchema(List fields) { +Schema.Builder builder = Schema.builder(); +for (org.apache.arrow.vector.types.pojo.Field field : fields) { + Field beamField = toBeamField(field); + builder.addField(beamField); +} +return builder.build(); + } + + /** Get Beam Field from Arrow Field. */ + private static Field toBeamField(org.apache.arrow.vector.types.pojo.Field field) { +FieldType beamFieldType = toFieldType(field.getFieldType(), field.getChildren()); +return Field.of(field.getName(), beamFieldType); + } + + /** Converts Arrow FieldType to Beam FieldType. */ + private static FieldType toFieldType( + org.apache.arrow.vector.types.pojo.FieldType arrowFieldType, + List childrenFields) { +FieldType fieldType = +arrowFieldType +.getType() +.accept( +new ArrowType.ArrowTypeVisitor() { + @Override + public FieldType visit(ArrowType.Null type) { +throw new IllegalArgumentException( +"Type \'" + type.toString() + "\' not supported."); + } + + @Override + public FieldType visit(ArrowType.Struct type) { +return FieldType.row(toBeamSchema(childrenFields)); + } + + @Override + public FieldType visit(ArrowType.List type) { +checkArgument( +childrenFields.size() == 1, +"Encountered " ++ childrenFields.size() ++ " child fields for list type, expected 1"); +return FieldType.array(toBeamField(childrenFields.get(0)).getType()); + } + + @Override + public FieldType visit(ArrowType.FixedSizeList
[jira] [Work logged] (BEAM-8933) BigQuery IO should support read/write in Arrow format
[ https://issues.apache.org/jira/browse/BEAM-8933?focusedWorklogId=378554=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378554 ] ASF GitHub Bot logged work on BEAM-8933: Author: ASF GitHub Bot Created on: 28/Jan/20 23:05 Start Date: 28/Jan/20 23:05 Worklog Time Spent: 10m Work Description: emkornfield commented on pull request #10384: [BEAM-8933] Utilities for converting Arrow schemas and reading Arrow batches as Rows URL: https://github.com/apache/beam/pull/10384#discussion_r372108633 ## File path: sdks/java/extensions/arrow/src/main/java/org/apache/beam/sdk/extensions/arrow/ArrowConversion.java ## @@ -0,0 +1,448 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.arrow; + +import static org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkArgument; + +import java.util.Iterator; +import java.util.List; +import java.util.function.Function; +import java.util.stream.Collectors; +import javax.annotation.Nullable; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.types.TimeUnit; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.util.Text; +import org.apache.beam.sdk.annotations.Experimental; +import org.apache.beam.sdk.schemas.CachingFactory; +import org.apache.beam.sdk.schemas.Factory; +import org.apache.beam.sdk.schemas.FieldValueGetter; +import org.apache.beam.sdk.schemas.Schema; +import org.apache.beam.sdk.schemas.Schema.Field; +import org.apache.beam.sdk.schemas.Schema.FieldType; +import org.apache.beam.sdk.schemas.logicaltypes.FixedBytes; +import org.apache.beam.sdk.values.Row; +import org.joda.time.DateTime; +import org.joda.time.DateTimeZone; + +/** + * Utilities to create {@link Iterable}s of Beam {@link Row} instances backed by Arrow record + * batches. + */ +@Experimental(Experimental.Kind.SCHEMAS) +public class ArrowConversion { + /** Converts Arrow schema to Beam row schema. */ + public static Schema toBeamSchema(org.apache.arrow.vector.types.pojo.Schema schema) { +return toBeamSchema(schema.getFields()); + } + + public static Schema toBeamSchema(List fields) { +Schema.Builder builder = Schema.builder(); +for (org.apache.arrow.vector.types.pojo.Field field : fields) { + Field beamField = toBeamField(field); + builder.addField(beamField); +} +return builder.build(); + } + + /** Get Beam Field from Arrow Field. */ + private static Field toBeamField(org.apache.arrow.vector.types.pojo.Field field) { +FieldType beamFieldType = toFieldType(field.getFieldType(), field.getChildren()); +return Field.of(field.getName(), beamFieldType); + } + + /** Converts Arrow FieldType to Beam FieldType. */ + private static FieldType toFieldType( + org.apache.arrow.vector.types.pojo.FieldType arrowFieldType, + List childrenFields) { +FieldType fieldType = +arrowFieldType +.getType() +.accept( +new ArrowType.ArrowTypeVisitor() { + @Override + public FieldType visit(ArrowType.Null type) { +throw new IllegalArgumentException( +"Type \'" + type.toString() + "\' not supported."); + } + + @Override + public FieldType visit(ArrowType.Struct type) { +return FieldType.row(toBeamSchema(childrenFields)); + } + + @Override + public FieldType visit(ArrowType.List type) { +checkArgument( +childrenFields.size() == 1, +"Encountered " ++ childrenFields.size() ++ " child fields for list type, expected 1"); +return FieldType.array(toBeamField(childrenFields.get(0)).getType()); + } + + @Override + public FieldType visit(ArrowType.FixedSizeList
[jira] [Work logged] (BEAM-7427) JmsCheckpointMark Avro Serialization issue with UnboundedSource
[ https://issues.apache.org/jira/browse/BEAM-7427?focusedWorklogId=378280=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378280 ] ASF GitHub Bot logged work on BEAM-7427: Author: ASF GitHub Bot Created on: 28/Jan/20 14:48 Start Date: 28/Jan/20 14:48 Worklog Time Spent: 10m Work Description: aromanenko-dev commented on issue #10644: [BEAM-7427] Refactore JmsCheckpointMark to be usage via Coder URL: https://github.com/apache/beam/pull/10644#issuecomment-579280837 Run Java PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378280) Time Spent: 8h 50m (was: 8h 40m) > JmsCheckpointMark Avro Serialization issue with UnboundedSource > --- > > Key: BEAM-7427 > URL: https://issues.apache.org/jira/browse/BEAM-7427 > Project: Beam > Issue Type: Bug > Components: io-java-jms >Affects Versions: 2.12.0, 2.13.0, 2.14.0, 2.15.0, 2.16.0, 2.17.0, 2.18.0, > 2.19.0 > Environment: Message Broker : solace > JMS Client (Over AMQP) : "org.apache.qpid:qpid-jms-client:0.42.0 >Reporter: Mourad >Assignee: Jean-Baptiste Onofré >Priority: Major > Fix For: 2.20.0 > > Time Spent: 8h 50m > Remaining Estimate: 0h > > I get the following exception when reading from unbounded JMS Source: > > {code:java} > Caused by: org.apache.avro.SchemaParseException: Illegal character in: this$0 > at org.apache.avro.Schema.validateName(Schema.java:1151) > at org.apache.avro.Schema.access$200(Schema.java:81) > at org.apache.avro.Schema$Field.(Schema.java:403) > at org.apache.avro.Schema$Field.(Schema.java:396) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:622) > at org.apache.avro.reflect.ReflectData.createFieldSchema(ReflectData.java:740) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:604) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:218) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:215) > at > avro.shaded.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228) > {code} > > The exception is thrown by Avro when introspecting {{JmsCheckpointMark}} to > generate schema. > JmsIO config : > > {code:java} > PCollection messages = pipeline.apply("read messages from the > events broker", JmsIO.readMessage() > .withConnectionFactory(jmsConnectionFactory) .withTopic(options.getTopic()) > .withMessageMapper(new DFAMessageMapper()) > .withCoder(AvroCoder.of(DFAMessage.class))); > {code} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-7691) Improve checkpoints documentation
[ https://issues.apache.org/jira/browse/BEAM-7691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía updated BEAM-7691: --- Summary: Improve checkpoints documentation (was: UnboundedSource.CheckpointMark should mention that implementations should be Serializable or have have an associated Coder) > Improve checkpoints documentation > - > > Key: BEAM-7691 > URL: https://issues.apache.org/jira/browse/BEAM-7691 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core >Reporter: Ismaël Mejía >Priority: Minor > Labels: documentation, javadoc, newbie, starter > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-7691) Improve checkpoints documentation
[ https://issues.apache.org/jira/browse/BEAM-7691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía updated BEAM-7691: --- Component/s: website > Improve checkpoints documentation > - > > Key: BEAM-7691 > URL: https://issues.apache.org/jira/browse/BEAM-7691 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core, website >Reporter: Ismaël Mejía >Priority: Minor > Labels: documentation, javadoc, newbie, starter > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9175) Introduce an autoformatting tool to Python SDK
[ https://issues.apache.org/jira/browse/BEAM-9175?focusedWorklogId=378323=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378323 ] ASF GitHub Bot logged work on BEAM-9175: Author: ASF GitHub Bot Created on: 28/Jan/20 15:30 Start Date: 28/Jan/20 15:30 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #10684: [BEAM-9175] Introduce an autoformatting tool to Python SDK URL: https://github.com/apache/beam/pull/10684#issuecomment-579301469 @robertwb I managed to add all knobs you asked for. I think the code looks better now (much less weird lines, that's for sure). I also added a pre-commit job that runs yapf with --diff option. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378323) Time Spent: 3h 10m (was: 3h) > Introduce an autoformatting tool to Python SDK > -- > > Key: BEAM-9175 > URL: https://issues.apache.org/jira/browse/BEAM-9175 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core, sdk-py-harness >Reporter: Michał Walenia >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 3h 10m > Remaining Estimate: 0h > > It seems there are three main options: > * black - very simple, but not configurable at all (except for line length), > would drastically change code style > * yapf - more options to tweak, can omit parts of code > * autopep8 - more similar to spotless - only touches code that breaks > formatting guidelines, can use pycodestyle and flake8 as configuration > The rigidity of Black makes it unusable for Beam. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9162) Upgrade Jackson to version 2.10.2
[ https://issues.apache.org/jira/browse/BEAM-9162?focusedWorklogId=378332=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378332 ] ASF GitHub Bot logged work on BEAM-9162: Author: ASF GitHub Bot Created on: 28/Jan/20 15:47 Start Date: 28/Jan/20 15:47 Worklog Time Spent: 10m Work Description: suztomo commented on issue #10643: [BEAM-9162] Upgrade Jackson to version 2.10.2 URL: https://github.com/apache/beam/pull/10643#issuecomment-579313211 Let me think about that this week. https://issues.apache.org/jira/browse/BEAM-9206 For this PR, I would only check the modules that use jackson: `find . -name 'build.gradle' | xargs grep library.java.jackson_`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378332) Time Spent: 2h 10m (was: 2h) > Upgrade Jackson to version 2.10.2 > - > > Key: BEAM-9162 > URL: https://issues.apache.org/jira/browse/BEAM-9162 > Project: Beam > Issue Type: Improvement > Components: build-system, sdk-java-core >Reporter: Ismaël Mejía >Assignee: Ismaël Mejía >Priority: Minor > Time Spent: 2h 10m > Remaining Estimate: 0h > > Jackson has a new way to deal with [deserialization security > issues|https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.10] in > 2.10.x so worth the upgrade. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9008) Add readAll() method to CassandraIO
[ https://issues.apache.org/jira/browse/BEAM-9008?focusedWorklogId=378338=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378338 ] ASF GitHub Bot logged work on BEAM-9008: Author: ASF GitHub Bot Created on: 28/Jan/20 15:55 Start Date: 28/Jan/20 15:55 Worklog Time Spent: 10m Work Description: iemejia commented on issue #10546: [BEAM-9008] Add CassandraIO readAll method URL: https://github.com/apache/beam/pull/10546#issuecomment-579317145 > @iemejia I think that could work, thanks for your patience as I try to understand what you're thinking. Some questions: No, thanks to you who has been the patient one during this discussion. > 1. If our `ReadFn extends DoFn, A>` and the only way we have connection information is from the `Read` passed in to the processElement, that means we need to re-establish a DB connection for each batch of queries we run? As in, the connection would be established in the `processElement` method and could not be in `setup` method? Yes exactly this will make the method simpler and the cost of starting a connection gets amortized by the processElement producing multiple outputs from a single connection. > 2. How would that work for the end user of a `PTransform, PCollection>`? Here is what I did in the test and would wand to document how end users could generate 'queries', > https://github.com/apache/beam/pull/10546/files#diff-8ba4ea3b09d563a67e29ff8584269d35R499 >Would we instead want to return a `PCollection>` by using something like `return CassandraIO.read().withRingRange(new RingRange(start, finish))`? If we do that however, we'd need to do the `withHosts` and all the other connection information, no? The other option is establishing one `ReadAll` PTransform that maps over the `Read` input and enriches the db connection information? You have a point here!. We need `class ReadAll extends PTransform, PCollection>` and there we read as intended with `ReadFn`. You would have to modify however the `expand` of `Read` to do `input.apply(Create.of(this)).apply(CassandraIO.readAll())` where `ReadAll` should expand into `input.apply(ParDo.of(splitFn)).apply(Reshuffle).apply(Read)` users should deal with building the PCollection of `Reads` before passing that collection to `ReadAll`. > 3. Originally I had wanted to have the ReadFn operate on a _collection_ of 'query' objects to ensure a way to enforce linearizability with our queries (mainly so we don't oversaturate a single node/shard). Currently the groupBy function a user passes in operates on the `RingRange` object, would we keep it that way and just, under the hood, allow for a single `Read` to hold a collection of RingRanges? If I understand this correctly this is covered by following the Create -> Split -> Reshuffle -> Read pattern mentioned above (in the mentioned IOs). So Split is the one who will generate a collection of `Read`s for each given `RingRange` then we use Reshuffle to guarantee that reads are redistributed and finally each read request is read by one worker. Hope this helps, don't hesitate to ask me more questions if still. I will try to answer quickly this time. Hope this helps This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378338) Time Spent: 4h (was: 3h 50m) > Add readAll() method to CassandraIO > --- > > Key: BEAM-9008 > URL: https://issues.apache.org/jira/browse/BEAM-9008 > Project: Beam > Issue Type: New Feature > Components: io-java-cassandra >Affects Versions: 2.16.0 >Reporter: vincent marquez >Assignee: vincent marquez >Priority: Minor > Time Spent: 4h > Remaining Estimate: 0h > > When querying a large cassandra database, it's often *much* more useful to > programatically generate the queries needed to to be run rather than reading > all partitions and attempting some filtering. > As an example: > {code:java} > public class Event { >@PartitionKey(0) public UUID accountId; >@PartitionKey(1)public String yearMonthDay; >@ClusteringKey public UUID eventId; >//other data... > }{code} > If there is ten years worth of data, you may want to only query one year's > worth. Here each token range would represent one 'token' but all events for > the day. > {code:java} > Set accounts = getRelevantAccounts(); > Set dateRange =
[jira] [Work logged] (BEAM-9008) Add readAll() method to CassandraIO
[ https://issues.apache.org/jira/browse/BEAM-9008?focusedWorklogId=378339=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378339 ] ASF GitHub Bot logged work on BEAM-9008: Author: ASF GitHub Bot Created on: 28/Jan/20 15:58 Start Date: 28/Jan/20 15:58 Worklog Time Spent: 10m Work Description: iemejia commented on issue #10546: [BEAM-9008] Add CassandraIO readAll method URL: https://github.com/apache/beam/pull/10546#issuecomment-579318866 Forgot to mention that in the above comment that in the Split function you have to split in every case save if the user provided a specific RingRange to read from. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378339) Time Spent: 4h 10m (was: 4h) > Add readAll() method to CassandraIO > --- > > Key: BEAM-9008 > URL: https://issues.apache.org/jira/browse/BEAM-9008 > Project: Beam > Issue Type: New Feature > Components: io-java-cassandra >Affects Versions: 2.16.0 >Reporter: vincent marquez >Assignee: vincent marquez >Priority: Minor > Time Spent: 4h 10m > Remaining Estimate: 0h > > When querying a large cassandra database, it's often *much* more useful to > programatically generate the queries needed to to be run rather than reading > all partitions and attempting some filtering. > As an example: > {code:java} > public class Event { >@PartitionKey(0) public UUID accountId; >@PartitionKey(1)public String yearMonthDay; >@ClusteringKey public UUID eventId; >//other data... > }{code} > If there is ten years worth of data, you may want to only query one year's > worth. Here each token range would represent one 'token' but all events for > the day. > {code:java} > Set accounts = getRelevantAccounts(); > Set dateRange = generateDateRange("2018-01-01", "2019-01-01"); > PCollection tokens = generateTokens(accounts, dateRange); > {code} > > I propose an additional _readAll()_ PTransform that can take a PCollection > of token ranges and can return a PCollection of what the query would > return. > *Question: How much code should be in common between both methods?* > Currently the read connector already groups all partitions into a List of > Token Ranges, so it would be simple to refactor the current read() based > method to a 'ParDo' based one and have them both share the same function. > Reasons against sharing code between read and readAll > * Not having the read based method return a BoundedSource connector would > mean losing the ability to know the size of the data returned > * Currently the CassandraReader executes all the grouped TokenRange queries > *asynchronously* which is (maybe?) fine when all that's happening is > splitting up all the partition ranges but terrible for executing potentially > millions of queries. > Reasons _for_ sharing code would be simplified code base and that both of > the above issues would most likely have a negligable performance impact. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8972) Add a Jenkins job running Combine load test on Java with Flink in Portability mode
[ https://issues.apache.org/jira/browse/BEAM-8972?focusedWorklogId=378151=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378151 ] ASF GitHub Bot logged work on BEAM-8972: Author: ASF GitHub Bot Created on: 28/Jan/20 10:59 Start Date: 28/Jan/20 10:59 Worklog Time Spent: 10m Work Description: mwalenia commented on issue #10386: [BEAM-8972] Add Jenkins job with Combine test for portable Java URL: https://github.com/apache/beam/pull/10386#issuecomment-579191281 Run seed job This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378151) Time Spent: 4.5h (was: 4h 20m) > Add a Jenkins job running Combine load test on Java with Flink in Portability > mode > -- > > Key: BEAM-8972 > URL: https://issues.apache.org/jira/browse/BEAM-8972 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Michał Walenia >Assignee: Michał Walenia >Priority: Minor > Time Spent: 4.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9063) Migrate docker images to apache namespace.
[ https://issues.apache.org/jira/browse/BEAM-9063?focusedWorklogId=378168=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378168 ] ASF GitHub Bot logged work on BEAM-9063: Author: ASF GitHub Bot Created on: 28/Jan/20 11:44 Start Date: 28/Jan/20 11:44 Worklog Time Spent: 10m Work Description: mxm commented on pull request #10612: [DO NOT MERGE][BEAM-9063] migrate docker images to apache URL: https://github.com/apache/beam/pull/10612#discussion_r371752615 ## File path: release/src/main/scripts/publish_docker_images.sh ## @@ -24,6 +24,9 @@ set -e +DOCKER_IMAGE_DEFAULT_REPO_ROOT=apache +DOCKER_IMAGE_DEFAULT_REPO_PREFIX=beam_ Review comment: Would it be worth sourcing these from a script? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378168) Time Spent: 5h 40m (was: 5.5h) > Migrate docker images to apache namespace. > -- > > Key: BEAM-9063 > URL: https://issues.apache.org/jira/browse/BEAM-9063 > Project: Beam > Issue Type: Task > Components: beam-community >Reporter: Hannah Jiang >Assignee: Hannah Jiang >Priority: Major > Fix For: Not applicable > > Time Spent: 5h 40m > Remaining Estimate: 0h > > https://hub.docker.com/u/apache -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9063) Migrate docker images to apache namespace.
[ https://issues.apache.org/jira/browse/BEAM-9063?focusedWorklogId=378167=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378167 ] ASF GitHub Bot logged work on BEAM-9063: Author: ASF GitHub Bot Created on: 28/Jan/20 11:44 Start Date: 28/Jan/20 11:44 Worklog Time Spent: 10m Work Description: mxm commented on pull request #10612: [DO NOT MERGE][BEAM-9063] migrate docker images to apache URL: https://github.com/apache/beam/pull/10612#discussion_r371752564 ## File path: release/src/main/scripts/build_release_candidate.sh ## @@ -45,6 +45,9 @@ PYTHON_ARTIFACTS_DIR=python BEAM_ROOT_DIR=beam WEBSITE_ROOT_DIR=beam-site +DOCKER_IMAGE_DEFAULT_REPO_ROOT=apache +DOCKER_IMAGE_DEFAULT_REPO_PREFIX=beam_ Review comment: Would it be worth sourcing these from a script? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378167) Time Spent: 5h 40m (was: 5.5h) > Migrate docker images to apache namespace. > -- > > Key: BEAM-9063 > URL: https://issues.apache.org/jira/browse/BEAM-9063 > Project: Beam > Issue Type: Task > Components: beam-community >Reporter: Hannah Jiang >Assignee: Hannah Jiang >Priority: Major > Fix For: Not applicable > > Time Spent: 5h 40m > Remaining Estimate: 0h > > https://hub.docker.com/u/apache -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8972) Add a Jenkins job running Combine load test on Java with Flink in Portability mode
[ https://issues.apache.org/jira/browse/BEAM-8972?focusedWorklogId=378184=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378184 ] ASF GitHub Bot logged work on BEAM-8972: Author: ASF GitHub Bot Created on: 28/Jan/20 12:30 Start Date: 28/Jan/20 12:30 Worklog Time Spent: 10m Work Description: mwalenia commented on issue #10386: [BEAM-8972] Add Jenkins job with Combine test for portable Java URL: https://github.com/apache/beam/pull/10386#issuecomment-579223188 run seed job This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378184) Time Spent: 5h 10m (was: 5h) > Add a Jenkins job running Combine load test on Java with Flink in Portability > mode > -- > > Key: BEAM-8972 > URL: https://issues.apache.org/jira/browse/BEAM-8972 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Michał Walenia >Assignee: Michał Walenia >Priority: Minor > Time Spent: 5h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8972) Add a Jenkins job running Combine load test on Java with Flink in Portability mode
[ https://issues.apache.org/jira/browse/BEAM-8972?focusedWorklogId=378212=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378212 ] ASF GitHub Bot logged work on BEAM-8972: Author: ASF GitHub Bot Created on: 28/Jan/20 13:06 Start Date: 28/Jan/20 13:06 Worklog Time Spent: 10m Work Description: mwalenia commented on issue #10386: [BEAM-8972] Add Jenkins job with Combine test for portable Java URL: https://github.com/apache/beam/pull/10386#issuecomment-579236550 Run Load Tests Java Combine Portable Flink Batch This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378212) Time Spent: 5h 20m (was: 5h 10m) > Add a Jenkins job running Combine load test on Java with Flink in Portability > mode > -- > > Key: BEAM-8972 > URL: https://issues.apache.org/jira/browse/BEAM-8972 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Michał Walenia >Assignee: Michał Walenia >Priority: Minor > Time Spent: 5h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-7427) JmsCheckpointMark Avro Serialization issue with UnboundedSource
[ https://issues.apache.org/jira/browse/BEAM-7427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía reassigned BEAM-7427: -- Assignee: Jean-Baptiste Onofré (was: Mourad) > JmsCheckpointMark Avro Serialization issue with UnboundedSource > --- > > Key: BEAM-7427 > URL: https://issues.apache.org/jira/browse/BEAM-7427 > Project: Beam > Issue Type: Bug > Components: io-java-jms >Affects Versions: 2.12.0 > Environment: Message Broker : solace > JMS Client (Over AMQP) : "org.apache.qpid:qpid-jms-client:0.42.0 >Reporter: Mourad >Assignee: Jean-Baptiste Onofré >Priority: Major > Time Spent: 8.5h > Remaining Estimate: 0h > > I get the following exception when reading from unbounded JMS Source: > > {code:java} > Caused by: org.apache.avro.SchemaParseException: Illegal character in: this$0 > at org.apache.avro.Schema.validateName(Schema.java:1151) > at org.apache.avro.Schema.access$200(Schema.java:81) > at org.apache.avro.Schema$Field.(Schema.java:403) > at org.apache.avro.Schema$Field.(Schema.java:396) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:622) > at org.apache.avro.reflect.ReflectData.createFieldSchema(ReflectData.java:740) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:604) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:218) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:215) > at > avro.shaded.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228) > {code} > > The exception is thrown by Avro when introspecting {{JmsCheckpointMark}} to > generate schema. > JmsIO config : > > {code:java} > PCollection messages = pipeline.apply("read messages from the > events broker", JmsIO.readMessage() > .withConnectionFactory(jmsConnectionFactory) .withTopic(options.getTopic()) > .withMessageMapper(new DFAMessageMapper()) > .withCoder(AvroCoder.of(DFAMessage.class))); > {code} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-7427) JmsCheckpointMark Avro Serialization issue with UnboundedSource
[ https://issues.apache.org/jira/browse/BEAM-7427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía updated BEAM-7427: --- Fix Version/s: 2.20.0 > JmsCheckpointMark Avro Serialization issue with UnboundedSource > --- > > Key: BEAM-7427 > URL: https://issues.apache.org/jira/browse/BEAM-7427 > Project: Beam > Issue Type: Bug > Components: io-java-jms >Affects Versions: 2.12.0, 2.13.0, 2.14.0, 2.15.0, 2.16.0, 2.17.0, 2.18.0, > 2.19.0 > Environment: Message Broker : solace > JMS Client (Over AMQP) : "org.apache.qpid:qpid-jms-client:0.42.0 >Reporter: Mourad >Assignee: Jean-Baptiste Onofré >Priority: Major > Fix For: 2.20.0 > > Time Spent: 8.5h > Remaining Estimate: 0h > > I get the following exception when reading from unbounded JMS Source: > > {code:java} > Caused by: org.apache.avro.SchemaParseException: Illegal character in: this$0 > at org.apache.avro.Schema.validateName(Schema.java:1151) > at org.apache.avro.Schema.access$200(Schema.java:81) > at org.apache.avro.Schema$Field.(Schema.java:403) > at org.apache.avro.Schema$Field.(Schema.java:396) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:622) > at org.apache.avro.reflect.ReflectData.createFieldSchema(ReflectData.java:740) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:604) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:218) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:215) > at > avro.shaded.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228) > {code} > > The exception is thrown by Avro when introspecting {{JmsCheckpointMark}} to > generate schema. > JmsIO config : > > {code:java} > PCollection messages = pipeline.apply("read messages from the > events broker", JmsIO.readMessage() > .withConnectionFactory(jmsConnectionFactory) .withTopic(options.getTopic()) > .withMessageMapper(new DFAMessageMapper()) > .withCoder(AvroCoder.of(DFAMessage.class))); > {code} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-7427) JmsCheckpointMark Avro Serialization issue with UnboundedSource
[ https://issues.apache.org/jira/browse/BEAM-7427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía updated BEAM-7427: --- Affects Version/s: 2.19.0 2.13.0 2.14.0 2.15.0 2.16.0 2.17.0 2.18.0 > JmsCheckpointMark Avro Serialization issue with UnboundedSource > --- > > Key: BEAM-7427 > URL: https://issues.apache.org/jira/browse/BEAM-7427 > Project: Beam > Issue Type: Bug > Components: io-java-jms >Affects Versions: 2.12.0, 2.13.0, 2.14.0, 2.15.0, 2.16.0, 2.17.0, 2.18.0, > 2.19.0 > Environment: Message Broker : solace > JMS Client (Over AMQP) : "org.apache.qpid:qpid-jms-client:0.42.0 >Reporter: Mourad >Assignee: Jean-Baptiste Onofré >Priority: Major > Time Spent: 8.5h > Remaining Estimate: 0h > > I get the following exception when reading from unbounded JMS Source: > > {code:java} > Caused by: org.apache.avro.SchemaParseException: Illegal character in: this$0 > at org.apache.avro.Schema.validateName(Schema.java:1151) > at org.apache.avro.Schema.access$200(Schema.java:81) > at org.apache.avro.Schema$Field.(Schema.java:403) > at org.apache.avro.Schema$Field.(Schema.java:396) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:622) > at org.apache.avro.reflect.ReflectData.createFieldSchema(ReflectData.java:740) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:604) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:218) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:215) > at > avro.shaded.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228) > {code} > > The exception is thrown by Avro when introspecting {{JmsCheckpointMark}} to > generate schema. > JmsIO config : > > {code:java} > PCollection messages = pipeline.apply("read messages from the > events broker", JmsIO.readMessage() > .withConnectionFactory(jmsConnectionFactory) .withTopic(options.getTopic()) > .withMessageMapper(new DFAMessageMapper()) > .withCoder(AvroCoder.of(DFAMessage.class))); > {code} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9162) Upgrade Jackson to version 2.10.2
[ https://issues.apache.org/jira/browse/BEAM-9162?focusedWorklogId=378269=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378269 ] ASF GitHub Bot logged work on BEAM-9162: Author: ASF GitHub Bot Created on: 28/Jan/20 14:26 Start Date: 28/Jan/20 14:26 Worklog Time Spent: 10m Work Description: suztomo commented on issue #10643: [BEAM-9162] Upgrade Jackson to version 2.10.2 URL: https://github.com/apache/beam/pull/10643#issuecomment-579270595 I want that job too! The challenge is that because of the many existing linkage errors, I'd have to compare - linkage errors in a PR, and - linkage errors in origin/master Like a code coverage report. As I don't know how to do that, I'm still doing it `diff` command in my local environment. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378269) Time Spent: 1h 50m (was: 1h 40m) > Upgrade Jackson to version 2.10.2 > - > > Key: BEAM-9162 > URL: https://issues.apache.org/jira/browse/BEAM-9162 > Project: Beam > Issue Type: Improvement > Components: build-system, sdk-java-core >Reporter: Ismaël Mejía >Assignee: Ismaël Mejía >Priority: Minor > Time Spent: 1h 50m > Remaining Estimate: 0h > > Jackson has a new way to deal with [deserialization security > issues|https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.10] in > 2.10.x so worth the upgrade. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9162) Upgrade Jackson to version 2.10.2
[ https://issues.apache.org/jira/browse/BEAM-9162?focusedWorklogId=378278=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378278 ] ASF GitHub Bot logged work on BEAM-9162: Author: ASF GitHub Bot Created on: 28/Jan/20 14:38 Start Date: 28/Jan/20 14:38 Worklog Time Spent: 10m Work Description: iemejia commented on issue #10643: [BEAM-9162] Upgrade Jackson to version 2.10.2 URL: https://github.com/apache/beam/pull/10643#issuecomment-579276082 Well the manual comparison is not ideal but we can cope with that for the moment, what I don't want is to type the command for the 31 modules of this PR and then have to change it for other dependency upgrade. I just want some sort of `./gradlew :checkJavaLinkage` that works for the whole set of modules of the project. Is this 'feasible' with gradlew + Beam? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378278) Time Spent: 2h (was: 1h 50m) > Upgrade Jackson to version 2.10.2 > - > > Key: BEAM-9162 > URL: https://issues.apache.org/jira/browse/BEAM-9162 > Project: Beam > Issue Type: Improvement > Components: build-system, sdk-java-core >Reporter: Ismaël Mejía >Assignee: Ismaël Mejía >Priority: Minor > Time Spent: 2h > Remaining Estimate: 0h > > Jackson has a new way to deal with [deserialization security > issues|https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.10] in > 2.10.x so worth the upgrade. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7427) JmsCheckpointMark Avro Serialization issue with UnboundedSource
[ https://issues.apache.org/jira/browse/BEAM-7427?focusedWorklogId=378281=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378281 ] ASF GitHub Bot logged work on BEAM-7427: Author: ASF GitHub Bot Created on: 28/Jan/20 14:50 Start Date: 28/Jan/20 14:50 Worklog Time Spent: 10m Work Description: iemejia commented on pull request #10644: [BEAM-7427] Refactor JmsCheckpointMark to use SerializableCoder URL: https://github.com/apache/beam/pull/10644 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378281) Time Spent: 9h (was: 8h 50m) > JmsCheckpointMark Avro Serialization issue with UnboundedSource > --- > > Key: BEAM-7427 > URL: https://issues.apache.org/jira/browse/BEAM-7427 > Project: Beam > Issue Type: Bug > Components: io-java-jms >Affects Versions: 2.12.0, 2.13.0, 2.14.0, 2.15.0, 2.16.0, 2.17.0, 2.18.0, > 2.19.0 > Environment: Message Broker : solace > JMS Client (Over AMQP) : "org.apache.qpid:qpid-jms-client:0.42.0 >Reporter: Mourad >Assignee: Jean-Baptiste Onofré >Priority: Major > Fix For: 2.20.0 > > Time Spent: 9h > Remaining Estimate: 0h > > I get the following exception when reading from unbounded JMS Source: > > {code:java} > Caused by: org.apache.avro.SchemaParseException: Illegal character in: this$0 > at org.apache.avro.Schema.validateName(Schema.java:1151) > at org.apache.avro.Schema.access$200(Schema.java:81) > at org.apache.avro.Schema$Field.(Schema.java:403) > at org.apache.avro.Schema$Field.(Schema.java:396) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:622) > at org.apache.avro.reflect.ReflectData.createFieldSchema(ReflectData.java:740) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:604) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:218) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:215) > at > avro.shaded.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228) > {code} > > The exception is thrown by Avro when introspecting {{JmsCheckpointMark}} to > generate schema. > JmsIO config : > > {code:java} > PCollection messages = pipeline.apply("read messages from the > events broker", JmsIO.readMessage() > .withConnectionFactory(jmsConnectionFactory) .withTopic(options.getTopic()) > .withMessageMapper(new DFAMessageMapper()) > .withCoder(AvroCoder.of(DFAMessage.class))); > {code} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-7427) JmsCheckpointMark Avro Serialization issue with UnboundedSource
[ https://issues.apache.org/jira/browse/BEAM-7427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía resolved BEAM-7427. Resolution: Fixed > JmsCheckpointMark Avro Serialization issue with UnboundedSource > --- > > Key: BEAM-7427 > URL: https://issues.apache.org/jira/browse/BEAM-7427 > Project: Beam > Issue Type: Bug > Components: io-java-jms >Affects Versions: 2.12.0, 2.13.0, 2.14.0, 2.15.0, 2.16.0, 2.17.0, 2.18.0, > 2.19.0 > Environment: Message Broker : solace > JMS Client (Over AMQP) : "org.apache.qpid:qpid-jms-client:0.42.0 >Reporter: Mourad >Assignee: Jean-Baptiste Onofré >Priority: Major > Fix For: 2.20.0 > > Time Spent: 9h 10m > Remaining Estimate: 0h > > I get the following exception when reading from unbounded JMS Source: > > {code:java} > Caused by: org.apache.avro.SchemaParseException: Illegal character in: this$0 > at org.apache.avro.Schema.validateName(Schema.java:1151) > at org.apache.avro.Schema.access$200(Schema.java:81) > at org.apache.avro.Schema$Field.(Schema.java:403) > at org.apache.avro.Schema$Field.(Schema.java:396) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:622) > at org.apache.avro.reflect.ReflectData.createFieldSchema(ReflectData.java:740) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:604) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:218) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:215) > at > avro.shaded.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228) > {code} > > The exception is thrown by Avro when introspecting {{JmsCheckpointMark}} to > generate schema. > JmsIO config : > > {code:java} > PCollection messages = pipeline.apply("read messages from the > events broker", JmsIO.readMessage() > .withConnectionFactory(jmsConnectionFactory) .withTopic(options.getTopic()) > .withMessageMapper(new DFAMessageMapper()) > .withCoder(AvroCoder.of(DFAMessage.class))); > {code} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7427) JmsCheckpointMark Avro Serialization issue with UnboundedSource
[ https://issues.apache.org/jira/browse/BEAM-7427?focusedWorklogId=378282=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378282 ] ASF GitHub Bot logged work on BEAM-7427: Author: ASF GitHub Bot Created on: 28/Jan/20 14:50 Start Date: 28/Jan/20 14:50 Worklog Time Spent: 10m Work Description: iemejia commented on issue #10644: [BEAM-7427] Refactor JmsCheckpointMark to use SerializableCoder URL: https://github.com/apache/beam/pull/10644#issuecomment-579281715 Merged manually, thanks again JB! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378282) Time Spent: 9h 10m (was: 9h) > JmsCheckpointMark Avro Serialization issue with UnboundedSource > --- > > Key: BEAM-7427 > URL: https://issues.apache.org/jira/browse/BEAM-7427 > Project: Beam > Issue Type: Bug > Components: io-java-jms >Affects Versions: 2.12.0, 2.13.0, 2.14.0, 2.15.0, 2.16.0, 2.17.0, 2.18.0, > 2.19.0 > Environment: Message Broker : solace > JMS Client (Over AMQP) : "org.apache.qpid:qpid-jms-client:0.42.0 >Reporter: Mourad >Assignee: Jean-Baptiste Onofré >Priority: Major > Fix For: 2.20.0 > > Time Spent: 9h 10m > Remaining Estimate: 0h > > I get the following exception when reading from unbounded JMS Source: > > {code:java} > Caused by: org.apache.avro.SchemaParseException: Illegal character in: this$0 > at org.apache.avro.Schema.validateName(Schema.java:1151) > at org.apache.avro.Schema.access$200(Schema.java:81) > at org.apache.avro.Schema$Field.(Schema.java:403) > at org.apache.avro.Schema$Field.(Schema.java:396) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:622) > at org.apache.avro.reflect.ReflectData.createFieldSchema(ReflectData.java:740) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:604) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:218) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:215) > at > avro.shaded.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228) > {code} > > The exception is thrown by Avro when introspecting {{JmsCheckpointMark}} to > generate schema. > JmsIO config : > > {code:java} > PCollection messages = pipeline.apply("read messages from the > events broker", JmsIO.readMessage() > .withConnectionFactory(jmsConnectionFactory) .withTopic(options.getTopic()) > .withMessageMapper(new DFAMessageMapper()) > .withCoder(AvroCoder.of(DFAMessage.class))); > {code} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-7691) Improve checkpoints documentation
[ https://issues.apache.org/jira/browse/BEAM-7691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía updated BEAM-7691: --- Description: UnboundedSource.CheckpointMark should mention that implementations it should be encodable (have an associated Coder). Also maybe it is a good idea to explain a bit more the checkpointing semantics on Beam. > Improve checkpoints documentation > - > > Key: BEAM-7691 > URL: https://issues.apache.org/jira/browse/BEAM-7691 > Project: Beam > Issue Type: Improvement > Components: sdk-java-core, website >Reporter: Ismaël Mejía >Priority: Minor > Labels: documentation, javadoc, newbie, starter > > UnboundedSource.CheckpointMark should mention that implementations it should > be encodable (have an associated Coder). Also maybe it is a good idea to > explain a bit more the checkpointing semantics on Beam. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9132) State request handler is removed prematurely when closing ActiveBundle
[ https://issues.apache.org/jira/browse/BEAM-9132?focusedWorklogId=378292=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378292 ] ASF GitHub Bot logged work on BEAM-9132: Author: ASF GitHub Bot Created on: 28/Jan/20 15:19 Start Date: 28/Jan/20 15:19 Worklog Time Spent: 10m Work Description: mxm commented on issue #10694: [BEAM-9132] Use a unique bundle id across all SDK workers bound to the same environment URL: https://github.com/apache/beam/pull/10694#issuecomment-579296135 This does not fix the issue because the id generation scheme is effectively the same. I'm still seeing the same errors, but I have a new trace. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378292) Time Spent: 1h 20m (was: 1h 10m) > State request handler is removed prematurely when closing ActiveBundle > -- > > Key: BEAM-9132 > URL: https://issues.apache.org/jira/browse/BEAM-9132 > Project: Beam > Issue Type: Bug > Components: java-fn-execution >Reporter: Maximilian Michels >Assignee: Maximilian Michels >Priority: Major > Time Spent: 1h 20m > Remaining Estimate: 0h > > We have observed these errors in a state-intense application: > {noformat} > Error processing instruction 107. Original traceback is > Traceback (most recent call last): > File "apache_beam/runners/common.py", line 780, in > apache_beam.runners.common.DoFnRunner.process > File "apache_beam/runners/common.py", line 587, in > apache_beam.runners.common.PerWindowInvoker.invoke_process > File "apache_beam/runners/common.py", line 659, in > apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window > File "apache_beam/runners/common.py", line 880, in > apache_beam.runners.common._OutputProcessor.process_outputs > File "apache_beam/runners/common.py", line 895, in > apache_beam.runners.common._OutputProcessor.process_outputs > File "redacted.py", line 56, in process > recent_events_map = load_recent_events_map(recent_events_state) > File "redacted.py", line 128, in _load_recent_events_map > items_in_recent_events_bag = list(recent_events_state.read()) > File "apache_beam/runners/worker/bundle_processor.py", line 335, in __iter__ > for elem in self.first: > File "apache_beam/runners/worker/bundle_processor.py", line 214, in __iter__ > self._state_key, self._coder_impl, is_cached=self._is_cached) > File "apache_beam/runners/worker/sdk_worker.py", line 692, in blocking_get > self._materialize_iter(state_key, coder)) > File "apache_beam/runners/worker/sdk_worker.py", line 723, in > _materialize_iter > self._underlying.get_raw(state_key, continuation_token) > File "apache_beam/runners/worker/sdk_worker.py", line 603, in get_raw > continuation_token=continuation_token))) > File "apache_beam/runners/worker/sdk_worker.py", line 637, in > _blocking_request > raise RuntimeError(response.error) > RuntimeError: Unknown process bundle instruction id '107' > {noformat} > Notice that the error is thrown on the Runner side. It seems to originate > from the {{ActiveBundle}} de-registering the state request handler too early > when the processing may still be going on in the SDK Harness. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-7427) JmsCheckpointMark can not be correctly encoded
[ https://issues.apache.org/jira/browse/BEAM-7427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía updated BEAM-7427: --- Summary: JmsCheckpointMark can not be correctly encoded (was: JmsCheckpointMark Avro Serialization issue with UnboundedSource) > JmsCheckpointMark can not be correctly encoded > -- > > Key: BEAM-7427 > URL: https://issues.apache.org/jira/browse/BEAM-7427 > Project: Beam > Issue Type: Bug > Components: io-java-jms >Affects Versions: 2.12.0, 2.13.0, 2.14.0, 2.15.0, 2.16.0, 2.17.0, 2.18.0, > 2.19.0 > Environment: Message Broker : solace > JMS Client (Over AMQP) : "org.apache.qpid:qpid-jms-client:0.42.0 >Reporter: Mourad >Assignee: Jean-Baptiste Onofré >Priority: Major > Fix For: 2.20.0 > > Time Spent: 9h 20m > Remaining Estimate: 0h > > I get the following exception when reading from unbounded JMS Source: > > {code:java} > Caused by: org.apache.avro.SchemaParseException: Illegal character in: this$0 > at org.apache.avro.Schema.validateName(Schema.java:1151) > at org.apache.avro.Schema.access$200(Schema.java:81) > at org.apache.avro.Schema$Field.(Schema.java:403) > at org.apache.avro.Schema$Field.(Schema.java:396) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:622) > at org.apache.avro.reflect.ReflectData.createFieldSchema(ReflectData.java:740) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:604) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:218) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:215) > at > avro.shaded.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228) > {code} > > The exception is thrown by Avro when introspecting {{JmsCheckpointMark}} to > generate schema. > JmsIO config : > > {code:java} > PCollection messages = pipeline.apply("read messages from the > events broker", JmsIO.readMessage() > .withConnectionFactory(jmsConnectionFactory) .withTopic(options.getTopic()) > .withMessageMapper(new DFAMessageMapper()) > .withCoder(AvroCoder.of(DFAMessage.class))); > {code} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9175) Introduce an autoformatting tool to Python SDK
[ https://issues.apache.org/jira/browse/BEAM-9175?focusedWorklogId=378293=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378293 ] ASF GitHub Bot logged work on BEAM-9175: Author: ASF GitHub Bot Created on: 28/Jan/20 15:20 Start Date: 28/Jan/20 15:20 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #10684: [BEAM-9175] Introduce an autoformatting tool to Python SDK URL: https://github.com/apache/beam/pull/10684#discussion_r371867218 ## File path: sdks/python/apache_beam/typehints/trivial_inference.py ## @@ -68,24 +68,23 @@ def instance_to_type(o): return typehints.Tuple[[instance_to_type(item) for item in o]] elif t == list: if len(o) > 0: - return typehints.List[ - typehints.Union[[instance_to_type(item) for item in o]] - ] + return typehints.List[typehints.Union[[ + instance_to_type(item) for item in o + ]]] else: return typehints.List[typehints.Any] elif t == set: if len(o) > 0: - return typehints.Set[ - typehints.Union[[instance_to_type(item) for item in o]] - ] + return typehints.Set[typehints.Union[[ + instance_to_type(item) for item in o + ]]] else: return typehints.Set[typehints.Any] elif t == dict: if len(o) > 0: return typehints.Dict[ typehints.Union[[instance_to_type(k) for k, v in o.items()]], - typehints.Union[[instance_to_type(v) for k, v in o.items()]], - ] + typehints.Union[[instance_to_type(v) for k, v in o.items()]], ] Review comment: It's OK now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378293) Time Spent: 2h 50m (was: 2h 40m) > Introduce an autoformatting tool to Python SDK > -- > > Key: BEAM-9175 > URL: https://issues.apache.org/jira/browse/BEAM-9175 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core, sdk-py-harness >Reporter: Michał Walenia >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 2h 50m > Remaining Estimate: 0h > > It seems there are three main options: > * black - very simple, but not configurable at all (except for line length), > would drastically change code style > * yapf - more options to tweak, can omit parts of code > * autopep8 - more similar to spotless - only touches code that breaks > formatting guidelines, can use pycodestyle and flake8 as configuration > The rigidity of Black makes it unusable for Beam. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9175) Introduce an autoformatting tool to Python SDK
[ https://issues.apache.org/jira/browse/BEAM-9175?focusedWorklogId=378294=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378294 ] ASF GitHub Bot logged work on BEAM-9175: Author: ASF GitHub Bot Created on: 28/Jan/20 15:20 Start Date: 28/Jan/20 15:20 Worklog Time Spent: 10m Work Description: kamilwu commented on pull request #10684: [BEAM-9175] Introduce an autoformatting tool to Python SDK URL: https://github.com/apache/beam/pull/10684#discussion_r371867339 ## File path: sdks/python/apache_beam/typehints/trivial_inference.py ## @@ -303,10 +302,8 @@ def infer_return_type(c, input_types, debug=False, depth=5): elif inspect.isclass(c): if c in typehints.DISALLOWED_PRIMITIVE_TYPES: return { -list: typehints.List[Any], -set: typehints.Set[Any], -tuple: typehints.Tuple[Any, ...], -dict: typehints.Dict[Any, Any] +list: typehints.List[Any], set: typehints.Set[Any], tuple: Review comment: Done. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378294) Time Spent: 3h (was: 2h 50m) > Introduce an autoformatting tool to Python SDK > -- > > Key: BEAM-9175 > URL: https://issues.apache.org/jira/browse/BEAM-9175 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core, sdk-py-harness >Reporter: Michał Walenia >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 3h > Remaining Estimate: 0h > > It seems there are three main options: > * black - very simple, but not configurable at all (except for line length), > would drastically change code style > * yapf - more options to tweak, can omit parts of code > * autopep8 - more similar to spotless - only touches code that breaks > formatting guidelines, can use pycodestyle and flake8 as configuration > The rigidity of Black makes it unusable for Beam. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9175) Introduce an autoformatting tool to Python SDK
[ https://issues.apache.org/jira/browse/BEAM-9175?focusedWorklogId=378326=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378326 ] ASF GitHub Bot logged work on BEAM-9175: Author: ASF GitHub Bot Created on: 28/Jan/20 15:33 Start Date: 28/Jan/20 15:33 Worklog Time Spent: 10m Work Description: kamilwu commented on issue #10684: [BEAM-9175] Introduce an autoformatting tool to Python SDK URL: https://github.com/apache/beam/pull/10684#issuecomment-579303250 There is still a number of pylint issues `Wrong continued indentation`. Most of them appear because of how lambda is formatted. For example: https://github.com/apache/beam/blob/7db61fbf2dd6eac4ffb542e684260edf0d892fea/sdks/python/apache_beam/io/gcp/bigquery_test.py#L908-L910 I don't know yet how to deal with it. Unless there is a knob for this, I'll just put a # yapf: disable comments in these places. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378326) Time Spent: 3h 20m (was: 3h 10m) > Introduce an autoformatting tool to Python SDK > -- > > Key: BEAM-9175 > URL: https://issues.apache.org/jira/browse/BEAM-9175 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core, sdk-py-harness >Reporter: Michał Walenia >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 3h 20m > Remaining Estimate: 0h > > It seems there are three main options: > * black - very simple, but not configurable at all (except for line length), > would drastically change code style > * yapf - more options to tweak, can omit parts of code > * autopep8 - more similar to spotless - only touches code that breaks > formatting guidelines, can use pycodestyle and flake8 as configuration > The rigidity of Black makes it unusable for Beam. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9162) Upgrade Jackson to version 2.10.2
[ https://issues.apache.org/jira/browse/BEAM-9162?focusedWorklogId=378340=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378340 ] ASF GitHub Bot logged work on BEAM-9162: Author: ASF GitHub Bot Created on: 28/Jan/20 16:00 Start Date: 28/Jan/20 16:00 Worklog Time Spent: 10m Work Description: iemejia commented on issue #10643: [BEAM-9162] Upgrade Jackson to version 2.10.2 URL: https://github.com/apache/beam/pull/10643#issuecomment-579320038 Hehe so the 31 that I mentioned above, mmm not an easy to sell proposition. On the other hand I can help with the jenkins part if you get to do an incantation that works locally for all modules. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378340) Time Spent: 2h 20m (was: 2h 10m) > Upgrade Jackson to version 2.10.2 > - > > Key: BEAM-9162 > URL: https://issues.apache.org/jira/browse/BEAM-9162 > Project: Beam > Issue Type: Improvement > Components: build-system, sdk-java-core >Reporter: Ismaël Mejía >Assignee: Ismaël Mejía >Priority: Minor > Time Spent: 2h 20m > Remaining Estimate: 0h > > Jackson has a new way to deal with [deserialization security > issues|https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.10] in > 2.10.x so worth the upgrade. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9206) Easy way to run checkJavaLinkage?
[ https://issues.apache.org/jira/browse/BEAM-9206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía updated BEAM-9206: --- Status: Open (was: Triage Needed) > Easy way to run checkJavaLinkage? > - > > Key: BEAM-9206 > URL: https://issues.apache.org/jira/browse/BEAM-9206 > Project: Beam > Issue Type: Bug > Components: build-system >Reporter: Tomo Suzuki >Assignee: Tomo Suzuki >Priority: Major > > Follow up of iemejia's comment: > https://github.com/apache/beam/pull/10643#issuecomment-579276082 > bq. I just want some sort of ./gradlew :checkJavaLinkage that works for the > whole set of modules of the project. Is this 'feasible' with gradlew + Beam? > h1. Considerations > * Something that can run on Jenkins > * Comparison with the result of origin/master > * Simple way to run checkJavaLinkage for all modules -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8298) Implement state caching for side inputs
[ https://issues.apache.org/jira/browse/BEAM-8298?focusedWorklogId=378345=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378345 ] ASF GitHub Bot logged work on BEAM-8298: Author: ASF GitHub Bot Created on: 28/Jan/20 16:10 Start Date: 28/Jan/20 16:10 Worklog Time Spent: 10m Work Description: lukecwik commented on issue #10679: [BEAM-8298] Fully specify the necessary details to support side input caching. URL: https://github.com/apache/beam/pull/10679#issuecomment-579326907 Run Java PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378345) Time Spent: 1h 10m (was: 1h) > Implement state caching for side inputs > --- > > Key: BEAM-8298 > URL: https://issues.apache.org/jira/browse/BEAM-8298 > Project: Beam > Issue Type: Improvement > Components: runner-core, sdk-py-harness >Reporter: Maximilian Michels >Assignee: Jing Chen >Priority: Major > Time Spent: 1h 10m > Remaining Estimate: 0h > > Caching is currently only implemented for user state. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7427) JmsCheckpointMark Avro Serialization issue with UnboundedSource
[ https://issues.apache.org/jira/browse/BEAM-7427?focusedWorklogId=378258=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378258 ] ASF GitHub Bot logged work on BEAM-7427: Author: ASF GitHub Bot Created on: 28/Jan/20 14:15 Start Date: 28/Jan/20 14:15 Worklog Time Spent: 10m Work Description: jbonofre commented on issue #10644: [BEAM-7427] Refactore JmsCheckpointMark to be usage via Coder URL: https://github.com/apache/beam/pull/10644#issuecomment-579265247 @iemejia thanks ! I will switch to other IOs improvements ;) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378258) Time Spent: 8h 40m (was: 8.5h) > JmsCheckpointMark Avro Serialization issue with UnboundedSource > --- > > Key: BEAM-7427 > URL: https://issues.apache.org/jira/browse/BEAM-7427 > Project: Beam > Issue Type: Bug > Components: io-java-jms >Affects Versions: 2.12.0, 2.13.0, 2.14.0, 2.15.0, 2.16.0, 2.17.0, 2.18.0, > 2.19.0 > Environment: Message Broker : solace > JMS Client (Over AMQP) : "org.apache.qpid:qpid-jms-client:0.42.0 >Reporter: Mourad >Assignee: Jean-Baptiste Onofré >Priority: Major > Fix For: 2.20.0 > > Time Spent: 8h 40m > Remaining Estimate: 0h > > I get the following exception when reading from unbounded JMS Source: > > {code:java} > Caused by: org.apache.avro.SchemaParseException: Illegal character in: this$0 > at org.apache.avro.Schema.validateName(Schema.java:1151) > at org.apache.avro.Schema.access$200(Schema.java:81) > at org.apache.avro.Schema$Field.(Schema.java:403) > at org.apache.avro.Schema$Field.(Schema.java:396) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:622) > at org.apache.avro.reflect.ReflectData.createFieldSchema(ReflectData.java:740) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:604) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:218) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:215) > at > avro.shaded.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228) > {code} > > The exception is thrown by Avro when introspecting {{JmsCheckpointMark}} to > generate schema. > JmsIO config : > > {code:java} > PCollection messages = pipeline.apply("read messages from the > events broker", JmsIO.readMessage() > .withConnectionFactory(jmsConnectionFactory) .withTopic(options.getTopic()) > .withMessageMapper(new DFAMessageMapper()) > .withCoder(AvroCoder.of(DFAMessage.class))); > {code} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7427) JmsCheckpointMark Avro Serialization issue with UnboundedSource
[ https://issues.apache.org/jira/browse/BEAM-7427?focusedWorklogId=378284=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378284 ] ASF GitHub Bot logged work on BEAM-7427: Author: ASF GitHub Bot Created on: 28/Jan/20 14:52 Start Date: 28/Jan/20 14:52 Worklog Time Spent: 10m Work Description: iemejia commented on issue #8757: [BEAM-7427] Fix JmsCheckpointMark Avro Encoding URL: https://github.com/apache/beam/pull/8757#issuecomment-579282755 For info #10644 was merged today. The fix will be part of Beam 2.20.0 since the vote for 2.19.0 has already started. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378284) Time Spent: 9h 20m (was: 9h 10m) > JmsCheckpointMark Avro Serialization issue with UnboundedSource > --- > > Key: BEAM-7427 > URL: https://issues.apache.org/jira/browse/BEAM-7427 > Project: Beam > Issue Type: Bug > Components: io-java-jms >Affects Versions: 2.12.0, 2.13.0, 2.14.0, 2.15.0, 2.16.0, 2.17.0, 2.18.0, > 2.19.0 > Environment: Message Broker : solace > JMS Client (Over AMQP) : "org.apache.qpid:qpid-jms-client:0.42.0 >Reporter: Mourad >Assignee: Jean-Baptiste Onofré >Priority: Major > Fix For: 2.20.0 > > Time Spent: 9h 20m > Remaining Estimate: 0h > > I get the following exception when reading from unbounded JMS Source: > > {code:java} > Caused by: org.apache.avro.SchemaParseException: Illegal character in: this$0 > at org.apache.avro.Schema.validateName(Schema.java:1151) > at org.apache.avro.Schema.access$200(Schema.java:81) > at org.apache.avro.Schema$Field.(Schema.java:403) > at org.apache.avro.Schema$Field.(Schema.java:396) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:622) > at org.apache.avro.reflect.ReflectData.createFieldSchema(ReflectData.java:740) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:604) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:218) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:215) > at > avro.shaded.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228) > {code} > > The exception is thrown by Avro when introspecting {{JmsCheckpointMark}} to > generate schema. > JmsIO config : > > {code:java} > PCollection messages = pipeline.apply("read messages from the > events broker", JmsIO.readMessage() > .withConnectionFactory(jmsConnectionFactory) .withTopic(options.getTopic()) > .withMessageMapper(new DFAMessageMapper()) > .withCoder(AvroCoder.of(DFAMessage.class))); > {code} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9188) Improving speed of splitting for Custom Sources
[ https://issues.apache.org/jira/browse/BEAM-9188?focusedWorklogId=378322=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378322 ] ASF GitHub Bot logged work on BEAM-9188: Author: ASF GitHub Bot Created on: 28/Jan/20 15:24 Start Date: 28/Jan/20 15:24 Worklog Time Spent: 10m Work Description: stankiewicz commented on pull request #10701: [BEAM-9188] CassandraIO split performance improvement - cache size of the table URL: https://github.com/apache/beam/pull/10701 Splitting CassandraIO source into multiple sources works fast as it uses one connection pool to Cassandra cluster but after that dataflow.worker.WorkerCustomSources is calling CassandraSource.getEstimatedSizeBytes for each source which setups and tears down connection to Cassandra cluster to calculate same size of table. This optimization introduces caching of size internally just to avoid additional queries. Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)[![Build
[jira] [Resolved] (BEAM-4409) NoSuchMethodException reading from JmsIO
[ https://issues.apache.org/jira/browse/BEAM-4409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía resolved BEAM-4409. Fix Version/s: 2.20.0 Resolution: Fixed > NoSuchMethodException reading from JmsIO > > > Key: BEAM-4409 > URL: https://issues.apache.org/jira/browse/BEAM-4409 > Project: Beam > Issue Type: Bug > Components: io-java-jms >Affects Versions: 2.4.0 > Environment: Linux, Java 1.8, Beam 2.4, Direct Runner, ActiveMQ >Reporter: Edward Pricer >Priority: Major > Fix For: 2.20.0 > > > Running with the DirectRunner, and reading from a queue with JmsIO as an > unbounded source will produce a NoSuchMethodException. This occurs as the > UnboundedReadEvaluatorFactory.UnboundedReadEvaluator attempts to clone the > JmsCheckpointMark with the default (Avro) coder. > The following trivial code on the reader side reproduces the error > (DirectRunner must be in path). The messages on the queue for this test case > were simple TextMessages. I found this exception is triggered more readily > when messages are published rapidly (~200/second) > {code:java} > Pipeline p = > Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create()); > // read from the queue > ConnectionFactory factory = new > ActiveMQConnectionFactory("tcp://localhost:61616"); > PCollection inputStrings = p.apply("Read from queue", > JmsIO.readMessage() .withConnectionFactory(factory) > .withQueue("somequeue") .withCoder(StringUtf8Coder.of()) > .withMessageMapper((JmsIO.MessageMapper) message -> > ((TextMessage) message).getText())); > // decode > PCollection asStrings = inputStrings.apply("Decode Message", > ParDo.of(new DoFn() { @ProcessElement public > void processElement(ProcessContext context) { > System.out.println(context.element()); > context.output(context.element()); } })); > p.run(); > {code} > Stack trace: > {code:java} > Exception in thread "main" java.lang.RuntimeException: > java.lang.NoSuchMethodException: javax.jms.Message.() at > org.apache.avro.specific.SpecificData.newInstance(SpecificData.java:353) at > org.apache.avro.specific.SpecificData.newRecord(SpecificData.java:369) at > org.apache.avro.reflect.ReflectData.newRecord(ReflectData.java:901) at > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:212) > at > org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175) > at > org.apache.avro.reflect.ReflectDatumReader.readCollection(ReflectDatumReader.java:219) > at > org.apache.avro.reflect.ReflectDatumReader.readArray(ReflectDatumReader.java:137) > at > org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:177) > at > org.apache.avro.reflect.ReflectDatumReader.readField(ReflectDatumReader.java:302) > at > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222) > at > org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145) > at org.apache.beam.sdk.coders.AvroCoder.decode(AvroCoder.java:318) at > org.apache.beam.sdk.coders.Coder.decode(Coder.java:170) at > org.apache.beam.sdk.util.CoderUtils.decodeFromSafeStream(CoderUtils.java:122) > at > org.apache.beam.sdk.util.CoderUtils.decodeFromByteArray(CoderUtils.java:105) > at > org.apache.beam.sdk.util.CoderUtils.decodeFromByteArray(CoderUtils.java:99) > at org.apache.beam.sdk.util.CoderUtils.clone(CoderUtils.java:148) at > org.apache.beam.runners.direct.UnboundedReadEvaluatorFactory$UnboundedReadEvaluator.getReader(UnboundedReadEvaluatorFactory.java:194) > at > org.apache.beam.runners.direct.UnboundedReadEvaluatorFactory$UnboundedReadEvaluator.processElement(UnboundedReadEvaluatorFactory.java:124) > at > org.apache.beam.runners.direct.DirectTransformExecutor.processElements(DirectTransformExecutor.java:161) > at > org.apache.beam.runners.direct.DirectTransformExecutor.run(DirectTransformExecutor.java:125) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) Caused by: > java.lang.NoSuchMethodException: javax.jms.Message.() at > java.lang.Class.getConstructor0(Class.java:3082) at > java.lang.Class.getDeclaredConstructor(Class.java:2178) at > org.apache.avro.specific.SpecificData.newInstance(SpecificData.java:347) > {code} > > And a more contrived
[jira] [Created] (BEAM-9206) Easy way to run checkJavaLinkage?
Tomo Suzuki created BEAM-9206: - Summary: Easy way to run checkJavaLinkage? Key: BEAM-9206 URL: https://issues.apache.org/jira/browse/BEAM-9206 Project: Beam Issue Type: Bug Components: build-system Reporter: Tomo Suzuki Assignee: Tomo Suzuki Follow up of iemejia's comment: https://github.com/apache/beam/pull/10643#issuecomment-579276082 bq. I just want some sort of ./gradlew :checkJavaLinkage that works for the whole set of modules of the project. Is this 'feasible' with gradlew + Beam? h1. Considerations * Something that can run on Jenkins * Comparison with the result of origin/master * Simple way to run checkJavaLinkage for all modules -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (BEAM-9205) Regression in validates runner tests configuration in spark module
[ https://issues.apache.org/jira/browse/BEAM-9205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Etienne Chauchot closed BEAM-9205. -- Fix Version/s: 2.20.0 Resolution: Fixed > Regression in validates runner tests configuration in spark module > -- > > Key: BEAM-9205 > URL: https://issues.apache.org/jira/browse/BEAM-9205 > Project: Beam > Issue Type: Sub-task > Components: runner-spark >Reporter: Etienne Chauchot >Assignee: Etienne Chauchot >Priority: Major > Fix For: 2.20.0 > > > Not all the metrics tests are run: at least MetricsPusher is no more run -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9205) Regression in validates runner tests configuration in spark module
[ https://issues.apache.org/jira/browse/BEAM-9205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Etienne Chauchot updated BEAM-9205: --- Parent: BEAM-3310 Issue Type: Sub-task (was: Test) > Regression in validates runner tests configuration in spark module > -- > > Key: BEAM-9205 > URL: https://issues.apache.org/jira/browse/BEAM-9205 > Project: Beam > Issue Type: Sub-task > Components: runner-spark >Reporter: Etienne Chauchot >Assignee: Etienne Chauchot >Priority: Major > > Not all the metrics tests are run: at least MetricsPusher is no more run -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9205) Regression in validates runner tests configuration in spark module
Etienne Chauchot created BEAM-9205: -- Summary: Regression in validates runner tests configuration in spark module Key: BEAM-9205 URL: https://issues.apache.org/jira/browse/BEAM-9205 Project: Beam Issue Type: Test Components: runner-spark Reporter: Etienne Chauchot Assignee: Etienne Chauchot Not all the metrics tests are run: at least MetricsPusher is no more run -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-8925) Beam Dependency Update Request: org.apache.tika:tika-core
[ https://issues.apache.org/jira/browse/BEAM-8925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colm O hEigeartaigh reassigned BEAM-8925: - Assignee: Colm O hEigeartaigh > Beam Dependency Update Request: org.apache.tika:tika-core > - > > Key: BEAM-8925 > URL: https://issues.apache.org/jira/browse/BEAM-8925 > Project: Beam > Issue Type: Sub-task > Components: dependencies >Reporter: Beam JIRA Bot >Assignee: Colm O hEigeartaigh >Priority: Major > > - 2019-12-09 12:20:22.212496 > - > Please consider upgrading the dependency org.apache.tika:tika-core. > The current version is 1.20. The latest version is 1.23 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2019-12-23 12:20:53.356760 > - > Please consider upgrading the dependency org.apache.tika:tika-core. > The current version is 1.20. The latest version is 1.23 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2019-12-30 14:15:58.081400 > - > Please consider upgrading the dependency org.apache.tika:tika-core. > The current version is 1.20. The latest version is 1.23 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2020-01-06 12:19:33.456649 > - > Please consider upgrading the dependency org.apache.tika:tika-core. > The current version is 1.20. The latest version is 1.23 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2020-01-13 12:18:38.940974 > - > Please consider upgrading the dependency org.apache.tika:tika-core. > The current version is 1.20. The latest version is 1.23 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2020-01-20 12:16:03.428169 > - > Please consider upgrading the dependency org.apache.tika:tika-core. > The current version is 1.20. The latest version is 1.23 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. > - 2020-01-27 12:17:01.302466 > - > Please consider upgrading the dependency org.apache.tika:tika-core. > The current version is 1.20. The latest version is 1.23 > cc: > Please refer to [Beam Dependency Guide > |https://beam.apache.org/contribute/dependencies/]for more information. > Do Not Modify The Description Above. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7427) JmsCheckpointMark Avro Serialization issue with UnboundedSource
[ https://issues.apache.org/jira/browse/BEAM-7427?focusedWorklogId=378223=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378223 ] ASF GitHub Bot logged work on BEAM-7427: Author: ASF GitHub Bot Created on: 28/Jan/20 13:25 Start Date: 28/Jan/20 13:25 Worklog Time Spent: 10m Work Description: jbonofre commented on issue #10644: [BEAM-7427] Refactore JmsCheckpointMark to be usage via Coder URL: https://github.com/apache/beam/pull/10644#issuecomment-579243908 As discussed, I've keep `JmsCheckpointMark` in a dedicated class. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378223) Time Spent: 8h 10m (was: 8h) > JmsCheckpointMark Avro Serialization issue with UnboundedSource > --- > > Key: BEAM-7427 > URL: https://issues.apache.org/jira/browse/BEAM-7427 > Project: Beam > Issue Type: Bug > Components: io-java-jms >Affects Versions: 2.12.0 > Environment: Message Broker : solace > JMS Client (Over AMQP) : "org.apache.qpid:qpid-jms-client:0.42.0 >Reporter: Mourad >Assignee: Mourad >Priority: Major > Time Spent: 8h 10m > Remaining Estimate: 0h > > I get the following exception when reading from unbounded JMS Source: > > {code:java} > Caused by: org.apache.avro.SchemaParseException: Illegal character in: this$0 > at org.apache.avro.Schema.validateName(Schema.java:1151) > at org.apache.avro.Schema.access$200(Schema.java:81) > at org.apache.avro.Schema$Field.(Schema.java:403) > at org.apache.avro.Schema$Field.(Schema.java:396) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:622) > at org.apache.avro.reflect.ReflectData.createFieldSchema(ReflectData.java:740) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:604) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:218) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:215) > at > avro.shaded.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228) > {code} > > The exception is thrown by Avro when introspecting {{JmsCheckpointMark}} to > generate schema. > JmsIO config : > > {code:java} > PCollection messages = pipeline.apply("read messages from the > events broker", JmsIO.readMessage() > .withConnectionFactory(jmsConnectionFactory) .withTopic(options.getTopic()) > .withMessageMapper(new DFAMessageMapper()) > .withCoder(AvroCoder.of(DFAMessage.class))); > {code} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-2535) Allow explicit output time independent of firing specification for all timers
[ https://issues.apache.org/jira/browse/BEAM-2535?focusedWorklogId=378225=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378225 ] ASF GitHub Bot logged work on BEAM-2535: Author: ASF GitHub Bot Created on: 28/Jan/20 13:26 Start Date: 28/Jan/20 13:26 Worklog Time Spent: 10m Work Description: iemejia commented on issue #10627: [BEAM-2535] Support outputTimestamp and watermark holds in processing timers. URL: https://github.com/apache/beam/pull/10627#issuecomment-579244488 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378225) Time Spent: 13.5h (was: 13h 20m) > Allow explicit output time independent of firing specification for all timers > - > > Key: BEAM-2535 > URL: https://issues.apache.org/jira/browse/BEAM-2535 > Project: Beam > Issue Type: New Feature > Components: beam-model, sdk-java-core >Reporter: Kenneth Knowles >Assignee: Shehzaad Nakhoda >Priority: Major > Time Spent: 13.5h > Remaining Estimate: 0h > > Today, we have insufficient control over the event time timestamp of elements > output from a timer callback. > 1. For an event time timer, it is the timestamp of the timer itself. > 2. For a processing time timer, it is the current input watermark at the > time of processing. > But for both of these, we may want to reserve the right to output a > particular time, aka set a "watermark hold". > A naive implementation of a {{TimerWithWatermarkHold}} would work for making > sure output is not droppable, but does not fully explain window expiration > and late data/timer dropping. > In the natural interpretation of a timer as a feedback loop on a transform, > timers should be viewed as another channel of input, with a watermark, and > items on that channel _all need event time timestamps even if they are > delivered according to a different time domain_. > I propose that the specification for when a timer should fire should be > separated (with nice defaults) from the specification of the event time of > resulting outputs. These timestamps will determine a side channel with a new > "timer watermark" that constrains the output watermark. > - We still need to fire event time timers according to the input watermark, > so that event time timers fire. > - Late data dropping and window expiration will be in terms of the minimum > of the input watermark and the timer watermark. In this way, whenever a timer > is set, the window is not going to be garbage collected. > - We will need to make sure we have a way to "wake up" a window once it is > expired; this may be as simple as exhausting the timer channel as soon as the > input watermark indicates expiration of a window > This is mostly aimed at end-user timers in a stateful+timely {{DoFn}}. It > seems reasonable to use timers as an implementation detail (e.g. in > runners-core utilities) without wanting any of this additional machinery. For > example, if there is no possibility of output from the timer callback. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-2535) Allow explicit output time independent of firing specification for all timers
[ https://issues.apache.org/jira/browse/BEAM-2535?focusedWorklogId=378227=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378227 ] ASF GitHub Bot logged work on BEAM-2535: Author: ASF GitHub Bot Created on: 28/Jan/20 13:26 Start Date: 28/Jan/20 13:26 Worklog Time Spent: 10m Work Description: mwalenia commented on issue #10627: [BEAM-2535] Support outputTimestamp and watermark holds in processing timers. URL: https://github.com/apache/beam/pull/10627#issuecomment-579232967 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378227) Time Spent: 13h 40m (was: 13.5h) > Allow explicit output time independent of firing specification for all timers > - > > Key: BEAM-2535 > URL: https://issues.apache.org/jira/browse/BEAM-2535 > Project: Beam > Issue Type: New Feature > Components: beam-model, sdk-java-core >Reporter: Kenneth Knowles >Assignee: Shehzaad Nakhoda >Priority: Major > Time Spent: 13h 40m > Remaining Estimate: 0h > > Today, we have insufficient control over the event time timestamp of elements > output from a timer callback. > 1. For an event time timer, it is the timestamp of the timer itself. > 2. For a processing time timer, it is the current input watermark at the > time of processing. > But for both of these, we may want to reserve the right to output a > particular time, aka set a "watermark hold". > A naive implementation of a {{TimerWithWatermarkHold}} would work for making > sure output is not droppable, but does not fully explain window expiration > and late data/timer dropping. > In the natural interpretation of a timer as a feedback loop on a transform, > timers should be viewed as another channel of input, with a watermark, and > items on that channel _all need event time timestamps even if they are > delivered according to a different time domain_. > I propose that the specification for when a timer should fire should be > separated (with nice defaults) from the specification of the event time of > resulting outputs. These timestamps will determine a side channel with a new > "timer watermark" that constrains the output watermark. > - We still need to fire event time timers according to the input watermark, > so that event time timers fire. > - Late data dropping and window expiration will be in terms of the minimum > of the input watermark and the timer watermark. In this way, whenever a timer > is set, the window is not going to be garbage collected. > - We will need to make sure we have a way to "wake up" a window once it is > expired; this may be as simple as exhausting the timer channel as soon as the > input watermark indicates expiration of a window > This is mostly aimed at end-user timers in a stateful+timely {{DoFn}}. It > seems reasonable to use timers as an implementation detail (e.g. in > runners-core utilities) without wanting any of this additional machinery. For > example, if there is no possibility of output from the timer callback. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7427) JmsCheckpointMark Avro Serialization issue with UnboundedSource
[ https://issues.apache.org/jira/browse/BEAM-7427?focusedWorklogId=378246=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378246 ] ASF GitHub Bot logged work on BEAM-7427: Author: ASF GitHub Bot Created on: 28/Jan/20 13:50 Start Date: 28/Jan/20 13:50 Worklog Time Spent: 10m Work Description: iemejia commented on issue #10644: [BEAM-7427] Refactore JmsCheckpointMark to be usage via Coder URL: https://github.com/apache/beam/pull/10644#issuecomment-579254727 Sorry for the delay, the extra commit with the fixes looks good. I was thinking that since the stored messages are not needed to restore the progress of the reads on `UnboundedJmsReader` maybe the simplest fix is just to let them transient as you proposed. About the State changes maybe let's do those in a subsequent PR so we can get this fix out more quickly. WDYT If you agree just let the class as it was before and then I will merge. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378246) Time Spent: 8h 20m (was: 8h 10m) > JmsCheckpointMark Avro Serialization issue with UnboundedSource > --- > > Key: BEAM-7427 > URL: https://issues.apache.org/jira/browse/BEAM-7427 > Project: Beam > Issue Type: Bug > Components: io-java-jms >Affects Versions: 2.12.0 > Environment: Message Broker : solace > JMS Client (Over AMQP) : "org.apache.qpid:qpid-jms-client:0.42.0 >Reporter: Mourad >Assignee: Mourad >Priority: Major > Time Spent: 8h 20m > Remaining Estimate: 0h > > I get the following exception when reading from unbounded JMS Source: > > {code:java} > Caused by: org.apache.avro.SchemaParseException: Illegal character in: this$0 > at org.apache.avro.Schema.validateName(Schema.java:1151) > at org.apache.avro.Schema.access$200(Schema.java:81) > at org.apache.avro.Schema$Field.(Schema.java:403) > at org.apache.avro.Schema$Field.(Schema.java:396) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:622) > at org.apache.avro.reflect.ReflectData.createFieldSchema(ReflectData.java:740) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:604) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:218) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:215) > at > avro.shaded.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228) > {code} > > The exception is thrown by Avro when introspecting {{JmsCheckpointMark}} to > generate schema. > JmsIO config : > > {code:java} > PCollection messages = pipeline.apply("read messages from the > events broker", JmsIO.readMessage() > .withConnectionFactory(jmsConnectionFactory) .withTopic(options.getTopic()) > .withMessageMapper(new DFAMessageMapper()) > .withCoder(AvroCoder.of(DFAMessage.class))); > {code} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8618) Tear down unused DoFns periodically in Python SDK harness
[ https://issues.apache.org/jira/browse/BEAM-8618?focusedWorklogId=378254=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378254 ] ASF GitHub Bot logged work on BEAM-8618: Author: ASF GitHub Bot Created on: 28/Jan/20 14:06 Start Date: 28/Jan/20 14:06 Worklog Time Spent: 10m Work Description: mxm commented on pull request #10655: [BEAM-8618] Tear down unused DoFns periodically in Python SDK harness. URL: https://github.com/apache/beam/pull/10655#discussion_r371820931 ## File path: sdks/python/apache_beam/runners/worker/sdk_worker.py ## @@ -315,18 +319,47 @@ def release(self, instruction_id): """ descriptor_id, processor = self.active_bundle_processors.pop(instruction_id) processor.reset() +self.last_access_time[descriptor_id] = time.time() self.cached_bundle_processors[descriptor_id].append(processor) def shutdown(self): """ Shutdown all ``BundleProcessor``s in the cache. """ +if self.periodic_shutdown: + self.periodic_shutdown.cancel() + self.periodic_shutdown.join() + self.periodic_shutdown = None + for instruction_id in self.active_bundle_processors: self.active_bundle_processors[instruction_id][1].shutdown() del self.active_bundle_processors[instruction_id] for cached_bundle_processors in self.cached_bundle_processors.values(): - while len(cached_bundle_processors) > 0: -cached_bundle_processors.pop().shutdown() + BundleProcessorCache._shutdown_cached_bundle_processors( + cached_bundle_processors) + + def _schedule_periodic_shutdown(self): +def shutdown_inactive_bundle_processors(): + for descriptor_id, last_access_time in self.last_access_time.items(): +if time.time() - last_access_time > 60: + BundleProcessorCache._shutdown_cached_bundle_processors( + self.cached_bundle_processors[descriptor_id]) Review comment: Don't we have to remove the bundle processor list from the dictionary? Otherwise we may access a cached shutdown bundle processor. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378254) Time Spent: 50m (was: 40m) > Tear down unused DoFns periodically in Python SDK harness > - > > Key: BEAM-8618 > URL: https://issues.apache.org/jira/browse/BEAM-8618 > Project: Beam > Issue Type: Improvement > Components: sdk-py-harness >Reporter: sunjincheng >Assignee: sunjincheng >Priority: Major > Fix For: 2.20.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Per the discussion in the ML, detail can be found [1], the teardown of DoFns > should be supported in the portability framework. It happens at two places: > 1) Upon the control service termination > 2) Tear down the unused DoFns periodically > The aim of this JIRA is to add support for tear down the unused DoFns > periodically in Python SDK harness. > [1] > https://lists.apache.org/thread.html/0c4a4cf83cf2e35c3dfeb9d906e26cd82d3820968ba6f862f91739e4@%3Cdev.beam.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8618) Tear down unused DoFns periodically in Python SDK harness
[ https://issues.apache.org/jira/browse/BEAM-8618?focusedWorklogId=378256=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378256 ] ASF GitHub Bot logged work on BEAM-8618: Author: ASF GitHub Bot Created on: 28/Jan/20 14:06 Start Date: 28/Jan/20 14:06 Worklog Time Spent: 10m Work Description: mxm commented on pull request #10655: [BEAM-8618] Tear down unused DoFns periodically in Python SDK harness. URL: https://github.com/apache/beam/pull/10655#discussion_r371818162 ## File path: sdks/python/apache_beam/runners/worker/sdk_worker.py ## @@ -315,18 +319,47 @@ def release(self, instruction_id): """ descriptor_id, processor = self.active_bundle_processors.pop(instruction_id) processor.reset() +self.last_access_time[descriptor_id] = time.time() self.cached_bundle_processors[descriptor_id].append(processor) def shutdown(self): """ Shutdown all ``BundleProcessor``s in the cache. """ +if self.periodic_shutdown: + self.periodic_shutdown.cancel() + self.periodic_shutdown.join() + self.periodic_shutdown = None + for instruction_id in self.active_bundle_processors: self.active_bundle_processors[instruction_id][1].shutdown() del self.active_bundle_processors[instruction_id] for cached_bundle_processors in self.cached_bundle_processors.values(): - while len(cached_bundle_processors) > 0: -cached_bundle_processors.pop().shutdown() + BundleProcessorCache._shutdown_cached_bundle_processors( + cached_bundle_processors) + + def _schedule_periodic_shutdown(self): +def shutdown_inactive_bundle_processors(): + for descriptor_id, last_access_time in self.last_access_time.items(): +if time.time() - last_access_time > 60: + BundleProcessorCache._shutdown_cached_bundle_processors( + self.cached_bundle_processors[descriptor_id]) + +from apache_beam.runners.worker.data_plane import PeriodicThread +self.periodic_shutdown = PeriodicThread( +60, shutdown_inactive_bundle_processors) Review comment: Same here. Should be configurable or at least be extracted to a variable. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378256) Time Spent: 1h (was: 50m) > Tear down unused DoFns periodically in Python SDK harness > - > > Key: BEAM-8618 > URL: https://issues.apache.org/jira/browse/BEAM-8618 > Project: Beam > Issue Type: Improvement > Components: sdk-py-harness >Reporter: sunjincheng >Assignee: sunjincheng >Priority: Major > Fix For: 2.20.0 > > Time Spent: 1h > Remaining Estimate: 0h > > Per the discussion in the ML, detail can be found [1], the teardown of DoFns > should be supported in the portability framework. It happens at two places: > 1) Upon the control service termination > 2) Tear down the unused DoFns periodically > The aim of this JIRA is to add support for tear down the unused DoFns > periodically in Python SDK harness. > [1] > https://lists.apache.org/thread.html/0c4a4cf83cf2e35c3dfeb9d906e26cd82d3820968ba6f862f91739e4@%3Cdev.beam.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8618) Tear down unused DoFns periodically in Python SDK harness
[ https://issues.apache.org/jira/browse/BEAM-8618?focusedWorklogId=378255=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378255 ] ASF GitHub Bot logged work on BEAM-8618: Author: ASF GitHub Bot Created on: 28/Jan/20 14:06 Start Date: 28/Jan/20 14:06 Worklog Time Spent: 10m Work Description: mxm commented on pull request #10655: [BEAM-8618] Tear down unused DoFns periodically in Python SDK harness. URL: https://github.com/apache/beam/pull/10655#discussion_r371817978 ## File path: sdks/python/apache_beam/runners/worker/sdk_worker.py ## @@ -315,18 +319,47 @@ def release(self, instruction_id): """ descriptor_id, processor = self.active_bundle_processors.pop(instruction_id) processor.reset() +self.last_access_time[descriptor_id] = time.time() self.cached_bundle_processors[descriptor_id].append(processor) def shutdown(self): """ Shutdown all ``BundleProcessor``s in the cache. """ +if self.periodic_shutdown: + self.periodic_shutdown.cancel() + self.periodic_shutdown.join() + self.periodic_shutdown = None + for instruction_id in self.active_bundle_processors: self.active_bundle_processors[instruction_id][1].shutdown() del self.active_bundle_processors[instruction_id] for cached_bundle_processors in self.cached_bundle_processors.values(): - while len(cached_bundle_processors) > 0: -cached_bundle_processors.pop().shutdown() + BundleProcessorCache._shutdown_cached_bundle_processors( + cached_bundle_processors) + + def _schedule_periodic_shutdown(self): +def shutdown_inactive_bundle_processors(): + for descriptor_id, last_access_time in self.last_access_time.items(): +if time.time() - last_access_time > 60: Review comment: We may want to make this configurable. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378255) Time Spent: 1h (was: 50m) > Tear down unused DoFns periodically in Python SDK harness > - > > Key: BEAM-8618 > URL: https://issues.apache.org/jira/browse/BEAM-8618 > Project: Beam > Issue Type: Improvement > Components: sdk-py-harness >Reporter: sunjincheng >Assignee: sunjincheng >Priority: Major > Fix For: 2.20.0 > > Time Spent: 1h > Remaining Estimate: 0h > > Per the discussion in the ML, detail can be found [1], the teardown of DoFns > should be supported in the portability framework. It happens at two places: > 1) Upon the control service termination > 2) Tear down the unused DoFns periodically > The aim of this JIRA is to add support for tear down the unused DoFns > periodically in Python SDK harness. > [1] > https://lists.apache.org/thread.html/0c4a4cf83cf2e35c3dfeb9d906e26cd82d3820968ba6f862f91739e4@%3Cdev.beam.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8618) Tear down unused DoFns periodically in Python SDK harness
[ https://issues.apache.org/jira/browse/BEAM-8618?focusedWorklogId=378257=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378257 ] ASF GitHub Bot logged work on BEAM-8618: Author: ASF GitHub Bot Created on: 28/Jan/20 14:06 Start Date: 28/Jan/20 14:06 Worklog Time Spent: 10m Work Description: mxm commented on pull request #10655: [BEAM-8618] Tear down unused DoFns periodically in Python SDK harness. URL: https://github.com/apache/beam/pull/10655#discussion_r371819018 ## File path: sdks/python/apache_beam/runners/worker/sdk_worker.py ## @@ -315,18 +319,47 @@ def release(self, instruction_id): """ descriptor_id, processor = self.active_bundle_processors.pop(instruction_id) processor.reset() +self.last_access_time[descriptor_id] = time.time() self.cached_bundle_processors[descriptor_id].append(processor) def shutdown(self): """ Shutdown all ``BundleProcessor``s in the cache. """ +if self.periodic_shutdown: + self.periodic_shutdown.cancel() + self.periodic_shutdown.join() + self.periodic_shutdown = None + for instruction_id in self.active_bundle_processors: self.active_bundle_processors[instruction_id][1].shutdown() del self.active_bundle_processors[instruction_id] for cached_bundle_processors in self.cached_bundle_processors.values(): - while len(cached_bundle_processors) > 0: -cached_bundle_processors.pop().shutdown() + BundleProcessorCache._shutdown_cached_bundle_processors( + cached_bundle_processors) + + def _schedule_periodic_shutdown(self): +def shutdown_inactive_bundle_processors(): + for descriptor_id, last_access_time in self.last_access_time.items(): +if time.time() - last_access_time > 60: + BundleProcessorCache._shutdown_cached_bundle_processors( + self.cached_bundle_processors[descriptor_id]) + +from apache_beam.runners.worker.data_plane import PeriodicThread Review comment: I think we should move this to the import section. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378257) Time Spent: 1h 10m (was: 1h) > Tear down unused DoFns periodically in Python SDK harness > - > > Key: BEAM-8618 > URL: https://issues.apache.org/jira/browse/BEAM-8618 > Project: Beam > Issue Type: Improvement > Components: sdk-py-harness >Reporter: sunjincheng >Assignee: sunjincheng >Priority: Major > Fix For: 2.20.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Per the discussion in the ML, detail can be found [1], the teardown of DoFns > should be supported in the portability framework. It happens at two places: > 1) Upon the control service termination > 2) Tear down the unused DoFns periodically > The aim of this JIRA is to add support for tear down the unused DoFns > periodically in Python SDK harness. > [1] > https://lists.apache.org/thread.html/0c4a4cf83cf2e35c3dfeb9d906e26cd82d3820968ba6f862f91739e4@%3Cdev.beam.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8618) Tear down unused DoFns periodically in Python SDK harness
[ https://issues.apache.org/jira/browse/BEAM-8618?focusedWorklogId=378253=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378253 ] ASF GitHub Bot logged work on BEAM-8618: Author: ASF GitHub Bot Created on: 28/Jan/20 14:06 Start Date: 28/Jan/20 14:06 Worklog Time Spent: 10m Work Description: mxm commented on pull request #10655: [BEAM-8618] Tear down unused DoFns periodically in Python SDK harness. URL: https://github.com/apache/beam/pull/10655#discussion_r371817547 ## File path: sdks/python/apache_beam/runners/worker/sdk_worker.py ## @@ -280,6 +283,7 @@ def get(self, instruction_id, bundle_descriptor_id): try: # pop() is threadsafe processor = self.cached_bundle_processors[bundle_descriptor_id].pop() + self.last_access_time[bundle_descriptor_id] = time.time() except IndexError: Review comment: This won't update the access time when we first create the processor in the except block. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378253) Time Spent: 40m (was: 0.5h) > Tear down unused DoFns periodically in Python SDK harness > - > > Key: BEAM-8618 > URL: https://issues.apache.org/jira/browse/BEAM-8618 > Project: Beam > Issue Type: Improvement > Components: sdk-py-harness >Reporter: sunjincheng >Assignee: sunjincheng >Priority: Major > Fix For: 2.20.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Per the discussion in the ML, detail can be found [1], the teardown of DoFns > should be supported in the portability framework. It happens at two places: > 1) Upon the control service termination > 2) Tear down the unused DoFns periodically > The aim of this JIRA is to add support for tear down the unused DoFns > periodically in Python SDK harness. > [1] > https://lists.apache.org/thread.html/0c4a4cf83cf2e35c3dfeb9d906e26cd82d3820968ba6f862f91739e4@%3Cdev.beam.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8972) Add a Jenkins job running Combine load test on Java with Flink in Portability mode
[ https://issues.apache.org/jira/browse/BEAM-8972?focusedWorklogId=378154=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378154 ] ASF GitHub Bot logged work on BEAM-8972: Author: ASF GitHub Bot Created on: 28/Jan/20 11:02 Start Date: 28/Jan/20 11:02 Worklog Time Spent: 10m Work Description: mwalenia commented on issue #10386: [BEAM-8972] Add Jenkins job with Combine test for portable Java URL: https://github.com/apache/beam/pull/10386#issuecomment-57919 Run Load Tests Java Combine Portable Flink Batch This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378154) Time Spent: 5h (was: 4h 50m) > Add a Jenkins job running Combine load test on Java with Flink in Portability > mode > -- > > Key: BEAM-8972 > URL: https://issues.apache.org/jira/browse/BEAM-8972 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Michał Walenia >Assignee: Michał Walenia >Priority: Minor > Time Spent: 5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8972) Add a Jenkins job running Combine load test on Java with Flink in Portability mode
[ https://issues.apache.org/jira/browse/BEAM-8972?focusedWorklogId=378153=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378153 ] ASF GitHub Bot logged work on BEAM-8972: Author: ASF GitHub Bot Created on: 28/Jan/20 11:02 Start Date: 28/Jan/20 11:02 Worklog Time Spent: 10m Work Description: mwalenia commented on issue #10386: [BEAM-8972] Add Jenkins job with Combine test for portable Java URL: https://github.com/apache/beam/pull/10386#issuecomment-579192080 Run Load Tests Java Combine Portable Flink Batch This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378153) Time Spent: 4h 50m (was: 4h 40m) > Add a Jenkins job running Combine load test on Java with Flink in Portability > mode > -- > > Key: BEAM-8972 > URL: https://issues.apache.org/jira/browse/BEAM-8972 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Michał Walenia >Assignee: Michał Walenia >Priority: Minor > Time Spent: 4h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8972) Add a Jenkins job running Combine load test on Java with Flink in Portability mode
[ https://issues.apache.org/jira/browse/BEAM-8972?focusedWorklogId=378152=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378152 ] ASF GitHub Bot logged work on BEAM-8972: Author: ASF GitHub Bot Created on: 28/Jan/20 11:01 Start Date: 28/Jan/20 11:01 Worklog Time Spent: 10m Work Description: mwalenia commented on issue #10386: [BEAM-8972] Add Jenkins job with Combine test for portable Java URL: https://github.com/apache/beam/pull/10386#issuecomment-579192080 Run Load Tests Java Combine Portable Flink Batch This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378152) Time Spent: 4h 40m (was: 4.5h) > Add a Jenkins job running Combine load test on Java with Flink in Portability > mode > -- > > Key: BEAM-8972 > URL: https://issues.apache.org/jira/browse/BEAM-8972 > Project: Beam > Issue Type: Improvement > Components: testing >Reporter: Michał Walenia >Assignee: Michał Walenia >Priority: Minor > Time Spent: 4h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8550) @RequiresTimeSortedInput DoFn annotation
[ https://issues.apache.org/jira/browse/BEAM-8550?focusedWorklogId=378199=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378199 ] ASF GitHub Bot logged work on BEAM-8550: Author: ASF GitHub Bot Created on: 28/Jan/20 12:51 Start Date: 28/Jan/20 12:51 Worklog Time Spent: 10m Work Description: je-ik commented on issue #8774: [BEAM-8550] Requires time sorted input URL: https://github.com/apache/beam/pull/8774#issuecomment-579230688 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378199) Time Spent: 9h 20m (was: 9h 10m) > @RequiresTimeSortedInput DoFn annotation > > > Key: BEAM-8550 > URL: https://issues.apache.org/jira/browse/BEAM-8550 > Project: Beam > Issue Type: New Feature > Components: beam-model, sdk-java-core >Reporter: Jan Lukavský >Assignee: Jan Lukavský >Priority: Major > Time Spent: 9h 20m > Remaining Estimate: 0h > > Implement new annotation {{@RequiresTimeSortedInput}} for stateful DoFn as > described in [design > document|https://docs.google.com/document/d/1ObLVUFsf1NcG8ZuIZE4aVy2RYKx2FfyMhkZYWPnI9-c/edit?usp=sharing]. > First implementation will assume that: > - time is defined by timestamp in associated WindowedValue > - allowed lateness is explicitly zero and all late elements are dropped > (due to being out of order) > The above properties are considered temporary and will be resolved by > subsequent extensions (backwards compatible). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-2535) Allow explicit output time independent of firing specification for all timers
[ https://issues.apache.org/jira/browse/BEAM-2535?focusedWorklogId=378228=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378228 ] ASF GitHub Bot logged work on BEAM-2535: Author: ASF GitHub Bot Created on: 28/Jan/20 13:28 Start Date: 28/Jan/20 13:28 Worklog Time Spent: 10m Work Description: iemejia commented on issue #10627: [BEAM-2535] Support outputTimestamp and watermark holds in processing timers. URL: https://github.com/apache/beam/pull/10627#issuecomment-579245093 Run Direct ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378228) Time Spent: 13h 50m (was: 13h 40m) > Allow explicit output time independent of firing specification for all timers > - > > Key: BEAM-2535 > URL: https://issues.apache.org/jira/browse/BEAM-2535 > Project: Beam > Issue Type: New Feature > Components: beam-model, sdk-java-core >Reporter: Kenneth Knowles >Assignee: Shehzaad Nakhoda >Priority: Major > Time Spent: 13h 50m > Remaining Estimate: 0h > > Today, we have insufficient control over the event time timestamp of elements > output from a timer callback. > 1. For an event time timer, it is the timestamp of the timer itself. > 2. For a processing time timer, it is the current input watermark at the > time of processing. > But for both of these, we may want to reserve the right to output a > particular time, aka set a "watermark hold". > A naive implementation of a {{TimerWithWatermarkHold}} would work for making > sure output is not droppable, but does not fully explain window expiration > and late data/timer dropping. > In the natural interpretation of a timer as a feedback loop on a transform, > timers should be viewed as another channel of input, with a watermark, and > items on that channel _all need event time timestamps even if they are > delivered according to a different time domain_. > I propose that the specification for when a timer should fire should be > separated (with nice defaults) from the specification of the event time of > resulting outputs. These timestamps will determine a side channel with a new > "timer watermark" that constrains the output watermark. > - We still need to fire event time timers according to the input watermark, > so that event time timers fire. > - Late data dropping and window expiration will be in terms of the minimum > of the input watermark and the timer watermark. In this way, whenever a timer > is set, the window is not going to be garbage collected. > - We will need to make sure we have a way to "wake up" a window once it is > expired; this may be as simple as exhausting the timer channel as soon as the > input watermark indicates expiration of a window > This is mostly aimed at end-user timers in a stateful+timely {{DoFn}}. It > seems reasonable to use timers as an implementation detail (e.g. in > runners-core utilities) without wanting any of this additional machinery. For > example, if there is no possibility of output from the timer callback. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7427) JmsCheckpointMark Avro Serialization issue with UnboundedSource
[ https://issues.apache.org/jira/browse/BEAM-7427?focusedWorklogId=378249=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-378249 ] ASF GitHub Bot logged work on BEAM-7427: Author: ASF GitHub Bot Created on: 28/Jan/20 13:52 Start Date: 28/Jan/20 13:52 Worklog Time Spent: 10m Work Description: iemejia commented on issue #10644: [BEAM-7427] Refactore JmsCheckpointMark to be usage via Coder URL: https://github.com/apache/beam/pull/10644#issuecomment-579255395 Oh you already get rid of state hehe, my bad ok looking again. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 378249) Time Spent: 8.5h (was: 8h 20m) > JmsCheckpointMark Avro Serialization issue with UnboundedSource > --- > > Key: BEAM-7427 > URL: https://issues.apache.org/jira/browse/BEAM-7427 > Project: Beam > Issue Type: Bug > Components: io-java-jms >Affects Versions: 2.12.0 > Environment: Message Broker : solace > JMS Client (Over AMQP) : "org.apache.qpid:qpid-jms-client:0.42.0 >Reporter: Mourad >Assignee: Mourad >Priority: Major > Time Spent: 8.5h > Remaining Estimate: 0h > > I get the following exception when reading from unbounded JMS Source: > > {code:java} > Caused by: org.apache.avro.SchemaParseException: Illegal character in: this$0 > at org.apache.avro.Schema.validateName(Schema.java:1151) > at org.apache.avro.Schema.access$200(Schema.java:81) > at org.apache.avro.Schema$Field.(Schema.java:403) > at org.apache.avro.Schema$Field.(Schema.java:396) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:622) > at org.apache.avro.reflect.ReflectData.createFieldSchema(ReflectData.java:740) > at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:604) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:218) > at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:215) > at > avro.shaded.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313) > at > avro.shaded.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228) > {code} > > The exception is thrown by Avro when introspecting {{JmsCheckpointMark}} to > generate schema. > JmsIO config : > > {code:java} > PCollection messages = pipeline.apply("read messages from the > events broker", JmsIO.readMessage() > .withConnectionFactory(jmsConnectionFactory) .withTopic(options.getTopic()) > .withMessageMapper(new DFAMessageMapper()) > .withCoder(AvroCoder.of(DFAMessage.class))); > {code} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)