[jira] [Updated] (BEAM-14405) Java Spanner IO NPE when ProjectID not specified in template executions
[ https://issues.apache.org/jira/browse/BEAM-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-14405: - Status: Resolved (was: Resolved) fixed by [PR#17540|https://github.com/apache/beam/pull/17540] > Java Spanner IO NPE when ProjectID not specified in template executions > --- > > Key: BEAM-14405 > URL: https://issues.apache.org/jira/browse/BEAM-14405 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.38.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: P1 > Fix For: 2.39.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > BEAM-13665 fixed the issue of NPE's when ProjectIDs were not specified in > command line executions by checking for a null ValueProvider, but in > _template_ executions, the ValueProvider is non-null but has a null > {_}value{_}. > [PR #17094|https://github.com/apache/beam/pull/17094] from BEAM-14116 added > nullness checks for keys in MonitoringInfoMetricName, which then triggered > the NPE in 2.38.0 when a template was executed with an unspecified ProjectID. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work stopped] (BEAM-14405) Java Spanner IO NPE when ProjectID not specified in template executions
[ https://issues.apache.org/jira/browse/BEAM-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on BEAM-14405 stopped by Niel Markwick. > Java Spanner IO NPE when ProjectID not specified in template executions > --- > > Key: BEAM-14405 > URL: https://issues.apache.org/jira/browse/BEAM-14405 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.38.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: P1 > Fix For: 2.39.0 > > Time Spent: 40m > Remaining Estimate: 0h > > BEAM-13665 fixed the issue of NPE's when ProjectIDs were not specified in > command line executions by checking for a null ValueProvider, but in > _template_ executions, the ValueProvider is non-null but has a null > {_}value{_}. > [PR #17094|https://github.com/apache/beam/pull/17094] from BEAM-14116 added > nullness checks for keys in MonitoringInfoMetricName, which then triggered > the NPE in 2.38.0 when a template was executed with an unspecified ProjectID. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Work started] (BEAM-14405) Java Spanner IO NPE when ProjectID not specified in template executions
[ https://issues.apache.org/jira/browse/BEAM-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on BEAM-14405 started by Niel Markwick. > Java Spanner IO NPE when ProjectID not specified in template executions > --- > > Key: BEAM-14405 > URL: https://issues.apache.org/jira/browse/BEAM-14405 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.38.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: P1 > Fix For: 2.39.0 > > > BEAM-13665 fixed the issue of NPE's when ProjectIDs were not specified in > command line executions by checking for a null ValueProvider, but in > _template_ executions, the ValueProvider is non-null but has a null > {_}value{_}. > [PR #17094|https://github.com/apache/beam/pull/17094] from BEAM-14116 added > nullness checks for keys in MonitoringInfoMetricName, which then triggered > the NPE in 2.38.0 when a template was executed with an unspecified ProjectID. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (BEAM-14405) Java Spanner IO NPE when ProjectID not specified in template executions
[ https://issues.apache.org/jira/browse/BEAM-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-14405: - Fix Version/s: 2.39.0 (was: 2.36.0) > Java Spanner IO NPE when ProjectID not specified in template executions > --- > > Key: BEAM-14405 > URL: https://issues.apache.org/jira/browse/BEAM-14405 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.38.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: P1 > Fix For: 2.39.0 > > > BEAM-13665 fixed the issue of NPE's when ProjectIDs were not specified in > command line executions by checking for a null ValueProvider, but in > _template_ executions, the ValueProvider is non-null but has a null > {_}value{_}. > [PR #17094|https://github.com/apache/beam/pull/17094] from BEAM-14116 added > nullness checks for keys in MonitoringInfoMetricName, which then triggered > the NPE in 2.38.0 when a template was executed with an unspecified ProjectID. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (BEAM-14405) Java Spanner IO NPE when ProjectID not specified in template executions
[ https://issues.apache.org/jira/browse/BEAM-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-14405: - Description: BEAM-13665 fixed the issue of NPE's when ProjectIDs were not specified in command line executions by checking for a null ValueProvider, but in _template_ executions, the ValueProvider is non-null but has a null {_}value{_}. PR #17094 from BEAM-14116 added nullness checks for keys in MonitoringInfoMetricName, which then triggered the NPE in 2.38.0 when a template was executed with an unspecified ProjectID. was:BEAM-13665 fixed the issue of NPE's when ProjectIDs were not specified in command line executions by checking for a null ValueProvider, but in _template_ executions, the ValueProvider is non-null but has a null {_}value{_}. > Java Spanner IO NPE when ProjectID not specified in template executions > --- > > Key: BEAM-14405 > URL: https://issues.apache.org/jira/browse/BEAM-14405 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: P1 > Fix For: 2.36.0 > > > BEAM-13665 fixed the issue of NPE's when ProjectIDs were not specified in > command line executions by checking for a null ValueProvider, but in > _template_ executions, the ValueProvider is non-null but has a null > {_}value{_}. > PR #17094 from BEAM-14116 added nullness checks for keys in > MonitoringInfoMetricName, which then triggered the NPE in 2.38.0 when a > template was executed with an unspecified ProjectID. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (BEAM-14405) Java Spanner IO NPE when ProjectID not specified in template executions
[ https://issues.apache.org/jira/browse/BEAM-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-14405: - Description: BEAM-13665 fixed the issue of NPE's when ProjectIDs were not specified in command line executions by checking for a null ValueProvider, but in _template_ executions, the ValueProvider is non-null but has a null {_}value{_}. [PR #17094|https://github.com/apache/beam/pull/17094] from BEAM-14116 added nullness checks for keys in MonitoringInfoMetricName, which then triggered the NPE in 2.38.0 when a template was executed with an unspecified ProjectID. was: BEAM-13665 fixed the issue of NPE's when ProjectIDs were not specified in command line executions by checking for a null ValueProvider, but in _template_ executions, the ValueProvider is non-null but has a null {_}value{_}. PR #17094 from BEAM-14116 added nullness checks for keys in MonitoringInfoMetricName, which then triggered the NPE in 2.38.0 when a template was executed with an unspecified ProjectID. > Java Spanner IO NPE when ProjectID not specified in template executions > --- > > Key: BEAM-14405 > URL: https://issues.apache.org/jira/browse/BEAM-14405 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: P1 > Fix For: 2.36.0 > > > BEAM-13665 fixed the issue of NPE's when ProjectIDs were not specified in > command line executions by checking for a null ValueProvider, but in > _template_ executions, the ValueProvider is non-null but has a null > {_}value{_}. > [PR #17094|https://github.com/apache/beam/pull/17094] from BEAM-14116 added > nullness checks for keys in MonitoringInfoMetricName, which then triggered > the NPE in 2.38.0 when a template was executed with an unspecified ProjectID. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (BEAM-14405) Java Spanner IO NPE when ProjectID not specified in template executions
[ https://issues.apache.org/jira/browse/BEAM-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-14405: - Affects Version/s: 2.38.0 > Java Spanner IO NPE when ProjectID not specified in template executions > --- > > Key: BEAM-14405 > URL: https://issues.apache.org/jira/browse/BEAM-14405 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.38.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: P1 > Fix For: 2.36.0 > > > BEAM-13665 fixed the issue of NPE's when ProjectIDs were not specified in > command line executions by checking for a null ValueProvider, but in > _template_ executions, the ValueProvider is non-null but has a null > {_}value{_}. > [PR #17094|https://github.com/apache/beam/pull/17094] from BEAM-14116 added > nullness checks for keys in MonitoringInfoMetricName, which then triggered > the NPE in 2.38.0 when a template was executed with an unspecified ProjectID. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (BEAM-14405) Java Spanner IO NPE when ProjectID not specified in template executions
[ https://issues.apache.org/jira/browse/BEAM-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-14405: - Description: BEAM-13665 fixed the issue of NPE's when ProjectIDs were not specified in command line executions by checking for a null ValueProvider, but in _template_ executions, the ValueProvider is non-null but has a null {_}value{_}. (was: According to comments on BEAM-11982, the change to resolve it broke backwards compatibility. So that affects 2.34.0 and 2.35.0. It is still worthwhile to restore 2.36.0) > Java Spanner IO NPE when ProjectID not specified in template executions > --- > > Key: BEAM-14405 > URL: https://issues.apache.org/jira/browse/BEAM-14405 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: P1 > Fix For: 2.36.0 > > > BEAM-13665 fixed the issue of NPE's when ProjectIDs were not specified in > command line executions by checking for a null ValueProvider, but in > _template_ executions, the ValueProvider is non-null but has a null > {_}value{_}. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (BEAM-14405) Java Spanner IO NPE when ProjectID not specified in template executions
Niel Markwick created BEAM-14405: Summary: Java Spanner IO NPE when ProjectID not specified in template executions Key: BEAM-14405 URL: https://issues.apache.org/jira/browse/BEAM-14405 Project: Beam Issue Type: Bug Components: io-java-gcp Reporter: Niel Markwick Assignee: Niel Markwick Fix For: 2.36.0 According to comments on BEAM-11982, the change to resolve it broke backwards compatibility. So that affects 2.34.0 and 2.35.0. It is still worthwhile to restore 2.36.0 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (BEAM-14171) CoGroupByKey loses values with large groups on Dataflow v1
Niel Markwick created BEAM-14171: Summary: CoGroupByKey loses values with large groups on Dataflow v1 Key: BEAM-14171 URL: https://issues.apache.org/jira/browse/BEAM-14171 Project: Beam Issue Type: Bug Components: runner-dataflow, sdk-java-core Affects Versions: 2.37.0, 2.36.0 Reporter: Niel Markwick Assignee: Robert Bradshaw Fix For: 2.38.0 CoGroupByKey can lose elements - replacing them with null values when a group is large (>10,000 elements). This only occurs in dataflow v1, not dataflow-v2 runner Possibly related to BEAM-13541. https://lists.apache.org/thread/5y56kbgm3q0m1byzf7186rrkomrcfldm -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (BEAM-14121) Incorrect Spanner IO Request Count metrics
[ https://issues.apache.org/jira/browse/BEAM-14121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-14121: - Remaining Estimate: 120h (was: 168h) Original Estimate: 120h (was: 168h) > Incorrect Spanner IO Request Count metrics > -- > > Key: BEAM-14121 > URL: https://issues.apache.org/jira/browse/BEAM-14121 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.34.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: P2 > Labels: google-cloud-spanner > Original Estimate: 120h > Remaining Estimate: 120h > > IO request count metrics calculated incorrectly for GCP Spanner > > Resource ID is formulated incorrectly > *Spanner Table:* > {{//spanner.googleapis.com/projects/\{projectId}/{*}topics{*}/\{databaseId}/tables/\{tableId}}} > should be > {{//spanner.googleapis.com/projects/\{projectId}/instances/\{instanceId}/databases/\{databaseId}/tables/\{tableId}}} > and is populated incorrectly – instance ID is used in place of tableID > Spanner SQL Query: > {{//spanner.googleapis.com/projects/\{projectId}/queries/\{queryName} }} > {{should be}} > {{{}//spanner.googleapis.com/projects/\{projectId}/{}}}{{{}instances/\{instanceId}/queries{}}}{{{}/\{queryName} > {}}} > and queryName is nullable which cause issued downstream > this is not actually populated at all - queries are logged as reads on an > instance. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (BEAM-14121) Incorrect Spanner IO Request Count metrics
[ https://issues.apache.org/jira/browse/BEAM-14121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-14121: - Remaining Estimate: 40h (was: 120h) Original Estimate: 40h (was: 120h) > Incorrect Spanner IO Request Count metrics > -- > > Key: BEAM-14121 > URL: https://issues.apache.org/jira/browse/BEAM-14121 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.34.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: P2 > Labels: google-cloud-spanner > Original Estimate: 40h > Remaining Estimate: 40h > > IO request count metrics calculated incorrectly for GCP Spanner > > Resource ID is formulated incorrectly > *Spanner Table:* > {{//spanner.googleapis.com/projects/\{projectId}/{*}topics{*}/\{databaseId}/tables/\{tableId}}} > should be > {{//spanner.googleapis.com/projects/\{projectId}/instances/\{instanceId}/databases/\{databaseId}/tables/\{tableId}}} > and is populated incorrectly – instance ID is used in place of tableID > Spanner SQL Query: > {{//spanner.googleapis.com/projects/\{projectId}/queries/\{queryName} }} > {{should be}} > {{{}//spanner.googleapis.com/projects/\{projectId}/{}}}{{{}instances/\{instanceId}/queries{}}}{{{}/\{queryName} > {}}} > and queryName is nullable which cause issued downstream > this is not actually populated at all - queries are logged as reads on an > instance. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (BEAM-14121) Incorrect Spanner IO Request Count metrics
Niel Markwick created BEAM-14121: Summary: Incorrect Spanner IO Request Count metrics Key: BEAM-14121 URL: https://issues.apache.org/jira/browse/BEAM-14121 Project: Beam Issue Type: Bug Components: io-java-gcp Affects Versions: 2.34.0 Reporter: Niel Markwick Assignee: Niel Markwick IO request count metrics calculated incorrectly for GCP Spanner Resource ID is formulated incorrectly *Spanner Table:* {{//spanner.googleapis.com/projects/\{projectId}/{*}topics{*}/\{databaseId}/tables/\{tableId}}} should be {{//spanner.googleapis.com/projects/\{projectId}/instances/\{instanceId}/databases/\{databaseId}/tables/\{tableId}}} and is populated incorrectly – instance ID is used in place of tableID Spanner SQL Query: {{//spanner.googleapis.com/projects/\{projectId}/queries/\{queryName} }} {{should be}} {{{}//spanner.googleapis.com/projects/\{projectId}/{}}}{{{}instances/\{instanceId}/queries{}}}{{{}/\{queryName} {}}} and queryName is nullable which cause issued downstream this is not actually populated at all - queries are logged as reads on an instance. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (BEAM-14005) Exceptions when reading from Spanner ignored
[ https://issues.apache.org/jira/browse/BEAM-14005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-14005: - Description: Regression from BEAM-11982: [BatchSpannerRead.java#L223|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/BatchSpannerRead.java#L223] will ignore exceptions raised when reading from spanner. It will log an metric and continue. This can lead to data loss as read errors are unreported and ignored. A pipeline can exit with success when data has been partially read or failed to read. was: Regression from BEAM-11982: [BatchSpannerRead.java#L223|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/BatchSpannerRead.java#L223] will ignore exceptions raised when reading from spanner. It will log an metric and continue. > Exceptions when reading from Spanner ignored > > > Key: BEAM-14005 > URL: https://issues.apache.org/jira/browse/BEAM-14005 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.34.0, 2.35.0, 2.36.0, 2.37.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: P1 > Labels: cloud-spanner, regression > Fix For: 2.37.0 > > Time Spent: 3h 20m > Remaining Estimate: 0h > > Regression from BEAM-11982: > [BatchSpannerRead.java#L223|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/BatchSpannerRead.java#L223] > will ignore exceptions raised when reading from spanner. It will log an > metric and continue. > > This can lead to data loss as read errors are unreported and ignored. A > pipeline can exit with success when data has been partially read or failed to > read. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (BEAM-11982) Java Spanner - Implement IO Request Count metrics
[ https://issues.apache.org/jira/browse/BEAM-11982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498426#comment-17498426 ] Niel Markwick commented on BEAM-11982: -- Found another issue with this PR, which causes a potential regression. Raised BEAM-14005 > Java Spanner - Implement IO Request Count metrics > - > > Key: BEAM-11982 > URL: https://issues.apache.org/jira/browse/BEAM-11982 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Reporter: Alex Amato >Assignee: Benjamin Gonzalez >Priority: P2 > Fix For: 2.34.0 > > Time Spent: 4h 20m > Remaining Estimate: 0h > > Reference PRs (See BigQuery IO example) and detailed explanation of what's > needed to instrument this IO with Request Count metrics is found in this > handoff doc: > [https://docs.google.com/document/d/1lrz2wE5Dl4zlUfPAenjXIQyleZvqevqoxhyE85aj4sc/edit'|https://docs.google.com/document/d/1lrz2wE5Dl4zlUfPAenjXIQyleZvqevqoxhyE85aj4sc/edit'?authuser=0] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (BEAM-14005) Exceptions when reading from Spanner ignored
Niel Markwick created BEAM-14005: Summary: Exceptions when reading from Spanner ignored Key: BEAM-14005 URL: https://issues.apache.org/jira/browse/BEAM-14005 Project: Beam Issue Type: Bug Components: io-java-gcp Affects Versions: 2.36.0, 2.35.0, 2.34.0, 2.37.0 Reporter: Niel Markwick Fix For: 2.38.0 Regression from BEAM-11982: [BatchSpannerRead.java#L223|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/BatchSpannerRead.java#L223] will ignore exceptions raised when reading from spanner. It will log an metric and continue. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (BEAM-14005) Exceptions when reading from Spanner ignored
[ https://issues.apache.org/jira/browse/BEAM-14005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick reassigned BEAM-14005: Assignee: Niel Markwick > Exceptions when reading from Spanner ignored > > > Key: BEAM-14005 > URL: https://issues.apache.org/jira/browse/BEAM-14005 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.34.0, 2.35.0, 2.36.0, 2.37.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: P1 > Labels: cloud-spanner, regression > Fix For: 2.38.0 > > > Regression from BEAM-11982: > [BatchSpannerRead.java#L223|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/BatchSpannerRead.java#L223] > will ignore exceptions raised when reading from spanner. It will log an > metric and continue. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (BEAM-13668) Java Spanner IO Request Count metrics broke backwards compatibility
[ https://issues.apache.org/jira/browse/BEAM-13668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-13668: - Fix Version/s: 2.36.0 (was: 2.38.0) Resolution: Duplicate Status: Resolved (was: Open) Marking as duplicate > Java Spanner IO Request Count metrics broke backwards compatibility > --- > > Key: BEAM-13668 > URL: https://issues.apache.org/jira/browse/BEAM-13668 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: P1 > Fix For: 2.36.0 > > > According to comments on BEAM-11982, the change to resolve it broke backwards > compatibility. So that affects 2.34.0 and 2.35.0. It is still worthwhile to > restore 2.36.0 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (BEAM-13665) Spanner IO request metrics requires projectId within the config when it didn't in the past
[ https://issues.apache.org/jira/browse/BEAM-13665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478559#comment-17478559 ] Niel Markwick commented on BEAM-13665: -- Resolved by #16547 Created BEAM-13687 as a followup for a potential performance regression due to creating metrics for every element. > Spanner IO request metrics requires projectId within the config when it > didn't in the past > -- > > Key: BEAM-13665 > URL: https://issues.apache.org/jira/browse/BEAM-13665 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.34.0, 2.35.0 >Reporter: Luke Cwik >Assignee: Niel Markwick >Priority: P1 > Fix For: 2.36.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > [https://github.com/apache/beam/pull/15493] makes the GCP projectID a > required parameter - which it was not before, as it could be inferred from > the environment - and thus breaks backward compatibility. > Specifically: BatchSpannerRead.java:175 – the toString() is not required on > the valueProvider, it should just be a get(), and createServiceCallMetric() > in line 195 should handle the situation where projectID could be null > Also SpannerIO.java:1683 has a similar issue with the Write transform. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (BEAM-13687) Java Spanner - Improve IO Request Count metrics
[ https://issues.apache.org/jira/browse/BEAM-13687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478558#comment-17478558 ] Niel Markwick commented on BEAM-13687: -- [~ajam...@google.com] [~lcwik] > Java Spanner - Improve IO Request Count metrics > --- > > Key: BEAM-13687 > URL: https://issues.apache.org/jira/browse/BEAM-13687 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Reporter: Niel Markwick >Assignee: Benjamin Gonzalez >Priority: P2 > Fix For: 2.34.0 > > > Request count metrics are created in the ProcessElement function for > SpannerIO Reads and Writes: > [BatchSpannerRead.java#L173|https://github.com/nielm/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/BatchSpannerRead.java#L173] > [SpannerIO.java#L1681|https://github.com/nielm/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerIO.java#L1681] > > This could cause a performance regression > (ref https://issues.apache.org/jira/browse/BEAM-13665 > [https://github.com/apache/beam/pull/16547] ) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (BEAM-13687) Java Spanner - Improve IO Request Count metrics
[ https://issues.apache.org/jira/browse/BEAM-13687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-13687: - Description: Request count metrics are created in the ProcessElement function for SpannerIO Reads and Writes: [BatchSpannerRead.java#L173|https://github.com/nielm/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/BatchSpannerRead.java#L173] [SpannerIO.java#L1681|https://github.com/nielm/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerIO.java#L1681] This could cause a performance regression (ref https://issues.apache.org/jira/browse/BEAM-13665 [https://github.com/apache/beam/pull/16547] ) was: Request count metrics are created in the ProcessElement function for SpannerIO Reads and Writes: [BatchSpannerRead.java#L173|https://github.com/nielm/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/BatchSpannerRead.java#L173] [SpannerIO.java#L1681|https://github.com/nielm/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerIO.java#L1681 > Java Spanner - Improve IO Request Count metrics > --- > > Key: BEAM-13687 > URL: https://issues.apache.org/jira/browse/BEAM-13687 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Reporter: Niel Markwick >Assignee: Benjamin Gonzalez >Priority: P2 > Fix For: 2.34.0 > > > Request count metrics are created in the ProcessElement function for > SpannerIO Reads and Writes: > [BatchSpannerRead.java#L173|https://github.com/nielm/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/BatchSpannerRead.java#L173] > [SpannerIO.java#L1681|https://github.com/nielm/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerIO.java#L1681] > > This could cause a performance regression > (ref https://issues.apache.org/jira/browse/BEAM-13665 > [https://github.com/apache/beam/pull/16547] ) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (BEAM-13687) Java Spanner - Improve IO Request Count metrics
[ https://issues.apache.org/jira/browse/BEAM-13687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-13687: - Description: Request count metrics are created in the ProcessElement function for SpannerIO Reads and Writes: [BatchSpannerRead.java#L173|https://github.com/nielm/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/BatchSpannerRead.java#L173] [SpannerIO.java#L1681|https://github.com/nielm/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerIO.java#L1681 was: Reference PRs (See BigQuery IO example) and detailed explanation of what's needed to instrument this IO with Request Count metrics is found in this handoff doc: [https://docs.google.com/document/d/1lrz2wE5Dl4zlUfPAenjXIQyleZvqevqoxhyE85aj4sc/edit'|https://docs.google.com/document/d/1lrz2wE5Dl4zlUfPAenjXIQyleZvqevqoxhyE85aj4sc/edit'?authuser=0] > Java Spanner - Improve IO Request Count metrics > --- > > Key: BEAM-13687 > URL: https://issues.apache.org/jira/browse/BEAM-13687 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Reporter: Niel Markwick >Assignee: Benjamin Gonzalez >Priority: P2 > Fix For: 2.34.0 > > > Request count metrics are created in the ProcessElement function for > SpannerIO Reads and Writes: > [BatchSpannerRead.java#L173|https://github.com/nielm/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/BatchSpannerRead.java#L173] > [SpannerIO.java#L1681|https://github.com/nielm/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerIO.java#L1681 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (BEAM-13687) Java Spanner - Improve IO Request Count metrics
Niel Markwick created BEAM-13687: Summary: Java Spanner - Improve IO Request Count metrics Key: BEAM-13687 URL: https://issues.apache.org/jira/browse/BEAM-13687 Project: Beam Issue Type: Improvement Components: io-java-gcp Reporter: Niel Markwick Assignee: Benjamin Gonzalez Fix For: 2.34.0 Reference PRs (See BigQuery IO example) and detailed explanation of what's needed to instrument this IO with Request Count metrics is found in this handoff doc: [https://docs.google.com/document/d/1lrz2wE5Dl4zlUfPAenjXIQyleZvqevqoxhyE85aj4sc/edit'|https://docs.google.com/document/d/1lrz2wE5Dl4zlUfPAenjXIQyleZvqevqoxhyE85aj4sc/edit'?authuser=0] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (BEAM-13665) Spanner IO request metrics requires projectId within the config when it didn't in the past
[ https://issues.apache.org/jira/browse/BEAM-13665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-13665: - Description: [https://github.com/apache/beam/pull/15493] makes the GCP projectID a required parameter - which it was not before, as it could be inferred from the environment - and thus breaks backward compatibility. Specifically: BatchSpannerRead.java:175 – the toString() is not required on the valueProvider, it should just be a get(), and createServiceCallMetric() in line 195 should handle the situation where projectID could be null Also SpannerIO.java:1683 has a similar issue with the Write transform. was: https://github.com/apache/beam/pull/15493 makes the GCP projectID a required parameter - which it was not before, as it could be inferred from the environment - and thus breaks backward compatibility. Specifically: BatchSpannerRead.java:175 -- the toString() is not required on the valueProvider, it should just be a get(), and createServiceCallMetric() in line 195 should handle the situation where projectID could be null > Spanner IO request metrics requires projectId within the config when it > didn't in the past > -- > > Key: BEAM-13665 > URL: https://issues.apache.org/jira/browse/BEAM-13665 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.34.0, 2.35.0 >Reporter: Luke Cwik >Assignee: Niel Markwick >Priority: P1 > Fix For: 2.36.0 > > Time Spent: 20m > Remaining Estimate: 0h > > [https://github.com/apache/beam/pull/15493] makes the GCP projectID a > required parameter - which it was not before, as it could be inferred from > the environment - and thus breaks backward compatibility. > Specifically: BatchSpannerRead.java:175 – the toString() is not required on > the valueProvider, it should just be a get(), and createServiceCallMetric() > in line 195 should handle the situation where projectID could be null > Also SpannerIO.java:1683 has a similar issue with the Write transform. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (BEAM-13665) Spanner IO request metrics requires projectId within the config when it didn't in the past
[ https://issues.apache.org/jira/browse/BEAM-13665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478012#comment-17478012 ] Niel Markwick commented on BEAM-13665: -- Created [https://github.com/apache/beam/pull/16547] > Spanner IO request metrics requires projectId within the config when it > didn't in the past > -- > > Key: BEAM-13665 > URL: https://issues.apache.org/jira/browse/BEAM-13665 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.34.0, 2.35.0 >Reporter: Luke Cwik >Assignee: Niel Markwick >Priority: P1 > Fix For: 2.36.0 > > Time Spent: 10m > Remaining Estimate: 0h > > https://github.com/apache/beam/pull/15493 makes the GCP projectID a required > parameter - which it was not before, as it could be inferred from the > environment - and thus breaks backward compatibility. > Specifically: BatchSpannerRead.java:175 -- the toString() is not required on > the valueProvider, it should just be a get(), and createServiceCallMetric() > in line 195 should handle the situation where projectID could be null -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (BEAM-13665) Spanner IO request metrics requires projectId within the config when it didn't in the past
[ https://issues.apache.org/jira/browse/BEAM-13665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick reassigned BEAM-13665: Assignee: Niel Markwick > Spanner IO request metrics requires projectId within the config when it > didn't in the past > -- > > Key: BEAM-13665 > URL: https://issues.apache.org/jira/browse/BEAM-13665 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.34.0, 2.35.0 >Reporter: Luke Cwik >Assignee: Niel Markwick >Priority: P1 > Fix For: 2.36.0 > > > https://github.com/apache/beam/pull/15493 makes the GCP projectID a required > parameter - which it was not before, as it could be inferred from the > environment - and thus breaks backward compatibility. > Specifically: BatchSpannerRead.java:175 -- the toString() is not required on > the valueProvider, it should just be a get(), and createServiceCallMetric() > in line 195 should handle the situation where projectID could be null -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (BEAM-11982) Java Spanner - Implement IO Request Count metrics
[ https://issues.apache.org/jira/browse/BEAM-11982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-11982: - Priority: P2 (was: P3) > Java Spanner - Implement IO Request Count metrics > - > > Key: BEAM-11982 > URL: https://issues.apache.org/jira/browse/BEAM-11982 > Project: Beam > Issue Type: Test > Components: io-java-gcp >Reporter: Alex Amato >Assignee: Benjamin Gonzalez >Priority: P2 > Fix For: Missing > > Time Spent: 4h > Remaining Estimate: 0h > > Reference PRs (See BigQuery IO example) and detailed explanation of what's > needed to instrument this IO with Request Count metrics is found in this > handoff doc: > [https://docs.google.com/document/d/1lrz2wE5Dl4zlUfPAenjXIQyleZvqevqoxhyE85aj4sc/edit'|https://docs.google.com/document/d/1lrz2wE5Dl4zlUfPAenjXIQyleZvqevqoxhyE85aj4sc/edit'?authuser=0] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Reopened] (BEAM-11982) Java Spanner - Implement IO Request Count metrics
[ https://issues.apache.org/jira/browse/BEAM-11982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick reopened BEAM-11982: -- Reopening as this breaks backward compatibility. [BatchSpannerRead.java:175|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/BatchSpannerRead.java#L175] calls toString() on the valueproviders, which can throw a null pointer exception if they are not set (it should call `get()`) The projectId value is optional as it can be inferred from the environment, and if it is not set, this line will throw a runtime NPE. Please fix, and also update createServiceCallMetric to handle the possibility of a NULL projectID > Java Spanner - Implement IO Request Count metrics > - > > Key: BEAM-11982 > URL: https://issues.apache.org/jira/browse/BEAM-11982 > Project: Beam > Issue Type: Test > Components: io-java-gcp >Reporter: Alex Amato >Assignee: Benjamin Gonzalez >Priority: P3 > Fix For: Missing > > Time Spent: 4h > Remaining Estimate: 0h > > Reference PRs (See BigQuery IO example) and detailed explanation of what's > needed to instrument this IO with Request Count metrics is found in this > handoff doc: > [https://docs.google.com/document/d/1lrz2wE5Dl4zlUfPAenjXIQyleZvqevqoxhyE85aj4sc/edit'|https://docs.google.com/document/d/1lrz2wE5Dl4zlUfPAenjXIQyleZvqevqoxhyE85aj4sc/edit'?authuser=0] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (BEAM-11722) SpannerIO.read() parallelism limited by partition count
[ https://issues.apache.org/jira/browse/BEAM-11722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380010#comment-17380010 ] Niel Markwick commented on BEAM-11722: -- I did some experiments on this, comparing: # Reading a table using a single thread # Reading using the pre-supplied partitions from the Spanner Partitioned Read API (in my test, this supplied 38 partitions, but only 3 had data) # Splitting the key into 1000 key-ranges, shuffling them, and then reading these 1000 in parallel using multiple workers. 1) was obviously the slowest. 2) and 3) were about equally fast to read the actual data - implying that despite the relatively low number of populated partitions, this was the most efficient way to read the data from the database. Background information: Spanner stores the tables by creating blocks of key-ranges and assigning ownership of these key-ranges (splits) to specific Spanner nodes. The Partitioned Read API returns partitions that correspond, roughly, to these splits, so when reading a partition, a single Spanner node is tasked with reading the data. Of course only having very few partition elements with actual rows may cause issues for the downstream pipeline if the work done on these rows takes a long time, and would benefit from additional parallelization. I don't believe putting a reshuffle on the output of SpannerIO.Read by default is a good idea. For very large reads (in the TB's), this will delay the time until the first record is emitted significantly, and may cause additional costs to shuffle the data. If the end user needs better parallelization of the SpannerIO.Read output, then they can add a Reshuffle themselves. > SpannerIO.read() parallelism limited by partition count > --- > > Key: BEAM-11722 > URL: https://issues.apache.org/jira/browse/BEAM-11722 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Reporter: Udi Meiri >Priority: P3 > Labels: google-cloud-spanner > > Setting the partition size / count is not possible: > {code} Note: This hint is currently ignored by sessions.partitionQuery and > sessions.partitionRead requests.{code} > https://cloud.google.com/spanner/docs/reference/rest/v1/PartitionOptions > Adding a Reshuffle might be the only choice, but it adds time and resource > usage. > cc: [~chamikara][~nielm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-11722) SpannerIO.read() parallelism limited by partition count
[ https://issues.apache.org/jira/browse/BEAM-11722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17285853#comment-17285853 ] Niel Markwick commented on BEAM-11722: -- Spanner eng confirms that these options are ignored at present. Adding a reshuffle to the output of Read/ReadAll would be a workaround but it would block the pipeline until all data has been read - which may take a long time and use a large amount of resources. This can be done in customer pipelines if they feel that the performance improvement of downstream parallelization is worth the additional latency. > SpannerIO.read() parallelism limited by partition count > --- > > Key: BEAM-11722 > URL: https://issues.apache.org/jira/browse/BEAM-11722 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Reporter: Udi Meiri >Assignee: Niel Markwick >Priority: P2 > Labels: google-cloud-spanner > > Setting the partition size / count is not possible: > {code} Note: This hint is currently ignored by sessions.partitionQuery and > sessions.partitionRead requests.{code} > https://cloud.google.com/spanner/docs/reference/rest/v1/PartitionOptions > Adding a Reshuffle might be the only choice, but it adds time and resource > usage. > cc: [~chamikara][~nielm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (BEAM-11722) SpannerIO.read() parallelism limited by partition count
[ https://issues.apache.org/jira/browse/BEAM-11722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278781#comment-17278781 ] Niel Markwick edited comment on BEAM-11722 at 2/4/21, 11:31 AM: I will follow up with Spanner engineering to determine whether these options are actually used (and incorrectly documented), and log a feature request if not. Thanks for raising. was (Author: nielm): I will follow up with Spanner engineer to determine whether these options are actually used (and incorrectly documented), and log a feature request if not. Thanks for raising. > SpannerIO.read() parallelism limited by partition count > --- > > Key: BEAM-11722 > URL: https://issues.apache.org/jira/browse/BEAM-11722 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Reporter: Udi Meiri >Assignee: Niel Markwick >Priority: P2 > Labels: google-cloud-spanner > > Setting the partition size / count is not possible: > {code} Note: This hint is currently ignored by sessions.partitionQuery and > sessions.partitionRead requests.{code} > https://cloud.google.com/spanner/docs/reference/rest/v1/PartitionOptions > Adding a Reshuffle might be the only choice, but it adds time and resource > usage. > cc: [~chamikara][~nielm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-11722) SpannerIO.read() parallelism limited by partition count
[ https://issues.apache.org/jira/browse/BEAM-11722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278781#comment-17278781 ] Niel Markwick commented on BEAM-11722: -- I will follow up with Spanner engineer to determine whether these options are actually used (and incorrectly documented), and log a feature request if not. Thanks for raising. > SpannerIO.read() parallelism limited by partition count > --- > > Key: BEAM-11722 > URL: https://issues.apache.org/jira/browse/BEAM-11722 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Reporter: Udi Meiri >Assignee: Niel Markwick >Priority: P2 > Labels: google-cloud-spanner > > Setting the partition size / count is not possible: > {code} Note: This hint is currently ignored by sessions.partitionQuery and > sessions.partitionRead requests.{code} > https://cloud.google.com/spanner/docs/reference/rest/v1/PartitionOptions > Adding a Reshuffle might be the only choice, but it adds time and resource > usage. > cc: [~chamikara][~nielm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-11722) SpannerIO.read() parallelism limited by partition count
[ https://issues.apache.org/jira/browse/BEAM-11722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-11722: - Component/s: (was: runner-dataflow) Labels: google-cloud-spanner (was: ) > SpannerIO.read() parallelism limited by partition count > --- > > Key: BEAM-11722 > URL: https://issues.apache.org/jira/browse/BEAM-11722 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Reporter: Udi Meiri >Assignee: Niel Markwick >Priority: P2 > Labels: google-cloud-spanner > > Setting the partition size / count is not possible: > {code} Note: This hint is currently ignored by sessions.partitionQuery and > sessions.partitionRead requests.{code} > https://cloud.google.com/spanner/docs/reference/rest/v1/PartitionOptions > Adding a Reshuffle might be the only choice, but it adds time and resource > usage. > cc: [~chamikara][~nielm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-11722) SpannerIO.read() parallelism limited by partition count
[ https://issues.apache.org/jira/browse/BEAM-11722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick reassigned BEAM-11722: Assignee: Niel Markwick > SpannerIO.read() parallelism limited by partition count > --- > > Key: BEAM-11722 > URL: https://issues.apache.org/jira/browse/BEAM-11722 > Project: Beam > Issue Type: Bug > Components: io-java-gcp, runner-dataflow >Reporter: Udi Meiri >Assignee: Niel Markwick >Priority: P2 > > Setting the partition size / count is not possible: > {code} Note: This hint is currently ignored by sessions.partitionQuery and > sessions.partitionRead requests.{code} > https://cloud.google.com/spanner/docs/reference/rest/v1/PartitionOptions > Adding a Reshuffle might be the only choice, but it adds time and resource > usage. > cc: [~chamikara][~nielm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-11722) SpannerIO.read() parallelism limited by partition count
[ https://issues.apache.org/jira/browse/BEAM-11722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-11722: - Status: Open (was: Triage Needed) > SpannerIO.read() parallelism limited by partition count > --- > > Key: BEAM-11722 > URL: https://issues.apache.org/jira/browse/BEAM-11722 > Project: Beam > Issue Type: Bug > Components: io-java-gcp, runner-dataflow >Reporter: Udi Meiri >Assignee: Niel Markwick >Priority: P2 > > Setting the partition size / count is not possible: > {code} Note: This hint is currently ignored by sessions.partitionQuery and > sessions.partitionRead requests.{code} > https://cloud.google.com/spanner/docs/reference/rest/v1/PartitionOptions > Adding a Reshuffle might be the only choice, but it adds time and resource > usage. > cc: [~chamikara][~nielm] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-11643) SpannerIO does not support using BigDecimal for Numeric fields
[ https://issues.apache.org/jira/browse/BEAM-11643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17275678#comment-17275678 ] Niel Markwick commented on BEAM-11643: -- Thanks! > SpannerIO does not support using BigDecimal for Numeric fields > -- > > Key: BEAM-11643 > URL: https://issues.apache.org/jira/browse/BEAM-11643 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.27.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: P2 > Fix For: 2.28.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > Since NUMERIC types were added to Spanner, it has been possible to assign a > Java BigDecimal to a column in a Spanner mutation. > The feature to support NUMERIC in BEAM was added in > [PR#12818|https://github.com/apache/beam/pull/12818] (BEAM-10875), but > missed adding BigDecimal support to the part of the code where the size of > mutated values are calculated. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-11643) SpannerIO does not support using BigDecimal for Numeric fields
[ https://issues.apache.org/jira/browse/BEAM-11643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265948#comment-17265948 ] Niel Markwick commented on BEAM-11643: -- TODO: Add BigDecimal support to org.apache.beam.sdk.io.gcp.spanner.MutationSizeEstimator > SpannerIO does not support using BigDecimal for Numeric fields > -- > > Key: BEAM-11643 > URL: https://issues.apache.org/jira/browse/BEAM-11643 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.27.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: P2 > Fix For: 2.28.0 > > > Since NUMERIC types were added to Spanner, it has been possible to assign a > Java BigDecimal to a column in a Spanner mutation. > The feature to support NUMERIC in BEAM was added in > [PR#12818|https://github.com/apache/beam/pull/12818] (BEAM-10875), but > missed adding BigDecimal support to the part of the code where the size of > mutated values are calculated. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-11643) SpannerIO does not support using BigDecimal for Numeric fields
Niel Markwick created BEAM-11643: Summary: SpannerIO does not support using BigDecimal for Numeric fields Key: BEAM-11643 URL: https://issues.apache.org/jira/browse/BEAM-11643 Project: Beam Issue Type: Bug Components: io-java-gcp Affects Versions: 2.27.0 Reporter: Niel Markwick Assignee: Niel Markwick Fix For: 2.28.0 Since NUMERIC types were added to Spanner, it has been possible to assign a Java BigDecimal to a column in a Spanner mutation. The feature to support NUMERIC in BEAM was added in [PR#12818|https://github.com/apache/beam/pull/12818] (BEAM-10875), but missed adding BigDecimal support to the part of the code where the size of mutated values are calculated. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (BEAM-6764) OutOfMemory Exception in Dataflow Spanner Write Mutations
[ https://issues.apache.org/jira/browse/BEAM-6764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick closed BEAM-6764. --- Fix Version/s: 2.23.0 Resolution: Fixed > OutOfMemory Exception in Dataflow Spanner Write Mutations > - > > Key: BEAM-6764 > URL: https://issues.apache.org/jira/browse/BEAM-6764 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.9.0, 2.10.0, 2.11.0 >Reporter: Joshua >Priority: P3 > Fix For: 2.23.0 > > > Since I upgraded my apache beam sdk version to >= 2.9.0, I have been noticing > OOM exceptions while using the dataflow runner to write mutations to spanner. > I have been using n1-standard-4 since version 2.9.0. On that version, it > works. But on higher versions, I get the exception. > > The stackdriver logs is provided below > {code:java} > java.lang.OutOfMemoryError: Java heap space > java.util.ArrayList.(ArrayList.java:152) > org.apache.beam.sdk.io.gcp.spanner.SpannerIO$GatherBundleAndSortFn.initSorter(SpannerIO.java:1056) > org.apache.beam.sdk.io.gcp.spanner.SpannerIO$GatherBundleAndSortFn.startBundle(SpannerIO.java:1049) > {code} > I have a very basic PTransform for writing to Spanner: > {code:java} > SpannerIO.write().withInstanceId(options.getSpannerInstanceId()) > .withDatabaseId(options.getSpannerDatabaseId()) > .withProjectId(options.getProject()) > .withFailureMode(SpannerIO.FailureMode.REPORT_FAILURES); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (BEAM-6913) Reading data from Spanner never ends
[ https://issues.apache.org/jira/browse/BEAM-6913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick closed BEAM-6913. --- Fix Version/s: Not applicable Resolution: Cannot Reproduce > Reading data from Spanner never ends > > > Key: BEAM-6913 > URL: https://issues.apache.org/jira/browse/BEAM-6913 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.11.0 > Environment: macOS Mojave (10.14.3) >Reporter: Mousa HAMAD >Priority: P3 > Fix For: Not applicable > > > Whenever my pipeline reads from Spanner, the code runs infinitely. If I > update the spanner dependency (_com.google.cloud:google-cloud-spanner_) to > e.g., _1.11.0,_ then everything works as expected. > Consider the following simple pipeline, which never ends: > {code:java} > public class Prototype_Spanner { > private static String INSTANCE_ID = "XYZ"; > private static String DATABASE_ID = "test_beam"; > private static String TABLE_NAME = "item"; > private static void runExample() { > PipelineOptions options = PipelineOptionsFactory.create(); > options.setRunner(DirectRunner.class); > Pipeline pipeline = Pipeline.create(options); > pipeline > .apply("Read", SpannerIO.read() > .withInstanceId(INSTANCE_ID) > .withDatabaseId(DATABASE_ID) > .withTable(TABLE_NAME) > .withColumns("price")) > .apply("Extract Price", MapElements > .into(TypeDescriptors.longs()) > .via((Struct struct) -> struct.getLong("price"))) > .apply("Calculate Mean", Mean.globally()) > .apply("Map to string", MapElements > .into(TypeDescriptor.of(String.class)) > .via(Object::toString)) > .apply("Write", TextIO.write().to("/tmp/output")); > pipeline.run().waitUntilFinish(); > } > public static void main(String[] args) { > runExample(); > } > } > {code} > Following is my full list of dependencies: > {code:java} > repositories { > mavenCentral() > } > ext { > beamVersion = '2.11.0' > sparkVersion = '2.3.3' > } > dependencies { > compile "org.apache.beam:beam-sdks-java-core:$beamVersion" > compile > "org.apache.beam:beam-sdks-java-extensions-join-library:$beamVersion" > compile > "org.apache.beam:beam-sdks-java-extensions-google-cloud-platform-core:$beamVersion" > compile > "org.apache.beam:beam-sdks-java-io-google-cloud-platform:$beamVersion" > compile "org.apache.beam:beam-runners-core-java:$beamVersion" > compile "org.apache.beam:beam-runners-direct-java:$beamVersion" > compile "org.apache.beam:beam-runners-spark:$beamVersion" > compile "org.apache.spark:spark-core_2.11:$sparkVersion" > compile "org.apache.spark:spark-streaming_2.11:$sparkVersion" > // This line fixed the issue for me > // compile "com.google.cloud:google-cloud-spanner:1.11.0" > testCompile "junit:junit:4.12" > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-7732) Allow access to SpannerOptions in Beam
[ https://issues.apache.org/jira/browse/BEAM-7732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick resolved BEAM-7732. - Fix Version/s: Not applicable Resolution: Won't Fix > Allow access to SpannerOptions in Beam > -- > > Key: BEAM-7732 > URL: https://issues.apache.org/jira/browse/BEAM-7732 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Affects Versions: 2.12.0, 2.13.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: P3 > Fix For: Not applicable > > Time Spent: 4h 10m > Remaining Estimate: 0h > > Beam hides the > [SpannerOptions|https://github.com/googleapis/google-cloud-java/blob/master/google-cloud-clients/google-cloud-spanner/src/main/java/com/google/cloud/spanner/SpannerOptions.java] > object behind a > [SpannerConfig|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerConfig.java] > object because the SpannerOptions object is not serializable. > This means that the only options that can be set are those that can be > specified in SpannerConfig - limited to host, project, instance, database. > Suggestion: add the possibility to set a SpannerOptionsFactory in > SpannerConfig: > {code:java} > public interface SpannerOptionsFactory extends Serializable { > public SpannerOptions create(); > } > {code} > This would allow the user use this factory class to specify custom > SpannerOptions before they are passed onto the connectToSpanner() method; > connectToSpanner() would then become: > {code:java} > public SpannerAccessor connectToSpanner() { > > SpannerOptions.Builder builder = spannerOptionsFactory.create().toBuilder(); > // rest of connectToSpanner follows, setting project, host, etc. > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-10047) SpannerIO: Combine sorting and batching
[ https://issues.apache.org/jira/browse/BEAM-10047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick resolved BEAM-10047. -- Resolution: Fixed > SpannerIO: Combine sorting and batching > --- > > Key: BEAM-10047 > URL: https://issues.apache.org/jira/browse/BEAM-10047 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Affects Versions: 2.16.0, 2.17.0, 2.18.0, 2.19.0, 2.20.0, 2.21.0, 2.22.0 >Reporter: Brian Hulette >Assignee: Niel Markwick >Priority: P2 > Fix For: 2.23.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > having Spanner IO's grouping, sorting and batching as separate stages makes > no sense as these stages will be fused in any runner that supports fusion. > In addition during these stages, even with fusion, at least 3 full copies of > each mutation are made (serialized then deserialized objects), which leads to > an extremely large use of memory. > > Combining these stages would significantly reduce memory usage, and improve > performance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-10259) Spanner Session leak/overload in Streaming Dataflow
[ https://issues.apache.org/jira/browse/BEAM-10259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick resolved BEAM-10259. -- Resolution: Fixed > Spanner Session leak/overload in Streaming Dataflow > --- > > Key: BEAM-10259 > URL: https://issues.apache.org/jira/browse/BEAM-10259 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.18.0, 2.19.0, 2.21.0, 2.22.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: P2 > Labels: dataflow, gcp, io > Fix For: 2.23.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > SpannerIO.WriteToSpannerFn connects to Spanner every time @Setup is called, > and closes the connection every time @Teardown is called. > This actually creates a separate Spanner connection and session pool for each > WriteToSpannerFn, which generally speaking is one per thread > In single-threaded runners (eg batch dataflow on a single vCPU machine) this > is not an issue, as there is normally only one WriteToSpannerFn per > node/process. > In multi-threaded runners (eg streaming dataflow, or batch on multiple CPU > machines), this can cause a problem with many session pools created (1 per > thread) which can cause a respource leak, and is in general wasteful. > Spanner connections (and session pools) should be shared among all threads of > a single process. so that the connection is only opened and closed once. > [~alxavier] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-7732) Allow access to SpannerOptions in Beam
[ https://issues.apache.org/jira/browse/BEAM-7732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147255#comment-17147255 ] Niel Markwick commented on BEAM-7732: - Current philosophy of Beam is not to expose client library classes, so this will not be implemented. The commit deadline has been exposed using a withCommitDeadline() setter on SpannerIO.Write. > Allow access to SpannerOptions in Beam > -- > > Key: BEAM-7732 > URL: https://issues.apache.org/jira/browse/BEAM-7732 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Affects Versions: 2.12.0, 2.13.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: P3 > Time Spent: 4h 10m > Remaining Estimate: 0h > > Beam hides the > [SpannerOptions|https://github.com/googleapis/google-cloud-java/blob/master/google-cloud-clients/google-cloud-spanner/src/main/java/com/google/cloud/spanner/SpannerOptions.java] > object behind a > [SpannerConfig|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerConfig.java] > object because the SpannerOptions object is not serializable. > This means that the only options that can be set are those that can be > specified in SpannerConfig - limited to host, project, instance, database. > Suggestion: add the possibility to set a SpannerOptionsFactory in > SpannerConfig: > {code:java} > public interface SpannerOptionsFactory extends Serializable { > public SpannerOptions create(); > } > {code} > This would allow the user use this factory class to specify custom > SpannerOptions before they are passed onto the connectToSpanner() method; > connectToSpanner() would then become: > {code:java} > public SpannerAccessor connectToSpanner() { > > SpannerOptions.Builder builder = spannerOptionsFactory.create().toBuilder(); > // rest of connectToSpanner follows, setting project, host, etc. > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (BEAM-9269) Set shorter Commit Deadline and handle with backoff/retry
[ https://issues.apache.org/jira/browse/BEAM-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick closed BEAM-9269. --- Fix Version/s: (was: 2.20.0) 2.23.0 Resolution: Fixed > Set shorter Commit Deadline and handle with backoff/retry > - > > Key: BEAM-9269 > URL: https://issues.apache.org/jira/browse/BEAM-9269 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Affects Versions: 2.16.0, 2.17.0, 2.18.0, 2.19.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: P2 > Labels: google-cloud-spanner > Fix For: 2.23.0 > > Time Spent: 5h 20m > Remaining Estimate: 0h > > Default commit deadline in Spanner is 1hr, which can lead to a variety of > issues including database overload and session expiry. > Shorter deadline should be set with backoff/retry when deadline expires, so > that the Spanner database does not become overloaded. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (BEAM-9269) Set shorter Commit Deadline and handle with backoff/retry
[ https://issues.apache.org/jira/browse/BEAM-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick reopened BEAM-9269: - Re-opening to convert RPC Interceptor code to use GRPC deadline configurations. > Set shorter Commit Deadline and handle with backoff/retry > - > > Key: BEAM-9269 > URL: https://issues.apache.org/jira/browse/BEAM-9269 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Affects Versions: 2.16.0, 2.17.0, 2.18.0, 2.19.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: P2 > Labels: google-cloud-spanner > Fix For: 2.20.0 > > Time Spent: 4h > Remaining Estimate: 0h > > Default commit deadline in Spanner is 1hr, which can lead to a variety of > issues including database overload and session expiry. > Shorter deadline should be set with backoff/retry when deadline expires, so > that the Spanner database does not become overloaded. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-10259) Spanner Session leak/overload in Streaming Dataflow
Niel Markwick created BEAM-10259: Summary: Spanner Session leak/overload in Streaming Dataflow Key: BEAM-10259 URL: https://issues.apache.org/jira/browse/BEAM-10259 Project: Beam Issue Type: Bug Components: io-java-gcp Affects Versions: 2.22.0, 2.21.0, 2.19.0, 2.18.0 Reporter: Niel Markwick Assignee: Niel Markwick Fix For: 2.23.0 SpannerIO.WriteToSpannerFn connects to Spanner every time @Setup is called, and closes the connection every time @Teardown is called. This actually creates a separate Spanner connection and session pool for each WriteToSpannerFn, which generally speaking is one per thread In single-threaded runners (eg batch dataflow on a single vCPU machine) this is not an issue, as there is normally only one WriteToSpannerFn per node/process. In multi-threaded runners (eg streaming dataflow, or batch on multiple CPU machines), this can cause a problem with many session pools created (1 per thread) which can cause a respource leak, and is in general wasteful. Spanner connections (and session pools) should be shared among all threads of a single process. so that the connection is only opened and closed once. [~alxavier] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-6887) Streaming Spanner Writer transform
[ https://issues.apache.org/jira/browse/BEAM-6887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick resolved BEAM-6887. - Fix Version/s: 2.22.0 Resolution: Abandoned > Streaming Spanner Writer transform > -- > > Key: BEAM-6887 > URL: https://issues.apache.org/jira/browse/BEAM-6887 > Project: Beam > Issue Type: New Feature > Components: io-java-gcp >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: P3 > Fix For: 2.22.0 > > > At present, the SpannerIO.Write/WriteGrouped transforms work by collecting an > entire bundle of elements, sorts them by table/key, splitting the sorted list > into batches (by size and number of cells modified) and then writes each > batch to Spanner in a single transaction. > It returns a SpannerWriteResult.java containing : > # a PCollection (the main output) - which will have no elements but > will be closed to signal when all the input elements have been written (which > is never in streaming because input is unbounded) > # a PCollection of elements that failed to write. > This transform is useful as a bulk sink for data because it efficiently > writes large amounts of data. > It is not at all useful as an intermediate step in a streaming pipeline - > because it has no useful output in streaming mode. > I propose that we have a separate Spanner Write transform which simply writes > each input Mutation to the database, and then pushes successful Mutations > onto its output. > This would allow use in the middle of a streaming pipeline, where the flow > would be > * Some data streamed in > * Converted to Spanner Mutations > * Written to Spanner Database > * Further processing where the values written to the Spanner Database are > used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-6887) Streaming Spanner Writer transform
[ https://issues.apache.org/jira/browse/BEAM-6887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135800#comment-17135800 ] Niel Markwick commented on BEAM-6887: - Meaningful output is covered in BEAM-6921 > Streaming Spanner Writer transform > -- > > Key: BEAM-6887 > URL: https://issues.apache.org/jira/browse/BEAM-6887 > Project: Beam > Issue Type: New Feature > Components: io-java-gcp >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: P3 > > At present, the SpannerIO.Write/WriteGrouped transforms work by collecting an > entire bundle of elements, sorts them by table/key, splitting the sorted list > into batches (by size and number of cells modified) and then writes each > batch to Spanner in a single transaction. > It returns a SpannerWriteResult.java containing : > # a PCollection (the main output) - which will have no elements but > will be closed to signal when all the input elements have been written (which > is never in streaming because input is unbounded) > # a PCollection of elements that failed to write. > This transform is useful as a bulk sink for data because it efficiently > writes large amounts of data. > It is not at all useful as an intermediate step in a streaming pipeline - > because it has no useful output in streaming mode. > I propose that we have a separate Spanner Write transform which simply writes > each input Mutation to the database, and then pushes successful Mutations > onto its output. > This would allow use in the middle of a streaming pipeline, where the flow > would be > * Some data streamed in > * Converted to Spanner Mutations > * Written to Spanner Database > * Further processing where the values written to the Spanner Database are > used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-6671) Beam 2.9.0 java.lang.NoSuchFieldError: internal_static_google_rpc_LocalizedMessage_fieldAccessorTable
[ https://issues.apache.org/jira/browse/BEAM-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135799#comment-17135799 ] Niel Markwick commented on BEAM-6671: - 2.19.0 has update the Spanner client library version to 1.49.0, and there have been no more recent reports of this. > Beam 2.9.0 java.lang.NoSuchFieldError: > internal_static_google_rpc_LocalizedMessage_fieldAccessorTable > - > > Key: BEAM-6671 > URL: https://issues.apache.org/jira/browse/BEAM-6671 > Project: Beam > Issue Type: New Feature > Components: build-system >Reporter: Alex Amato >Priority: P2 > Labels: stale-P2 > > I received a report from a Dataflow user encountering this in Beam 2.9.0 when > creating a spanner instance. I wanted to post this here as this is known to > be related to dependency conflicts in the past > ([https://stackoverflow.com/questions/46684071/error-using-spannerio-in-apache-beam]). > > java.lang.NoSuchFieldError: > internal_static_google_rpc_LocalizedMessage_fieldAccessorTable > at > com.google.rpc.LocalizedMessage.internalGetFieldAccessorTable(LocalizedMessage.java:90) > at > com.google.protobuf.GeneratedMessageV3.getDescriptorForType(GeneratedMessageV3.java:121) > at io.grpc.protobuf.ProtoUtils.keyForProto(ProtoUtils.java:67) > at > com.google.cloud.spanner.spi.v1.SpannerErrorInterceptor.(SpannerErrorInterceptor.java:47) > at > com.google.cloud.spanner.spi.v1.GrpcSpannerRpc.(GrpcSpannerRpc.java:136) > at > com.google.cloud.spanner.SpannerOptions$DefaultSpannerRpcFactory.create(SpannerOptions.java:73) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (BEAM-6671) Beam 2.9.0 java.lang.NoSuchFieldError: internal_static_google_rpc_LocalizedMessage_fieldAccessorTable
[ https://issues.apache.org/jira/browse/BEAM-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick closed BEAM-6671. --- Fix Version/s: 2.19.0 Resolution: Cannot Reproduce > Beam 2.9.0 java.lang.NoSuchFieldError: > internal_static_google_rpc_LocalizedMessage_fieldAccessorTable > - > > Key: BEAM-6671 > URL: https://issues.apache.org/jira/browse/BEAM-6671 > Project: Beam > Issue Type: New Feature > Components: build-system >Reporter: Alex Amato >Priority: P2 > Labels: stale-P2 > Fix For: 2.19.0 > > > I received a report from a Dataflow user encountering this in Beam 2.9.0 when > creating a spanner instance. I wanted to post this here as this is known to > be related to dependency conflicts in the past > ([https://stackoverflow.com/questions/46684071/error-using-spannerio-in-apache-beam]). > > java.lang.NoSuchFieldError: > internal_static_google_rpc_LocalizedMessage_fieldAccessorTable > at > com.google.rpc.LocalizedMessage.internalGetFieldAccessorTable(LocalizedMessage.java:90) > at > com.google.protobuf.GeneratedMessageV3.getDescriptorForType(GeneratedMessageV3.java:121) > at io.grpc.protobuf.ProtoUtils.keyForProto(ProtoUtils.java:67) > at > com.google.cloud.spanner.spi.v1.SpannerErrorInterceptor.(SpannerErrorInterceptor.java:47) > at > com.google.cloud.spanner.spi.v1.GrpcSpannerRpc.(GrpcSpannerRpc.java:136) > at > com.google.cloud.spanner.SpannerOptions$DefaultSpannerRpcFactory.create(SpannerOptions.java:73) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-6320) SpannerReadIT.testQuery flaky
[ https://issues.apache.org/jira/browse/BEAM-6320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135795#comment-17135795 ] Niel Markwick commented on BEAM-6320: - Duplicate of BEAM-2538 > SpannerReadIT.testQuery flaky > - > > Key: BEAM-6320 > URL: https://issues.apache.org/jira/browse/BEAM-6320 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Reporter: Andrew Pilloud >Priority: P2 > Labels: stale-P2 > Fix For: Not applicable > > > https://builds.apache.org/job/beam_PostCommit_Java/2218/ > {code} > WARNING: No terminal state was returned. State value UNKNOWN > Dec 27, 2018 9:08:13 PM org.apache.beam.runners.dataflow.TestDataflowRunner > checkForPAssertSuccess > WARNING: Metrics not present for Dataflow job > 2018-12-27_13_02_39-18037927821693074732. > Dec 27, 2018 9:08:13 PM org.apache.beam.runners.dataflow.TestDataflowRunner > run > WARNING: Dataflow job 2018-12-27_13_02_39-18037927821693074732 did not output > a success or failure metric. > {code} > https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2018-12-27_13_02_39-18037927821693074732?project=apache-beam-testing > {code} > com.google.cloud.spanner.SpannerException: NOT_FOUND: > io.grpc.StatusRuntimeException: NOT_FOUND: Database not found: > projects/apache-beam-testing/instances/beam-test/databases/beam-testdb-vkf2iqc72tevvroaop > resource_type: "type.googleapis.com/google.spanner.admin.database.v1.Database" > resource_name: > "projects/apache-beam-testing/instances/beam-test/databases/beam-testdb-vkf2iqc72tevvroaop" > description: "Database does not exist." > at > com.google.cloud.spanner.SpannerExceptionFactory.newSpannerExceptionPreformatted(SpannerExceptionFactory.java:119) > at > com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException(SpannerExceptionFactory.java:43) > at > com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException(SpannerExceptionFactory.java:80) > at > com.google.cloud.spanner.spi.v1.GrpcSpannerRpc.get(GrpcSpannerRpc.java:456) > at > com.google.cloud.spanner.spi.v1.GrpcSpannerRpc.createSession(GrpcSpannerRpc.java:350) > at com.google.cloud.spanner.SpannerImpl$2.call(SpannerImpl.java:258) > at com.google.cloud.spanner.SpannerImpl$2.call(SpannerImpl.java:255) > at > com.google.cloud.spanner.SpannerImpl.runWithRetries(SpannerImpl.java:227) > at > com.google.cloud.spanner.SpannerImpl.createSession(SpannerImpl.java:254) > at > com.google.cloud.spanner.BatchClientImpl.batchReadOnlyTransaction(BatchClientImpl.java:51) > at > org.apache.beam.sdk.io.gcp.spanner.CreateTransactionFn.processElement(CreateTransactionFn.java:47) > Caused by: java.util.concurrent.ExecutionException: > io.grpc.StatusRuntimeException: NOT_FOUND: Database not found: > projects/apache-beam-testing/instances/beam-test/databases/beam-testdb-vkf2iqc72tevvroaop > resource_type: "type.googleapis.com/google.spanner.admin.database.v1.Database" > resource_name: > "projects/apache-beam-testing/instances/beam-test/databases/beam-testdb-vkf2iqc72tevvroaop" > description: "Database does not exist." > at > com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:500) > at > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:479) > at > com.google.cloud.spanner.spi.v1.GrpcSpannerRpc.get(GrpcSpannerRpc.java:450) > at > com.google.cloud.spanner.spi.v1.GrpcSpannerRpc.createSession(GrpcSpannerRpc.java:350) > at com.google.cloud.spanner.SpannerImpl$2.call(SpannerImpl.java:258) > at com.google.cloud.spanner.SpannerImpl$2.call(SpannerImpl.java:255) > at > com.google.cloud.spanner.SpannerImpl.runWithRetries(SpannerImpl.java:227) > at > com.google.cloud.spanner.SpannerImpl.createSession(SpannerImpl.java:254) > at > com.google.cloud.spanner.BatchClientImpl.batchReadOnlyTransaction(BatchClientImpl.java:51) > at > org.apache.beam.sdk.io.gcp.spanner.CreateTransactionFn.processElement(CreateTransactionFn.java:47) > at > org.apache.beam.sdk.io.gcp.spanner.CreateTransactionFn$DoFnInvoker.invokeProcessElement(Unknown > Source) > at > org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:275) > at > org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:240) > at > org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:325) > at > org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44) > at > org.apache.beam.runners.dataflo
[jira] [Closed] (BEAM-6320) SpannerReadIT.testQuery flaky
[ https://issues.apache.org/jira/browse/BEAM-6320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick closed BEAM-6320. --- Fix Version/s: Not applicable Resolution: Duplicate > SpannerReadIT.testQuery flaky > - > > Key: BEAM-6320 > URL: https://issues.apache.org/jira/browse/BEAM-6320 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Reporter: Andrew Pilloud >Priority: P2 > Labels: stale-P2 > Fix For: Not applicable > > > https://builds.apache.org/job/beam_PostCommit_Java/2218/ > {code} > WARNING: No terminal state was returned. State value UNKNOWN > Dec 27, 2018 9:08:13 PM org.apache.beam.runners.dataflow.TestDataflowRunner > checkForPAssertSuccess > WARNING: Metrics not present for Dataflow job > 2018-12-27_13_02_39-18037927821693074732. > Dec 27, 2018 9:08:13 PM org.apache.beam.runners.dataflow.TestDataflowRunner > run > WARNING: Dataflow job 2018-12-27_13_02_39-18037927821693074732 did not output > a success or failure metric. > {code} > https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2018-12-27_13_02_39-18037927821693074732?project=apache-beam-testing > {code} > com.google.cloud.spanner.SpannerException: NOT_FOUND: > io.grpc.StatusRuntimeException: NOT_FOUND: Database not found: > projects/apache-beam-testing/instances/beam-test/databases/beam-testdb-vkf2iqc72tevvroaop > resource_type: "type.googleapis.com/google.spanner.admin.database.v1.Database" > resource_name: > "projects/apache-beam-testing/instances/beam-test/databases/beam-testdb-vkf2iqc72tevvroaop" > description: "Database does not exist." > at > com.google.cloud.spanner.SpannerExceptionFactory.newSpannerExceptionPreformatted(SpannerExceptionFactory.java:119) > at > com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException(SpannerExceptionFactory.java:43) > at > com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException(SpannerExceptionFactory.java:80) > at > com.google.cloud.spanner.spi.v1.GrpcSpannerRpc.get(GrpcSpannerRpc.java:456) > at > com.google.cloud.spanner.spi.v1.GrpcSpannerRpc.createSession(GrpcSpannerRpc.java:350) > at com.google.cloud.spanner.SpannerImpl$2.call(SpannerImpl.java:258) > at com.google.cloud.spanner.SpannerImpl$2.call(SpannerImpl.java:255) > at > com.google.cloud.spanner.SpannerImpl.runWithRetries(SpannerImpl.java:227) > at > com.google.cloud.spanner.SpannerImpl.createSession(SpannerImpl.java:254) > at > com.google.cloud.spanner.BatchClientImpl.batchReadOnlyTransaction(BatchClientImpl.java:51) > at > org.apache.beam.sdk.io.gcp.spanner.CreateTransactionFn.processElement(CreateTransactionFn.java:47) > Caused by: java.util.concurrent.ExecutionException: > io.grpc.StatusRuntimeException: NOT_FOUND: Database not found: > projects/apache-beam-testing/instances/beam-test/databases/beam-testdb-vkf2iqc72tevvroaop > resource_type: "type.googleapis.com/google.spanner.admin.database.v1.Database" > resource_name: > "projects/apache-beam-testing/instances/beam-test/databases/beam-testdb-vkf2iqc72tevvroaop" > description: "Database does not exist." > at > com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:500) > at > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:479) > at > com.google.cloud.spanner.spi.v1.GrpcSpannerRpc.get(GrpcSpannerRpc.java:450) > at > com.google.cloud.spanner.spi.v1.GrpcSpannerRpc.createSession(GrpcSpannerRpc.java:350) > at com.google.cloud.spanner.SpannerImpl$2.call(SpannerImpl.java:258) > at com.google.cloud.spanner.SpannerImpl$2.call(SpannerImpl.java:255) > at > com.google.cloud.spanner.SpannerImpl.runWithRetries(SpannerImpl.java:227) > at > com.google.cloud.spanner.SpannerImpl.createSession(SpannerImpl.java:254) > at > com.google.cloud.spanner.BatchClientImpl.batchReadOnlyTransaction(BatchClientImpl.java:51) > at > org.apache.beam.sdk.io.gcp.spanner.CreateTransactionFn.processElement(CreateTransactionFn.java:47) > at > org.apache.beam.sdk.io.gcp.spanner.CreateTransactionFn$DoFnInvoker.invokeProcessElement(Unknown > Source) > at > org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:275) > at > org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:240) > at > org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:325) > at > org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44) > at > org.apache.beam.runners.dataflow.worker.util.common
[jira] [Commented] (BEAM-7732) Allow access to SpannerOptions in Beam
[ https://issues.apache.org/jira/browse/BEAM-7732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135793#comment-17135793 ] Niel Markwick commented on BEAM-7732: - There is still no way to specify custom SpannerOptions in Beam. There have been a couple of related recent issues, and I am going to revisit the issue shortly. > Allow access to SpannerOptions in Beam > -- > > Key: BEAM-7732 > URL: https://issues.apache.org/jira/browse/BEAM-7732 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Affects Versions: 2.12.0, 2.13.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: P3 > Time Spent: 4h 10m > Remaining Estimate: 0h > > Beam hides the > [SpannerOptions|https://github.com/googleapis/google-cloud-java/blob/master/google-cloud-clients/google-cloud-spanner/src/main/java/com/google/cloud/spanner/SpannerOptions.java] > object behind a > [SpannerConfig|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerConfig.java] > object because the SpannerOptions object is not serializable. > This means that the only options that can be set are those that can be > specified in SpannerConfig - limited to host, project, instance, database. > Suggestion: add the possibility to set a SpannerOptionsFactory in > SpannerConfig: > {code:java} > public interface SpannerOptionsFactory extends Serializable { > public SpannerOptions create(); > } > {code} > This would allow the user use this factory class to specify custom > SpannerOptions before they are passed onto the connectToSpanner() method; > connectToSpanner() would then become: > {code:java} > public SpannerAccessor connectToSpanner() { > > SpannerOptions.Builder builder = spannerOptionsFactory.create().toBuilder(); > // rest of connectToSpanner follows, setting project, host, etc. > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-7732) Allow access to SpannerOptions in Beam
[ https://issues.apache.org/jira/browse/BEAM-7732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick reassigned BEAM-7732: --- Assignee: Niel Markwick > Allow access to SpannerOptions in Beam > -- > > Key: BEAM-7732 > URL: https://issues.apache.org/jira/browse/BEAM-7732 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Affects Versions: 2.12.0, 2.13.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: P3 > Time Spent: 4h 10m > Remaining Estimate: 0h > > Beam hides the > [SpannerOptions|https://github.com/googleapis/google-cloud-java/blob/master/google-cloud-clients/google-cloud-spanner/src/main/java/com/google/cloud/spanner/SpannerOptions.java] > object behind a > [SpannerConfig|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerConfig.java] > object because the SpannerOptions object is not serializable. > This means that the only options that can be set are those that can be > specified in SpannerConfig - limited to host, project, instance, database. > Suggestion: add the possibility to set a SpannerOptionsFactory in > SpannerConfig: > {code:java} > public interface SpannerOptionsFactory extends Serializable { > public SpannerOptions create(); > } > {code} > This would allow the user use this factory class to specify custom > SpannerOptions before they are passed onto the connectToSpanner() method; > connectToSpanner() would then become: > {code:java} > public SpannerAccessor connectToSpanner() { > > SpannerOptions.Builder builder = spannerOptionsFactory.create().toBuilder(); > // rest of connectToSpanner follows, setting project, host, etc. > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-6913) Reading data from Spanner never ends
[ https://issues.apache.org/jira/browse/BEAM-6913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135791#comment-17135791 ] Niel Markwick commented on BEAM-6913: - Spanner client library has been updated to 1.49, in beam, so this issue should now be resolved. > Reading data from Spanner never ends > > > Key: BEAM-6913 > URL: https://issues.apache.org/jira/browse/BEAM-6913 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.11.0 > Environment: macOS Mojave (10.14.3) >Reporter: Mousa HAMAD >Priority: P2 > Labels: stale-P2 > > Whenever my pipeline reads from Spanner, the code runs infinitely. If I > update the spanner dependency (_com.google.cloud:google-cloud-spanner_) to > e.g., _1.11.0,_ then everything works as expected. > Consider the following simple pipeline, which never ends: > {code:java} > public class Prototype_Spanner { > private static String INSTANCE_ID = "XYZ"; > private static String DATABASE_ID = "test_beam"; > private static String TABLE_NAME = "item"; > private static void runExample() { > PipelineOptions options = PipelineOptionsFactory.create(); > options.setRunner(DirectRunner.class); > Pipeline pipeline = Pipeline.create(options); > pipeline > .apply("Read", SpannerIO.read() > .withInstanceId(INSTANCE_ID) > .withDatabaseId(DATABASE_ID) > .withTable(TABLE_NAME) > .withColumns("price")) > .apply("Extract Price", MapElements > .into(TypeDescriptors.longs()) > .via((Struct struct) -> struct.getLong("price"))) > .apply("Calculate Mean", Mean.globally()) > .apply("Map to string", MapElements > .into(TypeDescriptor.of(String.class)) > .via(Object::toString)) > .apply("Write", TextIO.write().to("/tmp/output")); > pipeline.run().waitUntilFinish(); > } > public static void main(String[] args) { > runExample(); > } > } > {code} > Following is my full list of dependencies: > {code:java} > repositories { > mavenCentral() > } > ext { > beamVersion = '2.11.0' > sparkVersion = '2.3.3' > } > dependencies { > compile "org.apache.beam:beam-sdks-java-core:$beamVersion" > compile > "org.apache.beam:beam-sdks-java-extensions-join-library:$beamVersion" > compile > "org.apache.beam:beam-sdks-java-extensions-google-cloud-platform-core:$beamVersion" > compile > "org.apache.beam:beam-sdks-java-io-google-cloud-platform:$beamVersion" > compile "org.apache.beam:beam-runners-core-java:$beamVersion" > compile "org.apache.beam:beam-runners-direct-java:$beamVersion" > compile "org.apache.beam:beam-runners-spark:$beamVersion" > compile "org.apache.spark:spark-core_2.11:$sparkVersion" > compile "org.apache.spark:spark-streaming_2.11:$sparkVersion" > // This line fixed the issue for me > // compile "com.google.cloud:google-cloud-spanner:1.11.0" > testCompile "junit:junit:4.12" > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-6764) OutOfMemory Exception in Dataflow Spanner Write Mutations
[ https://issues.apache.org/jira/browse/BEAM-6764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135789#comment-17135789 ] Niel Markwick commented on BEAM-6764: - Since 2.9.0, streaming support has been improved dramatically to use less resources, and have a lower latency. I believe this issue is fixed in 2.22 > OutOfMemory Exception in Dataflow Spanner Write Mutations > - > > Key: BEAM-6764 > URL: https://issues.apache.org/jira/browse/BEAM-6764 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.9.0, 2.10.0, 2.11.0 >Reporter: Joshua >Priority: P2 > Labels: stale-P2 > > Since I upgraded my apache beam sdk version to >= 2.9.0, I have been noticing > OOM exceptions while using the dataflow runner to write mutations to spanner. > I have been using n1-standard-4 since version 2.9.0. On that version, it > works. But on higher versions, I get the exception. > > The stackdriver logs is provided below > {code:java} > java.lang.OutOfMemoryError: Java heap space > java.util.ArrayList.(ArrayList.java:152) > org.apache.beam.sdk.io.gcp.spanner.SpannerIO$GatherBundleAndSortFn.initSorter(SpannerIO.java:1056) > org.apache.beam.sdk.io.gcp.spanner.SpannerIO$GatherBundleAndSortFn.startBundle(SpannerIO.java:1049) > {code} > I have a very basic PTransform for writing to Spanner: > {code:java} > SpannerIO.write().withInstanceId(options.getSpannerInstanceId()) > .withDatabaseId(options.getSpannerDatabaseId()) > .withProjectId(options.getProject()) > .withFailureMode(SpannerIO.FailureMode.REPORT_FAILURES); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-6887) Streaming Spanner Writer transform
[ https://issues.apache.org/jira/browse/BEAM-6887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135787#comment-17135787 ] Niel Markwick commented on BEAM-6887: - https://github.com/apache/beam/pull/11532 changes the defaults for unbounded sources so that the high-latency grouping operation is disabled by default. In addition https://github.com/apache/beam/pull/11529 simplifies the pipeline when there batching is disabled. Therefore there is no longer a need for a separate SimpleWrite/WriteBulk This FR sill applies though for having meaningful output for successful mutations. > Streaming Spanner Writer transform > -- > > Key: BEAM-6887 > URL: https://issues.apache.org/jira/browse/BEAM-6887 > Project: Beam > Issue Type: New Feature > Components: io-java-gcp >Reporter: Niel Markwick >Priority: P3 > > At present, the SpannerIO.Write/WriteGrouped transforms work by collecting an > entire bundle of elements, sorts them by table/key, splitting the sorted list > into batches (by size and number of cells modified) and then writes each > batch to Spanner in a single transaction. > It returns a SpannerWriteResult.java containing : > # a PCollection (the main output) - which will have no elements but > will be closed to signal when all the input elements have been written (which > is never in streaming because input is unbounded) > # a PCollection of elements that failed to write. > This transform is useful as a bulk sink for data because it efficiently > writes large amounts of data. > It is not at all useful as an intermediate step in a streaming pipeline - > because it has no useful output in streaming mode. > I propose that we have a separate Spanner Write transform which simply writes > each input Mutation to the database, and then pushes successful Mutations > onto its output. > This would allow use in the middle of a streaming pipeline, where the flow > would be > * Some data streamed in > * Converted to Spanner Mutations > * Written to Spanner Database > * Further processing where the values written to the Spanner Database are > used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-6887) Streaming Spanner Writer transform
[ https://issues.apache.org/jira/browse/BEAM-6887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick reassigned BEAM-6887: --- Assignee: Niel Markwick > Streaming Spanner Writer transform > -- > > Key: BEAM-6887 > URL: https://issues.apache.org/jira/browse/BEAM-6887 > Project: Beam > Issue Type: New Feature > Components: io-java-gcp >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: P3 > > At present, the SpannerIO.Write/WriteGrouped transforms work by collecting an > entire bundle of elements, sorts them by table/key, splitting the sorted list > into batches (by size and number of cells modified) and then writes each > batch to Spanner in a single transaction. > It returns a SpannerWriteResult.java containing : > # a PCollection (the main output) - which will have no elements but > will be closed to signal when all the input elements have been written (which > is never in streaming because input is unbounded) > # a PCollection of elements that failed to write. > This transform is useful as a bulk sink for data because it efficiently > writes large amounts of data. > It is not at all useful as an intermediate step in a streaming pipeline - > because it has no useful output in streaming mode. > I propose that we have a separate Spanner Write transform which simply writes > each input Mutation to the database, and then pushes successful Mutations > onto its output. > This would allow use in the middle of a streaming pipeline, where the flow > would be > * Some data streamed in > * Converted to Spanner Mutations > * Written to Spanner Database > * Further processing where the values written to the Spanner Database are > used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (BEAM-2837) Writing To Spanner From Google Cloud DataFlow - Failure
[ https://issues.apache.org/jira/browse/BEAM-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick closed BEAM-2837. --- Fix Version/s: 2.2.0 Assignee: (was: Mairbek Khadikov) Resolution: Fixed This issue was fixed by the inclusion of SpannerIO in beam a looking while ago! > Writing To Spanner From Google Cloud DataFlow - Failure > --- > > Key: BEAM-2837 > URL: https://issues.apache.org/jira/browse/BEAM-2837 > Project: Beam > Issue Type: Bug > Components: runner-dataflow >Affects Versions: 2.1.0, 2.4.0 > Environment: Google Cloud DataFlow >Reporter: Al Yaros >Priority: P2 > Labels: stale-assigned > Fix For: 2.2.0 > > > Simple Java Program That reads from Pub\Sub and Writes to Spanner Fails with > cryptic error message. > Simple Program to Demonstrate the Error: > [https://github.com/alyaros/ExamplePubSubToSpannerViaDataFlow] > {code:java} > *Caused by: org.apache.beam.sdk.util.UserCodeException: > java.lang.NoClassDefFoundError: Could not initialize class > com.google.cloud.spanner.spi.v1.SpannerErrorInterceptor > > org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:36) > org.apache.beam.sdk.io. > gcp.spanner.SpannerWriteGroupFn$DoFnInvoker.invokeSetup(Unknown Source) > > com.google.cloud.dataflow.worker.DoFnInstanceManagers$ConcurrentQueueInstanceManager.deserializeCopy(DoFnInstanceManagers.java:66) > > com.google.cloud.dataflow.worker.DoFnInstanceManagers$ConcurrentQueueInstanceManager.peek(DoFnInstanceManagers.java:48) > > com.google.cloud.dataflow.worker.UserParDoFnFactory.create(UserParDoFnFactory.java:104) > > com.google.cloud.dataflow.worker.DefaultParDoFnFactory.create(DefaultParDoFnFactory.java:66) > > com.google.cloud.dataflow.worker.MapTaskExecutorFactory.createParDoOperation(MapTaskExecutorFactory.java:360) > > com.google.cloud.dataflow.worker.MapTaskExecutorFactory$3.typedApply(MapTaskExecutorFactory.java:271) > > com.google.cloud.dataflow.worker.MapTaskExecutorFactory$3.typedApply(MapTaskExecutorFactory.java:253) > > com.google.cloud.dataflow.worker.graph.Networks$TypeSafeNodeFunction.apply(Networks.java:55) > > com.google.cloud.dataflow.worker.graph.Networks$TypeSafeNodeFunction.apply(Networks.java:43) > > com.google.cloud.dataflow.worker.graph.Networks.replaceDirectedNetworkNodes(Networks.java:78) > > com.google.cloud.dataflow.worker.MapTaskExecutorFactory.create(MapTaskExecutorFactory.java:142) > > com.google.cloud.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:925) > > com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$800(StreamingDataflowWorker.java:133) > > com.google.cloud.dataflow.worker.StreamingDataflowWorker$7.run(StreamingDataflowWorker.java:771) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > java.lang.Thread.run(Thread.java:745)* > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-10047) SpannerIO: Combine sorting and batching
[ https://issues.apache.org/jira/browse/BEAM-10047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-10047: - Fix Version/s: 2.23.0 Affects Version/s: 2.16.0 2.17.0 2.18.0 2.19.0 2.20.0 2.21.0 2.22.0 Description: having Spanner IO's grouping, sorting and batching as separate stages makes no sense as these stages will be fused in any runner that supports fusion. In addition during these stages, even with fusion, at least 3 full copies of each mutation are made (serialized then deserialized objects), which leads to an extremely large use of memory. Combining these stages would significantly reduce memory usage, and improve performance. > SpannerIO: Combine sorting and batching > --- > > Key: BEAM-10047 > URL: https://issues.apache.org/jira/browse/BEAM-10047 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Affects Versions: 2.16.0, 2.17.0, 2.18.0, 2.19.0, 2.20.0, 2.21.0, 2.22.0 >Reporter: Brian Hulette >Assignee: Niel Markwick >Priority: P2 > Fix For: 2.23.0 > > > having Spanner IO's grouping, sorting and batching as separate stages makes > no sense as these stages will be fused in any runner that supports fusion. > In addition during these stages, even with fusion, at least 3 full copies of > each mutation are made (serialized then deserialized objects), which leads to > an extremely large use of memory. > > Combining these stages would significantly reduce memory usage, and improve > performance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-6407) regression: FileIO.writeDynamic() with side inputs fails in DirectRunner
[ https://issues.apache.org/jira/browse/BEAM-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick resolved BEAM-6407. - Resolution: Cannot Reproduce I cannot reproduce the issue in beam versions 2.10 -> 2.21. I assume it was fixed by in 2.10 > regression: FileIO.writeDynamic() with side inputs fails in DirectRunner > > > Key: BEAM-6407 > URL: https://issues.apache.org/jira/browse/BEAM-6407 > Project: Beam > Issue Type: Bug > Components: runner-direct >Affects Versions: 2.9.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: P2 > Labels: regression, stale-assigned > Fix For: 2.10.0 > > Attachments: beam-filewriter-demo.tgz > > Time Spent: 2.5h > Remaining Estimate: 0h > > When FileIO.writeDynamic is used with automatic sharding and a Contextful.Fn > that uses side inputs for the file naming, DirectRunner (and TestPipeline) > fail with: > {{java.lang.IllegalStateException: All PCollectionViews that are consumed > must be written by some WriteView PTransform: Missing [ > [RunnerPCollectionView]]}} > > Example code: > {code:java} > PCollectionView outputFileName = > pipeline.apply( > "outputDir", > Create.of("/tmp/testout")).apply(View.asSingleton()); > Contextful.Fn manifestNaming = > (element, c) -> > (window, pane, numShards, shardIndex, compression) -> > c.sideInput(outputFileName)+shardIndex; > pipeline.apply(FileIO.writeDynamic() > .by(SerializableFunctions.constant("")) > .withDestinationCoder(StringUtf8Coder.of()) > .via(TextIO.sink()) > .withTempDirectory("/tmp") > .withNaming(Contextful.of( > manifestNaming, > Requirements.requiresSideInputs(outputFileName; > {code} > > This does not occur in Dataflow-runner > It does not occur if the ContextFul.Fn is not given side inputs. > It does not occur if withNumShards(1) is set. > It did not occur in 2.8.0, and does in 2.9.0 and 2.10.0-SNAPSHOT (as of today) > > The cause appears to be due to the DirectRunner using TransformOverrides > re-writing FileIO sinks to use runner-determined-sharding > ( see [DirectRunner.java line > 226|https://github.com/apache/beam/blob/master/runners/direct-java/src/main/java/org/apache/beam/runners/direct/DirectRunner.java#L226] > ) > but I do not know why this started occuring in 2.9.0... -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-9822) SpannerIO: Reduce memory usage - especially when streaming
[ https://issues.apache.org/jira/browse/BEAM-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112441#comment-17112441 ] Niel Markwick commented on BEAM-9822: - #11570 does make a big reduction to the memory use of the transform in general, so would help a lot for both bounded and unbounded... But the other 2 resolve the immediate issue for streaming > SpannerIO: Reduce memory usage - especially when streaming > -- > > Key: BEAM-9822 > URL: https://issues.apache.org/jira/browse/BEAM-9822 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.20.0, 2.21.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: P2 > Labels: google-cloud-spanner > Fix For: 2.22.0 > > Time Spent: 7h > Remaining Estimate: 0h > > SpannerIO uses a lot of memory. > In Streaming Dataflow, it uses many times as much (because dataflow creates > many worker threads) > Lower the memory use, and change default parameters during streaming to use > smaller batches and disable grouping. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9821) SpannerIO does not include all batching parameters in DisplayData.
[ https://issues.apache.org/jira/browse/BEAM-9821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-9821: Affects Version/s: 2.21.0 > SpannerIO does not include all batching parameters in DisplayData. > -- > > Key: BEAM-9821 > URL: https://issues.apache.org/jira/browse/BEAM-9821 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.20.0, 2.21.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: Minor > Labels: google-cloud-spanner > Time Spent: 0.5h > Remaining Estimate: 0h > > SpannerIO Write and WriteGrouped do not populate all of the batching/grouping > parameters in their DisplayData – they only show "batchSizeBytes" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9821) SpannerIO does not include all batching parameters in DisplayData.
[ https://issues.apache.org/jira/browse/BEAM-9821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-9821: Fix Version/s: 2.22.0 > SpannerIO does not include all batching parameters in DisplayData. > -- > > Key: BEAM-9821 > URL: https://issues.apache.org/jira/browse/BEAM-9821 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.20.0, 2.21.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: Minor > Labels: google-cloud-spanner > Fix For: 2.22.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > SpannerIO Write and WriteGrouped do not populate all of the batching/grouping > parameters in their DisplayData – they only show "batchSizeBytes" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work stopped] (BEAM-9822) SpannerIO: Reduce memory usage - especially when streaming
[ https://issues.apache.org/jira/browse/BEAM-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on BEAM-9822 stopped by Niel Markwick. --- > SpannerIO: Reduce memory usage - especially when streaming > -- > > Key: BEAM-9822 > URL: https://issues.apache.org/jira/browse/BEAM-9822 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.20.0, 2.21.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: Major > Labels: google-cloud-spanner > Fix For: 2.22.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > SpannerIO uses a lot of memory. > In Streaming Dataflow, it uses many times as much (because dataflow creates > many worker threads) > Lower the memory use, and change default parameters during streaming to use > smaller batches and disable grouping. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9822) SpannerIO: Reduce memory usage - especially when streaming
[ https://issues.apache.org/jira/browse/BEAM-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-9822: Fix Version/s: 2.22.0 > SpannerIO: Reduce memory usage - especially when streaming > -- > > Key: BEAM-9822 > URL: https://issues.apache.org/jira/browse/BEAM-9822 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.20.0, 2.21.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: Major > Labels: google-cloud-spanner > Fix For: 2.22.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > SpannerIO uses a lot of memory. > In Streaming Dataflow, it uses many times as much (because dataflow creates > many worker threads) > Lower the memory use, and change default parameters during streaming to use > smaller batches and disable grouping. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9822) SpannerIO: Reduce memory usage - especially when streaming
[ https://issues.apache.org/jira/browse/BEAM-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-9822: Status: Open (was: Triage Needed) > SpannerIO: Reduce memory usage - especially when streaming > -- > > Key: BEAM-9822 > URL: https://issues.apache.org/jira/browse/BEAM-9822 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.20.0, 2.21.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: Major > Labels: google-cloud-spanner > Fix For: 2.22.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > SpannerIO uses a lot of memory. > In Streaming Dataflow, it uses many times as much (because dataflow creates > many worker threads) > Lower the memory use, and change default parameters during streaming to use > smaller batches and disable grouping. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9822) SpannerIO: Reduce memory usage - especially when streaming
[ https://issues.apache.org/jira/browse/BEAM-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-9822: Affects Version/s: 2.21.0 > SpannerIO: Reduce memory usage - especially when streaming > -- > > Key: BEAM-9822 > URL: https://issues.apache.org/jira/browse/BEAM-9822 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.20.0, 2.21.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: Major > Labels: google-cloud-spanner > Time Spent: 1.5h > Remaining Estimate: 0h > > SpannerIO uses a lot of memory. > In Streaming Dataflow, it uses many times as much (because dataflow creates > many worker threads) > Lower the memory use, and change default parameters during streaming to use > smaller batches and disable grouping. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9821) SpannerIO does not include all batching parameters in DisplayData.
[ https://issues.apache.org/jira/browse/BEAM-9821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-9821: Status: Open (was: Triage Needed) > SpannerIO does not include all batching parameters in DisplayData. > -- > > Key: BEAM-9821 > URL: https://issues.apache.org/jira/browse/BEAM-9821 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.20.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: Minor > Labels: google-cloud-spanner > Time Spent: 0.5h > Remaining Estimate: 0h > > SpannerIO Write and WriteGrouped do not populate all of the batching/grouping > parameters in their DisplayData – they only show "batchSizeBytes" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work started] (BEAM-9822) SpannerIO: Reduce memory usage - especially when streaming
[ https://issues.apache.org/jira/browse/BEAM-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on BEAM-9822 started by Niel Markwick. --- > SpannerIO: Reduce memory usage - especially when streaming > -- > > Key: BEAM-9822 > URL: https://issues.apache.org/jira/browse/BEAM-9822 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.20.0, 2.21.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: Major > Labels: google-cloud-spanner > Fix For: 2.22.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > SpannerIO uses a lot of memory. > In Streaming Dataflow, it uses many times as much (because dataflow creates > many worker threads) > Lower the memory use, and change default parameters during streaming to use > smaller batches and disable grouping. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-9822) SpannerIO: Reduce memory usage - especially when streaming
[ https://issues.apache.org/jira/browse/BEAM-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick reassigned BEAM-9822: --- Assignee: Niel Markwick > SpannerIO: Reduce memory usage - especially when streaming > -- > > Key: BEAM-9822 > URL: https://issues.apache.org/jira/browse/BEAM-9822 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.20.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: Major > Labels: google-cloud-spanner > Time Spent: 1.5h > Remaining Estimate: 0h > > SpannerIO uses a lot of memory. > In Streaming Dataflow, it uses many times as much (because dataflow creates > many worker threads) > Lower the memory use, and change default parameters during streaming to use > smaller batches and disable grouping. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9505) SpannerIO spurious error message with empty bundles
[ https://issues.apache.org/jira/browse/BEAM-9505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-9505: Affects Version/s: 2.21.0 > SpannerIO spurious error message with empty bundles > --- > > Key: BEAM-9505 > URL: https://issues.apache.org/jira/browse/BEAM-9505 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.18.0, 2.19.0, 2.20.0, 2.21.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: Minor > Fix For: 2.22.0 > > Attachments: Worker error log count.png > > Time Spent: 1h 10m > Remaining Estimate: 0h > > -When using DataflowRunner in streaming mode. DoFn.StartBundle is called > multiple times for the same bundle.- > -This does not occur with DirectRunner.- > -This breaks DoFn's which require per-bundle setup and teardown procedures.- > When a bundle is empty (such as in streaming if a window is empty), SpannerIO > will report a spurious error message: > {{IllegalStateException: Sorter should be null here}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-9821) SpannerIO does not include all batching parameters in DisplayData.
[ https://issues.apache.org/jira/browse/BEAM-9821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick reassigned BEAM-9821: --- Assignee: Niel Markwick > SpannerIO does not include all batching parameters in DisplayData. > -- > > Key: BEAM-9821 > URL: https://issues.apache.org/jira/browse/BEAM-9821 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.20.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: Minor > Labels: google-cloud-spanner > Time Spent: 0.5h > Remaining Estimate: 0h > > SpannerIO Write and WriteGrouped do not populate all of the batching/grouping > parameters in their DisplayData – they only show "batchSizeBytes" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-9269) Set shorter Commit Deadline and handle with backoff/retry
[ https://issues.apache.org/jira/browse/BEAM-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick resolved BEAM-9269. - Fix Version/s: 2.20.0 Resolution: Fixed > Set shorter Commit Deadline and handle with backoff/retry > - > > Key: BEAM-9269 > URL: https://issues.apache.org/jira/browse/BEAM-9269 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Affects Versions: 2.16.0, 2.17.0, 2.18.0, 2.19.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: Major > Labels: google-cloud-spanner > Fix For: 2.20.0 > > Time Spent: 4h > Remaining Estimate: 0h > > Default commit deadline in Spanner is 1hr, which can lead to a variety of > issues including database overload and session expiry. > Shorter deadline should be set with backoff/retry when deadline expires, so > that the Spanner database does not become overloaded. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-9505) SpannerIO spurious error message with empty bundles
[ https://issues.apache.org/jira/browse/BEAM-9505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17103272#comment-17103272 ] Niel Markwick commented on BEAM-9505: - The only workaround is to either build your own version of 2.17 patched with my fix from the [GitHub pull request|https://github.com/apache/beam/pull/11438] Or somehow filter out empty windows. 150,000 errors per hour is ~40/sec, I know that's a lot, but is it really slowing down your pipeline? If you are talking about end-end latency, you can reduce latency by reducing the grouping factor in spannerIO.write() (at the expense of potentially more spanner CPU usage) 2.22 will default to a groupingFactor of 1 for streaming pipelines. > SpannerIO spurious error message with empty bundles > --- > > Key: BEAM-9505 > URL: https://issues.apache.org/jira/browse/BEAM-9505 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.18.0, 2.19.0, 2.20.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: Minor > Fix For: 2.22.0 > > Attachments: Worker error log count.png > > Time Spent: 1h 10m > Remaining Estimate: 0h > > -When using DataflowRunner in streaming mode. DoFn.StartBundle is called > multiple times for the same bundle.- > -This does not occur with DirectRunner.- > -This breaks DoFn's which require per-bundle setup and teardown procedures.- > When a bundle is empty (such as in streaming if a window is empty), SpannerIO > will report a spurious error message: > {{IllegalStateException: Sorter should be null here}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9505) SpannerIO spurious error message with empty bundles
[ https://issues.apache.org/jira/browse/BEAM-9505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-9505: Fix Version/s: 2.22.0 > SpannerIO spurious error message with empty bundles > --- > > Key: BEAM-9505 > URL: https://issues.apache.org/jira/browse/BEAM-9505 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.18.0, 2.19.0, 2.20.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: Minor > Fix For: 2.22.0 > > Attachments: Worker error log count.png > > Time Spent: 1h 10m > Remaining Estimate: 0h > > -When using DataflowRunner in streaming mode. DoFn.StartBundle is called > multiple times for the same bundle.- > -This does not occur with DirectRunner.- > -This breaks DoFn's which require per-bundle setup and teardown procedures.- > When a bundle is empty (such as in streaming if a window is empty), SpannerIO > will report a spurious error message: > {{IllegalStateException: Sorter should be null here}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9822) SpannerIO: Reduce memory usage - especially when streaming
[ https://issues.apache.org/jira/browse/BEAM-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-9822: Summary: SpannerIO: Reduce memory usage - especially when streaming (was: Reduce memory usage - especially when streaming) > SpannerIO: Reduce memory usage - especially when streaming > -- > > Key: BEAM-9822 > URL: https://issues.apache.org/jira/browse/BEAM-9822 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.20.0 >Reporter: Niel Markwick >Priority: Major > Labels: google-cloud-spanner > Fix For: 2.21.0 > > > SpannerIO uses a lot of memory. > In Streaming Dataflow, it uses many times as much (because dataflow creates > many worker threads) > Lower the memory use, and change default parameters during streaming to use > smaller batches and disable grouping. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9822) Reduce memory usage - especially when streaming
Niel Markwick created BEAM-9822: --- Summary: Reduce memory usage - especially when streaming Key: BEAM-9822 URL: https://issues.apache.org/jira/browse/BEAM-9822 Project: Beam Issue Type: Bug Components: io-java-gcp Affects Versions: 2.20.0 Reporter: Niel Markwick Fix For: 2.21.0 SpannerIO uses a lot of memory. In Streaming Dataflow, it uses many times as much (because dataflow creates many worker threads) Lower the memory use, and change default parameters during streaming to use smaller batches and disable grouping. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9821) SpannerIO does not include all batching parameters in DisplayData.
Niel Markwick created BEAM-9821: --- Summary: SpannerIO does not include all batching parameters in DisplayData. Key: BEAM-9821 URL: https://issues.apache.org/jira/browse/BEAM-9821 Project: Beam Issue Type: Bug Components: io-java-gcp Affects Versions: 2.20.0 Reporter: Niel Markwick Fix For: 2.21.0 SpannerIO Write and WriteGrouped do not populate all of the batching/grouping parameters in their DisplayData – they only show "batchSizeBytes" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9505) SpannerIO spurious error message with empty bundles
[ https://issues.apache.org/jira/browse/BEAM-9505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-9505: Component/s: (was: runner-dataflow) io-java-gcp Fix Version/s: 2.21.0 Affects Version/s: 2.20.0 Description: -When using DataflowRunner in streaming mode. DoFn.StartBundle is called multiple times for the same bundle.- -This does not occur with DirectRunner.- -This breaks DoFn's which require per-bundle setup and teardown procedures.- When a bundle is empty (such as in streaming if a window is empty), SpannerIO will report a spurious error message: {{IllegalStateException: Sorter should be null here}} was: When using DataflowRunner in streaming mode. DoFn.StartBundle is called multiple times for the same bundle. This does not occur with DirectRunner. This breaks DoFn's which require per-bundle setup and teardown procedures. Priority: Minor (was: Major) Summary: SpannerIO spurious error message with empty bundles (was: DoFn.StartBundle called multiple times when streaming) > SpannerIO spurious error message with empty bundles > --- > > Key: BEAM-9505 > URL: https://issues.apache.org/jira/browse/BEAM-9505 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.18.0, 2.19.0, 2.20.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: Minor > Fix For: 2.21.0 > > > -When using DataflowRunner in streaming mode. DoFn.StartBundle is called > multiple times for the same bundle.- > -This does not occur with DirectRunner.- > -This breaks DoFn's which require per-bundle setup and teardown procedures.- > When a bundle is empty (such as in streaming if a window is empty), SpannerIO > will report a spurious error message: > {{IllegalStateException: Sorter should be null here}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-9505) DoFn.StartBundle called multiple times when streaming
[ https://issues.apache.org/jira/browse/BEAM-9505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084779#comment-17084779 ] Niel Markwick commented on BEAM-9505: - The shame 😢. This is actually a bug in my code where a value was not cleared in FinishBundle, leading to an unexpected state in StartBundle and a spurious error message. This only occurs in certain situations (such as an empty bundle, which can occur in streaming if the window is also empty). Updating issue to correct. > DoFn.StartBundle called multiple times when streaming > - > > Key: BEAM-9505 > URL: https://issues.apache.org/jira/browse/BEAM-9505 > Project: Beam > Issue Type: Bug > Components: runner-dataflow >Affects Versions: 2.18.0, 2.19.0 >Reporter: Niel Markwick >Assignee: Boyuan Zhang >Priority: Major > > When using DataflowRunner in streaming mode. DoFn.StartBundle is called > multiple times for the same bundle. > > This does not occur with DirectRunner. > This breaks DoFn's which require per-bundle setup and teardown procedures. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-9505) DoFn.StartBundle called multiple times when streaming
[ https://issues.apache.org/jira/browse/BEAM-9505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick reassigned BEAM-9505: --- Assignee: Niel Markwick (was: Boyuan Zhang) > DoFn.StartBundle called multiple times when streaming > - > > Key: BEAM-9505 > URL: https://issues.apache.org/jira/browse/BEAM-9505 > Project: Beam > Issue Type: Bug > Components: runner-dataflow >Affects Versions: 2.18.0, 2.19.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: Major > > When using DataflowRunner in streaming mode. DoFn.StartBundle is called > multiple times for the same bundle. > > This does not occur with DirectRunner. > This breaks DoFn's which require per-bundle setup and teardown procedures. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-9505) DoFn.StartBundle called multiple times when streaming
[ https://issues.apache.org/jira/browse/BEAM-9505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071109#comment-17071109 ] Niel Markwick commented on BEAM-9505: - This issue came was reported from StackOverflow. To reproduce it you would have to run a streaming pipeline in dataflow. The SO bug was attempting to write to Spanner (and in the SpannerIO, I have some code in StartBundle to sanity-check the state, which reports the error.). [https://stackoverflow.com/questions/60658135/spannerio-java-lang-illegalstateexception-sorter-should-be-null-here] I would have thought that a trivial DoFn in a pipeline taking pub/sub input would be able to reproduce it (but have not created that pipeline myself yet) > DoFn.StartBundle called multiple times when streaming > - > > Key: BEAM-9505 > URL: https://issues.apache.org/jira/browse/BEAM-9505 > Project: Beam > Issue Type: Bug > Components: runner-dataflow >Affects Versions: 2.18.0, 2.19.0 >Reporter: Niel Markwick >Assignee: Boyuan Zhang >Priority: Major > > When using DataflowRunner in streaming mode. DoFn.StartBundle is called > multiple times for the same bundle. > > This does not occur with DirectRunner. > This breaks DoFn's which require per-bundle setup and teardown procedures. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-9505) DoFn.StartBundle called multiple times when streaming
[ https://issues.apache.org/jira/browse/BEAM-9505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick reassigned BEAM-9505: --- Assignee: Boyuan Zhang (was: Niel Markwick) > DoFn.StartBundle called multiple times when streaming > - > > Key: BEAM-9505 > URL: https://issues.apache.org/jira/browse/BEAM-9505 > Project: Beam > Issue Type: Bug > Components: runner-dataflow >Affects Versions: 2.18.0, 2.19.0 >Reporter: Niel Markwick >Assignee: Boyuan Zhang >Priority: Major > > When using DataflowRunner in streaming mode. DoFn.StartBundle is called > multiple times for the same bundle. > > This does not occur with DirectRunner. > This breaks DoFn's which require per-bundle setup and teardown procedures. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-9505) DoFn.StartBundle called multiple times when streaming
[ https://issues.apache.org/jira/browse/BEAM-9505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick reassigned BEAM-9505: --- Assignee: Niel Markwick (was: Boyuan Zhang) > DoFn.StartBundle called multiple times when streaming > - > > Key: BEAM-9505 > URL: https://issues.apache.org/jira/browse/BEAM-9505 > Project: Beam > Issue Type: Bug > Components: runner-dataflow >Affects Versions: 2.18.0, 2.19.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: Major > > When using DataflowRunner in streaming mode. DoFn.StartBundle is called > multiple times for the same bundle. > > This does not occur with DirectRunner. > This breaks DoFn's which require per-bundle setup and teardown procedures. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9505) DoFn.StartBundle called multiple times when streaming
Niel Markwick created BEAM-9505: --- Summary: DoFn.StartBundle called multiple times when streaming Key: BEAM-9505 URL: https://issues.apache.org/jira/browse/BEAM-9505 Project: Beam Issue Type: Bug Components: runner-dataflow Affects Versions: 2.19.0, 2.18.0 Reporter: Niel Markwick When using DataflowRunner in streaming mode. DoFn.StartBundle is called multiple times for the same bundle. This does not occur with DirectRunner. This breaks DoFn's which require per-bundle setup and teardown procedures. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-9269) Set shorter Commit Deadline and handle with backoff/retry
[ https://issues.apache.org/jira/browse/BEAM-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick reassigned BEAM-9269: --- Assignee: Niel Markwick > Set shorter Commit Deadline and handle with backoff/retry > - > > Key: BEAM-9269 > URL: https://issues.apache.org/jira/browse/BEAM-9269 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Affects Versions: 2.16.0, 2.17.0, 2.18.0, 2.19.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: Major > Labels: google-cloud-spanner > Time Spent: 1h 10m > Remaining Estimate: 0h > > Default commit deadline in Spanner is 1hr, which can lead to a variety of > issues including database overload and session expiry. > Shorter deadline should be set with backoff/retry when deadline expires, so > that the Spanner database does not become overloaded. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-9268) SpannerIO: Better documentation and warning about creating tables in the pipeline
[ https://issues.apache.org/jira/browse/BEAM-9268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick resolved BEAM-9268. - Resolution: Fixed > SpannerIO: Better documentation and warning about creating tables in the > pipeline > - > > Key: BEAM-9268 > URL: https://issues.apache.org/jira/browse/BEAM-9268 > Project: Beam > Issue Type: Improvement > Components: io-go-gcp >Affects Versions: 2.16.0, 2.17.0, 2.18.0, 2.19.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: Major > Labels: google-cloud-spanner, perfomance > Fix For: 2.20.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > The javadoc for SpannerIO.Write mentions in passing that the transform needs > to know the DB schema for optimal performance. If the schema is created > within the pipeline, then there is a race between the schema being created > and SpannerIO reading it, leading to a potential performance penalty if > SpannerIO does not know about the existence of some tables. > > Javadoc needs to make this clearer and more explicit, and point the user at > the Write.withSchemaReadySignal(). > > Pipeline needs to raise (rate limited) warnings if it sees writes being made > to tables it does not know about (warnings can refer back to javadocs) > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9268) SpannerIO: Better documentation and warning about creating tables in the pipeline
[ https://issues.apache.org/jira/browse/BEAM-9268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-9268: Fix Version/s: 2.20.0 > SpannerIO: Better documentation and warning about creating tables in the > pipeline > - > > Key: BEAM-9268 > URL: https://issues.apache.org/jira/browse/BEAM-9268 > Project: Beam > Issue Type: Improvement > Components: io-go-gcp >Affects Versions: 2.16.0, 2.17.0, 2.18.0, 2.19.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: Major > Labels: google-cloud-spanner, perfomance > Fix For: 2.20.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > The javadoc for SpannerIO.Write mentions in passing that the transform needs > to know the DB schema for optimal performance. If the schema is created > within the pipeline, then there is a race between the schema being created > and SpannerIO reading it, leading to a potential performance penalty if > SpannerIO does not know about the existence of some tables. > > Javadoc needs to make this clearer and more explicit, and point the user at > the Write.withSchemaReadySignal(). > > Pipeline needs to raise (rate limited) warnings if it sees writes being made > to tables it does not know about (warnings can refer back to javadocs) > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9269) Set shorter Commit Deadline and handle with backoff/retry
Niel Markwick created BEAM-9269: --- Summary: Set shorter Commit Deadline and handle with backoff/retry Key: BEAM-9269 URL: https://issues.apache.org/jira/browse/BEAM-9269 Project: Beam Issue Type: Improvement Components: io-java-gcp Affects Versions: 2.19.0, 2.18.0, 2.17.0, 2.16.0 Reporter: Niel Markwick Default commit deadline in Spanner is 1hr, which can lead to a variety of issues including database overload and session expiry. Shorter deadline should be set with backoff/retry when deadline expires, so that the Spanner database does not become overloaded. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9268) SpannerIO: Better documentation and warning about creating tables in the pipeline
Niel Markwick created BEAM-9268: --- Summary: SpannerIO: Better documentation and warning about creating tables in the pipeline Key: BEAM-9268 URL: https://issues.apache.org/jira/browse/BEAM-9268 Project: Beam Issue Type: Improvement Components: io-go-gcp Affects Versions: 2.19.0, 2.18.0, 2.17.0, 2.16.0 Reporter: Niel Markwick Assignee: Niel Markwick The javadoc for SpannerIO.Write mentions in passing that the transform needs to know the DB schema for optimal performance. If the schema is created within the pipeline, then there is a race between the schema being created and SpannerIO reading it, leading to a potential performance penalty if SpannerIO does not know about the existence of some tables. Javadoc needs to make this clearer and more explicit, and point the user at the Write.withSchemaReadySignal(). Pipeline needs to raise (rate limited) warnings if it sees writes being made to tables it does not know about (warnings can refer back to javadocs) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8825) OOM when writing large numbers of 'narrow' rows
[ https://issues.apache.org/jira/browse/BEAM-8825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995677#comment-16995677 ] Niel Markwick commented on BEAM-8825: - done: [https://github.com/apache/beam/pull/10380] > OOM when writing large numbers of 'narrow' rows > --- > > Key: BEAM-8825 > URL: https://issues.apache.org/jira/browse/BEAM-8825 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.9.0, 2.10.0, 2.11.0, 2.12.0, 2.13.0, 2.14.0, 2.15.0, > 2.16.0, 2.17.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: Major > Fix For: 2.18.0 > > Time Spent: 1h > Remaining Estimate: 0h > > SpannerIO can OOM when writing large numbers of 'narrow' rows. > > SpannerIO puts input mutation elements into batches for efficient writing. > These batches are limited by number of cells mutated, and size of data > written (5000 cells, 1MB data). SpannerIO groups enough mutations to build > 1000 of these groups (5M cells, 1GB data), then sorts and batches them. > When the number of cells and size of data is very small (<5 cells, <100 > bytes), the memory overhead of storing millions of mutations for batching is > significant, and can lead to OOMs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-8825) OOM when writing large numbers of 'narrow' rows
[ https://issues.apache.org/jira/browse/BEAM-8825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick reassigned BEAM-8825: --- Assignee: Niel Markwick > OOM when writing large numbers of 'narrow' rows > --- > > Key: BEAM-8825 > URL: https://issues.apache.org/jira/browse/BEAM-8825 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.9.0, 2.10.0, 2.11.0, 2.12.0, 2.13.0, 2.14.0, 2.15.0, > 2.16.0, 2.17.0 >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: Major > Fix For: 2.18.0 > > > SpannerIO can OOM when writing large numbers of 'narrow' rows. > > SpannerIO puts input mutation elements into batches for efficient writing. > These batches are limited by number of cells mutated, and size of data > written (5000 cells, 1MB data). SpannerIO groups enough mutations to build > 1000 of these groups (5M cells, 1GB data), then sorts and batches them. > When the number of cells and size of data is very small (<5 cells, <100 > bytes), the memory overhead of storing millions of mutations for batching is > significant, and can lead to OOMs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8825) OOM when writing large numbers of 'narrow' rows
[ https://issues.apache.org/jira/browse/BEAM-8825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-8825: Fix Version/s: (was: 2.17.0) 2.18.0 Affects Version/s: 2.17.0 > OOM when writing large numbers of 'narrow' rows > --- > > Key: BEAM-8825 > URL: https://issues.apache.org/jira/browse/BEAM-8825 > Project: Beam > Issue Type: Bug > Components: io-java-gcp >Affects Versions: 2.9.0, 2.10.0, 2.11.0, 2.12.0, 2.13.0, 2.14.0, 2.15.0, > 2.16.0, 2.17.0 >Reporter: Niel Markwick >Priority: Major > Fix For: 2.18.0 > > > SpannerIO can OOM when writing large numbers of 'narrow' rows. > > SpannerIO puts input mutation elements into batches for efficient writing. > These batches are limited by number of cells mutated, and size of data > written (5000 cells, 1MB data). SpannerIO groups enough mutations to build > 1000 of these groups (5M cells, 1GB data), then sorts and batches them. > When the number of cells and size of data is very small (<5 cells, <100 > bytes), the memory overhead of storing millions of mutations for batching is > significant, and can lead to OOMs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-8825) OOM when writing large numbers of 'narrow' rows
Niel Markwick created BEAM-8825: --- Summary: OOM when writing large numbers of 'narrow' rows Key: BEAM-8825 URL: https://issues.apache.org/jira/browse/BEAM-8825 Project: Beam Issue Type: Bug Components: io-java-gcp Affects Versions: 2.16.0, 2.15.0, 2.14.0, 2.13.0, 2.12.0, 2.11.0, 2.10.0, 2.9.0 Reporter: Niel Markwick Fix For: 2.17.0 SpannerIO can OOM when writing large numbers of 'narrow' rows. SpannerIO puts input mutation elements into batches for efficient writing. These batches are limited by number of cells mutated, and size of data written (5000 cells, 1MB data). SpannerIO groups enough mutations to build 1000 of these groups (5M cells, 1GB data), then sorts and batches them. When the number of cells and size of data is very small (<5 cells, <100 bytes), the memory overhead of storing millions of mutations for batching is significant, and can lead to OOMs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-7732) Allow access to SpannerOptions in Beam
[ https://issues.apache.org/jira/browse/BEAM-7732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885225#comment-16885225 ] Niel Markwick commented on BEAM-7732: - Created [https://github.com/apache/beam/pull/9048] as a proof-of-concept change. > Allow access to SpannerOptions in Beam > -- > > Key: BEAM-7732 > URL: https://issues.apache.org/jira/browse/BEAM-7732 > Project: Beam > Issue Type: Improvement > Components: io-java-gcp >Affects Versions: 2.12.0, 2.13.0 >Reporter: Niel Markwick >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > Beam hides the > [SpannerOptions|https://github.com/googleapis/google-cloud-java/blob/master/google-cloud-clients/google-cloud-spanner/src/main/java/com/google/cloud/spanner/SpannerOptions.java] > object behind a > [SpannerConfig|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerConfig.java] > object because the SpannerOptions object is not serializable. > This means that the only options that can be set are those that can be > specified in SpannerConfig - limited to host, project, instance, database. > Suggestion: add the possibility to set a SpannerOptionsFactory in > SpannerConfig: > {code:java} > public interface SpannerOptionsFactory extends Serializable { > public SpannerOptions create(); > } > {code} > This would allow the user use this factory class to specify custom > SpannerOptions before they are passed onto the connectToSpanner() method; > connectToSpanner() would then become: > {code:java} > public SpannerAccessor connectToSpanner() { > > SpannerOptions.Builder builder = spannerOptionsFactory.create().toBuilder(); > // rest of connectToSpanner follows, setting project, host, etc. > {code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (BEAM-7732) Allow access to SpannerOptions in Beam
Niel Markwick created BEAM-7732: --- Summary: Allow access to SpannerOptions in Beam Key: BEAM-7732 URL: https://issues.apache.org/jira/browse/BEAM-7732 Project: Beam Issue Type: Improvement Components: io-java-gcp Affects Versions: 2.13.0, 2.12.0 Reporter: Niel Markwick Beam hides the [SpannerOptions|https://github.com/googleapis/google-cloud-java/blob/master/google-cloud-clients/google-cloud-spanner/src/main/java/com/google/cloud/spanner/SpannerOptions.java] object behind a [SpannerConfig|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerConfig.java] object because the SpannerOptions object is not serializable. This means that the only options that can be set are those that can be specified in SpannerConfig - limited to host, project, instance, database. Suggestion: add the possibility to set a SpannerOptionsFactory in SpannerConfig: {code:java} public interface SpannerOptionsFactory extends Serializable { public SpannerOptions create(); } {code} This would allow the user use this factory class to specify custom SpannerOptions before they are passed onto the connectToSpanner() method; connectToSpanner() would then become: {code:java} public SpannerAccessor connectToSpanner() { SpannerOptions.Builder builder = spannerOptionsFactory.create().toBuilder(); // rest of connectToSpanner follows, setting project, host, etc. {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (BEAM-6921) Improve SpannerIO output
Niel Markwick created BEAM-6921: --- Summary: Improve SpannerIO output Key: BEAM-6921 URL: https://issues.apache.org/jira/browse/BEAM-6921 Project: Beam Issue Type: New Feature Components: io-java-gcp Affects Versions: 2.11.0 Reporter: Niel Markwick from a discussion in [https://github.com/apache/beam/pull/8097] SpannerIO produces 2 output PCollections: * getOutput() -> PCollection ** never has any values ** in GlobalWindow ** Closed when the input PCollection is closed (ie never in streaming) to indicate when all input has been written ** Used in batch pipelines to have 'dependant' bulk imports - where one dataset is not written to Spanner until another has completed writing. (necessary for handling parent/child (1-many) referential integrity) * getFailedMutations() -> PCollection ** only contains values when Mutation[Group]s fail to be written ** in GlobalWindow ** Not very useful, as the reason for the failure is not given. Suggestion: * Deprecate these existing outputs. * Make primary output be a PCollection<\{ MutationGroup, CommitTimestamp }> so that the successfully written Mutation[Groups] can be processed further if necessary. (\{a,b} signifies a container class for these values) * Add an additional output of failed mutations PCollection<\{ MutationGroup, FailureMessage}> * The existing outputs can be derived from these new outputs This allows useful error reporting/handling from the failure message, and the ability to continue processing the successful mutations. (see also BEAM-6887) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-6887) Streaming Spanner Writer transform
[ https://issues.apache.org/jira/browse/BEAM-6887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niel Markwick updated BEAM-6887: Description: At present, the SpannerIO.Write/WriteGrouped transforms work by collecting an entire bundle of elements, sorts them by table/key, splitting the sorted list into batches (by size and number of cells modified) and then writes each batch to Spanner in a single transaction. It returns a SpannerWriteResult.java containing : # a PCollection (the main output) - which will have no elements but will be closed to signal when all the input elements have been written (which is never in streaming because input is unbounded) # a PCollection of elements that failed to write. This transform is useful as a bulk sink for data because it efficiently writes large amounts of data. It is not at all useful as an intermediate step in a streaming pipeline - because it has no useful output in streaming mode. I propose that we have a separate Spanner Write transform which simply writes each input Mutation to the database, and then pushes successful Mutations onto its output. This would allow use in the middle of a streaming pipeline, where the flow would be * Some data streamed in * Converted to Spanner Mutations * Written to Spanner Database * Further processing where the values written to the Spanner Database are used. was: At present, the[ SpannerIO.Write(Grouped|http://go/gh/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerIO.java#L892]) transform works by collecting an entire bundle of elements, sorts them by table/key, splitting the sorted list into batches (by size and number of cells modified) and then writes each batch to Spanner in a single transaction. It returns an[ object|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerWriteResult.java] containing : # a PCollection (the main output) - which will have no elements but will be closed to signal when all the input elements have been written (which is never in streaming because input is unbounded) # a PCollection of elements that failed to write. This transform is useful as a bulk sink for data because it efficiently writes large amounts of data. It is not at all useful as an intermediate step in a streaming pipeline - because it has no useful output in streaming mode. I propose that we have a separate Spanner Write transform which simply writes each input Mutation to the database, and then pushes successful Mutations onto its output. This would allow use in the middle of a streaming pipeline, where the flow would be * Some data streamed in * Converted to Spanner Mutations * Written to Spanner Database * Further processing where the values written to the Spanner Database are used. > Streaming Spanner Writer transform > -- > > Key: BEAM-6887 > URL: https://issues.apache.org/jira/browse/BEAM-6887 > Project: Beam > Issue Type: New Feature > Components: io-java-gcp >Reporter: Niel Markwick >Assignee: Niel Markwick >Priority: Minor > > At present, the SpannerIO.Write/WriteGrouped transforms work by collecting an > entire bundle of elements, sorts them by table/key, splitting the sorted list > into batches (by size and number of cells modified) and then writes each > batch to Spanner in a single transaction. > It returns a SpannerWriteResult.java containing : > # a PCollection (the main output) - which will have no elements but > will be closed to signal when all the input elements have been written (which > is never in streaming because input is unbounded) > # a PCollection of elements that failed to write. > This transform is useful as a bulk sink for data because it efficiently > writes large amounts of data. > It is not at all useful as an intermediate step in a streaming pipeline - > because it has no useful output in streaming mode. > I propose that we have a separate Spanner Write transform which simply writes > each input Mutation to the database, and then pushes successful Mutations > onto its output. > This would allow use in the middle of a streaming pipeline, where the flow > would be > * Some data streamed in > * Converted to Spanner Mutations > * Written to Spanner Database > * Further processing where the values written to the Spanner Database are > used. -- This message was sent by Atlassian JIRA (v7.6.3#76005)