[jira] [Updated] (BEAM-14405) Java Spanner IO NPE when ProjectID not specified in template executions

2022-05-04 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-14405:
-
Status: Resolved  (was: Resolved)

fixed by [PR#17540|https://github.com/apache/beam/pull/17540]

> Java Spanner IO NPE when ProjectID not specified in template executions
> ---
>
> Key: BEAM-14405
> URL: https://issues.apache.org/jira/browse/BEAM-14405
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.38.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: P1
> Fix For: 2.39.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> BEAM-13665 fixed the issue of NPE's when ProjectIDs were not specified in 
> command line executions by checking for a null ValueProvider, but in 
> _template_ executions, the ValueProvider is non-null but has a null 
> {_}value{_}. 
> [PR #17094|https://github.com/apache/beam/pull/17094] from BEAM-14116 added 
> nullness checks for keys in MonitoringInfoMetricName, which then triggered 
> the NPE in 2.38.0 when a template was executed with an unspecified ProjectID.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work stopped] (BEAM-14405) Java Spanner IO NPE when ProjectID not specified in template executions

2022-05-04 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-14405 stopped by Niel Markwick.

> Java Spanner IO NPE when ProjectID not specified in template executions
> ---
>
> Key: BEAM-14405
> URL: https://issues.apache.org/jira/browse/BEAM-14405
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.38.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: P1
> Fix For: 2.39.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> BEAM-13665 fixed the issue of NPE's when ProjectIDs were not specified in 
> command line executions by checking for a null ValueProvider, but in 
> _template_ executions, the ValueProvider is non-null but has a null 
> {_}value{_}. 
> [PR #17094|https://github.com/apache/beam/pull/17094] from BEAM-14116 added 
> nullness checks for keys in MonitoringInfoMetricName, which then triggered 
> the NPE in 2.38.0 when a template was executed with an unspecified ProjectID.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Work started] (BEAM-14405) Java Spanner IO NPE when ProjectID not specified in template executions

2022-05-04 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-14405 started by Niel Markwick.

> Java Spanner IO NPE when ProjectID not specified in template executions
> ---
>
> Key: BEAM-14405
> URL: https://issues.apache.org/jira/browse/BEAM-14405
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.38.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: P1
> Fix For: 2.39.0
>
>
> BEAM-13665 fixed the issue of NPE's when ProjectIDs were not specified in 
> command line executions by checking for a null ValueProvider, but in 
> _template_ executions, the ValueProvider is non-null but has a null 
> {_}value{_}. 
> [PR #17094|https://github.com/apache/beam/pull/17094] from BEAM-14116 added 
> nullness checks for keys in MonitoringInfoMetricName, which then triggered 
> the NPE in 2.38.0 when a template was executed with an unspecified ProjectID.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (BEAM-14405) Java Spanner IO NPE when ProjectID not specified in template executions

2022-05-04 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-14405:
-
Fix Version/s: 2.39.0
   (was: 2.36.0)

> Java Spanner IO NPE when ProjectID not specified in template executions
> ---
>
> Key: BEAM-14405
> URL: https://issues.apache.org/jira/browse/BEAM-14405
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.38.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: P1
> Fix For: 2.39.0
>
>
> BEAM-13665 fixed the issue of NPE's when ProjectIDs were not specified in 
> command line executions by checking for a null ValueProvider, but in 
> _template_ executions, the ValueProvider is non-null but has a null 
> {_}value{_}. 
> [PR #17094|https://github.com/apache/beam/pull/17094] from BEAM-14116 added 
> nullness checks for keys in MonitoringInfoMetricName, which then triggered 
> the NPE in 2.38.0 when a template was executed with an unspecified ProjectID.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (BEAM-14405) Java Spanner IO NPE when ProjectID not specified in template executions

2022-05-04 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-14405:
-
Description: 
BEAM-13665 fixed the issue of NPE's when ProjectIDs were not specified in 
command line executions by checking for a null ValueProvider, but in _template_ 
executions, the ValueProvider is non-null but has a null {_}value{_}. 



PR #17094 from BEAM-14116 added nullness checks for keys in 
MonitoringInfoMetricName, which then triggered the NPE in 2.38.0 when a 
template was executed with an unspecified ProjectID.

  was:BEAM-13665 fixed the issue of NPE's when ProjectIDs were not specified in 
command line executions by checking for a null ValueProvider, but in _template_ 
executions, the ValueProvider is non-null but has a null {_}value{_}. 


> Java Spanner IO NPE when ProjectID not specified in template executions
> ---
>
> Key: BEAM-14405
> URL: https://issues.apache.org/jira/browse/BEAM-14405
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: P1
> Fix For: 2.36.0
>
>
> BEAM-13665 fixed the issue of NPE's when ProjectIDs were not specified in 
> command line executions by checking for a null ValueProvider, but in 
> _template_ executions, the ValueProvider is non-null but has a null 
> {_}value{_}. 
> PR #17094 from BEAM-14116 added nullness checks for keys in 
> MonitoringInfoMetricName, which then triggered the NPE in 2.38.0 when a 
> template was executed with an unspecified ProjectID.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (BEAM-14405) Java Spanner IO NPE when ProjectID not specified in template executions

2022-05-04 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-14405:
-
Description: 
BEAM-13665 fixed the issue of NPE's when ProjectIDs were not specified in 
command line executions by checking for a null ValueProvider, but in _template_ 
executions, the ValueProvider is non-null but has a null {_}value{_}. 

[PR #17094|https://github.com/apache/beam/pull/17094] from BEAM-14116 added 
nullness checks for keys in MonitoringInfoMetricName, which then triggered the 
NPE in 2.38.0 when a template was executed with an unspecified ProjectID.

  was:
BEAM-13665 fixed the issue of NPE's when ProjectIDs were not specified in 
command line executions by checking for a null ValueProvider, but in _template_ 
executions, the ValueProvider is non-null but has a null {_}value{_}. 



PR #17094 from BEAM-14116 added nullness checks for keys in 
MonitoringInfoMetricName, which then triggered the NPE in 2.38.0 when a 
template was executed with an unspecified ProjectID.


> Java Spanner IO NPE when ProjectID not specified in template executions
> ---
>
> Key: BEAM-14405
> URL: https://issues.apache.org/jira/browse/BEAM-14405
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: P1
> Fix For: 2.36.0
>
>
> BEAM-13665 fixed the issue of NPE's when ProjectIDs were not specified in 
> command line executions by checking for a null ValueProvider, but in 
> _template_ executions, the ValueProvider is non-null but has a null 
> {_}value{_}. 
> [PR #17094|https://github.com/apache/beam/pull/17094] from BEAM-14116 added 
> nullness checks for keys in MonitoringInfoMetricName, which then triggered 
> the NPE in 2.38.0 when a template was executed with an unspecified ProjectID.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (BEAM-14405) Java Spanner IO NPE when ProjectID not specified in template executions

2022-05-04 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-14405:
-
Affects Version/s: 2.38.0

> Java Spanner IO NPE when ProjectID not specified in template executions
> ---
>
> Key: BEAM-14405
> URL: https://issues.apache.org/jira/browse/BEAM-14405
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.38.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: P1
> Fix For: 2.36.0
>
>
> BEAM-13665 fixed the issue of NPE's when ProjectIDs were not specified in 
> command line executions by checking for a null ValueProvider, but in 
> _template_ executions, the ValueProvider is non-null but has a null 
> {_}value{_}. 
> [PR #17094|https://github.com/apache/beam/pull/17094] from BEAM-14116 added 
> nullness checks for keys in MonitoringInfoMetricName, which then triggered 
> the NPE in 2.38.0 when a template was executed with an unspecified ProjectID.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (BEAM-14405) Java Spanner IO NPE when ProjectID not specified in template executions

2022-05-04 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-14405:
-
Description: BEAM-13665 fixed the issue of NPE's when ProjectIDs were not 
specified in command line executions by checking for a null ValueProvider, but 
in _template_ executions, the ValueProvider is non-null but has a null 
{_}value{_}.   (was: According to comments on BEAM-11982, the change to resolve 
it broke backwards compatibility. So that affects 2.34.0 and 2.35.0. It is 
still worthwhile to restore 2.36.0)

> Java Spanner IO NPE when ProjectID not specified in template executions
> ---
>
> Key: BEAM-14405
> URL: https://issues.apache.org/jira/browse/BEAM-14405
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: P1
> Fix For: 2.36.0
>
>
> BEAM-13665 fixed the issue of NPE's when ProjectIDs were not specified in 
> command line executions by checking for a null ValueProvider, but in 
> _template_ executions, the ValueProvider is non-null but has a null 
> {_}value{_}. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (BEAM-14405) Java Spanner IO NPE when ProjectID not specified in template executions

2022-05-04 Thread Niel Markwick (Jira)
Niel Markwick created BEAM-14405:


 Summary: Java Spanner IO NPE when ProjectID not specified in 
template executions
 Key: BEAM-14405
 URL: https://issues.apache.org/jira/browse/BEAM-14405
 Project: Beam
  Issue Type: Bug
  Components: io-java-gcp
Reporter: Niel Markwick
Assignee: Niel Markwick
 Fix For: 2.36.0


According to comments on BEAM-11982, the change to resolve it broke backwards 
compatibility. So that affects 2.34.0 and 2.35.0. It is still worthwhile to 
restore 2.36.0



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (BEAM-14171) CoGroupByKey loses values with large groups on Dataflow v1

2022-03-24 Thread Niel Markwick (Jira)
Niel Markwick created BEAM-14171:


 Summary: CoGroupByKey loses values with large groups on Dataflow v1
 Key: BEAM-14171
 URL: https://issues.apache.org/jira/browse/BEAM-14171
 Project: Beam
  Issue Type: Bug
  Components: runner-dataflow, sdk-java-core
Affects Versions: 2.37.0, 2.36.0
Reporter: Niel Markwick
Assignee: Robert Bradshaw
 Fix For: 2.38.0


CoGroupByKey can lose elements - replacing them with null values when a group 
is large (>10,000 elements).

 

This only occurs in dataflow v1, not dataflow-v2 runner

Possibly related to BEAM-13541.

 

https://lists.apache.org/thread/5y56kbgm3q0m1byzf7186rrkomrcfldm

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (BEAM-14121) Incorrect Spanner IO Request Count metrics

2022-03-17 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-14121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-14121:
-
Remaining Estimate: 120h  (was: 168h)
 Original Estimate: 120h  (was: 168h)

> Incorrect Spanner IO Request Count metrics
> --
>
> Key: BEAM-14121
> URL: https://issues.apache.org/jira/browse/BEAM-14121
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.34.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: P2
>  Labels: google-cloud-spanner
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> IO request count metrics calculated incorrectly for GCP Spanner
>  
> Resource ID is formulated incorrectly
> *Spanner Table:*
> {{//spanner.googleapis.com/projects/\{projectId}/{*}topics{*}/\{databaseId}/tables/\{tableId}}}
> should be
> {{//spanner.googleapis.com/projects/\{projectId}/instances/\{instanceId}/databases/\{databaseId}/tables/\{tableId}}}
> and is populated incorrectly – instance ID is used in place of tableID
> Spanner SQL Query:
> {{//spanner.googleapis.com/projects/\{projectId}/queries/\{queryName} }}
> {{should be}}
> {{{}//spanner.googleapis.com/projects/\{projectId}/{}}}{{{}instances/\{instanceId}/queries{}}}{{{}/\{queryName}
>  {}}}
> and queryName is nullable which cause issued downstream
> this is not actually populated at all - queries are logged as reads on an 
> instance. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (BEAM-14121) Incorrect Spanner IO Request Count metrics

2022-03-17 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-14121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-14121:
-
Remaining Estimate: 40h  (was: 120h)
 Original Estimate: 40h  (was: 120h)

> Incorrect Spanner IO Request Count metrics
> --
>
> Key: BEAM-14121
> URL: https://issues.apache.org/jira/browse/BEAM-14121
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.34.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: P2
>  Labels: google-cloud-spanner
>   Original Estimate: 40h
>  Remaining Estimate: 40h
>
> IO request count metrics calculated incorrectly for GCP Spanner
>  
> Resource ID is formulated incorrectly
> *Spanner Table:*
> {{//spanner.googleapis.com/projects/\{projectId}/{*}topics{*}/\{databaseId}/tables/\{tableId}}}
> should be
> {{//spanner.googleapis.com/projects/\{projectId}/instances/\{instanceId}/databases/\{databaseId}/tables/\{tableId}}}
> and is populated incorrectly – instance ID is used in place of tableID
> Spanner SQL Query:
> {{//spanner.googleapis.com/projects/\{projectId}/queries/\{queryName} }}
> {{should be}}
> {{{}//spanner.googleapis.com/projects/\{projectId}/{}}}{{{}instances/\{instanceId}/queries{}}}{{{}/\{queryName}
>  {}}}
> and queryName is nullable which cause issued downstream
> this is not actually populated at all - queries are logged as reads on an 
> instance. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (BEAM-14121) Incorrect Spanner IO Request Count metrics

2022-03-17 Thread Niel Markwick (Jira)
Niel Markwick created BEAM-14121:


 Summary: Incorrect Spanner IO Request Count metrics
 Key: BEAM-14121
 URL: https://issues.apache.org/jira/browse/BEAM-14121
 Project: Beam
  Issue Type: Bug
  Components: io-java-gcp
Affects Versions: 2.34.0
Reporter: Niel Markwick
Assignee: Niel Markwick


IO request count metrics calculated incorrectly for GCP Spanner

 

Resource ID is formulated incorrectly

*Spanner Table:*

{{//spanner.googleapis.com/projects/\{projectId}/{*}topics{*}/\{databaseId}/tables/\{tableId}}}

should be

{{//spanner.googleapis.com/projects/\{projectId}/instances/\{instanceId}/databases/\{databaseId}/tables/\{tableId}}}

and is populated incorrectly – instance ID is used in place of tableID



Spanner SQL Query:

{{//spanner.googleapis.com/projects/\{projectId}/queries/\{queryName} }}
{{should be}}
{{{}//spanner.googleapis.com/projects/\{projectId}/{}}}{{{}instances/\{instanceId}/queries{}}}{{{}/\{queryName}
 {}}}

and queryName is nullable which cause issued downstream
this is not actually populated at all - queries are logged as reads on an 
instance. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (BEAM-14005) Exceptions when reading from Spanner ignored

2022-03-01 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-14005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-14005:
-
Description: 
Regression from BEAM-11982:
[BatchSpannerRead.java#L223|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/BatchSpannerRead.java#L223]
  will ignore exceptions raised when reading from spanner. It will log an 
metric and continue.

 

This can lead to data loss as read errors are unreported and ignored. A 
pipeline can exit with success when data has been partially read or failed to 
read.

  was:
Regression from BEAM-11982:
[BatchSpannerRead.java#L223|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/BatchSpannerRead.java#L223]
  will ignore exceptions raised when reading from spanner. It will log an 
metric and continue.


> Exceptions when reading from Spanner ignored
> 
>
> Key: BEAM-14005
> URL: https://issues.apache.org/jira/browse/BEAM-14005
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.34.0, 2.35.0, 2.36.0, 2.37.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: P1
>  Labels: cloud-spanner, regression
> Fix For: 2.37.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Regression from BEAM-11982:
> [BatchSpannerRead.java#L223|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/BatchSpannerRead.java#L223]
>   will ignore exceptions raised when reading from spanner. It will log an 
> metric and continue.
>  
> This can lead to data loss as read errors are unreported and ignored. A 
> pipeline can exit with success when data has been partially read or failed to 
> read.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (BEAM-11982) Java Spanner - Implement IO Request Count metrics

2022-02-26 Thread Niel Markwick (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-11982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498426#comment-17498426
 ] 

Niel Markwick commented on BEAM-11982:
--

Found another issue with this PR, which causes a potential regression.
Raised BEAM-14005

> Java Spanner - Implement IO Request Count metrics
> -
>
> Key: BEAM-11982
> URL: https://issues.apache.org/jira/browse/BEAM-11982
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Reporter: Alex Amato
>Assignee: Benjamin Gonzalez
>Priority: P2
> Fix For: 2.34.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Reference PRs (See BigQuery IO example) and detailed explanation of what's 
> needed to instrument this IO with Request Count metrics is found in this 
> handoff doc:
> [https://docs.google.com/document/d/1lrz2wE5Dl4zlUfPAenjXIQyleZvqevqoxhyE85aj4sc/edit'|https://docs.google.com/document/d/1lrz2wE5Dl4zlUfPAenjXIQyleZvqevqoxhyE85aj4sc/edit'?authuser=0]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (BEAM-14005) Exceptions when reading from Spanner ignored

2022-02-26 Thread Niel Markwick (Jira)
Niel Markwick created BEAM-14005:


 Summary: Exceptions when reading from Spanner ignored
 Key: BEAM-14005
 URL: https://issues.apache.org/jira/browse/BEAM-14005
 Project: Beam
  Issue Type: Bug
  Components: io-java-gcp
Affects Versions: 2.36.0, 2.35.0, 2.34.0, 2.37.0
Reporter: Niel Markwick
 Fix For: 2.38.0


Regression from BEAM-11982:
[BatchSpannerRead.java#L223|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/BatchSpannerRead.java#L223]
  will ignore exceptions raised when reading from spanner. It will log an 
metric and continue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (BEAM-14005) Exceptions when reading from Spanner ignored

2022-02-26 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-14005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick reassigned BEAM-14005:


Assignee: Niel Markwick

> Exceptions when reading from Spanner ignored
> 
>
> Key: BEAM-14005
> URL: https://issues.apache.org/jira/browse/BEAM-14005
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.34.0, 2.35.0, 2.36.0, 2.37.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: P1
>  Labels: cloud-spanner, regression
> Fix For: 2.38.0
>
>
> Regression from BEAM-11982:
> [BatchSpannerRead.java#L223|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/BatchSpannerRead.java#L223]
>   will ignore exceptions raised when reading from spanner. It will log an 
> metric and continue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (BEAM-13668) Java Spanner IO Request Count metrics broke backwards compatibility

2022-02-16 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-13668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-13668:
-
Fix Version/s: 2.36.0
   (was: 2.38.0)
   Resolution: Duplicate
   Status: Resolved  (was: Open)

Marking as duplicate

> Java Spanner IO Request Count metrics broke backwards compatibility
> ---
>
> Key: BEAM-13668
> URL: https://issues.apache.org/jira/browse/BEAM-13668
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: P1
> Fix For: 2.36.0
>
>
> According to comments on BEAM-11982, the change to resolve it broke backwards 
> compatibility. So that affects 2.34.0 and 2.35.0. It is still worthwhile to 
> restore 2.36.0



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (BEAM-13665) Spanner IO request metrics requires projectId within the config when it didn't in the past

2022-01-19 Thread Niel Markwick (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-13665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478559#comment-17478559
 ] 

Niel Markwick commented on BEAM-13665:
--

Resolved by #16547



Created BEAM-13687 as a followup for a potential performance regression due to 
creating metrics for every element.

> Spanner IO request metrics requires projectId within the config when it 
> didn't in the past
> --
>
> Key: BEAM-13665
> URL: https://issues.apache.org/jira/browse/BEAM-13665
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.34.0, 2.35.0
>Reporter: Luke Cwik
>Assignee: Niel Markwick
>Priority: P1
> Fix For: 2.36.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/beam/pull/15493] makes the GCP projectID a 
> required parameter - which it was not before, as it could be inferred from 
> the environment - and thus breaks backward compatibility.
> Specifically: BatchSpannerRead.java:175 – the toString() is not required on 
> the valueProvider, it should just be a get(), and createServiceCallMetric() 
> in line 195 should handle the situation where projectID could be null
> Also SpannerIO.java:1683 has a similar issue with the Write transform.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (BEAM-13687) Java Spanner - Improve IO Request Count metrics

2022-01-19 Thread Niel Markwick (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-13687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478558#comment-17478558
 ] 

Niel Markwick commented on BEAM-13687:
--

[~ajam...@google.com] [~lcwik] 

> Java Spanner - Improve IO Request Count metrics
> ---
>
> Key: BEAM-13687
> URL: https://issues.apache.org/jira/browse/BEAM-13687
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Reporter: Niel Markwick
>Assignee: Benjamin Gonzalez
>Priority: P2
> Fix For: 2.34.0
>
>
> Request count metrics are created in the ProcessElement function for 
> SpannerIO Reads and Writes: 
> [BatchSpannerRead.java#L173|https://github.com/nielm/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/BatchSpannerRead.java#L173]
> [SpannerIO.java#L1681|https://github.com/nielm/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerIO.java#L1681]
>  
> This could cause a performance regression
> (ref https://issues.apache.org/jira/browse/BEAM-13665
> [https://github.com/apache/beam/pull/16547] )



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (BEAM-13687) Java Spanner - Improve IO Request Count metrics

2022-01-19 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-13687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-13687:
-
Description: 
Request count metrics are created in the ProcessElement function for SpannerIO 
Reads and Writes: 

[BatchSpannerRead.java#L173|https://github.com/nielm/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/BatchSpannerRead.java#L173]

[SpannerIO.java#L1681|https://github.com/nielm/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerIO.java#L1681]
 

This could cause a performance regression

(ref https://issues.apache.org/jira/browse/BEAM-13665

[https://github.com/apache/beam/pull/16547] )

  was:
Request count metrics are created in the ProcessElement function for SpannerIO 
Reads and Writes: 

[BatchSpannerRead.java#L173|https://github.com/nielm/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/BatchSpannerRead.java#L173]

[SpannerIO.java#L1681|https://github.com/nielm/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerIO.java#L1681


> Java Spanner - Improve IO Request Count metrics
> ---
>
> Key: BEAM-13687
> URL: https://issues.apache.org/jira/browse/BEAM-13687
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Reporter: Niel Markwick
>Assignee: Benjamin Gonzalez
>Priority: P2
> Fix For: 2.34.0
>
>
> Request count metrics are created in the ProcessElement function for 
> SpannerIO Reads and Writes: 
> [BatchSpannerRead.java#L173|https://github.com/nielm/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/BatchSpannerRead.java#L173]
> [SpannerIO.java#L1681|https://github.com/nielm/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerIO.java#L1681]
>  
> This could cause a performance regression
> (ref https://issues.apache.org/jira/browse/BEAM-13665
> [https://github.com/apache/beam/pull/16547] )



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (BEAM-13687) Java Spanner - Improve IO Request Count metrics

2022-01-19 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-13687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-13687:
-
Description: 
Request count metrics are created in the ProcessElement function for SpannerIO 
Reads and Writes: 

[BatchSpannerRead.java#L173|https://github.com/nielm/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/BatchSpannerRead.java#L173]

[SpannerIO.java#L1681|https://github.com/nielm/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerIO.java#L1681

  was:
Reference PRs (See BigQuery IO example) and detailed explanation of what's 
needed to instrument this IO with Request Count metrics is found in this 
handoff doc:

[https://docs.google.com/document/d/1lrz2wE5Dl4zlUfPAenjXIQyleZvqevqoxhyE85aj4sc/edit'|https://docs.google.com/document/d/1lrz2wE5Dl4zlUfPAenjXIQyleZvqevqoxhyE85aj4sc/edit'?authuser=0]


> Java Spanner - Improve IO Request Count metrics
> ---
>
> Key: BEAM-13687
> URL: https://issues.apache.org/jira/browse/BEAM-13687
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Reporter: Niel Markwick
>Assignee: Benjamin Gonzalez
>Priority: P2
> Fix For: 2.34.0
>
>
> Request count metrics are created in the ProcessElement function for 
> SpannerIO Reads and Writes: 
> [BatchSpannerRead.java#L173|https://github.com/nielm/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/BatchSpannerRead.java#L173]
> [SpannerIO.java#L1681|https://github.com/nielm/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerIO.java#L1681



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (BEAM-13687) Java Spanner - Improve IO Request Count metrics

2022-01-19 Thread Niel Markwick (Jira)
Niel Markwick created BEAM-13687:


 Summary: Java Spanner - Improve IO Request Count metrics
 Key: BEAM-13687
 URL: https://issues.apache.org/jira/browse/BEAM-13687
 Project: Beam
  Issue Type: Improvement
  Components: io-java-gcp
Reporter: Niel Markwick
Assignee: Benjamin Gonzalez
 Fix For: 2.34.0


Reference PRs (See BigQuery IO example) and detailed explanation of what's 
needed to instrument this IO with Request Count metrics is found in this 
handoff doc:

[https://docs.google.com/document/d/1lrz2wE5Dl4zlUfPAenjXIQyleZvqevqoxhyE85aj4sc/edit'|https://docs.google.com/document/d/1lrz2wE5Dl4zlUfPAenjXIQyleZvqevqoxhyE85aj4sc/edit'?authuser=0]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (BEAM-13665) Spanner IO request metrics requires projectId within the config when it didn't in the past

2022-01-18 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-13665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-13665:
-
Description: 
[https://github.com/apache/beam/pull/15493] makes the GCP projectID a required 
parameter - which it was not before, as it could be inferred from the 
environment - and thus breaks backward compatibility.

Specifically: BatchSpannerRead.java:175 – the toString() is not required on the 
valueProvider, it should just be a get(), and createServiceCallMetric() in line 
195 should handle the situation where projectID could be null

Also SpannerIO.java:1683 has a similar issue with the Write transform.

  was:
https://github.com/apache/beam/pull/15493 makes the GCP projectID a required 
parameter - which it was not before, as it could be inferred from the 
environment - and thus breaks backward compatibility.

Specifically: BatchSpannerRead.java:175 -- the toString() is not required on 
the valueProvider, it should just be a get(), and createServiceCallMetric() in 
line 195 should handle the situation where projectID could be null


> Spanner IO request metrics requires projectId within the config when it 
> didn't in the past
> --
>
> Key: BEAM-13665
> URL: https://issues.apache.org/jira/browse/BEAM-13665
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.34.0, 2.35.0
>Reporter: Luke Cwik
>Assignee: Niel Markwick
>Priority: P1
> Fix For: 2.36.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/beam/pull/15493] makes the GCP projectID a 
> required parameter - which it was not before, as it could be inferred from 
> the environment - and thus breaks backward compatibility.
> Specifically: BatchSpannerRead.java:175 – the toString() is not required on 
> the valueProvider, it should just be a get(), and createServiceCallMetric() 
> in line 195 should handle the situation where projectID could be null
> Also SpannerIO.java:1683 has a similar issue with the Write transform.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (BEAM-13665) Spanner IO request metrics requires projectId within the config when it didn't in the past

2022-01-18 Thread Niel Markwick (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-13665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478012#comment-17478012
 ] 

Niel Markwick commented on BEAM-13665:
--

Created [https://github.com/apache/beam/pull/16547] 

> Spanner IO request metrics requires projectId within the config when it 
> didn't in the past
> --
>
> Key: BEAM-13665
> URL: https://issues.apache.org/jira/browse/BEAM-13665
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.34.0, 2.35.0
>Reporter: Luke Cwik
>Assignee: Niel Markwick
>Priority: P1
> Fix For: 2.36.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/pull/15493 makes the GCP projectID a required 
> parameter - which it was not before, as it could be inferred from the 
> environment - and thus breaks backward compatibility.
> Specifically: BatchSpannerRead.java:175 -- the toString() is not required on 
> the valueProvider, it should just be a get(), and createServiceCallMetric() 
> in line 195 should handle the situation where projectID could be null



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (BEAM-13665) Spanner IO request metrics requires projectId within the config when it didn't in the past

2022-01-18 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-13665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick reassigned BEAM-13665:


Assignee: Niel Markwick

> Spanner IO request metrics requires projectId within the config when it 
> didn't in the past
> --
>
> Key: BEAM-13665
> URL: https://issues.apache.org/jira/browse/BEAM-13665
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.34.0, 2.35.0
>Reporter: Luke Cwik
>Assignee: Niel Markwick
>Priority: P1
> Fix For: 2.36.0
>
>
> https://github.com/apache/beam/pull/15493 makes the GCP projectID a required 
> parameter - which it was not before, as it could be inferred from the 
> environment - and thus breaks backward compatibility.
> Specifically: BatchSpannerRead.java:175 -- the toString() is not required on 
> the valueProvider, it should just be a get(), and createServiceCallMetric() 
> in line 195 should handle the situation where projectID could be null



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (BEAM-11982) Java Spanner - Implement IO Request Count metrics

2022-01-14 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-11982:
-
Priority: P2  (was: P3)

> Java Spanner - Implement IO Request Count metrics
> -
>
> Key: BEAM-11982
> URL: https://issues.apache.org/jira/browse/BEAM-11982
> Project: Beam
>  Issue Type: Test
>  Components: io-java-gcp
>Reporter: Alex Amato
>Assignee: Benjamin Gonzalez
>Priority: P2
> Fix For: Missing
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> Reference PRs (See BigQuery IO example) and detailed explanation of what's 
> needed to instrument this IO with Request Count metrics is found in this 
> handoff doc:
> [https://docs.google.com/document/d/1lrz2wE5Dl4zlUfPAenjXIQyleZvqevqoxhyE85aj4sc/edit'|https://docs.google.com/document/d/1lrz2wE5Dl4zlUfPAenjXIQyleZvqevqoxhyE85aj4sc/edit'?authuser=0]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Reopened] (BEAM-11982) Java Spanner - Implement IO Request Count metrics

2022-01-14 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick reopened BEAM-11982:
--

Reopening as this breaks backward compatibility. 

[BatchSpannerRead.java:175|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/BatchSpannerRead.java#L175]
 calls toString() on the valueproviders, which can throw a null pointer 
exception if they are not set (it should call `get()`) 

The projectId value is optional as it can be inferred from the environment, and 
if it is not set, this line will throw a runtime NPE. 

Please fix, and also update createServiceCallMetric to handle the possibility 
of a NULL projectID

> Java Spanner - Implement IO Request Count metrics
> -
>
> Key: BEAM-11982
> URL: https://issues.apache.org/jira/browse/BEAM-11982
> Project: Beam
>  Issue Type: Test
>  Components: io-java-gcp
>Reporter: Alex Amato
>Assignee: Benjamin Gonzalez
>Priority: P3
> Fix For: Missing
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> Reference PRs (See BigQuery IO example) and detailed explanation of what's 
> needed to instrument this IO with Request Count metrics is found in this 
> handoff doc:
> [https://docs.google.com/document/d/1lrz2wE5Dl4zlUfPAenjXIQyleZvqevqoxhyE85aj4sc/edit'|https://docs.google.com/document/d/1lrz2wE5Dl4zlUfPAenjXIQyleZvqevqoxhyE85aj4sc/edit'?authuser=0]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (BEAM-11722) SpannerIO.read() parallelism limited by partition count

2021-07-13 Thread Niel Markwick (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-11722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380010#comment-17380010
 ] 

Niel Markwick commented on BEAM-11722:
--

I did some experiments on this, comparing: 

 
 # Reading a table using a single thread
 # Reading using the pre-supplied partitions from the Spanner Partitioned Read 
API
(in my test, this supplied 38 partitions, but only 3 had data)
 # Splitting the key into 1000 key-ranges, shuffling them, and then reading 
these 1000 in parallel using multiple workers. 

1) was obviously the slowest. 

2) and 3) were about equally fast to read the actual data - implying that 
despite the relatively low number of populated partitions, this was the most 
efficient way to read the data from the database.

 

Background information: Spanner stores the tables by creating blocks of 
key-ranges and assigning ownership of these key-ranges (splits) to specific 
Spanner nodes. 

The Partitioned Read API returns partitions that correspond, roughly, to these 
splits, so when reading a partition, a single Spanner node is tasked with 
reading the data. 

 

Of course only having very few partition elements with actual rows may cause  
issues for the downstream pipeline if the work done on these rows takes a long 
time, and would benefit from additional parallelization.

 

I don't believe putting a reshuffle on the output of SpannerIO.Read by default 
is a good idea.
For very large reads (in the TB's), this will delay the time until the first 
record is emitted significantly, and may cause additional costs to shuffle the 
data. 

If the end user needs better parallelization of the SpannerIO.Read output, then 
they can add a Reshuffle themselves.

 

 

> SpannerIO.read() parallelism limited by partition count
> ---
>
> Key: BEAM-11722
> URL: https://issues.apache.org/jira/browse/BEAM-11722
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Reporter: Udi Meiri
>Priority: P3
>  Labels: google-cloud-spanner
>
> Setting the partition size / count is not possible: 
> {code} Note: This hint is currently ignored by sessions.partitionQuery and 
> sessions.partitionRead requests.{code}
> https://cloud.google.com/spanner/docs/reference/rest/v1/PartitionOptions
> Adding a Reshuffle might be the only choice, but it adds time and resource 
> usage.
> cc: [~chamikara][~nielm]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-11722) SpannerIO.read() parallelism limited by partition count

2021-02-17 Thread Niel Markwick (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-11722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17285853#comment-17285853
 ] 

Niel Markwick commented on BEAM-11722:
--

Spanner eng confirms that these options are ignored at present. 

Adding a reshuffle to the output of Read/ReadAll would be a workaround but it 
would block the pipeline until all data has been read -  which may take a long 
time and use a large amount of resources. 

This can be done in customer pipelines if they feel that the performance 
improvement of downstream parallelization is worth the additional latency. 

> SpannerIO.read() parallelism limited by partition count
> ---
>
> Key: BEAM-11722
> URL: https://issues.apache.org/jira/browse/BEAM-11722
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Reporter: Udi Meiri
>Assignee: Niel Markwick
>Priority: P2
>  Labels: google-cloud-spanner
>
> Setting the partition size / count is not possible: 
> {code} Note: This hint is currently ignored by sessions.partitionQuery and 
> sessions.partitionRead requests.{code}
> https://cloud.google.com/spanner/docs/reference/rest/v1/PartitionOptions
> Adding a Reshuffle might be the only choice, but it adds time and resource 
> usage.
> cc: [~chamikara][~nielm]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (BEAM-11722) SpannerIO.read() parallelism limited by partition count

2021-02-04 Thread Niel Markwick (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-11722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278781#comment-17278781
 ] 

Niel Markwick edited comment on BEAM-11722 at 2/4/21, 11:31 AM:


I will follow up with Spanner engineering to determine whether these options 
are actually used (and incorrectly documented), and log a feature request if 
not. 

 

Thanks for raising. 


was (Author: nielm):
I will follow up with Spanner engineer to determine whether these options are 
actually used (and incorrectly documented), and log a feature request if not. 

 

Thanks for raising. 

> SpannerIO.read() parallelism limited by partition count
> ---
>
> Key: BEAM-11722
> URL: https://issues.apache.org/jira/browse/BEAM-11722
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Reporter: Udi Meiri
>Assignee: Niel Markwick
>Priority: P2
>  Labels: google-cloud-spanner
>
> Setting the partition size / count is not possible: 
> {code} Note: This hint is currently ignored by sessions.partitionQuery and 
> sessions.partitionRead requests.{code}
> https://cloud.google.com/spanner/docs/reference/rest/v1/PartitionOptions
> Adding a Reshuffle might be the only choice, but it adds time and resource 
> usage.
> cc: [~chamikara][~nielm]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-11722) SpannerIO.read() parallelism limited by partition count

2021-02-04 Thread Niel Markwick (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-11722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278781#comment-17278781
 ] 

Niel Markwick commented on BEAM-11722:
--

I will follow up with Spanner engineer to determine whether these options are 
actually used (and incorrectly documented), and log a feature request if not. 

 

Thanks for raising. 

> SpannerIO.read() parallelism limited by partition count
> ---
>
> Key: BEAM-11722
> URL: https://issues.apache.org/jira/browse/BEAM-11722
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Reporter: Udi Meiri
>Assignee: Niel Markwick
>Priority: P2
>  Labels: google-cloud-spanner
>
> Setting the partition size / count is not possible: 
> {code} Note: This hint is currently ignored by sessions.partitionQuery and 
> sessions.partitionRead requests.{code}
> https://cloud.google.com/spanner/docs/reference/rest/v1/PartitionOptions
> Adding a Reshuffle might be the only choice, but it adds time and resource 
> usage.
> cc: [~chamikara][~nielm]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-11722) SpannerIO.read() parallelism limited by partition count

2021-02-04 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-11722:
-
Component/s: (was: runner-dataflow)
 Labels: google-cloud-spanner  (was: )

> SpannerIO.read() parallelism limited by partition count
> ---
>
> Key: BEAM-11722
> URL: https://issues.apache.org/jira/browse/BEAM-11722
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Reporter: Udi Meiri
>Assignee: Niel Markwick
>Priority: P2
>  Labels: google-cloud-spanner
>
> Setting the partition size / count is not possible: 
> {code} Note: This hint is currently ignored by sessions.partitionQuery and 
> sessions.partitionRead requests.{code}
> https://cloud.google.com/spanner/docs/reference/rest/v1/PartitionOptions
> Adding a Reshuffle might be the only choice, but it adds time and resource 
> usage.
> cc: [~chamikara][~nielm]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-11722) SpannerIO.read() parallelism limited by partition count

2021-02-04 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick reassigned BEAM-11722:


Assignee: Niel Markwick

> SpannerIO.read() parallelism limited by partition count
> ---
>
> Key: BEAM-11722
> URL: https://issues.apache.org/jira/browse/BEAM-11722
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp, runner-dataflow
>Reporter: Udi Meiri
>Assignee: Niel Markwick
>Priority: P2
>
> Setting the partition size / count is not possible: 
> {code} Note: This hint is currently ignored by sessions.partitionQuery and 
> sessions.partitionRead requests.{code}
> https://cloud.google.com/spanner/docs/reference/rest/v1/PartitionOptions
> Adding a Reshuffle might be the only choice, but it adds time and resource 
> usage.
> cc: [~chamikara][~nielm]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-11722) SpannerIO.read() parallelism limited by partition count

2021-02-04 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-11722:
-
Status: Open  (was: Triage Needed)

> SpannerIO.read() parallelism limited by partition count
> ---
>
> Key: BEAM-11722
> URL: https://issues.apache.org/jira/browse/BEAM-11722
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp, runner-dataflow
>Reporter: Udi Meiri
>Assignee: Niel Markwick
>Priority: P2
>
> Setting the partition size / count is not possible: 
> {code} Note: This hint is currently ignored by sessions.partitionQuery and 
> sessions.partitionRead requests.{code}
> https://cloud.google.com/spanner/docs/reference/rest/v1/PartitionOptions
> Adding a Reshuffle might be the only choice, but it adds time and resource 
> usage.
> cc: [~chamikara][~nielm]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-11643) SpannerIO does not support using BigDecimal for Numeric fields

2021-01-30 Thread Niel Markwick (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-11643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17275678#comment-17275678
 ] 

Niel Markwick commented on BEAM-11643:
--

Thanks!

> SpannerIO does not support using BigDecimal for Numeric fields
> --
>
> Key: BEAM-11643
> URL: https://issues.apache.org/jira/browse/BEAM-11643
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.27.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: P2
> Fix For: 2.28.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Since NUMERIC types were added to Spanner, it has been possible to assign a 
> Java BigDecimal to a column in a Spanner mutation.
> The feature to support NUMERIC in BEAM was added in 
> [PR#12818|https://github.com/apache/beam/pull/12818]  (BEAM-10875), but 
> missed adding BigDecimal support to the part of the code where the size of 
> mutated values are calculated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-11643) SpannerIO does not support using BigDecimal for Numeric fields

2021-01-15 Thread Niel Markwick (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-11643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265948#comment-17265948
 ] 

Niel Markwick commented on BEAM-11643:
--

TODO: Add BigDecimal support to 
org.apache.beam.sdk.io.gcp.spanner.MutationSizeEstimator

> SpannerIO does not support using BigDecimal for Numeric fields
> --
>
> Key: BEAM-11643
> URL: https://issues.apache.org/jira/browse/BEAM-11643
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.27.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: P2
> Fix For: 2.28.0
>
>
> Since NUMERIC types were added to Spanner, it has been possible to assign a 
> Java BigDecimal to a column in a Spanner mutation.
> The feature to support NUMERIC in BEAM was added in 
> [PR#12818|https://github.com/apache/beam/pull/12818]  (BEAM-10875), but 
> missed adding BigDecimal support to the part of the code where the size of 
> mutated values are calculated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-11643) SpannerIO does not support using BigDecimal for Numeric fields

2021-01-15 Thread Niel Markwick (Jira)
Niel Markwick created BEAM-11643:


 Summary: SpannerIO does not support using BigDecimal for Numeric 
fields
 Key: BEAM-11643
 URL: https://issues.apache.org/jira/browse/BEAM-11643
 Project: Beam
  Issue Type: Bug
  Components: io-java-gcp
Affects Versions: 2.27.0
Reporter: Niel Markwick
Assignee: Niel Markwick
 Fix For: 2.28.0


Since NUMERIC types were added to Spanner, it has been possible to assign a 
Java BigDecimal to a column in a Spanner mutation.

The feature to support NUMERIC in BEAM was added in 
[PR#12818|https://github.com/apache/beam/pull/12818]  (BEAM-10875), but missed 
adding BigDecimal support to the part of the code where the size of mutated 
values are calculated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (BEAM-6764) OutOfMemory Exception in Dataflow Spanner Write Mutations

2020-06-29 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-6764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick closed BEAM-6764.
---
Fix Version/s: 2.23.0
   Resolution: Fixed

> OutOfMemory Exception in Dataflow Spanner Write Mutations
> -
>
> Key: BEAM-6764
> URL: https://issues.apache.org/jira/browse/BEAM-6764
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.9.0, 2.10.0, 2.11.0
>Reporter: Joshua
>Priority: P3
> Fix For: 2.23.0
>
>
> Since I upgraded my apache beam sdk version to >= 2.9.0, I have been noticing 
> OOM exceptions while using the dataflow runner to write mutations to spanner. 
> I have been using n1-standard-4 since version 2.9.0. On that version, it 
> works. But on higher versions, I get the exception.
>  
> The stackdriver logs is provided below
> {code:java}
> java.lang.OutOfMemoryError: Java heap space
> java.util.ArrayList.(ArrayList.java:152)
> org.apache.beam.sdk.io.gcp.spanner.SpannerIO$GatherBundleAndSortFn.initSorter(SpannerIO.java:1056)
> org.apache.beam.sdk.io.gcp.spanner.SpannerIO$GatherBundleAndSortFn.startBundle(SpannerIO.java:1049)
> {code}
> I have a very basic PTransform for writing to Spanner:
> {code:java}
> SpannerIO.write().withInstanceId(options.getSpannerInstanceId()) 
> .withDatabaseId(options.getSpannerDatabaseId()) 
> .withProjectId(options.getProject()) 
> .withFailureMode(SpannerIO.FailureMode.REPORT_FAILURES);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (BEAM-6913) Reading data from Spanner never ends

2020-06-29 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-6913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick closed BEAM-6913.
---
Fix Version/s: Not applicable
   Resolution: Cannot Reproduce

> Reading data from Spanner never ends
> 
>
> Key: BEAM-6913
> URL: https://issues.apache.org/jira/browse/BEAM-6913
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.11.0
> Environment: macOS Mojave (10.14.3)
>Reporter: Mousa HAMAD
>Priority: P3
> Fix For: Not applicable
>
>
> Whenever my pipeline reads from Spanner, the code runs infinitely. If I 
> update the spanner dependency (_com.google.cloud:google-cloud-spanner_) to 
> e.g., _1.11.0,_ then everything works as expected.
> Consider the following simple pipeline, which never ends:
> {code:java}
> public class Prototype_Spanner {
> private static String INSTANCE_ID = "XYZ";
> private static String DATABASE_ID = "test_beam";
> private static String TABLE_NAME = "item";
> private static void runExample() {
> PipelineOptions options = PipelineOptionsFactory.create();
> options.setRunner(DirectRunner.class);
> Pipeline pipeline = Pipeline.create(options);
> pipeline
> .apply("Read", SpannerIO.read()
> .withInstanceId(INSTANCE_ID)
> .withDatabaseId(DATABASE_ID)
> .withTable(TABLE_NAME)
> .withColumns("price"))
> .apply("Extract Price", MapElements
> .into(TypeDescriptors.longs())
> .via((Struct struct) -> struct.getLong("price")))
> .apply("Calculate Mean", Mean.globally())
> .apply("Map to string", MapElements
> .into(TypeDescriptor.of(String.class))
> .via(Object::toString))
> .apply("Write", TextIO.write().to("/tmp/output"));
> pipeline.run().waitUntilFinish();
> }
> public static void main(String[] args) {
> runExample();
> }
> }
> {code}
> Following is my full list of dependencies:
> {code:java}
> repositories {
> mavenCentral()
> }
> ext {
> beamVersion = '2.11.0'
> sparkVersion = '2.3.3'
> }
> dependencies {
> compile "org.apache.beam:beam-sdks-java-core:$beamVersion"
> compile 
> "org.apache.beam:beam-sdks-java-extensions-join-library:$beamVersion"
> compile 
> "org.apache.beam:beam-sdks-java-extensions-google-cloud-platform-core:$beamVersion"
> compile 
> "org.apache.beam:beam-sdks-java-io-google-cloud-platform:$beamVersion"
> compile "org.apache.beam:beam-runners-core-java:$beamVersion"
> compile "org.apache.beam:beam-runners-direct-java:$beamVersion"
> compile "org.apache.beam:beam-runners-spark:$beamVersion"
> compile "org.apache.spark:spark-core_2.11:$sparkVersion"
> compile "org.apache.spark:spark-streaming_2.11:$sparkVersion"
> // This line fixed the issue for me
> // compile "com.google.cloud:google-cloud-spanner:1.11.0"
> testCompile "junit:junit:4.12"
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-7732) Allow access to SpannerOptions in Beam

2020-06-28 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick resolved BEAM-7732.
-
Fix Version/s: Not applicable
   Resolution: Won't Fix

> Allow access to SpannerOptions in Beam
> --
>
> Key: BEAM-7732
> URL: https://issues.apache.org/jira/browse/BEAM-7732
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Affects Versions: 2.12.0, 2.13.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: P3
> Fix For: Not applicable
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Beam hides the 
> [SpannerOptions|https://github.com/googleapis/google-cloud-java/blob/master/google-cloud-clients/google-cloud-spanner/src/main/java/com/google/cloud/spanner/SpannerOptions.java]
>  object behind a 
> [SpannerConfig|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerConfig.java]
>  object because the SpannerOptions object is not serializable. 
> This means that the only options that can be set are those that can be 
> specified in SpannerConfig - limited to host, project, instance, database.
> Suggestion: add the possibility to set a SpannerOptionsFactory in 
> SpannerConfig:
> {code:java}
> public interface SpannerOptionsFactory extends Serializable {
>    public SpannerOptions create();
> }
> {code}
> This would allow the user use this factory class to specify custom 
> SpannerOptions before they are passed onto the connectToSpanner() method; 
> connectToSpanner() would then become: 
> {code:java}
> public SpannerAccessor connectToSpanner() {
>   
>   SpannerOptions.Builder builder = spannerOptionsFactory.create().toBuilder();
>   // rest of connectToSpanner follows, setting project, host, etc.
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-10047) SpannerIO: Combine sorting and batching

2020-06-28 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick resolved BEAM-10047.
--
Resolution: Fixed

> SpannerIO: Combine sorting and batching
> ---
>
> Key: BEAM-10047
> URL: https://issues.apache.org/jira/browse/BEAM-10047
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Affects Versions: 2.16.0, 2.17.0, 2.18.0, 2.19.0, 2.20.0, 2.21.0, 2.22.0
>Reporter: Brian Hulette
>Assignee: Niel Markwick
>Priority: P2
> Fix For: 2.23.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> having Spanner IO's grouping, sorting and batching as separate stages makes 
> no sense as these stages will be fused in any runner that supports fusion. 
> In addition during these stages, even with fusion, at least 3 full copies of 
> each mutation are made (serialized then deserialized objects), which leads to 
> an extremely large use of memory. 
>  
> Combining these stages would significantly reduce memory usage, and improve 
> performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-10259) Spanner Session leak/overload in Streaming Dataflow

2020-06-28 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick resolved BEAM-10259.
--
Resolution: Fixed

> Spanner Session leak/overload in Streaming Dataflow
> ---
>
> Key: BEAM-10259
> URL: https://issues.apache.org/jira/browse/BEAM-10259
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.18.0, 2.19.0, 2.21.0, 2.22.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: P2
>  Labels: dataflow, gcp, io
> Fix For: 2.23.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> SpannerIO.WriteToSpannerFn connects to Spanner every time @Setup is called, 
> and closes the connection every time @Teardown is called. 
> This actually creates a separate Spanner connection and session pool for each 
> WriteToSpannerFn, which generally speaking is one per thread
> In single-threaded runners (eg batch dataflow on a single vCPU machine) this 
> is not an issue, as there is normally only one WriteToSpannerFn per 
> node/process.
> In multi-threaded runners (eg streaming dataflow, or batch on multiple CPU 
> machines), this can cause a problem with many session pools created (1 per 
> thread) which can cause a respource leak, and is in general wasteful.
> Spanner connections (and session pools) should be shared among all threads of 
> a single process. so that the connection is only opened and closed once.
> [~alxavier]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-7732) Allow access to SpannerOptions in Beam

2020-06-28 Thread Niel Markwick (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-7732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147255#comment-17147255
 ] 

Niel Markwick commented on BEAM-7732:
-

Current philosophy of Beam is not to expose client library classes, so this 
will not be implemented.
The commit deadline has been exposed using a withCommitDeadline() setter on 
SpannerIO.Write.

> Allow access to SpannerOptions in Beam
> --
>
> Key: BEAM-7732
> URL: https://issues.apache.org/jira/browse/BEAM-7732
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Affects Versions: 2.12.0, 2.13.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: P3
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Beam hides the 
> [SpannerOptions|https://github.com/googleapis/google-cloud-java/blob/master/google-cloud-clients/google-cloud-spanner/src/main/java/com/google/cloud/spanner/SpannerOptions.java]
>  object behind a 
> [SpannerConfig|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerConfig.java]
>  object because the SpannerOptions object is not serializable. 
> This means that the only options that can be set are those that can be 
> specified in SpannerConfig - limited to host, project, instance, database.
> Suggestion: add the possibility to set a SpannerOptionsFactory in 
> SpannerConfig:
> {code:java}
> public interface SpannerOptionsFactory extends Serializable {
>    public SpannerOptions create();
> }
> {code}
> This would allow the user use this factory class to specify custom 
> SpannerOptions before they are passed onto the connectToSpanner() method; 
> connectToSpanner() would then become: 
> {code:java}
> public SpannerAccessor connectToSpanner() {
>   
>   SpannerOptions.Builder builder = spannerOptionsFactory.create().toBuilder();
>   // rest of connectToSpanner follows, setting project, host, etc.
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (BEAM-9269) Set shorter Commit Deadline and handle with backoff/retry

2020-06-28 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick closed BEAM-9269.
---
Fix Version/s: (was: 2.20.0)
   2.23.0
   Resolution: Fixed

> Set shorter Commit Deadline and handle with backoff/retry
> -
>
> Key: BEAM-9269
> URL: https://issues.apache.org/jira/browse/BEAM-9269
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Affects Versions: 2.16.0, 2.17.0, 2.18.0, 2.19.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: P2
>  Labels: google-cloud-spanner
> Fix For: 2.23.0
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Default commit deadline in Spanner is 1hr, which can lead to a variety of 
> issues including database overload and session expiry.
> Shorter deadline should be set with backoff/retry when deadline expires, so 
> that the Spanner database does not become overloaded.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (BEAM-9269) Set shorter Commit Deadline and handle with backoff/retry

2020-06-15 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick reopened BEAM-9269:
-

Re-opening to convert RPC Interceptor code to use GRPC deadline configurations. 

> Set shorter Commit Deadline and handle with backoff/retry
> -
>
> Key: BEAM-9269
> URL: https://issues.apache.org/jira/browse/BEAM-9269
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Affects Versions: 2.16.0, 2.17.0, 2.18.0, 2.19.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: P2
>  Labels: google-cloud-spanner
> Fix For: 2.20.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> Default commit deadline in Spanner is 1hr, which can lead to a variety of 
> issues including database overload and session expiry.
> Shorter deadline should be set with backoff/retry when deadline expires, so 
> that the Spanner database does not become overloaded.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-10259) Spanner Session leak/overload in Streaming Dataflow

2020-06-15 Thread Niel Markwick (Jira)
Niel Markwick created BEAM-10259:


 Summary: Spanner Session leak/overload in Streaming Dataflow
 Key: BEAM-10259
 URL: https://issues.apache.org/jira/browse/BEAM-10259
 Project: Beam
  Issue Type: Bug
  Components: io-java-gcp
Affects Versions: 2.22.0, 2.21.0, 2.19.0, 2.18.0
Reporter: Niel Markwick
Assignee: Niel Markwick
 Fix For: 2.23.0


SpannerIO.WriteToSpannerFn connects to Spanner every time @Setup is called, and 
closes the connection every time @Teardown is called. 

This actually creates a separate Spanner connection and session pool for each 
WriteToSpannerFn, which generally speaking is one per thread

In single-threaded runners (eg batch dataflow on a single vCPU machine) this is 
not an issue, as there is normally only one WriteToSpannerFn per node/process.

In multi-threaded runners (eg streaming dataflow, or batch on multiple CPU 
machines), this can cause a problem with many session pools created (1 per 
thread) which can cause a respource leak, and is in general wasteful.

Spanner connections (and session pools) should be shared among all threads of a 
single process. so that the connection is only opened and closed once.

[~alxavier]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-6887) Streaming Spanner Writer transform

2020-06-15 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-6887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick resolved BEAM-6887.
-
Fix Version/s: 2.22.0
   Resolution: Abandoned

> Streaming Spanner Writer transform
> --
>
> Key: BEAM-6887
> URL: https://issues.apache.org/jira/browse/BEAM-6887
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: P3
> Fix For: 2.22.0
>
>
> At present, the SpannerIO.Write/WriteGrouped transforms work by collecting an 
> entire bundle of elements, sorts them by table/key, splitting the sorted list 
> into batches (by size and number of cells modified) and then writes each 
> batch to Spanner in a single transaction.
> It returns a SpannerWriteResult.java containing :
>  # a PCollection (the main output) - which will have no elements but 
> will be closed to signal when all the input elements have been written (which 
> is never in streaming because input is unbounded)
>  # a PCollection of elements that failed to write.
> This transform is useful as a bulk sink for data because it efficiently 
> writes large amounts of data. 
> It is not at all useful as an intermediate step in a streaming pipeline - 
> because it has no useful output in streaming mode. 
> I propose that we have a separate Spanner Write transform which simply writes 
> each input Mutation to the database, and then pushes successful Mutations 
> onto its output. 
> This would allow use in the middle of a streaming pipeline, where the flow 
> would be
>  * Some data streamed in
>  * Converted to Spanner Mutations
>  * Written to Spanner Database
>  * Further processing where the values written to the Spanner Database are 
> used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-6887) Streaming Spanner Writer transform

2020-06-15 Thread Niel Markwick (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-6887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135800#comment-17135800
 ] 

Niel Markwick commented on BEAM-6887:
-

Meaningful output is covered in BEAM-6921

> Streaming Spanner Writer transform
> --
>
> Key: BEAM-6887
> URL: https://issues.apache.org/jira/browse/BEAM-6887
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: P3
>
> At present, the SpannerIO.Write/WriteGrouped transforms work by collecting an 
> entire bundle of elements, sorts them by table/key, splitting the sorted list 
> into batches (by size and number of cells modified) and then writes each 
> batch to Spanner in a single transaction.
> It returns a SpannerWriteResult.java containing :
>  # a PCollection (the main output) - which will have no elements but 
> will be closed to signal when all the input elements have been written (which 
> is never in streaming because input is unbounded)
>  # a PCollection of elements that failed to write.
> This transform is useful as a bulk sink for data because it efficiently 
> writes large amounts of data. 
> It is not at all useful as an intermediate step in a streaming pipeline - 
> because it has no useful output in streaming mode. 
> I propose that we have a separate Spanner Write transform which simply writes 
> each input Mutation to the database, and then pushes successful Mutations 
> onto its output. 
> This would allow use in the middle of a streaming pipeline, where the flow 
> would be
>  * Some data streamed in
>  * Converted to Spanner Mutations
>  * Written to Spanner Database
>  * Further processing where the values written to the Spanner Database are 
> used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-6671) Beam 2.9.0 java.lang.NoSuchFieldError: internal_static_google_rpc_LocalizedMessage_fieldAccessorTable

2020-06-15 Thread Niel Markwick (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135799#comment-17135799
 ] 

Niel Markwick commented on BEAM-6671:
-

2.19.0 has update the Spanner client library version to 1.49.0, and there have 
been no more recent reports of this. 

> Beam 2.9.0 java.lang.NoSuchFieldError: 
> internal_static_google_rpc_LocalizedMessage_fieldAccessorTable
> -
>
> Key: BEAM-6671
> URL: https://issues.apache.org/jira/browse/BEAM-6671
> Project: Beam
>  Issue Type: New Feature
>  Components: build-system
>Reporter: Alex Amato
>Priority: P2
>  Labels: stale-P2
>
> I received a report from a Dataflow user encountering this in Beam 2.9.0 when 
> creating a spanner instance. I wanted to post this here as this is known to 
> be related to dependency conflicts in the past 
> ([https://stackoverflow.com/questions/46684071/error-using-spannerio-in-apache-beam]).
>  
> java.lang.NoSuchFieldError: 
> internal_static_google_rpc_LocalizedMessage_fieldAccessorTable
> at 
> com.google.rpc.LocalizedMessage.internalGetFieldAccessorTable(LocalizedMessage.java:90)
> at 
> com.google.protobuf.GeneratedMessageV3.getDescriptorForType(GeneratedMessageV3.java:121)
> at io.grpc.protobuf.ProtoUtils.keyForProto(ProtoUtils.java:67)
> at 
> com.google.cloud.spanner.spi.v1.SpannerErrorInterceptor.(SpannerErrorInterceptor.java:47)
> at 
> com.google.cloud.spanner.spi.v1.GrpcSpannerRpc.(GrpcSpannerRpc.java:136)
> at 
> com.google.cloud.spanner.SpannerOptions$DefaultSpannerRpcFactory.create(SpannerOptions.java:73)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (BEAM-6671) Beam 2.9.0 java.lang.NoSuchFieldError: internal_static_google_rpc_LocalizedMessage_fieldAccessorTable

2020-06-15 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-6671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick closed BEAM-6671.
---
Fix Version/s: 2.19.0
   Resolution: Cannot Reproduce

> Beam 2.9.0 java.lang.NoSuchFieldError: 
> internal_static_google_rpc_LocalizedMessage_fieldAccessorTable
> -
>
> Key: BEAM-6671
> URL: https://issues.apache.org/jira/browse/BEAM-6671
> Project: Beam
>  Issue Type: New Feature
>  Components: build-system
>Reporter: Alex Amato
>Priority: P2
>  Labels: stale-P2
> Fix For: 2.19.0
>
>
> I received a report from a Dataflow user encountering this in Beam 2.9.0 when 
> creating a spanner instance. I wanted to post this here as this is known to 
> be related to dependency conflicts in the past 
> ([https://stackoverflow.com/questions/46684071/error-using-spannerio-in-apache-beam]).
>  
> java.lang.NoSuchFieldError: 
> internal_static_google_rpc_LocalizedMessage_fieldAccessorTable
> at 
> com.google.rpc.LocalizedMessage.internalGetFieldAccessorTable(LocalizedMessage.java:90)
> at 
> com.google.protobuf.GeneratedMessageV3.getDescriptorForType(GeneratedMessageV3.java:121)
> at io.grpc.protobuf.ProtoUtils.keyForProto(ProtoUtils.java:67)
> at 
> com.google.cloud.spanner.spi.v1.SpannerErrorInterceptor.(SpannerErrorInterceptor.java:47)
> at 
> com.google.cloud.spanner.spi.v1.GrpcSpannerRpc.(GrpcSpannerRpc.java:136)
> at 
> com.google.cloud.spanner.SpannerOptions$DefaultSpannerRpcFactory.create(SpannerOptions.java:73)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-6320) SpannerReadIT.testQuery flaky

2020-06-15 Thread Niel Markwick (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-6320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135795#comment-17135795
 ] 

Niel Markwick commented on BEAM-6320:
-

Duplicate of BEAM-2538

> SpannerReadIT.testQuery flaky
> -
>
> Key: BEAM-6320
> URL: https://issues.apache.org/jira/browse/BEAM-6320
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Reporter: Andrew Pilloud
>Priority: P2
>  Labels: stale-P2
> Fix For: Not applicable
>
>
> https://builds.apache.org/job/beam_PostCommit_Java/2218/
> {code}
> WARNING: No terminal state was returned. State value UNKNOWN
> Dec 27, 2018 9:08:13 PM org.apache.beam.runners.dataflow.TestDataflowRunner 
> checkForPAssertSuccess
> WARNING: Metrics not present for Dataflow job 
> 2018-12-27_13_02_39-18037927821693074732.
> Dec 27, 2018 9:08:13 PM org.apache.beam.runners.dataflow.TestDataflowRunner 
> run
> WARNING: Dataflow job 2018-12-27_13_02_39-18037927821693074732 did not output 
> a success or failure metric.
> {code}
> https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2018-12-27_13_02_39-18037927821693074732?project=apache-beam-testing
> {code}
> com.google.cloud.spanner.SpannerException: NOT_FOUND: 
> io.grpc.StatusRuntimeException: NOT_FOUND: Database not found: 
> projects/apache-beam-testing/instances/beam-test/databases/beam-testdb-vkf2iqc72tevvroaop
> resource_type: "type.googleapis.com/google.spanner.admin.database.v1.Database"
> resource_name: 
> "projects/apache-beam-testing/instances/beam-test/databases/beam-testdb-vkf2iqc72tevvroaop"
> description: "Database does not exist."
>   at 
> com.google.cloud.spanner.SpannerExceptionFactory.newSpannerExceptionPreformatted(SpannerExceptionFactory.java:119)
>   at 
> com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException(SpannerExceptionFactory.java:43)
>   at 
> com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException(SpannerExceptionFactory.java:80)
>   at 
> com.google.cloud.spanner.spi.v1.GrpcSpannerRpc.get(GrpcSpannerRpc.java:456)
>   at 
> com.google.cloud.spanner.spi.v1.GrpcSpannerRpc.createSession(GrpcSpannerRpc.java:350)
>   at com.google.cloud.spanner.SpannerImpl$2.call(SpannerImpl.java:258)
>   at com.google.cloud.spanner.SpannerImpl$2.call(SpannerImpl.java:255)
>   at 
> com.google.cloud.spanner.SpannerImpl.runWithRetries(SpannerImpl.java:227)
>   at 
> com.google.cloud.spanner.SpannerImpl.createSession(SpannerImpl.java:254)
>   at 
> com.google.cloud.spanner.BatchClientImpl.batchReadOnlyTransaction(BatchClientImpl.java:51)
>   at 
> org.apache.beam.sdk.io.gcp.spanner.CreateTransactionFn.processElement(CreateTransactionFn.java:47)
> Caused by: java.util.concurrent.ExecutionException: 
> io.grpc.StatusRuntimeException: NOT_FOUND: Database not found: 
> projects/apache-beam-testing/instances/beam-test/databases/beam-testdb-vkf2iqc72tevvroaop
> resource_type: "type.googleapis.com/google.spanner.admin.database.v1.Database"
> resource_name: 
> "projects/apache-beam-testing/instances/beam-test/databases/beam-testdb-vkf2iqc72tevvroaop"
> description: "Database does not exist."
>   at 
> com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:500)
>   at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:479)
>   at 
> com.google.cloud.spanner.spi.v1.GrpcSpannerRpc.get(GrpcSpannerRpc.java:450)
>   at 
> com.google.cloud.spanner.spi.v1.GrpcSpannerRpc.createSession(GrpcSpannerRpc.java:350)
>   at com.google.cloud.spanner.SpannerImpl$2.call(SpannerImpl.java:258)
>   at com.google.cloud.spanner.SpannerImpl$2.call(SpannerImpl.java:255)
>   at 
> com.google.cloud.spanner.SpannerImpl.runWithRetries(SpannerImpl.java:227)
>   at 
> com.google.cloud.spanner.SpannerImpl.createSession(SpannerImpl.java:254)
>   at 
> com.google.cloud.spanner.BatchClientImpl.batchReadOnlyTransaction(BatchClientImpl.java:51)
>   at 
> org.apache.beam.sdk.io.gcp.spanner.CreateTransactionFn.processElement(CreateTransactionFn.java:47)
>   at 
> org.apache.beam.sdk.io.gcp.spanner.CreateTransactionFn$DoFnInvoker.invokeProcessElement(Unknown
>  Source)
>   at 
> org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:275)
>   at 
> org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:240)
>   at 
> org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:325)
>   at 
> org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
>   at 
> org.apache.beam.runners.dataflo

[jira] [Closed] (BEAM-6320) SpannerReadIT.testQuery flaky

2020-06-15 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-6320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick closed BEAM-6320.
---
Fix Version/s: Not applicable
   Resolution: Duplicate

> SpannerReadIT.testQuery flaky
> -
>
> Key: BEAM-6320
> URL: https://issues.apache.org/jira/browse/BEAM-6320
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Reporter: Andrew Pilloud
>Priority: P2
>  Labels: stale-P2
> Fix For: Not applicable
>
>
> https://builds.apache.org/job/beam_PostCommit_Java/2218/
> {code}
> WARNING: No terminal state was returned. State value UNKNOWN
> Dec 27, 2018 9:08:13 PM org.apache.beam.runners.dataflow.TestDataflowRunner 
> checkForPAssertSuccess
> WARNING: Metrics not present for Dataflow job 
> 2018-12-27_13_02_39-18037927821693074732.
> Dec 27, 2018 9:08:13 PM org.apache.beam.runners.dataflow.TestDataflowRunner 
> run
> WARNING: Dataflow job 2018-12-27_13_02_39-18037927821693074732 did not output 
> a success or failure metric.
> {code}
> https://pantheon.corp.google.com/dataflow/jobsDetail/locations/us-central1/jobs/2018-12-27_13_02_39-18037927821693074732?project=apache-beam-testing
> {code}
> com.google.cloud.spanner.SpannerException: NOT_FOUND: 
> io.grpc.StatusRuntimeException: NOT_FOUND: Database not found: 
> projects/apache-beam-testing/instances/beam-test/databases/beam-testdb-vkf2iqc72tevvroaop
> resource_type: "type.googleapis.com/google.spanner.admin.database.v1.Database"
> resource_name: 
> "projects/apache-beam-testing/instances/beam-test/databases/beam-testdb-vkf2iqc72tevvroaop"
> description: "Database does not exist."
>   at 
> com.google.cloud.spanner.SpannerExceptionFactory.newSpannerExceptionPreformatted(SpannerExceptionFactory.java:119)
>   at 
> com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException(SpannerExceptionFactory.java:43)
>   at 
> com.google.cloud.spanner.SpannerExceptionFactory.newSpannerException(SpannerExceptionFactory.java:80)
>   at 
> com.google.cloud.spanner.spi.v1.GrpcSpannerRpc.get(GrpcSpannerRpc.java:456)
>   at 
> com.google.cloud.spanner.spi.v1.GrpcSpannerRpc.createSession(GrpcSpannerRpc.java:350)
>   at com.google.cloud.spanner.SpannerImpl$2.call(SpannerImpl.java:258)
>   at com.google.cloud.spanner.SpannerImpl$2.call(SpannerImpl.java:255)
>   at 
> com.google.cloud.spanner.SpannerImpl.runWithRetries(SpannerImpl.java:227)
>   at 
> com.google.cloud.spanner.SpannerImpl.createSession(SpannerImpl.java:254)
>   at 
> com.google.cloud.spanner.BatchClientImpl.batchReadOnlyTransaction(BatchClientImpl.java:51)
>   at 
> org.apache.beam.sdk.io.gcp.spanner.CreateTransactionFn.processElement(CreateTransactionFn.java:47)
> Caused by: java.util.concurrent.ExecutionException: 
> io.grpc.StatusRuntimeException: NOT_FOUND: Database not found: 
> projects/apache-beam-testing/instances/beam-test/databases/beam-testdb-vkf2iqc72tevvroaop
> resource_type: "type.googleapis.com/google.spanner.admin.database.v1.Database"
> resource_name: 
> "projects/apache-beam-testing/instances/beam-test/databases/beam-testdb-vkf2iqc72tevvroaop"
> description: "Database does not exist."
>   at 
> com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:500)
>   at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:479)
>   at 
> com.google.cloud.spanner.spi.v1.GrpcSpannerRpc.get(GrpcSpannerRpc.java:450)
>   at 
> com.google.cloud.spanner.spi.v1.GrpcSpannerRpc.createSession(GrpcSpannerRpc.java:350)
>   at com.google.cloud.spanner.SpannerImpl$2.call(SpannerImpl.java:258)
>   at com.google.cloud.spanner.SpannerImpl$2.call(SpannerImpl.java:255)
>   at 
> com.google.cloud.spanner.SpannerImpl.runWithRetries(SpannerImpl.java:227)
>   at 
> com.google.cloud.spanner.SpannerImpl.createSession(SpannerImpl.java:254)
>   at 
> com.google.cloud.spanner.BatchClientImpl.batchReadOnlyTransaction(BatchClientImpl.java:51)
>   at 
> org.apache.beam.sdk.io.gcp.spanner.CreateTransactionFn.processElement(CreateTransactionFn.java:47)
>   at 
> org.apache.beam.sdk.io.gcp.spanner.CreateTransactionFn$DoFnInvoker.invokeProcessElement(Unknown
>  Source)
>   at 
> org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:275)
>   at 
> org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:240)
>   at 
> org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:325)
>   at 
> org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
>   at 
> org.apache.beam.runners.dataflow.worker.util.common

[jira] [Commented] (BEAM-7732) Allow access to SpannerOptions in Beam

2020-06-15 Thread Niel Markwick (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-7732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135793#comment-17135793
 ] 

Niel Markwick commented on BEAM-7732:
-

There is still no way to specify custom SpannerOptions in Beam. There have been 
a couple of related recent issues, and I am going to revisit the issue shortly.

> Allow access to SpannerOptions in Beam
> --
>
> Key: BEAM-7732
> URL: https://issues.apache.org/jira/browse/BEAM-7732
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Affects Versions: 2.12.0, 2.13.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: P3
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Beam hides the 
> [SpannerOptions|https://github.com/googleapis/google-cloud-java/blob/master/google-cloud-clients/google-cloud-spanner/src/main/java/com/google/cloud/spanner/SpannerOptions.java]
>  object behind a 
> [SpannerConfig|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerConfig.java]
>  object because the SpannerOptions object is not serializable. 
> This means that the only options that can be set are those that can be 
> specified in SpannerConfig - limited to host, project, instance, database.
> Suggestion: add the possibility to set a SpannerOptionsFactory in 
> SpannerConfig:
> {code:java}
> public interface SpannerOptionsFactory extends Serializable {
>    public SpannerOptions create();
> }
> {code}
> This would allow the user use this factory class to specify custom 
> SpannerOptions before they are passed onto the connectToSpanner() method; 
> connectToSpanner() would then become: 
> {code:java}
> public SpannerAccessor connectToSpanner() {
>   
>   SpannerOptions.Builder builder = spannerOptionsFactory.create().toBuilder();
>   // rest of connectToSpanner follows, setting project, host, etc.
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-7732) Allow access to SpannerOptions in Beam

2020-06-15 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick reassigned BEAM-7732:
---

Assignee: Niel Markwick

> Allow access to SpannerOptions in Beam
> --
>
> Key: BEAM-7732
> URL: https://issues.apache.org/jira/browse/BEAM-7732
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Affects Versions: 2.12.0, 2.13.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: P3
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Beam hides the 
> [SpannerOptions|https://github.com/googleapis/google-cloud-java/blob/master/google-cloud-clients/google-cloud-spanner/src/main/java/com/google/cloud/spanner/SpannerOptions.java]
>  object behind a 
> [SpannerConfig|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerConfig.java]
>  object because the SpannerOptions object is not serializable. 
> This means that the only options that can be set are those that can be 
> specified in SpannerConfig - limited to host, project, instance, database.
> Suggestion: add the possibility to set a SpannerOptionsFactory in 
> SpannerConfig:
> {code:java}
> public interface SpannerOptionsFactory extends Serializable {
>    public SpannerOptions create();
> }
> {code}
> This would allow the user use this factory class to specify custom 
> SpannerOptions before they are passed onto the connectToSpanner() method; 
> connectToSpanner() would then become: 
> {code:java}
> public SpannerAccessor connectToSpanner() {
>   
>   SpannerOptions.Builder builder = spannerOptionsFactory.create().toBuilder();
>   // rest of connectToSpanner follows, setting project, host, etc.
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-6913) Reading data from Spanner never ends

2020-06-15 Thread Niel Markwick (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-6913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135791#comment-17135791
 ] 

Niel Markwick commented on BEAM-6913:
-

Spanner client library has been updated to 1.49, in beam, so this issue should 
now be resolved.

> Reading data from Spanner never ends
> 
>
> Key: BEAM-6913
> URL: https://issues.apache.org/jira/browse/BEAM-6913
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.11.0
> Environment: macOS Mojave (10.14.3)
>Reporter: Mousa HAMAD
>Priority: P2
>  Labels: stale-P2
>
> Whenever my pipeline reads from Spanner, the code runs infinitely. If I 
> update the spanner dependency (_com.google.cloud:google-cloud-spanner_) to 
> e.g., _1.11.0,_ then everything works as expected.
> Consider the following simple pipeline, which never ends:
> {code:java}
> public class Prototype_Spanner {
> private static String INSTANCE_ID = "XYZ";
> private static String DATABASE_ID = "test_beam";
> private static String TABLE_NAME = "item";
> private static void runExample() {
> PipelineOptions options = PipelineOptionsFactory.create();
> options.setRunner(DirectRunner.class);
> Pipeline pipeline = Pipeline.create(options);
> pipeline
> .apply("Read", SpannerIO.read()
> .withInstanceId(INSTANCE_ID)
> .withDatabaseId(DATABASE_ID)
> .withTable(TABLE_NAME)
> .withColumns("price"))
> .apply("Extract Price", MapElements
> .into(TypeDescriptors.longs())
> .via((Struct struct) -> struct.getLong("price")))
> .apply("Calculate Mean", Mean.globally())
> .apply("Map to string", MapElements
> .into(TypeDescriptor.of(String.class))
> .via(Object::toString))
> .apply("Write", TextIO.write().to("/tmp/output"));
> pipeline.run().waitUntilFinish();
> }
> public static void main(String[] args) {
> runExample();
> }
> }
> {code}
> Following is my full list of dependencies:
> {code:java}
> repositories {
> mavenCentral()
> }
> ext {
> beamVersion = '2.11.0'
> sparkVersion = '2.3.3'
> }
> dependencies {
> compile "org.apache.beam:beam-sdks-java-core:$beamVersion"
> compile 
> "org.apache.beam:beam-sdks-java-extensions-join-library:$beamVersion"
> compile 
> "org.apache.beam:beam-sdks-java-extensions-google-cloud-platform-core:$beamVersion"
> compile 
> "org.apache.beam:beam-sdks-java-io-google-cloud-platform:$beamVersion"
> compile "org.apache.beam:beam-runners-core-java:$beamVersion"
> compile "org.apache.beam:beam-runners-direct-java:$beamVersion"
> compile "org.apache.beam:beam-runners-spark:$beamVersion"
> compile "org.apache.spark:spark-core_2.11:$sparkVersion"
> compile "org.apache.spark:spark-streaming_2.11:$sparkVersion"
> // This line fixed the issue for me
> // compile "com.google.cloud:google-cloud-spanner:1.11.0"
> testCompile "junit:junit:4.12"
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-6764) OutOfMemory Exception in Dataflow Spanner Write Mutations

2020-06-15 Thread Niel Markwick (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-6764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135789#comment-17135789
 ] 

Niel Markwick commented on BEAM-6764:
-

Since 2.9.0, streaming support has been improved dramatically to use less 
resources, and have a lower latency. 

I believe this issue is fixed in 2.22

> OutOfMemory Exception in Dataflow Spanner Write Mutations
> -
>
> Key: BEAM-6764
> URL: https://issues.apache.org/jira/browse/BEAM-6764
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.9.0, 2.10.0, 2.11.0
>Reporter: Joshua
>Priority: P2
>  Labels: stale-P2
>
> Since I upgraded my apache beam sdk version to >= 2.9.0, I have been noticing 
> OOM exceptions while using the dataflow runner to write mutations to spanner. 
> I have been using n1-standard-4 since version 2.9.0. On that version, it 
> works. But on higher versions, I get the exception.
>  
> The stackdriver logs is provided below
> {code:java}
> java.lang.OutOfMemoryError: Java heap space
> java.util.ArrayList.(ArrayList.java:152)
> org.apache.beam.sdk.io.gcp.spanner.SpannerIO$GatherBundleAndSortFn.initSorter(SpannerIO.java:1056)
> org.apache.beam.sdk.io.gcp.spanner.SpannerIO$GatherBundleAndSortFn.startBundle(SpannerIO.java:1049)
> {code}
> I have a very basic PTransform for writing to Spanner:
> {code:java}
> SpannerIO.write().withInstanceId(options.getSpannerInstanceId()) 
> .withDatabaseId(options.getSpannerDatabaseId()) 
> .withProjectId(options.getProject()) 
> .withFailureMode(SpannerIO.FailureMode.REPORT_FAILURES);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-6887) Streaming Spanner Writer transform

2020-06-15 Thread Niel Markwick (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-6887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135787#comment-17135787
 ] 

Niel Markwick commented on BEAM-6887:
-

https://github.com/apache/beam/pull/11532 changes the defaults for unbounded 
sources so that the high-latency grouping operation is disabled by default. 
In addition https://github.com/apache/beam/pull/11529 simplifies the pipeline 
when there  batching is disabled.

Therefore there is no longer a need for a separate SimpleWrite/WriteBulk

This FR sill applies though for having meaningful output for successful 
mutations. 


> Streaming Spanner Writer transform
> --
>
> Key: BEAM-6887
> URL: https://issues.apache.org/jira/browse/BEAM-6887
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp
>Reporter: Niel Markwick
>Priority: P3
>
> At present, the SpannerIO.Write/WriteGrouped transforms work by collecting an 
> entire bundle of elements, sorts them by table/key, splitting the sorted list 
> into batches (by size and number of cells modified) and then writes each 
> batch to Spanner in a single transaction.
> It returns a SpannerWriteResult.java containing :
>  # a PCollection (the main output) - which will have no elements but 
> will be closed to signal when all the input elements have been written (which 
> is never in streaming because input is unbounded)
>  # a PCollection of elements that failed to write.
> This transform is useful as a bulk sink for data because it efficiently 
> writes large amounts of data. 
> It is not at all useful as an intermediate step in a streaming pipeline - 
> because it has no useful output in streaming mode. 
> I propose that we have a separate Spanner Write transform which simply writes 
> each input Mutation to the database, and then pushes successful Mutations 
> onto its output. 
> This would allow use in the middle of a streaming pipeline, where the flow 
> would be
>  * Some data streamed in
>  * Converted to Spanner Mutations
>  * Written to Spanner Database
>  * Further processing where the values written to the Spanner Database are 
> used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-6887) Streaming Spanner Writer transform

2020-06-15 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-6887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick reassigned BEAM-6887:
---

Assignee: Niel Markwick

> Streaming Spanner Writer transform
> --
>
> Key: BEAM-6887
> URL: https://issues.apache.org/jira/browse/BEAM-6887
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: P3
>
> At present, the SpannerIO.Write/WriteGrouped transforms work by collecting an 
> entire bundle of elements, sorts them by table/key, splitting the sorted list 
> into batches (by size and number of cells modified) and then writes each 
> batch to Spanner in a single transaction.
> It returns a SpannerWriteResult.java containing :
>  # a PCollection (the main output) - which will have no elements but 
> will be closed to signal when all the input elements have been written (which 
> is never in streaming because input is unbounded)
>  # a PCollection of elements that failed to write.
> This transform is useful as a bulk sink for data because it efficiently 
> writes large amounts of data. 
> It is not at all useful as an intermediate step in a streaming pipeline - 
> because it has no useful output in streaming mode. 
> I propose that we have a separate Spanner Write transform which simply writes 
> each input Mutation to the database, and then pushes successful Mutations 
> onto its output. 
> This would allow use in the middle of a streaming pipeline, where the flow 
> would be
>  * Some data streamed in
>  * Converted to Spanner Mutations
>  * Written to Spanner Database
>  * Further processing where the values written to the Spanner Database are 
> used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (BEAM-2837) Writing To Spanner From Google Cloud DataFlow - Failure

2020-06-10 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick closed BEAM-2837.
---
Fix Version/s: 2.2.0
 Assignee: (was: Mairbek Khadikov)
   Resolution: Fixed

This issue was fixed by the inclusion of SpannerIO in beam a looking while ago!

> Writing To Spanner From Google Cloud DataFlow - Failure
> ---
>
> Key: BEAM-2837
> URL: https://issues.apache.org/jira/browse/BEAM-2837
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 2.1.0, 2.4.0
> Environment: Google Cloud DataFlow
>Reporter: Al Yaros
>Priority: P2
>  Labels: stale-assigned
> Fix For: 2.2.0
>
>
> Simple Java Program That reads from Pub\Sub and Writes to Spanner Fails with 
> cryptic error message.
> Simple Program to Demonstrate the Error:
> [https://github.com/alyaros/ExamplePubSubToSpannerViaDataFlow]
> {code:java}
> *Caused by: org.apache.beam.sdk.util.UserCodeException: 
> java.lang.NoClassDefFoundError: Could not initialize class 
> com.google.cloud.spanner.spi.v1.SpannerErrorInterceptor
> 
> org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:36)
> org.apache.beam.sdk.io.
> gcp.spanner.SpannerWriteGroupFn$DoFnInvoker.invokeSetup(Unknown Source)
> 
> com.google.cloud.dataflow.worker.DoFnInstanceManagers$ConcurrentQueueInstanceManager.deserializeCopy(DoFnInstanceManagers.java:66)
> 
> com.google.cloud.dataflow.worker.DoFnInstanceManagers$ConcurrentQueueInstanceManager.peek(DoFnInstanceManagers.java:48)
> 
> com.google.cloud.dataflow.worker.UserParDoFnFactory.create(UserParDoFnFactory.java:104)
> 
> com.google.cloud.dataflow.worker.DefaultParDoFnFactory.create(DefaultParDoFnFactory.java:66)
> 
> com.google.cloud.dataflow.worker.MapTaskExecutorFactory.createParDoOperation(MapTaskExecutorFactory.java:360)
> 
> com.google.cloud.dataflow.worker.MapTaskExecutorFactory$3.typedApply(MapTaskExecutorFactory.java:271)
> 
> com.google.cloud.dataflow.worker.MapTaskExecutorFactory$3.typedApply(MapTaskExecutorFactory.java:253)
> 
> com.google.cloud.dataflow.worker.graph.Networks$TypeSafeNodeFunction.apply(Networks.java:55)
> 
> com.google.cloud.dataflow.worker.graph.Networks$TypeSafeNodeFunction.apply(Networks.java:43)
> 
> com.google.cloud.dataflow.worker.graph.Networks.replaceDirectedNetworkNodes(Networks.java:78)
> 
> com.google.cloud.dataflow.worker.MapTaskExecutorFactory.create(MapTaskExecutorFactory.java:142)
> 
> com.google.cloud.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:925)
> 
> com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$800(StreamingDataflowWorker.java:133)
> 
> com.google.cloud.dataflow.worker.StreamingDataflowWorker$7.run(StreamingDataflowWorker.java:771)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)*
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10047) SpannerIO: Combine sorting and batching

2020-06-10 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-10047:
-
Fix Version/s: 2.23.0
Affects Version/s: 2.16.0
   2.17.0
   2.18.0
   2.19.0
   2.20.0
   2.21.0
   2.22.0
  Description: 
having Spanner IO's grouping, sorting and batching as separate stages makes no 
sense as these stages will be fused in any runner that supports fusion. 
In addition during these stages, even with fusion, at least 3 full copies of 
each mutation are made (serialized then deserialized objects), which leads to 
an extremely large use of memory. 

 

Combining these stages would significantly reduce memory usage, and improve 
performance.

> SpannerIO: Combine sorting and batching
> ---
>
> Key: BEAM-10047
> URL: https://issues.apache.org/jira/browse/BEAM-10047
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Affects Versions: 2.16.0, 2.17.0, 2.18.0, 2.19.0, 2.20.0, 2.21.0, 2.22.0
>Reporter: Brian Hulette
>Assignee: Niel Markwick
>Priority: P2
> Fix For: 2.23.0
>
>
> having Spanner IO's grouping, sorting and batching as separate stages makes 
> no sense as these stages will be fused in any runner that supports fusion. 
> In addition during these stages, even with fusion, at least 3 full copies of 
> each mutation are made (serialized then deserialized objects), which leads to 
> an extremely large use of memory. 
>  
> Combining these stages would significantly reduce memory usage, and improve 
> performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-6407) regression: FileIO.writeDynamic() with side inputs fails in DirectRunner

2020-06-02 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick resolved BEAM-6407.
-
Resolution: Cannot Reproduce

I cannot reproduce the issue in beam versions 2.10 -> 2.21.



I assume it was fixed by  in 2.10

> regression: FileIO.writeDynamic() with side inputs fails in DirectRunner
> 
>
> Key: BEAM-6407
> URL: https://issues.apache.org/jira/browse/BEAM-6407
> Project: Beam
>  Issue Type: Bug
>  Components: runner-direct
>Affects Versions: 2.9.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: P2
>  Labels: regression, stale-assigned
> Fix For: 2.10.0
>
> Attachments: beam-filewriter-demo.tgz
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> When FileIO.writeDynamic is used with automatic sharding and  a Contextful.Fn 
> that uses side inputs for the file naming, DirectRunner (and TestPipeline) 
> fail with: 
> {{java.lang.IllegalStateException: All PCollectionViews that are consumed 
> must be written by some WriteView PTransform: Missing [ 
> [RunnerPCollectionView]]}}
>  
> Example code:  
> {code:java}
> PCollectionView outputFileName =
>    pipeline.apply(
>       "outputDir",
>        Create.of("/tmp/testout")).apply(View.asSingleton());
> Contextful.Fn manifestNaming =
>    (element, c) ->
>       (window, pane, numShards, shardIndex, compression) -> 
>          c.sideInput(outputFileName)+shardIndex;
> pipeline.apply(FileIO.writeDynamic()
>    .by(SerializableFunctions.constant(""))
>    .withDestinationCoder(StringUtf8Coder.of())
>    .via(TextIO.sink())
>    .withTempDirectory("/tmp")
>    .withNaming(Contextful.of(
>       manifestNaming,
>       Requirements.requiresSideInputs(outputFileName;
> {code}
>  
> This does not occur in Dataflow-runner
> It does not occur if the ContextFul.Fn is not given side inputs.
> It does not occur if withNumShards(1) is set.
> It did not occur in 2.8.0, and does in 2.9.0 and 2.10.0-SNAPSHOT (as of today)
>  
> The cause appears to be due to the DirectRunner using TransformOverrides 
> re-writing FileIO sinks to use runner-determined-sharding
> ( see [DirectRunner.java line 
> 226|https://github.com/apache/beam/blob/master/runners/direct-java/src/main/java/org/apache/beam/runners/direct/DirectRunner.java#L226]
>  )
>  but I do not know why this started occuring in 2.9.0...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9822) SpannerIO: Reduce memory usage - especially when streaming

2020-05-20 Thread Niel Markwick (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112441#comment-17112441
 ] 

Niel Markwick commented on BEAM-9822:
-

#11570 does make a big reduction to the memory use of the transform in general, 
so would help a lot for both bounded and unbounded...

But the other 2 resolve the immediate issue for streaming

> SpannerIO: Reduce memory usage - especially when streaming
> --
>
> Key: BEAM-9822
> URL: https://issues.apache.org/jira/browse/BEAM-9822
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.20.0, 2.21.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: P2
>  Labels: google-cloud-spanner
> Fix For: 2.22.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> SpannerIO uses a lot of memory. 
> In Streaming Dataflow, it uses many times as much (because dataflow creates 
> many worker threads)
> Lower the memory use, and change default parameters during streaming to use 
> smaller batches and disable grouping.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9821) SpannerIO does not include all batching parameters in DisplayData.

2020-05-11 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-9821:

Affects Version/s: 2.21.0

> SpannerIO does not include all batching parameters in DisplayData.
> --
>
> Key: BEAM-9821
> URL: https://issues.apache.org/jira/browse/BEAM-9821
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.20.0, 2.21.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: Minor
>  Labels: google-cloud-spanner
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> SpannerIO Write and WriteGrouped do not populate all of the batching/grouping 
> parameters in their DisplayData – they only show "batchSizeBytes"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9821) SpannerIO does not include all batching parameters in DisplayData.

2020-05-11 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-9821:

Fix Version/s: 2.22.0

> SpannerIO does not include all batching parameters in DisplayData.
> --
>
> Key: BEAM-9821
> URL: https://issues.apache.org/jira/browse/BEAM-9821
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.20.0, 2.21.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: Minor
>  Labels: google-cloud-spanner
> Fix For: 2.22.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> SpannerIO Write and WriteGrouped do not populate all of the batching/grouping 
> parameters in their DisplayData – they only show "batchSizeBytes"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work stopped] (BEAM-9822) SpannerIO: Reduce memory usage - especially when streaming

2020-05-11 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-9822 stopped by Niel Markwick.
---
> SpannerIO: Reduce memory usage - especially when streaming
> --
>
> Key: BEAM-9822
> URL: https://issues.apache.org/jira/browse/BEAM-9822
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.20.0, 2.21.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: Major
>  Labels: google-cloud-spanner
> Fix For: 2.22.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> SpannerIO uses a lot of memory. 
> In Streaming Dataflow, it uses many times as much (because dataflow creates 
> many worker threads)
> Lower the memory use, and change default parameters during streaming to use 
> smaller batches and disable grouping.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9822) SpannerIO: Reduce memory usage - especially when streaming

2020-05-11 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-9822:

Fix Version/s: 2.22.0

> SpannerIO: Reduce memory usage - especially when streaming
> --
>
> Key: BEAM-9822
> URL: https://issues.apache.org/jira/browse/BEAM-9822
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.20.0, 2.21.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: Major
>  Labels: google-cloud-spanner
> Fix For: 2.22.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> SpannerIO uses a lot of memory. 
> In Streaming Dataflow, it uses many times as much (because dataflow creates 
> many worker threads)
> Lower the memory use, and change default parameters during streaming to use 
> smaller batches and disable grouping.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9822) SpannerIO: Reduce memory usage - especially when streaming

2020-05-11 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-9822:

Status: Open  (was: Triage Needed)

> SpannerIO: Reduce memory usage - especially when streaming
> --
>
> Key: BEAM-9822
> URL: https://issues.apache.org/jira/browse/BEAM-9822
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.20.0, 2.21.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: Major
>  Labels: google-cloud-spanner
> Fix For: 2.22.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> SpannerIO uses a lot of memory. 
> In Streaming Dataflow, it uses many times as much (because dataflow creates 
> many worker threads)
> Lower the memory use, and change default parameters during streaming to use 
> smaller batches and disable grouping.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9822) SpannerIO: Reduce memory usage - especially when streaming

2020-05-11 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-9822:

Affects Version/s: 2.21.0

> SpannerIO: Reduce memory usage - especially when streaming
> --
>
> Key: BEAM-9822
> URL: https://issues.apache.org/jira/browse/BEAM-9822
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.20.0, 2.21.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: Major
>  Labels: google-cloud-spanner
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> SpannerIO uses a lot of memory. 
> In Streaming Dataflow, it uses many times as much (because dataflow creates 
> many worker threads)
> Lower the memory use, and change default parameters during streaming to use 
> smaller batches and disable grouping.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9821) SpannerIO does not include all batching parameters in DisplayData.

2020-05-11 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-9821:

Status: Open  (was: Triage Needed)

> SpannerIO does not include all batching parameters in DisplayData.
> --
>
> Key: BEAM-9821
> URL: https://issues.apache.org/jira/browse/BEAM-9821
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.20.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: Minor
>  Labels: google-cloud-spanner
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> SpannerIO Write and WriteGrouped do not populate all of the batching/grouping 
> parameters in their DisplayData – they only show "batchSizeBytes"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (BEAM-9822) SpannerIO: Reduce memory usage - especially when streaming

2020-05-11 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-9822 started by Niel Markwick.
---
> SpannerIO: Reduce memory usage - especially when streaming
> --
>
> Key: BEAM-9822
> URL: https://issues.apache.org/jira/browse/BEAM-9822
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.20.0, 2.21.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: Major
>  Labels: google-cloud-spanner
> Fix For: 2.22.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> SpannerIO uses a lot of memory. 
> In Streaming Dataflow, it uses many times as much (because dataflow creates 
> many worker threads)
> Lower the memory use, and change default parameters during streaming to use 
> smaller batches and disable grouping.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-9822) SpannerIO: Reduce memory usage - especially when streaming

2020-05-11 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick reassigned BEAM-9822:
---

Assignee: Niel Markwick

> SpannerIO: Reduce memory usage - especially when streaming
> --
>
> Key: BEAM-9822
> URL: https://issues.apache.org/jira/browse/BEAM-9822
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.20.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: Major
>  Labels: google-cloud-spanner
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> SpannerIO uses a lot of memory. 
> In Streaming Dataflow, it uses many times as much (because dataflow creates 
> many worker threads)
> Lower the memory use, and change default parameters during streaming to use 
> smaller batches and disable grouping.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9505) SpannerIO spurious error message with empty bundles

2020-05-11 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-9505:

Affects Version/s: 2.21.0

> SpannerIO spurious error message with empty bundles
> ---
>
> Key: BEAM-9505
> URL: https://issues.apache.org/jira/browse/BEAM-9505
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.18.0, 2.19.0, 2.20.0, 2.21.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: Minor
> Fix For: 2.22.0
>
> Attachments: Worker error log count.png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> -When using DataflowRunner in streaming mode. DoFn.StartBundle is called 
> multiple times for the same bundle.-
> -This does not occur with DirectRunner.- 
> -This breaks DoFn's which require per-bundle setup and teardown  procedures.-
> When a bundle is empty (such as in streaming if a window is empty), SpannerIO 
> will report a spurious error message:
> {{IllegalStateException: Sorter should be null here}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-9821) SpannerIO does not include all batching parameters in DisplayData.

2020-05-11 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick reassigned BEAM-9821:
---

Assignee: Niel Markwick

> SpannerIO does not include all batching parameters in DisplayData.
> --
>
> Key: BEAM-9821
> URL: https://issues.apache.org/jira/browse/BEAM-9821
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.20.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: Minor
>  Labels: google-cloud-spanner
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> SpannerIO Write and WriteGrouped do not populate all of the batching/grouping 
> parameters in their DisplayData – they only show "batchSizeBytes"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-9269) Set shorter Commit Deadline and handle with backoff/retry

2020-05-11 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick resolved BEAM-9269.
-
Fix Version/s: 2.20.0
   Resolution: Fixed

> Set shorter Commit Deadline and handle with backoff/retry
> -
>
> Key: BEAM-9269
> URL: https://issues.apache.org/jira/browse/BEAM-9269
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Affects Versions: 2.16.0, 2.17.0, 2.18.0, 2.19.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: Major
>  Labels: google-cloud-spanner
> Fix For: 2.20.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> Default commit deadline in Spanner is 1hr, which can lead to a variety of 
> issues including database overload and session expiry.
> Shorter deadline should be set with backoff/retry when deadline expires, so 
> that the Spanner database does not become overloaded.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9505) SpannerIO spurious error message with empty bundles

2020-05-09 Thread Niel Markwick (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17103272#comment-17103272
 ] 

Niel Markwick commented on BEAM-9505:
-

The only workaround is to either build your own version of 2.17 patched with my 
fix from the [GitHub pull request|https://github.com/apache/beam/pull/11438]

Or somehow filter out empty windows.

150,000 errors per hour is ~40/sec, I know that's a lot, but is it really 
slowing down your pipeline?

If you are talking about end-end latency, you can reduce latency by reducing 
the grouping factor in spannerIO.write() (at the expense of potentially more 
spanner CPU usage)

2.22 will default to a groupingFactor of 1 for streaming pipelines.

> SpannerIO spurious error message with empty bundles
> ---
>
> Key: BEAM-9505
> URL: https://issues.apache.org/jira/browse/BEAM-9505
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.18.0, 2.19.0, 2.20.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: Minor
> Fix For: 2.22.0
>
> Attachments: Worker error log count.png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> -When using DataflowRunner in streaming mode. DoFn.StartBundle is called 
> multiple times for the same bundle.-
> -This does not occur with DirectRunner.- 
> -This breaks DoFn's which require per-bundle setup and teardown  procedures.-
> When a bundle is empty (such as in streaming if a window is empty), SpannerIO 
> will report a spurious error message:
> {{IllegalStateException: Sorter should be null here}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9505) SpannerIO spurious error message with empty bundles

2020-05-09 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-9505:

Fix Version/s: 2.22.0

> SpannerIO spurious error message with empty bundles
> ---
>
> Key: BEAM-9505
> URL: https://issues.apache.org/jira/browse/BEAM-9505
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.18.0, 2.19.0, 2.20.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: Minor
> Fix For: 2.22.0
>
> Attachments: Worker error log count.png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> -When using DataflowRunner in streaming mode. DoFn.StartBundle is called 
> multiple times for the same bundle.-
> -This does not occur with DirectRunner.- 
> -This breaks DoFn's which require per-bundle setup and teardown  procedures.-
> When a bundle is empty (such as in streaming if a window is empty), SpannerIO 
> will report a spurious error message:
> {{IllegalStateException: Sorter should be null here}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9822) SpannerIO: Reduce memory usage - especially when streaming

2020-04-26 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-9822:

Summary: SpannerIO: Reduce memory usage - especially when streaming  (was: 
Reduce memory usage - especially when streaming)

> SpannerIO: Reduce memory usage - especially when streaming
> --
>
> Key: BEAM-9822
> URL: https://issues.apache.org/jira/browse/BEAM-9822
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.20.0
>Reporter: Niel Markwick
>Priority: Major
>  Labels: google-cloud-spanner
> Fix For: 2.21.0
>
>
> SpannerIO uses a lot of memory. 
> In Streaming Dataflow, it uses many times as much (because dataflow creates 
> many worker threads)
> Lower the memory use, and change default parameters during streaming to use 
> smaller batches and disable grouping.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9822) Reduce memory usage - especially when streaming

2020-04-26 Thread Niel Markwick (Jira)
Niel Markwick created BEAM-9822:
---

 Summary: Reduce memory usage - especially when streaming
 Key: BEAM-9822
 URL: https://issues.apache.org/jira/browse/BEAM-9822
 Project: Beam
  Issue Type: Bug
  Components: io-java-gcp
Affects Versions: 2.20.0
Reporter: Niel Markwick
 Fix For: 2.21.0


SpannerIO uses a lot of memory. 
In Streaming Dataflow, it uses many times as much (because dataflow creates 
many worker threads)

Lower the memory use, and change default parameters during streaming to use 
smaller batches and disable grouping.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9821) SpannerIO does not include all batching parameters in DisplayData.

2020-04-26 Thread Niel Markwick (Jira)
Niel Markwick created BEAM-9821:
---

 Summary: SpannerIO does not include all batching parameters in 
DisplayData.
 Key: BEAM-9821
 URL: https://issues.apache.org/jira/browse/BEAM-9821
 Project: Beam
  Issue Type: Bug
  Components: io-java-gcp
Affects Versions: 2.20.0
Reporter: Niel Markwick
 Fix For: 2.21.0


SpannerIO Write and WriteGrouped do not populate all of the batching/grouping 
parameters in their DisplayData – they only show "batchSizeBytes"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9505) SpannerIO spurious error message with empty bundles

2020-04-16 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-9505:

  Component/s: (was: runner-dataflow)
   io-java-gcp
Fix Version/s: 2.21.0
Affects Version/s: 2.20.0
  Description: 
-When using DataflowRunner in streaming mode. DoFn.StartBundle is called 
multiple times for the same bundle.-

-This does not occur with DirectRunner.- 

-This breaks DoFn's which require per-bundle setup and teardown  procedures.-

When a bundle is empty (such as in streaming if a window is empty), SpannerIO 
will report a spurious error message:

{{IllegalStateException: Sorter should be null here}}


  was:
When using DataflowRunner in streaming mode. DoFn.StartBundle is called 
multiple times for the same bundle.

 

This does not occur with DirectRunner. 

This breaks DoFn's which require per-bundle setup and teardown  procedures.

 Priority: Minor  (was: Major)
  Summary: SpannerIO spurious error message with empty bundles  
(was: DoFn.StartBundle called multiple times when streaming)

> SpannerIO spurious error message with empty bundles
> ---
>
> Key: BEAM-9505
> URL: https://issues.apache.org/jira/browse/BEAM-9505
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.18.0, 2.19.0, 2.20.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: Minor
> Fix For: 2.21.0
>
>
> -When using DataflowRunner in streaming mode. DoFn.StartBundle is called 
> multiple times for the same bundle.-
> -This does not occur with DirectRunner.- 
> -This breaks DoFn's which require per-bundle setup and teardown  procedures.-
> When a bundle is empty (such as in streaming if a window is empty), SpannerIO 
> will report a spurious error message:
> {{IllegalStateException: Sorter should be null here}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9505) DoFn.StartBundle called multiple times when streaming

2020-04-16 Thread Niel Markwick (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084779#comment-17084779
 ] 

Niel Markwick commented on BEAM-9505:
-

The shame 😢. This is actually a bug in my code where a value was not cleared in 
FinishBundle, leading to an unexpected state in StartBundle and a spurious 
error message.

This only occurs in certain situations (such as an empty bundle, which can 
occur in streaming if the window is also empty).

Updating issue to correct.

> DoFn.StartBundle called multiple times when streaming
> -
>
> Key: BEAM-9505
> URL: https://issues.apache.org/jira/browse/BEAM-9505
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 2.18.0, 2.19.0
>Reporter: Niel Markwick
>Assignee: Boyuan Zhang
>Priority: Major
>
> When using DataflowRunner in streaming mode. DoFn.StartBundle is called 
> multiple times for the same bundle.
>  
> This does not occur with DirectRunner. 
> This breaks DoFn's which require per-bundle setup and teardown  procedures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-9505) DoFn.StartBundle called multiple times when streaming

2020-04-16 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick reassigned BEAM-9505:
---

Assignee: Niel Markwick  (was: Boyuan Zhang)

> DoFn.StartBundle called multiple times when streaming
> -
>
> Key: BEAM-9505
> URL: https://issues.apache.org/jira/browse/BEAM-9505
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 2.18.0, 2.19.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: Major
>
> When using DataflowRunner in streaming mode. DoFn.StartBundle is called 
> multiple times for the same bundle.
>  
> This does not occur with DirectRunner. 
> This breaks DoFn's which require per-bundle setup and teardown  procedures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9505) DoFn.StartBundle called multiple times when streaming

2020-03-30 Thread Niel Markwick (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071109#comment-17071109
 ] 

Niel Markwick commented on BEAM-9505:
-

This issue came was reported from StackOverflow.  To reproduce it you would 
have to run a streaming pipeline in dataflow. The SO bug was attempting to 
write to Spanner (and in the SpannerIO, I have some code in StartBundle to 
sanity-check the state, which reports the error.). 

[https://stackoverflow.com/questions/60658135/spannerio-java-lang-illegalstateexception-sorter-should-be-null-here]

 

I would have thought that a trivial DoFn in a pipeline taking pub/sub input 
would be able to reproduce it (but have not created that pipeline myself yet)

 

> DoFn.StartBundle called multiple times when streaming
> -
>
> Key: BEAM-9505
> URL: https://issues.apache.org/jira/browse/BEAM-9505
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 2.18.0, 2.19.0
>Reporter: Niel Markwick
>Assignee: Boyuan Zhang
>Priority: Major
>
> When using DataflowRunner in streaming mode. DoFn.StartBundle is called 
> multiple times for the same bundle.
>  
> This does not occur with DirectRunner. 
> This breaks DoFn's which require per-bundle setup and teardown  procedures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-9505) DoFn.StartBundle called multiple times when streaming

2020-03-30 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick reassigned BEAM-9505:
---

Assignee: Boyuan Zhang  (was: Niel Markwick)

> DoFn.StartBundle called multiple times when streaming
> -
>
> Key: BEAM-9505
> URL: https://issues.apache.org/jira/browse/BEAM-9505
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 2.18.0, 2.19.0
>Reporter: Niel Markwick
>Assignee: Boyuan Zhang
>Priority: Major
>
> When using DataflowRunner in streaming mode. DoFn.StartBundle is called 
> multiple times for the same bundle.
>  
> This does not occur with DirectRunner. 
> This breaks DoFn's which require per-bundle setup and teardown  procedures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-9505) DoFn.StartBundle called multiple times when streaming

2020-03-30 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick reassigned BEAM-9505:
---

Assignee: Niel Markwick  (was: Boyuan Zhang)

> DoFn.StartBundle called multiple times when streaming
> -
>
> Key: BEAM-9505
> URL: https://issues.apache.org/jira/browse/BEAM-9505
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 2.18.0, 2.19.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: Major
>
> When using DataflowRunner in streaming mode. DoFn.StartBundle is called 
> multiple times for the same bundle.
>  
> This does not occur with DirectRunner. 
> This breaks DoFn's which require per-bundle setup and teardown  procedures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9505) DoFn.StartBundle called multiple times when streaming

2020-03-16 Thread Niel Markwick (Jira)
Niel Markwick created BEAM-9505:
---

 Summary: DoFn.StartBundle called multiple times when streaming
 Key: BEAM-9505
 URL: https://issues.apache.org/jira/browse/BEAM-9505
 Project: Beam
  Issue Type: Bug
  Components: runner-dataflow
Affects Versions: 2.19.0, 2.18.0
Reporter: Niel Markwick


When using DataflowRunner in streaming mode. DoFn.StartBundle is called 
multiple times for the same bundle.

 

This does not occur with DirectRunner. 

This breaks DoFn's which require per-bundle setup and teardown  procedures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-9269) Set shorter Commit Deadline and handle with backoff/retry

2020-02-10 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick reassigned BEAM-9269:
---

Assignee: Niel Markwick

> Set shorter Commit Deadline and handle with backoff/retry
> -
>
> Key: BEAM-9269
> URL: https://issues.apache.org/jira/browse/BEAM-9269
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Affects Versions: 2.16.0, 2.17.0, 2.18.0, 2.19.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: Major
>  Labels: google-cloud-spanner
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Default commit deadline in Spanner is 1hr, which can lead to a variety of 
> issues including database overload and session expiry.
> Shorter deadline should be set with backoff/retry when deadline expires, so 
> that the Spanner database does not become overloaded.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-9268) SpannerIO: Better documentation and warning about creating tables in the pipeline

2020-02-10 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick resolved BEAM-9268.
-
Resolution: Fixed

> SpannerIO: Better documentation and warning about creating tables in the 
> pipeline
> -
>
> Key: BEAM-9268
> URL: https://issues.apache.org/jira/browse/BEAM-9268
> Project: Beam
>  Issue Type: Improvement
>  Components: io-go-gcp
>Affects Versions: 2.16.0, 2.17.0, 2.18.0, 2.19.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: Major
>  Labels: google-cloud-spanner, perfomance
> Fix For: 2.20.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> The javadoc for SpannerIO.Write mentions in passing that the transform needs 
> to know the DB schema for optimal performance. If the schema is created 
> within the pipeline, then there is a race between the schema being created 
> and SpannerIO reading it, leading to a potential performance penalty if 
> SpannerIO does not know about the existence of some tables. 
>  
> Javadoc needs to make this clearer and more explicit, and point the user at 
> the Write.withSchemaReadySignal().
>  
> Pipeline needs to raise (rate limited) warnings if it sees writes being made 
> to tables it does not know about (warnings can refer back to javadocs)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9268) SpannerIO: Better documentation and warning about creating tables in the pipeline

2020-02-10 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-9268:

Fix Version/s: 2.20.0

> SpannerIO: Better documentation and warning about creating tables in the 
> pipeline
> -
>
> Key: BEAM-9268
> URL: https://issues.apache.org/jira/browse/BEAM-9268
> Project: Beam
>  Issue Type: Improvement
>  Components: io-go-gcp
>Affects Versions: 2.16.0, 2.17.0, 2.18.0, 2.19.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: Major
>  Labels: google-cloud-spanner, perfomance
> Fix For: 2.20.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> The javadoc for SpannerIO.Write mentions in passing that the transform needs 
> to know the DB schema for optimal performance. If the schema is created 
> within the pipeline, then there is a race between the schema being created 
> and SpannerIO reading it, leading to a potential performance penalty if 
> SpannerIO does not know about the existence of some tables. 
>  
> Javadoc needs to make this clearer and more explicit, and point the user at 
> the Write.withSchemaReadySignal().
>  
> Pipeline needs to raise (rate limited) warnings if it sees writes being made 
> to tables it does not know about (warnings can refer back to javadocs)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9269) Set shorter Commit Deadline and handle with backoff/retry

2020-02-07 Thread Niel Markwick (Jira)
Niel Markwick created BEAM-9269:
---

 Summary: Set shorter Commit Deadline and handle with backoff/retry
 Key: BEAM-9269
 URL: https://issues.apache.org/jira/browse/BEAM-9269
 Project: Beam
  Issue Type: Improvement
  Components: io-java-gcp
Affects Versions: 2.19.0, 2.18.0, 2.17.0, 2.16.0
Reporter: Niel Markwick


Default commit deadline in Spanner is 1hr, which can lead to a variety of 
issues including database overload and session expiry.

Shorter deadline should be set with backoff/retry when deadline expires, so 
that the Spanner database does not become overloaded.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9268) SpannerIO: Better documentation and warning about creating tables in the pipeline

2020-02-07 Thread Niel Markwick (Jira)
Niel Markwick created BEAM-9268:
---

 Summary: SpannerIO: Better documentation and warning about 
creating tables in the pipeline
 Key: BEAM-9268
 URL: https://issues.apache.org/jira/browse/BEAM-9268
 Project: Beam
  Issue Type: Improvement
  Components: io-go-gcp
Affects Versions: 2.19.0, 2.18.0, 2.17.0, 2.16.0
Reporter: Niel Markwick
Assignee: Niel Markwick


The javadoc for SpannerIO.Write mentions in passing that the transform needs to 
know the DB schema for optimal performance. If the schema is created within the 
pipeline, then there is a race between the schema being created and SpannerIO 
reading it, leading to a potential performance penalty if SpannerIO does not 
know about the existence of some tables. 

 

Javadoc needs to make this clearer and more explicit, and point the user at the 
Write.withSchemaReadySignal().

 

Pipeline needs to raise (rate limited) warnings if it sees writes being made to 
tables it does not know about (warnings can refer back to javadocs)

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8825) OOM when writing large numbers of 'narrow' rows

2019-12-13 Thread Niel Markwick (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995677#comment-16995677
 ] 

Niel Markwick commented on BEAM-8825:
-

done: [https://github.com/apache/beam/pull/10380]

> OOM when writing large numbers of 'narrow' rows
> ---
>
> Key: BEAM-8825
> URL: https://issues.apache.org/jira/browse/BEAM-8825
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.9.0, 2.10.0, 2.11.0, 2.12.0, 2.13.0, 2.14.0, 2.15.0, 
> 2.16.0, 2.17.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: Major
> Fix For: 2.18.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> SpannerIO can OOM when writing large numbers of 'narrow' rows. 
>  
> SpannerIO puts  input mutation elements into batches for efficient writing.
> These batches are limited by number of cells mutated, and size of data 
> written (5000 cells, 1MB data). SpannerIO groups enough mutations to build 
> 1000 of these groups (5M cells, 1GB data), then sorts and batches them.
> When the number of cells and size of data is very small (<5 cells, <100 
> bytes), the memory overhead of storing millions of mutations for batching is 
> significant, and can lead to OOMs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-8825) OOM when writing large numbers of 'narrow' rows

2019-11-27 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick reassigned BEAM-8825:
---

Assignee: Niel Markwick

> OOM when writing large numbers of 'narrow' rows
> ---
>
> Key: BEAM-8825
> URL: https://issues.apache.org/jira/browse/BEAM-8825
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.9.0, 2.10.0, 2.11.0, 2.12.0, 2.13.0, 2.14.0, 2.15.0, 
> 2.16.0, 2.17.0
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: Major
> Fix For: 2.18.0
>
>
> SpannerIO can OOM when writing large numbers of 'narrow' rows. 
>  
> SpannerIO puts  input mutation elements into batches for efficient writing.
> These batches are limited by number of cells mutated, and size of data 
> written (5000 cells, 1MB data). SpannerIO groups enough mutations to build 
> 1000 of these groups (5M cells, 1GB data), then sorts and batches them.
> When the number of cells and size of data is very small (<5 cells, <100 
> bytes), the memory overhead of storing millions of mutations for batching is 
> significant, and can lead to OOMs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8825) OOM when writing large numbers of 'narrow' rows

2019-11-26 Thread Niel Markwick (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-8825:

Fix Version/s: (was: 2.17.0)
   2.18.0
Affects Version/s: 2.17.0

> OOM when writing large numbers of 'narrow' rows
> ---
>
> Key: BEAM-8825
> URL: https://issues.apache.org/jira/browse/BEAM-8825
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Affects Versions: 2.9.0, 2.10.0, 2.11.0, 2.12.0, 2.13.0, 2.14.0, 2.15.0, 
> 2.16.0, 2.17.0
>Reporter: Niel Markwick
>Priority: Major
> Fix For: 2.18.0
>
>
> SpannerIO can OOM when writing large numbers of 'narrow' rows. 
>  
> SpannerIO puts  input mutation elements into batches for efficient writing.
> These batches are limited by number of cells mutated, and size of data 
> written (5000 cells, 1MB data). SpannerIO groups enough mutations to build 
> 1000 of these groups (5M cells, 1GB data), then sorts and batches them.
> When the number of cells and size of data is very small (<5 cells, <100 
> bytes), the memory overhead of storing millions of mutations for batching is 
> significant, and can lead to OOMs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-8825) OOM when writing large numbers of 'narrow' rows

2019-11-26 Thread Niel Markwick (Jira)
Niel Markwick created BEAM-8825:
---

 Summary: OOM when writing large numbers of 'narrow' rows
 Key: BEAM-8825
 URL: https://issues.apache.org/jira/browse/BEAM-8825
 Project: Beam
  Issue Type: Bug
  Components: io-java-gcp
Affects Versions: 2.16.0, 2.15.0, 2.14.0, 2.13.0, 2.12.0, 2.11.0, 2.10.0, 
2.9.0
Reporter: Niel Markwick
 Fix For: 2.17.0


SpannerIO can OOM when writing large numbers of 'narrow' rows. 

 

SpannerIO puts  input mutation elements into batches for efficient writing.
These batches are limited by number of cells mutated, and size of data written 
(5000 cells, 1MB data). SpannerIO groups enough mutations to build 1000 of 
these groups (5M cells, 1GB data), then sorts and batches them.

When the number of cells and size of data is very small (<5 cells, <100 bytes), 
the memory overhead of storing millions of mutations for batching is 
significant, and can lead to OOMs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-7732) Allow access to SpannerOptions in Beam

2019-07-15 Thread Niel Markwick (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885225#comment-16885225
 ] 

Niel Markwick commented on BEAM-7732:
-

Created [https://github.com/apache/beam/pull/9048] as a proof-of-concept change.

> Allow access to SpannerOptions in Beam
> --
>
> Key: BEAM-7732
> URL: https://issues.apache.org/jira/browse/BEAM-7732
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-gcp
>Affects Versions: 2.12.0, 2.13.0
>Reporter: Niel Markwick
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Beam hides the 
> [SpannerOptions|https://github.com/googleapis/google-cloud-java/blob/master/google-cloud-clients/google-cloud-spanner/src/main/java/com/google/cloud/spanner/SpannerOptions.java]
>  object behind a 
> [SpannerConfig|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerConfig.java]
>  object because the SpannerOptions object is not serializable. 
> This means that the only options that can be set are those that can be 
> specified in SpannerConfig - limited to host, project, instance, database.
> Suggestion: add the possibility to set a SpannerOptionsFactory in 
> SpannerConfig:
> {code:java}
> public interface SpannerOptionsFactory extends Serializable {
>    public SpannerOptions create();
> }
> {code}
> This would allow the user use this factory class to specify custom 
> SpannerOptions before they are passed onto the connectToSpanner() method; 
> connectToSpanner() would then become: 
> {code:java}
> public SpannerAccessor connectToSpanner() {
>   
>   SpannerOptions.Builder builder = spannerOptionsFactory.create().toBuilder();
>   // rest of connectToSpanner follows, setting project, host, etc.
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (BEAM-7732) Allow access to SpannerOptions in Beam

2019-07-12 Thread Niel Markwick (JIRA)
Niel Markwick created BEAM-7732:
---

 Summary: Allow access to SpannerOptions in Beam
 Key: BEAM-7732
 URL: https://issues.apache.org/jira/browse/BEAM-7732
 Project: Beam
  Issue Type: Improvement
  Components: io-java-gcp
Affects Versions: 2.13.0, 2.12.0
Reporter: Niel Markwick



Beam hides the 
[SpannerOptions|https://github.com/googleapis/google-cloud-java/blob/master/google-cloud-clients/google-cloud-spanner/src/main/java/com/google/cloud/spanner/SpannerOptions.java]
 object behind a 
[SpannerConfig|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerConfig.java]
 object because the SpannerOptions object is not serializable. 

This means that the only options that can be set are those that can be 
specified in SpannerConfig - limited to host, project, instance, database.

Suggestion: add the possibility to set a SpannerOptionsFactory in SpannerConfig:
{code:java}
public interface SpannerOptionsFactory extends Serializable {
   public SpannerOptions create();
}
{code}
This would allow the user use this factory class to specify custom 
SpannerOptions before they are passed onto the connectToSpanner() method; 

connectToSpanner() would then become: 


{code:java}
public SpannerAccessor connectToSpanner() {
  
  SpannerOptions.Builder builder = spannerOptionsFactory.create().toBuilder();
  // rest of connectToSpanner follows, setting project, host, etc.
{code}





 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (BEAM-6921) Improve SpannerIO output

2019-03-27 Thread Niel Markwick (JIRA)
Niel Markwick created BEAM-6921:
---

 Summary: Improve SpannerIO output
 Key: BEAM-6921
 URL: https://issues.apache.org/jira/browse/BEAM-6921
 Project: Beam
  Issue Type: New Feature
  Components: io-java-gcp
Affects Versions: 2.11.0
Reporter: Niel Markwick


from a discussion in [https://github.com/apache/beam/pull/8097]
 SpannerIO produces 2 output PCollections:
 * getOutput() -> PCollection
 ** never has any values
 ** in GlobalWindow
 ** Closed when the input PCollection is closed (ie never in streaming) to 
indicate when all input has been written
 ** Used in batch pipelines to have 'dependant' bulk imports - where one 
dataset is not written to Spanner until another has completed writing. 
(necessary for handling parent/child (1-many) referential integrity)
 * getFailedMutations() -> PCollection
 ** only contains values when Mutation[Group]s fail to be written
 ** in GlobalWindow
 ** Not very useful, as the reason for the failure is not given. 

Suggestion: 
 * Deprecate these existing outputs.
 * Make primary output be a PCollection<\{ MutationGroup, CommitTimestamp }> so 
that the successfully written Mutation[Groups] can be processed further if 
necessary.
(\{a,b} signifies a container class for these values)
 * Add an additional output of failed mutations PCollection<\{ MutationGroup, 
FailureMessage}>
 * The existing outputs can be derived from these new outputs

This allows useful error reporting/handling from the failure message, and the 
ability to continue processing the successful mutations. 

 

(see also BEAM-6887)

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-6887) Streaming Spanner Writer transform

2019-03-22 Thread Niel Markwick (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niel Markwick updated BEAM-6887:

Description: 
At present, the SpannerIO.Write/WriteGrouped transforms work by collecting an 
entire bundle of elements, sorts them by table/key, splitting the sorted list 
into batches (by size and number of cells modified) and then writes each batch 
to Spanner in a single transaction.

It returns a SpannerWriteResult.java containing :
 # a PCollection (the main output) - which will have no elements but will 
be closed to signal when all the input elements have been written (which is 
never in streaming because input is unbounded)
 # a PCollection of elements that failed to write.

This transform is useful as a bulk sink for data because it efficiently writes 
large amounts of data. 

It is not at all useful as an intermediate step in a streaming pipeline - 
because it has no useful output in streaming mode. 

I propose that we have a separate Spanner Write transform which simply writes 
each input Mutation to the database, and then pushes successful Mutations onto 
its output. 

This would allow use in the middle of a streaming pipeline, where the flow 
would be
 * Some data streamed in
 * Converted to Spanner Mutations
 * Written to Spanner Database
 * Further processing where the values written to the Spanner Database are used.

  was:
At present, the[ 
SpannerIO.Write(Grouped|http://go/gh/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerIO.java#L892])
 transform works by collecting an entire bundle of elements, sorts them by 
table/key, splitting the sorted list into batches (by size and number of cells 
modified) and then writes each batch to Spanner in a single transaction. 

It returns an[ 
object|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerWriteResult.java]
 containing :
 # a PCollection (the main output) - which will have no elements but will 
be closed to signal when all the input elements have been written (which is 
never in streaming because input is unbounded)
 # a PCollection of elements that failed to write.
 

This transform is useful as a bulk sink for data because it efficiently writes 
large amounts of data. 


It is not at all useful as an intermediate step in a streaming pipeline - 
because it has no useful output in streaming mode. 


I propose that we have a separate Spanner Write transform which simply writes 
each input Mutation to the database, and then pushes successful Mutations onto 
its output. 

This would allow use in the middle of a streaming pipeline, where the flow 
would be

 * Some data streamed in
 * Converted to Spanner Mutations
 * Written to Spanner Database
 * Further processing where the values written to the Spanner Database are used.


> Streaming Spanner Writer transform
> --
>
> Key: BEAM-6887
> URL: https://issues.apache.org/jira/browse/BEAM-6887
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp
>Reporter: Niel Markwick
>Assignee: Niel Markwick
>Priority: Minor
>
> At present, the SpannerIO.Write/WriteGrouped transforms work by collecting an 
> entire bundle of elements, sorts them by table/key, splitting the sorted list 
> into batches (by size and number of cells modified) and then writes each 
> batch to Spanner in a single transaction.
> It returns a SpannerWriteResult.java containing :
>  # a PCollection (the main output) - which will have no elements but 
> will be closed to signal when all the input elements have been written (which 
> is never in streaming because input is unbounded)
>  # a PCollection of elements that failed to write.
> This transform is useful as a bulk sink for data because it efficiently 
> writes large amounts of data. 
> It is not at all useful as an intermediate step in a streaming pipeline - 
> because it has no useful output in streaming mode. 
> I propose that we have a separate Spanner Write transform which simply writes 
> each input Mutation to the database, and then pushes successful Mutations 
> onto its output. 
> This would allow use in the middle of a streaming pipeline, where the flow 
> would be
>  * Some data streamed in
>  * Converted to Spanner Mutations
>  * Written to Spanner Database
>  * Further processing where the values written to the Spanner Database are 
> used.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   >