[jira] [Updated] (BEAM-7693) FILE_LOADS option for inserting rows in BigQuery creates a stuck process in Dataflow that saturates all the resources of the Job

2020-06-01 Thread Beam JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Beam JIRA Bot updated BEAM-7693:

Labels: stale-P2  (was: )

> FILE_LOADS option for inserting rows in BigQuery creates a stuck process in 
> Dataflow that saturates all the resources of the Job
> 
>
> Key: BEAM-7693
> URL: https://issues.apache.org/jira/browse/BEAM-7693
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp, runner-dataflow
>Affects Versions: 2.13.0
> Environment: Dataflow
>Reporter: Juan Urrego
>Priority: P2
>  Labels: stale-P2
> Attachments: Screenshot 2019-07-06 at 15.05.04.png, 
> image-2019-07-06-15-04-17-593.png
>
>
> During a Stream Job, when you insert records to BigQuery in batch using the 
> FILE_LOADS option and one of the jobs fail, the thread who failed is getting 
> stuck and eventually it saturates the Job resources, making the autoscaling 
> option useless (uses the max number of workers and the system latency always 
> goes up). In some cases it has become ridiculous slow trying to process the 
> incoming events.
> Here is an example:
> {code:java}
> BigQueryIO.writeTableRows()
> .to(destinationTableSerializableFunction)
> .withMethod(Method.FILE_LOADS)
> .withJsonSchema(tableSchema)
> .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
> .withWriteDisposition(WriteDisposition.WRITE_APPEND)
> .withTriggeringFrequency(Duration.standardMinutes(5))
> .withNumFileShards(25);
> {code}
> The pipeline works like a charm, but in the moment that I send a wrong 
> tableRow (for instance a required value as null) the pipeline starts sending 
> this messages:
> {code:java}
> Processing stuck in step FILE_LOADS:  in 
> BigQuery/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at 
> least 10m00s without outputting or completing in state finish at 
> java.lang.Thread.sleep(Native Method) at 
> com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:42) at 
> com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:48) at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.nextBackOff(BigQueryHelpers.java:159)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:145)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:255)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown
>  Source)
> {code}
> It's clear that the step keeps running even when it failed. The BigQuery Job 
> mentions that the task failed, but DataFlow keeps trying to wait for a 
> response, even when the job is never executed again.
> !image-2019-07-06-15-04-17-593.png|width=497,height=134!
> At the same time, no message is sent to the DropInputs step, even when I 
> created my own step for DeadLetter, the process think that it hasn't failed 
> yet.
> !Screenshot 2019-07-06 at 15.05.04.png|width=490,height=306!
> The only option that I have found so far, is to pre validate all the fields 
> before, but I was expecting the DB to do that for me, especially in some 
> extreme cases (like decimal numbers or constraint limitations). Please help 
> fixing this issue, otherwise the batch option in stream jobs is almost 
> useless, because I can't trust the own library to manage dead letters properly
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-7693) FILE_LOADS option for inserting rows in BigQuery creates a stuck process in Dataflow that saturates all the resources of the Job

2019-07-15 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/BEAM-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated BEAM-7693:
---
Component/s: (was: io-java-files)
 runner-dataflow
 io-java-gcp

> FILE_LOADS option for inserting rows in BigQuery creates a stuck process in 
> Dataflow that saturates all the resources of the Job
> 
>
> Key: BEAM-7693
> URL: https://issues.apache.org/jira/browse/BEAM-7693
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp, runner-dataflow
>Affects Versions: 2.13.0
> Environment: Dataflow
>Reporter: Juan Urrego
>Priority: Major
> Attachments: Screenshot 2019-07-06 at 15.05.04.png, 
> image-2019-07-06-15-04-17-593.png
>
>
> During a Stream Job, when you insert records to BigQuery in batch using the 
> FILE_LOADS option and one of the jobs fail, the thread who failed is getting 
> stuck and eventually it saturates the Job resources, making the autoscaling 
> option useless (uses the max number of workers and the system latency always 
> goes up). In some cases it has become ridiculous slow trying to process the 
> incoming events.
> Here is an example:
> {code:java}
> BigQueryIO.writeTableRows()
> .to(destinationTableSerializableFunction)
> .withMethod(Method.FILE_LOADS)
> .withJsonSchema(tableSchema)
> .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
> .withWriteDisposition(WriteDisposition.WRITE_APPEND)
> .withTriggeringFrequency(Duration.standardMinutes(5))
> .withNumFileShards(25);
> {code}
> The pipeline works like a charm, but in the moment that I send a wrong 
> tableRow (for instance a required value as null) the pipeline starts sending 
> this messages:
> {code:java}
> Processing stuck in step FILE_LOADS:  in 
> BigQuery/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at 
> least 10m00s without outputting or completing in state finish at 
> java.lang.Thread.sleep(Native Method) at 
> com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:42) at 
> com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:48) at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.nextBackOff(BigQueryHelpers.java:159)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:145)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:255)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown
>  Source)
> {code}
> It's clear that the step keeps running even when it failed. The BigQuery Job 
> mentions that the task failed, but DataFlow keeps trying to wait for a 
> response, even when the job is never executed again.
> !image-2019-07-06-15-04-17-593.png|width=497,height=134!
> At the same time, no message is sent to the DropInputs step, even when I 
> created my own step for DeadLetter, the process think that it hasn't failed 
> yet.
> !Screenshot 2019-07-06 at 15.05.04.png|width=490,height=306!
> The only option that I have found so far, is to pre validate all the fields 
> before, but I was expecting the DB to do that for me, especially in some 
> extreme cases (like decimal numbers or constraint limitations). Please help 
> fixing this issue, otherwise the batch option in stream jobs is almost 
> useless, because I can't trust the own library to manage dead letters properly
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (BEAM-7693) FILE_LOADS option for inserting rows in BigQuery creates a stuck process in Dataflow that saturates all the resources of the Job

2019-07-15 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/BEAM-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated BEAM-7693:
---
Status: Open  (was: Triage Needed)

> FILE_LOADS option for inserting rows in BigQuery creates a stuck process in 
> Dataflow that saturates all the resources of the Job
> 
>
> Key: BEAM-7693
> URL: https://issues.apache.org/jira/browse/BEAM-7693
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-files
>Affects Versions: 2.13.0
> Environment: Dataflow
>Reporter: Juan Urrego
>Priority: Major
> Attachments: Screenshot 2019-07-06 at 15.05.04.png, 
> image-2019-07-06-15-04-17-593.png
>
>
> During a Stream Job, when you insert records to BigQuery in batch using the 
> FILE_LOADS option and one of the jobs fail, the thread who failed is getting 
> stuck and eventually it saturates the Job resources, making the autoscaling 
> option useless (uses the max number of workers and the system latency always 
> goes up). In some cases it has become ridiculous slow trying to process the 
> incoming events.
> Here is an example:
> {code:java}
> BigQueryIO.writeTableRows()
> .to(destinationTableSerializableFunction)
> .withMethod(Method.FILE_LOADS)
> .withJsonSchema(tableSchema)
> .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
> .withWriteDisposition(WriteDisposition.WRITE_APPEND)
> .withTriggeringFrequency(Duration.standardMinutes(5))
> .withNumFileShards(25);
> {code}
> The pipeline works like a charm, but in the moment that I send a wrong 
> tableRow (for instance a required value as null) the pipeline starts sending 
> this messages:
> {code:java}
> Processing stuck in step FILE_LOADS:  in 
> BigQuery/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at 
> least 10m00s without outputting or completing in state finish at 
> java.lang.Thread.sleep(Native Method) at 
> com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:42) at 
> com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:48) at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.nextBackOff(BigQueryHelpers.java:159)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:145)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:255)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown
>  Source)
> {code}
> It's clear that the step keeps running even when it failed. The BigQuery Job 
> mentions that the task failed, but DataFlow keeps trying to wait for a 
> response, even when the job is never executed again.
> !image-2019-07-06-15-04-17-593.png|width=497,height=134!
> At the same time, no message is sent to the DropInputs step, even when I 
> created my own step for DeadLetter, the process think that it hasn't failed 
> yet.
> !Screenshot 2019-07-06 at 15.05.04.png|width=490,height=306!
> The only option that I have found so far, is to pre validate all the fields 
> before, but I was expecting the DB to do that for me, especially in some 
> extreme cases (like decimal numbers or constraint limitations). Please help 
> fixing this issue, otherwise the batch option in stream jobs is almost 
> useless, because I can't trust the own library to manage dead letters properly
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (BEAM-7693) FILE_LOADS option for inserting rows in BigQuery creates a stuck process in Dataflow that saturates all the resources of the Job

2019-07-06 Thread Juan Urrego (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juan Urrego updated BEAM-7693:
--
Description: 
During a Stream Job, when you insert records to BigQuery in batch using the 
FILE_LOADS option and one of the jobs fail, the thread who failed is getting 
stuck and eventually it saturates the Job resources, making the autoscaling 
option useless (uses the max number of workers and the system latency always 
goes up). In some cases it has become ridiculous slow trying to process the 
incoming events.

Here is an example:
{code:java}
BigQueryIO.writeTableRows()
.to(destinationTableSerializableFunction)
.withMethod(Method.FILE_LOADS)
.withJsonSchema(tableSchema)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withTriggeringFrequency(Duration.standardMinutes(5))
.withNumFileShards(25);
{code}
The pipeline works like a charm, but in the moment that I send a wrong tableRow 
(for instance a required value as null) the pipeline starts sending this 
messages:
{code:java}
Processing stuck in step FILE_LOADS:  in 
BigQuery/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at 
least 10m00s without outputting or completing in state finish at 
java.lang.Thread.sleep(Native Method) at 
com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:42) at 
com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:48) at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.nextBackOff(BigQueryHelpers.java:159)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:145)
 at 
org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:255)
 at 
org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown
 Source)
{code}
It's clear that the step keeps running even when it failed. The BigQuery Job 
mentions that the task failed, but DataFlow keeps trying to wait for a 
response, even when the job is never executed again.

!image-2019-07-06-15-04-17-593.png|width=497,height=134!

At the same time, no message is sent to the DropInputs step, even when I 
created my own step for DeadLetter, the process think that it hasn't failed yet.

!Screenshot 2019-07-06 at 15.05.04.png|width=490,height=306!

The only option that I have found so far, is to pre validate all the fields 
before, but I was expecting the DB to do that for me, especially in some 
extreme cases (like decimal numbers or constraint limitations). Please help 
fixing this issue, otherwise the batch option in stream jobs is almost useless, 
because I can't trust the own library to manage dead letters properly

 

  was:
During a Stream Job, when you insert records to BigQuery in batch using the 
FILE_LOADS option and one of the jobs fail, the thread who failed is getting 
stuck and eventually it saturates the Job resources, making the autoscaling 
option useless (uses the max number of workers and the system latency always 
goes up). In some cases it has become ridiculous slow trying to process the 
incoming events.

Here is an example:
{code:java}
BigQueryIO.writeTableRows()
.to(destinationTableSerializableFunction)
.withMethod(Method.FILE_LOADS)
.withJsonSchema(tableSchema)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withTriggeringFrequency(Duration.standardMinutes(5))
.withNumFileShards(25);
{code}
The pipeline works like a charm, but in the moment that I send a wrong tableRow 
(for instance a required value as null) the pipeline starts sending this 
messages:
{code:java}
Processing stuck in step FILE_LOADS:  in 
BigQuery/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at 
least 10m00s without outputting or completing in state finish at 
java.lang.Thread.sleep(Native Method) at 
com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:42) at 
com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:48) at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.nextBackOff(BigQueryHelpers.java:159)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:145)
 at 
org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:255)
 at 
org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown
 Source)
{code}
It's clear that the step keeps running even when it failed. The BigQuery Job 
mentions that the task failed, but DataFlow keeps trying to wait for a 
response, even when the job is never executed again.

!image-2019-07-06-15-04-17-593.png|width=497,height=134!

At the same time, no message is sent to the DropInputs step, even when I 
created my own step for DeadLetter, the process think that it hasn't failed 

[jira] [Updated] (BEAM-7693) FILE_LOADS option for inserting rows in BigQuery creates a stuck process in Dataflow that saturates all the resources of the Job

2019-07-06 Thread Juan Urrego (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juan Urrego updated BEAM-7693:
--
Description: 
During a Stream Job, when you insert records to BigQuery in batch using the 
FILE_LOADS option and one of the jobs fail, the thread who failed is getting 
stuck and eventually it saturates the Job resources, making the autoscaling 
option useless (uses the max number of workers and the system latency always 
goes up). In some cases it has become ridiculous slow trying to process the 
incoming events.

Here is an example:
{code:java}
BigQueryIO.writeTableRows()
.to(destinationTableSerializableFunction)
.withMethod(Method.FILE_LOADS)
.withJsonSchema(tableSchema)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withTriggeringFrequency(Duration.standardMinutes(5))
.withNumFileShards(25);
{code}
The pipeline works like a charm, but in the moment that I send a wrong tableRow 
(for instance a required value as null) the pipeline starts sending this 
messages:
{code:java}
Processing stuck in step FILE_LOADS:  in 
BigQuery/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at 
least 10m00s without outputting or completing in state finish at 
java.lang.Thread.sleep(Native Method) at 
com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:42) at 
com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:48) at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.nextBackOff(BigQueryHelpers.java:159)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:145)
 at 
org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:255)
 at 
org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown
 Source)
{code}
It's clear that the step keeps running even when it failed. The BigQuery Job 
mentions that the task failed, but DataFlow keeps trying to wait for a 
response, even when the job is never executed again.

!image-2019-07-06-15-04-17-593.png|width=497,height=134!

At the same time, no message is sent to the DropInputs step, even when I 
created my own step for DeadLetter, the process think that it hasn't failed yet.

!Screenshot 2019-07-06 at 15.05.04.png|width=490,height=306!

 

 

The only option that I have found so far, is to pre validate all the fields 
before, but I was expecting the DB to do that for me, especially in some 
extreme cases (like decimal numbers or constraint limitations). Please help 
fixing this issue, otherwise the batch option in stream jobs is almost useless, 
because I can't trust the own library to manage dead letters properly

 

  was:
During a Stream Job, when you insert records to BigQuery in batch using the 
FILE_LOADS option and one of the jobs fail, the thread who failed is getting 
stuck and eventually it saturates the Job resources, making the autoscaling 
option useless (uses the max number of workers and the system latency always 
goes up). In some cases it has become ridiculous slow trying to process the 
incoming events.

Here is an example:
{code:java}
BigQueryIO.writeTableRows()
.to(destinationTableSerializableFunction)
.withMethod(Method.FILE_LOADS)
.withJsonSchema(tableSchema)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withTriggeringFrequency(Duration.standardMinutes(5))
.withNumFileShards(25);
{code}
The pipeline works like a charm, but in the moment that I send a wrong tableRow 
(for instance a required value as null) the pipeline starts sending this 
messages:
{code:java}
Processing stuck in step FILE_LOADS:  in 
BigQuery/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at 
least 10m00s without outputting or completing in state finish at 
java.lang.Thread.sleep(Native Method) at 
com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:42) at 
com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:48) at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.nextBackOff(BigQueryHelpers.java:159)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:145)
 at 
org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:255)
 at 
org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown
 Source)
{code}
It's clear that the step keeps running even when it failed. The BigQuery Job 
mentions that the task failed, but DataFlow keeps trying to wait for a 
response, even when the job is never executed again.

!image-2019-07-06-15-04-17-593.png|width=497,height=134!

At the same time, no message is sent to the DropInputs step, even when I 
created my own step for DeadLetter, the process think that it hasn't 

[jira] [Updated] (BEAM-7693) FILE_LOADS option for inserting rows in BigQuery creates a stuck process in Dataflow that saturates all the resources of the Job

2019-07-06 Thread Juan Urrego (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juan Urrego updated BEAM-7693:
--
Description: 
During a Stream Job, when you insert records to BigQuery in batch using the 
FILE_LOADS option and one of the jobs fail, the thread who failed is getting 
stuck and eventually it saturates the Job resources, making the autoscaling 
option useless (uses the max number of workers and the system latency always 
goes up). In some cases it has become ridiculous slow trying to process the 
incoming events.

Here is an example:
{code:java}
BigQueryIO.writeTableRows()
.to(destinationTableSerializableFunction)
.withMethod(Method.FILE_LOADS)
.withJsonSchema(tableSchema)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withTriggeringFrequency(Duration.standardMinutes(5))
.withNumFileShards(25);
{code}
The pipeline works like a charm, but in the moment that I send a wrong tableRow 
(for instance a required value as null) the pipeline starts sending this 
messages:
{code:java}
Processing stuck in step FILE_LOADS:  in 
BigQuery/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at 
least 10m00s without outputting or completing in state finish at 
java.lang.Thread.sleep(Native Method) at 
com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:42) at 
com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:48) at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.nextBackOff(BigQueryHelpers.java:159)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:145)
 at 
org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:255)
 at 
org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown
 Source)
{code}
It's clear that the step keeps running even when it failed. The BigQuery Job 
mentions that the task failed, but DataFlow keeps trying to wait for a 
response, even when the job is never executed again.

!image-2019-07-06-15-04-17-593.png|width=497,height=134!

At the same time, no message is sent to the DropInputs step, even when I 
created my own step for DeadLetter, the process think that it hasn't failed yet.

!Screenshot 2019-07-06 at 15.05.04.png|width=490,height=306!

The only option that I have found so far, is to pre validate all the fields 
before, but I was expecting the DB to do that for me, especially in some 
extreme cases (like decimal numbers or constraint limitations). Please help 
fixing this issue, otherwise the batch option in stream jobs is almost useless, 
because I can't trust the own library to manage dead letters properly

 

  was:
During a Stream Job, when you insert records to BigQuery in batch using the 
FILE_LOADS option and one of the jobs fail, the thread who failed is getting 
stuck and eventually it saturates the Job resources, making the autoscaling 
option useless (uses the max number of workers and the system latency always 
goes up). In some cases it has become ridiculous slow trying to process the 
incoming events.

Here is an example:
{code:java}
BigQueryIO.writeTableRows()
.to(destinationTableSerializableFunction)
.withMethod(Method.FILE_LOADS)
.withJsonSchema(tableSchema)
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withTriggeringFrequency(Duration.standardMinutes(5))
.withNumFileShards(25);
{code}
The pipeline works like a charm, but in the moment that I send a wrong tableRow 
(for instance a required value as null) the pipeline starts sending this 
messages:
{code:java}
Processing stuck in step FILE_LOADS:  in 
BigQuery/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at 
least 10m00s without outputting or completing in state finish at 
java.lang.Thread.sleep(Native Method) at 
com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:42) at 
com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:48) at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.nextBackOff(BigQueryHelpers.java:159)
 at 
org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:145)
 at 
org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:255)
 at 
org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown
 Source)
{code}
It's clear that the step keeps running even when it failed. The BigQuery Job 
mentions that the task failed, but DataFlow keeps trying to wait for a 
response, even when the job is never executed again. At the same time, no 
message is sent to the DropInputs step, even when I created my own step for 
DeadLetter, the process think that it hasn't failed yet.

The only option that I have found so far, is to pre 

[jira] [Updated] (BEAM-7693) FILE_LOADS option for inserting rows in BigQuery creates a stuck process in Dataflow that saturates all the resources of the Job

2019-07-06 Thread Juan Urrego (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juan Urrego updated BEAM-7693:
--
Attachment: (was: Screenshot 2019-07-06 at 14.51.36.png)

> FILE_LOADS option for inserting rows in BigQuery creates a stuck process in 
> Dataflow that saturates all the resources of the Job
> 
>
> Key: BEAM-7693
> URL: https://issues.apache.org/jira/browse/BEAM-7693
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-files
>Affects Versions: 2.13.0
> Environment: Dataflow
>Reporter: Juan Urrego
>Priority: Major
> Attachments: Screenshot 2019-07-06 at 15.05.04.png, 
> image-2019-07-06-15-04-17-593.png
>
>
> During a Stream Job, when you insert records to BigQuery in batch using the 
> FILE_LOADS option and one of the jobs fail, the thread who failed is getting 
> stuck and eventually it saturates the Job resources, making the autoscaling 
> option useless (uses the max number of workers and the system latency always 
> goes up). In some cases it has become ridiculous slow trying to process the 
> incoming events.
> Here is an example:
> {code:java}
> BigQueryIO.writeTableRows()
> .to(destinationTableSerializableFunction)
> .withMethod(Method.FILE_LOADS)
> .withJsonSchema(tableSchema)
> .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
> .withWriteDisposition(WriteDisposition.WRITE_APPEND)
> .withTriggeringFrequency(Duration.standardMinutes(5))
> .withNumFileShards(25);
> {code}
> The pipeline works like a charm, but in the moment that I send a wrong 
> tableRow (for instance a required value as null) the pipeline starts sending 
> this messages:
> {code:java}
> Processing stuck in step FILE_LOADS:  in 
> BigQuery/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at 
> least 10m00s without outputting or completing in state finish at 
> java.lang.Thread.sleep(Native Method) at 
> com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:42) at 
> com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:48) at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.nextBackOff(BigQueryHelpers.java:159)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:145)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:255)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown
>  Source)
> {code}
> It's clear that the step keeps running even when it failed. The BigQuery Job 
> mentions that the task failed, but DataFlow keeps trying to wait for a 
> response, even when the job is never executed again.
> !image-2019-07-06-15-04-17-593.png|width=497,height=134!
> At the same time, no message is sent to the DropInputs step, even when I 
> created my own step for DeadLetter, the process think that it hasn't failed 
> yet.
> !Screenshot 2019-07-06 at 15.05.04.png|width=490,height=306!
> The only option that I have found so far, is to pre validate all the fields 
> before, but I was expecting the DB to do that for me, especially in some 
> extreme cases (like decimal numbers or constraint limitations). Please help 
> fixing this issue, otherwise the batch option in stream jobs is almost 
> useless, because I can't trust the own library to manage dead letters properly
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-7693) FILE_LOADS option for inserting rows in BigQuery creates a stuck process in Dataflow that saturates all the resources of the Job

2019-07-06 Thread Juan Urrego (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juan Urrego updated BEAM-7693:
--
Attachment: Screenshot 2019-07-06 at 15.05.04.png

> FILE_LOADS option for inserting rows in BigQuery creates a stuck process in 
> Dataflow that saturates all the resources of the Job
> 
>
> Key: BEAM-7693
> URL: https://issues.apache.org/jira/browse/BEAM-7693
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-files
>Affects Versions: 2.13.0
> Environment: Dataflow
>Reporter: Juan Urrego
>Priority: Major
> Attachments: Screenshot 2019-07-06 at 14.51.36.png, Screenshot 
> 2019-07-06 at 15.05.04.png, image-2019-07-06-15-04-17-593.png
>
>
> During a Stream Job, when you insert records to BigQuery in batch using the 
> FILE_LOADS option and one of the jobs fail, the thread who failed is getting 
> stuck and eventually it saturates the Job resources, making the autoscaling 
> option useless (uses the max number of workers and the system latency always 
> goes up). In some cases it has become ridiculous slow trying to process the 
> incoming events.
> Here is an example:
> {code:java}
> BigQueryIO.writeTableRows()
> .to(destinationTableSerializableFunction)
> .withMethod(Method.FILE_LOADS)
> .withJsonSchema(tableSchema)
> .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
> .withWriteDisposition(WriteDisposition.WRITE_APPEND)
> .withTriggeringFrequency(Duration.standardMinutes(5))
> .withNumFileShards(25);
> {code}
> The pipeline works like a charm, but in the moment that I send a wrong 
> tableRow (for instance a required value as null) the pipeline starts sending 
> this messages:
> {code:java}
> Processing stuck in step FILE_LOADS:  in 
> BigQuery/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at 
> least 10m00s without outputting or completing in state finish at 
> java.lang.Thread.sleep(Native Method) at 
> com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:42) at 
> com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:48) at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.nextBackOff(BigQueryHelpers.java:159)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:145)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:255)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown
>  Source)
> {code}
> It's clear that the step keeps running even when it failed. The BigQuery Job 
> mentions that the task failed, but DataFlow keeps trying to wait for a 
> response, even when the job is never executed again. At the same time, no 
> message is sent to the DropInputs step, even when I created my own step for 
> DeadLetter, the process think that it hasn't failed yet.
> The only option that I have found so far, is to pre validate all the fields 
> before, but I was expecting the DB to do that for me, especially in some 
> extreme cases (like decimal numbers or constraint limitations). Please help 
> fixing this issue, otherwise the batch option in stream jobs is almost 
> useless, because I can't trust the own library to manage dead letters properly
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-7693) FILE_LOADS option for inserting rows in BigQuery creates a stuck process in Dataflow that saturates all the resources of the Job

2019-07-06 Thread Juan Urrego (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juan Urrego updated BEAM-7693:
--
Attachment: image-2019-07-06-15-04-17-593.png

> FILE_LOADS option for inserting rows in BigQuery creates a stuck process in 
> Dataflow that saturates all the resources of the Job
> 
>
> Key: BEAM-7693
> URL: https://issues.apache.org/jira/browse/BEAM-7693
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-files
>Affects Versions: 2.13.0
> Environment: Dataflow
>Reporter: Juan Urrego
>Priority: Major
> Attachments: Screenshot 2019-07-06 at 14.51.36.png, 
> image-2019-07-06-15-04-17-593.png
>
>
> During a Stream Job, when you insert records to BigQuery in batch using the 
> FILE_LOADS option and one of the jobs fail, the thread who failed is getting 
> stuck and eventually it saturates the Job resources, making the autoscaling 
> option useless (uses the max number of workers and the system latency always 
> goes up). In some cases it has become ridiculous slow trying to process the 
> incoming events.
> Here is an example:
> {code:java}
> BigQueryIO.writeTableRows()
> .to(destinationTableSerializableFunction)
> .withMethod(Method.FILE_LOADS)
> .withJsonSchema(tableSchema)
> .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
> .withWriteDisposition(WriteDisposition.WRITE_APPEND)
> .withTriggeringFrequency(Duration.standardMinutes(5))
> .withNumFileShards(25);
> {code}
> The pipeline works like a charm, but in the moment that I send a wrong 
> tableRow (for instance a required value as null) the pipeline starts sending 
> this messages:
> {code:java}
> Processing stuck in step FILE_LOADS:  in 
> BigQuery/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at 
> least 10m00s without outputting or completing in state finish at 
> java.lang.Thread.sleep(Native Method) at 
> com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:42) at 
> com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:48) at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.nextBackOff(BigQueryHelpers.java:159)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:145)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:255)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown
>  Source)
> {code}
> It's clear that the step keeps running even when it failed. The BigQuery Job 
> mentions that the task failed, but DataFlow keeps trying to wait for a 
> response, even when the job is never executed again. At the same time, no 
> message is sent to the DropInputs step, even when I created my own step for 
> DeadLetter, the process think that it hasn't failed yet.
> The only option that I have found so far, is to pre validate all the fields 
> before, but I was expecting the DB to do that for me, especially in some 
> extreme cases (like decimal numbers or constraint limitations). Please help 
> fixing this issue, otherwise the batch option in stream jobs is almost 
> useless, because I can't trust the own library to manage dead letters properly
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-7693) FILE_LOADS option for inserting rows in BigQuery creates a stuck process in Dataflow that saturates all the resources of the Job

2019-07-06 Thread Juan Urrego (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juan Urrego updated BEAM-7693:
--
Attachment: Screenshot 2019-07-06 at 14.51.36.png

> FILE_LOADS option for inserting rows in BigQuery creates a stuck process in 
> Dataflow that saturates all the resources of the Job
> 
>
> Key: BEAM-7693
> URL: https://issues.apache.org/jira/browse/BEAM-7693
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-files
>Affects Versions: 2.13.0
> Environment: Dataflow
>Reporter: Juan Urrego
>Priority: Major
> Attachments: Screenshot 2019-07-06 at 14.51.36.png
>
>
> During a Stream Job, when you insert records to BigQuery in batch using the 
> FILE_LOADS option and one of the jobs fail, the thread who failed is getting 
> stuck and eventually it saturates the Job resources, making the autoscaling 
> option useless (uses the max number of workers and the system latency always 
> goes up). In some cases it has become ridiculous slow trying to process the 
> incoming events.
> Here is an example:
> {code:java}
> BigQueryIO.writeTableRows()
> .to(destinationTableSerializableFunction)
> .withMethod(Method.FILE_LOADS)
> .withJsonSchema(tableSchema)
> .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
> .withWriteDisposition(WriteDisposition.WRITE_APPEND)
> .withTriggeringFrequency(Duration.standardMinutes(5))
> .withNumFileShards(25);
> {code}
> The pipeline works like a charm, but in the moment that I send a wrong 
> tableRow (for instance a required value as null) the pipeline starts sending 
> this messages:
> {code:java}
> Processing stuck in step FILE_LOADS:  in 
> BigQuery/BatchLoads/SinglePartitionWriteTables/ParMultiDo(WriteTables) for at 
> least 10m00s without outputting or completing in state finish at 
> java.lang.Thread.sleep(Native Method) at 
> com.google.api.client.util.Sleeper$1.sleep(Sleeper.java:42) at 
> com.google.api.client.util.BackOffUtils.next(BackOffUtils.java:48) at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.nextBackOff(BigQueryHelpers.java:159)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:145)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:255)
>  at 
> org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn$DoFnInvoker.invokeFinishBundle(Unknown
>  Source)
> {code}
> It's clear that the step keeps running even when it failed. The BigQuery Job 
> mentions that the task failed, but DataFlow keeps trying to wait for a 
> response, even when the job is never executed again. At the same time, no 
> message is sent to the DropInputs step, even when I created my own step for 
> DeadLetter, the process think that it hasn't failed yet.
> The only option that I have found so far, is to pre validate all the fields 
> before, but I was expecting the DB to do that for me, especially in some 
> extreme cases (like decimal numbers or constraint limitations). Please help 
> fixing this issue, otherwise the batch option in stream jobs is almost 
> useless, because I can't trust the own library to manage dead letters properly
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)