Ning Kang created BEAM-7812:
-------------------------------

             Summary: Work around Stackdriver error reporting double counting 
worker errors
                 Key: BEAM-7812
                 URL: https://issues.apache.org/jira/browse/BEAM-7812
             Project: Beam
          Issue Type: Bug
          Components: runner-dataflow
            Reporter: Ning Kang


h1. *Objective*

Work around Stackdriver Error Reporting to count worker errors only once when 
double logging.

{color:#d04437}*Only applicable to dataflow runner workers in SDK*{color}.
h1. *Background*

Stackdriver error reporting will double count worker errors logged to 
Stackdriver, because:
 # workers log errors to Stackdriver;
 # workers report the same errors to dfe and dfe will log them again to 
Stackdriver.

The double counting is blocking us sending job message logs from dfe to 
Stackdriver because we don't want to change the behavior of any existing log 
and feature.

There happens to be an inconsistency in Java batch 
([DataflowWorkerLoggingHandler|[https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/logging/DataflowWorkerLoggingHandler.java#L82]])
 and streaming 
([StreamingDataflowWorker|[https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/StreamingDataflowWorker.java#L1747]])
 error reporting to dfe that results in reported error from streaming Java 
worker will eventually be ignored by Stackdriver Error Reporting.
h1. *Details*

Inspired by the inconsistency, we decide to apply the streaming Java worker 
error reporting logic to batch to both fix the inconsistency and work around 
double counting issue on Stackdriver Error Reporting.

The change will be when workers reporting errors to dfe,
 * For Java, construct stack trace from StackTrace object instead of using 
printStackTrace;
 * For Python, report the complete error message details exactly the same to 
worker logging instead of only reporting traceback through traceback module.

Users will not experience change since job message logging to Stackdriver 
hasn’t been launched yet.
h1. *Test Plan*

We'll add unit test for public methods changed in the process.

Google has internal integration tests where we can push worker harness images 
and set worker harness container image to test in sandbox.

When releasing, we also have integration tests in different releasing stages.

The workaround needs to be released completely before we can enable job message 
logging.

We can verify the format of stacktraces in sandbox and release stages by 
executing example pipelines in our projects and directly browse prod 
Stackdriver logging and error reporting consoles. This should be done before 
and after enabling job message logging.

Run any other existing and required tests before sending PR.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to