Ning Kang created BEAM-7812: ------------------------------- Summary: Work around Stackdriver error reporting double counting worker errors Key: BEAM-7812 URL: https://issues.apache.org/jira/browse/BEAM-7812 Project: Beam Issue Type: Bug Components: runner-dataflow Reporter: Ning Kang
h1. *Objective* Work around Stackdriver Error Reporting to count worker errors only once when double logging. {color:#d04437}*Only applicable to dataflow runner workers in SDK*{color}. h1. *Background* Stackdriver error reporting will double count worker errors logged to Stackdriver, because: # workers log errors to Stackdriver; # workers report the same errors to dfe and dfe will log them again to Stackdriver. The double counting is blocking us sending job message logs from dfe to Stackdriver because we don't want to change the behavior of any existing log and feature. There happens to be an inconsistency in Java batch ([DataflowWorkerLoggingHandler|[https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/logging/DataflowWorkerLoggingHandler.java#L82]]) and streaming ([StreamingDataflowWorker|[https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/StreamingDataflowWorker.java#L1747]]) error reporting to dfe that results in reported error from streaming Java worker will eventually be ignored by Stackdriver Error Reporting. h1. *Details* Inspired by the inconsistency, we decide to apply the streaming Java worker error reporting logic to batch to both fix the inconsistency and work around double counting issue on Stackdriver Error Reporting. The change will be when workers reporting errors to dfe, * For Java, construct stack trace from StackTrace object instead of using printStackTrace; * For Python, report the complete error message details exactly the same to worker logging instead of only reporting traceback through traceback module. Users will not experience change since job message logging to Stackdriver hasn’t been launched yet. h1. *Test Plan* We'll add unit test for public methods changed in the process. Google has internal integration tests where we can push worker harness images and set worker harness container image to test in sandbox. When releasing, we also have integration tests in different releasing stages. The workaround needs to be released completely before we can enable job message logging. We can verify the format of stacktraces in sandbox and release stages by executing example pipelines in our projects and directly browse prod Stackdriver logging and error reporting consoles. This should be done before and after enabling job message logging. Run any other existing and required tests before sending PR. -- This message was sent by Atlassian JIRA (v7.6.14#76016)