Re: [PR] [SPARK-47574][CORE] Introduce Structured Logging Framework [spark]

2024-04-05 Thread via GitHub


steveloughran commented on code in PR #45729:
URL: https://github.com/apache/spark/pull/45729#discussion_r1553727210


##
common/utils/src/test/scala/org/apache/spark/util/StructuredLoggingSuite.scala:
##
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.util
+
+import java.io.File
+import java.nio.file.Files
+
+import org.scalatest.funsuite.AnyFunSuite // scalastyle:ignore funsuite
+
+import org.apache.spark.internal.{Logging, MDC}
+import org.apache.spark.internal.LogKey.EXECUTOR_ID
+
+abstract class LoggingSuiteBase extends AnyFunSuite // scalastyle:ignore 
funsuite
+  with Logging {
+
+  protected def logFilePath: String
+
+  protected lazy val logFile: File = {
+val pwd = new File(".").getCanonicalPath
+new File(pwd + "/" + logFilePath)
+  }
+
+  // Returns the first line in the log file that contains the given substring.
+  protected def captureLogOutput(f: () => Unit): String = {
+val content = if (logFile.exists()) {
+  Files.readString(logFile.toPath)
+} else {
+  ""
+}
+f()
+val newContent = Files.readString(logFile.toPath)
+newContent.substring(content.length)
+  }
+}
+
+class StructuredLoggingSuite extends LoggingSuiteBase {
+  private val className = this.getClass.getName.stripSuffix("$")
+  override def logFilePath: String = "target/structured.log"
+
+  test("Structured logging") {
+val msg = "This is a log message"
+val logOutput = captureLogOutput(() => logError(msg))
+
+// scalastyle:off line.size.limit
+val pattern = s"""\\{"ts":"[^"]+","level":"ERROR","msg":"This is a log 
message","logger":"$className"}\n""".r
+// scalastyle:on
+assert(pattern.matches(logOutput))
+  }
+
+  test("Structured logging with MDC") {
+val logOutput = captureLogOutput(() => logError(log"Lost executor 
${MDC(EXECUTOR_ID, "1")}."))
+assert(logOutput.nonEmpty)
+// scalastyle:off line.size.limit
+val pattern1 = s"""\\{"ts":"[^"]+","level":"ERROR","msg":"Lost executor 
1.","context":\\{"executor_id":"1"},"logger":"$className"}\n""".r
+// scalastyle:on
+assert(pattern1.matches(logOutput))
+  }
+
+  test("Structured exception logging with MDC") {
+val exception = new RuntimeException("OOM")
+val logOutput = captureLogOutput(() =>
+  logError(log"Error in executor ${MDC(EXECUTOR_ID, "1")}.", exception))
+assert(logOutput.nonEmpty)
+// scalastyle:off line.size.limit
+val pattern = s"""\\{"ts":"[^"]+","level":"ERROR","msg":"Error in executor 
1.","context":\\{"executor_id":"1"},"exception":\\{"class":"java.lang.RuntimeException","msg":"OOM","stacktrace":.*},"logger":"$className"}\n""".r

Review Comment:
   1. you should be deserializing this rather than just doing pattern matching, 
then validate the contents
   2. and verify that on nested exceptions, you get both messages and stack 
traces



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47574][CORE] Introduce Structured Logging Framework [spark]

2024-04-04 Thread via GitHub


pan3793 commented on PR #45729:
URL: https://github.com/apache/spark/pull/45729#issuecomment-2038263897

   @gengliangwang thanks for the clarification, that makes sense.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47574][CORE] Introduce Structured Logging Framework [spark]

2024-04-04 Thread via GitHub


gengliangwang commented on PR #45729:
URL: https://github.com/apache/spark/pull/45729#issuecomment-2038127812

   > Seems it does not get implemented in this PR, in the migration PRs, we 
still inject the Spark identifiers like APP_ID manually in each message. 
   
   This will be done. 
   The current migration is necessary, for example, there are logs about 
executor/worker on driver. 
   
   
   > Another question is, as we use the enum LogKey to track all known MDC 
keys, is it possible to inject custom keys?
   
   It is possible, developers can either use ThreadContext from Log4j, or 
create a customized MessageWithContext entry
   ```
   case class MessageWithContext(message: String, context: 
java.util.HashMap[String, String])
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47574][CORE] Introduce Structured Logging Framework [spark]

2024-04-04 Thread via GitHub


pan3793 commented on PR #45729:
URL: https://github.com/apache/spark/pull/45729#issuecomment-2038083361

   @gengliangwang I see the SPIP docs say
   
   > Spark identifiers (e.g., query ID, executor ID, task ID) will be [tagged 
using 
ThreadContext](https://logging.apache.org/log4j/2.x/manual/thread-context.html#fish-tagging),
 e.g., ThreadContext.set(EXECUTOR_ID, executorId).
   
   Seems it does not get implemented in this PR, in the migration PRs, we still 
inject the Spark identifiers like APP_ID manually in each message. Another 
question is, as we use the `enum LogKey` to track all known MDC keys, is it 
possible to inject custom keys? For example, users may have custom labels on 
the Spark nodes, and they also want to aggregate logs by those custom labels.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org