[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14659 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...
Github user Sherry302 commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r80579059 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2421,6 +2421,69 @@ private[spark] object Utils extends Logging { } /** + * An utility class used to set up Spark caller contexts to HDFS and Yarn. The `context` will be + * constructed by parameters passed in. + * When Spark applications run on Yarn and HDFS, its caller contexts will be written into Yarn RM + * audit log and hdfs-audit.log. That can help users to better diagnose and understand how + * specific applications impacting parts of the Hadoop system and potential problems they may be + * creating (e.g. overloading NN). As HDFS mentioned in HDFS-9184, for a given HDFS operation, it's + * very helpful to track which upper level job issues it. + * + * @param from who sets up the caller context (TASK, CLIENT, APPMASTER) + * + * The parameters below are optional: + * @param appId id of the app this task belongs to + * @param appAttemptId attempt id of the app this task belongs to + * @param jobId id of the job this task belongs to + * @param stageId id of the stage this task belongs to + * @param stageAttemptId attempt id of the stage this task belongs to + * @param taskId task id + * @param taskAttemptNumber task attempt id + * @since 2.0.1 + */ +private[spark] class CallerContext( + from: String, + appId: Option[String] = None, + appAttemptId: Option[String] = None, + jobId: Option[Int] = None, + stageId: Option[Int] = None, + stageAttemptId: Option[Int] = None, + taskId: Option[Long] = None, + taskAttemptNumber: Option[Int] = None) extends Logging { + + val AppId = if (appId.isDefined) s"_${appId.get}" else "" --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...
Github user Sherry302 commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r80579032 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2421,6 +2421,69 @@ private[spark] object Utils extends Logging { } /** + * An utility class used to set up Spark caller contexts to HDFS and Yarn. The `context` will be + * constructed by parameters passed in. + * When Spark applications run on Yarn and HDFS, its caller contexts will be written into Yarn RM + * audit log and hdfs-audit.log. That can help users to better diagnose and understand how + * specific applications impacting parts of the Hadoop system and potential problems they may be + * creating (e.g. overloading NN). As HDFS mentioned in HDFS-9184, for a given HDFS operation, it's + * very helpful to track which upper level job issues it. + * + * @param from who sets up the caller context (TASK, CLIENT, APPMASTER) + * + * The parameters below are optional: + * @param appId id of the app this task belongs to + * @param appAttemptId attempt id of the app this task belongs to + * @param jobId id of the job this task belongs to + * @param stageId id of the stage this task belongs to + * @param stageAttemptId attempt id of the stage this task belongs to + * @param taskId task id + * @param taskAttemptNumber task attempt id + * @since 2.0.1 --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r80482936 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2421,6 +2421,69 @@ private[spark] object Utils extends Logging { } /** + * An utility class used to set up Spark caller contexts to HDFS and Yarn. The `context` will be + * constructed by parameters passed in. + * When Spark applications run on Yarn and HDFS, its caller contexts will be written into Yarn RM + * audit log and hdfs-audit.log. That can help users to better diagnose and understand how + * specific applications impacting parts of the Hadoop system and potential problems they may be + * creating (e.g. overloading NN). As HDFS mentioned in HDFS-9184, for a given HDFS operation, it's + * very helpful to track which upper level job issues it. + * + * @param from who sets up the caller context (TASK, CLIENT, APPMASTER) + * + * The parameters below are optional: + * @param appId id of the app this task belongs to + * @param appAttemptId attempt id of the app this task belongs to + * @param jobId id of the job this task belongs to + * @param stageId id of the stage this task belongs to + * @param stageAttemptId attempt id of the stage this task belongs to + * @param taskId task id + * @param taskAttemptNumber task attempt id + * @since 2.0.1 + */ +private[spark] class CallerContext( + from: String, + appId: Option[String] = None, + appAttemptId: Option[String] = None, + jobId: Option[Int] = None, + stageId: Option[Int] = None, + stageAttemptId: Option[Int] = None, + taskId: Option[Long] = None, + taskAttemptNumber: Option[Int] = None) extends Logging { + + val AppId = if (appId.isDefined) s"_${appId.get}" else "" --- End diff -- make all these local vals start with lower case, add Str to them if you need to differentiate appIdStr --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r80482539 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2421,6 +2421,69 @@ private[spark] object Utils extends Logging { } /** + * An utility class used to set up Spark caller contexts to HDFS and Yarn. The `context` will be + * constructed by parameters passed in. + * When Spark applications run on Yarn and HDFS, its caller contexts will be written into Yarn RM + * audit log and hdfs-audit.log. That can help users to better diagnose and understand how + * specific applications impacting parts of the Hadoop system and potential problems they may be + * creating (e.g. overloading NN). As HDFS mentioned in HDFS-9184, for a given HDFS operation, it's + * very helpful to track which upper level job issues it. + * + * @param from who sets up the caller context (TASK, CLIENT, APPMASTER) + * + * The parameters below are optional: + * @param appId id of the app this task belongs to + * @param appAttemptId attempt id of the app this task belongs to + * @param jobId id of the job this task belongs to + * @param stageId id of the stage this task belongs to + * @param stageAttemptId attempt id of the stage this task belongs to + * @param taskId task id + * @param taskAttemptNumber task attempt id + * @since 2.0.1 --- End diff -- sorry I missed this before, this is new feature and will only go into 2.1, lets just remove the @since since this isn't public api --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...
Github user Sherry302 commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r80323261 --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala --- @@ -54,7 +54,10 @@ private[spark] abstract class Task[T]( val partitionId: Int, // The default value is only used in tests. val metrics: TaskMetrics = TaskMetrics.registered, -@transient var localProperties: Properties = new Properties) extends Serializable { +@transient var localProperties: Properties = new Properties, +val jobId: Option[Int] = None, --- End diff -- OK. Thanks, @tgravescs . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r80239469 --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala --- @@ -54,7 +54,10 @@ private[spark] abstract class Task[T]( val partitionId: Int, // The default value is only used in tests. val metrics: TaskMetrics = TaskMetrics.registered, -@transient var localProperties: Properties = new Properties) extends Serializable { +@transient var localProperties: Properties = new Properties, +val jobId: Option[Int] = None, --- End diff -- this is fine lets leave it as is. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...
Github user Sherry302 commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r80161383 --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala --- @@ -54,7 +54,10 @@ private[spark] abstract class Task[T]( val partitionId: Int, // The default value is only used in tests. val metrics: TaskMetrics = TaskMetrics.registered, -@transient var localProperties: Properties = new Properties) extends Serializable { +@transient var localProperties: Properties = new Properties, +val jobId: Option[Int] = None, --- End diff -- Hi, @tgravescs I want to conform this with you if I can just change and fix up everywhere that calls /extends Task. I can do this, but may change many test classes/cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r79849949 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2420,29 +2420,54 @@ private[spark] object Utils extends Logging { } } +/** + * An utility class used to set up Spark caller contexts to HDFS and Yarn. The `context` will be + * constructed by parameters passed in. + * When Spark applications run on Yarn and HDFS, its caller contexts will be written into Yarn RM + * audit log and hdfs-audit.log. That can help users to better diagnose and understand how + * specific applications impacting parts of the Hadoop system and potential problems they may be + * creating (e.g. overloading NN). As HDFS mentioned in HDFS-9184, for a given HDFS operation, it's + * very helpful to track which upper level job issues it. + * + * @param from who sets up the caller context (TASK, CLIENT, APPLICATION_MASTER) + * + * The parameters below are optional: + * @param appID id of the app this task belongs to + * @param appAttemptID attempt id of the app this task belongs to + * @param jobID id of the job this task belongs to + * @param stageID id of the stage this task belongs to + * @param stageAttemptId attempt id of the stage this task belongs to + * @param taskId task id + * @param taskAttemptNumber task attempt id + * @since 2.0.1 + */ private[spark] class CallerContext( - appName: Option[String] = None, - appID: Option[String] = None, - appAttemptID: Option[String] = None, - jobID: Option[Int] = None, - stageID: Option[Int] = None, + from: String, + appId: Option[String] = None, + appAttemptId: Option[String] = None, + jobId: Option[Int] = None, + stageId: Option[Int] = None, stageAttemptId: Option[Int] = None, taskId: Option[Long] = None, taskAttemptNumber: Option[Int] = None) extends Logging { - val AppName = if (appName.isDefined) s"_AppName_${appName.get}" else "" - val AppID = if (appID.isDefined) s"_AppID_${appID.get}" else "" - val AppAttemptID = if (appAttemptID.isDefined) s"_${appAttemptID.get}" else "" - val JobID = if (jobID.isDefined) s"_JobID_${jobID.get}" else "" - val StageID = if (stageID.isDefined) s"_StageID_${stageID.get}" else "" - val StageAttemptId = if (stageAttemptId.isDefined) s"_${stageAttemptId.get}" else "" + val AppId = if (appId.isDefined) s"_AppId_${appId.get}" else "" + val AppAttemptId = if (appAttemptId.isDefined) s"_AttemptId_${appAttemptId.get}" else "" --- End diff -- I don't agree with adding the AttemptId_ string back. I understand it helps readability but its a lot of characters compared to the actual attempt id. I want this thing to be as small as possible as to not add extra overhead to the rpc calls. The strings from your example run are already 112 characters long (SPARK_TASK_AppId_application_1474394339641_0006_AttemptId_1_JobId_0_StageId_0_AttemptId_0_TaskId_14_AttemptNum_0), if you start getting into many tasks and stages that could reach the default 128 print to the audit log and get truncated. how about SPARK_TASK_application_1474394339641_0006_1_JId_0_SId_0_0_TId_14_0, that is only 66 characters. yes the user may have to look up the format but I think that is ok. If its really that big of an issue parsing this we can change it later but I would rather have it smaller and better performance and and have all the information (rather then it possibly getting truncated). Really right now I think all you need is the task id and attempt because the numbers just increase across jobs and stages, but having the job id and stage id would be helpful to find the task id quickly and handles if that every changes and we have taskid unique per job or stage. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r79847913 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -202,8 +202,8 @@ private[spark] class ApplicationMaster( attemptID = Option(appAttemptId.getAttemptId.toString) } - new CallerContext(Option(System.getProperty("spark.app.name")), -Option(appAttemptId.getApplicationId.toString), attemptID).set() + new CallerContext("APPLICATION_MASTER", --- End diff -- truncate to APPMASTER --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r79844130 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2420,29 +2420,54 @@ private[spark] object Utils extends Logging { } } +/** + * An utility class used to set up Spark caller contexts to HDFS and Yarn. The `context` will be + * constructed by parameters passed in. + * When Spark applications run on Yarn and HDFS, its caller contexts will be written into Yarn RM + * audit log and hdfs-audit.log. That can help users to better diagnose and understand how + * specific applications impacting parts of the Hadoop system and potential problems they may be + * creating (e.g. overloading NN). As HDFS mentioned in HDFS-9184, for a given HDFS operation, it's + * very helpful to track which upper level job issues it. + * + * @param from who sets up the caller context (TASK, CLIENT, APPLICATION_MASTER) + * + * The parameters below are optional: + * @param appID id of the app this task belongs to + * @param appAttemptID attempt id of the app this task belongs to + * @param jobID id of the job this task belongs to + * @param stageID id of the stage this task belongs to + * @param stageAttemptId attempt id of the stage this task belongs to + * @param taskId task id + * @param taskAttemptNumber task attempt id + * @since 2.0.1 + */ private[spark] class CallerContext( - appName: Option[String] = None, - appID: Option[String] = None, - appAttemptID: Option[String] = None, - jobID: Option[Int] = None, - stageID: Option[Int] = None, + from: String, + appId: Option[String] = None, + appAttemptId: Option[String] = None, + jobId: Option[Int] = None, + stageId: Option[Int] = None, stageAttemptId: Option[Int] = None, taskId: Option[Long] = None, taskAttemptNumber: Option[Int] = None) extends Logging { - val AppName = if (appName.isDefined) s"_AppName_${appName.get}" else "" - val AppID = if (appID.isDefined) s"_AppID_${appID.get}" else "" - val AppAttemptID = if (appAttemptID.isDefined) s"_${appAttemptID.get}" else "" - val JobID = if (jobID.isDefined) s"_JobID_${jobID.get}" else "" - val StageID = if (stageID.isDefined) s"_StageID_${stageID.get}" else "" - val StageAttemptId = if (stageAttemptId.isDefined) s"_${stageAttemptId.get}" else "" + val AppId = if (appId.isDefined) s"_AppId_${appId.get}" else "" --- End diff -- as mentioned please remove AppId_... the application id is pretty obvious in the logs it starts with application_ so no need to print extra characters --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r79843188 --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala --- @@ -54,7 +54,10 @@ private[spark] abstract class Task[T]( val partitionId: Int, // The default value is only used in tests. val metrics: TaskMetrics = TaskMetrics.registered, -@transient var localProperties: Properties = new Properties) extends Serializable { +@transient var localProperties: Properties = new Properties, +val jobId: Option[Int] = None, --- End diff -- so Task is a private spark api so there is no issues breaking api. You don't need to deprecate and add new, you can just change and fix up everywhere that calls it or extends it. There are a handful of test classes that extend it that would need to be fixed up. Let me think about it a bit more. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r79841783 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2420,6 +2420,44 @@ private[spark] object Utils extends Logging { } } +private[spark] class CallerContext( + appName: Option[String] = None, + appID: Option[String] = None, + appAttemptID: Option[String] = None, + jobID: Option[Int] = None, + stageID: Option[Int] = None, + stageAttemptId: Option[Int] = None, + taskId: Option[Long] = None, + taskAttemptNumber: Option[Int] = None) extends Logging { + + val AppName = if (appName.isDefined) s"_AppName_${appName.get}" else "" --- End diff -- I'm not sure what you mean here there is no information about the application. its pretty clear that the application id in your example is: application_1473908768790_0007. We don't need the extra 6 characters AppID_ to tell us that --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...
Github user Sherry302 commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r79695565 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2420,6 +2420,44 @@ private[spark] object Utils extends Logging { } } +private[spark] class CallerContext( + appName: Option[String] = None, + appID: Option[String] = None, + appAttemptID: Option[String] = None, + jobID: Option[Int] = None, + stageID: Option[Int] = None, + stageAttemptId: Option[Int] = None, + taskId: Option[Long] = None, + taskAttemptNumber: Option[Int] = None) extends Logging { + + val AppName = if (appName.isDefined) s"_AppName_${appName.get}" else "" --- End diff -- Yes. Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...
Github user Sherry302 commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r79695449 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2420,6 +2420,44 @@ private[spark] object Utils extends Logging { } } +private[spark] class CallerContext( + appName: Option[String] = None, + appID: Option[String] = None, + appAttemptID: Option[String] = None, + jobID: Option[Int] = None, + stageID: Option[Int] = None, + stageAttemptId: Option[Int] = None, + taskId: Option[Long] = None, + taskAttemptNumber: Option[Int] = None) extends Logging { + + val AppName = if (appName.isDefined) s"_AppName_${appName.get}" else "" --- End diff -- I have updated the PR to remove appName, and replace appName with something to differentiate the context from ApplicationMaster vs Yarn Client vs Task. But for AppID, I think it is better to keep it since in hdfs-audit.log, there is no info about application. For example, the record below was produced when Task did a read/write operation to HDFS, except `callerContext`, there is no other info about application: ``` 2016-09-14 22:29:06,526 INFO FSNamesystem.audit: allowed=true ugi=wyang (auth:SIMPLE) ip=/127.0.0.1 cmd=opensrc=/lr_big.txt dst=nullperm=null proto=rpc callerContext=SPARK_AppID_application_1473908768790_0007_JobID_0_StageID_0_0_TaskId_2_0 ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...
Github user Sherry302 commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r79693044 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ResultTask.scala --- @@ -42,7 +42,10 @@ import org.apache.spark.rdd.RDD * input RDD's partitions). * @param localProperties copy of thread-local properties set by the user on the driver side. * @param metrics a [[TaskMetrics]] that is created at driver side and sent to executor side. - */ + * @param jobId id of the job this task belongs to --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...
Github user Sherry302 commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r79692960 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -196,8 +198,13 @@ private[spark] class ApplicationMaster( // Set this internal configuration if it is running on cluster mode, this // configuration will be checked in SparkContext to avoid misuse of yarn cluster mode. System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) + +attemptID = Option(appAttemptId.getAttemptId.toString) } + new CallerContext(Option(System.getProperty("spark.app.name")), --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...
Github user Sherry302 commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r79693000 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2420,6 +2420,44 @@ private[spark] object Utils extends Logging { } } +private[spark] class CallerContext( + appName: Option[String] = None, + appID: Option[String] = None, + appAttemptID: Option[String] = None, + jobID: Option[Int] = None, + stageID: Option[Int] = None, + stageAttemptId: Option[Int] = None, + taskId: Option[Long] = None, + taskAttemptNumber: Option[Int] = None) extends Logging { + + val AppName = if (appName.isDefined) s"_AppName_${appName.get}" else "" + val AppID = if (appID.isDefined) s"_AppID_${appID.get}" else "" + val AppAttemptID = if (appAttemptID.isDefined) s"_${appAttemptID.get}" else "" + val JobID = if (jobID.isDefined) s"_JobID_${jobID.get}" else "" + val StageID = if (stageID.isDefined) s"_StageID_${stageID.get}" else "" + val StageAttemptId = if (stageAttemptId.isDefined) s"_${stageAttemptId.get}" else "" + val TaskId = if (taskId.isDefined) s"_TaskId_${taskId.get}" else "" + val TaskAttemptNumber = if (taskAttemptNumber.isDefined) s"_${taskAttemptNumber.get}" else "" + + val context = "SPARK" + AppName + AppID + AppAttemptID + + JobID + StageID + StageAttemptId + TaskId + TaskAttemptNumber + + def set(): Boolean = { --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...
Github user Sherry302 commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r79692475 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2420,6 +2420,44 @@ private[spark] object Utils extends Logging { } } +private[spark] class CallerContext( + appName: Option[String] = None, + appID: Option[String] = None, + appAttemptID: Option[String] = None, + jobID: Option[Int] = None, + stageID: Option[Int] = None, + stageAttemptId: Option[Int] = None, + taskId: Option[Long] = None, + taskAttemptNumber: Option[Int] = None) extends Logging { + + val AppName = if (appName.isDefined) s"_AppName_${appName.get}" else "" + val AppID = if (appID.isDefined) s"_AppID_${appID.get}" else "" + val AppAttemptID = if (appAttemptID.isDefined) s"_${appAttemptID.get}" else "" + val JobID = if (jobID.isDefined) s"_JobID_${jobID.get}" else "" + val StageID = if (stageID.isDefined) s"_StageID_${stageID.get}" else "" + val StageAttemptId = if (stageAttemptId.isDefined) s"_${stageAttemptId.get}" else "" + val TaskId = if (taskId.isDefined) s"_TaskId_${taskId.get}" else "" + val TaskAttemptNumber = if (taskAttemptNumber.isDefined) s"_${taskAttemptNumber.get}" else "" + + val context = "SPARK" + AppName + AppID + AppAttemptID + + JobID + StageID + StageAttemptId + TaskId + TaskAttemptNumber + + def set(): Boolean = { +var succeed = false +try { + val callerContext = Utils.classForName("org.apache.hadoop.ipc.CallerContext") --- End diff -- Yes. I have updated the PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...
Github user Sherry302 commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r79692046 --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala --- @@ -54,7 +54,10 @@ private[spark] abstract class Task[T]( val partitionId: Int, // The default value is only used in tests. val metrics: TaskMetrics = TaskMetrics.registered, -@transient var localProperties: Properties = new Properties) extends Serializable { +@transient var localProperties: Properties = new Properties, +val jobId: Option[Int] = None, --- End diff -- Making these params all optional is not to break current code which uses this API. An alternative way is to mark the current API as deprecated and add a new overloaded function with new parameters. I am going to go this way. Any suggestions? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r79457356 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -196,8 +198,13 @@ private[spark] class ApplicationMaster( // Set this internal configuration if it is running on cluster mode, this // configuration will be checked in SparkContext to avoid misuse of yarn cluster mode. System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) + +attemptID = Option(appAttemptId.getAttemptId.toString) } + new CallerContext(Option(System.getProperty("spark.app.name")), --- End diff -- would it be useful to add something special here to differentiate the tasks from master vs client. just like a static string "master" and for client "client" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r79457084 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2420,6 +2420,44 @@ private[spark] object Utils extends Logging { } } +private[spark] class CallerContext( + appName: Option[String] = None, + appID: Option[String] = None, + appAttemptID: Option[String] = None, + jobID: Option[Int] = None, + stageID: Option[Int] = None, + stageAttemptId: Option[Int] = None, + taskId: Option[Long] = None, + taskAttemptNumber: Option[Int] = None) extends Logging { + + val AppName = if (appName.isDefined) s"_AppName_${appName.get}" else "" --- End diff -- thinking about this a little more I think we should just remove app name altogether. For tracking purposes the app id should be enough. The app name adds in a bunch of unknowns (length or it, weird characters or just spaces) which could cause things in the audit log to be less useful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r79421792 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2420,6 +2420,44 @@ private[spark] object Utils extends Logging { } } +private[spark] class CallerContext( + appName: Option[String] = None, + appID: Option[String] = None, + appAttemptID: Option[String] = None, + jobID: Option[Int] = None, + stageID: Option[Int] = None, + stageAttemptId: Option[Int] = None, + taskId: Option[Long] = None, + taskAttemptNumber: Option[Int] = None) extends Logging { + + val AppName = if (appName.isDefined) s"_AppName_${appName.get}" else "" + val AppID = if (appID.isDefined) s"_AppID_${appID.get}" else "" + val AppAttemptID = if (appAttemptID.isDefined) s"_${appAttemptID.get}" else "" + val JobID = if (jobID.isDefined) s"_JobID_${jobID.get}" else "" + val StageID = if (stageID.isDefined) s"_StageID_${stageID.get}" else "" + val StageAttemptId = if (stageAttemptId.isDefined) s"_${stageAttemptId.get}" else "" + val TaskId = if (taskId.isDefined) s"_TaskId_${taskId.get}" else "" + val TaskAttemptNumber = if (taskAttemptNumber.isDefined) s"_${taskAttemptNumber.get}" else "" + + val context = "SPARK" + AppName + AppID + AppAttemptID + + JobID + StageID + StageAttemptId + TaskId + TaskAttemptNumber + + def set(): Boolean = { --- End diff -- Let rename this to setCurrentContext(). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r79419642 --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala --- @@ -54,7 +54,10 @@ private[spark] abstract class Task[T]( val partitionId: Int, // The default value is only used in tests. val metrics: TaskMetrics = TaskMetrics.registered, -@transient var localProperties: Properties = new Properties) extends Serializable { +@transient var localProperties: Properties = new Properties, +val jobId: Option[Int] = None, --- End diff -- are these params all optional just to make it easier for different task types? the jobId and appId I think are mandatory now, the appattempt id is still really optional. I'm leaning towards making this not be Option so that is someone adds a new Task Type we make sure these are setup properly and thus context set properly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r79419042 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ResultTask.scala --- @@ -42,7 +42,10 @@ import org.apache.spark.rdd.RDD * input RDD's partitions). * @param localProperties copy of thread-local properties set by the user on the driver side. * @param metrics a [[TaskMetrics]] that is created at driver side and sent to executor side. - */ + * @param jobId id of the job this task belongs to --- End diff -- description should mention these are optional. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r79415724 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2420,6 +2420,44 @@ private[spark] object Utils extends Logging { } } +private[spark] class CallerContext( + appName: Option[String] = None, + appID: Option[String] = None, + appAttemptID: Option[String] = None, + jobID: Option[Int] = None, + stageID: Option[Int] = None, + stageAttemptId: Option[Int] = None, + taskId: Option[Long] = None, + taskAttemptNumber: Option[Int] = None) extends Logging { + + val AppName = if (appName.isDefined) s"_AppName_${appName.get}" else "" --- End diff -- for the appName and the AppId, it feels like these should be obvious from the string so I'm not sure if we really need the extra _AppName_ and _AppID_ tags there. Are there cases you don't think that is true? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r79408637 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2420,6 +2420,44 @@ private[spark] object Utils extends Logging { } } +private[spark] class CallerContext( + appName: Option[String] = None, + appID: Option[String] = None, + appAttemptID: Option[String] = None, + jobID: Option[Int] = None, + stageID: Option[Int] = None, + stageAttemptId: Option[Int] = None, + taskId: Option[Long] = None, + taskAttemptNumber: Option[Int] = None) extends Logging { + + val AppName = if (appName.isDefined) s"_AppName_${appName.get}" else "" + val AppID = if (appID.isDefined) s"_AppID_${appID.get}" else "" + val AppAttemptID = if (appAttemptID.isDefined) s"_${appAttemptID.get}" else "" + val JobID = if (jobID.isDefined) s"_JobID_${jobID.get}" else "" + val StageID = if (stageID.isDefined) s"_StageID_${stageID.get}" else "" + val StageAttemptId = if (stageAttemptId.isDefined) s"_${stageAttemptId.get}" else "" + val TaskId = if (taskId.isDefined) s"_TaskId_${taskId.get}" else "" + val TaskAttemptNumber = if (taskAttemptNumber.isDefined) s"_${taskAttemptNumber.get}" else "" + + val context = "SPARK" + AppName + AppID + AppAttemptID + + JobID + StageID + StageAttemptId + TaskId + TaskAttemptNumber + + def set(): Boolean = { +var succeed = false +try { + val callerContext = Utils.classForName("org.apache.hadoop.ipc.CallerContext") --- End diff -- Actually perhaps add a description to this entire class that explain what this is, what it applies to and when it was added. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r79403168 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2420,6 +2420,44 @@ private[spark] object Utils extends Logging { } } +private[spark] class CallerContext( + appName: Option[String] = None, + appID: Option[String] = None, + appAttemptID: Option[String] = None, + jobID: Option[Int] = None, + stageID: Option[Int] = None, + stageAttemptId: Option[Int] = None, + taskId: Option[Long] = None, + taskAttemptNumber: Option[Int] = None) extends Logging { + + val AppName = if (appName.isDefined) s"_AppName_${appName.get}" else "" + val AppID = if (appID.isDefined) s"_AppID_${appID.get}" else "" + val AppAttemptID = if (appAttemptID.isDefined) s"_${appAttemptID.get}" else "" + val JobID = if (jobID.isDefined) s"_JobID_${jobID.get}" else "" + val StageID = if (stageID.isDefined) s"_StageID_${stageID.get}" else "" + val StageAttemptId = if (stageAttemptId.isDefined) s"_${stageAttemptId.get}" else "" + val TaskId = if (taskId.isDefined) s"_TaskId_${taskId.get}" else "" + val TaskAttemptNumber = if (taskAttemptNumber.isDefined) s"_${taskAttemptNumber.get}" else "" + + val context = "SPARK" + AppName + AppID + AppAttemptID + + JobID + StageID + StageAttemptId + TaskId + TaskAttemptNumber + + def set(): Boolean = { +var succeed = false +try { + val callerContext = Utils.classForName("org.apache.hadoop.ipc.CallerContext") --- End diff -- can you add a comment that this api was added in hadoop 2.8 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user Sherry302 commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r78898595 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2418,6 +2418,21 @@ private[spark] object Utils extends Logging { sparkJars.map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten } } + + def setCallerContext(context: String): Boolean = { +var succeed = false +try { + val Builder = Utils.classForName("org.apache.hadoop.ipc.CallerContext$Builder") + val builderInst = Builder.getConstructor(classOf[String]).newInstance(context) + val ret = Builder.getMethod("build").invoke(builderInst) --- End diff -- Yes. hdfsContext is more readable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user Sherry302 commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r78898257 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2418,6 +2418,21 @@ private[spark] object Utils extends Logging { sparkJars.map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten } } + + def setCallerContext(context: String): Boolean = { +var succeed = false +try { + val Builder = Utils.classForName("org.apache.hadoop.ipc.CallerContext$Builder") + val builderInst = Builder.getConstructor(classOf[String]).newInstance(context) + val ret = Builder.getMethod("build").invoke(builderInst) + val callerContext = Utils.classForName("org.apache.hadoop.ipc.CallerContext") --- End diff -- If make `val callerContext = Utils.classForName("org.apache.hadoop.ipc.CallerContext")` out of `try` block, Spark will throw exception when it runs on hadoop before 2.8.0. Also, moving that line to the first of `try` block does not make any difference since `Utils.classForName("org.apache.hadoop.ipc.CallerContext$Builder")` also needs to check if `org.apache.hadoop.ipc.CallerContext` exists. I am not sure if I got your point, could you please give more info about it? Thanks a lot. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user Sherry302 commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r78897068 --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala --- @@ -79,6 +82,13 @@ private[spark] abstract class Task[T]( metrics) TaskContext.setTaskContext(context) taskThread = Thread.currentThread() + +val callerContext = + s"Spark_AppId_${appId.getOrElse("")}_AppAttemptId_${appAttemptId.getOrElse("None")}" + + s"_JobId_${jobId.getOrElse("0")}_StageID_${stageId}_stageAttemptId_${stageAttemptId}" + +s"_taskID_${taskAttemptId}_attemptNumber_${attemptNumber}" +Utils.setCallerContext(callerContext) --- End diff -- Yes. Good catch! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user Sherry302 commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r78896928 --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala --- @@ -79,6 +82,13 @@ private[spark] abstract class Task[T]( metrics) TaskContext.setTaskContext(context) taskThread = Thread.currentThread() + +val callerContext = + s"Spark_AppId_${appId.getOrElse("")}_AppAttemptId_${appAttemptId.getOrElse("None")}" + + s"_JobId_${jobId.getOrElse("0")}_StageID_${stageId}_stageAttemptId_${stageAttemptId}" + +s"_taskID_${taskAttemptId}_attemptNumber_${attemptNumber}" --- End diff -- I have updated the PR to make the string shorter. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user Sherry302 commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r78896972 --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala --- @@ -79,6 +82,13 @@ private[spark] abstract class Task[T]( metrics) TaskContext.setTaskContext(context) taskThread = Thread.currentThread() + +val callerContext = + s"Spark_AppId_${appId.getOrElse("")}_AppAttemptId_${appAttemptId.getOrElse("None")}" + --- End diff -- Yes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user Sherry302 commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r78896863 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -184,6 +184,9 @@ private[spark] class ApplicationMaster( try { val appAttemptId = client.getAttemptId() + var context = s"Spark_AppName_${System.getProperty("spark.app.name")}" + --- End diff -- A CallerContext class has been added. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user Sherry302 commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r78896816 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2418,6 +2418,21 @@ private[spark] object Utils extends Logging { sparkJars.map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten } } + + def setCallerContext(context: String): Boolean = { +var succeed = false +try { + val Builder = Utils.classForName("org.apache.hadoop.ipc.CallerContext$Builder") + val builderInst = Builder.getConstructor(classOf[String]).newInstance(context) + val ret = Builder.getMethod("build").invoke(builderInst) + val callerContext = Utils.classForName("org.apache.hadoop.ipc.CallerContext") + callerContext.getMethod("setCurrent", callerContext).invoke(null, ret) + succeed = true +} catch { + case NonFatal(e) => logDebug(s"$e", e) --- End diff -- I have updated this to "case NonFatal(e) => logInfo("Fail to set Spark caller context", e)" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user Sherry302 commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r78896707 --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala --- @@ -54,7 +54,10 @@ private[spark] abstract class Task[T]( val partitionId: Int, // The default value is only used in tests. val metrics: TaskMetrics = TaskMetrics.registered, -@transient var localProperties: Properties = new Properties) extends Serializable { +@transient var localProperties: Properties = new Properties, +val jobId: Option[Int] = None, --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user Sherry302 commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r78896692 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala --- @@ -51,8 +51,12 @@ private[spark] class ShuffleMapTask( partition: Partition, @transient private var locs: Seq[TaskLocation], metrics: TaskMetrics, -localProperties: Properties) - extends Task[MapStatus](stageId, stageAttemptId, partition.index, metrics, localProperties) +localProperties: Properties, --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user Sherry302 commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r78896681 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ResultTask.scala --- @@ -51,8 +51,12 @@ private[spark] class ResultTask[T, U]( locs: Seq[TaskLocation], val outputId: Int, localProperties: Properties, -metrics: TaskMetrics) - extends Task[U](stageId, stageAttemptId, partition.index, metrics, localProperties) +metrics: TaskMetrics, +jobId: Option[Int] = None, --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r78783274 --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala --- @@ -79,6 +82,13 @@ private[spark] abstract class Task[T]( metrics) TaskContext.setTaskContext(context) taskThread = Thread.currentThread() + +val callerContext = + s"Spark_AppId_${appId.getOrElse("")}_AppAttemptId_${appAttemptId.getOrElse("None")}" + + s"_JobId_${jobId.getOrElse("0")}_StageID_${stageId}_stageAttemptId_${stageAttemptId}" + +s"_taskID_${taskAttemptId}_attemptNumber_${attemptNumber}" --- End diff -- one concern I have about this is the length of the string. This string is going to be sent on every hadoop RPC call and then I believe get into the hdfs audit log. The audit log does have a config to truncate it (default 128) but it still gets sent over rpc. So I would like to keep this string as small as possible while still being useful. The one above seems very long to me since just the static string is like 90 characters. So I would like to see some of those static strings removed or abbreviated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r78773807 --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala --- @@ -79,6 +82,13 @@ private[spark] abstract class Task[T]( metrics) TaskContext.setTaskContext(context) taskThread = Thread.currentThread() + +val callerContext = + s"Spark_AppId_${appId.getOrElse("")}_AppAttemptId_${appAttemptId.getOrElse("None")}" + --- End diff -- can you change Spark to all caps "SPARK" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r78769775 --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala --- @@ -79,6 +82,13 @@ private[spark] abstract class Task[T]( metrics) TaskContext.setTaskContext(context) taskThread = Thread.currentThread() + +val callerContext = + s"Spark_AppId_${appId.getOrElse("")}_AppAttemptId_${appAttemptId.getOrElse("None")}" + + s"_JobId_${jobId.getOrElse("0")}_StageID_${stageId}_stageAttemptId_${stageAttemptId}" + +s"_taskID_${taskAttemptId}_attemptNumber_${attemptNumber}" +Utils.setCallerContext(callerContext) --- End diff -- we should move this below the if(_killed) check as not reason to log this if it doesn't run. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r78768851 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -184,6 +184,9 @@ private[spark] class ApplicationMaster( try { val appAttemptId = client.getAttemptId() + var context = s"Spark_AppName_${System.getProperty("spark.app.name")}" + --- End diff -- it would be nice to have a class or at least util function to construct the context string so that we don't have same string in multiple places. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r78767924 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2418,6 +2418,21 @@ private[spark] object Utils extends Logging { sparkJars.map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten } } + + def setCallerContext(context: String): Boolean = { +var succeed = false +try { + val Builder = Utils.classForName("org.apache.hadoop.ipc.CallerContext$Builder") + val builderInst = Builder.getConstructor(classOf[String]).newInstance(context) + val ret = Builder.getMethod("build").invoke(builderInst) + val callerContext = Utils.classForName("org.apache.hadoop.ipc.CallerContext") + callerContext.getMethod("setCurrent", callerContext).invoke(null, ret) + succeed = true +} catch { + case NonFatal(e) => logDebug(s"$e", e) --- End diff -- why are you printing exception twice here. make the actual log message something like failed to set CallerContext and leave the Exception to print after ,e --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r78767856 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2418,6 +2418,21 @@ private[spark] object Utils extends Logging { sparkJars.map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten } } + + def setCallerContext(context: String): Boolean = { +var succeed = false +try { + val Builder = Utils.classForName("org.apache.hadoop.ipc.CallerContext$Builder") + val builderInst = Builder.getConstructor(classOf[String]).newInstance(context) + val ret = Builder.getMethod("build").invoke(builderInst) --- End diff -- perhaps name something other then ret, perhaps hdfsContext --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r78767418 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2418,6 +2418,21 @@ private[spark] object Utils extends Logging { sparkJars.map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten } } + + def setCallerContext(context: String): Boolean = { +var succeed = false +try { + val Builder = Utils.classForName("org.apache.hadoop.ipc.CallerContext$Builder") + val builderInst = Builder.getConstructor(classOf[String]).newInstance(context) + val ret = Builder.getMethod("build").invoke(builderInst) + val callerContext = Utils.classForName("org.apache.hadoop.ipc.CallerContext") --- End diff -- It might be nice to do this first outside of this try block and it the class isn't found we just skip the rest of this and assume its before hadoop 2.8.0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r78758857 --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala --- @@ -54,7 +54,10 @@ private[spark] abstract class Task[T]( val partitionId: Int, // The default value is only used in tests. val metrics: TaskMetrics = TaskMetrics.registered, -@transient var localProperties: Properties = new Properties) extends Serializable { +@transient var localProperties: Properties = new Properties, +val jobId: Option[Int] = None, --- End diff -- update params and descriptions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r78758789 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala --- @@ -51,8 +51,12 @@ private[spark] class ShuffleMapTask( partition: Partition, @transient private var locs: Seq[TaskLocation], metrics: TaskMetrics, -localProperties: Properties) - extends Task[MapStatus](stageId, stageAttemptId, partition.index, metrics, localProperties) +localProperties: Properties, --- End diff -- update descriptions and params in java doc above --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r78758711 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ResultTask.scala --- @@ -51,8 +51,12 @@ private[spark] class ResultTask[T, U]( locs: Seq[TaskLocation], val outputId: Int, localProperties: Properties, -metrics: TaskMetrics) - extends Task[U](stageId, stageAttemptId, partition.index, metrics, localProperties) +metrics: TaskMetrics, +jobId: Option[Int] = None, --- End diff -- you need to update the description of this class to have new params and descriptions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r76952473 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2418,6 +2418,18 @@ private[spark] object Utils extends Logging { sparkJars.map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten } } + + def setCallerContext(context: String): Unit = { --- End diff -- If this is set to return a Boolean on success, you could have a unit test which asserts that when called on a version of Hadoop with the resource `/org/apache/hadoop/ipcCallerContext.class`, that it returned true. That way you could assert that "on a version of Hadoop with a caller context, this code sets it". That will help catch any regressions in the Hadoop code breaking this reflection stuff. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/14659#discussion_r76951863 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2418,6 +2418,18 @@ private[spark] object Utils extends Logging { sparkJars.map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten } } + + def setCallerContext(context: String): Unit = { +try { + val Builder = Utils.classForName("org.apache.hadoop.ipc.CallerContext$Builder") + val builderInst = Builder.getConstructor(classOf[String]).newInstance(context) + val ret = Builder.getMethod("build").invoke(builderInst) + val callerContext = Utils.classForName("org.apache.hadoop.ipc.CallerContext") + callerContext.getMethod("setCurrent", callerContext).invoke(null, ret) +} catch { + case NonFatal(e) => logDebug(s"${e.getMessage}") --- End diff -- 1. better to use `logDebug(s"$e", e)`. Some exceptions (e.g. `NullPointerException`) return null from `getMessage()`; you may also want the stack --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS
GitHub user Sherry302 opened a pull request: https://github.com/apache/spark/pull/14659 [SPARK-16757] Set up Spark caller context to HDFS ## What changes were proposed in this pull request? 1. Pass `jobId` to Task. 2. Invoke Hadoop APIs. A new function `setCallerContext` is added in `Utils`. `setCallerContext` function invokes APIs of `org.apache.hadoop.ipc.CallerContext` to set up spark caller contexts, which will be written into `hdfs-audit.log`. For applications in Yarn client mode, `org.apache.hadoop.ipc.CallerContext` are called in `Task` and Yarn `Client`. For applications in Yarn cluster mode, `org.apache.hadoop.ipc.CallerContext` are be called in `Task` and `ApplicationMaster`. The Spark caller contexts written into `hdfs-audit.log` are applications' name` {spark.app.name}` and `JobID_stageID_stageAttemptId_taskID_attemptNumbe`. ## How was this patch tested? Manual Tests against some Spark applications in Yarn client mode and Yarn cluster mode. Need to check if spark caller contexts are written into HDFS hdfs-audit.log successfully. For example, run SparkKmeans in Yarn client mode: `./bin/spark-submit --master yarn --deploy-mode client --class org.apache.spark.examples.SparkKMeans examples/target/original-spark-examples_2.11-2.1.0-SNAPSHOT.jar hdfs://localhost:9000/lr_big.txt 2 5` Before: There will be no Spark caller context in records of `hdfs-audit.log`. After: Spark caller contexts will be in records of `hdfs-audit.log`. (_Note: spark caller context below since Hadoop caller context API was invoked in Yarn Client_) `2016-07-21 13:52:30,802 INFO FSNamesystem.audit: allowed=true ugi=wyang (auth:SIMPLE)ip=/127.0.0.1cmd=getfileinfo src=/lr_big.txtdst=nullperm=nullproto=rpc callerContext=SparkKMeans running on Spark ` (_Note: spark caller context below since Hadoop caller context API was invoked in Task_) `2016-07-21 13:52:35,584 INFO FSNamesystem.audit: allowed=true ugi=wyang (auth:SIMPLE)ip=/127.0.0.1cmd=open src=/lr_big.txtdst=nullperm=nullproto=rpc callerContext=JobId_0_StageID_0_stageAttemptId_0_taskID_0_attemptNumber_0` You can merge this pull request into a Git repository by running: $ git pull https://github.com/Sherry302/spark callercontextSubmit Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14659.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14659 commit ec6833d32ef14950b2d81790bc908992f6288815 Author: Weiqing YangDate: 2016-08-16T04:11:41Z [SPARK-16757] Set up Spark caller context to HDFS --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org