[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...

2016-09-27 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14659


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...

2016-09-26 Thread Sherry302
Github user Sherry302 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r80579059
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2421,6 +2421,69 @@ private[spark] object Utils extends Logging {
 }
 
 /**
+ * An utility class used to set up Spark caller contexts to HDFS and Yarn. 
The `context` will be
+ * constructed by parameters passed in.
+ * When Spark applications run on Yarn and HDFS, its caller contexts will 
be written into Yarn RM
+ * audit log and hdfs-audit.log. That can help users to better diagnose 
and understand how
+ * specific applications impacting parts of the Hadoop system and 
potential problems they may be
+ * creating (e.g. overloading NN). As HDFS mentioned in HDFS-9184, for a 
given HDFS operation, it's
+ * very helpful to track which upper level job issues it.
+ *
+ * @param from who sets up the caller context (TASK, CLIENT, APPMASTER)
+ *
+ * The parameters below are optional:
+ * @param appId id of the app this task belongs to
+ * @param appAttemptId attempt id of the app this task belongs to
+ * @param jobId id of the job this task belongs to
+ * @param stageId id of the stage this task belongs to
+ * @param stageAttemptId attempt id of the stage this task belongs to
+ * @param taskId task id
+ * @param taskAttemptNumber task attempt id
+ * @since 2.0.1
+ */
+private[spark] class CallerContext(
+   from: String,
+   appId: Option[String] = None,
+   appAttemptId: Option[String] = None,
+   jobId: Option[Int] = None,
+   stageId: Option[Int] = None,
+   stageAttemptId: Option[Int] = None,
+   taskId: Option[Long] = None,
+   taskAttemptNumber: Option[Int] = None) extends Logging {
+
+   val AppId = if (appId.isDefined) s"_${appId.get}" else ""
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...

2016-09-26 Thread Sherry302
Github user Sherry302 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r80579032
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2421,6 +2421,69 @@ private[spark] object Utils extends Logging {
 }
 
 /**
+ * An utility class used to set up Spark caller contexts to HDFS and Yarn. 
The `context` will be
+ * constructed by parameters passed in.
+ * When Spark applications run on Yarn and HDFS, its caller contexts will 
be written into Yarn RM
+ * audit log and hdfs-audit.log. That can help users to better diagnose 
and understand how
+ * specific applications impacting parts of the Hadoop system and 
potential problems they may be
+ * creating (e.g. overloading NN). As HDFS mentioned in HDFS-9184, for a 
given HDFS operation, it's
+ * very helpful to track which upper level job issues it.
+ *
+ * @param from who sets up the caller context (TASK, CLIENT, APPMASTER)
+ *
+ * The parameters below are optional:
+ * @param appId id of the app this task belongs to
+ * @param appAttemptId attempt id of the app this task belongs to
+ * @param jobId id of the job this task belongs to
+ * @param stageId id of the stage this task belongs to
+ * @param stageAttemptId attempt id of the stage this task belongs to
+ * @param taskId task id
+ * @param taskAttemptNumber task attempt id
+ * @since 2.0.1
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...

2016-09-26 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r80482936
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2421,6 +2421,69 @@ private[spark] object Utils extends Logging {
 }
 
 /**
+ * An utility class used to set up Spark caller contexts to HDFS and Yarn. 
The `context` will be
+ * constructed by parameters passed in.
+ * When Spark applications run on Yarn and HDFS, its caller contexts will 
be written into Yarn RM
+ * audit log and hdfs-audit.log. That can help users to better diagnose 
and understand how
+ * specific applications impacting parts of the Hadoop system and 
potential problems they may be
+ * creating (e.g. overloading NN). As HDFS mentioned in HDFS-9184, for a 
given HDFS operation, it's
+ * very helpful to track which upper level job issues it.
+ *
+ * @param from who sets up the caller context (TASK, CLIENT, APPMASTER)
+ *
+ * The parameters below are optional:
+ * @param appId id of the app this task belongs to
+ * @param appAttemptId attempt id of the app this task belongs to
+ * @param jobId id of the job this task belongs to
+ * @param stageId id of the stage this task belongs to
+ * @param stageAttemptId attempt id of the stage this task belongs to
+ * @param taskId task id
+ * @param taskAttemptNumber task attempt id
+ * @since 2.0.1
+ */
+private[spark] class CallerContext(
+   from: String,
+   appId: Option[String] = None,
+   appAttemptId: Option[String] = None,
+   jobId: Option[Int] = None,
+   stageId: Option[Int] = None,
+   stageAttemptId: Option[Int] = None,
+   taskId: Option[Long] = None,
+   taskAttemptNumber: Option[Int] = None) extends Logging {
+
+   val AppId = if (appId.isDefined) s"_${appId.get}" else ""
--- End diff --

make all these local vals start with lower case, add Str to them if you 
need to differentiate  appIdStr


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...

2016-09-26 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r80482539
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2421,6 +2421,69 @@ private[spark] object Utils extends Logging {
 }
 
 /**
+ * An utility class used to set up Spark caller contexts to HDFS and Yarn. 
The `context` will be
+ * constructed by parameters passed in.
+ * When Spark applications run on Yarn and HDFS, its caller contexts will 
be written into Yarn RM
+ * audit log and hdfs-audit.log. That can help users to better diagnose 
and understand how
+ * specific applications impacting parts of the Hadoop system and 
potential problems they may be
+ * creating (e.g. overloading NN). As HDFS mentioned in HDFS-9184, for a 
given HDFS operation, it's
+ * very helpful to track which upper level job issues it.
+ *
+ * @param from who sets up the caller context (TASK, CLIENT, APPMASTER)
+ *
+ * The parameters below are optional:
+ * @param appId id of the app this task belongs to
+ * @param appAttemptId attempt id of the app this task belongs to
+ * @param jobId id of the job this task belongs to
+ * @param stageId id of the stage this task belongs to
+ * @param stageAttemptId attempt id of the stage this task belongs to
+ * @param taskId task id
+ * @param taskAttemptNumber task attempt id
+ * @since 2.0.1
--- End diff --

sorry I missed this before, this is new feature and will only go into 2.1, 
lets just remove the @since since this isn't public api


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...

2016-09-23 Thread Sherry302
Github user Sherry302 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r80323261
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala ---
@@ -54,7 +54,10 @@ private[spark] abstract class Task[T](
 val partitionId: Int,
 // The default value is only used in tests.
 val metrics: TaskMetrics = TaskMetrics.registered,
-@transient var localProperties: Properties = new Properties) extends 
Serializable {
+@transient var localProperties: Properties = new Properties,
+val jobId: Option[Int] = None,
--- End diff --

OK. Thanks, @tgravescs .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...

2016-09-23 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r80239469
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala ---
@@ -54,7 +54,10 @@ private[spark] abstract class Task[T](
 val partitionId: Int,
 // The default value is only used in tests.
 val metrics: TaskMetrics = TaskMetrics.registered,
-@transient var localProperties: Properties = new Properties) extends 
Serializable {
+@transient var localProperties: Properties = new Properties,
+val jobId: Option[Int] = None,
--- End diff --

this is fine lets leave it as is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...

2016-09-22 Thread Sherry302
Github user Sherry302 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r80161383
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala ---
@@ -54,7 +54,10 @@ private[spark] abstract class Task[T](
 val partitionId: Int,
 // The default value is only used in tests.
 val metrics: TaskMetrics = TaskMetrics.registered,
-@transient var localProperties: Properties = new Properties) extends 
Serializable {
+@transient var localProperties: Properties = new Properties,
+val jobId: Option[Int] = None,
--- End diff --

Hi, @tgravescs I want to conform this with you if I can just change and fix 
up everywhere that calls /extends Task. I can do this, but may change many test 
classes/cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...

2016-09-21 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r79849949
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2420,29 +2420,54 @@ private[spark] object Utils extends Logging {
   }
 }
 
+/**
+ * An utility class used to set up Spark caller contexts to HDFS and Yarn. 
The `context` will be
+ * constructed by parameters passed in.
+ * When Spark applications run on Yarn and HDFS, its caller contexts will 
be written into Yarn RM
+ * audit log and hdfs-audit.log. That can help users to better diagnose 
and understand how
+ * specific applications impacting parts of the Hadoop system and 
potential problems they may be
+ * creating (e.g. overloading NN). As HDFS mentioned in HDFS-9184, for a 
given HDFS operation, it's
+ * very helpful to track which upper level job issues it.
+ *
+ * @param from who sets up the caller context (TASK, CLIENT, 
APPLICATION_MASTER)
+ *
+ * The parameters below are optional:
+ * @param appID id of the app this task belongs to
+ * @param appAttemptID attempt id of the app this task belongs to
+ * @param jobID id of the job this task belongs to
+ * @param stageID id of the stage this task belongs to
+ * @param stageAttemptId attempt id of the stage this task belongs to
+ * @param taskId task id
+ * @param taskAttemptNumber task attempt id
+ * @since 2.0.1
+ */
 private[spark] class CallerContext(
-   appName: Option[String] = None,
-   appID: Option[String] = None,
-   appAttemptID: Option[String] = None,
-   jobID: Option[Int] = None,
-   stageID: Option[Int] = None,
+   from: String,
+   appId: Option[String] = None,
+   appAttemptId: Option[String] = None,
+   jobId: Option[Int] = None,
+   stageId: Option[Int] = None,
stageAttemptId: Option[Int] = None,
taskId: Option[Long] = None,
taskAttemptNumber: Option[Int] = None) extends Logging {
 
-   val AppName = if (appName.isDefined) s"_AppName_${appName.get}" else ""
-   val AppID = if (appID.isDefined) s"_AppID_${appID.get}" else ""
-   val AppAttemptID = if (appAttemptID.isDefined) s"_${appAttemptID.get}" 
else ""
-   val JobID = if (jobID.isDefined) s"_JobID_${jobID.get}" else ""
-   val StageID = if (stageID.isDefined) s"_StageID_${stageID.get}" else ""
-   val StageAttemptId = if (stageAttemptId.isDefined) 
s"_${stageAttemptId.get}" else ""
+   val AppId = if (appId.isDefined) s"_AppId_${appId.get}" else ""
+   val AppAttemptId = if (appAttemptId.isDefined) 
s"_AttemptId_${appAttemptId.get}" else ""
--- End diff --

I don't agree with adding the AttemptId_ string back.  I understand it 
helps readability but its a lot of characters compared to the actual attempt 
id.   I want this thing to be as small as possible as to not add extra overhead 
to the rpc calls.   The strings from your example run are already 112 
characters long 
(SPARK_TASK_AppId_application_1474394339641_0006_AttemptId_1_JobId_0_StageId_0_AttemptId_0_TaskId_14_AttemptNum_0),
 if you start getting into many tasks and stages that could reach the default 
128 print to the audit log and get truncated.  

how about 
SPARK_TASK_application_1474394339641_0006_1_JId_0_SId_0_0_TId_14_0, that is 
only 66 characters.  yes the user may have to look up the format but I think 
that is ok.  If its really that big of an issue parsing this we can change it 
later but I would rather have it smaller and better performance and and have 
all the information (rather then it possibly getting truncated). 

 Really right now I think all you need is the task id and attempt because 
the numbers just increase across jobs and stages, but having the job id and 
stage id would be helpful to find the task id quickly and handles if that every 
changes and we have taskid unique per job or stage.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...

2016-09-21 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r79847913
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
@@ -202,8 +202,8 @@ private[spark] class ApplicationMaster(
 attemptID = Option(appAttemptId.getAttemptId.toString)
   }
 
-  new CallerContext(Option(System.getProperty("spark.app.name")),
-Option(appAttemptId.getApplicationId.toString), attemptID).set()
+  new CallerContext("APPLICATION_MASTER",
--- End diff --

truncate to APPMASTER


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...

2016-09-21 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r79844130
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2420,29 +2420,54 @@ private[spark] object Utils extends Logging {
   }
 }
 
+/**
+ * An utility class used to set up Spark caller contexts to HDFS and Yarn. 
The `context` will be
+ * constructed by parameters passed in.
+ * When Spark applications run on Yarn and HDFS, its caller contexts will 
be written into Yarn RM
+ * audit log and hdfs-audit.log. That can help users to better diagnose 
and understand how
+ * specific applications impacting parts of the Hadoop system and 
potential problems they may be
+ * creating (e.g. overloading NN). As HDFS mentioned in HDFS-9184, for a 
given HDFS operation, it's
+ * very helpful to track which upper level job issues it.
+ *
+ * @param from who sets up the caller context (TASK, CLIENT, 
APPLICATION_MASTER)
+ *
+ * The parameters below are optional:
+ * @param appID id of the app this task belongs to
+ * @param appAttemptID attempt id of the app this task belongs to
+ * @param jobID id of the job this task belongs to
+ * @param stageID id of the stage this task belongs to
+ * @param stageAttemptId attempt id of the stage this task belongs to
+ * @param taskId task id
+ * @param taskAttemptNumber task attempt id
+ * @since 2.0.1
+ */
 private[spark] class CallerContext(
-   appName: Option[String] = None,
-   appID: Option[String] = None,
-   appAttemptID: Option[String] = None,
-   jobID: Option[Int] = None,
-   stageID: Option[Int] = None,
+   from: String,
+   appId: Option[String] = None,
+   appAttemptId: Option[String] = None,
+   jobId: Option[Int] = None,
+   stageId: Option[Int] = None,
stageAttemptId: Option[Int] = None,
taskId: Option[Long] = None,
taskAttemptNumber: Option[Int] = None) extends Logging {
 
-   val AppName = if (appName.isDefined) s"_AppName_${appName.get}" else ""
-   val AppID = if (appID.isDefined) s"_AppID_${appID.get}" else ""
-   val AppAttemptID = if (appAttemptID.isDefined) s"_${appAttemptID.get}" 
else ""
-   val JobID = if (jobID.isDefined) s"_JobID_${jobID.get}" else ""
-   val StageID = if (stageID.isDefined) s"_StageID_${stageID.get}" else ""
-   val StageAttemptId = if (stageAttemptId.isDefined) 
s"_${stageAttemptId.get}" else ""
+   val AppId = if (appId.isDefined) s"_AppId_${appId.get}" else ""
--- End diff --

as mentioned please remove AppId_... the application id is pretty obvious 
in the logs it starts with application_ so no need to print extra characters


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...

2016-09-21 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r79843188
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala ---
@@ -54,7 +54,10 @@ private[spark] abstract class Task[T](
 val partitionId: Int,
 // The default value is only used in tests.
 val metrics: TaskMetrics = TaskMetrics.registered,
-@transient var localProperties: Properties = new Properties) extends 
Serializable {
+@transient var localProperties: Properties = new Properties,
+val jobId: Option[Int] = None,
--- End diff --

so Task is a private spark api so there is no issues breaking api.  You 
don't need to deprecate and add new, you can just change and fix up everywhere 
that calls it or extends it.
There are a handful of test classes that extend it that would need to be 
fixed up.  Let me think about it a bit more.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...

2016-09-21 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r79841783
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2420,6 +2420,44 @@ private[spark] object Utils extends Logging {
   }
 }
 
+private[spark] class CallerContext(
+   appName: Option[String] = None,
+   appID: Option[String] = None,
+   appAttemptID: Option[String] = None,
+   jobID: Option[Int] = None,
+   stageID: Option[Int] = None,
+   stageAttemptId: Option[Int] = None,
+   taskId: Option[Long] = None,
+   taskAttemptNumber: Option[Int] = None) extends Logging {
+
+   val AppName = if (appName.isDefined) s"_AppName_${appName.get}" else ""
--- End diff --

I'm not sure what you mean here there is no information about the 
application.  its pretty clear that the application id in your example is: 
application_1473908768790_0007.  We don't need the extra 6 characters AppID_ to 
tell us that


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...

2016-09-20 Thread Sherry302
Github user Sherry302 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r79695565
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2420,6 +2420,44 @@ private[spark] object Utils extends Logging {
   }
 }
 
+private[spark] class CallerContext(
+   appName: Option[String] = None,
+   appID: Option[String] = None,
+   appAttemptID: Option[String] = None,
+   jobID: Option[Int] = None,
+   stageID: Option[Int] = None,
+   stageAttemptId: Option[Int] = None,
+   taskId: Option[Long] = None,
+   taskAttemptNumber: Option[Int] = None) extends Logging {
+
+   val AppName = if (appName.isDefined) s"_AppName_${appName.get}" else ""
--- End diff --

Yes. Done. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...

2016-09-20 Thread Sherry302
Github user Sherry302 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r79695449
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2420,6 +2420,44 @@ private[spark] object Utils extends Logging {
   }
 }
 
+private[spark] class CallerContext(
+   appName: Option[String] = None,
+   appID: Option[String] = None,
+   appAttemptID: Option[String] = None,
+   jobID: Option[Int] = None,
+   stageID: Option[Int] = None,
+   stageAttemptId: Option[Int] = None,
+   taskId: Option[Long] = None,
+   taskAttemptNumber: Option[Int] = None) extends Logging {
+
+   val AppName = if (appName.isDefined) s"_AppName_${appName.get}" else ""
--- End diff --

I have updated the PR to remove appName, and replace appName with something 
to differentiate the context from ApplicationMaster vs Yarn Client vs Task. But 
for AppID, I think it is better to keep it since in hdfs-audit.log, there is no 
info about application. For example, the record below was produced when Task 
did a read/write operation to HDFS, except `callerContext`, there is no other 
info about application:
```
2016-09-14 22:29:06,526 INFO FSNamesystem.audit: allowed=true   ugi=wyang 
(auth:SIMPLE) ip=/127.0.0.1   cmd=opensrc=/lr_big.txt dst=nullperm=null 
  proto=rpc   
callerContext=SPARK_AppID_application_1473908768790_0007_JobID_0_StageID_0_0_TaskId_2_0
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...

2016-09-20 Thread Sherry302
Github user Sherry302 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r79693044
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/ResultTask.scala 
---
@@ -42,7 +42,10 @@ import org.apache.spark.rdd.RDD
  * input RDD's partitions).
  * @param localProperties copy of thread-local properties set by the user 
on the driver side.
  * @param metrics a [[TaskMetrics]] that is created at driver side and 
sent to executor side.
- */
+ * @param jobId id of the job this task belongs to
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...

2016-09-20 Thread Sherry302
Github user Sherry302 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r79692960
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
@@ -196,8 +198,13 @@ private[spark] class ApplicationMaster(
 // Set this internal configuration if it is running on cluster 
mode, this
 // configuration will be checked in SparkContext to avoid misuse 
of yarn cluster mode.
 System.setProperty("spark.yarn.app.id", 
appAttemptId.getApplicationId().toString())
+
+attemptID = Option(appAttemptId.getAttemptId.toString)
   }
 
+  new CallerContext(Option(System.getProperty("spark.app.name")),
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...

2016-09-20 Thread Sherry302
Github user Sherry302 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r79693000
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2420,6 +2420,44 @@ private[spark] object Utils extends Logging {
   }
 }
 
+private[spark] class CallerContext(
+   appName: Option[String] = None,
+   appID: Option[String] = None,
+   appAttemptID: Option[String] = None,
+   jobID: Option[Int] = None,
+   stageID: Option[Int] = None,
+   stageAttemptId: Option[Int] = None,
+   taskId: Option[Long] = None,
+   taskAttemptNumber: Option[Int] = None) extends Logging {
+
+   val AppName = if (appName.isDefined) s"_AppName_${appName.get}" else ""
+   val AppID = if (appID.isDefined) s"_AppID_${appID.get}" else ""
+   val AppAttemptID = if (appAttemptID.isDefined) s"_${appAttemptID.get}" 
else ""
+   val JobID = if (jobID.isDefined) s"_JobID_${jobID.get}" else ""
+   val StageID = if (stageID.isDefined) s"_StageID_${stageID.get}" else ""
+   val StageAttemptId = if (stageAttemptId.isDefined) 
s"_${stageAttemptId.get}" else ""
+   val TaskId = if (taskId.isDefined) s"_TaskId_${taskId.get}" else ""
+   val TaskAttemptNumber = if (taskAttemptNumber.isDefined) 
s"_${taskAttemptNumber.get}" else ""
+
+   val context = "SPARK" + AppName + AppID + AppAttemptID +
+ JobID + StageID + StageAttemptId + TaskId + TaskAttemptNumber
+
+  def set(): Boolean = {
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...

2016-09-20 Thread Sherry302
Github user Sherry302 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r79692475
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2420,6 +2420,44 @@ private[spark] object Utils extends Logging {
   }
 }
 
+private[spark] class CallerContext(
+   appName: Option[String] = None,
+   appID: Option[String] = None,
+   appAttemptID: Option[String] = None,
+   jobID: Option[Int] = None,
+   stageID: Option[Int] = None,
+   stageAttemptId: Option[Int] = None,
+   taskId: Option[Long] = None,
+   taskAttemptNumber: Option[Int] = None) extends Logging {
+
+   val AppName = if (appName.isDefined) s"_AppName_${appName.get}" else ""
+   val AppID = if (appID.isDefined) s"_AppID_${appID.get}" else ""
+   val AppAttemptID = if (appAttemptID.isDefined) s"_${appAttemptID.get}" 
else ""
+   val JobID = if (jobID.isDefined) s"_JobID_${jobID.get}" else ""
+   val StageID = if (stageID.isDefined) s"_StageID_${stageID.get}" else ""
+   val StageAttemptId = if (stageAttemptId.isDefined) 
s"_${stageAttemptId.get}" else ""
+   val TaskId = if (taskId.isDefined) s"_TaskId_${taskId.get}" else ""
+   val TaskAttemptNumber = if (taskAttemptNumber.isDefined) 
s"_${taskAttemptNumber.get}" else ""
+
+   val context = "SPARK" + AppName + AppID + AppAttemptID +
+ JobID + StageID + StageAttemptId + TaskId + TaskAttemptNumber
+
+  def set(): Boolean = {
+var succeed = false
+try {
+  val callerContext = 
Utils.classForName("org.apache.hadoop.ipc.CallerContext")
--- End diff --

Yes. I have updated the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS...

2016-09-20 Thread Sherry302
Github user Sherry302 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r79692046
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala ---
@@ -54,7 +54,10 @@ private[spark] abstract class Task[T](
 val partitionId: Int,
 // The default value is only used in tests.
 val metrics: TaskMetrics = TaskMetrics.registered,
-@transient var localProperties: Properties = new Properties) extends 
Serializable {
+@transient var localProperties: Properties = new Properties,
+val jobId: Option[Int] = None,
--- End diff --

Making these params all optional is not to break current code which uses 
this API. An alternative way is to mark the current API as deprecated and add a 
new overloaded function with new parameters. I am going to go this way. Any 
suggestions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-19 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r79457356
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
@@ -196,8 +198,13 @@ private[spark] class ApplicationMaster(
 // Set this internal configuration if it is running on cluster 
mode, this
 // configuration will be checked in SparkContext to avoid misuse 
of yarn cluster mode.
 System.setProperty("spark.yarn.app.id", 
appAttemptId.getApplicationId().toString())
+
+attemptID = Option(appAttemptId.getAttemptId.toString)
   }
 
+  new CallerContext(Option(System.getProperty("spark.app.name")),
--- End diff --

would it be useful to add something special here to differentiate the tasks 
from master vs client.  just like a static string "master" and for client 
"client"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-19 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r79457084
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2420,6 +2420,44 @@ private[spark] object Utils extends Logging {
   }
 }
 
+private[spark] class CallerContext(
+   appName: Option[String] = None,
+   appID: Option[String] = None,
+   appAttemptID: Option[String] = None,
+   jobID: Option[Int] = None,
+   stageID: Option[Int] = None,
+   stageAttemptId: Option[Int] = None,
+   taskId: Option[Long] = None,
+   taskAttemptNumber: Option[Int] = None) extends Logging {
+
+   val AppName = if (appName.isDefined) s"_AppName_${appName.get}" else ""
--- End diff --

thinking about this a little more I think we should just remove app name 
altogether. For tracking purposes the app id should be enough. The app name 
adds in a bunch of unknowns (length or it, weird characters or just spaces) 
which could cause things in the audit log to be less useful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-19 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r79421792
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2420,6 +2420,44 @@ private[spark] object Utils extends Logging {
   }
 }
 
+private[spark] class CallerContext(
+   appName: Option[String] = None,
+   appID: Option[String] = None,
+   appAttemptID: Option[String] = None,
+   jobID: Option[Int] = None,
+   stageID: Option[Int] = None,
+   stageAttemptId: Option[Int] = None,
+   taskId: Option[Long] = None,
+   taskAttemptNumber: Option[Int] = None) extends Logging {
+
+   val AppName = if (appName.isDefined) s"_AppName_${appName.get}" else ""
+   val AppID = if (appID.isDefined) s"_AppID_${appID.get}" else ""
+   val AppAttemptID = if (appAttemptID.isDefined) s"_${appAttemptID.get}" 
else ""
+   val JobID = if (jobID.isDefined) s"_JobID_${jobID.get}" else ""
+   val StageID = if (stageID.isDefined) s"_StageID_${stageID.get}" else ""
+   val StageAttemptId = if (stageAttemptId.isDefined) 
s"_${stageAttemptId.get}" else ""
+   val TaskId = if (taskId.isDefined) s"_TaskId_${taskId.get}" else ""
+   val TaskAttemptNumber = if (taskAttemptNumber.isDefined) 
s"_${taskAttemptNumber.get}" else ""
+
+   val context = "SPARK" + AppName + AppID + AppAttemptID +
+ JobID + StageID + StageAttemptId + TaskId + TaskAttemptNumber
+
+  def set(): Boolean = {
--- End diff --

Let rename this to setCurrentContext().


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-19 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r79419642
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala ---
@@ -54,7 +54,10 @@ private[spark] abstract class Task[T](
 val partitionId: Int,
 // The default value is only used in tests.
 val metrics: TaskMetrics = TaskMetrics.registered,
-@transient var localProperties: Properties = new Properties) extends 
Serializable {
+@transient var localProperties: Properties = new Properties,
+val jobId: Option[Int] = None,
--- End diff --

are these params all optional just to make it easier for different task 
types?  the jobId and appId I think are mandatory now, the appattempt id is 
still really optional.  I'm leaning towards making this not be Option so that 
is someone adds a new Task Type we make sure these are setup properly and thus 
context set properly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-19 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r79419042
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/ResultTask.scala 
---
@@ -42,7 +42,10 @@ import org.apache.spark.rdd.RDD
  * input RDD's partitions).
  * @param localProperties copy of thread-local properties set by the user 
on the driver side.
  * @param metrics a [[TaskMetrics]] that is created at driver side and 
sent to executor side.
- */
+ * @param jobId id of the job this task belongs to
--- End diff --

description should mention these are optional.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-19 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r79415724
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2420,6 +2420,44 @@ private[spark] object Utils extends Logging {
   }
 }
 
+private[spark] class CallerContext(
+   appName: Option[String] = None,
+   appID: Option[String] = None,
+   appAttemptID: Option[String] = None,
+   jobID: Option[Int] = None,
+   stageID: Option[Int] = None,
+   stageAttemptId: Option[Int] = None,
+   taskId: Option[Long] = None,
+   taskAttemptNumber: Option[Int] = None) extends Logging {
+
+   val AppName = if (appName.isDefined) s"_AppName_${appName.get}" else ""
--- End diff --

for the appName and the AppId, it feels like these should be obvious from 
the string so I'm not sure if we really need the extra _AppName_ and _AppID_ 
tags there.  Are there cases you don't think that is true?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-19 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r79408637
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2420,6 +2420,44 @@ private[spark] object Utils extends Logging {
   }
 }
 
+private[spark] class CallerContext(
+   appName: Option[String] = None,
+   appID: Option[String] = None,
+   appAttemptID: Option[String] = None,
+   jobID: Option[Int] = None,
+   stageID: Option[Int] = None,
+   stageAttemptId: Option[Int] = None,
+   taskId: Option[Long] = None,
+   taskAttemptNumber: Option[Int] = None) extends Logging {
+
+   val AppName = if (appName.isDefined) s"_AppName_${appName.get}" else ""
+   val AppID = if (appID.isDefined) s"_AppID_${appID.get}" else ""
+   val AppAttemptID = if (appAttemptID.isDefined) s"_${appAttemptID.get}" 
else ""
+   val JobID = if (jobID.isDefined) s"_JobID_${jobID.get}" else ""
+   val StageID = if (stageID.isDefined) s"_StageID_${stageID.get}" else ""
+   val StageAttemptId = if (stageAttemptId.isDefined) 
s"_${stageAttemptId.get}" else ""
+   val TaskId = if (taskId.isDefined) s"_TaskId_${taskId.get}" else ""
+   val TaskAttemptNumber = if (taskAttemptNumber.isDefined) 
s"_${taskAttemptNumber.get}" else ""
+
+   val context = "SPARK" + AppName + AppID + AppAttemptID +
+ JobID + StageID + StageAttemptId + TaskId + TaskAttemptNumber
+
+  def set(): Boolean = {
+var succeed = false
+try {
+  val callerContext = 
Utils.classForName("org.apache.hadoop.ipc.CallerContext")
--- End diff --

Actually perhaps add a description to this entire class that explain what 
this is, what it applies to and when it was added.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-19 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r79403168
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2420,6 +2420,44 @@ private[spark] object Utils extends Logging {
   }
 }
 
+private[spark] class CallerContext(
+   appName: Option[String] = None,
+   appID: Option[String] = None,
+   appAttemptID: Option[String] = None,
+   jobID: Option[Int] = None,
+   stageID: Option[Int] = None,
+   stageAttemptId: Option[Int] = None,
+   taskId: Option[Long] = None,
+   taskAttemptNumber: Option[Int] = None) extends Logging {
+
+   val AppName = if (appName.isDefined) s"_AppName_${appName.get}" else ""
+   val AppID = if (appID.isDefined) s"_AppID_${appID.get}" else ""
+   val AppAttemptID = if (appAttemptID.isDefined) s"_${appAttemptID.get}" 
else ""
+   val JobID = if (jobID.isDefined) s"_JobID_${jobID.get}" else ""
+   val StageID = if (stageID.isDefined) s"_StageID_${stageID.get}" else ""
+   val StageAttemptId = if (stageAttemptId.isDefined) 
s"_${stageAttemptId.get}" else ""
+   val TaskId = if (taskId.isDefined) s"_TaskId_${taskId.get}" else ""
+   val TaskAttemptNumber = if (taskAttemptNumber.isDefined) 
s"_${taskAttemptNumber.get}" else ""
+
+   val context = "SPARK" + AppName + AppID + AppAttemptID +
+ JobID + StageID + StageAttemptId + TaskId + TaskAttemptNumber
+
+  def set(): Boolean = {
+var succeed = false
+try {
+  val callerContext = 
Utils.classForName("org.apache.hadoop.ipc.CallerContext")
--- End diff --

can you add a comment that this api was added in hadoop 2.8


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-15 Thread Sherry302
Github user Sherry302 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r78898595
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2418,6 +2418,21 @@ private[spark] object Utils extends Logging {
   sparkJars.map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten
 }
   }
+
+  def setCallerContext(context: String): Boolean = {
+var succeed = false
+try {
+  val Builder = 
Utils.classForName("org.apache.hadoop.ipc.CallerContext$Builder")
+  val builderInst = 
Builder.getConstructor(classOf[String]).newInstance(context)
+  val ret = Builder.getMethod("build").invoke(builderInst)
--- End diff --

Yes. hdfsContext is more readable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-15 Thread Sherry302
Github user Sherry302 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r78898257
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2418,6 +2418,21 @@ private[spark] object Utils extends Logging {
   sparkJars.map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten
 }
   }
+
+  def setCallerContext(context: String): Boolean = {
+var succeed = false
+try {
+  val Builder = 
Utils.classForName("org.apache.hadoop.ipc.CallerContext$Builder")
+  val builderInst = 
Builder.getConstructor(classOf[String]).newInstance(context)
+  val ret = Builder.getMethod("build").invoke(builderInst)
+  val callerContext = 
Utils.classForName("org.apache.hadoop.ipc.CallerContext")
--- End diff --

If make `val callerContext = 
Utils.classForName("org.apache.hadoop.ipc.CallerContext")` out of `try` block, 
Spark will throw exception when it runs on hadoop before 2.8.0. Also, moving 
that line to the first of `try` block does not make any difference since 
`Utils.classForName("org.apache.hadoop.ipc.CallerContext$Builder")` also needs 
to check if `org.apache.hadoop.ipc.CallerContext` exists. I am not sure if I 
got your point, could you please give more info about it? Thanks a lot.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-15 Thread Sherry302
Github user Sherry302 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r78897068
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala ---
@@ -79,6 +82,13 @@ private[spark] abstract class Task[T](
   metrics)
 TaskContext.setTaskContext(context)
 taskThread = Thread.currentThread()
+
+val callerContext =
+  
s"Spark_AppId_${appId.getOrElse("")}_AppAttemptId_${appAttemptId.getOrElse("None")}"
 +
+
s"_JobId_${jobId.getOrElse("0")}_StageID_${stageId}_stageAttemptId_${stageAttemptId}"
 +
+s"_taskID_${taskAttemptId}_attemptNumber_${attemptNumber}"
+Utils.setCallerContext(callerContext)
--- End diff --

Yes. Good catch!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-15 Thread Sherry302
Github user Sherry302 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r78896928
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala ---
@@ -79,6 +82,13 @@ private[spark] abstract class Task[T](
   metrics)
 TaskContext.setTaskContext(context)
 taskThread = Thread.currentThread()
+
+val callerContext =
+  
s"Spark_AppId_${appId.getOrElse("")}_AppAttemptId_${appAttemptId.getOrElse("None")}"
 +
+
s"_JobId_${jobId.getOrElse("0")}_StageID_${stageId}_stageAttemptId_${stageAttemptId}"
 +
+s"_taskID_${taskAttemptId}_attemptNumber_${attemptNumber}"
--- End diff --

I have updated the PR to make the string shorter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-15 Thread Sherry302
Github user Sherry302 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r78896972
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala ---
@@ -79,6 +82,13 @@ private[spark] abstract class Task[T](
   metrics)
 TaskContext.setTaskContext(context)
 taskThread = Thread.currentThread()
+
+val callerContext =
+  
s"Spark_AppId_${appId.getOrElse("")}_AppAttemptId_${appAttemptId.getOrElse("None")}"
 +
--- End diff --

Yes. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-15 Thread Sherry302
Github user Sherry302 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r78896863
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
@@ -184,6 +184,9 @@ private[spark] class ApplicationMaster(
 try {
   val appAttemptId = client.getAttemptId()
 
+  var context = 
s"Spark_AppName_${System.getProperty("spark.app.name")}" +
--- End diff --

A CallerContext class has been added.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-15 Thread Sherry302
Github user Sherry302 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r78896816
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2418,6 +2418,21 @@ private[spark] object Utils extends Logging {
   sparkJars.map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten
 }
   }
+
+  def setCallerContext(context: String): Boolean = {
+var succeed = false
+try {
+  val Builder = 
Utils.classForName("org.apache.hadoop.ipc.CallerContext$Builder")
+  val builderInst = 
Builder.getConstructor(classOf[String]).newInstance(context)
+  val ret = Builder.getMethod("build").invoke(builderInst)
+  val callerContext = 
Utils.classForName("org.apache.hadoop.ipc.CallerContext")
+  callerContext.getMethod("setCurrent", callerContext).invoke(null, 
ret)
+  succeed = true
+} catch {
+  case NonFatal(e) => logDebug(s"$e", e)
--- End diff --

I have updated this to "case NonFatal(e) => logInfo("Fail to set Spark 
caller context", e)"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-15 Thread Sherry302
Github user Sherry302 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r78896707
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala ---
@@ -54,7 +54,10 @@ private[spark] abstract class Task[T](
 val partitionId: Int,
 // The default value is only used in tests.
 val metrics: TaskMetrics = TaskMetrics.registered,
-@transient var localProperties: Properties = new Properties) extends 
Serializable {
+@transient var localProperties: Properties = new Properties,
+val jobId: Option[Int] = None,
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-15 Thread Sherry302
Github user Sherry302 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r78896692
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala ---
@@ -51,8 +51,12 @@ private[spark] class ShuffleMapTask(
 partition: Partition,
 @transient private var locs: Seq[TaskLocation],
 metrics: TaskMetrics,
-localProperties: Properties)
-  extends Task[MapStatus](stageId, stageAttemptId, partition.index, 
metrics, localProperties)
+localProperties: Properties,
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-15 Thread Sherry302
Github user Sherry302 commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r78896681
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/ResultTask.scala 
---
@@ -51,8 +51,12 @@ private[spark] class ResultTask[T, U](
 locs: Seq[TaskLocation],
 val outputId: Int,
 localProperties: Properties,
-metrics: TaskMetrics)
-  extends Task[U](stageId, stageAttemptId, partition.index, metrics, 
localProperties)
+metrics: TaskMetrics,
+jobId: Option[Int] = None,
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-14 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r78783274
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala ---
@@ -79,6 +82,13 @@ private[spark] abstract class Task[T](
   metrics)
 TaskContext.setTaskContext(context)
 taskThread = Thread.currentThread()
+
+val callerContext =
+  
s"Spark_AppId_${appId.getOrElse("")}_AppAttemptId_${appAttemptId.getOrElse("None")}"
 +
+
s"_JobId_${jobId.getOrElse("0")}_StageID_${stageId}_stageAttemptId_${stageAttemptId}"
 +
+s"_taskID_${taskAttemptId}_attemptNumber_${attemptNumber}"
--- End diff --

one concern I have about this is the length of the string.  This string is 
going to be sent on every hadoop RPC call and then I believe get into the hdfs 
audit log.  The audit log does have a config to truncate it (default 128) but 
it still gets sent over rpc.  So I would like to keep this string as small as 
possible while still being useful.  The one above seems very long to me since 
just the static string is like 90 characters.  So I would like to see some of 
those static strings removed or abbreviated. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-14 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r78773807
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala ---
@@ -79,6 +82,13 @@ private[spark] abstract class Task[T](
   metrics)
 TaskContext.setTaskContext(context)
 taskThread = Thread.currentThread()
+
+val callerContext =
+  
s"Spark_AppId_${appId.getOrElse("")}_AppAttemptId_${appAttemptId.getOrElse("None")}"
 +
--- End diff --

can you change Spark to all caps "SPARK"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-14 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r78769775
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala ---
@@ -79,6 +82,13 @@ private[spark] abstract class Task[T](
   metrics)
 TaskContext.setTaskContext(context)
 taskThread = Thread.currentThread()
+
+val callerContext =
+  
s"Spark_AppId_${appId.getOrElse("")}_AppAttemptId_${appAttemptId.getOrElse("None")}"
 +
+
s"_JobId_${jobId.getOrElse("0")}_StageID_${stageId}_stageAttemptId_${stageAttemptId}"
 +
+s"_taskID_${taskAttemptId}_attemptNumber_${attemptNumber}"
+Utils.setCallerContext(callerContext)
--- End diff --

we should move this below the if(_killed) check as not reason to log this 
if it doesn't run.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-14 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r78768851
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
@@ -184,6 +184,9 @@ private[spark] class ApplicationMaster(
 try {
   val appAttemptId = client.getAttemptId()
 
+  var context = 
s"Spark_AppName_${System.getProperty("spark.app.name")}" +
--- End diff --

it would be nice to have a class or at least util function to construct the 
context string so that we don't have same string in multiple places.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-14 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r78767924
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2418,6 +2418,21 @@ private[spark] object Utils extends Logging {
   sparkJars.map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten
 }
   }
+
+  def setCallerContext(context: String): Boolean = {
+var succeed = false
+try {
+  val Builder = 
Utils.classForName("org.apache.hadoop.ipc.CallerContext$Builder")
+  val builderInst = 
Builder.getConstructor(classOf[String]).newInstance(context)
+  val ret = Builder.getMethod("build").invoke(builderInst)
+  val callerContext = 
Utils.classForName("org.apache.hadoop.ipc.CallerContext")
+  callerContext.getMethod("setCurrent", callerContext).invoke(null, 
ret)
+  succeed = true
+} catch {
+  case NonFatal(e) => logDebug(s"$e", e)
--- End diff --

why are you printing exception twice here.  make the actual log message 
something like failed to set CallerContext and leave the Exception to print 
after ,e


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-14 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r78767856
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2418,6 +2418,21 @@ private[spark] object Utils extends Logging {
   sparkJars.map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten
 }
   }
+
+  def setCallerContext(context: String): Boolean = {
+var succeed = false
+try {
+  val Builder = 
Utils.classForName("org.apache.hadoop.ipc.CallerContext$Builder")
+  val builderInst = 
Builder.getConstructor(classOf[String]).newInstance(context)
+  val ret = Builder.getMethod("build").invoke(builderInst)
--- End diff --

perhaps name something other then ret, perhaps hdfsContext 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-14 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r78767418
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2418,6 +2418,21 @@ private[spark] object Utils extends Logging {
   sparkJars.map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten
 }
   }
+
+  def setCallerContext(context: String): Boolean = {
+var succeed = false
+try {
+  val Builder = 
Utils.classForName("org.apache.hadoop.ipc.CallerContext$Builder")
+  val builderInst = 
Builder.getConstructor(classOf[String]).newInstance(context)
+  val ret = Builder.getMethod("build").invoke(builderInst)
+  val callerContext = 
Utils.classForName("org.apache.hadoop.ipc.CallerContext")
--- End diff --

It might be nice to do this first outside of this try block and it the 
class isn't found we just skip the rest of this and assume its before hadoop 
2.8.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-14 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r78758857
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala ---
@@ -54,7 +54,10 @@ private[spark] abstract class Task[T](
 val partitionId: Int,
 // The default value is only used in tests.
 val metrics: TaskMetrics = TaskMetrics.registered,
-@transient var localProperties: Properties = new Properties) extends 
Serializable {
+@transient var localProperties: Properties = new Properties,
+val jobId: Option[Int] = None,
--- End diff --

update params and descriptions


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-14 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r78758789
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala ---
@@ -51,8 +51,12 @@ private[spark] class ShuffleMapTask(
 partition: Partition,
 @transient private var locs: Seq[TaskLocation],
 metrics: TaskMetrics,
-localProperties: Properties)
-  extends Task[MapStatus](stageId, stageAttemptId, partition.index, 
metrics, localProperties)
+localProperties: Properties,
--- End diff --

update descriptions and params in java doc above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-09-14 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r78758711
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/ResultTask.scala 
---
@@ -51,8 +51,12 @@ private[spark] class ResultTask[T, U](
 locs: Seq[TaskLocation],
 val outputId: Int,
 localProperties: Properties,
-metrics: TaskMetrics)
-  extends Task[U](stageId, stageAttemptId, partition.index, metrics, 
localProperties)
+metrics: TaskMetrics,
+jobId: Option[Int] = None,
--- End diff --

you need to update the description of this class to have new params and 
descriptions


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-08-31 Thread steveloughran
Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r76952473
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2418,6 +2418,18 @@ private[spark] object Utils extends Logging {
   sparkJars.map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten
 }
   }
+
+  def setCallerContext(context: String): Unit = {
--- End diff --

If this is set to return a Boolean on success, you could have a unit test 
which asserts that when called on a version of Hadoop with the resource 
`/org/apache/hadoop/ipcCallerContext.class`, that it returned true. That way 
you could assert that "on a version of Hadoop with a caller context, this code 
sets it". That will help catch any regressions in the Hadoop code breaking this 
reflection stuff.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-08-31 Thread steveloughran
Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/14659#discussion_r76951863
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2418,6 +2418,18 @@ private[spark] object Utils extends Logging {
   sparkJars.map(_.split(",")).map(_.filter(_.nonEmpty)).toSeq.flatten
 }
   }
+
+  def setCallerContext(context: String): Unit = {
+try {
+  val Builder = 
Utils.classForName("org.apache.hadoop.ipc.CallerContext$Builder")
+  val builderInst = 
Builder.getConstructor(classOf[String]).newInstance(context)
+  val ret = Builder.getMethod("build").invoke(builderInst)
+  val callerContext = 
Utils.classForName("org.apache.hadoop.ipc.CallerContext")
+  callerContext.getMethod("setCurrent", callerContext).invoke(null, 
ret)
+} catch {
+  case NonFatal(e) => logDebug(s"${e.getMessage}")
--- End diff --

1. better to use  `logDebug(s"$e", e)`. Some exceptions (e.g. 
`NullPointerException`) return null from `getMessage()`; you may also want the 
stack


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14659: [SPARK-16757] Set up Spark caller context to HDFS

2016-08-15 Thread Sherry302
GitHub user Sherry302 opened a pull request:

https://github.com/apache/spark/pull/14659

[SPARK-16757] Set up Spark caller context to HDFS

## What changes were proposed in this pull request?

1. Pass `jobId` to Task.
2. Invoke Hadoop APIs. 

A new function `setCallerContext` is added in `Utils`. `setCallerContext` 
function invokes APIs of   `org.apache.hadoop.ipc.CallerContext` to set up 
spark caller contexts, which will be written into `hdfs-audit.log`.

For applications in Yarn client mode, `org.apache.hadoop.ipc.CallerContext` 
are called in `Task` and Yarn `Client`. For applications in Yarn cluster mode, 
`org.apache.hadoop.ipc.CallerContext` are be called in `Task` and 
`ApplicationMaster`.

The Spark caller contexts written into `hdfs-audit.log` are applications' 
name` {spark.app.name}` and `JobID_stageID_stageAttemptId_taskID_attemptNumbe`.

## How was this patch tested?
Manual Tests against some Spark applications in Yarn client mode and Yarn 
cluster mode. Need to check if spark caller contexts are written into HDFS 
hdfs-audit.log successfully.

For example, run SparkKmeans in Yarn client mode: 
`./bin/spark-submit  --master yarn --deploy-mode client --class 
org.apache.spark.examples.SparkKMeans 
examples/target/original-spark-examples_2.11-2.1.0-SNAPSHOT.jar 
hdfs://localhost:9000/lr_big.txt 2 5`

Before:
There will be no Spark caller context in records of `hdfs-audit.log`.

After:
Spark caller contexts will be in records of `hdfs-audit.log`.
(_Note: spark caller context below since Hadoop caller context API was 
invoked in Yarn Client_)
`2016-07-21 13:52:30,802 INFO FSNamesystem.audit: allowed=true
ugi=wyang (auth:SIMPLE)ip=/127.0.0.1cmd=getfileinfo
src=/lr_big.txtdst=nullperm=nullproto=rpc
callerContext=SparkKMeans running on Spark 
`
(_Note: spark caller context below since Hadoop caller context API was 
invoked in Task_)
`2016-07-21 13:52:35,584 INFO FSNamesystem.audit: allowed=true
ugi=wyang (auth:SIMPLE)ip=/127.0.0.1cmd=open
src=/lr_big.txtdst=nullperm=nullproto=rpc
callerContext=JobId_0_StageID_0_stageAttemptId_0_taskID_0_attemptNumber_0`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Sherry302/spark callercontextSubmit

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14659.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14659


commit ec6833d32ef14950b2d81790bc908992f6288815
Author: Weiqing Yang 
Date:   2016-08-16T04:11:41Z

[SPARK-16757] Set up Spark caller context to HDFS




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org