[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-27 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2087#discussion_r19419188
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -224,18 +223,18 @@ class HadoopRDD[K, V](
   val key: K = reader.createKey()
   val value: V = reader.createValue()
 
-  // Set the task input metrics.
   val inputMetrics = new InputMetrics(DataReadMethod.Hadoop)
-  try {
-/* bytesRead may not exactly equal the bytes read by a task: split 
boundaries aren't
- * always at record boundaries, so tasks may need to read into 
other splits to complete
- * a record. */
-inputMetrics.bytesRead = split.inputSplit.value.getLength()
-  } catch {
-case e: java.io.IOException =
-  logWarning(Unable to get input size to set InputMetrics for 
task, e)
+  // Find a function that will return the FileSystem bytes read by 
this thread.
+  val bytesReadCallback = if 
(split.inputSplit.value.isInstanceOf[FileSplit]) {
+SparkHadoopUtil.get.getFSBytesReadOnThreadCallback(
+  split.inputSplit.value.asInstanceOf[FileSplit].getPath, jobConf)
+  } else {
+None
+  }
+  if (bytesReadCallback.isDefined) {
+context.taskMetrics.inputMetrics = Some(inputMetrics)
   }
-  context.taskMetrics.inputMetrics = Some(inputMetrics)
--- End diff --

Gotcha - maybe we can add a comment about this later.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-27 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2087


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-26 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2087#discussion_r19382961
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -224,18 +223,18 @@ class HadoopRDD[K, V](
   val key: K = reader.createKey()
   val value: V = reader.createValue()
 
-  // Set the task input metrics.
   val inputMetrics = new InputMetrics(DataReadMethod.Hadoop)
-  try {
-/* bytesRead may not exactly equal the bytes read by a task: split 
boundaries aren't
- * always at record boundaries, so tasks may need to read into 
other splits to complete
- * a record. */
-inputMetrics.bytesRead = split.inputSplit.value.getLength()
-  } catch {
-case e: java.io.IOException =
-  logWarning(Unable to get input size to set InputMetrics for 
task, e)
+  // Find a function that will return the FileSystem bytes read by 
this thread.
+  val bytesReadCallback = if 
(split.inputSplit.value.isInstanceOf[FileSplit]) {
+SparkHadoopUtil.get.getFSBytesReadOnThreadCallback(
+  split.inputSplit.value.asInstanceOf[FileSplit].getPath, jobConf)
+  } else {
+None
+  }
+  if (bytesReadCallback.isDefined) {
+context.taskMetrics.inputMetrics = Some(inputMetrics)
   }
-  context.taskMetrics.inputMetrics = Some(inputMetrics)
--- End diff --

Is there a reason why we can't just set the input metrics (this line) here 
in both cases? In the case where we can't get the read callback, we just do 
this exact thing further down. Is there a reason why you can't just make the 
assignment here and just update the `bytesRead` below?

In fact, would it be possible to just do this at the beginning of the 
function:
```
context.taskMetrics.inputMetrics = new InputMetrics(DataReadMethod.Hadoop)
```
and not have a `val inputMetrics` here? Then we could just always access 
`context.taskMetrics` in the code and we'd just have one code path for 
accessing this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-26 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-60528666
  
Hey @sryza this looks good - I tested it locally and it worked. I stumbled 
a bit with the test because I was using coalesce() and these metrics don't work 
well with coalesced RDD's right now (which I forgot).

I just have one small question about slightly simplifying access to the 
metrics variable. Let me know what you think... if it's possible to clean it 
quickly it would be nice to do it and then merge this. If that can't be 
simplified for some reason then let me know and we can just merge this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-26 Thread sryza
Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/2087#discussion_r19387877
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -224,18 +223,18 @@ class HadoopRDD[K, V](
   val key: K = reader.createKey()
   val value: V = reader.createValue()
 
-  // Set the task input metrics.
   val inputMetrics = new InputMetrics(DataReadMethod.Hadoop)
-  try {
-/* bytesRead may not exactly equal the bytes read by a task: split 
boundaries aren't
- * always at record boundaries, so tasks may need to read into 
other splits to complete
- * a record. */
-inputMetrics.bytesRead = split.inputSplit.value.getLength()
-  } catch {
-case e: java.io.IOException =
-  logWarning(Unable to get input size to set InputMetrics for 
task, e)
+  // Find a function that will return the FileSystem bytes read by 
this thread.
+  val bytesReadCallback = if 
(split.inputSplit.value.isInstanceOf[FileSplit]) {
+SparkHadoopUtil.get.getFSBytesReadOnThreadCallback(
+  split.inputSplit.value.asInstanceOf[FileSplit].getPath, jobConf)
+  } else {
+None
+  }
+  if (bytesReadCallback.isDefined) {
+context.taskMetrics.inputMetrics = Some(inputMetrics)
   }
-  context.taskMetrics.inputMetrics = Some(inputMetrics)
--- End diff --

My thinking was that, if we aren't able to get the in-progress metrics 
callback, we don't want to set the InputMetrics until the end of the task.  If 
we were to set the input metrics at the beginning of the task, in-progress 
metrics would always show its bytes read as 0, instead of not at all, which is 
confusing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-24 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-60454724
  
Anything else needed here?  Sorry to keep pestering - I have an output 
metrics patch that depends on this that I'm eager to post.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-24 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-60458950
  
Ah sorry about that - I'm out until tomorrow morning but I can look then. I 
just wanted to test this locally with a few hadoop versions to check it, this 
looks good. In the mean time feel free to send up the other patch so folks can 
review... and you can rebase that once this goes in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-60198186
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22061/consoleFull)
 for   PR 2087 at commit 
[`23010b8`](https://github.com/apache/spark/commit/23010b850b28fccd9b33b0352c4bc2cb5f5dd45c).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-60198189
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22061/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-22 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-60193215
  
@pwendell any further comments on this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-22 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2087#discussion_r19259854
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -244,12 +243,35 @@ class HadoopRDD[K, V](
   case eof: EOFException =
 finished = true
 }
+
+// Update bytes read metric every few records
+if (recordsSinceMetricsUpdate == 
HadoopRDD.RECORDS_BETWEEN_BYTES_READ_METRIC_UPDATES
+ bytesReadCallback.isDefined) {
+  recordsSinceMetricsUpdate = 0
+  val bytesReadFn = bytesReadCallback.get
+  inputMetrics.bytesRead = bytesReadFn()
+} else {
+  recordsSinceMetricsUpdate += 1
+}
 (key, value)
   }
 
   override def close() {
 try {
   reader.close()
+  if (bytesReadCallback.isDefined) {
+inputMetrics.bytesRead = bytesReadCallback.get()
--- End diff --

can you change this invocation to use the correct style as well?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-22 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2087#discussion_r19259865
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala ---
@@ -147,12 +150,37 @@ class NewHadoopRDD[K, V](
   throw new java.util.NoSuchElementException(End of stream)
 }
 havePair = false
+
+// Update bytes read metric every few records
+if (recordsSinceMetricsUpdate == 
HadoopRDD.RECORDS_BETWEEN_BYTES_READ_METRIC_UPDATES
+ bytesReadCallback.isDefined) {
+  recordsSinceMetricsUpdate = 0
+  inputMetrics.bytesRead = bytesReadCallback.get()
+} else {
+  recordsSinceMetricsUpdate += 1
+}
+
 (reader.getCurrentKey, reader.getCurrentValue)
   }
 
   private def close() {
 try {
   reader.close()
+
+  // Update metrics with final amount
+  if (bytesReadCallback.isDefined) {
+inputMetrics.bytesRead = bytesReadCallback.get()
--- End diff --

here too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-22 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2087#discussion_r19259861
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala ---
@@ -147,12 +150,37 @@ class NewHadoopRDD[K, V](
   throw new java.util.NoSuchElementException(End of stream)
 }
 havePair = false
+
+// Update bytes read metric every few records
+if (recordsSinceMetricsUpdate == 
HadoopRDD.RECORDS_BETWEEN_BYTES_READ_METRIC_UPDATES
+ bytesReadCallback.isDefined) {
+  recordsSinceMetricsUpdate = 0
+  inputMetrics.bytesRead = bytesReadCallback.get()
--- End diff --

here too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-22 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-60193547
  
Just had some minor style comments - there were four cases which used the 
confusing invocation style but you only changed one of them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-22 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-60193683
  
Oops, sorry about that.  Posted a new patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-60193928
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22061/consoleFull)
 for   PR 2087 at commit 
[`23010b8`](https://github.com/apache/spark/commit/23010b850b28fccd9b33b0352c4bc2cb5f5dd45c).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-21 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-59885695
  
Cool, updated patch addresses comments.  It look like the failure is caused 
by a failure to fetch from git.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-21 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-59885707
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-59886278
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21976/consoleFull)
 for   PR 2087 at commit 
[`1ab662d`](https://github.com/apache/spark/commit/1ab662d8ae674407bfe0f8bbc14aedf1da60c030).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-59891135
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21971/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-59891129
  
**[Tests timed 
out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21971/consoleFull)**
 for PR 2087 at commit 
[`1ab662d`](https://github.com/apache/spark/commit/1ab662d8ae674407bfe0f8bbc14aedf1da60c030)
 after a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-21 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-59891774
  
Jenkins, retest this pleas.e


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-59892045
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21976/consoleFull)
 for   PR 2087 at commit 
[`1ab662d`](https://github.com/apache/spark/commit/1ab662d8ae674407bfe0f8bbc14aedf1da60c030).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-59892050
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21976/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-21 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-59959228
  
Small change to make a method I added private


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-59959931
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21995/consoleFull)
 for   PR 2087 at commit 
[`74fc9bb`](https://github.com/apache/spark/commit/74fc9bb31081453f701f15d090ec6f0f988a9f2f).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-59970626
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21995/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-59970613
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21995/consoleFull)
 for   PR 2087 at commit 
[`74fc9bb`](https://github.com/apache/spark/commit/74fc9bb31081453f701f15d090ec6f0f988a9f2f).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-20 Thread sryza
Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/2087#discussion_r19113109
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 
---
@@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
 UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
   }
 
+  /**
+   * Returns a function that can be called to find the number of Hadoop 
FileSystem bytes read by
+   * this thread so far. Reflection is required because thread-level 
FileSystem statistics are only
+   * available as of Hadoop 2.5 (see HADOOP-10688). Returns None if the 
required method can't be
+   * found.
+   */
+  def getInputBytesReadCallback(path: Path, conf: Configuration): 
Option[() = Long] = {
+val qualifiedPath = path.getFileSystem(conf).makeQualified(path)
+val scheme = qualifiedPath.toUri().getScheme()
+val stats = 
FileSystem.getAllStatistics().filter(_.getScheme().equals(scheme))
--- End diff --

Right.  I was assuming that your It looks like this is based on a static 
cache within the FileSystem class. was because you noticed this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-59882086
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21971/consoleFull)
 for   PR 2087 at commit 
[`1ab662d`](https://github.com/apache/spark/commit/1ab662d8ae674407bfe0f8bbc14aedf1da60c030).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-59882564
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21969/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-16 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2087#discussion_r18939560
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 
---
@@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
 UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
   }
 
+  /**
+   * Returns a function that can be called to find the number of Hadoop 
FileSystem bytes read by
+   * this thread so far. Reflection is required because thread-level 
FileSystem statistics are only
+   * available as of Hadoop 2.5 (see HADOOP-10688). Returns None if the 
required method can't be
+   * found.
+   */
+  def getInputBytesReadCallback(path: Path, conf: Configuration): 
Option[() = Long] = {
--- End diff --

I mean that the reflective lookups are themselves expensive, not calling 
Method.invoke. However, now I see that this is just returning a new function 
that calls Method.invoke so it should be fine performance wise.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-16 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2087#discussion_r18939658
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 
---
@@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
 UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
   }
 
+  /**
+   * Returns a function that can be called to find the number of Hadoop 
FileSystem bytes read by
+   * this thread so far. Reflection is required because thread-level 
FileSystem statistics are only
+   * available as of Hadoop 2.5 (see HADOOP-10688). Returns None if the 
required method can't be
+   * found.
+   */
+  def getInputBytesReadCallback(path: Path, conf: Configuration): 
Option[() = Long] = {
+val qualifiedPath = path.getFileSystem(conf).makeQualified(path)
+val scheme = qualifiedPath.toUri().getScheme()
+val stats = 
FileSystem.getAllStatistics().filter(_.getScheme().equals(scheme))
+try {
+  val threadStats = stats.map(Utils.invoke(classOf[Statistics], _, 
getThreadStatistics))
+  val statisticsDataClass =
+
Class.forName(org.apache.hadoop.fs.FileSystem$Statistics$StatisticsData)
+  val getBytesReadMethod = 
statisticsDataClass.getDeclaredMethod(getBytesRead)
+  val f = () = 
threadStats.map(getBytesReadMethod.invoke(_).asInstanceOf[Long]).sum
+  val start = f()
+  Some(() = f() - start)
--- End diff --

ah I see now - so should we call this function `getThreadLocalBytesRead` or 
something? It seems like it only semantically makes sense if the returned 
function is called from the same thread as the function was generated in. That 
might also be worth documenting in the javadoc. The use of the phrase so far 
there also threw me off a bit - maybe there is a better phrase for that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-16 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2087#discussion_r18939679
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -222,12 +221,33 @@ class HadoopRDD[K, V](
   case eof: EOFException =
 finished = true
 }
+
+// Update bytes read metric every 32 records
+if (recordsSinceMetricsUpdate == 32  
bytesReadCallback.isDefined) {
+  recordsSinceMetricsUpdate = 0
+  inputMetrics.bytesRead = bytesReadCallback.get()
--- End diff --

One thought is to make it two statements:
```
val bytesReadFn = bytesReadCallback.get()
inputBetrics.bytesRead = bytesReadFn()
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-16 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-59318880
  
Yeah you are totally right - the performance bit was not correct from my 
end. I added some more comments on this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-16 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2087#discussion_r18939723
  
--- Diff: 
core/src/test/scala/org/apache/spark/metrics/InputMetricsSuite.scala ---
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.metrics
+
+import org.scalatest.FunSuite
+
+import org.apache.spark.SharedSparkContext
+import org.apache.spark.scheduler.{SparkListenerTaskEnd, SparkListener}
+
+import scala.collection.mutable.ArrayBuffer
+
+import java.io.{FileWriter, PrintWriter, File}
+
+class InputMetricsSuite extends FunSuite with SharedSparkContext {
+  test(input metrics when reading text file) {
+val file = new File(getClass.getSimpleName + .txt)
+val pw = new PrintWriter(new FileWriter(file))
+pw.println(some stuff)
+pw.println(some other stuff)
+pw.println(yet more stuff)
+pw.println(too much stuff)
+pw.close()
+file.deleteOnExit()
+
+val taskBytesRead = new ArrayBuffer[Long]()
+sc.addSparkListener(new SparkListener() {
+  override def onTaskEnd(taskEnd: SparkListenerTaskEnd) {
+taskBytesRead += taskEnd.taskMetrics.inputMetrics.get.bytesRead
+  }
+})
+sc.textFile(file:// + file.getAbsolutePath, 2).count()
+
+// Wait for task end events to come in
+Thread.sleep(100)
--- End diff --

Can you use the utility that waits into the listener bus is empty here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-14 Thread sryza
Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/2087#discussion_r18831556
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 
---
@@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
 UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
   }
 
+  /**
+   * Returns a function that can be called to find the number of Hadoop 
FileSystem bytes read by
+   * this thread so far. Reflection is required because thread-level 
FileSystem statistics are only
+   * available as of Hadoop 2.5 (see HADOOP-10688). Returns None if the 
required method can't be
+   * found.
+   */
+  def getInputBytesReadCallback(path: Path, conf: Configuration): 
Option[() = Long] = {
+val qualifiedPath = path.getFileSystem(conf).makeQualified(path)
+val scheme = qualifiedPath.toUri().getScheme()
+val stats = 
FileSystem.getAllStatistics().filter(_.getScheme().equals(scheme))
+try {
+  val threadStats = stats.map(Utils.invoke(classOf[Statistics], _, 
getThreadStatistics))
--- End diff --

getThreadData is basically the same as getThreadStatistics, but when we 
made the method public, we changed its name to make it a little bit more 
descriptive.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-14 Thread sryza
Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/2087#discussion_r18832502
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 
---
@@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
 UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
   }
 
+  /**
+   * Returns a function that can be called to find the number of Hadoop 
FileSystem bytes read by
+   * this thread so far. Reflection is required because thread-level 
FileSystem statistics are only
+   * available as of Hadoop 2.5 (see HADOOP-10688). Returns None if the 
required method can't be
+   * found.
+   */
+  def getInputBytesReadCallback(path: Path, conf: Configuration): 
Option[() = Long] = {
--- End diff --

getInputBytesReadCallback only gets called once per task - to find the 
function and return it.  Are you worried about that per-task overhead?  We 
still do have a single reflective call to actually invoke it when we want to 
populate the metric.  The internet has a few different opinions on the overhead 
of this.  Most likely is that it's only about twice the overhead of a direct 
function call, but I've also seen threads that say it's much more.  Either way, 
this was my root of my earlier wariness about having this call on the read path 
as opposed to doing it asynchonously in a separate thread.

On the catch-all exception, you're right - will rework that part.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-14 Thread sryza
Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/2087#discussion_r18832748
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -222,12 +221,33 @@ class HadoopRDD[K, V](
   case eof: EOFException =
 finished = true
 }
+
+// Update bytes read metric every 32 records
+if (recordsSinceMetricsUpdate == 32  
bytesReadCallback.isDefined) {
+  recordsSinceMetricsUpdate = 0
+  inputMetrics.bytesRead = bytesReadCallback.get()
--- End diff --

This compiles for me and apparently gets interpreted as 
`(bytesReadCallback.get)()`.  Agreed that this looks weird.  Is there a 
phrasing you think would be best?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-14 Thread sryza
Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/2087#discussion_r18832984
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 
---
@@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
 UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
   }
 
+  /**
+   * Returns a function that can be called to find the number of Hadoop 
FileSystem bytes read by
+   * this thread so far. Reflection is required because thread-level 
FileSystem statistics are only
+   * available as of Hadoop 2.5 (see HADOOP-10688). Returns None if the 
required method can't be
+   * found.
+   */
+  def getInputBytesReadCallback(path: Path, conf: Configuration): 
Option[() = Long] = {
+val qualifiedPath = path.getFileSystem(conf).makeQualified(path)
+val scheme = qualifiedPath.toUri().getScheme()
+val stats = 
FileSystem.getAllStatistics().filter(_.getScheme().equals(scheme))
--- End diff --

The code inside FileSystem is a little confusing about this, but 
`CACHE.get(...)` will actually create a new FileSystem object if none currently 
exists.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-14 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-59057738
  
 I.e. the Hadoop RDD should look up the entire function for the computing 
thread at the beginning, then it can invoke that function within the hot loop 
only.

Commented inline above, but am I missing something about my implementation? 
 This is what I (thought) my code is doing.  Unless you mean that looking up 
the function per-task is too expensive?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-14 Thread sryza
Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/2087#discussion_r18833578
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 
---
@@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
 UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
   }
 
+  /**
+   * Returns a function that can be called to find the number of Hadoop 
FileSystem bytes read by
+   * this thread so far. Reflection is required because thread-level 
FileSystem statistics are only
+   * available as of Hadoop 2.5 (see HADOOP-10688). Returns None if the 
required method can't be
+   * found.
+   */
+  def getInputBytesReadCallback(path: Path, conf: Configuration): 
Option[() = Long] = {
+val qualifiedPath = path.getFileSystem(conf).makeQualified(path)
+val scheme = qualifiedPath.toUri().getScheme()
+val stats = 
FileSystem.getAllStatistics().filter(_.getScheme().equals(scheme))
+try {
+  val threadStats = stats.map(Utils.invoke(classOf[Statistics], _, 
getThreadStatistics))
+  val statisticsDataClass =
+
Class.forName(org.apache.hadoop.fs.FileSystem$Statistics$StatisticsData)
+  val getBytesReadMethod = 
statisticsDataClass.getDeclaredMethod(getBytesRead)
+  val f = () = 
threadStats.map(getBytesReadMethod.invoke(_).asInstanceOf[Long]).sum
+  val start = f()
+  Some(() = f() - start)
--- End diff --

`getBytesRead` will give you the bytes since the thread was created.  As 
executors can put multiple tasks on the same thread, calling it without a delta 
could include bytes read from a previous task.  I'll comment / restructure this 
to make it a little more clear.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-13 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2087#discussion_r18795643
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala ---
@@ -147,12 +150,36 @@ class NewHadoopRDD[K, V](
   throw new java.util.NoSuchElementException(End of stream)
 }
 havePair = false
+
+// Update bytes read metric every 32 records
--- End diff --

It would be better to pull this number into a constant and also probably 
make it much higher... 32 will have a performance implication since there is 
e.g. thread locals and other stuff going on here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-13 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2087#discussion_r18795734
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 
---
@@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
 UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
   }
 
+  /**
+   * Returns a function that can be called to find the number of Hadoop 
FileSystem bytes read by
+   * this thread so far. Reflection is required because thread-level 
FileSystem statistics are only
+   * available as of Hadoop 2.5 (see HADOOP-10688). Returns None if the 
required method can't be
+   * found.
+   */
+  def getInputBytesReadCallback(path: Path, conf: Configuration): 
Option[() = Long] = {
--- End diff --

Hey so there are a couple issues with the current approach:

This a bunch of reflective calls + exception handling every time it is 
called. That will have huge performance overhead. Also, this catch-all 
exception is sort of scary... what if there is a legitimate exception invoking 
this function in versions that support it? The user will never be able to find 
it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-13 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2087#discussion_r18795775
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 
---
@@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
 UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
   }
 
+  /**
+   * Returns a function that can be called to find the number of Hadoop 
FileSystem bytes read by
+   * this thread so far. Reflection is required because thread-level 
FileSystem statistics are only
+   * available as of Hadoop 2.5 (see HADOOP-10688). Returns None if the 
required method can't be
+   * found.
+   */
+  def getInputBytesReadCallback(path: Path, conf: Configuration): 
Option[() = Long] = {
+val qualifiedPath = path.getFileSystem(conf).makeQualified(path)
+val scheme = qualifiedPath.toUri().getScheme()
+val stats = 
FileSystem.getAllStatistics().filter(_.getScheme().equals(scheme))
+try {
+  val threadStats = stats.map(Utils.invoke(classOf[Statistics], _, 
getThreadStatistics))
--- End diff --

some versions of Hadoop have a method called `getThreadData` that seems to 
return something very similar, is that the same as this modulo the name? Or is 
it different?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-13 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2087#discussion_r18796228
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 
---
@@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
 UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
   }
 
+  /**
+   * Returns a function that can be called to find the number of Hadoop 
FileSystem bytes read by
+   * this thread so far. Reflection is required because thread-level 
FileSystem statistics are only
+   * available as of Hadoop 2.5 (see HADOOP-10688). Returns None if the 
required method can't be
+   * found.
+   */
+  def getInputBytesReadCallback(path: Path, conf: Configuration): 
Option[() = Long] = {
--- End diff --

Actually - this will only be called once per partition... let me continue 
looking.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-13 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2087#discussion_r18796436
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -222,12 +221,33 @@ class HadoopRDD[K, V](
   case eof: EOFException =
 finished = true
 }
+
+// Update bytes read metric every 32 records
+if (recordsSinceMetricsUpdate == 32  
bytesReadCallback.isDefined) {
+  recordsSinceMetricsUpdate = 0
+  inputMetrics.bytesRead = bytesReadCallback.get()
--- End diff --

is this a legal assignment? `inputMetrics.bytesRead` is of type Long and 
`bytesReadCallback` is of type `Option[() = Long]`...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-13 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2087#discussion_r18796643
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 
---
@@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
 UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
   }
 
+  /**
+   * Returns a function that can be called to find the number of Hadoop 
FileSystem bytes read by
+   * this thread so far. Reflection is required because thread-level 
FileSystem statistics are only
+   * available as of Hadoop 2.5 (see HADOOP-10688). Returns None if the 
required method can't be
+   * found.
+   */
+  def getInputBytesReadCallback(path: Path, conf: Configuration): 
Option[() = Long] = {
+val qualifiedPath = path.getFileSystem(conf).makeQualified(path)
+val scheme = qualifiedPath.toUri().getScheme()
+val stats = 
FileSystem.getAllStatistics().filter(_.getScheme().equals(scheme))
--- End diff --

When will a specific FileSystem be present here? Is it definitely the case 
that there will exist a fileysystem for the supplied scheme? It looks like this 
is based on a static cache within the FileSystem class. Does that get populated 
before this is invoked?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-13 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2087#discussion_r18796711
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -222,12 +221,33 @@ class HadoopRDD[K, V](
   case eof: EOFException =
 finished = true
 }
+
+// Update bytes read metric every 32 records
+if (recordsSinceMetricsUpdate == 32  
bytesReadCallback.isDefined) {
--- End diff --

Can you make this into a class constant?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-13 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-58960161
  
Hey Sandy, had a couple questions about behavior and assumptions from 
Hadoop. A couple of things here. The current approach does a lot of reflection 
every time we invoke this statistics function which is very expensive. Also 
there is some catch all exception handling that could bite us. Two things 
that would help this are:

1. Determine a single time up-front whether we will try to compute these 
advanced statistics:
```
  private val statsClass = 
org.apache.hadoop.fs.FileSystem$Statistics$StatisticsData
  private val statsFunction = getThreadStatistics

  /** Whether to attempt accessing per-thread statistics from Hadoop */
  private[spark] val hasAdvancedStatistics =

Try(Class.forName(statsClass).getDeclaredMethod(statsFunction)).map(true).getOrElse(false)
```

And then remove the exception blocks elsewhere. I.e. if we detect advanced 
statistics are available and then there is a failure getting them, we should 
throw an exception.

2. Perform as much reflection as possible off the critical path. I.e. the 
Hadoop RDD should look up the entire function for the computing thread at the 
beginning, then it can invoke that function within the hot loop only.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-13 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2087#discussion_r18797454
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 
---
@@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
 UserGroupInformation.loginUserFromKeytab(principalName, keytabFilename)
   }
 
+  /**
+   * Returns a function that can be called to find the number of Hadoop 
FileSystem bytes read by
+   * this thread so far. Reflection is required because thread-level 
FileSystem statistics are only
+   * available as of Hadoop 2.5 (see HADOOP-10688). Returns None if the 
required method can't be
+   * found.
+   */
+  def getInputBytesReadCallback(path: Path, conf: Configuration): 
Option[() = Long] = {
+val qualifiedPath = path.getFileSystem(conf).makeQualified(path)
+val scheme = qualifiedPath.toUri().getScheme()
+val stats = 
FileSystem.getAllStatistics().filter(_.getScheme().equals(scheme))
+try {
+  val threadStats = stats.map(Utils.invoke(classOf[Statistics], _, 
getThreadStatistics))
+  val statisticsDataClass =
+
Class.forName(org.apache.hadoop.fs.FileSystem$Statistics$StatisticsData)
+  val getBytesReadMethod = 
statisticsDataClass.getDeclaredMethod(getBytesRead)
+  val f = () = 
threadStats.map(getBytesReadMethod.invoke(_).asInstanceOf[Long]).sum
+  val start = f()
+  Some(() = f() - start)
--- End diff --

What is going with the deltas here? The javadoc says it's the bytes read 
thus far, but if I look at the function here it seems like it will return to 
you a function that gives you the bytes read since the function was created.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-58975673
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/367/consoleFull)
 for   PR 2087 at commit 
[`305ad9f`](https://github.com/apache/spark/commit/305ad9f50b4dae992e09fdc962ebaf71e191a191).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-58979294
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/367/consoleFull)
 for   PR 2087 at commit 
[`305ad9f`](https://github.com/apache/spark/commit/305ad9f50b4dae992e09fdc962ebaf71e191a191).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-58984415
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/376/consoleFull)
 for   PR 2087 at commit 
[`305ad9f`](https://github.com/apache/spark/commit/305ad9f50b4dae992e09fdc962ebaf71e191a191).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-58988193
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/376/consoleFull)
 for   PR 2087 at commit 
[`305ad9f`](https://github.com/apache/spark/commit/305ad9f50b4dae992e09fdc962ebaf71e191a191).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-57425652
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21091/consoleFull)
 for   PR 2087 at commit 
[`305ad9f`](https://github.com/apache/spark/commit/305ad9f50b4dae992e09fdc962ebaf71e191a191).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-57425658
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21091/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-09-30 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-57421448
  
Updated patch switches from the pull to push model as requested by 
@pwendell and adds a test.  I verified that the test succeeds against both 
Hadoop 2.2 and Hadoop 2.5 (which contains the new API).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-09-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-57421546
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21091/consoleFull)
 for   PR 2087 at commit 
[`305ad9f`](https://github.com/apache/spark/commit/305ad9f50b4dae992e09fdc962ebaf71e191a191).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-09-29 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-57229133
  
Hey @sryza so it seems like there are two things going on here. One is 
adding incremental update and the other is changing the way we deal with 
tracking read bytes for Hadoop RDD's. For the incremental updates, could we 
just make bytes read an atomic long and update it directly inside of the 
`compute` functions - this seems simpler than using callbacks? For instance, 
what if we just update the bytes read every N records by reading from the 
thread local information.

The current approach couples the updating of this metric with the 
heartbeats in a way that seems strange. In fact, is `updatebytesRead` ever 
called here if the heartbeats are disabled or are very long? And don't we need 
to `updateBytesRead` once the task finishes... for instance, more bytes could 
have been read after the most recent heartbeat was sent, right? If we did an 
approach that updated it every N records and then when the entire partition was 
computed, it is easier to reason about the order of updates.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-09-29 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-57236118
  
 The current approach couples the updating of this metric with the 
heartbeats in a way that seems strange.

The heartbeats (and task completion, which, my bad, I need to add in) are 
the only times when we use the value of this metric.  Is there an advantage to 
adding complexity to keep it more up to date than that?  We'd also be adding an 
extra branch on the read path, which I suppose might not be much compared with 
the crazy stuff Hadoop record readers do, but could still be a small perf hit.  
Last, in the (rare) case where we're reading a single huge record, we wouldn't 
get incremental measurements within it.

We use a similar approach for shuffleReadMetrics, aggregating it across 
readers right before sending it to the driver.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-09-29 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-57244182
  
Yeah so I just prefer keeping the TaskMetrics/InputMetrics as simple as 
possible rather than having callback registration and other state in them. The 
simplest possible interface is that they are just structs and people update 
their values. This keeps all of the logic around this thread-based Hadoop 
instrumentation local to the HadoopRDD itself, so the interface is much 
narrower between the components.

Overall, I'm gauging the complexity based on how complicated the interfaces 
are, not on the complexity of the internal implementations.

If we have a single large record this might be an issue. But we already 
make other assumptions that record sizes are fairly small, for instance they 
must fit easily in memory so they can't be large.

By keeping the interactions between the components simpler, this will be 
easier to test also. Right now there are no unit tests for this and because the 
interfaces are complex it might be difficult to test as-is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-09-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-57260156
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21013/consoleFull)
 for   PR 2087 at commit 
[`a5486af`](https://github.com/apache/spark/commit/a5486af4c6750980eba8065344676ba64f2a8ad5).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-09-22 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-56461580
  
MapReduce doesn't use getPos, but it does look like it might be helpful in 
some situations.  One caveat is that pos only means # bytes for file input 
formats.  For example, for DBInputFormat, it means the number of records. 

If we choose to use getPos for pre-2.5 Hadoop, my preference would be to 
make that change in a separate patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-09-21 Thread kayousterhout
Github user kayousterhout commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-56307027
  
@aarondav @sryza Did you consider using reader.getPos() to get the correct 
metrics for older versions of Hadoop (as in here: 
https://github.com/kayousterhout/spark-1/blob/0028f0de92d79fd4df01d69b9fefdc51d3489c54/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L170)?
  That only fixes half the problem because I think it's only available in the 
old Hadoop API (not the new one), but I think this is the way that MapReduce 
correctly sets the input bytes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-09-11 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-55355339
  
Updated patch includes fallback to the split size


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-09-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-55355492
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20199/consoleFull)
 for   PR 2087 at commit 
[`8bfaa24`](https://github.com/apache/spark/commit/8bfaa24c03262db846f61d50ed174200da527f82).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-09-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-55358462
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20199/consoleFull)
 for   PR 2087 at commit 
[`8bfaa24`](https://github.com/apache/spark/commit/8bfaa24c03262db846f61d50ed174200da527f82).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-09-08 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-54780033
  
It looks like all the core tests are passing, but there are some failures 
in streaming and SQL tests.  Have those been showing up elsewhere?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-09-08 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-54807921
  
FWIW, I think mostly-accurate metrics are much better than no metrics in 
this case. The read/write bytes are very useful from Hadoop FSes, and Hadoop 
2.5 is still very much widespread (Spark's default).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-09-08 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-54851789
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-09-08 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-54852759
  
Just to make sure it's clear, the issue isn't only that we can be a few 
bytes off when we're reading outside of split boundaries, but that it'll look 
like we read the full split even if it's only a take or query with a limit.

If you're comfortable with that, I don't mind adding it back in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-09-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-54859158
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19983/consoleFull)
 for   PR 2087 at commit 
[`0034292`](https://github.com/apache/spark/commit/00342924b0b3eba12861bf1cd524f0bfca2fbc4e).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-09-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-54867619
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19983/consoleFull)
 for   PR 2087 at commit 
[`0034292`](https://github.com/apache/spark/commit/00342924b0b3eba12861bf1cd524f0bfca2fbc4e).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `logDebug(isMulticlass =  + metadata.isMulticlass)`
  * `logDebug(isMulticlass =  + metadata.isMulticlass)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-09-08 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-54915960
  
I think we need some indication of the bytes being read from Hadoop. If 
this is our only current mechanism, then I think removing the code is not worth 
the behavioral regression.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-09-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-54695093
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19865/consoleFull)
 for   PR 2087 at commit 
[`0034292`](https://github.com/apache/spark/commit/00342924b0b3eba12861bf1cd524f0bfca2fbc4e).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-09-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-54698426
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19865/consoleFull)
 for   PR 2087 at commit 
[`0034292`](https://github.com/apache/spark/commit/00342924b0b3eba12861bf1cd524f0bfca2fbc4e).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-09-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-54519238
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19790/consoleFull)
 for   PR 2087 at commit 
[`0a743c0`](https://github.com/apache/spark/commit/0a743c00307c721b74343eceb715925101fc5c66).
 * This patch **does not** merge cleanly!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-09-04 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-54563140
  
Hm, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-09-04 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2087#issuecomment-54563520
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2621. Update task InputMetrics increment...

2014-08-21 Thread sryza
GitHub user sryza opened a pull request:

https://github.com/apache/spark/pull/2087

SPARK-2621. Update task InputMetrics incrementally

The patch takes advantage an API provided in Hadoop 2.5 that allows getting 
accurate data on Hadoop FileSystem bytes read.  It eliminates the old method, 
which naively accepts the split size as the input bytes.  An impact of this 
change will be that input metrics go away when using against Hadoop versions 
earlier thatn 2.5.  I can add this back in, but my opinion is that no metrics 
are better than inaccurate metrics.

This is difficult to write a test for because we don't usually build 
against a version of Hadoop that contains the function we need.  I've tested it 
manually on a pseudo-distributed cluster.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sryza/spark sandy-spark-2621

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2087.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2087


commit b5f4c6c5d0be646798bc8188f610b00fb4be83fa
Author: Sandy Ryza sa...@cloudera.com
Date:   2014-07-22T20:42:28Z

SPARK-2621. Update task InputMetrics incrementally




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org