subject:"\[GitHub\] spark pull request\: \[SPARK\-4874\] \[CORE\] Collect record count metri..."

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2016-04-27 Thread aguyyala

Github user aguyyala commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-214980191
  
@ksakellis How do I collect these metrics on a console right after the task 
is done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-08 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-73325105
  
Merging this, thanks Kos.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-73314017
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26936/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-73314007
  
  [Test build #26936 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26936/consoleFull)
 for   PR 4067 at commit 
[`bd919be`](https://github.com/apache/spark/commit/bd919be5817e29dad476213a0b3b407d28ee0f24).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-08 Thread JoshRosen

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-73393979
  
It looks like the InputOutputMetricsSuite input metrics with mixed read 
methods and InputOutputMetricsSuite input metrics with interleaved reads 
test may have started failing in the hadoop-2.2 build since this patch:


https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/29/testReport/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-08 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/4067


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-08 Thread ksakellis

Github user ksakellis commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-73429206
  
Yikes, @JoshRosen i'm looking into this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-06 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-73303194
  
Jenkins, test this please. This LGTM pending tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-06 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-73303658
  
  [Test build #26936 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26936/consoleFull)
 for   PR 4067 at commit 
[`bd919be`](https://github.com/apache/spark/commit/bd919be5817e29dad476213a0b3b407d28ee0f24).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-73203717
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26906/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-06 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-73203708
  
  [Test build #26906 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26906/consoleFull)
 for   PR 4067 at commit 
[`bd919be`](https://github.com/apache/spark/commit/bd919be5817e29dad476213a0b3b407d28ee0f24).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-05 Thread ksakellis

Github user ksakellis commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-73182058
  
Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-05 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r24223212
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockObjectWriter.scala ---
@@ -193,12 +194,11 @@ private[spark] class DiskBlockObjectWriter(
 }
 
 objOut.writeObject(value)
+numRecordsWritten += 1
--- End diff --

What about just adding a class level comment that it can't be used after it 
is closed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-05 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r24223198
  
--- Diff: core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala 
---
@@ -358,5 +374,12 @@ class ShuffleWriteMetrics extends Serializable {
   private[spark] def incShuffleWriteTime(value: Long) = _shuffleWriteTime 
+= value
   private[spark] def decShuffleWriteTime(value: Long) = _shuffleWriteTime 
-= value
   
-
+  /**
+   * Total number of records written to the shuffle by this task
+   */
+  @volatile private var _recordsWritten: Long = _
--- End diff --

@ksakellis any thoughts on this one?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-05 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r24225336
  
--- Diff: core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala 
---
@@ -334,6 +342,14 @@ class ShuffleReadMetrics extends Serializable {
* Number of blocks fetched in this shuffle by this task (remote or 
local)
*/
   def totalBlocksFetched = _remoteBlocksFetched + _localBlocksFetched
+
+  /**
+   * Total number of records read from the shuffle by this task
+   */
+  private var _recordsRead: Long = _
--- End diff --

@ksakellis mind fixing this one?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-05 Thread ksakellis

Github user ksakellis commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r24223586
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockObjectWriter.scala ---
@@ -193,12 +194,11 @@ private[spark] class DiskBlockObjectWriter(
 }
 
 objOut.writeObject(value)
+numRecordsWritten += 1
--- End diff --

Yes, i guess that is the least we can do. Having an explicit check i think 
would be better. So if we are okay with it, i can add a boolean that tracks if 
the blockwriter has been opened before and if so, don't allow it to be 
reopened. Thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-05 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-73196678
  
  [Test build #26906 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26906/consoleFull)
 for   PR 4067 at commit 
[`bd919be`](https://github.com/apache/spark/commit/bd919be5817e29dad476213a0b3b407d28ee0f24).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-05 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-73186608
  
  [Test build #26896 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26896/consoleFull)
 for   PR 4067 at commit 
[`e156560`](https://github.com/apache/spark/commit/e1565607622a118cf7da2f00379749141e927a73).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-73186614
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26896/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-05 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r24224638
  
--- Diff: core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala 
---
@@ -358,5 +374,12 @@ class ShuffleWriteMetrics extends Serializable {
   private[spark] def incShuffleWriteTime(value: Long) = _shuffleWriteTime 
+= value
   private[spark] def decShuffleWriteTime(value: Long) = _shuffleWriteTime 
-= value
   
-
+  /**
+   * Total number of records written to the shuffle by this task
+   */
+  @volatile private var _recordsWritten: Long = _
--- End diff --

it is a bit redundant, but many other fields are already named as such, so 
figured for consistency it was best.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-05 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r24224673
  
--- Diff: core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala ---
@@ -472,12 +512,12 @@ private[ui] class StagePage(parent: StagesTab) 
extends WebUIPage(stage) {
 }}
 {if (hasInput) {
   td sorttable_customkey={inputSortable}
-{inputReadable}
+{s$inputReadable / $inputRecords}
--- End diff --

Yeah, that was my thought.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-05 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r24224655
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockObjectWriter.scala ---
@@ -193,12 +194,11 @@ private[spark] class DiskBlockObjectWriter(
 }
 
 objOut.writeObject(value)
+numRecordsWritten += 1
--- End diff --

Sure, you can enforce that it is never re-opened if you want.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-05 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r24225268
  
--- Diff: 
core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala
 ---
@@ -25,7 +25,7 @@ import org.apache.spark._
 import org.apache.spark.serializer.Serializer
 import org.apache.spark.shuffle.FetchFailedException
 import org.apache.spark.storage.{BlockId, BlockManagerId, 
ShuffleBlockFetcherIterator, ShuffleBlockId}
-import org.apache.spark.util.CompletionIterator
+import org.apache.spark.util.{CompletionIterator}
--- End diff --

you don't need braces here if it is a single import.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-05 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-73182187
  
  [Test build #26896 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26896/consoleFull)
 for   PR 4067 at commit 
[`e156560`](https://github.com/apache/spark/commit/e1565607622a118cf7da2f00379749141e927a73).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-05 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-73189391
  
  [Test build #26904 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26904/consoleFull)
 for   PR 4067 at commit 
[`dad4d57`](https://github.com/apache/spark/commit/dad4d5782bf6cde7f655d44bd29ee31041d3af4a).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-73198540
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26904/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-05 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-73198535
  
  [Test build #26904 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26904/consoleFull)
 for   PR 4067 at commit 
[`dad4d57`](https://github.com/apache/spark/commit/dad4d5782bf6cde7f655d44bd29ee31041d3af4a).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-05 Thread ksakellis

Github user ksakellis commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r24205171
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -1089,17 +1091,15 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
   private def initHadoopOutputMetrics(context: TaskContext): 
(OutputMetrics, Option[() = Long]) = {
 val bytesWrittenCallback = 
SparkHadoopUtil.get.getFSBytesWrittenOnThreadCallback()
 val outputMetrics = new OutputMetrics(DataWriteMethod.Hadoop)
-if (bytesWrittenCallback.isDefined) {
-  context.taskMetrics.outputMetrics = Some(outputMetrics)
--- End diff --

Yeah, good point. It isn't really straightforward to add outputmetrics for 
older hadoop. I'll revert this change so that outputMetrics are only created 
for hadoop  2.5. We can add a JIRA for adding output metrics for older hadoop. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-05 Thread ksakellis

Github user ksakellis commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r24206958
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockObjectWriter.scala ---
@@ -193,12 +194,11 @@ private[spark] class DiskBlockObjectWriter(
 }
 
 objOut.writeObject(value)
+numRecordsWritten += 1
--- End diff --

If the BlockObjectWriter can be reopened, wouldn't we we need to reset some 
other variables too? Like finalPosition, initialPosition and reportedPosition? 
They are set during construction? I don't see anywhere any enforcement of 
reopening the BlockObjectWriter. We could either reset all the variables i 
mentioned on open() or we can add a check that once the blockObjectWriter has 
been opened, it can't be reopened. I'm leaning towards the latter.  what do you 
think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-05 Thread ksakellis

Github user ksakellis commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r24207220
  
--- Diff: core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala ---
@@ -472,12 +512,12 @@ private[ui] class StagePage(parent: StagesTab) 
extends WebUIPage(stage) {
 }}
 {if (hasInput) {
   td sorttable_customkey={inputSortable}
-{inputReadable}
+{s$inputReadable / $inputRecords}
--- End diff --

I worry that it would be confusing to a user if they see 0 because that is 
valid number of num records read/written. but maybe that is better than seeing 
-45.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-05 Thread ksakellis

Github user ksakellis commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-73164133
  
New screenshots with the irrelevant columns invisible:

![screen shot 2015-02-05 at 5 07 23 
pm](https://cloud.githubusercontent.com/assets/6590087/6072770/4a42c7b8-ad5a-11e4-89c8-fbffdbcb820e.png)
Shows a stage that has Input Metrics (reading from a file) and writes data 
for next stage.

![screen shot 2015-02-05 at 5 07 43 
pm](https://cloud.githubusercontent.com/assets/6590087/6072775/5b0a4bb6-ad5a-11e4-940a-f6e686ec95b7.png)
Shows a stage that has both shuffle reading and writing - no input or 
output metrics.

![screen shot 2015-02-05 at 5 07 59 
pm](https://cloud.githubusercontent.com/assets/6590087/6072776/5d1e233c-ad5a-11e4-8612-7c7b704bde17.png)
Shows a stage outputting to an HDFS file.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-05 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-73164339
  
  [Test build #26876 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26876/consoleFull)
 for   PR 4067 at commit 
[`e156560`](https://github.com/apache/spark/commit/e1565607622a118cf7da2f00379749141e927a73).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-05 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-73170617
  
  [Test build #26876 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26876/consoleFull)
 for   PR 4067 at commit 
[`e156560`](https://github.com/apache/spark/commit/e1565607622a118cf7da2f00379749141e927a73).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-73170627
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26876/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-04 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r24133825
  
--- Diff: core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala 
---
@@ -358,5 +374,12 @@ class ShuffleWriteMetrics extends Serializable {
   private[spark] def incShuffleWriteTime(value: Long) = _shuffleWriteTime 
+= value
   private[spark] def decShuffleWriteTime(value: Long) = _shuffleWriteTime 
-= value
   
-
+  /**
+   * Total number of records written to the shuffle by this task
+   */
+  @volatile private var _recordsWritten: Long = _
--- End diff --

to be more consistent with the presentation in the UI and the other fields 
here, should this be `shuffleRecordsWritten`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-04 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r24133859
  
--- Diff: core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala 
---
@@ -334,6 +342,14 @@ class ShuffleReadMetrics extends Serializable {
* Number of blocks fetched in this shuffle by this task (remote or 
local)
*/
   def totalBlocksFetched = _remoteBlocksFetched + _localBlocksFetched
+
+  /**
+   * Total number of records read from the shuffle by this task
+   */
+  private var _recordsRead: Long = _
--- End diff --

to be more consistent with the presentation in the UI and the other fields 
here, should this be shuffleRecordsRead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-04 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r24133326
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockObjectWriter.scala ---
@@ -117,7 +117,7 @@ private[spark] class DiskBlockObjectWriter(
 
   /** Calling channel.position() to update the write metrics can be a 
little bit expensive, so we
 * only call it every N writes */
-  private var writesSinceMetricsUpdate = 0
+  private var numRecordsWritten = 0
--- End diff --

It would be good to define this precisely with a comment. I think it's the 
number of records ever written, excluding reversions (or instead, you could 
update this if the writes are reverted).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-04 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r24134862
  
--- Diff: core/src/main/scala/org/apache/spark/ui/jobs/ExecutorTable.scala 
---
@@ -77,13 +79,13 @@ private[ui] class ExecutorTable(stageId: Int, 
stageAttemptId: Int, parent: Stage
 td{v.failedTasks}/td
 td{v.succeededTasks}/td
 td sorttable_customkey={v.inputBytes.toString}
-  {Utils.bytesToString(v.inputBytes)}/td
+  {s${Utils.bytesToString(v.inputBytes)} / 
${v.inputRecords}}/td
--- End diff --

How hard is it to avoid these columns if they are not present in the job?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-04 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r24133768
  
--- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala ---
@@ -669,6 +673,7 @@ private[spark] object JsonProtocol {
 metrics.incLocalBlocksFetched((json \ Local Blocks 
Fetched).extract[Int])
 metrics.incFetchWaitTime((json \ Fetch Wait Time).extract[Long])
 metrics.incRemoteBytesRead((json \ Remote Bytes Read).extract[Long])
+metrics.incRecordsRead((json \ Records 
Read).extractOpt[Long].getOrElse(-1))
--- End diff --

Same for all below. I think it would be fine to just have them be 0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-04 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r24133742
  
--- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala ---
@@ -669,6 +673,7 @@ private[spark] object JsonProtocol {
 metrics.incLocalBlocksFetched((json \ Local Blocks 
Fetched).extract[Int])
 metrics.incFetchWaitTime((json \ Fetch Wait Time).extract[Long])
 metrics.incRemoteBytesRead((json \ Remote Bytes Read).extract[Long])
+metrics.incRecordsRead((json \ Records 
Read).extractOpt[Long].getOrElse(-1))
--- End diff --

Should this just be 0 instead of -1. We have -1in a few places but it's 
for metrics which are not aggregated like this. The weird thing here is that 
we'll actually keep aggregating more and more negatives.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-04 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r24134903
  
--- Diff: core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala ---
@@ -472,12 +512,12 @@ private[ui] class StagePage(parent: StagesTab) 
extends WebUIPage(stage) {
 }}
 {if (hasInput) {
   td sorttable_customkey={inputSortable}
-{inputReadable}
+{s$inputReadable / $inputRecords}
--- End diff --

I suggested below related to a separate concern, but what about just 
setting this to 0 instead of accumulating more and more negative values?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-04 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r24132189
  
--- Diff: core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala 
---
@@ -241,28 +242,29 @@ object DataWriteMethod extends Enumeration with 
Serializable {
  */
 @DeveloperApi
 case class InputMetrics(readMethod: DataReadMethod.Value) {
-
-  private val _bytesRead: AtomicLong = new AtomicLong()
+ 
+  @volatile @transient var bytesReadCallback: Option[() = Long] = None
--- End diff --

Mind adding a comment:

```
// This is volatile to make sure it's visible to updater thread
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-04 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r24132929
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -1089,17 +1091,15 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
   private def initHadoopOutputMetrics(context: TaskContext): 
(OutputMetrics, Option[() = Long]) = {
 val bytesWrittenCallback = 
SparkHadoopUtil.get.getFSBytesWrittenOnThreadCallback()
 val outputMetrics = new OutputMetrics(DataWriteMethod.Hadoop)
-if (bytesWrittenCallback.isDefined) {
-  context.taskMetrics.outputMetrics = Some(outputMetrics)
--- End diff --

I think there might be an issue with including OutputMetrics for older 
Hadoop versions. Based on the way output metrics were implemented in #2968 (/cc 
@sryza), this won't correctly display the output file size (it will show up as 
zero) for older versions of Hadoop. If you look, for input metrics, we look at 
the file size with older hadoop versions, but unfortunately I don't think the 
same thing was added for output metrics. It might be good to add that, in the 
case where the `FileOutputFormat` is in use.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-04 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r24133075
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockObjectWriter.scala ---
@@ -117,7 +117,7 @@ private[spark] class DiskBlockObjectWriter(
 
   /** Calling channel.position() to update the write metrics can be a 
little bit expensive, so we
--- End diff --

This comment is a bit out of place now. Maybe put this down where the 
actual % 32 check occurs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-04 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r24133393
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockObjectWriter.scala ---
@@ -193,12 +194,11 @@ private[spark] class DiskBlockObjectWriter(
 }
 
 objOut.writeObject(value)
+numRecordsWritten += 1
--- End diff --

The current use of `numRecordsWritten` assumes that a `BlockObjectWriter` 
will never be closed and re-opened, correct? I'm not sure whether that is 
enforced anywhere. Should we zero this var out in `open()`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-04 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-72975805
  
Hey @ksakellis - I did a pretty thorough review here, any chance you could 
take a look? Most of the things were small. The only major thing is that I 
think if we are going to always show the output metrics for Hadoop (even for 
older versions) we need to make sure that we at least do a best effort 
presentation of the output size for older versions. I think it would be good if 
we could sneak that in, since otherwise that output metrics thing is not so 
useful for many people running with older versions. However, I'd also be okay 
just not showing either bytes or count in that case (which I think was the 
behavior before).

A second issue I realized (let's open a new JIRA for this) is that we never 
track output metrics (size or bytes) when data is written to cache, which is 
confusing because we track input metrics when data is read from cache.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-72745792
  
  [Test build #26678 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26678/consoleFull)
 for   PR 4067 at commit 
[`9ecf912`](https://github.com/apache/spark/commit/9ecf912adddb9a4fb49c8fa9907eba6d3a947d00).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-72756645
  
  [Test build #26678 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26678/consoleFull)
 for   PR 4067 at commit 
[`9ecf912`](https://github.com/apache/spark/commit/9ecf912adddb9a4fb49c8fa9907eba6d3a947d00).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class SimpleFunctionRegistry(val caseSensitive: Boolean) extends 
FunctionRegistry `
  * `class StringKeyHashMap[T](normalizer: (String) = String) `
  * `case class MultiAlias(child: Expression, names: Seq[String])`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-72756658
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26678/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-02 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r23958696
  
--- Diff: core/src/main/scala/org/apache/spark/CacheManager.scala ---
@@ -47,9 +49,13 @@ private[spark] class CacheManager(blockManager: 
BlockManager) extends Logging {
 val inputMetrics = blockResult.inputMetrics
 val existingMetrics = context.taskMetrics
   .getInputMetricsForReadMethod(inputMetrics.readMethod)
-existingMetrics.addBytesRead(inputMetrics.bytesRead)
+existingMetrics.incBytesRead(inputMetrics.bytesRead)
 
-new InterruptibleIterator(context, 
blockResult.data.asInstanceOf[Iterator[T]])
+val iter = blockResult.data.asInstanceOf[Iterator[T]]
+new InterruptibleIterator(context, 
AfterNextInterceptingIterator(iter, (next: T) = {
+  existingMetrics.incRecordsRead(1)
--- End diff --

That's the general assumption. We should document that better.

In general, each task is only processed by a single thread. Otherwise a lot 
of things will break.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-02 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r23951392
  
--- Diff: core/src/main/scala/org/apache/spark/CacheManager.scala ---
@@ -47,9 +49,13 @@ private[spark] class CacheManager(blockManager: 
BlockManager) extends Logging {
 val inputMetrics = blockResult.inputMetrics
 val existingMetrics = context.taskMetrics
   .getInputMetricsForReadMethod(inputMetrics.readMethod)
-existingMetrics.addBytesRead(inputMetrics.bytesRead)
+existingMetrics.incBytesRead(inputMetrics.bytesRead)
 
-new InterruptibleIterator(context, 
blockResult.data.asInstanceOf[Iterator[T]])
+val iter = blockResult.data.asInstanceOf[Iterator[T]]
+new InterruptibleIterator(context, 
AfterNextInterceptingIterator(iter, (next: T) = {
+  existingMetrics.incRecordsRead(1)
--- End diff --

The question is - how expensive is the thing we are doing inside of the 
override method? In those other cases I think we're just checking the value of 
a single variable that doesn't change often (i.e. checking for interrupted). In 
the past we've seen performance regressions from anything more expensive than 
this: See 
https://github.com/apache/spark/commit/f708dda81ed5004325591fcc31cd79a8afa580db.

The cost of CAS is hardware dependent, but can be expensive on machines 
with large numbers of cores because in many case there is a single shared bus 
lock. Volatile is similar, basically if you think about it maybe you have 16 
cores and they each need to constantly invalidate each-other's local copy of 
the variable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-02 Thread sryza

Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r23951634
  
--- Diff: core/src/main/scala/org/apache/spark/CacheManager.scala ---
@@ -47,9 +49,13 @@ private[spark] class CacheManager(blockManager: 
BlockManager) extends Logging {
 val inputMetrics = blockResult.inputMetrics
 val existingMetrics = context.taskMetrics
   .getInputMetricsForReadMethod(inputMetrics.readMethod)
-existingMetrics.addBytesRead(inputMetrics.bytesRead)
+existingMetrics.incBytesRead(inputMetrics.bytesRead)
 
-new InterruptibleIterator(context, 
blockResult.data.asInstanceOf[Iterator[T]])
+val iter = blockResult.data.asInstanceOf[Iterator[T]]
+new InterruptibleIterator(context, 
AfterNextInterceptingIterator(iter, (next: T) = {
+  existingMetrics.incRecordsRead(1)
--- End diff --

Thinking about it more, we don't really even need the variable to be 
volatile - the worst that can happen is that the incremental metric we report 
is slightly out of date.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-02 Thread ksakellis

Github user ksakellis commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r23953114
  
--- Diff: core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala ---
@@ -472,12 +512,12 @@ private[ui] class StagePage(parent: StagesTab) 
extends WebUIPage(stage) {
 }}
 {if (hasInput) {
   td sorttable_customkey={inputSortable}
-{inputReadable}
+{s$inputReadable / $inputRecords}
--- End diff --

good point. I could add checks in the UI code for inputRecords  0 and now 
show it. You think thats worth doing or leaving it as -8?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-02 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-72533681
  
  [Test build #26530 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26530/consoleFull)
 for   PR 4067 at commit 
[`1aaa980`](https://github.com/apache/spark/commit/1aaa980f24c1dfcb9ef879db938231287ef60fcb).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-02 Thread ksakellis

Github user ksakellis commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r23952709
  
--- Diff: core/src/main/scala/org/apache/spark/CacheManager.scala ---
@@ -47,9 +49,13 @@ private[spark] class CacheManager(blockManager: 
BlockManager) extends Logging {
 val inputMetrics = blockResult.inputMetrics
 val existingMetrics = context.taskMetrics
   .getInputMetricsForReadMethod(inputMetrics.readMethod)
-existingMetrics.addBytesRead(inputMetrics.bytesRead)
+existingMetrics.incBytesRead(inputMetrics.bytesRead)
 
-new InterruptibleIterator(context, 
blockResult.data.asInstanceOf[Iterator[T]])
+val iter = blockResult.data.asInstanceOf[Iterator[T]]
+new InterruptibleIterator(context, 
AfterNextInterceptingIterator(iter, (next: T) = {
+  existingMetrics.incRecordsRead(1)
--- End diff --

I'd like to get to my first question about removing AtomicLong. Are we 
absolutely sure that more than one thread in the executor will never touch 
InputMetrics now and possibly in the future? I already made the changes to 
adding the counting into InterruptableIterator but now I'd like clarity on 
AtomicLong vs. Long. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-02 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-72545450
  
  [Test build #26530 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26530/consoleFull)
 for   PR 4067 at commit 
[`1aaa980`](https://github.com/apache/spark/commit/1aaa980f24c1dfcb9ef879db938231287ef60fcb).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class ChiSqSelectorModel (val selectedFeatures: Array[Int]) extends 
VectorTransformer `
  * `class ChiSqSelector (val numTopFeatures: Int) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-72545460
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26530/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-02 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-72428547
  
@rxin yeah that seems good.

@ksakellis one other thing I realized which is a little confusing, right 
now we report the bytes as read from a cached RDD as soon as they are 
fetched/present on the executor even if they have not been consumed by the 
task. Tracking consumption incrementally (in bytes) will be really hard, so 
maybe this is the best answer for now. It's a bit weird though - I think this 
logic was written before we sent incremental updates back. In terms of getting 
the total _records_ read, we might need to make the assumption that the 
iterator is consumed in its entirety.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-02 Thread sryza

Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-72507773
  
If it would save on perf, I would be in favor of adding the counting to 
InterruptibleIterator.  In core loops like these, I think efficiency is more 
important than elegance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-02 Thread ksakellis

Github user ksakellis commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r23940117
  
--- Diff: core/src/main/scala/org/apache/spark/CacheManager.scala ---
@@ -47,9 +49,13 @@ private[spark] class CacheManager(blockManager: 
BlockManager) extends Logging {
 val inputMetrics = blockResult.inputMetrics
 val existingMetrics = context.taskMetrics
   .getInputMetricsForReadMethod(inputMetrics.readMethod)
-existingMetrics.addBytesRead(inputMetrics.bytesRead)
+existingMetrics.incBytesRead(inputMetrics.bytesRead)
 
-new InterruptibleIterator(context, 
blockResult.data.asInstanceOf[Iterator[T]])
+val iter = blockResult.data.asInstanceOf[Iterator[T]]
+new InterruptibleIterator(context, 
AfterNextInterceptingIterator(iter, (next: T) = {
+  existingMetrics.incRecordsRead(1)
--- End diff --

This is a valid concern but I followed the pattern we have with 
CompletionIterator and InterruptableIterator. The CompletionIterator adds a 
function call after every hasNext() call - granted this is at the block level 
(not record). The IterruptableIterator adds a function call at each hasNext() 
and next() at the record level too. Did we evaluate the performance impact of 
these iterators too? 

@pwendell can you explain to me where the locking occurs? calling 
incRecordsRead calls the AtomicLong.setAndget which uses a compare and swap. I 
don't see any locking. Since this is going to be called by one thread (the task 
thread), the compare should never fail so there shouldn't be any retrying. 

@rxin I think adding the counting in the InterruptibleIterator is not a 
good separation of concerns. Also the InterruptibleIterator is used in a few 
places so we'd have to add some switching logic to know whether we should 
update the recordsread metrics?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-02 Thread squito

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r23942254
  
--- Diff: core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala ---
@@ -203,8 +206,11 @@ private[ui] class StagePage(parent: StagesTab) extends 
WebUIPage(stage) {
   None
 }
 else {
+  def getDistributionQuantities(data: Seq[Double]): 
IndexedSeq[Double] =
--- End diff --

typo: `getDistributionQuantiles`, (with an `l`)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-02 Thread ksakellis

Github user ksakellis commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-72498987
  
@pwendell I'm not sure how we can do what you propose without having an 
O(n) loop through all the records before passing the InterruptableIterator? We 
could do something fancy like counting incrementally and when we finish the 
task, if there are more records left, then do the loop to count the rest of the 
unread records. I don't think the complication is worth it. Also, reporting the 
accurate records read i think is better. 

Alternatively, we can fix the bytesRead to be more accurate. Right now they 
are computed in ShuffleBlockFetcherIterator and calculated based on the blocks 
fetched. Since we do the flatMap on that iterator in BlockStoreShuffleFetcher 
we report that we read all the bytes even if we didn't. We can move the 
bytesRead collection out of ShuffleBlockFetcherIterator and move it into the 
same iterator that computes the # records read. So they line up and are more 
accurate. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-02 Thread sryza

Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r23942054
  
--- Diff: core/src/main/scala/org/apache/spark/CacheManager.scala ---
@@ -47,9 +49,13 @@ private[spark] class CacheManager(blockManager: 
BlockManager) extends Logging {
 val inputMetrics = blockResult.inputMetrics
 val existingMetrics = context.taskMetrics
   .getInputMetricsForReadMethod(inputMetrics.readMethod)
-existingMetrics.addBytesRead(inputMetrics.bytesRead)
+existingMetrics.incBytesRead(inputMetrics.bytesRead)
 
-new InterruptibleIterator(context, 
blockResult.data.asInstanceOf[Iterator[T]])
+val iter = blockResult.data.asInstanceOf[Iterator[T]]
+new InterruptibleIterator(context, 
AfterNextInterceptingIterator(iter, (next: T) = {
+  existingMetrics.incRecordsRead(1)
--- End diff --

@ksakellis I think Patrick means locking in the sense that `AtomicLong` and 
its ilk use special compare-and-swap hardware instructions that lock the bus.  
They're more performant than acquiring a software lock, but still have high 
overhead compared to updating a variable that's in CPU cache.

http://en.wikipedia.org/wiki/Compare-and-swap


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-02 Thread squito

Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r23948987
  
--- Diff: core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala ---
@@ -472,12 +512,12 @@ private[ui] class StagePage(parent: StagesTab) 
extends WebUIPage(stage) {
 }}
 {if (hasInput) {
   td sorttable_customkey={inputSortable}
-{inputReadable}
+{s$inputReadable / $inputRecords}
--- End diff --

from the way the json serialization is done, I think you'll get something 
weird here when loading data from old runs that didn't have this field.  Each 
task is assigned a `-1`, so you end up with some negative number at the stage 
summary.  eg., I just ran a little test and I see Input Size / Records as 
160.0 B / -8.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-02 Thread ksakellis

Github user ksakellis commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r23950220
  
--- Diff: core/src/main/scala/org/apache/spark/CacheManager.scala ---
@@ -47,9 +49,13 @@ private[spark] class CacheManager(blockManager: 
BlockManager) extends Logging {
 val inputMetrics = blockResult.inputMetrics
 val existingMetrics = context.taskMetrics
   .getInputMetricsForReadMethod(inputMetrics.readMethod)
-existingMetrics.addBytesRead(inputMetrics.bytesRead)
+existingMetrics.incBytesRead(inputMetrics.bytesRead)
 
-new InterruptibleIterator(context, 
blockResult.data.asInstanceOf[Iterator[T]])
+val iter = blockResult.data.asInstanceOf[Iterator[T]]
+new InterruptibleIterator(context, 
AfterNextInterceptingIterator(iter, (next: T) = {
+  existingMetrics.incRecordsRead(1)
--- End diff --

@sryza right. So how do you propose we increment the bytes and records read 
in a threadsafe way? If we use a @volatile Long we can't safely do an increment 
unless we guarantee that only one thread is accessing the InputMetrics at any 
one time. I guess this is an okay assumption now but doesn't that open 
ourselves up to race conditions down the line when we add more multithreading? 

Looking at: 
http://stackoverflow.com/questions/2538070/atomic-operation-cost it doesn't 
seem like the cost of CAS is that high, there is at most 2 cacheline misses for 
this integer and only 1 if other CPUs are not reading and writing from it.  Am 
i misinterpreting this?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-02 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r23950541
  
--- Diff: core/src/main/scala/org/apache/spark/CacheManager.scala ---
@@ -47,9 +49,13 @@ private[spark] class CacheManager(blockManager: 
BlockManager) extends Logging {
 val inputMetrics = blockResult.inputMetrics
 val existingMetrics = context.taskMetrics
   .getInputMetricsForReadMethod(inputMetrics.readMethod)
-existingMetrics.addBytesRead(inputMetrics.bytesRead)
+existingMetrics.incBytesRead(inputMetrics.bytesRead)
 
-new InterruptibleIterator(context, 
blockResult.data.asInstanceOf[Iterator[T]])
+val iter = blockResult.data.asInstanceOf[Iterator[T]]
+new InterruptibleIterator(context, 
AfterNextInterceptingIterator(iter, (next: T) = {
+  existingMetrics.incRecordsRead(1)
--- End diff --

Do we ever consume the iterator on separate threads? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-02 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r23950681
  
--- Diff: core/src/main/scala/org/apache/spark/CacheManager.scala ---
@@ -47,9 +49,13 @@ private[spark] class CacheManager(blockManager: 
BlockManager) extends Logging {
 val inputMetrics = blockResult.inputMetrics
 val existingMetrics = context.taskMetrics
   .getInputMetricsForReadMethod(inputMetrics.readMethod)
-existingMetrics.addBytesRead(inputMetrics.bytesRead)
+existingMetrics.incBytesRead(inputMetrics.bytesRead)
 
-new InterruptibleIterator(context, 
blockResult.data.asInstanceOf[Iterator[T]])
+val iter = blockResult.data.asInstanceOf[Iterator[T]]
+new InterruptibleIterator(context, 
AfterNextInterceptingIterator(iter, (next: T) = {
+  existingMetrics.incRecordsRead(1)
--- End diff --

And yes, every time we add some sort of iterator like this, we do a lot of 
performance benchmarks. IterruptableIterator was added long before the project 
is an official ASF project, and even back then we tested performance.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-02 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r23913204
  
--- Diff: core/src/main/scala/org/apache/spark/CacheManager.scala ---
@@ -47,9 +49,13 @@ private[spark] class CacheManager(blockManager: 
BlockManager) extends Logging {
 val inputMetrics = blockResult.inputMetrics
 val existingMetrics = context.taskMetrics
   .getInputMetricsForReadMethod(inputMetrics.readMethod)
-existingMetrics.addBytesRead(inputMetrics.bytesRead)
+existingMetrics.incBytesRead(inputMetrics.bytesRead)
 
-new InterruptibleIterator(context, 
blockResult.data.asInstanceOf[Iterator[T]])
+val iter = blockResult.data.asInstanceOf[Iterator[T]]
+new InterruptibleIterator(context, 
AfterNextInterceptingIterator(iter, (next: T) = {
+  existingMetrics.incRecordsRead(1)
--- End diff --

Hm, this seems prohibitively expensive. This will add multiple function 
invocations and force hardware locking for every +next+ call (or at least some 
internal checks if it's changed to +volatile+). Have you looked at the 
performance implications of this? For instance, try reading from a good amount 
of off-heap data from many threads and seeing if it regresses performance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-02 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-72425732
  
Having a count does seem like a good idea, but I think the current approach 
of adding function calls for each iteration is too expensive. What about using 
a counting iterator and just asking it at the end?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-02-02 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-72426117
  
BTW we can also just add counting directly to InterruptibleIterator


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-30 Thread ksakellis

Github user ksakellis commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-72292701
  
New Screenshot that correspond to CR feedback.

![screen shot 2015-01-30 at 7 07 48 
pm](https://cloud.githubusercontent.com/assets/6590087/5985688/871415e4-a8b3-11e4-81e0-5c35cafb1ca1.png)
Shows a stage that has Input Metrics (reading from a file) and writes data 
for next stage.

![screen shot 2015-01-30 at 7 08 01 
pm](https://cloud.githubusercontent.com/assets/6590087/5985693/a596674c-a8b3-11e4-9785-c49e64eda853.png)
Shows a stage that has both shuffle reading and writing - no input or 
output metrics.

![screen shot 2015-01-30 at 7 08 15 
pm](https://cloud.githubusercontent.com/assets/6590087/5985698/b2fefb92-a8b3-11e4-9f6b-237d4ad1d934.png)
Shows a stage that has outputting to a file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-30 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-72297442
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26441/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-30 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-72292852
  
  [Test build #26441 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26441/consoleFull)
 for   PR 4067 at commit 
[`fe3f715`](https://github.com/apache/spark/commit/fe3f715819e19eba122e5973d5172fa190a80210).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-30 Thread sryza

Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r23884482
  
--- Diff: core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala 
---
@@ -241,21 +242,22 @@ object DataWriteMethod extends Enumeration with 
Serializable {
  */
 @DeveloperApi
 case class InputMetrics(readMethod: DataReadMethod.Value) {
-
-  private val _bytesRead: AtomicLong = new AtomicLong()
+ 
+  @volatile @transient var bytesReadCallback: Option[() = Long] = None
 
   /**
* Total bytes read.
*/
+  private val _bytesRead: AtomicLong = new AtomicLong()
   def bytesRead: Long = _bytesRead.get()
-  @volatile @transient var bytesReadCallback: Option[() = Long] = None
+  def incBytesRead(bytes: Long) = _bytesRead.addAndGet(bytes)
 
   /**
-   * Adds additional bytes read for this read method.
+   * Total records read.
*/
-  def addBytesRead(bytes: Long) = {
-_bytesRead.addAndGet(bytes)
-  }
+  def recordsRead: Long = _recordsRead.get() 
+  private val _recordsRead: AtomicLong = new AtomicLong()
--- End diff --

This will only ever be modified by a single thread, right? So better to 
make it volatile than atomic long?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-30 Thread sryza

Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r23884519
  
--- Diff: core/src/main/scala/org/apache/spark/ui/ToolTips.scala ---
@@ -29,14 +29,15 @@ private[spark] object ToolTips {
   val SHUFFLE_READ_BLOCKED_TIME =
 Time that the task spent blocked waiting for shuffle data to be read 
from remote machines.
 
-  val INPUT = Bytes read from Hadoop or from Spark storage.
+  val INPUT = Bytes and # records read from Hadoop or from Spark storage.
--- End diff --

Technically it's # bytes as well right?  I think this would be clearer as 
just Bytes and records read.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-30 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-72297436
  
  [Test build #26441 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26441/consoleFull)
 for   PR 4067 at commit 
[`fe3f715`](https://github.com/apache/spark/commit/fe3f715819e19eba122e5973d5172fa190a80210).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class AfterNextInterceptingIterator[A](sub: Iterator[A]) extends 
Iterator[A] `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-30 Thread ksakellis

Github user ksakellis commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-72292815
  
@pwendell can you please re-review this? I'd like to get it in to 1.3. Some 
of our customers have been asking for metrics to help them determine data skew. 
Thx.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-21 Thread sryza

Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r23284264
  
--- Diff: 
core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala
 ---
@@ -82,7 +82,15 @@ private[hash] object BlockStoreShuffleFetcher extends 
Logging {
   SparkEnv.get.conf.getLong(spark.reducer.maxMbInFlight, 48) * 1024 
* 1024)
 val itr = blockFetcherItr.flatMap(unpackBlock)
 
-val completionIter = CompletionIterator[T, Iterator[T]](itr, {
+val itr2 = new AfterNextInterceptingIterator[T](itr) {
+  val readMetrics = 
context.taskMetrics().createShuffleReadMetricsForDependency()
+  override def afterNext(next: T) : T = {
--- End diff --

remove space between close paren and colon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-21 Thread sryza

Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r23284115
  
--- Diff: core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala 
---
@@ -238,6 +245,10 @@ case class InputMetrics(readMethod: 
DataReadMethod.Value) {
 _bytesRead.addAndGet(bytes)
   }
 
+  def addRecordsRead(records: Long) = {
--- End diff --

This should be `incRecordsRead` in keeping with SPARK-3288.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-21 Thread sryza

Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r23284251
  
--- Diff: 
core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala
 ---
@@ -82,7 +82,15 @@ private[hash] object BlockStoreShuffleFetcher extends 
Logging {
   SparkEnv.get.conf.getLong(spark.reducer.maxMbInFlight, 48) * 1024 
* 1024)
 val itr = blockFetcherItr.flatMap(unpackBlock)
 
-val completionIter = CompletionIterator[T, Iterator[T]](itr, {
+val itr2 = new AfterNextInterceptingIterator[T](itr) {
+  val readMetrics = 
context.taskMetrics().createShuffleReadMetricsForDependency()
--- End diff --

Nit: take out parens after `taskMetrics`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-21 Thread sryza

Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r23284154
  
--- Diff: core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala 
---
@@ -317,4 +338,9 @@ class ShuffleWriteMetrics extends Serializable {
* Time the task spent blocking on writes to disk or buffer cache, in 
nanoseconds
*/
   @volatile var shuffleWriteTime: Long = _
+
+  /**
+   * Total number of records written from the shuffle by this task
--- End diff --

Nit: written *to* the shuffle


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-21 Thread sryza

Github user sryza commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r23284162
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -41,7 +41,7 @@ import org.apache.spark._
 import org.apache.spark.annotation.DeveloperApi
 import org.apache.spark.broadcast.Broadcast
 import org.apache.spark.deploy.SparkHadoopUtil
-import org.apache.spark.executor.{DataReadMethod, InputMetrics}
+import org.apache.spark.executor.{DataReadMethod}
--- End diff --

Take out curly braces


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-16 Thread ksakellis

Github user ksakellis commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-70220867
  
![screen shot 2015-01-16 at 12 04 04 
am](https://cloud.githubusercontent.com/assets/6590087/5773199/bcd70ffc-9d13-11e4-86c1-7140a73f9e92.png)
Shows a stage that has Input Metrics (reading from a file) and writes data 
for next stage.

![screen shot 2015-01-16 at 12 04 15 
am](https://cloud.githubusercontent.com/assets/6590087/5773207/e252d90a-9d13-11e4-86d6-d1e77500785c.png)
Shows a stage that has both shuffle reading and writing - no input or 
output metrics.

![screen shot 2015-01-16 at 12 04 25 
am](https://cloud.githubusercontent.com/assets/6590087/5773215/048fdef0-9d14-11e4-877f-229922bfe967.png)
Shows a stage that has outputting to a file.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-70224780
  
**[Test build #25638 timed 
out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25638/consoleFull)**
 for PR 4067 at commit 
[`1572054`](https://github.com/apache/spark/commit/157205446fbf474877195d7a30ada86e31e836c2)
 after a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-70224784
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25638/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-70225110
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25641/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-70225105
  
  [Test build #25641 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25641/consoleFull)
 for   PR 4067 at commit 
[`3c2d021`](https://github.com/apache/spark/commit/3c2d021ae4879f844ce3a5f1fc761b015ab4b5a9).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class AfterNextInterceptingIterator[A](sub: Iterator[A]) extends 
Iterator[A] `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-16 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-70229214
  
What about combining the input size and records in the same column. Overall 
this will help with the expansion in the number of columns. The title could be 
Input Size / Records


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-16 Thread ksakellis

Github user ksakellis commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-70230969
  
If we do that we wouldn't be able to sort on num records and bytes 
independently. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-16 Thread ksakellis

Github user ksakellis commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-70301522
  
A big motivation to add recordsRead/Written was to detect data skew. In 
these cases bytes and records might not track very closely.

Thinking more about this, I suspect that having an Avg. record Size column 
(bytesRead/recordsRead) would be what you'd want to sort on. We could add this 
metric to the UI, make it sortable and then combine the bytesRead and 
recordsRead metrics into one column. Thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-15 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r23065199
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/interceptingIterator.scala ---
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+/**
+ * A Wrapper for iterators where a caller can get hooks before/after
+ * iterator methods.
+ * @param sub The iterator being wrapped.
+ * @tparam A the iterable type
+ */
+private[spark]
+class InterceptingIterator[A](sub: Iterator[A]) extends Iterator[A] {
--- End diff --

Can we avoid this? Seems fairly expensive by adding a lot more method calls 
...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-15 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-70214097
  
Can you also paste some screenshots on what the UI changes look like? 
Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-15 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r23065223
  
--- Diff: 
core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala
 ---
@@ -82,7 +82,16 @@ private[hash] object BlockStoreShuffleFetcher extends 
Logging {
   SparkEnv.get.conf.getLong(spark.reducer.maxMbInFlight, 48) * 1024 
* 1024)
 val itr = blockFetcherItr.flatMap(unpackBlock)
 
-val completionIter = CompletionIterator[T, Iterator[T]](itr, {
+val itr2 = new InterceptingIterator[T](itr) {
+  val readMetrics = 
context.taskMetrics().createShuffleReadMetricsForDependency()
+  override def afterNext(next: T) : T = {
+readMetrics.recordsRead += 1
+logError(Read record  + next)
--- End diff --

this is not intended to be here, is it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-15 Thread ksakellis

Github user ksakellis commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r23065373
  
--- Diff: 
core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala
 ---
@@ -82,7 +82,16 @@ private[hash] object BlockStoreShuffleFetcher extends 
Logging {
   SparkEnv.get.conf.getLong(spark.reducer.maxMbInFlight, 48) * 1024 
* 1024)
 val itr = blockFetcherItr.flatMap(unpackBlock)
 
-val completionIter = CompletionIterator[T, Iterator[T]](itr, {
+val itr2 = new InterceptingIterator[T](itr) {
+  val readMetrics = 
context.taskMetrics().createShuffleReadMetricsForDependency()
+  override def afterNext(next: T) : T = {
+readMetrics.recordsRead += 1
+logError(Read record  + next)
--- End diff --

whoops.. nope.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-15 Thread ksakellis

Github user ksakellis commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r23065399
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/interceptingIterator.scala ---
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+/**
+ * A Wrapper for iterators where a caller can get hooks before/after
+ * iterator methods.
+ * @param sub The iterator being wrapped.
+ * @tparam A the iterable type
+ */
+private[spark]
+class InterceptingIterator[A](sub: Iterator[A]) extends Iterator[A] {
--- End diff --

So this is supposed to be a generic way of intercepting iterators. If we 
don't have this, i'd have to do something custom like CompletionIterator - i 
was trying to make something reusable.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-15 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r23065459
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/interceptingIterator.scala ---
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+/**
+ * A Wrapper for iterators where a caller can get hooks before/after
+ * iterator methods.
+ * @param sub The iterator being wrapped.
+ * @tparam A the iterable type
+ */
+private[spark]
+class InterceptingIterator[A](sub: Iterator[A]) extends Iterator[A] {
--- End diff --

Yea the thing is there is only a very limited number of places that you'd 
need to increment the counters. I'm not sure if this super generic design is 
worth it, unless you want to do a lot of performance studies of the differences 
...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-15 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-70215215
  
  [Test build #25638 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25638/consoleFull)
 for   PR 4067 at commit 
[`1572054`](https://github.com/apache/spark/commit/157205446fbf474877195d7a30ada86e31e836c2).
 * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-15 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r23065524
  
--- Diff: core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala 
---
@@ -179,10 +223,48 @@ object DataWriteMethod extends Enumeration with 
Serializable {
  */
 @DeveloperApi
 case class InputMetrics(readMethod: DataReadMethod.Value) {
+
+  private val _bytesRead: AtomicLong = new AtomicLong()
+  private val _recordsRead: AtomicLong = new AtomicLong()
+
   /**
* Total bytes read.
*/
-  var bytesRead: Long = 0L
+  def bytesRead: Long = _bytesRead.get()
+
+  /**
+   * Total records read.
+   */
+  def recordsRead: Long = _recordsRead.get()
+  @volatile @transient var bytesReadCallback: Option[() = Long] = None
--- End diff --

can you explain what this does?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-15 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4067#discussion_r23065562
  
--- Diff: core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala 
---
@@ -134,6 +149,30 @@ class TaskMetrics extends Serializable {
   }
 
   /**
+   * Returns the input metrics object that the task should use. Currently, 
if
+   * there exists an input metric with the same readMethod, we return that 
one
+   * so the caller can accumulate bytes read. If the readMethod is 
different
+   * than previously seen by this task, we return a new InputMetric but 
don't
+   * record it.
+   *
+   * Once https://issues.apache.org/jira/browse/SPARK-5225 is addressed,
+   * we can store all the different inputMetrics (one per readMethod).
+   */
+  private[spark] def getInputMetricsForReadMethod(readMethod: 
DataReadMethod):
--- End diff --

```scala
  private[spark] def getInputMetricsForReadMethod(readMethod: 
DataReadMethod): InputMetrics =
synchronized {
  
  }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4874] [CORE] Collect record count metri...

2015-01-15 Thread ksakellis

Github user ksakellis commented on the pull request:

https://github.com/apache/spark/pull/4067#issuecomment-70215626
  
This change was dependent on https://github.com/apache/spark/pull/3120, 
that just got merged and now there are some merge conflicts. I need to fix 
those first and will update the pr.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 116 matches

Mail list logo