[GitHub] spark pull request: SPARK-2526: Simplify options in make-distribut...

2014-07-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1445#issuecomment-49202151
  
QA tests have started for PR 1445. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16738/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2517] Removed some compiler type erasur...

2014-07-16 Thread rxin
Github user rxin closed the pull request at:

https://github.com/apache/spark/pull/1431


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2517] Removed some compiler type erasur...

2014-07-16 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1431#issuecomment-49202256
  
SGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2525][SQL] Remove as many compilation w...

2014-07-16 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1444


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2525][SQL] Remove as many compilation w...

2014-07-16 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1444#issuecomment-49202444
  
I'v merged this in master  branch-1.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] Cleaned up ConstantFolding slightly.

2014-07-16 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1430#issuecomment-49202491
  
Merging in master  branch-1.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2154] Schedule next Driver when one com...

2014-07-16 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1405#issuecomment-49202648
  
Looks good to me -- we can put it in both 1.0.2 and 0.9.2 as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] Cleaned up ConstantFolding slightly.

2014-07-16 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1430


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-16 Thread kayousterhout
Github user kayousterhout commented on the pull request:

https://github.com/apache/spark/pull/1313#issuecomment-49202752
  
Also cc @mateiz


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-16 Thread kayousterhout
Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/1313#discussion_r15016112
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -353,6 +350,14 @@ private[spark] class TaskSetManager(
   for (index - findTaskFromList(execId, 
getPendingTasksForHost(host))) {
 return Some((index, TaskLocality.NODE_LOCAL, false))
   }
+  // Look for noPref tasks after NODE_LOCAL for minimize cross-rack 
traffic
+  for (index - findTaskFromList(execId, pendingTasksWithNoPrefs)) {
+return Some((index, TaskLocality.PROCESS_LOCAL, false))
+  }
+  // find a speculative task if all noPref tasks have been scheduled
+  val specTask = findSpeculativeTask(execId, host, locality).map {
+case (taskIndex, allowedLocality) = (taskIndex, allowedLocality, 
true)}
+  if (specTask != None) return specTask
--- End diff --

What happens if the maximum current allowed locality level (based on delay 
scheduling) is PROCESS_LOCAL?  It seems like that will mean tasks in 
pendingTasksWithNoPrefs can't be scheduled (they won't be scheduled until the 
first timeout expires) -- which seems bad?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

2014-07-16 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/1393#issuecomment-49203419
  
@mateiz I think it would mean mostly cloning `ALS.scala`, as the `Rating` 
object is woven throughout. Probably some large chunks could be refactored and 
shared. Is that what you mean? even I'm not sure if two APIs are worth the 
trouble. When I get to this point and see what it takes to make a 64-bit key 
implementation, yes I can propose what that looks like.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] Cleaned up ConstantFolding slightly.

2014-07-16 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1430#issuecomment-49203889
  
Actually only master. There is a conflict in branch-1.0 that I will leave 
it intact for now. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519. Eliminate pattern-matching on Tupl...

2014-07-16 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1435#issuecomment-49203999
  
Merging in master. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1097: Do not introduce deadlock while fi...

2014-07-16 Thread aarondav
Github user aarondav commented on the pull request:

https://github.com/apache/spark/pull/1409#issuecomment-49204042
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519. Eliminate pattern-matching on Tupl...

2014-07-16 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1435


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1097: Do not introduce deadlock while fi...

2014-07-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1409#issuecomment-49204801
  
QA tests have started for PR 1409. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16740/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2518][SQL] Fix foldability of Substring...

2014-07-16 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1432#issuecomment-49204855
  
Merging in master and branch-1.0. Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2518][SQL] Fix foldability of Substring...

2014-07-16 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1432


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

2014-07-16 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1428#issuecomment-49205341
  
It doesn't bring much benefit right now, but what we are doing here is 
creating patterns in NullPropagation to specify the semantics of each 
individual expression ... not very scalable in maintaining this code base. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2033] Automatically cleanup checkpoint

2014-07-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/855#issuecomment-49205391
  
QA results for PR 855:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16731/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1477]: Add the lifecycle interface

2014-07-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/991#issuecomment-49205438
  
QA tests have started for PR 991. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16741/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-16 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1313#issuecomment-49205621
  
I'm curious about one thing here: doesn't this change mean that we might 
wait longer before launching a no-prefs or speculative task, due to delay 
scheduling? This is because in TaskSetManager we take the preferredLocality and 
make it no bigger than our allowed locality level based on delays. The unit 
tests seem to add extra sleeps for this reason.

I don't really like this behavior, since with both kind of tasks you'd like 
to launch them as soon as possible. No-prefs tasks can just run, and for 
speculative tasks you want to launch them ASAP in situations like Spark 
Streaming, where a 3-second delay could mess up your stream latency.

It might be better to add a special preferredLocality value for no 
preference, and always call the TaskSetManager with that value after calling it 
with the others, then have a code path in there that deals with that specially 
(not taking the min of that with allowedLocality).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-16 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1313#issuecomment-49205693
  
Looks like Kay pointed out the same issue while I was typing this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] Add HiveDecimal HiveVarchar support in...

2014-07-16 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1436#issuecomment-49205734
  
Unit test actually failed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Improve ALS algorithm resource usage

2014-07-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/929#issuecomment-49206560
  
QA results for PR 929:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16733/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2525][SQL] Remove as many compilation w...

2014-07-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1444#issuecomment-49206483
  
QA results for PR 1444:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16732/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-16 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1313#issuecomment-49206624
  
By the way, as far as I can tell, findSpeculativeTask also takes a locality 
level as an argument, and doesn't return tasks that are farther away than that. 
Do we need to change it? It seems that it shouldn't interfere with other tasks. 
It's true that we might launch a node-local speculative task before a 
rack-local non-speculative task, but I think it's unlikely that we'll be 
wanting to speculate stuff at that point, so I'm not sure this is a problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2522] set default broadcast factory to ...

2014-07-16 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1437#issuecomment-49206767
  
Merging in master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-16 Thread dorx
Github user dorx commented on a diff in the pull request:

https://github.com/apache/spark/pull/1367#discussion_r15017783
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala
 ---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.correlation
+
+import org.apache.spark.Partitioner
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector}
+import org.apache.spark.rdd.{CoGroupedRDD, RDD}
+
+/**
+ * Compute Spearman's correlation for two RDDs of the type RDD[Double] or 
the correlation matrix
+ * for an RDD of the type RDD[Vector].
+ *
+ * Definition of Spearman's correlation can be found at
+ * http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient
+ */
+object SpearmansCorrelation extends Correlation {
+
+  /**
+   * Compute Spearman's correlation for two datasets.
+   */
+  override def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double 
= {
+computeCorrelationWithMatrixImpl(x, y)
+  }
+
+  /**
+   * Compute Spearman's correlation matrix S, for the input matrix, where 
S(i, j) is the
+   * correlation between column i and j.
+   */
+  override def computeCorrelationMatrix(X: RDD[Vector]): Matrix = {
+val indexed = X.zipWithIndex()
+// Attempt to checkpoint the RDD before splitting it into numCols 
RDD[Double]s to avoid
+// computing the lineage prefix multiple times.
+// If checkpoint directory not set, cache the RDD instead.
+try {
+  indexed.checkpoint()
+} catch {
+  case e: Exception = indexed.cache()
--- End diff --

FWIW I tried it with zipWithUniqueId, and, as expected, the results were 
wrong when more than 1 partition is used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-16 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1313#issuecomment-49206975
  
BTW even in that case we could solve it by having more levels, basically 
pass in a different level to request speculative tasks after we requested 
non-speculative ones, or pass in a boolean called speculativeAllowed. It seems 
the main change really is to treat no-prefs as less local than node- and 
process-local, whereas in the past it was treated as equivalent to 
process-local.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-16 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/1313#issuecomment-49207097
  
My rationale for suggesting noPrefs tasks after NODE_LOCAL was to ensure 
that noPref tasks do not preempt locality for NODE_LOCAL tasks (they cant for 
PROCESS_LOCAL)

I think noPrefs sort of breaks the current model where we wait for 
PROCESS_LOCAL, then NODE_LOCAL and so on : though I dont seem to have a good 
solution.

For example: if we have 20 executors, with 30 no pref tasks and 100 
NODE_LOCAL tasks.
Depending on the time taken for the no pref tasks, we can end up with 
most/none of the NODE_LOCAL tasks actually getting scheduled as NODE_LOCAL 
(since the corresponding executors might be running no pref tasks). This is 
slightly contrived :-) But it does have implications in general scheduling too.

I did not properly account for idle executors - which leads to suboptimal 
utilization in my suggestion above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-16 Thread dorx
Github user dorx commented on a diff in the pull request:

https://github.com/apache/spark/pull/1367#discussion_r15017975
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala
 ---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.correlation
+
+import org.apache.spark.Partitioner
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector}
+import org.apache.spark.rdd.{CoGroupedRDD, RDD}
+
+/**
+ * Compute Spearman's correlation for two RDDs of the type RDD[Double] or 
the correlation matrix
+ * for an RDD of the type RDD[Vector].
+ *
+ * Definition of Spearman's correlation can be found at
+ * http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient
+ */
+object SpearmansCorrelation extends Correlation {
+
+  /**
+   * Compute Spearman's correlation for two datasets.
+   */
+  override def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double 
= {
+computeCorrelationWithMatrixImpl(x, y)
+  }
+
+  /**
+   * Compute Spearman's correlation matrix S, for the input matrix, where 
S(i, j) is the
+   * correlation between column i and j.
+   */
+  override def computeCorrelationMatrix(X: RDD[Vector]): Matrix = {
+val indexed = X.zipWithIndex()
--- End diff --

FWIW I tried it with zipWithUniqueId, and, as expected, the results were 
wrong when more than 1 partition is used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-695] In DAGScheduler's getPreferredLocs...

2014-07-16 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/1362#discussion_r15018046
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1107,7 +1106,6 @@ class DAGScheduler(
 case shufDep: ShuffleDependency[_, _, _] =
   val mapStage = getShuffleMapStage(shufDep, stage.jobId)
   if (!mapStage.isAvailable) {
-visitedStages += mapStage
--- End diff --

I'm curious, why did you remove this? It seems unrelated to the preferred 
locs fix, and it was necessary to prevent exponential explosion here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2190][SQL] Specialized ColumnType for T...

2014-07-16 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1440#discussion_r15018072
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala ---
@@ -344,21 +344,52 @@ private[sql] class StringColumnStats extends 
BasicColumnStats(STRING) {
   }
 
   override def contains(row: Row, ordinal: Int) = {
-!(upperBound eq null)  {
+(upperBound ne null)  {
--- End diff --

I think I read somewhere ne has better performance in certain cases...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: fix compile error of streaming project

2014-07-16 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/153#issuecomment-49207773
  
Merging this in master. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: fix compile error of streaming project

2014-07-16 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/153#issuecomment-49207755
  
I think I've seeen this happening once in a while, but can't exactly 
reproduce after clean. Anyway it's better to explicitly define the return 
type for public methods, even if they are simple  implicit. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-695] In DAGScheduler's getPreferredLocs...

2014-07-16 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1362#issuecomment-49208638
  
This looks good to me modulo a few comments! One other thing though, you 
should add a unit test for this functionality. Create a test that would result 
in a very long running time with the old code (e.g. something where you zip 
RDDs together 20-30 times and the original one has some preferred locations), 
and add something around it to time that it runs within, say, 5 seconds.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-16 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/1259#issuecomment-49208726
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2298: Show stage attempt in UI

2014-07-16 Thread tsudukim
Github user tsudukim commented on the pull request:

https://github.com/apache/spark/pull/1384#issuecomment-49207731
  
@pwendell I agree that there are many room for improvement about handling 
of stageId and attemptId. It might be better to break this problems into some 
sub-tasks. I think this patch should be one of them. Or did you mean we should 
fix all of this problem in one patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: fix compile error of streaming project

2014-07-16 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/153


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2298: Show stage attempt in UI

2014-07-16 Thread tsudukim
Github user tsudukim commented on a diff in the pull request:

https://github.com/apache/spark/pull/1384#discussion_r15019018
  
--- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala ---
@@ -478,6 +479,7 @@ private[spark] object JsonProtocol {
 
   def stageInfoFromJson(json: JValue): StageInfo = {
 val stageId = (json \ Stage ID).extract[Int]
+val attemptId = (json \ Attempt ID).extract[Int]
--- End diff --

Sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2298: Show stage attempt in UI

2014-07-16 Thread tsudukim
Github user tsudukim commented on the pull request:

https://github.com/apache/spark/pull/1384#issuecomment-49209319
  
@rxin OK. After that, I think I can make this patch better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2298: Show stage attempt in UI

2014-07-16 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1384#issuecomment-49208410
  
Let's hold off merging this one until we merge #1262. Then it will be 
easier to index the information based on stage + attempt. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-16 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1259#issuecomment-49209789
  
Thanks. Merging this. We can fix the serialization error logging in a 
separate PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2298: Show stage attempt in UI

2014-07-16 Thread tsudukim
Github user tsudukim commented on the pull request:

https://github.com/apache/spark/pull/1384#issuecomment-49209857
  
@rxin in #1262, can I expect the key of the stagedata in 
JobProgressListener become stageId + attemptId instead of stageId only?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2317] Improve task logging.

2014-07-16 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1259


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-16 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1238#discussion_r15019589
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
 ---
@@ -26,6 +26,28 @@ import org.apache.spark.sql.catalyst.trees
 abstract class LogicalPlan extends QueryPlan[LogicalPlan] {
   self: Product =
 
+  // TODO: handle overflow?
+  /**
+   * Estimates of various statistics.  The default estimation logic simply 
sums up the corresponding
+   * statistic produced by the children.  To override this behavior, 
override `statistics` and
+   * assign it a overriden version of `Statistics`.
+   */
+  case class Statistics(
+/**
+ * Number of output tuples. For leaf operators this defaults to 1, 
otherwise it is set to the
+ * product of children's `numTuples`.
+ */
+numTuples: Long = childrenStats.map(_.numTuples).product,
+
+/**
+ * Physical size in bytes. For leaf operators this defaults to 1, 
otherwise it is set to the
+ * product of children's `sizeInBytes`.
+ */
+sizeInBytes: Long = childrenStats.map(_.sizeInBytes).product
+  )
+  lazy val statistics: Statistics = new Statistics
+  lazy val childrenStats = children.map(_.statistics)
--- End diff --

I think we need a lazy marker for this, otherwise during the initialization 
of `childrenStats`, `statistics` in all children will get evaluated too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-16 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1367#discussion_r15019606
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala 
---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.correlation
+
+import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+
+/**
+ * New correlation algorithms should implement this trait
+ */
+trait Correlation {
+
+  /**
+   * Compute correlation for two datasets.
+   */
+  def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double
+
+  /**
+   * Compute the correlation matrix S, for the input matrix, where S(i, j) 
is the correlation
+   * between column i and j.
+   */
+  def computeCorrelationMatrix(X: RDD[Vector]): Matrix
+
+  /**
+   * Combine the two input RDD[Double]s into an RDD[Vector] and compute 
the correlation using the
+   * correlation implementation for RDD[Vector]
+   */
+  def computeCorrelationWithMatrixImpl(x: RDD[Double], y: RDD[Double]): 
Double = {
--- End diff --

It is okay to keep it as long as the `Correlation` class is private.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-16 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1367#discussion_r15019630
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/PearsonCorrelation.scala
 ---
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.correlation
+
+import breeze.linalg.{DenseMatrix = BDM}
+
+import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector}
+import org.apache.spark.mllib.linalg.distributed.RowMatrix
+import org.apache.spark.rdd.RDD
+
+/**
+ * Compute Pearson correlation for two RDDs of the type RDD[Double] or the 
correlation matrix
+ * for an RDD of the type RDD[Vector].
+ *
+ * Definition of Pearson correlation can be found at
+ * 
http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
+ */
+object PearsonCorrelation extends Correlation {
+
+  /**
+   * Compute the Pearson correlation for two datasets.
+   */
+  override def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double 
= {
+computeCorrelationWithMatrixImpl(x, y)
+  }
+
+  /**
+   * Compute the Pearson correlation matrix S, for the input matrix, where 
S(i, j) is the
+   * correlation between column i and j.
+   */
+  override def computeCorrelationMatrix(X: RDD[Vector]): Matrix = {
+val rowMatrix = new RowMatrix(X)
+val cov = rowMatrix.computeCovariance()
+computeCorrelationMatrixFromCovariance(cov)
+  }
+
+  /**
+   * Compute the pearson correlation matrix from the covariance matrix
+   */
+  def computeCorrelationMatrixFromCovariance(covarianceMatrix: Matrix): 
Matrix = {
+val cov = covarianceMatrix.toBreeze.asInstanceOf[BDM[Double]]
+val n = cov.cols
+
+// Compute the standard deviation on the diagonals first
+var i = 0
+while (i  n) {
+  cov(i, i) = math.sqrt(cov(i, i))
+  i +=1
+}
+// or we could put the stddev in its own array to trade space for one 
less pass over the matrix
+
+// TODO: use blas.dspr instead to compute the correlation matrix
+// if the covariance matrix comes in the upper triangular form for free
+
+// Loop through columns since cov is column major
+var j = 0
+var sigma = 0.0
+while (j  n) {
+  sigma = cov(j, j)
+  i = 0
+  while (i  j) {
+val covariance = cov(i, j) / (sigma * cov(i, i))
--- End diff --

Agree with @srowen : set NaN and generate a warning message. Btw, R returns 
a warning message instead of an error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-16 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1367#discussion_r15019612
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala 
---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.correlation
+
+import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+
+/**
+ * New correlation algorithms should implement this trait
+ */
+trait Correlation {
+
+  /**
+   * Compute correlation for two datasets.
+   */
+  def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double
+
+  /**
+   * Compute the correlation matrix S, for the input matrix, where S(i, j) 
is the correlation
+   * between column i and j.
+   */
+  def computeCorrelationMatrix(X: RDD[Vector]): Matrix
+
+  /**
+   * Combine the two input RDD[Double]s into an RDD[Vector] and compute 
the correlation using the
+   * correlation implementation for RDD[Vector]
+   */
+  def computeCorrelationWithMatrixImpl(x: RDD[Double], y: RDD[Double]): 
Double = {
+val mat: RDD[Vector] = x.zip(y).mapPartitions({ iter =
--- End diff --

Make sure you use `preservePartitioning` correctly. My first understanding 
was wrong and I may have used it wrongly somewhere. The argument should really 
be called `preservePartitioner`. It passes the partitioner to the derived RDD 
without checking. This may lead to wrong results. For example, the join in the 
following code block returns nothing, though `c` and `d` have the same content.

~~~
val a = sc.makeRDD(Seq((0, 1), (2, 3), (4, 5), (6, 7)), 4)
val b = a.groupByKey()
val c = b.mapPartitions({ iter =
iter.flatMap { case (k, vv) =
  vv.map { v = (v, k) }
  }}, true)
c.collect()
val d = sc.makeRDD(Seq((1, 0), (3, 2), (5, 4), (7, 6)), 4)
c.join(d).collect()
~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-16 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1367#discussion_r15019642
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala
 ---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.correlation
+
+import org.apache.spark.Partitioner
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector}
+import org.apache.spark.rdd.{CoGroupedRDD, RDD}
+
+/**
+ * Compute Spearman's correlation for two RDDs of the type RDD[Double] or 
the correlation matrix
+ * for an RDD of the type RDD[Vector].
+ *
+ * Definition of Spearman's correlation can be found at
+ * http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient
+ */
+object SpearmansCorrelation extends Correlation {
+
+  /**
+   * Compute Spearman's correlation for two datasets.
+   */
+  override def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double 
= {
+computeCorrelationWithMatrixImpl(x, y)
+  }
+
+  /**
+   * Compute Spearman's correlation matrix S, for the input matrix, where 
S(i, j) is the
+   * correlation between column i and j.
+   */
+  override def computeCorrelationMatrix(X: RDD[Vector]): Matrix = {
+val indexed = X.zipWithIndex()
--- End diff --

The input `X` is not ordered. I think the purpose here is just to assign 
each row an id for joining back the elements after we obtain the rank for each 
column.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-16 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1367#discussion_r15019620
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala 
---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.correlation
+
+import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+
+/**
+ * New correlation algorithms should implement this trait
+ */
+trait Correlation {
+
+  /**
+   * Compute correlation for two datasets.
+   */
+  def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double
+
+  /**
+   * Compute the correlation matrix S, for the input matrix, where S(i, j) 
is the correlation
+   * between column i and j.
+   */
+  def computeCorrelationMatrix(X: RDD[Vector]): Matrix
+
+  /**
+   * Combine the two input RDD[Double]s into an RDD[Vector] and compute 
the correlation using the
+   * correlation implementation for RDD[Vector]
+   */
+  def computeCorrelationWithMatrixImpl(x: RDD[Double], y: RDD[Double]): 
Double = {
+val mat: RDD[Vector] = x.zip(y).mapPartitions({ iter =
+  iter.map {case(xi, yi) = new DenseVector(Array(xi, yi))}
+}, preservesPartitioning = true)
+computeCorrelationMatrix(mat)(0, 1)
+  }
+
+}
+
+/**
+ * Delegates computation to the specific correlation object based on the 
input method name
+ *
+ * Currently supported correlations: pearson, spearman.
+ * After new correlation algorithms are added, please update the 
documentation here and in
+ * Statistics.scala for the correlation APIs.
+ *
+ * Cases are ignored when doing method matching. We also allow initials, 
e.g. P for pearson, as
+ * long as initials are unique in the supported set of correlation 
algorithms. In addition, a
+ * supported method name has to be a substring of the input method name 
for it to be matched (e.g.
+ * spearmansrho will be matched to spearman)
+ *
+ * Maintains the default correlation type, pearson
+ */
+object Correlations {
--- End diff --

Sounds good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

2014-07-16 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1428#issuecomment-49210382
  
I'm not sure I agree with that.  This is a pretty niche optimization not 
something fundamental about the expressions that is required for correct 
evaluation (and the first version of this code had a bunch of mistakes).  
Having all of these rules in one place made it easier for me to realize that 
when doing the review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-16 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1238#issuecomment-49210466
  
Jenkins, test this please.

I think I have addressed the latest round of review comments, where the 
biggest changes being:

- Remove statistics estimates from ParquetRelation for now due to 
performance considerations.
- When estimating sizes for MetastoreRelation, instead of making 
potentially expensive Hadoop FileSystem calls, we peek into the Hive Metastore 
for populated information.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-16 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1313#issuecomment-49210568
  
Ah, I see, so you're saying it's worth to wait 3 seconds for node-local 
ones to be able to go instead of launching no-prefs tasks. That does make 
sense. We just have to make sure that if a task set contains *only* 
process-local and no-prefs tasks, for instance, this wait doesn't happen. (I'm 
pretty sure that right now we don't wait for node-local if there were no 
process-local tasks, right? We only wait when we move up a level.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-16 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1367#discussion_r15019889
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala
 ---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.correlation
+
+import org.apache.spark.Partitioner
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector}
+import org.apache.spark.rdd.{CoGroupedRDD, RDD}
+
+/**
+ * Compute Spearman's correlation for two RDDs of the type RDD[Double] or 
the correlation matrix
+ * for an RDD of the type RDD[Vector].
+ *
+ * Definition of Spearman's correlation can be found at
+ * http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient
+ */
+object SpearmansCorrelation extends Correlation {
+
+  /**
+   * Compute Spearman's correlation for two datasets.
+   */
+  override def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double 
= {
+computeCorrelationWithMatrixImpl(x, y)
+  }
+
+  /**
+   * Compute Spearman's correlation matrix S, for the input matrix, where 
S(i, j) is the
+   * correlation between column i and j.
+   */
+  override def computeCorrelationMatrix(X: RDD[Vector]): Matrix = {
+val indexed = X.zipWithIndex()
+// Attempt to checkpoint the RDD before splitting it into numCols 
RDD[Double]s to avoid
+// computing the lineage prefix multiple times.
+// If checkpoint directory not set, cache the RDD instead.
+try {
+  indexed.checkpoint()
+} catch {
+  case e: Exception = indexed.cache()
--- End diff --

See my previous comment about mentioning the cost in the doc. User should 
be responsible to cache the RDD. If user already cached the input RDD, we are 
duplicating the data here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

2014-07-16 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1393#issuecomment-49210917
  
BTW this could also be a place to use the dreaded Scala @specialized 
annotation to template the code for Ints vs Longs, though as far as I know 
that's being deprecated by the Scala developers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

2014-07-16 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1393#issuecomment-49210831
  
Yeah, that's what I meant, we can clone it at first but we might be able to 
share code later (at least the math code we run on each block, or stuff like 
that). But let's do it only if you find out it's worth it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2098: All Spark processes should support...

2014-07-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1256#issuecomment-49211180
  
QA results for PR 1256:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brclass SparkConf(loadDefaults: Boolean, fileName: 
Option[String])brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16734/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-16 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1238#discussion_r15020066
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
 ---
@@ -26,6 +26,28 @@ import org.apache.spark.sql.catalyst.trees
 abstract class LogicalPlan extends QueryPlan[LogicalPlan] {
   self: Product =
 
+  // TODO: handle overflow?
+  /**
+   * Estimates of various statistics.  The default estimation logic simply 
sums up the corresponding
+   * statistic produced by the children.  To override this behavior, 
override `statistics` and
+   * assign it a overriden version of `Statistics`.
+   */
+  case class Statistics(
+/**
+ * Number of output tuples. For leaf operators this defaults to 1, 
otherwise it is set to the
+ * product of children's `numTuples`.
+ */
+numTuples: Long = childrenStats.map(_.numTuples).product,
+
+/**
+ * Physical size in bytes. For leaf operators this defaults to 1, 
otherwise it is set to the
+ * product of children's `sizeInBytes`.
+ */
+sizeInBytes: Long = childrenStats.map(_.sizeInBytes).product
+  )
+  lazy val statistics: Statistics = new Statistics
+  lazy val childrenStats = children.map(_.statistics)
--- End diff --

I was suggesting we turn it into a def, remove it entirely, or stop 
memoizing statistics.  Double memoization seems unnecessarily expensive memory 
wise.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1719: spark.*.extraLibraryPath isn't app...

2014-07-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1022#issuecomment-49211292
  
QA results for PR 1022:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16735/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-16 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1367#discussion_r15020198
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala
 ---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.correlation
+
+import org.apache.spark.Partitioner
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector}
+import org.apache.spark.rdd.{CoGroupedRDD, RDD}
+
+/**
+ * Compute Spearman's correlation for two RDDs of the type RDD[Double] or 
the correlation matrix
+ * for an RDD of the type RDD[Vector].
+ *
+ * Definition of Spearman's correlation can be found at
+ * http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient
+ */
+object SpearmansCorrelation extends Correlation {
+
+  /**
+   * Compute Spearman's correlation for two datasets.
+   */
+  override def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double 
= {
+computeCorrelationWithMatrixImpl(x, y)
+  }
+
+  /**
+   * Compute Spearman's correlation matrix S, for the input matrix, where 
S(i, j) is the
+   * correlation between column i and j.
+   */
+  override def computeCorrelationMatrix(X: RDD[Vector]): Matrix = {
+val indexed = X.zipWithIndex()
+// Attempt to checkpoint the RDD before splitting it into numCols 
RDD[Double]s to avoid
+// computing the lineage prefix multiple times.
+// If checkpoint directory not set, cache the RDD instead.
+try {
+  indexed.checkpoint()
+} catch {
+  case e: Exception = indexed.cache()
+}
+
+val numCols = X.first.size
+val ranks = new Array[RDD[(Long, Double)]](numCols)
+
+// Note: we use a for loop here instead of a while loop with a single 
index variable
+// to avoid race condition caused by closure serialization
+for (k - 0 until numCols) {
+  val column = indexed.map {case(vector, index) = {
+(vector(k), index)}
+  }
+  ranks(k) = getRanks(column)
+}
+
+val ranksMat: RDD[Vector] = makeRankMatrix(ranks)
+PearsonCorrelation.computeCorrelationMatrix(ranksMat)
+  }
+
+  /**
+   * Compute the ranks for elements in the input RDD, using the average 
method for ties.
+   *
+   * With the average method, elements with the same value receive the 
same rank that's computed
+   * by taking the average of their positions in the sorted list.
+   * e.g. ranks([2, 1, 0, 2]) = [3.5, 2.0, 1.0, 3.5]
+   */
+  private def getRanks(indexed: RDD[(Double, Long)]): RDD[(Long, Double)] 
= {
+// Get elements' positions in the sorted list for computing average 
rank for duplicate values
+val sorted = indexed.sortByKey().zipWithIndex()
+val groupedByValue = sorted.groupBy(_._1._1)
--- End diff --

The `RangePartitioner` guarantees  same keys appear in the same partition. 
Then inside each partition, you go through an iterator of (value: Double, rank: 
Long). Do not output a value and its rank directly. Count records with the same 
value until a different value comes up, then flush out the value and the 
average rank.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2277: make TaskScheduler track hosts on ...

2014-07-16 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1212#issuecomment-49211502
  
Hey @mridulm I usually won't merge any code in the scheduler unless 
@markhamstra or @kayhousterhout has looked at it and signed off, since they are 
the most active maintainers of this code. That might be a good practice to 
follow in the future. We can try to come up with a list of maintainers to make 
it more clear who should be consulted for code in various parts of Spark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

2014-07-16 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1428#issuecomment-49211681
  
That is exactly the argument I made when the folding logic was added. :) I 
suggested that we add `deterministic` instead and then have a rule that folds 
things that are `deterministic` and have no `references`.  `deterministic` I 
would argue is something fundamental about the expression itself, unlike these 
optimizations we are making.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...

2014-07-16 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1428#issuecomment-49211732
  
I would be supportive of changing it to match my original proposal...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-16 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1367#discussion_r15020402
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala
 ---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.correlation
+
+import org.apache.spark.Partitioner
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector}
+import org.apache.spark.rdd.{CoGroupedRDD, RDD}
+
+/**
+ * Compute Spearman's correlation for two RDDs of the type RDD[Double] or 
the correlation matrix
+ * for an RDD of the type RDD[Vector].
+ *
+ * Definition of Spearman's correlation can be found at
+ * http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient
+ */
+object SpearmansCorrelation extends Correlation {
+
+  /**
+   * Compute Spearman's correlation for two datasets.
+   */
+  override def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double 
= {
+computeCorrelationWithMatrixImpl(x, y)
+  }
+
+  /**
+   * Compute Spearman's correlation matrix S, for the input matrix, where 
S(i, j) is the
+   * correlation between column i and j.
+   */
+  override def computeCorrelationMatrix(X: RDD[Vector]): Matrix = {
+val indexed = X.zipWithIndex()
+// Attempt to checkpoint the RDD before splitting it into numCols 
RDD[Double]s to avoid
+// computing the lineage prefix multiple times.
+// If checkpoint directory not set, cache the RDD instead.
+try {
+  indexed.checkpoint()
+} catch {
+  case e: Exception = indexed.cache()
+}
+
+val numCols = X.first.size
+val ranks = new Array[RDD[(Long, Double)]](numCols)
+
+// Note: we use a for loop here instead of a while loop with a single 
index variable
+// to avoid race condition caused by closure serialization
+for (k - 0 until numCols) {
+  val column = indexed.map {case(vector, index) = {
+(vector(k), index)}
+  }
+  ranks(k) = getRanks(column)
+}
+
+val ranksMat: RDD[Vector] = makeRankMatrix(ranks)
+PearsonCorrelation.computeCorrelationMatrix(ranksMat)
+  }
+
+  /**
+   * Compute the ranks for elements in the input RDD, using the average 
method for ties.
+   *
+   * With the average method, elements with the same value receive the 
same rank that's computed
+   * by taking the average of their positions in the sorted list.
+   * e.g. ranks([2, 1, 0, 2]) = [3.5, 2.0, 1.0, 3.5]
+   */
+  private def getRanks(indexed: RDD[(Double, Long)]): RDD[(Long, Double)] 
= {
+// Get elements' positions in the sorted list for computing average 
rank for duplicate values
+val sorted = indexed.sortByKey().zipWithIndex()
+val groupedByValue = sorted.groupBy(_._1._1)
+val ranks = groupedByValue.flatMap[(Long, Double)] { item =
+  val duplicates = item._2
+  if (duplicates.size  1) {
+val averageRank = duplicates.foldLeft(0L) {_ + _._2 + 1} / 
duplicates.size.toDouble
+duplicates.map(entry = (entry._1._2, averageRank)).toSeq
+  } else {
+duplicates.map(entry = (entry._1._2, entry._2.toDouble + 1)).toSeq
+  }
+}
+ranks.sortByKey()
+  }
+
+  private def makeRankMatrix(ranks: Array[RDD[(Long, Double)]]): 
RDD[Vector] = {
+val partitioner = Partitioner.defaultPartitioner(ranks(0), ranks.tail: 
_*)
--- End diff --

The code is weird because `ranks` do share any partitioner. It is cleaner 
to use HashPartitioner with the same number of partitions in the input RDD or 
simply `defaultParallelisim`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, 

[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-16 Thread dorx
Github user dorx commented on a diff in the pull request:

https://github.com/apache/spark/pull/1367#discussion_r15020400
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala
 ---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.correlation
+
+import org.apache.spark.Partitioner
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector}
+import org.apache.spark.rdd.{CoGroupedRDD, RDD}
+
+/**
+ * Compute Spearman's correlation for two RDDs of the type RDD[Double] or 
the correlation matrix
+ * for an RDD of the type RDD[Vector].
+ *
+ * Definition of Spearman's correlation can be found at
+ * http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient
+ */
+object SpearmansCorrelation extends Correlation {
+
+  /**
+   * Compute Spearman's correlation for two datasets.
+   */
+  override def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double 
= {
+computeCorrelationWithMatrixImpl(x, y)
+  }
+
+  /**
+   * Compute Spearman's correlation matrix S, for the input matrix, where 
S(i, j) is the
+   * correlation between column i and j.
+   */
+  override def computeCorrelationMatrix(X: RDD[Vector]): Matrix = {
+val indexed = X.zipWithIndex()
--- End diff --

you're right. sorry i was thinking about the 2nd zipWithIndex call (on the 
sorted RDD). That one absolutely cannot be replaced with zipWithUnqueId or the 
results are all wrong.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-16 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/1367#discussion_r15020475
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala
 ---
@@ -0,0 +1,102 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.correlation
+
+import org.apache.spark.Partitioner
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector}
+import org.apache.spark.rdd.{CoGroupedRDD, RDD}
+
+/**
+ * Compute Spearman's correlation for two RDDs of the type RDD[Double] or 
the correlation matrix
+ * for an RDD of the type RDD[Vector].
+ *
+ * Definition of Spearman's correlation can be found at
+ * http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient
+ */
+object SpearmansCorrelation extends Correlation {
+
+  /**
+   * Compute Spearman's correlation for two datasets.
+   */
+  override def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double 
= {
+computeCorrelationWithMatrixImpl(x, y)
+  }
+
+  /**
+   * Compute Spearman's correlation matrix S, for the input matrix, where 
S(i, j) is the
+   * correlation between column i and j.
+   */
+  override def computeCorrelationMatrix(X: RDD[Vector]): Matrix = {
+val indexed = X.zipWithIndex()
+// Attempt to checkpoint the RDD before splitting it into numCols 
RDD[Double]s to avoid
+// computing the lineage prefix multiple times.
+// If checkpoint directory not set, cache the RDD instead.
+try {
+  indexed.checkpoint()
+} catch {
+  case e: Exception = indexed.cache()
+}
+
+val numCols = X.first.size
+val ranks = new Array[RDD[(Long, Double)]](numCols)
+
+// Note: we use a for loop here instead of a while loop with a single 
index variable
+// to avoid race condition caused by closure serialization
+for (k - 0 until numCols) {
+  val column = indexed.map {case(vector, index) = {
+(vector(k), index)}
+  }
+  ranks(k) = getRanks(column)
+}
+
+val ranksMat: RDD[Vector] = makeRankMatrix(ranks)
+PearsonCorrelation.computeCorrelationMatrix(ranksMat)
+  }
+
+  /**
+   * Compute the ranks for elements in the input RDD, using the average 
method for ties.
+   *
+   * With the average method, elements with the same value receive the 
same rank that's computed
+   * by taking the average of their positions in the sorted list.
+   * e.g. ranks([2, 1, 0, 2]) = [3.5, 2.0, 1.0, 3.5]
+   */
+  private def getRanks(indexed: RDD[(Double, Long)]): RDD[(Long, Double)] 
= {
+// Get elements' positions in the sorted list for computing average 
rank for duplicate values
+val sorted = indexed.sortByKey().zipWithIndex()
+val groupedByValue = sorted.groupBy(_._1._1)
+val ranks = groupedByValue.flatMap[(Long, Double)] { item =
+  val duplicates = item._2
+  if (duplicates.size  1) {
+val averageRank = duplicates.foldLeft(0L) {_ + _._2 + 1} / 
duplicates.size.toDouble
+duplicates.map(entry = (entry._1._2, averageRank)).toSeq
+  } else {
+duplicates.map(entry = (entry._1._2, entry._2.toDouble + 1)).toSeq
+  }
+}
+ranks.sortByKey()
+  }
+
+  private def makeRankMatrix(ranks: Array[RDD[(Long, Double)]]): 
RDD[Vector] = {
+val partitioner = Partitioner.defaultPartitioner(ranks(0), ranks.tail: 
_*)
+val cogrouped = new CoGroupedRDD[Long](ranks, partitioner)
+cogrouped.mapPartitions({ iter =
--- End diff --

Please check my comment above about `preservePartitioning` before adding 
`preservePartitioning = true`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature 

[GitHub] spark pull request: SPARK-1719: spark.*.extraLibraryPath isn't app...

2014-07-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1022#issuecomment-49213268
  
QA results for PR 1022:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16737/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2190][SQL] Specialized ColumnType for T...

2014-07-16 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1440#discussion_r15020827
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala ---
@@ -344,21 +344,52 @@ private[sql] class StringColumnStats extends 
BasicColumnStats(STRING) {
   }
 
   override def contains(row: Row, ordinal: Int) = {
-!(upperBound eq null)  {
+(upperBound ne null)  {
--- End diff --

Oh, I see... I forgot that `ne`/`eq` do reference equality... in the case 
of null I would imagine there is no difference, but this is probably fine then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2277: make TaskScheduler track hosts on ...

2014-07-16 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/1212#issuecomment-49214821
  
Thanks, but that is fine, I merged it in after I resolved my local hardware
issues today.
So did not need to impose on you to merge after all !
On 17-Jul-2014 12:33 am, Patrick Wendell notificati...@github.com wrote:

 Hey @mridulm https://github.com/mridulm I usually won't merge any code
 in the scheduler unless @markhamstra https://github.com/markhamstra or
 @kayhousterhout has looked at it and signed off, since they are the most
 active maintainers of this code. That might be a good practice to follow 
in
 the future. We can try to come up with a list of maintainers to make it
 more clear who should be consulted for code in various parts of Spark.

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/1212#issuecomment-49211502.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-16 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/1313#issuecomment-49215361
  
I am not sure if it is ok to wait ... This is something I never considered
from the beginning when I added process_local ... Maybe it is ok !
If it is not, then we might need to come up with something.

Unlike earlier, the noPrefs list now truely contains tasks which have no
preference (earlier task failure also ended up here) ... So maybe not
common anymore ? And so ok to wait ?
On 17-Jul-2014 12:26 am, Matei Zaharia notificati...@github.com wrote:

 Ah, I see, so you're saying it's worth to wait 3 seconds for node-local
 ones to be able to go instead of launching no-prefs tasks. That does make
 sense. We just have to make sure that if a task set contains *only*
 process-local and no-prefs tasks, for instance, this wait doesn't happen.
 (I'm pretty sure that right now we don't wait for node-local if there were
 no process-local tasks, right? We only wait when we move up a level.)

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/1313#issuecomment-49210568.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations

2014-07-16 Thread dorx
Github user dorx commented on a diff in the pull request:

https://github.com/apache/spark/pull/1367#discussion_r15022756
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala 
---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.stat.correlation
+
+import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+
+/**
+ * New correlation algorithms should implement this trait
+ */
+trait Correlation {
+
+  /**
+   * Compute correlation for two datasets.
+   */
+  def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double
+
+  /**
+   * Compute the correlation matrix S, for the input matrix, where S(i, j) 
is the correlation
+   * between column i and j.
+   */
+  def computeCorrelationMatrix(X: RDD[Vector]): Matrix
+
+  /**
+   * Combine the two input RDD[Double]s into an RDD[Vector] and compute 
the correlation using the
+   * correlation implementation for RDD[Vector]
+   */
+  def computeCorrelationWithMatrixImpl(x: RDD[Double], y: RDD[Double]): 
Double = {
+val mat: RDD[Vector] = x.zip(y).mapPartitions({ iter =
--- End diff --

Thanks for the example explaining what `preservePartitioning` does. It 
seems like in this case it's a good idea to copy the partitioner into the pair 
RDD since zip preserves the partitioner, and if we don't preserve the 
partitioner here, we're unnecessarily dropping the partitioner even though we 
didn't need to shuffle the data at all.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: replace println to log4j

2014-07-16 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/1372#issuecomment-49217835
  
There has been further comments regarding this. It would be great if you 
address them as well. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1097: Do not introduce deadlock while fi...

2014-07-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1409#issuecomment-49218047
  
QA results for PR 1409:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16740/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1477]: Add the lifecycle interface

2014-07-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/991#issuecomment-49218315
  
QA results for PR 991:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brtrait Lifecycle extends Service {brtrait Service extends 
java.io.Closeable {brclass SparkContext(config: SparkConf) extends Logging 
with Lifecycle {brclass JavaStreamingContext(val ssc: StreamingContext) 
extends Lifecycle {brclass JobGenerator(jobScheduler: JobScheduler) extends 
Logging with Lifecycle {brclass JobScheduler(val ssc: StreamingContext) 
extends Logging with Lifecycle {brclass ReceiverTracker(ssc: 
StreamingContext) extends Logging with Lifecycle {brclass ReceiverLauncher 
extends Lifecycle {brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16741/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2277: make TaskScheduler track hosts on ...

2014-07-16 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1212#issuecomment-49218616
  
@mridulm what I meant was that it would be good in the future if you try to 
have Mark or Kay look at patches in the scheduler code before you merge them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-16 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1238#issuecomment-49219047
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [MLlib] SPARK-1536: multiclass classification ...

2014-07-16 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/886#issuecomment-49219179
  
@manishamde  It looks like the MIMA issue may not be fixed right away, so 
let's use the exceptions in MimaExcludes.scala for now.  Could you please do 
that?  I believe it will then be good to go.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1238#issuecomment-49219540
  
QA tests have started for PR 1238. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16743/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-16 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1238#discussion_r15024063
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
 ---
@@ -26,6 +26,28 @@ import org.apache.spark.sql.catalyst.trees
 abstract class LogicalPlan extends QueryPlan[LogicalPlan] {
   self: Product =
 
+  // TODO: handle overflow?
+  /**
+   * Estimates of various statistics.  The default estimation logic simply 
sums up the corresponding
+   * statistic produced by the children.  To override this behavior, 
override `statistics` and
+   * assign it a overriden version of `Statistics`.
+   */
+  case class Statistics(
+/**
+ * Number of output tuples. For leaf operators this defaults to 1, 
otherwise it is set to the
+ * product of children's `numTuples`.
+ */
+numTuples: Long = childrenStats.map(_.numTuples).product,
+
+/**
+ * Physical size in bytes. For leaf operators this defaults to 1, 
otherwise it is set to the
+ * product of children's `sizeInBytes`.
+ */
+sizeInBytes: Long = childrenStats.map(_.sizeInBytes).product
+  )
+  lazy val statistics: Statistics = new Statistics
+  lazy val childrenStats = children.map(_.statistics)
--- End diff --

Thanks for the note -- I wasn't well aware of the memory/time overhead of 
lazy vals. Since we have one statistic at the moment, `childrenStats` is used 
only once for a logical node, so I went ahead and removed it. 

However, memoizing `statistics` itself probably would save some planning 
time, as it will get called multiple times during planning (e.g. different 
cases in the HashJoin pattern match). If we were to use a `def` it is going to 
be evaluated in the quadratic order.

Using `val` only will fire off the computation of the stats for all logical 
nodes right away. I think this might be less desirable to only evaluating it 
for everything under a Join when needed by a Strategy. Also, we'd have very few 
operators in a typical query anyway.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1238#issuecomment-49219898
  
QA results for PR 1238:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16743/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2250: show stage RDDs in UI

2014-07-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1188#issuecomment-49220170
  
QA tests have started for PR 1188. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16744/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2250: show stage RDDs in UI

2014-07-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1188#issuecomment-49220283
  
QA results for PR 1188:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16744/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-16 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1313#issuecomment-49220362
  
Ah, that's true, it won't be as common now. Anyway I'd be okay with any 
solution as long as TaskSets with only no-prefs tasks, or only process-local + 
no-prefs, don't get an unnecessary delay.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2411] Add a history-not-found page to s...

2014-07-16 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/1336#discussion_r15024426
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/master/ui/HistoryNotFoundPage.scala 
---
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy.master.ui
+
+import javax.servlet.http.HttpServletRequest
+
+import scala.xml.Node
+
+import org.apache.spark.ui.{UIUtils, WebUIPage}
+
+private[spark] class HistoryNotFoundPage(parent: MasterWebUI)
+  extends WebUIPage(history/not-found) {
+
+  def render(request: HttpServletRequest): Seq[Node] = {
+val content =
+  div class=row-fluid
+div class=span12 style=font-size:14px;font-weight:bold
+  No event logs were found for this application. To enable event 
logging, please set
--- End diff --

This is for the standalone Master, not the history server. Here the event 
log directory for each application is shipped to the master as part of the 
application description, so the Master knows where to find the logs 
automatically.

(For history server, however, the user should explicitly set the path in 
all cases as you describe)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2250: show stage RDDs in UI

2014-07-16 Thread nevillelyh
Github user nevillelyh commented on the pull request:

https://github.com/apache/spark/pull/1188#issuecomment-49220820
  
Sorry missed one. Fixed the style warning.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...

2014-07-16 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1425#issuecomment-49221370
  
@mengxr  Scalatest 2.x has the tolerance feature, but it's absolute error 
not relative error. For large numbers, the absolute error may not be 
meaningful. With `===`, it will return false even the different is only one 
unit of least precision (ULP), and it often happens when running the unittest 
under different architecture of machine. For example, ARM and X86 may have 
different numerical rounding , and we don't run any test other than X86. C++ 
boost has their numerical `===` test with the relative error for this reason.

I probably can add method called `~=` and `~==` method for `Double`, and 
`Vector` type using implicit class, and `~==` will raise the exception for the 
message purpose like `===` does.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-16 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/1313#issuecomment-49221715
  
Same here, Kay any thoughts ?
On 17-Jul-2014 1:44 am, Matei Zaharia notificati...@github.com wrote:

 Ah, that's true, it won't be as common now. Anyway I'd be okay with any
 solution as long as TaskSets with only no-prefs tasks, or only
 process-local + no-prefs, don't get an unnecessary delay.

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/1313#issuecomment-49220362.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...

2014-07-16 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/1425#issuecomment-49221957
  
`almostEquals` reads better than `~===`. The feature we like is having the 
values in comparison in the error message but not the name :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-16 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1238#issuecomment-49222667
  
Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...

2014-07-16 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/1425#issuecomment-49222983
  
I learn `almostEquals` from boost library. Anyway, in this case, how do we 
distinguish the one with throwing out the message, and the one just returning 
true/false?

`almostEquals` and `almostEqualsWithMessage`? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1238#issuecomment-49223292
  
QA tests have started for PR 1238. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16745/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS

2014-07-16 Thread srowen
Github user srowen closed the pull request at:

https://github.com/apache/spark/pull/1393


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Added t2 instance types

2014-07-16 Thread 24601
GitHub user 24601 opened a pull request:

https://github.com/apache/spark/pull/1446

Added t2 instance types

New t2 instance types require HVM amis, bailout assumption of pvm
causes failures when using t2 instance types.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/24601/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1446.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1446


commit 392a95ef97d9644b52d7526a1a068b6d9dbf01c0
Author: Basit Mustafa basitmustafa@computes-things-for-basit.local
Date:   2014-07-16T20:45:37Z

Added t2 instance types

New t2 instance types require HVM amis, bailout assumption of pvm
causes failures when using t2 instance types.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Added t2 instance types

2014-07-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1446#issuecomment-49225163
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1097: Do not introduce deadlock while fi...

2014-07-16 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/1409#issuecomment-49227548
  
Okay I'm going to merge this into master and 1.0. We can cut a new patch 
release shortly for this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

2014-07-16 Thread sryza
GitHub user sryza opened a pull request:

https://github.com/apache/spark/pull/1447

SPARK-2519 part 2. Remove pattern matching on Tuple2 in critical section...

...s of CoGroupedRDD and PairRDDFunctions

This also removes an unnecessary tuple creation in cogroup.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sryza/spark sandy-spark-2519-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1447.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1447


commit 906e6b362db11a4ac5a3f8096668c6f968dc48aa
Author: Sandy Ryza sa...@cloudera.com
Date:   2014-07-16T21:00:17Z

SPARK-2519 part 2. Remove pattern matching on Tuple2 in critical sections 
of CoGroupedRDD and PairRDDFunctions




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Added t2 instance types

2014-07-16 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/1446#discussion_r15027927
  
--- Diff: ec2/spark_ec2.py ---
@@ -240,7 +240,10 @@ def get_spark_ami(opts):
 r3.xlarge:   hvm,
 r3.2xlarge:  hvm,
 r3.4xlarge:  hvm,
-r3.8xlarge:  hvm
+r3.8xlarge:  hvm,
+t2.micro: hvm,
+t2.small:hvm,
+t2.medium:hvm
--- End diff --

Ca you format these correctly with the other ones?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...

2014-07-16 Thread CodingCat
Github user CodingCat commented on the pull request:

https://github.com/apache/spark/pull/1313#issuecomment-49227891
  
en...that makes sense

actually we can get the situation about locality in the taskSet easily 
through myLocalityLevels, which is calculated when a new executor is added


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


<    1   2   3   4   >