[GitHub] spark pull request: SPARK-2526: Simplify options in make-distribut...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1445#issuecomment-49202151 QA tests have started for PR 1445. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16738/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2517] Removed some compiler type erasur...
Github user rxin closed the pull request at: https://github.com/apache/spark/pull/1431 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2517] Removed some compiler type erasur...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1431#issuecomment-49202256 SGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2525][SQL] Remove as many compilation w...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1444 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2525][SQL] Remove as many compilation w...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1444#issuecomment-49202444 I'v merged this in master branch-1.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] Cleaned up ConstantFolding slightly.
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1430#issuecomment-49202491 Merging in master branch-1.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2154] Schedule next Driver when one com...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1405#issuecomment-49202648 Looks good to me -- we can put it in both 1.0.2 and 0.9.2 as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] Cleaned up ConstantFolding slightly.
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1430 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...
Github user kayousterhout commented on the pull request: https://github.com/apache/spark/pull/1313#issuecomment-49202752 Also cc @mateiz --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...
Github user kayousterhout commented on a diff in the pull request: https://github.com/apache/spark/pull/1313#discussion_r15016112 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -353,6 +350,14 @@ private[spark] class TaskSetManager( for (index - findTaskFromList(execId, getPendingTasksForHost(host))) { return Some((index, TaskLocality.NODE_LOCAL, false)) } + // Look for noPref tasks after NODE_LOCAL for minimize cross-rack traffic + for (index - findTaskFromList(execId, pendingTasksWithNoPrefs)) { +return Some((index, TaskLocality.PROCESS_LOCAL, false)) + } + // find a speculative task if all noPref tasks have been scheduled + val specTask = findSpeculativeTask(execId, host, locality).map { +case (taskIndex, allowedLocality) = (taskIndex, allowedLocality, true)} + if (specTask != None) return specTask --- End diff -- What happens if the maximum current allowed locality level (based on delay scheduling) is PROCESS_LOCAL? It seems like that will mean tasks in pendingTasksWithNoPrefs can't be scheduled (they won't be scheduled until the first timeout expires) -- which seems bad? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/1393#issuecomment-49203419 @mateiz I think it would mean mostly cloning `ALS.scala`, as the `Rating` object is woven throughout. Probably some large chunks could be refactored and shared. Is that what you mean? even I'm not sure if two APIs are worth the trouble. When I get to this point and see what it takes to make a 64-bit key implementation, yes I can propose what that looks like. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] Cleaned up ConstantFolding slightly.
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1430#issuecomment-49203889 Actually only master. There is a conflict in branch-1.0 that I will leave it intact for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2519. Eliminate pattern-matching on Tupl...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1435#issuecomment-49203999 Merging in master. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1097: Do not introduce deadlock while fi...
Github user aarondav commented on the pull request: https://github.com/apache/spark/pull/1409#issuecomment-49204042 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2519. Eliminate pattern-matching on Tupl...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1435 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1097: Do not introduce deadlock while fi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1409#issuecomment-49204801 QA tests have started for PR 1409. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16740/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2518][SQL] Fix foldability of Substring...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1432#issuecomment-49204855 Merging in master and branch-1.0. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2518][SQL] Fix foldability of Substring...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1432 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1428#issuecomment-49205341 It doesn't bring much benefit right now, but what we are doing here is creating patterns in NullPropagation to specify the semantics of each individual expression ... not very scalable in maintaining this code base. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2033] Automatically cleanup checkpoint
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/855#issuecomment-49205391 QA results for PR 855:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16731/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1477]: Add the lifecycle interface
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/991#issuecomment-49205438 QA tests have started for PR 991. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16741/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1313#issuecomment-49205621 I'm curious about one thing here: doesn't this change mean that we might wait longer before launching a no-prefs or speculative task, due to delay scheduling? This is because in TaskSetManager we take the preferredLocality and make it no bigger than our allowed locality level based on delays. The unit tests seem to add extra sleeps for this reason. I don't really like this behavior, since with both kind of tasks you'd like to launch them as soon as possible. No-prefs tasks can just run, and for speculative tasks you want to launch them ASAP in situations like Spark Streaming, where a 3-second delay could mess up your stream latency. It might be better to add a special preferredLocality value for no preference, and always call the TaskSetManager with that value after calling it with the others, then have a code path in there that deals with that specially (not taking the min of that with allowedLocality). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1313#issuecomment-49205693 Looks like Kay pointed out the same issue while I was typing this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] Add HiveDecimal HiveVarchar support in...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1436#issuecomment-49205734 Unit test actually failed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Improve ALS algorithm resource usage
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/929#issuecomment-49206560 QA results for PR 929:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16733/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2525][SQL] Remove as many compilation w...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1444#issuecomment-49206483 QA results for PR 1444:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16732/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1313#issuecomment-49206624 By the way, as far as I can tell, findSpeculativeTask also takes a locality level as an argument, and doesn't return tasks that are farther away than that. Do we need to change it? It seems that it shouldn't interfere with other tasks. It's true that we might launch a node-local speculative task before a rack-local non-speculative task, but I think it's unlikely that we'll be wanting to speculate stuff at that point, so I'm not sure this is a problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2522] set default broadcast factory to ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1437#issuecomment-49206767 Merging in master! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r15017783 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.stat.correlation + +import org.apache.spark.Partitioner +import org.apache.spark.SparkContext._ +import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector} +import org.apache.spark.rdd.{CoGroupedRDD, RDD} + +/** + * Compute Spearman's correlation for two RDDs of the type RDD[Double] or the correlation matrix + * for an RDD of the type RDD[Vector]. + * + * Definition of Spearman's correlation can be found at + * http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient + */ +object SpearmansCorrelation extends Correlation { + + /** + * Compute Spearman's correlation for two datasets. + */ + override def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double = { +computeCorrelationWithMatrixImpl(x, y) + } + + /** + * Compute Spearman's correlation matrix S, for the input matrix, where S(i, j) is the + * correlation between column i and j. + */ + override def computeCorrelationMatrix(X: RDD[Vector]): Matrix = { +val indexed = X.zipWithIndex() +// Attempt to checkpoint the RDD before splitting it into numCols RDD[Double]s to avoid +// computing the lineage prefix multiple times. +// If checkpoint directory not set, cache the RDD instead. +try { + indexed.checkpoint() +} catch { + case e: Exception = indexed.cache() --- End diff -- FWIW I tried it with zipWithUniqueId, and, as expected, the results were wrong when more than 1 partition is used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1313#issuecomment-49206975 BTW even in that case we could solve it by having more levels, basically pass in a different level to request speculative tasks after we requested non-speculative ones, or pass in a boolean called speculativeAllowed. It seems the main change really is to treat no-prefs as less local than node- and process-local, whereas in the past it was treated as equivalent to process-local. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1313#issuecomment-49207097 My rationale for suggesting noPrefs tasks after NODE_LOCAL was to ensure that noPref tasks do not preempt locality for NODE_LOCAL tasks (they cant for PROCESS_LOCAL) I think noPrefs sort of breaks the current model where we wait for PROCESS_LOCAL, then NODE_LOCAL and so on : though I dont seem to have a good solution. For example: if we have 20 executors, with 30 no pref tasks and 100 NODE_LOCAL tasks. Depending on the time taken for the no pref tasks, we can end up with most/none of the NODE_LOCAL tasks actually getting scheduled as NODE_LOCAL (since the corresponding executors might be running no pref tasks). This is slightly contrived :-) But it does have implications in general scheduling too. I did not properly account for idle executors - which leads to suboptimal utilization in my suggestion above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r15017975 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.stat.correlation + +import org.apache.spark.Partitioner +import org.apache.spark.SparkContext._ +import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector} +import org.apache.spark.rdd.{CoGroupedRDD, RDD} + +/** + * Compute Spearman's correlation for two RDDs of the type RDD[Double] or the correlation matrix + * for an RDD of the type RDD[Vector]. + * + * Definition of Spearman's correlation can be found at + * http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient + */ +object SpearmansCorrelation extends Correlation { + + /** + * Compute Spearman's correlation for two datasets. + */ + override def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double = { +computeCorrelationWithMatrixImpl(x, y) + } + + /** + * Compute Spearman's correlation matrix S, for the input matrix, where S(i, j) is the + * correlation between column i and j. + */ + override def computeCorrelationMatrix(X: RDD[Vector]): Matrix = { +val indexed = X.zipWithIndex() --- End diff -- FWIW I tried it with zipWithUniqueId, and, as expected, the results were wrong when more than 1 partition is used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-695] In DAGScheduler's getPreferredLocs...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/1362#discussion_r15018046 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -1107,7 +1106,6 @@ class DAGScheduler( case shufDep: ShuffleDependency[_, _, _] = val mapStage = getShuffleMapStage(shufDep, stage.jobId) if (!mapStage.isAvailable) { -visitedStages += mapStage --- End diff -- I'm curious, why did you remove this? It seems unrelated to the preferred locs fix, and it was necessary to prevent exponential explosion here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2190][SQL] Specialized ColumnType for T...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1440#discussion_r15018072 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala --- @@ -344,21 +344,52 @@ private[sql] class StringColumnStats extends BasicColumnStats(STRING) { } override def contains(row: Row, ordinal: Int) = { -!(upperBound eq null) { +(upperBound ne null) { --- End diff -- I think I read somewhere ne has better performance in certain cases... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: fix compile error of streaming project
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/153#issuecomment-49207773 Merging this in master. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: fix compile error of streaming project
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/153#issuecomment-49207755 I think I've seeen this happening once in a while, but can't exactly reproduce after clean. Anyway it's better to explicitly define the return type for public methods, even if they are simple implicit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-695] In DAGScheduler's getPreferredLocs...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1362#issuecomment-49208638 This looks good to me modulo a few comments! One other thing though, you should add a unit test for this functionality. Create a test that would result in a very long running time with the old code (e.g. something where you zip RDDs together 20-30 times and the original one has some preferred locations), and add something around it to time that it runs within, say, 5 seconds. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2317] Improve task logging.
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/1259#issuecomment-49208726 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2298: Show stage attempt in UI
Github user tsudukim commented on the pull request: https://github.com/apache/spark/pull/1384#issuecomment-49207731 @pwendell I agree that there are many room for improvement about handling of stageId and attemptId. It might be better to break this problems into some sub-tasks. I think this patch should be one of them. Or did you mean we should fix all of this problem in one patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: fix compile error of streaming project
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/153 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2298: Show stage attempt in UI
Github user tsudukim commented on a diff in the pull request: https://github.com/apache/spark/pull/1384#discussion_r15019018 --- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala --- @@ -478,6 +479,7 @@ private[spark] object JsonProtocol { def stageInfoFromJson(json: JValue): StageInfo = { val stageId = (json \ Stage ID).extract[Int] +val attemptId = (json \ Attempt ID).extract[Int] --- End diff -- Sure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2298: Show stage attempt in UI
Github user tsudukim commented on the pull request: https://github.com/apache/spark/pull/1384#issuecomment-49209319 @rxin OK. After that, I think I can make this patch better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2298: Show stage attempt in UI
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1384#issuecomment-49208410 Let's hold off merging this one until we merge #1262. Then it will be easier to index the information based on stage + attempt. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2317] Improve task logging.
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1259#issuecomment-49209789 Thanks. Merging this. We can fix the serialization error logging in a separate PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2298: Show stage attempt in UI
Github user tsudukim commented on the pull request: https://github.com/apache/spark/pull/1384#issuecomment-49209857 @rxin in #1262, can I expect the key of the stagedata in JobProgressListener become stageId + attemptId instead of stageId only? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2317] Improve task logging.
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1259 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1238#discussion_r15019589 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala --- @@ -26,6 +26,28 @@ import org.apache.spark.sql.catalyst.trees abstract class LogicalPlan extends QueryPlan[LogicalPlan] { self: Product = + // TODO: handle overflow? + /** + * Estimates of various statistics. The default estimation logic simply sums up the corresponding + * statistic produced by the children. To override this behavior, override `statistics` and + * assign it a overriden version of `Statistics`. + */ + case class Statistics( +/** + * Number of output tuples. For leaf operators this defaults to 1, otherwise it is set to the + * product of children's `numTuples`. + */ +numTuples: Long = childrenStats.map(_.numTuples).product, + +/** + * Physical size in bytes. For leaf operators this defaults to 1, otherwise it is set to the + * product of children's `sizeInBytes`. + */ +sizeInBytes: Long = childrenStats.map(_.sizeInBytes).product + ) + lazy val statistics: Statistics = new Statistics + lazy val childrenStats = children.map(_.statistics) --- End diff -- I think we need a lazy marker for this, otherwise during the initialization of `childrenStats`, `statistics` in all children will get evaluated too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r15019606 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.stat.correlation + +import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector} +import org.apache.spark.rdd.RDD + +/** + * New correlation algorithms should implement this trait + */ +trait Correlation { + + /** + * Compute correlation for two datasets. + */ + def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double + + /** + * Compute the correlation matrix S, for the input matrix, where S(i, j) is the correlation + * between column i and j. + */ + def computeCorrelationMatrix(X: RDD[Vector]): Matrix + + /** + * Combine the two input RDD[Double]s into an RDD[Vector] and compute the correlation using the + * correlation implementation for RDD[Vector] + */ + def computeCorrelationWithMatrixImpl(x: RDD[Double], y: RDD[Double]): Double = { --- End diff -- It is okay to keep it as long as the `Correlation` class is private. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r15019630 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/PearsonCorrelation.scala --- @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.stat.correlation + +import breeze.linalg.{DenseMatrix = BDM} + +import org.apache.spark.mllib.linalg.{Matrices, Matrix, Vector} +import org.apache.spark.mllib.linalg.distributed.RowMatrix +import org.apache.spark.rdd.RDD + +/** + * Compute Pearson correlation for two RDDs of the type RDD[Double] or the correlation matrix + * for an RDD of the type RDD[Vector]. + * + * Definition of Pearson correlation can be found at + * http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient + */ +object PearsonCorrelation extends Correlation { + + /** + * Compute the Pearson correlation for two datasets. + */ + override def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double = { +computeCorrelationWithMatrixImpl(x, y) + } + + /** + * Compute the Pearson correlation matrix S, for the input matrix, where S(i, j) is the + * correlation between column i and j. + */ + override def computeCorrelationMatrix(X: RDD[Vector]): Matrix = { +val rowMatrix = new RowMatrix(X) +val cov = rowMatrix.computeCovariance() +computeCorrelationMatrixFromCovariance(cov) + } + + /** + * Compute the pearson correlation matrix from the covariance matrix + */ + def computeCorrelationMatrixFromCovariance(covarianceMatrix: Matrix): Matrix = { +val cov = covarianceMatrix.toBreeze.asInstanceOf[BDM[Double]] +val n = cov.cols + +// Compute the standard deviation on the diagonals first +var i = 0 +while (i n) { + cov(i, i) = math.sqrt(cov(i, i)) + i +=1 +} +// or we could put the stddev in its own array to trade space for one less pass over the matrix + +// TODO: use blas.dspr instead to compute the correlation matrix +// if the covariance matrix comes in the upper triangular form for free + +// Loop through columns since cov is column major +var j = 0 +var sigma = 0.0 +while (j n) { + sigma = cov(j, j) + i = 0 + while (i j) { +val covariance = cov(i, j) / (sigma * cov(i, i)) --- End diff -- Agree with @srowen : set NaN and generate a warning message. Btw, R returns a warning message instead of an error. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r15019612 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.stat.correlation + +import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector} +import org.apache.spark.rdd.RDD + +/** + * New correlation algorithms should implement this trait + */ +trait Correlation { + + /** + * Compute correlation for two datasets. + */ + def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double + + /** + * Compute the correlation matrix S, for the input matrix, where S(i, j) is the correlation + * between column i and j. + */ + def computeCorrelationMatrix(X: RDD[Vector]): Matrix + + /** + * Combine the two input RDD[Double]s into an RDD[Vector] and compute the correlation using the + * correlation implementation for RDD[Vector] + */ + def computeCorrelationWithMatrixImpl(x: RDD[Double], y: RDD[Double]): Double = { +val mat: RDD[Vector] = x.zip(y).mapPartitions({ iter = --- End diff -- Make sure you use `preservePartitioning` correctly. My first understanding was wrong and I may have used it wrongly somewhere. The argument should really be called `preservePartitioner`. It passes the partitioner to the derived RDD without checking. This may lead to wrong results. For example, the join in the following code block returns nothing, though `c` and `d` have the same content. ~~~ val a = sc.makeRDD(Seq((0, 1), (2, 3), (4, 5), (6, 7)), 4) val b = a.groupByKey() val c = b.mapPartitions({ iter = iter.flatMap { case (k, vv) = vv.map { v = (v, k) } }}, true) c.collect() val d = sc.makeRDD(Seq((1, 0), (3, 2), (5, 4), (7, 6)), 4) c.join(d).collect() ~~~ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r15019642 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.stat.correlation + +import org.apache.spark.Partitioner +import org.apache.spark.SparkContext._ +import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector} +import org.apache.spark.rdd.{CoGroupedRDD, RDD} + +/** + * Compute Spearman's correlation for two RDDs of the type RDD[Double] or the correlation matrix + * for an RDD of the type RDD[Vector]. + * + * Definition of Spearman's correlation can be found at + * http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient + */ +object SpearmansCorrelation extends Correlation { + + /** + * Compute Spearman's correlation for two datasets. + */ + override def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double = { +computeCorrelationWithMatrixImpl(x, y) + } + + /** + * Compute Spearman's correlation matrix S, for the input matrix, where S(i, j) is the + * correlation between column i and j. + */ + override def computeCorrelationMatrix(X: RDD[Vector]): Matrix = { +val indexed = X.zipWithIndex() --- End diff -- The input `X` is not ordered. I think the purpose here is just to assign each row an id for joining back the elements after we obtain the rank for each column. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r15019620 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.stat.correlation + +import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector} +import org.apache.spark.rdd.RDD + +/** + * New correlation algorithms should implement this trait + */ +trait Correlation { + + /** + * Compute correlation for two datasets. + */ + def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double + + /** + * Compute the correlation matrix S, for the input matrix, where S(i, j) is the correlation + * between column i and j. + */ + def computeCorrelationMatrix(X: RDD[Vector]): Matrix + + /** + * Combine the two input RDD[Double]s into an RDD[Vector] and compute the correlation using the + * correlation implementation for RDD[Vector] + */ + def computeCorrelationWithMatrixImpl(x: RDD[Double], y: RDD[Double]): Double = { +val mat: RDD[Vector] = x.zip(y).mapPartitions({ iter = + iter.map {case(xi, yi) = new DenseVector(Array(xi, yi))} +}, preservesPartitioning = true) +computeCorrelationMatrix(mat)(0, 1) + } + +} + +/** + * Delegates computation to the specific correlation object based on the input method name + * + * Currently supported correlations: pearson, spearman. + * After new correlation algorithms are added, please update the documentation here and in + * Statistics.scala for the correlation APIs. + * + * Cases are ignored when doing method matching. We also allow initials, e.g. P for pearson, as + * long as initials are unique in the supported set of correlation algorithms. In addition, a + * supported method name has to be a substring of the input method name for it to be matched (e.g. + * spearmansrho will be matched to spearman) + * + * Maintains the default correlation type, pearson + */ +object Correlations { --- End diff -- Sounds good. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1428#issuecomment-49210382 I'm not sure I agree with that. This is a pretty niche optimization not something fundamental about the expressions that is required for correct evaluation (and the first version of this code had a bunch of mistakes). Having all of these rules in one place made it easier for me to realize that when doing the review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1238#issuecomment-49210466 Jenkins, test this please. I think I have addressed the latest round of review comments, where the biggest changes being: - Remove statistics estimates from ParquetRelation for now due to performance considerations. - When estimating sizes for MetastoreRelation, instead of making potentially expensive Hadoop FileSystem calls, we peek into the Hive Metastore for populated information. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1313#issuecomment-49210568 Ah, I see, so you're saying it's worth to wait 3 seconds for node-local ones to be able to go instead of launching no-prefs tasks. That does make sense. We just have to make sure that if a task set contains *only* process-local and no-prefs tasks, for instance, this wait doesn't happen. (I'm pretty sure that right now we don't wait for node-local if there were no process-local tasks, right? We only wait when we move up a level.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r15019889 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.stat.correlation + +import org.apache.spark.Partitioner +import org.apache.spark.SparkContext._ +import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector} +import org.apache.spark.rdd.{CoGroupedRDD, RDD} + +/** + * Compute Spearman's correlation for two RDDs of the type RDD[Double] or the correlation matrix + * for an RDD of the type RDD[Vector]. + * + * Definition of Spearman's correlation can be found at + * http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient + */ +object SpearmansCorrelation extends Correlation { + + /** + * Compute Spearman's correlation for two datasets. + */ + override def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double = { +computeCorrelationWithMatrixImpl(x, y) + } + + /** + * Compute Spearman's correlation matrix S, for the input matrix, where S(i, j) is the + * correlation between column i and j. + */ + override def computeCorrelationMatrix(X: RDD[Vector]): Matrix = { +val indexed = X.zipWithIndex() +// Attempt to checkpoint the RDD before splitting it into numCols RDD[Double]s to avoid +// computing the lineage prefix multiple times. +// If checkpoint directory not set, cache the RDD instead. +try { + indexed.checkpoint() +} catch { + case e: Exception = indexed.cache() --- End diff -- See my previous comment about mentioning the cost in the doc. User should be responsible to cache the RDD. If user already cached the input RDD, we are duplicating the data here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1393#issuecomment-49210917 BTW this could also be a place to use the dreaded Scala @specialized annotation to template the code for Ints vs Longs, though as far as I know that's being deprecated by the Scala developers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1393#issuecomment-49210831 Yeah, that's what I meant, we can clone it at first but we might be able to share code later (at least the math code we run on each block, or stuff like that). But let's do it only if you find out it's worth it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2098: All Spark processes should support...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1256#issuecomment-49211180 QA results for PR 1256:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brclass SparkConf(loadDefaults: Boolean, fileName: Option[String])brbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16734/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1238#discussion_r15020066 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala --- @@ -26,6 +26,28 @@ import org.apache.spark.sql.catalyst.trees abstract class LogicalPlan extends QueryPlan[LogicalPlan] { self: Product = + // TODO: handle overflow? + /** + * Estimates of various statistics. The default estimation logic simply sums up the corresponding + * statistic produced by the children. To override this behavior, override `statistics` and + * assign it a overriden version of `Statistics`. + */ + case class Statistics( +/** + * Number of output tuples. For leaf operators this defaults to 1, otherwise it is set to the + * product of children's `numTuples`. + */ +numTuples: Long = childrenStats.map(_.numTuples).product, + +/** + * Physical size in bytes. For leaf operators this defaults to 1, otherwise it is set to the + * product of children's `sizeInBytes`. + */ +sizeInBytes: Long = childrenStats.map(_.sizeInBytes).product + ) + lazy val statistics: Statistics = new Statistics + lazy val childrenStats = children.map(_.statistics) --- End diff -- I was suggesting we turn it into a def, remove it entirely, or stop memoizing statistics. Double memoization seems unnecessarily expensive memory wise. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1719: spark.*.extraLibraryPath isn't app...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1022#issuecomment-49211292 QA results for PR 1022:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16735/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r15020198 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.stat.correlation + +import org.apache.spark.Partitioner +import org.apache.spark.SparkContext._ +import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector} +import org.apache.spark.rdd.{CoGroupedRDD, RDD} + +/** + * Compute Spearman's correlation for two RDDs of the type RDD[Double] or the correlation matrix + * for an RDD of the type RDD[Vector]. + * + * Definition of Spearman's correlation can be found at + * http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient + */ +object SpearmansCorrelation extends Correlation { + + /** + * Compute Spearman's correlation for two datasets. + */ + override def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double = { +computeCorrelationWithMatrixImpl(x, y) + } + + /** + * Compute Spearman's correlation matrix S, for the input matrix, where S(i, j) is the + * correlation between column i and j. + */ + override def computeCorrelationMatrix(X: RDD[Vector]): Matrix = { +val indexed = X.zipWithIndex() +// Attempt to checkpoint the RDD before splitting it into numCols RDD[Double]s to avoid +// computing the lineage prefix multiple times. +// If checkpoint directory not set, cache the RDD instead. +try { + indexed.checkpoint() +} catch { + case e: Exception = indexed.cache() +} + +val numCols = X.first.size +val ranks = new Array[RDD[(Long, Double)]](numCols) + +// Note: we use a for loop here instead of a while loop with a single index variable +// to avoid race condition caused by closure serialization +for (k - 0 until numCols) { + val column = indexed.map {case(vector, index) = { +(vector(k), index)} + } + ranks(k) = getRanks(column) +} + +val ranksMat: RDD[Vector] = makeRankMatrix(ranks) +PearsonCorrelation.computeCorrelationMatrix(ranksMat) + } + + /** + * Compute the ranks for elements in the input RDD, using the average method for ties. + * + * With the average method, elements with the same value receive the same rank that's computed + * by taking the average of their positions in the sorted list. + * e.g. ranks([2, 1, 0, 2]) = [3.5, 2.0, 1.0, 3.5] + */ + private def getRanks(indexed: RDD[(Double, Long)]): RDD[(Long, Double)] = { +// Get elements' positions in the sorted list for computing average rank for duplicate values +val sorted = indexed.sortByKey().zipWithIndex() +val groupedByValue = sorted.groupBy(_._1._1) --- End diff -- The `RangePartitioner` guarantees same keys appear in the same partition. Then inside each partition, you go through an iterator of (value: Double, rank: Long). Do not output a value and its rank directly. Count records with the same value until a different value comes up, then flush out the value and the average rank. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2277: make TaskScheduler track hosts on ...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1212#issuecomment-49211502 Hey @mridulm I usually won't merge any code in the scheduler unless @markhamstra or @kayhousterhout has looked at it and signed off, since they are the most active maintainers of this code. That might be a good practice to follow in the future. We can try to come up with a list of maintainers to make it more clear who should be consulted for code in various parts of Spark. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1428#issuecomment-49211681 That is exactly the argument I made when the folding logic was added. :) I suggested that we add `deterministic` instead and then have a rule that folds things that are `deterministic` and have no `references`. `deterministic` I would argue is something fundamental about the expression itself, unlike these optimizations we are making. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2509][SQL] Add optimization for Substri...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1428#issuecomment-49211732 I would be supportive of changing it to match my original proposal... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r15020402 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.stat.correlation + +import org.apache.spark.Partitioner +import org.apache.spark.SparkContext._ +import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector} +import org.apache.spark.rdd.{CoGroupedRDD, RDD} + +/** + * Compute Spearman's correlation for two RDDs of the type RDD[Double] or the correlation matrix + * for an RDD of the type RDD[Vector]. + * + * Definition of Spearman's correlation can be found at + * http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient + */ +object SpearmansCorrelation extends Correlation { + + /** + * Compute Spearman's correlation for two datasets. + */ + override def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double = { +computeCorrelationWithMatrixImpl(x, y) + } + + /** + * Compute Spearman's correlation matrix S, for the input matrix, where S(i, j) is the + * correlation between column i and j. + */ + override def computeCorrelationMatrix(X: RDD[Vector]): Matrix = { +val indexed = X.zipWithIndex() +// Attempt to checkpoint the RDD before splitting it into numCols RDD[Double]s to avoid +// computing the lineage prefix multiple times. +// If checkpoint directory not set, cache the RDD instead. +try { + indexed.checkpoint() +} catch { + case e: Exception = indexed.cache() +} + +val numCols = X.first.size +val ranks = new Array[RDD[(Long, Double)]](numCols) + +// Note: we use a for loop here instead of a while loop with a single index variable +// to avoid race condition caused by closure serialization +for (k - 0 until numCols) { + val column = indexed.map {case(vector, index) = { +(vector(k), index)} + } + ranks(k) = getRanks(column) +} + +val ranksMat: RDD[Vector] = makeRankMatrix(ranks) +PearsonCorrelation.computeCorrelationMatrix(ranksMat) + } + + /** + * Compute the ranks for elements in the input RDD, using the average method for ties. + * + * With the average method, elements with the same value receive the same rank that's computed + * by taking the average of their positions in the sorted list. + * e.g. ranks([2, 1, 0, 2]) = [3.5, 2.0, 1.0, 3.5] + */ + private def getRanks(indexed: RDD[(Double, Long)]): RDD[(Long, Double)] = { +// Get elements' positions in the sorted list for computing average rank for duplicate values +val sorted = indexed.sortByKey().zipWithIndex() +val groupedByValue = sorted.groupBy(_._1._1) +val ranks = groupedByValue.flatMap[(Long, Double)] { item = + val duplicates = item._2 + if (duplicates.size 1) { +val averageRank = duplicates.foldLeft(0L) {_ + _._2 + 1} / duplicates.size.toDouble +duplicates.map(entry = (entry._1._2, averageRank)).toSeq + } else { +duplicates.map(entry = (entry._1._2, entry._2.toDouble + 1)).toSeq + } +} +ranks.sortByKey() + } + + private def makeRankMatrix(ranks: Array[RDD[(Long, Double)]]): RDD[Vector] = { +val partitioner = Partitioner.defaultPartitioner(ranks(0), ranks.tail: _*) --- End diff -- The code is weird because `ranks` do share any partitioner. It is cleaner to use HashPartitioner with the same number of partitions in the input RDD or simply `defaultParallelisim`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working,
[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r15020400 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.stat.correlation + +import org.apache.spark.Partitioner +import org.apache.spark.SparkContext._ +import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector} +import org.apache.spark.rdd.{CoGroupedRDD, RDD} + +/** + * Compute Spearman's correlation for two RDDs of the type RDD[Double] or the correlation matrix + * for an RDD of the type RDD[Vector]. + * + * Definition of Spearman's correlation can be found at + * http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient + */ +object SpearmansCorrelation extends Correlation { + + /** + * Compute Spearman's correlation for two datasets. + */ + override def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double = { +computeCorrelationWithMatrixImpl(x, y) + } + + /** + * Compute Spearman's correlation matrix S, for the input matrix, where S(i, j) is the + * correlation between column i and j. + */ + override def computeCorrelationMatrix(X: RDD[Vector]): Matrix = { +val indexed = X.zipWithIndex() --- End diff -- you're right. sorry i was thinking about the 2nd zipWithIndex call (on the sorted RDD). That one absolutely cannot be replaced with zipWithUnqueId or the results are all wrong. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r15020475 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmansCorrelation.scala --- @@ -0,0 +1,102 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.stat.correlation + +import org.apache.spark.Partitioner +import org.apache.spark.SparkContext._ +import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector} +import org.apache.spark.rdd.{CoGroupedRDD, RDD} + +/** + * Compute Spearman's correlation for two RDDs of the type RDD[Double] or the correlation matrix + * for an RDD of the type RDD[Vector]. + * + * Definition of Spearman's correlation can be found at + * http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient + */ +object SpearmansCorrelation extends Correlation { + + /** + * Compute Spearman's correlation for two datasets. + */ + override def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double = { +computeCorrelationWithMatrixImpl(x, y) + } + + /** + * Compute Spearman's correlation matrix S, for the input matrix, where S(i, j) is the + * correlation between column i and j. + */ + override def computeCorrelationMatrix(X: RDD[Vector]): Matrix = { +val indexed = X.zipWithIndex() +// Attempt to checkpoint the RDD before splitting it into numCols RDD[Double]s to avoid +// computing the lineage prefix multiple times. +// If checkpoint directory not set, cache the RDD instead. +try { + indexed.checkpoint() +} catch { + case e: Exception = indexed.cache() +} + +val numCols = X.first.size +val ranks = new Array[RDD[(Long, Double)]](numCols) + +// Note: we use a for loop here instead of a while loop with a single index variable +// to avoid race condition caused by closure serialization +for (k - 0 until numCols) { + val column = indexed.map {case(vector, index) = { +(vector(k), index)} + } + ranks(k) = getRanks(column) +} + +val ranksMat: RDD[Vector] = makeRankMatrix(ranks) +PearsonCorrelation.computeCorrelationMatrix(ranksMat) + } + + /** + * Compute the ranks for elements in the input RDD, using the average method for ties. + * + * With the average method, elements with the same value receive the same rank that's computed + * by taking the average of their positions in the sorted list. + * e.g. ranks([2, 1, 0, 2]) = [3.5, 2.0, 1.0, 3.5] + */ + private def getRanks(indexed: RDD[(Double, Long)]): RDD[(Long, Double)] = { +// Get elements' positions in the sorted list for computing average rank for duplicate values +val sorted = indexed.sortByKey().zipWithIndex() +val groupedByValue = sorted.groupBy(_._1._1) +val ranks = groupedByValue.flatMap[(Long, Double)] { item = + val duplicates = item._2 + if (duplicates.size 1) { +val averageRank = duplicates.foldLeft(0L) {_ + _._2 + 1} / duplicates.size.toDouble +duplicates.map(entry = (entry._1._2, averageRank)).toSeq + } else { +duplicates.map(entry = (entry._1._2, entry._2.toDouble + 1)).toSeq + } +} +ranks.sortByKey() + } + + private def makeRankMatrix(ranks: Array[RDD[(Long, Double)]]): RDD[Vector] = { +val partitioner = Partitioner.defaultPartitioner(ranks(0), ranks.tail: _*) +val cogrouped = new CoGroupedRDD[Long](ranks, partitioner) +cogrouped.mapPartitions({ iter = --- End diff -- Please check my comment above about `preservePartitioning` before adding `preservePartitioning = true`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature
[GitHub] spark pull request: SPARK-1719: spark.*.extraLibraryPath isn't app...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1022#issuecomment-49213268 QA results for PR 1022:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16737/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2190][SQL] Specialized ColumnType for T...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1440#discussion_r15020827 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala --- @@ -344,21 +344,52 @@ private[sql] class StringColumnStats extends BasicColumnStats(STRING) { } override def contains(row: Row, ordinal: Int) = { -!(upperBound eq null) { +(upperBound ne null) { --- End diff -- Oh, I see... I forgot that `ne`/`eq` do reference equality... in the case of null I would imagine there is no difference, but this is probably fine then. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2277: make TaskScheduler track hosts on ...
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1212#issuecomment-49214821 Thanks, but that is fine, I merged it in after I resolved my local hardware issues today. So did not need to impose on you to merge after all ! On 17-Jul-2014 12:33 am, Patrick Wendell notificati...@github.com wrote: Hey @mridulm https://github.com/mridulm I usually won't merge any code in the scheduler unless @markhamstra https://github.com/markhamstra or @kayhousterhout has looked at it and signed off, since they are the most active maintainers of this code. That might be a good practice to follow in the future. We can try to come up with a list of maintainers to make it more clear who should be consulted for code in various parts of Spark. â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/1212#issuecomment-49211502. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1313#issuecomment-49215361 I am not sure if it is ok to wait ... This is something I never considered from the beginning when I added process_local ... Maybe it is ok ! If it is not, then we might need to come up with something. Unlike earlier, the noPrefs list now truely contains tasks which have no preference (earlier task failure also ended up here) ... So maybe not common anymore ? And so ok to wait ? On 17-Jul-2014 12:26 am, Matei Zaharia notificati...@github.com wrote: Ah, I see, so you're saying it's worth to wait 3 seconds for node-local ones to be able to go instead of launching no-prefs tasks. That does make sense. We just have to make sure that if a task set contains *only* process-local and no-prefs tasks, for instance, this wait doesn't happen. (I'm pretty sure that right now we don't wait for node-local if there were no process-local tasks, right? We only wait when we move up a level.) â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/1313#issuecomment-49210568. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2359][MLlib] Correlations
Github user dorx commented on a diff in the pull request: https://github.com/apache/spark/pull/1367#discussion_r15022756 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.stat.correlation + +import org.apache.spark.mllib.linalg.{DenseVector, Matrix, Vector} +import org.apache.spark.rdd.RDD + +/** + * New correlation algorithms should implement this trait + */ +trait Correlation { + + /** + * Compute correlation for two datasets. + */ + def computeCorrelation(x: RDD[Double], y: RDD[Double]): Double + + /** + * Compute the correlation matrix S, for the input matrix, where S(i, j) is the correlation + * between column i and j. + */ + def computeCorrelationMatrix(X: RDD[Vector]): Matrix + + /** + * Combine the two input RDD[Double]s into an RDD[Vector] and compute the correlation using the + * correlation implementation for RDD[Vector] + */ + def computeCorrelationWithMatrixImpl(x: RDD[Double], y: RDD[Double]): Double = { +val mat: RDD[Vector] = x.zip(y).mapPartitions({ iter = --- End diff -- Thanks for the example explaining what `preservePartitioning` does. It seems like in this case it's a good idea to copy the partitioner into the pair RDD since zip preserves the partitioner, and if we don't preserve the partitioner here, we're unnecessarily dropping the partitioner even though we didn't need to shuffle the data at all. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: replace println to log4j
Github user tdas commented on the pull request: https://github.com/apache/spark/pull/1372#issuecomment-49217835 There has been further comments regarding this. It would be great if you address them as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1097: Do not introduce deadlock while fi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1409#issuecomment-49218047 QA results for PR 1409:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16740/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1477]: Add the lifecycle interface
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/991#issuecomment-49218315 QA results for PR 991:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds the following public classes (experimental):brtrait Lifecycle extends Service {brtrait Service extends java.io.Closeable {brclass SparkContext(config: SparkConf) extends Logging with Lifecycle {brclass JavaStreamingContext(val ssc: StreamingContext) extends Lifecycle {brclass JobGenerator(jobScheduler: JobScheduler) extends Logging with Lifecycle {brclass JobScheduler(val ssc: StreamingContext) extends Logging with Lifecycle {brclass ReceiverTracker(ssc: StreamingContext) extends Logging with Lifecycle {brclass ReceiverLauncher extends Lifecycle {brbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16741/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2277: make TaskScheduler track hosts on ...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1212#issuecomment-49218616 @mridulm what I meant was that it would be good in the future if you try to have Mark or Kay look at patches in the scheduler code before you merge them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1238#issuecomment-49219047 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [MLlib] SPARK-1536: multiclass classification ...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/886#issuecomment-49219179 @manishamde It looks like the MIMA issue may not be fixed right away, so let's use the exceptions in MimaExcludes.scala for now. Could you please do that? I believe it will then be good to go. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1238#issuecomment-49219540 QA tests have started for PR 1238. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16743/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1238#discussion_r15024063 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala --- @@ -26,6 +26,28 @@ import org.apache.spark.sql.catalyst.trees abstract class LogicalPlan extends QueryPlan[LogicalPlan] { self: Product = + // TODO: handle overflow? + /** + * Estimates of various statistics. The default estimation logic simply sums up the corresponding + * statistic produced by the children. To override this behavior, override `statistics` and + * assign it a overriden version of `Statistics`. + */ + case class Statistics( +/** + * Number of output tuples. For leaf operators this defaults to 1, otherwise it is set to the + * product of children's `numTuples`. + */ +numTuples: Long = childrenStats.map(_.numTuples).product, + +/** + * Physical size in bytes. For leaf operators this defaults to 1, otherwise it is set to the + * product of children's `sizeInBytes`. + */ +sizeInBytes: Long = childrenStats.map(_.sizeInBytes).product + ) + lazy val statistics: Statistics = new Statistics + lazy val childrenStats = children.map(_.statistics) --- End diff -- Thanks for the note -- I wasn't well aware of the memory/time overhead of lazy vals. Since we have one statistic at the moment, `childrenStats` is used only once for a logical node, so I went ahead and removed it. However, memoizing `statistics` itself probably would save some planning time, as it will get called multiple times during planning (e.g. different cases in the HashJoin pattern match). If we were to use a `def` it is going to be evaluated in the quadratic order. Using `val` only will fire off the computation of the stats for all logical nodes right away. I think this might be less desirable to only evaluating it for everything under a Join when needed by a Strategy. Also, we'd have very few operators in a typical query anyway. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1238#issuecomment-49219898 QA results for PR 1238:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16743/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2250: show stage RDDs in UI
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1188#issuecomment-49220170 QA tests have started for PR 1188. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16744/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2250: show stage RDDs in UI
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1188#issuecomment-49220283 QA results for PR 1188:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16744/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1313#issuecomment-49220362 Ah, that's true, it won't be as common now. Anyway I'd be okay with any solution as long as TaskSets with only no-prefs tasks, or only process-local + no-prefs, don't get an unnecessary delay. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2411] Add a history-not-found page to s...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/1336#discussion_r15024426 --- Diff: core/src/main/scala/org/apache/spark/deploy/master/ui/HistoryNotFoundPage.scala --- @@ -0,0 +1,40 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy.master.ui + +import javax.servlet.http.HttpServletRequest + +import scala.xml.Node + +import org.apache.spark.ui.{UIUtils, WebUIPage} + +private[spark] class HistoryNotFoundPage(parent: MasterWebUI) + extends WebUIPage(history/not-found) { + + def render(request: HttpServletRequest): Seq[Node] = { +val content = + div class=row-fluid +div class=span12 style=font-size:14px;font-weight:bold + No event logs were found for this application. To enable event logging, please set --- End diff -- This is for the standalone Master, not the history server. Here the event log directory for each application is shipped to the master as part of the application description, so the Master knows where to find the logs automatically. (For history server, however, the user should explicitly set the path in all cases as you describe) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2250: show stage RDDs in UI
Github user nevillelyh commented on the pull request: https://github.com/apache/spark/pull/1188#issuecomment-49220820 Sorry missed one. Fixed the style warning. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1425#issuecomment-49221370 @mengxr Scalatest 2.x has the tolerance feature, but it's absolute error not relative error. For large numbers, the absolute error may not be meaningful. With `===`, it will return false even the different is only one unit of least precision (ULP), and it often happens when running the unittest under different architecture of machine. For example, ARM and X86 may have different numerical rounding , and we don't run any test other than X86. C++ boost has their numerical `===` test with the relative error for this reason. I probably can add method called `~=` and `~==` method for `Double`, and `Vector` type using implicit class, and `~==` will raise the exception for the message purpose like `===` does. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1313#issuecomment-49221715 Same here, Kay any thoughts ? On 17-Jul-2014 1:44 am, Matei Zaharia notificati...@github.com wrote: Ah, that's true, it won't be as common now. Anyway I'd be okay with any solution as long as TaskSets with only no-prefs tasks, or only process-local + no-prefs, don't get an unnecessary delay. â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/1313#issuecomment-49220362. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/1425#issuecomment-49221957 `almostEquals` reads better than `~===`. The feature we like is having the values in comparison in the error message but not the name :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1238#issuecomment-49222667 Jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2479][MLlib] Comparing floating-point n...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/1425#issuecomment-49222983 I learn `almostEquals` from boost library. Anyway, in this case, how do we distinguish the one with throwing out the message, and the one just returning true/false? `almostEquals` and `almostEqualsWithMessage`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1238#issuecomment-49223292 QA tests have started for PR 1238. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16745/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2465. Use long as user / item ID for ALS
Github user srowen closed the pull request at: https://github.com/apache/spark/pull/1393 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Added t2 instance types
GitHub user 24601 opened a pull request: https://github.com/apache/spark/pull/1446 Added t2 instance types New t2 instance types require HVM amis, bailout assumption of pvm causes failures when using t2 instance types. You can merge this pull request into a Git repository by running: $ git pull https://github.com/24601/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1446.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1446 commit 392a95ef97d9644b52d7526a1a068b6d9dbf01c0 Author: Basit Mustafa basitmustafa@computes-things-for-basit.local Date: 2014-07-16T20:45:37Z Added t2 instance types New t2 instance types require HVM amis, bailout assumption of pvm causes failures when using t2 instance types. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Added t2 instance types
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1446#issuecomment-49225163 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1097: Do not introduce deadlock while fi...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1409#issuecomment-49227548 Okay I'm going to merge this into master and 1.0. We can cut a new patch release shortly for this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...
GitHub user sryza opened a pull request: https://github.com/apache/spark/pull/1447 SPARK-2519 part 2. Remove pattern matching on Tuple2 in critical section... ...s of CoGroupedRDD and PairRDDFunctions This also removes an unnecessary tuple creation in cogroup. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sryza/spark sandy-spark-2519-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1447.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1447 commit 906e6b362db11a4ac5a3f8096668c6f968dc48aa Author: Sandy Ryza sa...@cloudera.com Date: 2014-07-16T21:00:17Z SPARK-2519 part 2. Remove pattern matching on Tuple2 in critical sections of CoGroupedRDD and PairRDDFunctions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Added t2 instance types
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/1446#discussion_r15027927 --- Diff: ec2/spark_ec2.py --- @@ -240,7 +240,10 @@ def get_spark_ami(opts): r3.xlarge: hvm, r3.2xlarge: hvm, r3.4xlarge: hvm, -r3.8xlarge: hvm +r3.8xlarge: hvm, +t2.micro: hvm, +t2.small:hvm, +t2.medium:hvm --- End diff -- Ca you format these correctly with the other ones? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2294: fix locality inversion bug in Task...
Github user CodingCat commented on the pull request: https://github.com/apache/spark/pull/1313#issuecomment-49227891 en...that makes sense actually we can get the situation about locality in the taskSet easily through myLocalityLevels, which is calculated when a new executor is added --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---