[GitHub] spark pull request: Support cross building for Scala 2.11
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3159#issuecomment-62295971 [Test build #23115 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23115/consoleFull) for PR 3159 at commit [`5dcd602`](https://github.com/apache/spark/commit/5dcd602ca04d90d80066d6405920a684749aeea4). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Support cross building for Scala 2.11
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3159#issuecomment-62295972 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23115/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Support cross building for Scala 2.11
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3159#issuecomment-62296123 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23114/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Support cross building for Scala 2.11
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3159#issuecomment-62296120 [Test build #23114 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23114/consoleFull) for PR 3159 at commit [`5dcd602`](https://github.com/apache/spark/commit/5dcd602ca04d90d80066d6405920a684749aeea4). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4237][BUILD] Fix MANIFEST.MF in maven a...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3103#issuecomment-62296677 [Test build #23116 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23116/consoleFull) for PR 3103 at commit [`8332304`](https://github.com/apache/spark/commit/8332304f00130c4a7ff429d3892c55e02494a0c0). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4237][BUILD] Fix MANIFEST.MF in maven a...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3103#issuecomment-62296681 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23116/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3971][SQL] Backport #2843 to branch-1.1
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3113#issuecomment-62297437 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23117/consoleFull) for PR 3113 at commit [`d354161`](https://github.com/apache/spark/commit/d3541613da1c3e5b309645cb103d9a4a972b812b). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Update RecoverableNetworkWordCount.scala
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/2735#discussion_r20057909 --- Diff: examples/src/main/scala/org/apache/spark/examples/streaming/RecoverableNetworkWordCount.scala --- @@ -114,7 +115,7 @@ object RecoverableNetworkWordCount { val Array(ip, IntParam(port), checkpointDirectory, outputPath) = args val ssc = StreamingContext.getOrCreate(checkpointDirectory, () = { -createContext(ip, port, outputPath) +createContext(ip, port, outputPath, checkpointDirectory) --- End diff -- @tdas Can I double-check that it's correct to call `StreamingContext.checkpoint` only within the create context function? as opposed to always calling it on the result of `StreamingContext.getOrCreate`? That is, if it reads checkpoint data, it already configures itself to continue using that checkpoint directory? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62299039 @shivaram IMHO it would be good to have the developer API updates as well and test a couple of more pipelines before we push this out. I'll try to get a branch based on this PR ready next week for feedback. Not sure if we want to do a mega-PR though; hopefully it can be kept as a separate follow-up. Also I am not sure I fully understand the difference between the User API and the Developer API These are loose terms; part of the Developer API will actually be public. E.g., Classifier will be public since it will be needed for the boosting API. But most users won't have to worry about these abstract classes, and the classes will include some private[ml] methods to make developers' lives easier. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3971][SQL] Backport #2843 to branch-1.1
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3113#issuecomment-62299144 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23117/consoleFull) for PR 3113 at commit [`d354161`](https://github.com/apache/spark/commit/d3541613da1c3e5b309645cb103d9a4a972b812b). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3971][SQL] Backport #2843 to branch-1.1
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3113#issuecomment-62299146 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23117/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3971][SQL] Backport #2843 to branch-1.1
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/3113#issuecomment-62299785 @marmbrus Backported #2164 to fix the Jenkins build failure (ParquetQuerySuite). Should be ready to go. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...
Github user yu-iskw commented on the pull request: https://github.com/apache/spark/pull/2906#issuecomment-62302445 There is a few conflicts with master brach. I will rebase my PR branch, and then force push it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4253]Ignore spark.driver.host in yarn-c...
Github user WangTaoTheTonic commented on the pull request: https://github.com/apache/spark/pull/3112#issuecomment-62302608 Additionally, I think spark.driver.host is useful in all client mode including standalone, mesos mode ( i don't know it very mych) and yarn-client mode. When the cluster cannot resolve client's hostname we must set this configuration to client's ip address to avoid failure to connect to driver. If i understood it wrong, correct me please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4033][Examples]Input of the SparkPi too...
Github user SaintBacchus closed the pull request at: https://github.com/apache/spark/pull/2874 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3971][SQL] Backport #2843 to branch-1.1
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/3113#issuecomment-62304982 @marmbrus However, I didn't quite get why #2164 fixes those Parquet tests. Especially why did you say the original test cases are order dependent? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Change the initial iteration num of ruleExecut...
GitHub user DoingDone9 opened a pull request: https://github.com/apache/spark/pull/3174 Change the initial iteration num of ruleExecutor from 1 to 0 Change the initial iteration num of ruleExecutor from 1 to 0. You can merge this pull request into a Git repository by running: $ git pull https://github.com/DoingDone9/spark catalyst_issue_01 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3174.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3174 commit fe0bc4ca5656ba8ec490428748fecc22948b7d95 Author: DoingDone9 799203...@qq.com Date: 2014-11-09T14:49:08Z Change the first iteration num of ruleExecutor from 1 to 0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Change the initial iteration num of ruleExecut...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3174#issuecomment-62305964 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4122][STREAMING] Add a library that can...
Github user helena commented on a diff in the pull request: https://github.com/apache/spark/pull/2994#discussion_r20059145 --- Diff: external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaOutputWriter.scala --- @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.streaming.kafka + +import java.util.Properties + +import scala.reflect.ClassTag + +import kafka.producer.{ProducerConfig, KeyedMessage, Producer} + +import org.apache.spark.Logging +import org.apache.spark.rdd.RDD +import org.apache.spark.streaming.dstream.DStream + +/** + * Import this object in this form: + * {{{ + * import org.apache.spark.streaming.kafka.KafkaWriter._ + * }}} + * + * Once imported, the `writeToKafka` can be called on any [[DStream]] object in this form: + * {{{ + * dstream.writeToKafka(producerConfig, f) + * }}} + */ +object KafkaWriter { + import scala.language.implicitConversions + /** + * This implicit method allows the user to call dstream.writeToKafka(..) + * @param dstream - DStream to write to Kafka + * @tparam T - The type of the DStream + * @tparam K - The type of the key to serialize to + * @tparam V - The type of the value to serialize to + * @return + */ + implicit def createKafkaOutputWriter[T: ClassTag, K, V](dstream: DStream[T]): KafkaWriter[T] = { +new KafkaWriter[T](dstream) + } +} + +/** + * + * This class can be used to write data to Kafka from Spark Streaming. To write data to Kafka + * simply `import org.apache.spark.streaming.kafka.KafkaWriter._` in your application and call + * `dstream.writeToKafka(producerConf, func)` + * + * Here is an example: + * {{{ + * // Adding this line allows the user to call dstream.writeDStreamToKafka(..) + * import org.apache.spark.streaming.kafka.KafkaWriter._ + * + * class ExampleWriter { + * val instream = ssc.queueStream(toBe) + * val producerConf = new Properties() + * producerConf.put(serializer.class, kafka.serializer.DefaultEncoder) + * producerConf.put(key.serializer.class, kafka.serializer.StringEncoder) + * producerConf.put(metadata.broker.list, kafka.example.com:5545) + * producerConf.put(request.required.acks, 1) + * instream.writeToKafka(producerConf, + *(x: String) = new KeyedMessage[String, String](default, null, x)) + * ssc.start() + * } + * + * }}} + * @param dstream - The [[DStream]] to be written to Kafka + * + */ +class KafkaWriter[T: ClassTag](@transient dstream: DStream[T]) extends Serializable with Logging { + + /** + * To write data from a DStream to Kafka, call this function after creating the DStream. Once + * the DStream is passed into this function, all data coming from the DStream is written out to + * Kafka. The properties instance takes the configuration required to connect to the Kafka + * brokers in the standard Kafka format. The serializerFunc is a function that converts each + * element of the RDD to a Kafka [[KeyedMessage]]. This closure should be serializable - so it + * should use only instances of Serializables. + * @param producerConfig The configuration that can be used to connect to Kafka + * @param serializerFunc The function to convert the data from the stream into Kafka + * [[KeyedMessage]]s. + * @tparam K The type of the key + * @tparam V The type of the value + * + */ + def writeToKafka[K, V](producerConfig: Properties, +serializerFunc: T = KeyedMessage[K, V]): Unit = { + +// Broadcast the producer to avoid sending it every time. +val broadcastedConfig = dstream.ssc.sc.broadcast(producerConfig) + +def func = (rdd: RDD[T]) = { + rdd.foreachPartition(events = { +// The ForEachDStream runs the function locally on the driver. +// This code
[GitHub] spark pull request: Change the initial iteration num of ruleExecut...
Github user DoingDone9 commented on the pull request: https://github.com/apache/spark/pull/3174#issuecomment-62306218 i am new, but i think it is should be 0 not 1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4274] [SQL] Print informative message w...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/3139#discussion_r20059176 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveComparisonTest.scala --- @@ -341,12 +341,21 @@ abstract class HiveComparisonTest val query = new TestHive.HiveQLQueryExecution(queryString) try { (query, prepareAnswer(query, query.stringResult())) } catch { case e: Throwable = + val logicalQueryInString = try { +query.toString --- End diff -- Oh, I didn't know this. Thank you @marmbrus very much for the code snippet. After debugging, I think you are right, the exception is thrown by `${executedPlan.codegenEnabled}`, the `executedPlan` is null if something wrong in parsing or analyzing etc. I've updated the code again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4274] [SQL] Print informative message w...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3139#issuecomment-62306341 [Test build #23118 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23118/consoleFull) for PR 3139 at commit [`f5d7146`](https://github.com/apache/spark/commit/f5d714662d4a2e487d42531c4df6dfcf0c49b296). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Sets SQL operation state to ERROR when excepti...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/3175 Sets SQL operation state to ERROR when exception is thrown In `HiveThriftServer2`, when an exception is thrown during a SQL execution, the SQL operation state should be set to `ERROR`, but now it remains `RUNNING`. This affects the result of the `GetOperationStatus` Thrift API. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark fix-op-state Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3175.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3175 commit 6d4c1fed5e701c79de1e1489342e0d167159ba12 Author: Cheng Lian l...@databricks.com Date: 2014-11-09T10:08:43Z Sets SQL operation state to ERROR when exception is thrown --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4308][SQL] Sets SQL operation state to ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3175#issuecomment-62306503 [Test build #23119 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23119/consoleFull) for PR 3175 at commit [`6d4c1fe`](https://github.com/apache/spark/commit/6d4c1fed5e701c79de1e1489342e0d167159ba12). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4308][SQL] Follow up of #3175 for branc...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/3176 [SPARK-4308][SQL] Follow up of #3175 for branch 1.1 The PR for master branch can't be backported to branch 1.1 directly because Hive 0.13.1 support. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark fix-op-state-for-1.1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3176.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3176 commit 8791d87661f91a72fbd605bdfc9dd56bfa621821 Author: Cheng Lian l...@databricks.com Date: 2014-11-09T15:16:51Z This is a follow up of #3175 for branch 1.1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2213][SQL] Sort Merge Join
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/3173#issuecomment-62306671 That's really nice to have the Sort-Merge-Join, as we did meet some of join queries couldn't run completely in real cases. One high level comment on this, can we also keep the `ShuffleHashJoin`? It still can be faster than the Sort-Merge-Join in some cases, all we need is a configuration/strategy to map to different Join Operators. BTW: do you have any performance comparison result can be shared with us? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4308][SQL] Follow up of #3175 for branc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3176#issuecomment-62306835 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23120/consoleFull) for PR 3176 at commit [`8791d87`](https://github.com/apache/spark/commit/8791d87661f91a72fbd605bdfc9dd56bfa621821). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2906#issuecomment-62307500 [Test build #23121 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23121/consoleFull) for PR 2906 at commit [`691c49a`](https://github.com/apache/spark/commit/691c49adf9751193f3b8928211e77d307ef44c37). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4295][External]Fix exception in SparkSi...
GitHub user maji2014 opened a pull request: https://github.com/apache/spark/pull/3177 [SPARK-4295][External]Fix exception in SparkSinkSuite Handle exception in SparkSinkSuite, please refer to [SPARK-4295] You can merge this pull request into a Git repository by running: $ git pull https://github.com/maji2014/spark spark-4295 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3177.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3177 commit c807bf66a8d945708af0f620576255cc133ffe46 Author: maji2014 ma...@asiainfo.com Date: 2014-11-09T15:58:50Z Fix exception in SparkSinkSuite --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4295][External]Fix exception in SparkSi...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3177#issuecomment-62308120 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4274] [SQL] Print informative message w...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3139#issuecomment-62308752 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23118/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4274] [SQL] Print informative message w...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3139#issuecomment-62308750 [Test build #23118 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23118/consoleFull) for PR 3139 at commit [`f5d7146`](https://github.com/apache/spark/commit/f5d714662d4a2e487d42531c4df6dfcf0c49b296). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4308][SQL] Sets SQL operation state to ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3175#issuecomment-62309229 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23119/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4308][SQL] Sets SQL operation state to ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3175#issuecomment-62309225 [Test build #23119 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23119/consoleFull) for PR 3175 at commit [`6d4c1fe`](https://github.com/apache/spark/commit/6d4c1fed5e701c79de1e1489342e0d167159ba12). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2811 upgrade algebird to 0.8.1
Github user ScrapCodes commented on the pull request: https://github.com/apache/spark/pull/2947#issuecomment-62309276 Looks good then. On Nov 9, 2014 8:24 AM, Adam Pingel notificati...@github.com wrote: Algebird 0.8.1 for Scala 2.11 is on the central repo: http://search.maven.org/#artifactdetails%7Ccom.twitter%7Calgebird_2.11%7C0.8.1%7Cjar â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/2947#issuecomment-62288693. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4309][SQL] Date type support for Thrift...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/3178 [SPARK-4309][SQL] Date type support for Thrift server You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark date-for-thriftserver Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3178.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3178 commit 70b1becb6d0e852cad6baa9457a1e741036347fd Author: Cheng Lian l...@databricks.com Date: 2014-11-09T16:20:46Z Adds Date support for HiveThriftServer2 (Hive 0.12.0) commit 313248c8545b105b2ac83d0062ba0306fabd7859 Author: Cheng Lian l...@databricks.com Date: 2014-11-09T16:39:59Z Updates HiveShim for 0.13.1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4309][SQL] Date type support for Thrift...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3178#issuecomment-62309727 [Test build #23122 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23122/consoleFull) for PR 3178 at commit [`313248c`](https://github.com/apache/spark/commit/313248c8545b105b2ac83d0062ba0306fabd7859). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4000][Build] Uploads HiveCompatibilityS...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/2993#issuecomment-62309812 @pwendell ping :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2906#issuecomment-62310159 [Test build #23121 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23121/consoleFull) for PR 2906 at commit [`691c49a`](https://github.com/apache/spark/commit/691c49adf9751193f3b8928211e77d307ef44c37). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public class JavaHierarchicalClustering ` * `trait HierarchicalClusteringConf extends Serializable ` * `class HierarchicalClustering(` * `class HierarchicalClusteringModel(object):` * `class HierarchicalClustering(object):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4205][SQL] Timestamp and Date with comp...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3158#issuecomment-62310177 [Test build #23123 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23123/consoleFull) for PR 3158 at commit [`c5fb299`](https://github.com/apache/spark/commit/c5fb299c3327a78fb9ab1988e46f64a2bdd83807). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2906#issuecomment-62310162 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23121/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4308][SQL] Follow up of #3175 for branc...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3176#issuecomment-62310223 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23120/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4308][SQL] Follow up of #3175 for branc...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3176#issuecomment-62310222 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23120/consoleFull) for PR 3176 at commit [`8791d87`](https://github.com/apache/spark/commit/8791d87661f91a72fbd605bdfc9dd56bfa621821). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4205][SQL] Timestamp and Date with comp...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3158#issuecomment-62311505 [Test build #23123 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23123/consoleFull) for PR 3158 at commit [`c5fb299`](https://github.com/apache/spark/commit/c5fb299c3327a78fb9ab1988e46f64a2bdd83807). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class LhsLiteral(x: Any) ` * `final class MutableDate extends MutableValue ` * `final class MutableTimestamp extends MutableValue ` * `class RichDate(milliseconds: Long) extends Date(milliseconds) ` * `class RichTimestamp(milliseconds: Long) extends Timestamp(milliseconds) ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4205][SQL] Timestamp and Date with comp...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3158#issuecomment-62311507 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23123/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4309][SQL] Date type support for Thrift...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3178#issuecomment-62312335 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23122/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4309][SQL] Date type support for Thrift...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3178#issuecomment-62312332 [Test build #23122 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23122/consoleFull) for PR 3178 at commit [`313248c`](https://github.com/apache/spark/commit/313248c8545b105b2ac83d0062ba0306fabd7859). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3648: Provide a script for fetching remo...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/3165#issuecomment-62313063 Okay i'm gonna close this. If one of you guys could quickly add docs on our wiki, that would be great. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3648: Provide a script for fetching remo...
Github user pwendell closed the pull request at: https://github.com/apache/spark/pull/3165 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4047] - Generate runtime warnings for e...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/2894#issuecomment-62313396 @varadharajan Good suggestion about documenting algs for LR; I'll make a note to do that for the upcoming release. Thank you for the PR! LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3022#issuecomment-62313507 [Test build #514 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/514/consoleFull) for PR 3022 at commit [`c15405c`](https://github.com/apache/spark/commit/c15405c78345e9a46549a398c6b59bed80274f9e). * This patch **does not merge cleanly**. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4079] [CORE] Default to LZF if Snappy n...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/3119#issuecomment-62313527 Could this instead just throw an exception when Snappy is configured but not supported? We typically try not to silently mutate configs in the background in favor of giving users an actionable exception. I think this could be accomplished by just modifying `SnappyCompressionCodec` to guard the creation of an input stream or output stream with a check as to whether Snappy is enabled, and throw an exception if it is not enabled. The current approach could lead to very confusing failure behavior. For instance say a user has the Snappy native library installed on some machines but not others. What will happen is that there will be a stream corruption exception somewhere inside of Spark where one node writes data as Snappy and another reads it as LZF. And to figure out what caused this a user will have to troll through executor logs for a somewhat innocuous looking `WARN` statement. @rxin designed this codec interface (I think) so maybe he has more comments also. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4230. Doc for spark.default.parallelism ...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/3107#discussion_r20060444 --- Diff: docs/configuration.md --- @@ -563,8 +566,8 @@ Apart from these, the following properties are also available, and may be useful /ul /td td -Default number of tasks to use across the cluster for distributed shuffle operations -(codegroupByKey/code, codereduceByKey/code, etc) when not set by user. +Default number of output partitions for operations like codejoin/code, --- End diff -- Should this say number of shuffle partitions - it's slightly weird to me to say output when this refers to something that is totally internal to Spark - it's output on the map side but input on he read side. In other cases I think output tends to mean things like saving as HDFS data, etc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4230. Doc for spark.default.parallelism ...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/3107#discussion_r20060453 --- Diff: docs/configuration.md --- @@ -556,6 +556,9 @@ Apart from these, the following properties are also available, and may be useful tr tdcodespark.default.parallelism/code/td td +For distributed shuffle operations like codereduceByKey/code and codejoin/code, the +largest number of partitions in parent RDD. For operations like codeparallelize/code with --- End diff -- Is this just the number of partitions in the parent RDD (why largest?) Doesn't the parentRDD have the a fixed number of partitions? Or is this a maximum across all parents...? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4230. Doc for spark.default.parallelism ...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/3107#issuecomment-62313723 Had some minor wording questions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4230. Doc for spark.default.parallelism ...
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3107#discussion_r20060629 --- Diff: docs/configuration.md --- @@ -556,6 +556,9 @@ Apart from these, the following properties are also available, and may be useful tr tdcodespark.default.parallelism/code/td td +For distributed shuffle operations like codereduceByKey/code and codejoin/code, the +largest number of partitions in parent RDD. For operations like codeparallelize/code with --- End diff -- I was worried the number of partitions of the largest parent RDD could be construed as the number of partitions in the parent RDD containing the most data. Do you think the largest number of partitions in _a_ parent RDD or the largest number of partitions in one of the operation's input RDDs would be more clear? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4230. Doc for spark.default.parallelism ...
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3107#discussion_r20060674 --- Diff: docs/configuration.md --- @@ -563,8 +566,8 @@ Apart from these, the following properties are also available, and may be useful /ul /td td -Default number of tasks to use across the cluster for distributed shuffle operations -(codegroupByKey/code, codereduceByKey/code, etc) when not set by user. +Default number of output partitions for operations like codejoin/code, --- End diff -- My thinking was that Spark's APIs have no mention of the concept of a shuffle partition (e.g. the term is referenced nowhere on https://spark.apache.org/docs/latest/programming-guide.html), but even novice Spark users are meant to understand that every transformation has input and output RDDs and that every RDD has a number of partitions. Maybe Default number of partitions for the RDDs produced by operations like ...? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3022#issuecomment-62316919 [Test build #514 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/514/consoleFull) for PR 3022 at commit [`c15405c`](https://github.com/apache/spark/commit/c15405c78345e9a46549a398c6b59bed80274f9e). * This patch **passes all tests**. * This patch **does not merge cleanly**. * This patch adds the following public classes _(experimental)_: * `class GaussianMixtureModel(val w: Array[Double], val mu: Array[Vector], val sigma: Array[Matrix]) ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4276 fix for two working thread
Github user squito commented on the pull request: https://github.com/apache/spark/pull/3141#issuecomment-62320554 I agress w/ TD, I don't think this change is necessary. I think we should close this and, @svar29 , maybe you can discuss the problem you are running into on the spark-user mailing list, hopefully we can help you out there. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4260] Httpbroadcast should set connecti...
Github user squito commented on the pull request: https://github.com/apache/spark/pull/3122#issuecomment-62320949 This looks good, but could also explain what necessitates this change? Did you observe some error? If nothing else, just putting the error you observed in the JIRA would help somebody else find this patch if they run into the error as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Change the initial iteration num of ruleExecut...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3174#issuecomment-62321314 -1 This breaks the logic of the loop. For example if maxIterations is 1, now it will execute twice. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4295][External]Fix exception in SparkSi...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3177#issuecomment-62321381 Can you clarify how the tests pass if an exception is thrown? does that also need a fix? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3936] Add aggregateMessages, which supe...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/3100#discussion_r20062181 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/TripletFields.scala --- @@ -0,0 +1,59 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graphx + +/** + * Represents a subset of the fields of an [[EdgeTriplet]] or [[EdgeContext]]. This allows the + * system to populate only those fields for efficiency. + */ +class TripletFields private ( +val useSrc: Boolean, +val useDst: Boolean, +val useEdge: Boolean) --- End diff -- maybe I'm just missing it, but it seems like `useEdge` is never used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1344 [DOCS] Scala API docs for top metho...
Github user squito commented on the pull request: https://github.com/apache/spark/pull/3168#issuecomment-62323279 lgtm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-971 [DOCS] Link to Confluence wiki from ...
Github user squito commented on the pull request: https://github.com/apache/spark/pull/3169#issuecomment-62323284 lgtm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2906#issuecomment-62323346 [Test build #23124 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23124/consoleFull) for PR 2906 at commit [`cfdf842`](https://github.com/apache/spark/commit/cfdf8429bf4afb3e7a6a329dd285fe48429aec46). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/3079#discussion_r20062337 --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala --- @@ -113,8 +117,12 @@ class RangePartitioner[K : Ordering : ClassTag, V]( private var ordering = implicitly[Ordering[K]] // An array of upper bounds for the first (partitions - 1) partitions - private var rangeBounds: Array[K] = { -if (partitions = 1) { + @volatile private var valRB: Array[K] = null --- End diff -- `valRD` is a kinda confusing name. I think the convention would be to name it `_rangeBounds`. Eg. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/FutureAction.scala#L111 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3936] Add aggregateMessages, which supe...
Github user ankurdave commented on a diff in the pull request: https://github.com/apache/spark/pull/3100#discussion_r20062658 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/TripletFields.scala --- @@ -0,0 +1,59 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.graphx + +/** + * Represents a subset of the fields of an [[EdgeTriplet]] or [[EdgeContext]]. This allows the + * system to populate only those fields for efficiency. + */ +class TripletFields private ( +val useSrc: Boolean, +val useDst: Boolean, +val useEdge: Boolean) --- End diff -- Yeah, we don't currently use it since it's cheap to access the edge attributes, but I think @jegonzal added it in case our internal representation changes and it becomes useful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4000][Build] Uploads HiveCompatibilityS...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2993#issuecomment-62325053 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2906#issuecomment-62325994 [Test build #23124 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23124/consoleFull) for PR 2906 at commit [`cfdf842`](https://github.com/apache/spark/commit/cfdf8429bf4afb3e7a6a329dd285fe48429aec46). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public class JavaHierarchicalClustering ` * `trait HierarchicalClusteringConf extends Serializable ` * `class HierarchicalClustering(` * `class HierarchicalClusteringModel(object):` * `class HierarchicalClustering(object):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2906#issuecomment-62325997 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23124/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4230. Doc for spark.default.parallelism ...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/3107#discussion_r20063433 --- Diff: docs/configuration.md --- @@ -556,6 +556,9 @@ Apart from these, the following properties are also available, and may be useful tr tdcodespark.default.parallelism/code/td td +For distributed shuffle operations like codereduceByKey/code and codejoin/code, the +largest number of partitions in parent RDD. For operations like codeparallelize/code with --- End diff -- Yeah - if you just add in a parent RDD then that seems good! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2906#issuecomment-62328415 [Test build #23125 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23125/consoleFull) for PR 2906 at commit [`b0b061e`](https://github.com/apache/spark/commit/b0b061edc4c2ad42deda00bb664534e1334b50e5). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4230. Doc for spark.default.parallelism ...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/3107#discussion_r20063448 --- Diff: docs/configuration.md --- @@ -563,8 +566,8 @@ Apart from these, the following properties are also available, and may be useful /ul /td td -Default number of tasks to use across the cluster for distributed shuffle operations -(codegroupByKey/code, codereduceByKey/code, etc) when not set by user. +Default number of output partitions for operations like codejoin/code, --- End diff -- Ah I see - what about Default number of partitions in RDD's returned by join, reduceByKey... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4156 [MLLIB] EM algorithm for GMMs
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/3022#discussion_r20063810 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/GMMExpectationMaximization.scala --- @@ -0,0 +1,246 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseVector = BreezeVector, DenseMatrix = BreezeMatrix} +import breeze.linalg.{Transpose, det, inv} +import org.apache.spark.rdd.RDD +import org.apache.spark.mllib.linalg.{Matrices, Vector, Vectors} +import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext} +import org.apache.spark.SparkContext.DoubleAccumulatorParam + +/** + * Expectation-Maximization for multivariate Gaussian Mixture Models. + * + */ +object GMMExpectationMaximization { + /** + * Trains a GMM using the given parameters + * + * @param data training points stores as RDD[Vector] + * @param k the number of Gaussians in the mixture + * @param maxIterations the maximum number of iterations to perform + * @param delta change in log-likelihood at which convergence is considered achieved + */ + def train(data: RDD[Vector], k: Int, maxIterations: Int, delta: Double): GaussianMixtureModel = { +new GMMExpectationMaximization().setK(k) + .setMaxIterations(maxIterations) + .setDelta(delta) + .run(data) + } + + /** + * Trains a GMM using the given parameters + * + * @param data training points stores as RDD[Vector] + * @param k the number of Gaussians in the mixture + * @param maxIterations the maximum number of iterations to perform + */ + def train(data: RDD[Vector], k: Int, maxIterations: Int): GaussianMixtureModel = { +new GMMExpectationMaximization().setK(k).setMaxIterations(maxIterations).run(data) + } + + /** + * Trains a GMM using the given parameters + * + * @param data training points stores as RDD[Vector] + * @param k the number of Gaussians in the mixture + */ + def train(data: RDD[Vector], k: Int): GaussianMixtureModel = { +new GMMExpectationMaximization().setK(k).run(data) + } +} + +/** + * This class performs multivariate Gaussian expectation maximization. It will + * maximize the log-likelihood for a mixture of k Gaussians, iterating until + * the log-likelihood changes by less than delta, or until it has reached + * the max number of iterations. + */ +class GMMExpectationMaximization private ( +private var k: Int, +private var delta: Double, +private var maxIterations: Int) extends Serializable { + + // Type aliases for convenience + private type DenseDoubleVector = BreezeVector[Double] + private type DenseDoubleMatrix = BreezeMatrix[Double] + + // A default instance, 2 Gaussians, 100 iterations, 0.01 log-likelihood threshold + def this() = this(2, 0.01, 100) + + /** Set the number of Gaussians in the mixture model. Default: 2 */ + def setK(k: Int): this.type = { +this.k = k +this + } + + /** Set the maximum number of iterations to run. Default: 100 */ + def setMaxIterations(maxIterations: Int): this.type = { +this.maxIterations = maxIterations +this + } + + /** + * Set the largest change in log-likelihood at which convergence is + * considered to have occurred. + */ + def setDelta(delta: Double): this.type = { +this.delta = delta +this + } + + /** Machine precision value used to ensure matrix conditioning */ + private val eps = math.pow(2.0, -52) + + /** Perform expectation maximization */ + def run(data: RDD[Vector]): GaussianMixtureModel = { +val ctx = data.sparkContext + +// we will operate on the data as breeze data +val breezeData = data.map{ u =
[GitHub] spark pull request: [SPARK-4017] show progress bar in console and ...
Github user squito commented on the pull request: https://github.com/apache/spark/pull/3029#issuecomment-62330615 this is awesome! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Support cross building for Scala 2.11
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3159#issuecomment-62330823 [Test build #23126 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23126/consoleFull) for PR 3159 at commit [`542adea`](https://github.com/apache/spark/commit/542adeaf216cbfd5fbe2a99887e66224cc0f988d). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1344 [DOCS] Scala API docs for top metho...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/3168#issuecomment-62330950 Thanks I pulled this in. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...
Github user squito commented on the pull request: https://github.com/apache/spark/pull/2933#issuecomment-62330965 I agree with @pwendell . It seems like the right thing to do is just fix Broadcast ... and if we can't, then wouldn't you also want to turn off Broadcast even for big closures? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1957] [WIP] Pluggable Diskstore for Blo...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/907#issuecomment-62331200 Thit is useful as a prototype, I'd prefer to close this issue as an active review. We can use this as a starting point if we revisit the internal interfaces here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Updates to shell globbing in run-example and s...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/449#issuecomment-62331162 This is stale so let's close this issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1972: Added support for tracking custom ...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/918#issuecomment-62331367 I'd like to close this issue for now and keep the JIRA around. This is a completely reasonable way to accomplish adding custom metrics, but this overlaps a good amount with Accumulators and their display in the UI - which I think is our longer term API for doing things like this. Anyways let's keep this patch and the JIRA around and we can consider it in the future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2165] spark on yarn: add support for se...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1279#issuecomment-62331405 Okay let's close this issue for now and he can reopen it if he has time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-971 [DOCS] Link to Confluence wiki from ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/3169 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1344 [DOCS] Scala API docs for top metho...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/3168 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3051] Support looking-up named accumula...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2438#issuecomment-62332004 Hey @nfergu - I was looking for any older PR's that have fallen through the cracks and came across this. This is a very well written patch - kudos! When I suggested this registry concept initially, I was actually envisioning this happening in user-space rather than in Spark itself. I think automatically broadcasting all named accumulators is not going to work because some applications create thousands of accumulators (e.g. streaming applications), and it could end up with an unexpected performance regression. For some applications this might be acceptable though. How hard would it be for a user-space library to implement this rather than having it be inside of Spark proper? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3548] [WebUI] Display cache hit ratio o...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2411#issuecomment-62332025 Let's close this issue for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3548] [WebUI] Display cache hit ratio o...
Github user sarutak closed the pull request at: https://github.com/apache/spark/pull/2411 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2671] BlockObjectWriter should create p...
Github user sarutak closed the pull request at: https://github.com/apache/spark/pull/1580 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3171] Don't print meaningless informati...
Github user sarutak closed the pull request at: https://github.com/apache/spark/pull/2078 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3551] Remove redundant putting FetchRes...
Github user sarutak closed the pull request at: https://github.com/apache/spark/pull/2413 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3106] Fix the race condition issue abou...
Github user sarutak closed the pull request at: https://github.com/apache/spark/pull/2019 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1888] enhance MEMORY_AND_DISK mode by d...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/791#issuecomment-62332299 This has mostly gone stale so I'd suggest we close this issue and revisit this later. This is a decent idea, but it does complicate things a good amount, and this particular piece of code IMO is already quite complicated. As with any performance change, it would be useful to quantify the performance problems observed as a result of this issue. For instance, has it been observed as a bottleneck in real clusters? Putting information of this type on the JIRA would be useful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1380: Add sort-merge based cogroup/joins...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/283#issuecomment-62332367 I'd suggest we close this issue for now and go to the JIRA to discuss whether the feature is needed and how high of a priority it is. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2083 Add support for spark.local.maxFail...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/1465#issuecomment-62332453 I'm going to close this issue as wontfix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1888] enhance MEMORY_AND_DISK mode by d...
Github user liyezhang556520 commented on the pull request: https://github.com/apache/spark/pull/791#issuecomment-62332899 @pwendell , I updated a [design doc](https://issues.apache.org/jira/secure/attachment/12679822/Spark-3000%20Design%20Doc.pdf) for [SPARK-3000](https://issues.apache.org/jira/browse/SPARK-3000) several days ago which is also mainly to resolve the issue, There might have some performance problems in some case. You can have a look on [this](https://github.com/apache/spark/pull/2134). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2906#issuecomment-62332987 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23125/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2906#issuecomment-62332985 [Test build #23125 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23125/consoleFull) for PR 2906 at commit [`b0b061e`](https://github.com/apache/spark/commit/b0b061edc4c2ad42deda00bb664534e1334b50e5). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public class JavaHierarchicalClustering ` * `trait HierarchicalClusteringConf extends Serializable ` * `class HierarchicalClustering(` * `class HierarchicalClusteringModel(object):` * `class HierarchicalClustering(object):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2429] [MLlib] Hierarchical Implementati...
Github user yu-iskw commented on the pull request: https://github.com/apache/spark/pull/2906#issuecomment-6214 @srowen and @rnowling , Sorry for my complicated commits. I modified my source code. Could you review my PR? - I modified what you pointed out. - I added a function to cut a cluster tree of a trained hierarchical clustering model by a height of dendrogram. - I rebased my PR with the latest master branch and then force-push my branch. Because there are a few conflicts with it. Thanks, --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4310][WebUI] Sort 'Submitted' column in...
GitHub user zsxwing opened a pull request: https://github.com/apache/spark/pull/3179 [SPARK-4310][WebUI] Sort 'Submitted' column in Stage page by time You can merge this pull request into a Git repository by running: $ git pull https://github.com/zsxwing/spark SPARK-4310 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3179.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3179 commit fb03b354add078ae47db55b79282636fe74ea7dc Author: zsxwing zsxw...@gmail.com Date: 2014-11-10T02:30:06Z Sort 'Submitted' column in Stage page by time --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org