[GitHub] spark pull request: [SPARK-5337][Mesos][Standalone] respect spark....
Github user CodingCat commented on a diff in the pull request: https://github.com/apache/spark/pull/4129#discussion_r23368844 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/SparkDeploySchedulerBackend.scala --- @@ -40,6 +40,14 @@ private[spark] class SparkDeploySchedulerBackend( var registrationDone = false val maxCores = conf.getOption(spark.cores.max).map(_.toInt) + val coreNumPerTask = { +val corePerTask = conf.getInt(spark.task.cpus, 1) +if (corePerTask 1) { + throw new IllegalArgumentException( +sspark.task.cpus is set to an invalid value $corePerTask) +} +corePerTask + } --- End diff -- oh...I tried to embed the validation logic into the assignment, so ...it looks like this --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4934][CORE] Print remote address in Con...
Github user shenh062326 commented on a diff in the pull request: https://github.com/apache/spark/pull/4157#discussion_r23370062 --- Diff: core/src/main/scala/org/apache/spark/network/nio/ConnectionManager.scala --- @@ -375,16 +375,22 @@ private[nio] class ConnectionManager( } } } else { - logInfo(Key not valid ? + key) + logInfo(Key not valid ? + key + remote address: + + key.channel().asInstanceOf[SocketChannel].socket --- End diff -- Thanks, I will change it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5315][Streaming] Fix reduceByWindow Jav...
Github user jerryshao commented on the pull request: https://github.com/apache/spark/pull/4104#issuecomment-71021795 Thanks TD, done with code rebase. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5315][Streaming] Fix reduceByWindow Jav...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4104#issuecomment-71022025 [Test build #25967 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25967/consoleFull) for PR 4104 at commit [`5bc8987`](https://github.com/apache/spark/commit/5bc8987fa1a27a6d43ce7f8d1d01da2cc81af04b). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4032#issuecomment-71004965 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25964/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4032#issuecomment-71004957 [Test build #25964 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25964/consoleFull) for PR 4032 at commit [`a237c75`](https://github.com/apache/spark/commit/a237c75f87518b26245fd688de9bbae4bc151ae2). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...
Github user zsxwing commented on a diff in the pull request: https://github.com/apache/spark/pull/3661#discussion_r23372367 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/ContextWaiter.scala --- @@ -17,30 +17,63 @@ package org.apache.spark.streaming +import java.util.concurrent.TimeUnit +import java.util.concurrent.locks.ReentrantLock + private[streaming] class ContextWaiter { + + private val lock = new ReentrantLock() + private val condition = lock.newCondition() + + // Guarded by lock private var error: Throwable = null - private var stopped: Boolean = false - def notifyError(e: Throwable) = synchronized { -error = e -notifyAll() - } + // Guarded by lock + private var stopped: Boolean = false - def notifyStop() = synchronized { -stopped = true -notifyAll() + def notifyError(e: Throwable): Unit = { +lock.lock() +try { + error = e + condition.signalAll() +} finally { + lock.unlock() +} } - def waitForStopOrError(timeout: Long = -1) = synchronized { -// If already had error, then throw it -if (error != null) { - throw error + def notifyStop(): Unit = { +lock.lock() +try { + stopped = true + condition.signalAll() +} finally { + lock.unlock() } + } -// If not already stopped, then wait -if (!stopped) { - if (timeout 0) wait() else wait(timeout) + /** + * Return `true` if it's stopped; or throw the reported error if `notifyError` has been called; or + * `false` if the waiting time detectably elapsed before return from the method. + */ + def waitForStopOrError(timeout: Long = -1): Boolean = { --- End diff -- In hindsight, instead of modeling awaitTermination against Akka ActorSystem's awaitTermination (which return Unit) , I should have modeled it like Java ExecutorService's awaitTermination which returns a Boolean. Now its not possible to change the API without breaking compatiblity. :( @tdas, Sorry that I forgot to reply you. You said you designed it just like Akka `ActorSystem.awaitTermination`. But [ActorSystem.awaitTermination](https://github.com/akka/akka/blob/master/akka-actor/src/main/scala/akka/actor/ActorSystem.scala#L394) will throw a TimeoutException in case of timeout. ```Scala /** * Block current thread until the system has been shutdown, or the specified * timeout has elapsed. This will block until after all on termination * callbacks have been run. * * @throws TimeoutException in case of timeout */ def awaitTermination(timeout: Duration): Unit ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5308 [BUILD] MD5 / SHA1 hash format does...
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/4161 SPARK-5308 [BUILD] MD5 / SHA1 hash format doesn't match standard Maven output Here's one way to make the hashes match what Maven's plugins would create. It takes a little extra footwork since OS X doesn't have the same command line tools. An alternative is just to make Maven output these of course - would that be better? I ask in case there is a reason I'm missing, like, we need to hash files that Maven doesn't build. You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark SPARK-5308 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4161.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4161 commit e25eff8cd20e436002c77426e9b5e04cc1a19f2e Author: Sean Owen so...@cloudera.com Date: 2015-01-22T14:22:22Z Generate MD5, SHA1 hashes in a format like Maven's plugin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4987] [SQL] parquet timestamp type supp...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3820#issuecomment-71004690 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25963/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4987] [SQL] parquet timestamp type supp...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3820#issuecomment-71004681 [Test build #25963 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25963/consoleFull) for PR 3820 at commit [`5d1eeed`](https://github.com/apache/spark/commit/5d1eeedd43d60d9ef9c5dcc0e97fff829ccea8ed). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4506 [DOCS] Addendum: Update more docs t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4160#issuecomment-71008499 [Test build #25965 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25965/consoleFull) for PR 4160 at commit [`5f5f7df`](https://github.com/apache/spark/commit/5f5f7dfea53e25f4dfe7f880aa8479263b36cc74). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4506 [DOCS] Addendum: Update more docs t...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4160#issuecomment-71008502 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25965/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5308 [BUILD] MD5 / SHA1 hash format does...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4161#issuecomment-71027549 [Test build #25968 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25968/consoleFull) for PR 4161 at commit [`e25eff8`](https://github.com/apache/spark/commit/e25eff8cd20e436002c77426e9b5e04cc1a19f2e). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/4032#discussion_r23373392 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceivedBlockTracker.scala --- @@ -107,8 +107,14 @@ private[streaming] class ReceivedBlockTracker( lastAllocatedBatchTime = batchTime allocatedBlocks } else { - throw new SparkException(sUnexpected allocation of blocks, + -slast batch = $lastAllocatedBatchTime, batch time to allocate = $batchTime ) + // This situation occurs when: + // 1. WAL is ended with BatchAllocationEvent, but without BatchCleanupEvent, + // possibly processed batch job or half-processed batch job need to be processed again, + // so the batchTime will be equal to lastAllocatedBatchTime. + // 2. Slow checkpointing makes recovered batch time older than WAL recovered + // lastAllocatedBatchTime. + // This situation will only occurs in recovery time. + logWarning(sPossibly processed batch $batchTime need to be processed again in WAL recovery) --- End diff -- At least I think we should inform user this batch is processed or half-processed before, so user may pay attention to this, whether to clean or overwrite the previously processed result. I'm downgrading to logInfo, what's your opinion? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4032#issuecomment-71020806 [Test build #25966 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25966/consoleFull) for PR 4032 at commit [`f0b0c0b`](https://github.com/apache/spark/commit/f0b0c0bcc8db551c57c1590c8a12b01f5f0ae2d0). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-4251][SPARK-2352][MLLIB]Add RBM, A...
Github user musicx commented on the pull request: https://github.com/apache/spark/pull/3222#issuecomment-70984348 Hi @witgo, where can I find your email? ä¸æäº¤æµ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/3715#issuecomment-70984720 @tdas I think this PR is almost ready, please follow the example to double check it, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3715#issuecomment-70985022 [Test build #25958 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25958/consoleFull) for PR 3715 at commit [`97386b3`](https://github.com/apache/spark/commit/97386b3debd5f352b61dfed194ab9495fecbe834). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/4032#discussion_r23360371 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceivedBlockTracker.scala --- @@ -106,6 +106,12 @@ private[streaming] class ReceivedBlockTracker( timeToAllocatedBlocks(batchTime) = allocatedBlocks lastAllocatedBatchTime = batchTime allocatedBlocks +} else if (batchTime == lastAllocatedBatchTime) { --- End diff -- Could you improve the unit test to actually verify the behavior. The unit test line you removed can be modified to verify that calling `batchTime` = `lastAllocatedBatchTime` is a no-op and does no allocation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5357: Update commons-codec version to 1....
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/4153#issuecomment-70985727 Jenkins, this is ok to test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5364] [SQL] HiveQL transform doesn't su...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4158#issuecomment-70986035 [Test build #25959 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25959/consoleFull) for PR 4158 at commit [`c8fe7fc`](https://github.com/apache/spark/commit/c8fe7fc37471c38b24e52a5d170fa0741b50c791). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5357: Update commons-codec version to 1....
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4153#issuecomment-70986011 [Test build #25960 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25960/consoleFull) for PR 4153 at commit [`b4a91f4`](https://github.com/apache/spark/commit/b4a91f478d496b48c03d28bc44a9a848b5e93c85). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5353] Log failures in REPL class loadin...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/4130#discussion_r23361106 --- Diff: repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala --- @@ -91,7 +91,14 @@ class ExecutorClassLoader(conf: SparkConf, classUri: String, parent: ClassLoader inputStream.close() Some(defineClass(name, bytes, 0, bytes.length)) } catch { - case e: Exception = None + case e: FileNotFoundException = +// We did not find the class +logDebug(sDid not load class $name from REPL class server at $uri, e) --- End diff -- Yes exactly. The URI is typically not remote, not from some server, just a local JAR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4934][CORE] Print remote address in Con...
GitHub user shenh062326 opened a pull request: https://github.com/apache/spark/pull/4157 [SPARK-4934][CORE] Print remote address in ConnectionManager Connection key is hard to read : key already cancelled ? sun.nio.ch.SelectionKeyImpl@52b0e278. Itâs hard to solve problem by this log. It's better to add remote address. You can merge this pull request into a Git repository by running: $ git pull https://github.com/shenh062326/spark my_change2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4157.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4157 commit e5ac73e3fd18c3bf5c1a32ce531b15be9feac385 Author: Hong Shen hongs...@tencent.com Date: 2015-01-22T07:53:47Z Print remote address in ConnectionManager --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] SPARK-5309: Use Dictionary for Binary-S...
Github user MickDavies commented on the pull request: https://github.com/apache/spark/pull/4139#issuecomment-70984197 I've looked through ParquetQuerySuite and ParquetQuerySuite2 and its not obvious that there are tests that will exercise this change. I.e. where Parquet uses dictionary encoding for Strings. Most test String columns have strings with incrementing values that will result in unique values, and I don't think these will be encoded in dictionaries. I think it would be good to add an explicit test to ParquetQuerySuite2, which I'll try to do this evening. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3715#issuecomment-70984126 [Test build #25956 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25956/consoleFull) for PR 3715 at commit [`2c567a5`](https://github.com/apache/spark/commit/2c567a5d55c465d706026c2395e9025fad9dbd68). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5147][Streaming] Delete the received da...
Github user jerryshao commented on the pull request: https://github.com/apache/spark/pull/4037#issuecomment-70984564 OK. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5147][Streaming] Delete the received da...
Github user jerryshao closed the pull request at: https://github.com/apache/spark/pull/4037 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-4251][SPARK-2352][MLLIB]Add RBM, A...
Github user witgo commented on the pull request: https://github.com/apache/spark/pull/3222#issuecomment-70984655 witgo#qq.com --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4934][CORE] Print remote address in Con...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4157#issuecomment-70984613 [Test build #25957 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25957/consoleFull) for PR 4157 at commit [`e5ac73e`](https://github.com/apache/spark/commit/e5ac73e3fd18c3bf5c1a32ce531b15be9feac385). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5357: Update commons-codec version to 1....
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4153#issuecomment-70987116 I think it is pretty safe to update. However really the right thing is to depend on commons codec in your app and set your classpath to take precedence. In general. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/4032#discussion_r23361754 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceivedBlockTracker.scala --- @@ -106,6 +106,12 @@ private[streaming] class ReceivedBlockTracker( timeToAllocatedBlocks(batchTime) = allocatedBlocks lastAllocatedBatchTime = batchTime allocatedBlocks +} else if (batchTime == lastAllocatedBatchTime) { + // This situation occurs when WAL is ended with BatchAllocationEvent, + // but without BatchCleanupEvent, possibly processed batch job or half-processed batch + // job need to process again, so the batchTime will be equal to lastAllocatedBatchTime. + // This situation will only occurs in recovery time. + logWarning(sPossibly processed batch $batchTime need to be processed again in WAL recovery) } else { --- End diff -- Actually lets remove this exception completely. Instead any attempts to allocate blocks to a batch such that batch = lastAllocatedBatchTime, is completely ignored. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5135][SQL] Add support for describe [ex...
Github user OopsOutOfMemory commented on a diff in the pull request: https://github.com/apache/spark/pull/4127#discussion_r23359955 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/commands.scala --- @@ -178,3 +180,34 @@ case class DescribeCommand( child.output.map(field = Row(field.name, field.dataType.toString, null)) } } + +/** + * :: DeveloperApi :: + */ +@DeveloperApi +case class DDLDescribeCommand( +dbName: Option[String], +tableName: String, isExtended: Boolean) extends RunnableCommand { + + override def run(sqlContext: SQLContext) = { +val tblRelation = dbName match { + case Some(db) = UnresolvedRelation(Seq(db, tableName)) + case None = UnresolvedRelation(Seq(tableName)) +} +val logicalRelation = sqlContext.executePlan(tblRelation).analyzed +val rows = new ArrayBuffer[Row]() +rows ++= logicalRelation.schema.fields.map{field = + Row(field.name, field.dataType.toSimpleString, null)} + +/* + * TODO if future support partition table, add header below: + * # Partition Information + * # col_name data_type comment --- End diff -- after finish SPARK-5182. we can do it like this: ``` val logicalRelation = sqlContext.executePlan(tblRelation).analyzed val rows = new ArrayBuffer[Row]() rows ++= logicalRelation.schema.fields.map{field = Row(field.name, field.dataType.toSimpleString, null)} val partitionFields = logicalRelation.schema.getPartitionedCols() if (partitionFields.nonEmpty) { val partColumnRows = partitionFields.map(field = Row(field.getName, field.getType.toSimpleString, null)) rows ++= Seq((# Partition Information, , )) ++ (col_name,data_type,comment) ++ partColumnRows } ``` For `extended` we can do it later and discuss to show detail information. ``` if (isExtended) { rows ++= Seq((Detailed Table Information, //get some detail info from table, )) } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5196][SQL] Support `comment` in Create ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3999#issuecomment-70983891 [Test build #25950 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25950/consoleFull) for PR 3999 at commit [`d1cfb0f`](https://github.com/apache/spark/commit/d1cfb0fa7620c3cc6800087f32a2e4542884823f). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5196][SQL] Support `comment` in Create ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3999#issuecomment-70983903 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25950/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5297][Streaming][backport] Backport SPA...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4154#issuecomment-70984233 [Test build #25954 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25954/consoleFull) for PR 4154 at commit [`a4b2bea`](https://github.com/apache/spark/commit/a4b2bea998e5722b11383d553914253249788c2c). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5297][Streaming][backport] Backport SPA...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4154#issuecomment-70984236 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25954/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5213] [SQL] Pluggable SQL Parser Suppor...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/4015#issuecomment-70985149 cc @marmbrus @liancheng --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5364] [SQL] HiveQL transform doesn't su...
GitHub user chenghao-intel opened a pull request: https://github.com/apache/spark/pull/4158 [SPARK-5364] [SQL] HiveQL transform doesn't support the non output clause This is a quick fix for query (in HiveContext) like: ``` SELECT transform(key + 1, value) USING '/bin/cat' FROM src ``` Ideally, we need to refactor the `ScriptTransformation`, which should support the custom SerDe for reader writer. Will do that in the follow up. You can merge this pull request into a Git repository by running: $ git pull https://github.com/chenghao-intel/spark transform Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4158.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4158 commit c8fe7fc37471c38b24e52a5d170fa0741b50c791 Author: Cheng Hao hao.ch...@intel.com Date: 2015-01-22T08:09:00Z fix bug of transform in HiveQL --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/4032#discussion_r23360875 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceivedBlockTracker.scala --- @@ -106,6 +106,12 @@ private[streaming] class ReceivedBlockTracker( timeToAllocatedBlocks(batchTime) = allocatedBlocks lastAllocatedBatchTime = batchTime allocatedBlocks +} else if (batchTime == lastAllocatedBatchTime) { --- End diff -- OK, I will improve this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5307] SerializationDebugger - take 2
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4098#issuecomment-70988349 This is really cool. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3715#issuecomment-70988894 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25951/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3715#issuecomment-7099 [Test build #25951 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25951/consoleFull) for PR 3715 at commit [`adeeb38`](https://github.com/apache/spark/commit/adeeb3863353f9a0ca3070a9cc914a2914d95fa9). * This patch **fails Spark unit tests**. * This patch **does not merge cleanly**. * This patch adds the following public classes _(experimental)_: * `class KafkaUtils(object):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/4032#discussion_r23361948 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceivedBlockTracker.scala --- @@ -106,6 +106,12 @@ private[streaming] class ReceivedBlockTracker( timeToAllocatedBlocks(batchTime) = allocatedBlocks lastAllocatedBatchTime = batchTime allocatedBlocks +} else if (batchTime == lastAllocatedBatchTime) { --- End diff -- Update to this comment. See other comment about removing this check altogether. Please update unit test to verify this new behavior. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4934][CORE] Print remote address in Con...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/4157#discussion_r23365112 --- Diff: core/src/main/scala/org/apache/spark/network/nio/ConnectionManager.scala --- @@ -375,16 +375,22 @@ private[nio] class ConnectionManager( } } } else { - logInfo(Key not valid ? + key) + logInfo(Key not valid ? + key + remote address: + + key.channel().asInstanceOf[SocketChannel].socket --- End diff -- My first reaction was that this risks an exception just for a log message. Since this work is repeated several times, how about creating a method that will return the remote address in case of a `SocketChannel` and the default `toString()` otherwise? although given the current code, it will always be a `SocketChannel`. Can you use string interpolation here? Finally, the second cast isn't needed is it? it does not change the `toString()` that is called. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5315][Streaming] Fix reduceByWindow Jav...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/4104#discussion_r23365189 --- Diff: project/MimaExcludes.scala --- @@ -82,6 +82,10 @@ object MimaExcludes { // SPARK-5166 Spark SQL API stabilization ProblemFilters.exclude[IncompatibleMethTypeProblem](org.apache.spark.ml.Transformer.transform), ProblemFilters.exclude[IncompatibleMethTypeProblem](org.apache.spark.ml.Estimator.fit) + ) ++ Seq( +// SPARK-5315 Spark Streaming Java API returns Scala DStream +ProblemFilters.exclude[MissingMethodProblem]( + org.apache.spark.streaming.api.java.JavaDStreamLike.reduceByWindow) --- End diff -- This actually not a false positive. Since JavaDStreamLike is a trait, it boils down to an abstract class. If someone has inherited their own class from DStreamLike, and had a function reduceByKeyAndWindow with this signature (the new one), his code will break as his code will now have to override the method. So technically it is correctly. But I dont envision any one doing this, so this binary compatibility break is fine. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5315][Streaming] Fix reduceByWindow Jav...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/4104#discussion_r23365219 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaDStreamLike.scala --- @@ -211,7 +211,9 @@ trait JavaDStreamLike[T, This : JavaDStreamLike[T, This, R], R : JavaRDDLike[T * @param slideDuration sliding interval of the window (i.e., the interval after which * the new DStream will generate RDDs); must be a multiple of this * DStream's batching interval + * @deprecated As this API is not Java compatible. */ + @deprecated(Use Java compatible version, 1.3.0) --- End diff -- Use Java-compatible version of reduceByWindow --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4032#issuecomment-70997210 [Test build #25964 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25964/consoleFull) for PR 4032 at commit [`a237c75`](https://github.com/apache/spark/commit/a237c75f87518b26245fd688de9bbae4bc151ae2). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...
Github user jerryshao commented on the pull request: https://github.com/apache/spark/pull/4032#issuecomment-70997541 Hey @tdas , I've updated the code and rebased the branch according to your comments, also with several rounds of test in my local set, the previous exception I reported is gone :). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5365][MLlib] Refactor KMeans to reduce ...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4159#issuecomment-70997638 So this returns `(p, (r1, r2, r3, ...))` instead of `(r1, p), (r2, p), (r3, p), ...` Makes sense to me, especially if you have reason to believe this is a bottleneck somewhere. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5315][Streaming] Fix reduceByWindow Jav...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/4104#discussion_r23365454 --- Diff: project/MimaExcludes.scala --- @@ -82,6 +82,10 @@ object MimaExcludes { // SPARK-5166 Spark SQL API stabilization ProblemFilters.exclude[IncompatibleMethTypeProblem](org.apache.spark.ml.Transformer.transform), ProblemFilters.exclude[IncompatibleMethTypeProblem](org.apache.spark.ml.Estimator.fit) + ) ++ Seq( +// SPARK-5315 Spark Streaming Java API returns Scala DStream +ProblemFilters.exclude[MissingMethodProblem]( + org.apache.spark.streaming.api.java.JavaDStreamLike.reduceByWindow) --- End diff -- Ah right, my understanding of this issue never fails to fail. In a recent PR, I think the decision was to give up on supporting user code extending `JavaRDDLike` so indeed this is OK. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5315][Streaming] Fix reduceByWindow Jav...
Github user tdas commented on the pull request: https://github.com/apache/spark/pull/4104#issuecomment-70997871 Can you update this patch to deal with the merge conflicts. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5365][MLlib] Refactor KMeans to reduce ...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/4159#issuecomment-70998144 Especially when there are many runs to use and p is also high dimensional and selected in more than one run. Then collecting redundant p would be too useless and time-consuming. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70998096 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25961/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70998086 [Test build #25961 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25961/consoleFull) for PR 4073 at commit [`e0e0d9c`](https://github.com/apache/spark/commit/e0e0d9c65c8981e86cf806f39200bff76e4e786f). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5365][MLlib] Refactor KMeans to reduce ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4159#issuecomment-70998595 [Test build #25962 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25962/consoleFull) for PR 4159 at commit [`25487e6`](https://github.com/apache/spark/commit/25487e65420cb3706744c0f8a560ac0f714fca31). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5365][MLlib] Refactor KMeans to reduce ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4159#issuecomment-70998601 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25962/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5347][CORE] Change FileSplit to InputSp...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4150#issuecomment-70999157 My only question was whether `getLength()` is indeed defined in the `InputSplit` interface in older Hadoop versions, but it looks like it is. This change compiles with default Hadoop versions in the build. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/4032#discussion_r23366311 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala --- @@ -206,9 +208,13 @@ class JobGenerator(jobScheduler: JobScheduler) extends Logging { val timesToReschedule = (pendingTimes ++ downTimes).distinct.sorted(Time.ordering) logInfo(Batches to reschedule ( + timesToReschedule.size + batches): + timesToReschedule.mkString(, )) -timesToReschedule.foreach(time = +timesToReschedule.foreach { time = + // Allocate the related blocks when recovering from failure, because some added but not + // allocated block is dangled in the queue after recovering, we have to insert some block --- End diff -- Grammar fix ..some blocks that were added but not allocated, are dangling in the queue after recovering. We have to allocate thoe blocks to the next batch, which is the batch they were supposed to go to. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/4032#discussion_r23366414 --- Diff: streaming/src/test/scala/org/apache/spark/streaming/ReceivedBlockTrackerSuite.scala --- @@ -82,15 +82,13 @@ class ReceivedBlockTrackerSuite receivedBlockTracker.allocateBlocksToBatch(2) receivedBlockTracker.getBlocksOfBatchAndStream(2, streamId) shouldBe empty -// Verify that batch 2 cannot be allocated again -intercept[SparkException] { - receivedBlockTracker.allocateBlocksToBatch(2) -} +// Verify that older batches have no operation on batch allocation, +// will return the same blocks as previously allocated. +receivedBlockTracker.allocateBlocksToBatch(1) +receivedBlockTracker.getBlocksOfBatchAndStream(1, streamId) shouldEqual blockInfos -// Verify that older batches cannot be allocated again -intercept[SparkException] { - receivedBlockTracker.allocateBlocksToBatch(1) -} +receivedBlockTracker.allocateBlocksToBatch(2) +receivedBlockTracker.getBlocksOfBatchAndStream(2, streamId) shouldBe empty --- End diff -- Could you also test that even if there are new unallocated blocks, calling `receivedBlockTracker.allocateBlocksToBatch(2)` does not allocate those blocks? I dont think that is covered in this test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4506 [DOCS] Addendum: Update more docs t...
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/4160 SPARK-4506 [DOCS] Addendum: Update more docs to reflect that standalone works in cluster mode This is a trivial addendum to SPARK-4506, which was already resolved. noted by Asim Jalis in SPARK-4506. You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark SPARK-4506 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4160.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4160 commit 5f5f7dfea53e25f4dfe7f880aa8479263b36cc74 Author: Sean Owen so...@cloudera.com Date: 2015-01-22T10:26:50Z Update more docs to reflect that standalone works in cluster mode --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/4032#discussion_r23366552 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceivedBlockTracker.scala --- @@ -107,8 +107,14 @@ private[streaming] class ReceivedBlockTracker( lastAllocatedBatchTime = batchTime allocatedBlocks } else { - throw new SparkException(sUnexpected allocation of blocks, + -slast batch = $lastAllocatedBatchTime, batch time to allocate = $batchTime ) + // This situation occurs when: + // 1. WAL is ended with BatchAllocationEvent, but without BatchCleanupEvent, + // possibly processed batch job or half-processed batch job need to be processed again, + // so the batchTime will be equal to lastAllocatedBatchTime. + // 2. Slow checkpointing makes recovered batch time older than WAL recovered + // lastAllocatedBatchTime. + // This situation will only occurs in recovery time. + logWarning(sPossibly processed batch $batchTime need to be processed again in WAL recovery) --- End diff -- i am not sure we even should throw a warning. Warnings are thrown when something bad may have happened. In this case, if the code reaches here, nothing bad or unexpected has happened. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4654][CORE] Clean up DAGScheduler getMi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/4134#discussion_r23366593 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -349,34 +349,7 @@ class DAGScheduler( } private def getMissingParentStages(stage: Stage): List[Stage] = { -val missing = new HashSet[Stage] -val visited = new HashSet[RDD[_]] -// We are manually maintaining a stack here to prevent StackOverflowError -// caused by recursively visiting -val waitingForVisit = new Stack[RDD[_]] -def visit(rdd: RDD[_]) { - if (!visited(rdd)) { -visited += rdd -if (getCacheLocs(rdd).contains(Nil)) { --- End diff -- Actually I don't fully understand why we need this. Does this mean the `rdd` has at least one block which hasn't been cached? And why we need this to get missing parent stages? @lianhuiwang @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...
Github user tdas commented on the pull request: https://github.com/apache/spark/pull/4032#issuecomment-71000469 looks almost good. some minor comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4939] move to next locality when no pen...
Github user tdas commented on the pull request: https://github.com/apache/spark/pull/3779#issuecomment-71000700 @kayousterhout @davies @pwendell can this be merged for 1.2.1? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5147][Streaming] Delete the received da...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4149#issuecomment-70989118 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25952/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5147][Streaming] Delete the received da...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4149#issuecomment-70989110 [Test build #25952 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25952/consoleFull) for PR 4149 at commit [`730798b`](https://github.com/apache/spark/commit/730798b8a21eea5ebc13acb9e4376b42271ff9e4). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...
Github user tdas commented on the pull request: https://github.com/apache/spark/pull/4032#issuecomment-70989477 I understand this patch now. Please update it based on my comments, and test it in your harness to make sure that it addresses the exception problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70989958 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25955/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Refactor KMeans to reduce redundant data
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/4159 Refactor KMeans to reduce redundant data If a point is selected as new centers for many runs, it would collect many redundant data. This pr refactors it. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 small_refactor_kmeans Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4159.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4159 commit 25487e65420cb3706744c0f8a560ac0f714fca31 Author: Liang-Chi Hsieh vii...@gmail.com Date: 2015-01-22T09:00:33Z Refactor codes to reduce redundant data. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Refactor KMeans to reduce redundant data
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4159#issuecomment-70990653 [Test build #25962 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25962/consoleFull) for PR 4159 at commit [`25487e6`](https://github.com/apache/spark/commit/25487e65420cb3706744c0f8a560ac0f714fca31). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/4032#discussion_r23362757 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceivedBlockTracker.scala --- @@ -106,6 +106,12 @@ private[streaming] class ReceivedBlockTracker( timeToAllocatedBlocks(batchTime) = allocatedBlocks lastAllocatedBatchTime = batchTime allocatedBlocks +} else if (batchTime == lastAllocatedBatchTime) { + // This situation occurs when WAL is ended with BatchAllocationEvent, + // but without BatchCleanupEvent, possibly processed batch job or half-processed batch + // job need to process again, so the batchTime will be equal to lastAllocatedBatchTime. + // This situation will only occurs in recovery time. + logWarning(sPossibly processed batch $batchTime need to be processed again in WAL recovery) } else { --- End diff -- IIUC, I think this might not be happened normally, unless the checkpoint operation is so slow. Anyway we should consider this to make the code more robust. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3715#issuecomment-70992015 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25958/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3715#issuecomment-70992003 [Test build #25958 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25958/consoleFull) for PR 3715 at commit [`97386b3`](https://github.com/apache/spark/commit/97386b3debd5f352b61dfed194ab9495fecbe834). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class KafkaUtils(object):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4987] [SQL] parquet timestamp type supp...
Github user adrian-wang commented on the pull request: https://github.com/apache/spark/pull/3820#issuecomment-70992950 I have fixed the bug This is quite embarrassing, I forgot to set those factors(NANOS_PER_SECOND, SECONDS_PER_MINUTE, MINUTES_PER_HOUR) to Long when divide, so it overflew... I have tested it, now it works fine. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5364] [SQL] HiveQL transform doesn't su...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4158#issuecomment-70992965 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25959/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4987] [SQL] parquet timestamp type supp...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3820#issuecomment-70992996 [Test build #25963 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25963/consoleFull) for PR 3820 at commit [`5d1eeed`](https://github.com/apache/spark/commit/5d1eeedd43d60d9ef9c5dcc0e97fff829ccea8ed). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5364] [SQL] HiveQL transform doesn't su...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4158#issuecomment-70992962 [Test build #25959 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25959/consoleFull) for PR 4158 at commit [`c8fe7fc`](https://github.com/apache/spark/commit/c8fe7fc37471c38b24e52a5d170fa0741b50c791). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5357: Update commons-codec version to 1....
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4153#issuecomment-70993462 [Test build #25960 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25960/consoleFull) for PR 4153 at commit [`b4a91f4`](https://github.com/apache/spark/commit/b4a91f478d496b48c03d28bc44a9a848b5e93c85). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5357: Update commons-codec version to 1....
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4153#issuecomment-70993475 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25960/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5364] [SQL] HiveQL transform doesn't su...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/4158#issuecomment-70989376 Hi @chenghao-intel, I already did this and support for custom field delimiter and SerDe in PR #4014. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/4032#discussion_r23361989 --- Diff: streaming/src/test/scala/org/apache/spark/streaming/ReceivedBlockTrackerSuite.scala --- @@ -82,11 +82,6 @@ class ReceivedBlockTrackerSuite receivedBlockTracker.allocateBlocksToBatch(2) receivedBlockTracker.getBlocksOfBatchAndStream(2, streamId) shouldBe empty -// Verify that batch 2 cannot be allocated again --- End diff -- Please add test to verify that `allocateBlocksToBatch(x)` where x=2 does nothing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70989944 [Test build #25955 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25955/consoleFull) for PR 4073 at commit [`d5d68e7`](https://github.com/apache/spark/commit/d5d68e79f7ccbefe4c45d53253df3e7066cb7f53). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4073#issuecomment-70990119 [Test build #25961 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25961/consoleFull) for PR 4073 at commit [`e0e0d9c`](https://github.com/apache/spark/commit/e0e0d9c65c8981e86cf806f39200bff76e4e786f). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...
Github user jerryshao commented on the pull request: https://github.com/apache/spark/pull/4032#issuecomment-70990301 OK, will do, thanks a lot for your comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3715#issuecomment-70990806 [Test build #25956 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25956/consoleFull) for PR 3715 at commit [`2c567a5`](https://github.com/apache/spark/commit/2c567a5d55c465d706026c2395e9025fad9dbd68). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class KafkaUtils(object):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3715#issuecomment-70990814 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25956/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4934][CORE] Print remote address in Con...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4157#issuecomment-70991309 [Test build #25957 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25957/consoleFull) for PR 4157 at commit [`e5ac73e`](https://github.com/apache/spark/commit/e5ac73e3fd18c3bf5c1a32ce531b15be9feac385). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4934][CORE] Print remote address in Con...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4157#issuecomment-70991317 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25957/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4506 [DOCS] Addendum: Update more docs t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4160#issuecomment-71000852 [Test build #25965 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25965/consoleFull) for PR 4160 at commit [`5f5f7df`](https://github.com/apache/spark/commit/5f5f7dfea53e25f4dfe7f880aa8479263b36cc74). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/4055#discussion_r23367855 --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala --- @@ -106,7 +106,22 @@ private[spark] abstract class Task[T](val stageId: Int, var partitionId: Int) ex if (interruptThread taskThread != null) { taskThread.interrupt() } - } + } + + override def hashCode(): Int = { +val state = Seq(stageId, partitionId) +state.map(_.hashCode()).foldLeft(0)((a, b) = 31 * a + b) --- End diff -- This seems like an excessively complex way of writing `31 * stageId.hashCode + partitionId.hashCode`. I don't think FP is the way to do this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/4055#discussion_r23367928 --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala --- @@ -106,7 +106,22 @@ private[spark] abstract class Task[T](val stageId: Int, var partitionId: Int) ex if (interruptThread taskThread != null) { taskThread.interrupt() } - } + } + + override def hashCode(): Int = { +val state = Seq(stageId, partitionId) +state.map(_.hashCode()).foldLeft(0)((a, b) = 31 * a + b) + } + + def canEqual(other: Any): Boolean = other.isInstanceOf[Task[T]] + + override def equals(other: Any): Boolean = other match { +case that: Task[_] = + (that canEqual this) --- End diff -- `that` is already a `Task` here; is the `canEqual` method adding anything? the check has no dependency on the generic type `T`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5063] More helpful error messages for s...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/3884#discussion_r23397600 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -955,6 +977,11 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli * The variable will be sent to each cluster only once. */ def broadcast[T: ClassTag](value: T): Broadcast[T] = { +assertNotStopped() +if (classOf[RDD[_]].isAssignableFrom(classTag[T].runtimeClass)) { --- End diff -- Actually, maybe this check should go somewhere else, since I think that it might technically have been safe to _create_ a broadcast variable with an RDD, even though doing anything with it would trigger errors. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5063] More helpful error messages for s...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/3884#discussion_r23397684 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -76,10 +76,25 @@ import org.apache.spark.util.random.{BernoulliSampler, PoissonSampler, Bernoulli * on RDD internals. */ abstract class RDD[T: ClassTag]( -@transient private var sc: SparkContext, +@transient private var _sc: SparkContext, @transient private var deps: Seq[Dependency[_]] ) extends Serializable with Logging { + if (classOf[RDD[_]].isAssignableFrom(elementClassTag.runtimeClass)) { --- End diff -- Similarly, this should perhaps be a warning instead of an exception in order to avoid any possibility of breaking odd corner-case 1.2.1 apps. I'll change this to a warning and leave the `sc` getter as an exception. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5012][MLLib][PySpark]Python API for Gau...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4059#discussion_r23399719 --- Diff: examples/src/main/python/mllib/gaussian_mixture_model.py --- @@ -0,0 +1,65 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + + +A Gaussian Mixture Model clustering program using MLlib. + +This example requires NumPy (http://www.numpy.org/). + + +import sys +import random +import argparse +import numpy as np +from pyspark import SparkConf, SparkContext +from pyspark.mllib.clustering import GaussianMixtureEM + + +def parseVector(line): +return np.array([float(x) for x in line.split(' ')]) + + +if __name__ == __main__: + +Parameters +-- +input_file : path of the file which contains data points +k : Number of mixture components +convergenceTol : convergence_threshold.Default to 1e-3 --- End diff -- space after . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5012][MLLib][PySpark]Python API for Gau...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4059#discussion_r23399756 --- Diff: examples/src/main/python/mllib/gaussian_mixture_model.py --- @@ -0,0 +1,65 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + + +A Gaussian Mixture Model clustering program using MLlib. + +This example requires NumPy (http://www.numpy.org/). + + +import sys +import random +import argparse +import numpy as np +from pyspark import SparkConf, SparkContext +from pyspark.mllib.clustering import GaussianMixtureEM + + +def parseVector(line): +return np.array([float(x) for x in line.split(' ')]) + + +if __name__ == __main__: + +Parameters +-- +input_file : path of the file which contains data points +k : Number of mixture components +convergenceTol : convergence_threshold.Default to 1e-3 +seed : random seed +n_iter : Number of EM iterations to perform. Default to 100 + +conf = SparkConf().setAppName(GMM) +sc = SparkContext(conf=conf) + +parser = argparse.ArgumentParser() +parser.add_argument('input_file', help='input file') +parser.add_argument('k', type=int, help='num_of_clusters') +parser.add_argument('--ct', default=1e-3, type=float, help='convergence_threshold') --- End diff -- `--ct` - `--convergenceTol`? In general, it is good to use the original parameter names in the API. So after trying the example, users become familiar with the API as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5012][MLLib][PySpark]Python API for Gau...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4059#discussion_r23399765 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -280,6 +280,48 @@ class PythonMLLibAPI extends Serializable { } /** + * Java stub for Python mllib GaussianMixtureEM.train() --- End diff -- Should we document the return value? It is not easy to tell from the return type `JList[Object]`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5012][MLLib][PySpark]Python API for Gau...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4059#discussion_r23399804 --- Diff: python/pyspark/mllib/clustering.py --- @@ -86,6 +86,68 @@ def train(cls, rdd, k, maxIterations=100, runs=1, initializationMode=k-means|| return KMeansModel([c.toArray() for c in centers]) +class GaussianMixtureModel(object): + +A clustering model derived from the Gaussian Mixture Model method. + + from numpy import array + clusterdata_1 = sc.parallelize(array([-0.1,-0.05,-0.01,-0.1, +... 0.9,0.8,0.75,0.935, +...-0.83,-0.68,-0.91,-0.76 ]).reshape(6,2)) + model = GaussianMixtureEM.train(clusterdata_1, 3, 0.0001, 3205, 10) + labels = model.predictLabels(clusterdata_1).collect() + labels[0]==labels[2] +True + labels[3]==labels[4] +False + labels[4]==labels[5] +True + clusterdata_2 = sc.parallelize(array([-5.1971, -2.5359, -3.8220, +...-5.2211, -5.0602, 4.7118, +... 6.8989, 3.4592, 4.6322, +... 5.7048, 4.6567, 5.5026, +... 4.5605, 5.2043, 6.2734]).reshape(5,3)) + model = GaussianMixtureEM.train(clusterdata_2, 2, 0.0001, 150, 10) + labels = model.predictLabels(clusterdata_2).collect() + labels[0]==labels[1]==labels[2] +True + labels[3]==labels[4] +True + + +def __init__(self, weight, mu, sigma): --- End diff -- FYI, the names were changed to `weights` and `gaussians`. We might want to discuss how the Python API should change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5012][MLLib][PySpark]Python API for Gau...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4059#discussion_r23399766 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -280,6 +280,48 @@ class PythonMLLibAPI extends Serializable { } /** + * Java stub for Python mllib GaussianMixtureEM.train() + */ + def trainGaussianMixtureEM( + data: JavaRDD[Vector], + k: Int, + convergenceTol: Double, + seed: Long, + maxIterations: Int): JList[Object] = { +val gmmAlg = new GaussianMixtureEM() + .setK(k) + .setConvergenceTol(convergenceTol) + .setSeed(seed) + .setMaxIterations(maxIterations) +try { + val model = gmmAlg.run(data.rdd.persist(StorageLevel.MEMORY_AND_DISK)) + List(model.weight, model.mu, model.sigma). + map(_.asInstanceOf[Object]).asJava +} finally { + data.rdd.unpersist(blocking = false) +} + } + + /** + * Java stub for Python mllib GaussianMixtureModel.predictSoft() + */ + def findPredict( --- End diff -- This is inside `PythonMLlibAPI`. It is necessary to mention GMM in the method name. Btw, I don't quite understand what `find` means here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5012][MLLib][PySpark]Python API for Gau...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4059#discussion_r23399715 --- Diff: examples/src/main/python/mllib/gaussian_mixture_model.py --- @@ -0,0 +1,65 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + + +A Gaussian Mixture Model clustering program using MLlib. + +This example requires NumPy (http://www.numpy.org/). + + +import sys +import random +import argparse +import numpy as np +from pyspark import SparkConf, SparkContext --- End diff -- separate spark imports from python imports --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org