[GitHub] spark pull request: [SPARK-5337][Mesos][Standalone] respect spark....

2015-01-22 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/4129#discussion_r23368844
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/SparkDeploySchedulerBackend.scala
 ---
@@ -40,6 +40,14 @@ private[spark] class SparkDeploySchedulerBackend(
   var registrationDone = false
 
   val maxCores = conf.getOption(spark.cores.max).map(_.toInt)
+  val coreNumPerTask = {
+val corePerTask = conf.getInt(spark.task.cpus, 1)
+if (corePerTask  1) {
+  throw new IllegalArgumentException(
+sspark.task.cpus is set to an invalid value $corePerTask)
+}
+corePerTask
+  }
--- End diff --

oh...I tried to embed the validation logic into the assignment, so ...it 
looks like this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4934][CORE] Print remote address in Con...

2015-01-22 Thread shenh062326
Github user shenh062326 commented on a diff in the pull request:

https://github.com/apache/spark/pull/4157#discussion_r23370062
  
--- Diff: 
core/src/main/scala/org/apache/spark/network/nio/ConnectionManager.scala ---
@@ -375,16 +375,22 @@ private[nio] class ConnectionManager(
 }
   }
 } else {
-  logInfo(Key not valid ?  + key)
+  logInfo(Key not valid ?  + key +  remote address:  + 
+  key.channel().asInstanceOf[SocketChannel].socket
--- End diff --

Thanks, I will change it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5315][Streaming] Fix reduceByWindow Jav...

2015-01-22 Thread jerryshao
Github user jerryshao commented on the pull request:

https://github.com/apache/spark/pull/4104#issuecomment-71021795
  
Thanks TD, done with code rebase.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5315][Streaming] Fix reduceByWindow Jav...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4104#issuecomment-71022025
  
  [Test build #25967 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25967/consoleFull)
 for   PR 4104 at commit 
[`5bc8987`](https://github.com/apache/spark/commit/5bc8987fa1a27a6d43ce7f8d1d01da2cc81af04b).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...

2015-01-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4032#issuecomment-71004965
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25964/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4032#issuecomment-71004957
  
  [Test build #25964 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25964/consoleFull)
 for   PR 4032 at commit 
[`a237c75`](https://github.com/apache/spark/commit/a237c75f87518b26245fd688de9bbae4bc151ae2).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4813][Streaming] Fix the issue that Con...

2015-01-22 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/3661#discussion_r23372367
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/ContextWaiter.scala ---
@@ -17,30 +17,63 @@
 
 package org.apache.spark.streaming
 
+import java.util.concurrent.TimeUnit
+import java.util.concurrent.locks.ReentrantLock
+
 private[streaming] class ContextWaiter {
+
+  private val lock = new ReentrantLock()
+  private val condition = lock.newCondition()
+
+  // Guarded by lock
   private var error: Throwable = null
-  private var stopped: Boolean = false
 
-  def notifyError(e: Throwable) = synchronized {
-error = e
-notifyAll()
-  }
+  // Guarded by lock
+  private var stopped: Boolean = false
 
-  def notifyStop() = synchronized {
-stopped = true
-notifyAll()
+  def notifyError(e: Throwable): Unit = {
+lock.lock()
+try {
+  error = e
+  condition.signalAll()
+} finally {
+  lock.unlock()
+}
   }
 
-  def waitForStopOrError(timeout: Long = -1) = synchronized {
-// If already had error, then throw it
-if (error != null) {
-  throw error
+  def notifyStop(): Unit = {
+lock.lock()
+try {
+  stopped = true
+  condition.signalAll()
+} finally {
+  lock.unlock()
 }
+  }
 
-// If not already stopped, then wait
-if (!stopped) {
-  if (timeout  0) wait() else wait(timeout)
+  /**
+   * Return `true` if it's stopped; or throw the reported error if 
`notifyError` has been called; or
+   * `false` if the waiting time detectably elapsed before return from the 
method.
+   */
+  def waitForStopOrError(timeout: Long = -1): Boolean = {
--- End diff --

 In hindsight, instead of modeling awaitTermination against Akka 
ActorSystem's awaitTermination (which return Unit) , I should have modeled it 
like Java ExecutorService's awaitTermination which returns a Boolean. Now its 
not possible to change the API without breaking compatiblity. :(

@tdas, Sorry that I forgot to reply you. You said you designed it just like 
Akka `ActorSystem.awaitTermination`. But  
[ActorSystem.awaitTermination](https://github.com/akka/akka/blob/master/akka-actor/src/main/scala/akka/actor/ActorSystem.scala#L394)
 will throw  a TimeoutException in case of timeout.

```Scala
  /**
   * Block current thread until the system has been shutdown, or the 
specified
   * timeout has elapsed. This will block until after all on termination
   * callbacks have been run.
   *
   * @throws TimeoutException in case of timeout
   */
  def awaitTermination(timeout: Duration): Unit
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5308 [BUILD] MD5 / SHA1 hash format does...

2015-01-22 Thread srowen
GitHub user srowen opened a pull request:

https://github.com/apache/spark/pull/4161

SPARK-5308 [BUILD] MD5 / SHA1 hash format doesn't match standard Maven 
output

Here's one way to make the hashes match what Maven's plugins would create. 
It takes a little extra footwork since OS X doesn't have the same command line 
tools. An alternative is just to make Maven output these of course - would that 
be better? I ask in case there is a reason I'm missing, like, we need to hash 
files that Maven doesn't build.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srowen/spark SPARK-5308

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4161.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4161


commit e25eff8cd20e436002c77426e9b5e04cc1a19f2e
Author: Sean Owen so...@cloudera.com
Date:   2015-01-22T14:22:22Z

Generate MD5, SHA1 hashes in a format like Maven's plugin




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4987] [SQL] parquet timestamp type supp...

2015-01-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3820#issuecomment-71004690
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25963/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4987] [SQL] parquet timestamp type supp...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3820#issuecomment-71004681
  
  [Test build #25963 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25963/consoleFull)
 for   PR 3820 at commit 
[`5d1eeed`](https://github.com/apache/spark/commit/5d1eeedd43d60d9ef9c5dcc0e97fff829ccea8ed).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4506 [DOCS] Addendum: Update more docs t...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4160#issuecomment-71008499
  
  [Test build #25965 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25965/consoleFull)
 for   PR 4160 at commit 
[`5f5f7df`](https://github.com/apache/spark/commit/5f5f7dfea53e25f4dfe7f880aa8479263b36cc74).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4506 [DOCS] Addendum: Update more docs t...

2015-01-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4160#issuecomment-71008502
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25965/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5308 [BUILD] MD5 / SHA1 hash format does...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4161#issuecomment-71027549
  
  [Test build #25968 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25968/consoleFull)
 for   PR 4161 at commit 
[`e25eff8`](https://github.com/apache/spark/commit/e25eff8cd20e436002c77426e9b5e04cc1a19f2e).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...

2015-01-22 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/4032#discussion_r23373392
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceivedBlockTracker.scala
 ---
@@ -107,8 +107,14 @@ private[streaming] class ReceivedBlockTracker(
   lastAllocatedBatchTime = batchTime
   allocatedBlocks
 } else {
-  throw new SparkException(sUnexpected allocation of blocks,  +
-slast batch = $lastAllocatedBatchTime, batch time to allocate = 
$batchTime  )
+  // This situation occurs when:
+  // 1. WAL is ended with BatchAllocationEvent, but without 
BatchCleanupEvent,
+  // possibly processed batch job or half-processed batch job need to 
be processed again,
+  // so the batchTime will be equal to lastAllocatedBatchTime.
+  // 2. Slow checkpointing makes recovered batch time older than WAL 
recovered
+  // lastAllocatedBatchTime.
+  // This situation will only occurs in recovery time.
+  logWarning(sPossibly processed batch $batchTime need to be 
processed again in WAL recovery)
--- End diff --

At least I think we should inform user this batch is processed or 
half-processed before, so user may pay attention to this, whether to clean or 
overwrite the previously processed result. I'm downgrading to logInfo, what's 
your opinion?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4032#issuecomment-71020806
  
  [Test build #25966 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25966/consoleFull)
 for   PR 4032 at commit 
[`f0b0c0b`](https://github.com/apache/spark/commit/f0b0c0bcc8db551c57c1590c8a12b01f5f0ae2d0).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-4251][SPARK-2352][MLLIB]Add RBM, A...

2015-01-22 Thread musicx
Github user musicx commented on the pull request:

https://github.com/apache/spark/pull/3222#issuecomment-70984348
  
Hi @witgo, where can I find your email? 中文交流


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...

2015-01-22 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/3715#issuecomment-70984720
  
@tdas I think this PR is almost ready, please follow the example to double 
check it, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3715#issuecomment-70985022
  
  [Test build #25958 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25958/consoleFull)
 for   PR 3715 at commit 
[`97386b3`](https://github.com/apache/spark/commit/97386b3debd5f352b61dfed194ab9495fecbe834).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...

2015-01-22 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/4032#discussion_r23360371
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceivedBlockTracker.scala
 ---
@@ -106,6 +106,12 @@ private[streaming] class ReceivedBlockTracker(
   timeToAllocatedBlocks(batchTime) = allocatedBlocks
   lastAllocatedBatchTime = batchTime
   allocatedBlocks
+} else if (batchTime == lastAllocatedBatchTime) {
--- End diff --

Could you improve the unit test to actually verify the behavior. The unit 
test line you removed can be modified to verify that calling `batchTime` = 
`lastAllocatedBatchTime` is a no-op and does no allocation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5357: Update commons-codec version to 1....

2015-01-22 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/4153#issuecomment-70985727
  
Jenkins, this is ok to test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5364] [SQL] HiveQL transform doesn't su...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4158#issuecomment-70986035
  
  [Test build #25959 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25959/consoleFull)
 for   PR 4158 at commit 
[`c8fe7fc`](https://github.com/apache/spark/commit/c8fe7fc37471c38b24e52a5d170fa0741b50c791).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5357: Update commons-codec version to 1....

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4153#issuecomment-70986011
  
  [Test build #25960 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25960/consoleFull)
 for   PR 4153 at commit 
[`b4a91f4`](https://github.com/apache/spark/commit/b4a91f478d496b48c03d28bc44a9a848b5e93c85).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5353] Log failures in REPL class loadin...

2015-01-22 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4130#discussion_r23361106
  
--- Diff: 
repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala ---
@@ -91,7 +91,14 @@ class ExecutorClassLoader(conf: SparkConf, classUri: 
String, parent: ClassLoader
   inputStream.close()
   Some(defineClass(name, bytes, 0, bytes.length))
 } catch {
-  case e: Exception = None
+  case e: FileNotFoundException =
+// We did not find the class
+logDebug(sDid not load class $name from REPL class server at 
$uri, e)
--- End diff --

Yes exactly. The URI is typically not remote, not from some server, just a 
local JAR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4934][CORE] Print remote address in Con...

2015-01-22 Thread shenh062326
GitHub user shenh062326 opened a pull request:

https://github.com/apache/spark/pull/4157

[SPARK-4934][CORE] Print remote address in ConnectionManager

Connection key is hard to read : key already cancelled ? 
sun.nio.ch.SelectionKeyImpl@52b0e278. 
It’s hard to solve problem by this log. It's better to add remote address.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/shenh062326/spark my_change2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4157.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4157


commit e5ac73e3fd18c3bf5c1a32ce531b15be9feac385
Author: Hong Shen hongs...@tencent.com
Date:   2015-01-22T07:53:47Z

Print remote address in ConnectionManager




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SQL] SPARK-5309: Use Dictionary for Binary-S...

2015-01-22 Thread MickDavies
Github user MickDavies commented on the pull request:

https://github.com/apache/spark/pull/4139#issuecomment-70984197
  
I've looked through ParquetQuerySuite and ParquetQuerySuite2 and its not 
obvious that there are tests that will exercise this change. I.e. where Parquet 
uses dictionary encoding for Strings. Most test String columns have strings 
with incrementing values that will result in unique values, and I don't think 
these will be encoded in dictionaries.

I think it would be good to add an explicit test to ParquetQuerySuite2, 
which I'll try to do this evening.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3715#issuecomment-70984126
  
  [Test build #25956 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25956/consoleFull)
 for   PR 3715 at commit 
[`2c567a5`](https://github.com/apache/spark/commit/2c567a5d55c465d706026c2395e9025fad9dbd68).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5147][Streaming] Delete the received da...

2015-01-22 Thread jerryshao
Github user jerryshao commented on the pull request:

https://github.com/apache/spark/pull/4037#issuecomment-70984564
  
OK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5147][Streaming] Delete the received da...

2015-01-22 Thread jerryshao
Github user jerryshao closed the pull request at:

https://github.com/apache/spark/pull/4037


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP][SPARK-4251][SPARK-2352][MLLIB]Add RBM, A...

2015-01-22 Thread witgo
Github user witgo commented on the pull request:

https://github.com/apache/spark/pull/3222#issuecomment-70984655
  
witgo#qq.com


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4934][CORE] Print remote address in Con...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4157#issuecomment-70984613
  
  [Test build #25957 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25957/consoleFull)
 for   PR 4157 at commit 
[`e5ac73e`](https://github.com/apache/spark/commit/e5ac73e3fd18c3bf5c1a32ce531b15be9feac385).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5357: Update commons-codec version to 1....

2015-01-22 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/4153#issuecomment-70987116
  
I think it is pretty safe to update. However really the right thing is to 
depend on commons codec in your app and set your classpath to take precedence. 
In general.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...

2015-01-22 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/4032#discussion_r23361754
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceivedBlockTracker.scala
 ---
@@ -106,6 +106,12 @@ private[streaming] class ReceivedBlockTracker(
   timeToAllocatedBlocks(batchTime) = allocatedBlocks
   lastAllocatedBatchTime = batchTime
   allocatedBlocks
+} else if (batchTime == lastAllocatedBatchTime) {
+  // This situation occurs when WAL is ended with BatchAllocationEvent,
+  // but without BatchCleanupEvent, possibly processed batch job or 
half-processed batch
+  // job need to process again, so the batchTime will be equal to 
lastAllocatedBatchTime.
+  // This situation will only occurs in recovery time.
+  logWarning(sPossibly processed batch $batchTime need to be 
processed again in WAL recovery)
 } else {
--- End diff --

Actually lets remove this exception completely. Instead any attempts to 
allocate blocks to a batch such that batch = lastAllocatedBatchTime, is 
completely ignored.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5135][SQL] Add support for describe [ex...

2015-01-22 Thread OopsOutOfMemory
Github user OopsOutOfMemory commented on a diff in the pull request:

https://github.com/apache/spark/pull/4127#discussion_r23359955
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/commands.scala ---
@@ -178,3 +180,34 @@ case class DescribeCommand(
 child.output.map(field = Row(field.name, field.dataType.toString, 
null))
   }
 }
+
+/**
+ * :: DeveloperApi ::
+ */
+@DeveloperApi
+case class DDLDescribeCommand(
+dbName: Option[String],
+tableName: String, isExtended: Boolean) extends RunnableCommand {
+
+  override def run(sqlContext: SQLContext) = {
+val tblRelation = dbName match {
+  case Some(db) = UnresolvedRelation(Seq(db, tableName))
+  case None = UnresolvedRelation(Seq(tableName))
+}
+val logicalRelation = sqlContext.executePlan(tblRelation).analyzed
+val rows = new ArrayBuffer[Row]()
+rows ++= logicalRelation.schema.fields.map{field =
+  Row(field.name, field.dataType.toSimpleString, null)}
+
+/*
+ * TODO if future support partition table, add header below:
+ * # Partition Information
+ * # col_name data_type comment
--- End diff --

after finish SPARK-5182.  we can do it like this:
```
val logicalRelation = sqlContext.executePlan(tblRelation).analyzed
val rows = new ArrayBuffer[Row]()
rows ++= logicalRelation.schema.fields.map{field =
  Row(field.name, field.dataType.toSimpleString, null)}
val partitionFields = logicalRelation.schema.getPartitionedCols()
if (partitionFields.nonEmpty) {
  val partColumnRows =
partitionFields.map(field = Row(field.getName, 
field.getType.toSimpleString, null))
  rows ++=
  Seq((# Partition Information, , )) ++
  (col_name,data_type,comment) ++ 
  partColumnRows
}
```
For `extended` we can do it later and discuss to show detail information.
```
if (isExtended) {
  rows ++= Seq((Detailed Table Information, //get some detail info 
from table, ))
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5196][SQL] Support `comment` in Create ...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3999#issuecomment-70983891
  
  [Test build #25950 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25950/consoleFull)
 for   PR 3999 at commit 
[`d1cfb0f`](https://github.com/apache/spark/commit/d1cfb0fa7620c3cc6800087f32a2e4542884823f).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5196][SQL] Support `comment` in Create ...

2015-01-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3999#issuecomment-70983903
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25950/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5297][Streaming][backport] Backport SPA...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4154#issuecomment-70984233
  
  [Test build #25954 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25954/consoleFull)
 for   PR 4154 at commit 
[`a4b2bea`](https://github.com/apache/spark/commit/a4b2bea998e5722b11383d553914253249788c2c).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5297][Streaming][backport] Backport SPA...

2015-01-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4154#issuecomment-70984236
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25954/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5213] [SQL] Pluggable SQL Parser Suppor...

2015-01-22 Thread chenghao-intel
Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/4015#issuecomment-70985149
  
cc @marmbrus @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5364] [SQL] HiveQL transform doesn't su...

2015-01-22 Thread chenghao-intel
GitHub user chenghao-intel opened a pull request:

https://github.com/apache/spark/pull/4158

[SPARK-5364] [SQL] HiveQL transform doesn't support the non output clause

This is a quick fix for query (in HiveContext) like:
```
SELECT transform(key + 1, value) USING '/bin/cat' FROM src
```
Ideally, we need to refactor the `ScriptTransformation`, which should 
support the custom SerDe for reader  writer. Will do that in the follow up.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chenghao-intel/spark transform

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4158.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4158


commit c8fe7fc37471c38b24e52a5d170fa0741b50c791
Author: Cheng Hao hao.ch...@intel.com
Date:   2015-01-22T08:09:00Z

fix bug of transform in HiveQL




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...

2015-01-22 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/4032#discussion_r23360875
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceivedBlockTracker.scala
 ---
@@ -106,6 +106,12 @@ private[streaming] class ReceivedBlockTracker(
   timeToAllocatedBlocks(batchTime) = allocatedBlocks
   lastAllocatedBatchTime = batchTime
   allocatedBlocks
+} else if (batchTime == lastAllocatedBatchTime) {
--- End diff --

OK, I will improve this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5307] SerializationDebugger - take 2

2015-01-22 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4098#issuecomment-70988349
  
This is really cool.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...

2015-01-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3715#issuecomment-70988894
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25951/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3715#issuecomment-7099
  
  [Test build #25951 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25951/consoleFull)
 for   PR 3715 at commit 
[`adeeb38`](https://github.com/apache/spark/commit/adeeb3863353f9a0ca3070a9cc914a2914d95fa9).
 * This patch **fails Spark unit tests**.
 * This patch **does not merge cleanly**.
 * This patch adds the following public classes _(experimental)_:
  * `class KafkaUtils(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...

2015-01-22 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/4032#discussion_r23361948
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceivedBlockTracker.scala
 ---
@@ -106,6 +106,12 @@ private[streaming] class ReceivedBlockTracker(
   timeToAllocatedBlocks(batchTime) = allocatedBlocks
   lastAllocatedBatchTime = batchTime
   allocatedBlocks
+} else if (batchTime == lastAllocatedBatchTime) {
--- End diff --

Update to this comment. See other comment about removing this check 
altogether. Please update unit test to verify this new behavior.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4934][CORE] Print remote address in Con...

2015-01-22 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4157#discussion_r23365112
  
--- Diff: 
core/src/main/scala/org/apache/spark/network/nio/ConnectionManager.scala ---
@@ -375,16 +375,22 @@ private[nio] class ConnectionManager(
 }
   }
 } else {
-  logInfo(Key not valid ?  + key)
+  logInfo(Key not valid ?  + key +  remote address:  + 
+  key.channel().asInstanceOf[SocketChannel].socket
--- End diff --

My first reaction was that this risks an exception just for a log message. 
Since this work is repeated several times, how about creating a method that 
will return the remote address in case of a `SocketChannel` and the default 
`toString()` otherwise? although given the current code, it will always be a 
`SocketChannel`.

Can you use string interpolation here?

Finally, the second cast isn't needed is it? it does not change the 
`toString()` that is called.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5315][Streaming] Fix reduceByWindow Jav...

2015-01-22 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/4104#discussion_r23365189
  
--- Diff: project/MimaExcludes.scala ---
@@ -82,6 +82,10 @@ object MimaExcludes {
 // SPARK-5166 Spark SQL API stabilization
 
ProblemFilters.exclude[IncompatibleMethTypeProblem](org.apache.spark.ml.Transformer.transform),
 
ProblemFilters.exclude[IncompatibleMethTypeProblem](org.apache.spark.ml.Estimator.fit)
+  ) ++ Seq(
+// SPARK-5315 Spark Streaming Java API returns Scala DStream
+ProblemFilters.exclude[MissingMethodProblem](
+  
org.apache.spark.streaming.api.java.JavaDStreamLike.reduceByWindow)
--- End diff --

This actually not a false positive. Since JavaDStreamLike is a trait, it 
boils down to an abstract class. If someone has inherited their own class from 
DStreamLike, and had a function reduceByKeyAndWindow with this signature (the 
new one), his code will break as his code will now have to override the 
method. So technically it is correctly. But I dont envision any one doing this, 
so this binary compatibility break is fine. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5315][Streaming] Fix reduceByWindow Jav...

2015-01-22 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/4104#discussion_r23365219
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaDStreamLike.scala
 ---
@@ -211,7 +211,9 @@ trait JavaDStreamLike[T, This : JavaDStreamLike[T, 
This, R], R : JavaRDDLike[T
* @param slideDuration  sliding interval of the window (i.e., the 
interval after which
*   the new DStream will generate RDDs); must be a 
multiple of this
*   DStream's batching interval
+   * @deprecated As this API is not Java compatible.
*/
+  @deprecated(Use Java compatible version, 1.3.0)
--- End diff --

Use Java-compatible version of reduceByWindow


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4032#issuecomment-70997210
  
  [Test build #25964 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25964/consoleFull)
 for   PR 4032 at commit 
[`a237c75`](https://github.com/apache/spark/commit/a237c75f87518b26245fd688de9bbae4bc151ae2).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...

2015-01-22 Thread jerryshao
Github user jerryshao commented on the pull request:

https://github.com/apache/spark/pull/4032#issuecomment-70997541
  
Hey @tdas , I've updated the code and rebased the branch according to your 
comments, also with several rounds of test in my local set, the previous 
exception I reported is gone :).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5365][MLlib] Refactor KMeans to reduce ...

2015-01-22 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/4159#issuecomment-70997638
  
So this returns `(p, (r1, r2, r3, ...))` instead of `(r1, p), (r2, p), (r3, 
p), ...` Makes sense to me, especially if you have reason to believe this is a 
bottleneck somewhere.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5315][Streaming] Fix reduceByWindow Jav...

2015-01-22 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4104#discussion_r23365454
  
--- Diff: project/MimaExcludes.scala ---
@@ -82,6 +82,10 @@ object MimaExcludes {
 // SPARK-5166 Spark SQL API stabilization
 
ProblemFilters.exclude[IncompatibleMethTypeProblem](org.apache.spark.ml.Transformer.transform),
 
ProblemFilters.exclude[IncompatibleMethTypeProblem](org.apache.spark.ml.Estimator.fit)
+  ) ++ Seq(
+// SPARK-5315 Spark Streaming Java API returns Scala DStream
+ProblemFilters.exclude[MissingMethodProblem](
+  
org.apache.spark.streaming.api.java.JavaDStreamLike.reduceByWindow)
--- End diff --

Ah right, my understanding of this issue never fails to fail. In a recent 
PR, I think the decision was to give up on supporting user code extending 
`JavaRDDLike` so indeed this is OK.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5315][Streaming] Fix reduceByWindow Jav...

2015-01-22 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/4104#issuecomment-70997871
  
Can you update this patch to deal with the merge conflicts.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5365][MLlib] Refactor KMeans to reduce ...

2015-01-22 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/4159#issuecomment-70998144
  
Especially when there are many runs to use and p is also high dimensional 
and selected in more than one run. Then collecting redundant p would be too 
useless and time-consuming.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4073#issuecomment-70998096
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25961/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4073#issuecomment-70998086
  
  [Test build #25961 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25961/consoleFull)
 for   PR 4073 at commit 
[`e0e0d9c`](https://github.com/apache/spark/commit/e0e0d9c65c8981e86cf806f39200bff76e4e786f).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5365][MLlib] Refactor KMeans to reduce ...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4159#issuecomment-70998595
  
  [Test build #25962 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25962/consoleFull)
 for   PR 4159 at commit 
[`25487e6`](https://github.com/apache/spark/commit/25487e65420cb3706744c0f8a560ac0f714fca31).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5365][MLlib] Refactor KMeans to reduce ...

2015-01-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4159#issuecomment-70998601
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25962/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5347][CORE] Change FileSplit to InputSp...

2015-01-22 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/4150#issuecomment-70999157
  
My only question was whether `getLength()` is indeed defined in the 
`InputSplit` interface in older Hadoop versions, but it looks like it is. This 
change compiles with default Hadoop versions in the build.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...

2015-01-22 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/4032#discussion_r23366311
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
 ---
@@ -206,9 +208,13 @@ class JobGenerator(jobScheduler: JobScheduler) extends 
Logging {
 val timesToReschedule = (pendingTimes ++ 
downTimes).distinct.sorted(Time.ordering)
 logInfo(Batches to reschedule ( + timesToReschedule.size +  
batches):  +
   timesToReschedule.mkString(, ))
-timesToReschedule.foreach(time =
+timesToReschedule.foreach { time =
+  // Allocate the related blocks when recovering from failure, because 
some added but not
+  // allocated block is dangled in the queue after recovering, we have 
to insert some block
--- End diff --

Grammar fix
..some blocks that were added but not allocated, are dangling in the queue 
after recovering. We have to allocate thoe blocks to the next batch, which is 
the batch they were supposed to go to.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...

2015-01-22 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/4032#discussion_r23366414
  
--- Diff: 
streaming/src/test/scala/org/apache/spark/streaming/ReceivedBlockTrackerSuite.scala
 ---
@@ -82,15 +82,13 @@ class ReceivedBlockTrackerSuite
 receivedBlockTracker.allocateBlocksToBatch(2)
 receivedBlockTracker.getBlocksOfBatchAndStream(2, streamId) shouldBe 
empty
 
-// Verify that batch 2 cannot be allocated again
-intercept[SparkException] {
-  receivedBlockTracker.allocateBlocksToBatch(2)
-}
+// Verify that older batches have no operation on batch allocation,
+// will return the same blocks as previously allocated.
+receivedBlockTracker.allocateBlocksToBatch(1)
+receivedBlockTracker.getBlocksOfBatchAndStream(1, streamId) 
shouldEqual blockInfos
 
-// Verify that older batches cannot be allocated again
-intercept[SparkException] {
-  receivedBlockTracker.allocateBlocksToBatch(1)
-}
+receivedBlockTracker.allocateBlocksToBatch(2)
+receivedBlockTracker.getBlocksOfBatchAndStream(2, streamId) shouldBe 
empty
--- End diff --

Could you also test that even if there are new unallocated blocks, calling 
`receivedBlockTracker.allocateBlocksToBatch(2)` does not allocate those blocks? 
I dont think that is covered in this test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4506 [DOCS] Addendum: Update more docs t...

2015-01-22 Thread srowen
GitHub user srowen opened a pull request:

https://github.com/apache/spark/pull/4160

SPARK-4506 [DOCS] Addendum: Update more docs to reflect that standalone 
works in cluster mode

This is a trivial addendum to SPARK-4506, which was already resolved. noted 
by Asim Jalis in SPARK-4506.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srowen/spark SPARK-4506

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4160.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4160


commit 5f5f7dfea53e25f4dfe7f880aa8479263b36cc74
Author: Sean Owen so...@cloudera.com
Date:   2015-01-22T10:26:50Z

Update more docs to reflect that standalone works in cluster mode




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...

2015-01-22 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/4032#discussion_r23366552
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceivedBlockTracker.scala
 ---
@@ -107,8 +107,14 @@ private[streaming] class ReceivedBlockTracker(
   lastAllocatedBatchTime = batchTime
   allocatedBlocks
 } else {
-  throw new SparkException(sUnexpected allocation of blocks,  +
-slast batch = $lastAllocatedBatchTime, batch time to allocate = 
$batchTime  )
+  // This situation occurs when:
+  // 1. WAL is ended with BatchAllocationEvent, but without 
BatchCleanupEvent,
+  // possibly processed batch job or half-processed batch job need to 
be processed again,
+  // so the batchTime will be equal to lastAllocatedBatchTime.
+  // 2. Slow checkpointing makes recovered batch time older than WAL 
recovered
+  // lastAllocatedBatchTime.
+  // This situation will only occurs in recovery time.
+  logWarning(sPossibly processed batch $batchTime need to be 
processed again in WAL recovery)
--- End diff --

i am not sure we even should throw a warning. Warnings are thrown when 
something bad may have happened. In this case, if the code reaches here, 
nothing bad or unexpected has happened.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4654][CORE] Clean up DAGScheduler getMi...

2015-01-22 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/4134#discussion_r23366593
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -349,34 +349,7 @@ class DAGScheduler(
   }
 
   private def getMissingParentStages(stage: Stage): List[Stage] = {
-val missing = new HashSet[Stage]
-val visited = new HashSet[RDD[_]]
-// We are manually maintaining a stack here to prevent 
StackOverflowError
-// caused by recursively visiting
-val waitingForVisit = new Stack[RDD[_]]
-def visit(rdd: RDD[_]) {
-  if (!visited(rdd)) {
-visited += rdd
-if (getCacheLocs(rdd).contains(Nil)) {
--- End diff --

Actually I don't fully understand why we need this. Does this mean the 
`rdd` has at least one block which hasn't been cached? And why we need this to 
get missing parent stages?
@lianhuiwang @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...

2015-01-22 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/4032#issuecomment-71000469
  
looks almost good. some minor comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4939] move to next locality when no pen...

2015-01-22 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/3779#issuecomment-71000700
  
@kayousterhout @davies @pwendell can this be merged for 1.2.1?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5147][Streaming] Delete the received da...

2015-01-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4149#issuecomment-70989118
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25952/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5147][Streaming] Delete the received da...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4149#issuecomment-70989110
  
  [Test build #25952 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25952/consoleFull)
 for   PR 4149 at commit 
[`730798b`](https://github.com/apache/spark/commit/730798b8a21eea5ebc13acb9e4376b42271ff9e4).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...

2015-01-22 Thread tdas
Github user tdas commented on the pull request:

https://github.com/apache/spark/pull/4032#issuecomment-70989477
  
I understand this patch now. Please update it based on my comments, and 
test it in your harness to make sure that it addresses the exception problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4073#issuecomment-70989958
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25955/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Refactor KMeans to reduce redundant data

2015-01-22 Thread viirya
GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/4159

Refactor KMeans to reduce redundant data

If a point is selected as new centers for many runs, it would collect many 
redundant data. This pr refactors it.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 small_refactor_kmeans

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4159.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4159


commit 25487e65420cb3706744c0f8a560ac0f714fca31
Author: Liang-Chi Hsieh vii...@gmail.com
Date:   2015-01-22T09:00:33Z

Refactor codes to reduce redundant data.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Refactor KMeans to reduce redundant data

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4159#issuecomment-70990653
  
  [Test build #25962 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25962/consoleFull)
 for   PR 4159 at commit 
[`25487e6`](https://github.com/apache/spark/commit/25487e65420cb3706744c0f8a560ac0f714fca31).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...

2015-01-22 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/4032#discussion_r23362757
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceivedBlockTracker.scala
 ---
@@ -106,6 +106,12 @@ private[streaming] class ReceivedBlockTracker(
   timeToAllocatedBlocks(batchTime) = allocatedBlocks
   lastAllocatedBatchTime = batchTime
   allocatedBlocks
+} else if (batchTime == lastAllocatedBatchTime) {
+  // This situation occurs when WAL is ended with BatchAllocationEvent,
+  // but without BatchCleanupEvent, possibly processed batch job or 
half-processed batch
+  // job need to process again, so the batchTime will be equal to 
lastAllocatedBatchTime.
+  // This situation will only occurs in recovery time.
+  logWarning(sPossibly processed batch $batchTime need to be 
processed again in WAL recovery)
 } else {
--- End diff --

IIUC, I think this might not be happened normally, unless the checkpoint 
operation is so slow. Anyway we should consider this to make the code more 
robust.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...

2015-01-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3715#issuecomment-70992015
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25958/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3715#issuecomment-70992003
  
  [Test build #25958 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25958/consoleFull)
 for   PR 3715 at commit 
[`97386b3`](https://github.com/apache/spark/commit/97386b3debd5f352b61dfed194ab9495fecbe834).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class KafkaUtils(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4987] [SQL] parquet timestamp type supp...

2015-01-22 Thread adrian-wang
Github user adrian-wang commented on the pull request:

https://github.com/apache/spark/pull/3820#issuecomment-70992950
  
I have fixed the bug
This is quite embarrassing, I forgot to set those factors(NANOS_PER_SECOND, 
SECONDS_PER_MINUTE, MINUTES_PER_HOUR) to Long when divide, so it overflew... I 
have tested it, now it works fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5364] [SQL] HiveQL transform doesn't su...

2015-01-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4158#issuecomment-70992965
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25959/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4987] [SQL] parquet timestamp type supp...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3820#issuecomment-70992996
  
  [Test build #25963 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25963/consoleFull)
 for   PR 3820 at commit 
[`5d1eeed`](https://github.com/apache/spark/commit/5d1eeedd43d60d9ef9c5dcc0e97fff829ccea8ed).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5364] [SQL] HiveQL transform doesn't su...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4158#issuecomment-70992962
  
  [Test build #25959 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25959/consoleFull)
 for   PR 4158 at commit 
[`c8fe7fc`](https://github.com/apache/spark/commit/c8fe7fc37471c38b24e52a5d170fa0741b50c791).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5357: Update commons-codec version to 1....

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4153#issuecomment-70993462
  
  [Test build #25960 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25960/consoleFull)
 for   PR 4153 at commit 
[`b4a91f4`](https://github.com/apache/spark/commit/b4a91f478d496b48c03d28bc44a9a848b5e93c85).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-5357: Update commons-codec version to 1....

2015-01-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4153#issuecomment-70993475
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25960/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5364] [SQL] HiveQL transform doesn't su...

2015-01-22 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/4158#issuecomment-70989376
  
Hi @chenghao-intel, I already did this and support for custom field 
delimiter and SerDe in PR #4014.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...

2015-01-22 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/4032#discussion_r23361989
  
--- Diff: 
streaming/src/test/scala/org/apache/spark/streaming/ReceivedBlockTrackerSuite.scala
 ---
@@ -82,11 +82,6 @@ class ReceivedBlockTrackerSuite
 receivedBlockTracker.allocateBlocksToBatch(2)
 receivedBlockTracker.getBlocksOfBatchAndStream(2, streamId) shouldBe 
empty
 
-// Verify that batch 2 cannot be allocated again
--- End diff --

Please add test to verify that `allocateBlocksToBatch(x)` where x=2 does 
nothing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4073#issuecomment-70989944
  
  [Test build #25955 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25955/consoleFull)
 for   PR 4073 at commit 
[`d5d68e7`](https://github.com/apache/spark/commit/d5d68e79f7ccbefe4c45d53253df3e7066cb7f53).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3726] [MLlib] Allow sampling_rate not e...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4073#issuecomment-70990119
  
  [Test build #25961 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25961/consoleFull)
 for   PR 4073 at commit 
[`e0e0d9c`](https://github.com/apache/spark/commit/e0e0d9c65c8981e86cf806f39200bff76e4e786f).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5233][Streaming] Fix error replaying of...

2015-01-22 Thread jerryshao
Github user jerryshao commented on the pull request:

https://github.com/apache/spark/pull/4032#issuecomment-70990301
  
OK, will do, thanks a lot for your comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3715#issuecomment-70990806
  
  [Test build #25956 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25956/consoleFull)
 for   PR 3715 at commit 
[`2c567a5`](https://github.com/apache/spark/commit/2c567a5d55c465d706026c2395e9025fad9dbd68).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class KafkaUtils(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...

2015-01-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3715#issuecomment-70990814
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25956/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4934][CORE] Print remote address in Con...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4157#issuecomment-70991309
  
  [Test build #25957 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25957/consoleFull)
 for   PR 4157 at commit 
[`e5ac73e`](https://github.com/apache/spark/commit/e5ac73e3fd18c3bf5c1a32ce531b15be9feac385).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4934][CORE] Print remote address in Con...

2015-01-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4157#issuecomment-70991317
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25957/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-4506 [DOCS] Addendum: Update more docs t...

2015-01-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4160#issuecomment-71000852
  
  [Test build #25965 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25965/consoleFull)
 for   PR 4160 at commit 
[`5f5f7df`](https://github.com/apache/spark/commit/5f5f7dfea53e25f4dfe7f880aa8479263b36cc74).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...

2015-01-22 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4055#discussion_r23367855
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala ---
@@ -106,7 +106,22 @@ private[spark] abstract class Task[T](val stageId: 
Int, var partitionId: Int) ex
 if (interruptThread  taskThread != null) {
   taskThread.interrupt()
 }
-  }  
+  }
+
+  override def hashCode(): Int = {
+val state = Seq(stageId, partitionId)
+state.map(_.hashCode()).foldLeft(0)((a, b) = 31 * a + b)
--- End diff --

This seems like an excessively complex way of writing `31 * 
stageId.hashCode + partitionId.hashCode`. I don't think FP is the way to do 
this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...

2015-01-22 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4055#discussion_r23367928
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala ---
@@ -106,7 +106,22 @@ private[spark] abstract class Task[T](val stageId: 
Int, var partitionId: Int) ex
 if (interruptThread  taskThread != null) {
   taskThread.interrupt()
 }
-  }  
+  }
+
+  override def hashCode(): Int = {
+val state = Seq(stageId, partitionId)
+state.map(_.hashCode()).foldLeft(0)((a, b) = 31 * a + b)
+  }
+
+  def canEqual(other: Any): Boolean = other.isInstanceOf[Task[T]]
+
+  override def equals(other: Any): Boolean = other match {
+case that: Task[_] =
+  (that canEqual this) 
--- End diff --

`that` is already a `Task` here; is the `canEqual` method adding anything? 
the check has no dependency on the generic type `T`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5063] More helpful error messages for s...

2015-01-22 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/3884#discussion_r23397600
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -955,6 +977,11 @@ class SparkContext(config: SparkConf) extends Logging 
with ExecutorAllocationCli
* The variable will be sent to each cluster only once.
*/
   def broadcast[T: ClassTag](value: T): Broadcast[T] = {
+assertNotStopped()
+if (classOf[RDD[_]].isAssignableFrom(classTag[T].runtimeClass)) {
--- End diff --

Actually, maybe this check should go somewhere else, since I think that it 
might technically have been safe to _create_ a broadcast variable with an RDD, 
even though doing anything with it would trigger errors.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5063] More helpful error messages for s...

2015-01-22 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/3884#discussion_r23397684
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -76,10 +76,25 @@ import org.apache.spark.util.random.{BernoulliSampler, 
PoissonSampler, Bernoulli
  * on RDD internals.
  */
 abstract class RDD[T: ClassTag](
-@transient private var sc: SparkContext,
+@transient private var _sc: SparkContext,
 @transient private var deps: Seq[Dependency[_]]
   ) extends Serializable with Logging {
 
+  if (classOf[RDD[_]].isAssignableFrom(elementClassTag.runtimeClass)) {
--- End diff --

Similarly, this should perhaps be a warning instead of an exception in 
order to avoid any possibility of breaking odd corner-case 1.2.1 apps.  I'll 
change this to a warning and leave the `sc` getter as an exception.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5012][MLLib][PySpark]Python API for Gau...

2015-01-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4059#discussion_r23399719
  
--- Diff: examples/src/main/python/mllib/gaussian_mixture_model.py ---
@@ -0,0 +1,65 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+A Gaussian Mixture Model clustering program using MLlib.
+
+This example requires NumPy (http://www.numpy.org/).
+
+
+import sys
+import random
+import argparse
+import numpy as np
+from pyspark import SparkConf, SparkContext
+from pyspark.mllib.clustering import GaussianMixtureEM
+
+
+def parseVector(line):
+return np.array([float(x) for x in line.split(' ')])
+
+
+if __name__ == __main__:
+
+Parameters
+--
+input_file : path of the file which contains data points
+k : Number of mixture components
+convergenceTol : convergence_threshold.Default to 1e-3
--- End diff --

space after .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5012][MLLib][PySpark]Python API for Gau...

2015-01-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4059#discussion_r23399756
  
--- Diff: examples/src/main/python/mllib/gaussian_mixture_model.py ---
@@ -0,0 +1,65 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+A Gaussian Mixture Model clustering program using MLlib.
+
+This example requires NumPy (http://www.numpy.org/).
+
+
+import sys
+import random
+import argparse
+import numpy as np
+from pyspark import SparkConf, SparkContext
+from pyspark.mllib.clustering import GaussianMixtureEM
+
+
+def parseVector(line):
+return np.array([float(x) for x in line.split(' ')])
+
+
+if __name__ == __main__:
+
+Parameters
+--
+input_file : path of the file which contains data points
+k : Number of mixture components
+convergenceTol : convergence_threshold.Default to 1e-3
+seed : random seed
+n_iter : Number of EM iterations to perform. Default to 100
+
+conf = SparkConf().setAppName(GMM)
+sc = SparkContext(conf=conf)
+
+parser = argparse.ArgumentParser()
+parser.add_argument('input_file', help='input file')
+parser.add_argument('k', type=int, help='num_of_clusters')
+parser.add_argument('--ct', default=1e-3, type=float, 
help='convergence_threshold')
--- End diff --

`--ct` - `--convergenceTol`? In general, it is good to use the original 
parameter names in the API. So after trying the example, users become familiar 
with the API as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5012][MLLib][PySpark]Python API for Gau...

2015-01-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4059#discussion_r23399765
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -280,6 +280,48 @@ class PythonMLLibAPI extends Serializable {
   }
 
   /**
+   * Java stub for Python mllib GaussianMixtureEM.train()
--- End diff --

Should we document the return value? It is not easy to tell from the return 
type `JList[Object]`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5012][MLLib][PySpark]Python API for Gau...

2015-01-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4059#discussion_r23399804
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -86,6 +86,68 @@ def train(cls, rdd, k, maxIterations=100, runs=1, 
initializationMode=k-means||
 return KMeansModel([c.toArray() for c in centers])
 
 
+class GaussianMixtureModel(object):
+
+A clustering model derived from the Gaussian Mixture Model method.
+
+ from numpy import array
+ clusterdata_1 =  sc.parallelize(array([-0.1,-0.05,-0.01,-0.1,
+... 0.9,0.8,0.75,0.935,
+...-0.83,-0.68,-0.91,-0.76 
]).reshape(6,2))
+ model = GaussianMixtureEM.train(clusterdata_1, 3, 0.0001, 3205, 10)
+ labels = model.predictLabels(clusterdata_1).collect()
+ labels[0]==labels[2]
+True
+ labels[3]==labels[4]
+False
+ labels[4]==labels[5]
+True
+ clusterdata_2 =  sc.parallelize(array([-5.1971, -2.5359, -3.8220,
+...-5.2211, -5.0602,  4.7118,
+... 6.8989, 3.4592,  4.6322,
+... 5.7048,  4.6567, 5.5026,
+... 4.5605,  5.2043,  
6.2734]).reshape(5,3))
+ model = GaussianMixtureEM.train(clusterdata_2, 2, 0.0001, 150, 10)
+ labels = model.predictLabels(clusterdata_2).collect()
+ labels[0]==labels[1]==labels[2]
+True
+ labels[3]==labels[4]
+True
+
+
+def __init__(self, weight, mu, sigma):
--- End diff --

FYI, the names were changed to `weights` and `gaussians`. We might want to 
discuss how the Python API should change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5012][MLLib][PySpark]Python API for Gau...

2015-01-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4059#discussion_r23399766
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -280,6 +280,48 @@ class PythonMLLibAPI extends Serializable {
   }
 
   /**
+   * Java stub for Python mllib GaussianMixtureEM.train()
+   */
+  def trainGaussianMixtureEM(
+  data: JavaRDD[Vector], 
+  k: Int, 
+  convergenceTol: Double, 
+  seed: Long, 
+  maxIterations: Int): JList[Object]  = {
+val gmmAlg = new GaussianMixtureEM()
+  .setK(k)
+  .setConvergenceTol(convergenceTol)
+  .setSeed(seed)
+  .setMaxIterations(maxIterations)
+try {
+  val model = 
gmmAlg.run(data.rdd.persist(StorageLevel.MEMORY_AND_DISK))
+  List(model.weight, model.mu, model.sigma).
+  map(_.asInstanceOf[Object]).asJava
+} finally {
+  data.rdd.unpersist(blocking = false)
+}
+  }
+
+  /**
+   * Java stub for Python mllib GaussianMixtureModel.predictSoft()
+   */
+  def findPredict(
--- End diff --

This is inside `PythonMLlibAPI`. It is necessary to mention GMM in the 
method name. Btw, I don't quite understand what `find` means here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5012][MLLib][PySpark]Python API for Gau...

2015-01-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4059#discussion_r23399715
  
--- Diff: examples/src/main/python/mllib/gaussian_mixture_model.py ---
@@ -0,0 +1,65 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+A Gaussian Mixture Model clustering program using MLlib.
+
+This example requires NumPy (http://www.numpy.org/).
+
+
+import sys
+import random
+import argparse
+import numpy as np
+from pyspark import SparkConf, SparkContext
--- End diff --

separate spark imports from python imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   >