date:20150203

[GitHub] spark pull request: [SPARK-4939] revive offers periodically in Loc...

2015-02-03 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/4147


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5578][SQL][DataFrame] Provide a conveni...

2015-02-03 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/4345


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...

2015-02-03 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/3642


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4587] [mllib] ML model import/export

2015-02-03 Thread selvinsource

Github user selvinsource commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r24066997
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -17,14 +17,17 @@
 
 package org.apache.spark.mllib.classification
 
+import org.apache.spark.SparkContext
 import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.classification.impl.GLMClassificationModel
 import org.apache.spark.mllib.linalg.BLAS.dot
 import org.apache.spark.mllib.linalg.{DenseVector, Vector}
 import org.apache.spark.mllib.optimization._
 import org.apache.spark.mllib.regression._
-import org.apache.spark.mllib.util.{DataValidators, MLUtils}
+import org.apache.spark.mllib.util.{DataValidators, Exportable, Importable}
--- End diff --

Good for me too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5484] Checkpoint every 25 iterations in...

2015-02-03 Thread maropu

Github user maropu commented on the pull request:

https://github.com/apache/spark/pull/4273#issuecomment-72804614
  
And also, this issue seems to be related to SPARK-5561.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5068] [SQL] Fix bug query data when pat...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4356#issuecomment-72805635
  
  [Test build #26729 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26729/consoleFull)
 for   PR 4356 at commit 
[`1f033cd`](https://github.com/apache/spark/commit/1f033cd8901bd97c8a4677e284847a2e975c6987).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5068] [SQL] Fix bug query data when pat...

2015-02-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4356#issuecomment-72805642
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26729/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5068] [SQL] Fix bug query data when pat...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4356#issuecomment-72799255
  
  [Test build #26729 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26729/consoleFull)
 for   PR 4356 at commit 
[`1f033cd`](https://github.com/apache/spark/commit/1f033cd8901bd97c8a4677e284847a2e975c6987).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4939] move to next locality when no pen...

2015-02-03 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/3779#discussion_r24065502
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -506,13 +506,59 @@ private[spark] class TaskSetManager(
* Get the level we can launch tasks according to delay scheduling, 
based on current wait time.
*/
   private def getAllowedLocalityLevel(curTime: Long): 
TaskLocality.TaskLocality = {
-while (curTime - lastLaunchTime = localityWaits(currentLocalityIndex) 

-currentLocalityIndex  myLocalityLevels.length - 1)
-{
-  // Jump to the next locality level, and remove our waiting time for 
the current one since
-  // we don't want to count it again on the next one
-  lastLaunchTime += localityWaits(currentLocalityIndex)
-  currentLocalityIndex += 1
+// Remove the scheduled or finished tasks lazily
+def hasNotScheduledTasks(taskIndexes: ArrayBuffer[Int]): Boolean = {
+  var indexOffset = taskIndexes.size
+  while (indexOffset  0) {
+indexOffset -= 1
+val index = taskIndexes(indexOffset)
+if (copiesRunning(index) == 0  !successful(index)) {
+  return true
+} else {
+  taskIndexes.remove(indexOffset)
+}
+  }
+  false
+}
+// It removes the empty lists after check
+def hasMoreTasks(pendingTasks: HashMap[String, ArrayBuffer[Int]]): 
Boolean = {
+  val emptyKeys = new ArrayBuffer[String]
+  val hasTasks = pendingTasks.exists{
+case (id: String, tasks: ArrayBuffer[Int]) =
+  if (hasNotScheduledTasks(tasks)) {
+true
+  } else {
+emptyKeys += id
+false
+  }
+  }
+  emptyKeys.foreach(x = pendingTasks.remove(x))
+  hasTasks
+}
+
+while (currentLocalityIndex  myLocalityLevels.length - 1) {
+  val moreTasks = myLocalityLevels(currentLocalityIndex) match {
+case TaskLocality.PROCESS_LOCAL = 
hasMoreTasks(pendingTasksForExecutor)
+case TaskLocality.NODE_LOCAL = hasMoreTasks(pendingTasksForHost)
+case TaskLocality.NO_PREF = pendingTasksWithNoPrefs.isEmpty
+case TaskLocality.RACK_LOCAL = hasMoreTasks(pendingTasksForRack)
+  }
+  if (!moreTasks) {
+// Move to next locality level if there is no task for current 
level
+lastLaunchTime = curTime
+logDebug(sNo tasks for locality level 
${myLocalityLevels(currentLocalityIndex)}  +
+  smove to ${myLocalityLevels(currentLocalityIndex + 1)})
+currentLocalityIndex += 1
+  } else if (curTime - lastLaunchTime = 
localityWaits(currentLocalityIndex)) {
+// Jump to the next locality level, and remove our waiting time 
for the current one since
+// we don't want to count it again on the next one
--- End diff --

I know this existed before your change, but can you change this comment to: 
Jump to the next locality level, and reset lastLaunchTime so that the next 
locality wait timer doesn't immediately expire? I think that would make this 
easier to understand


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4939] move to next locality when no pen...

2015-02-03 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/3779#discussion_r24066639
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -506,13 +506,59 @@ private[spark] class TaskSetManager(
* Get the level we can launch tasks according to delay scheduling, 
based on current wait time.
*/
   private def getAllowedLocalityLevel(curTime: Long): 
TaskLocality.TaskLocality = {
-while (curTime - lastLaunchTime = localityWaits(currentLocalityIndex) 

-currentLocalityIndex  myLocalityLevels.length - 1)
-{
-  // Jump to the next locality level, and remove our waiting time for 
the current one since
-  // we don't want to count it again on the next one
-  lastLaunchTime += localityWaits(currentLocalityIndex)
-  currentLocalityIndex += 1
+// Remove the scheduled or finished tasks lazily
+def hasNotScheduledTasks(taskIndexes: ArrayBuffer[Int]): Boolean = {
+  var indexOffset = taskIndexes.size
+  while (indexOffset  0) {
+indexOffset -= 1
+val index = taskIndexes(indexOffset)
+if (copiesRunning(index) == 0  !successful(index)) {
+  return true
+} else {
+  taskIndexes.remove(indexOffset)
+}
+  }
+  false
+}
+// It removes the empty lists after check
+def hasMoreTasks(pendingTasks: HashMap[String, ArrayBuffer[Int]]): 
Boolean = {
+  val emptyKeys = new ArrayBuffer[String]
+  val hasTasks = pendingTasks.exists{
+case (id: String, tasks: ArrayBuffer[Int]) =
+  if (hasNotScheduledTasks(tasks)) {
+true
+  } else {
+emptyKeys += id
+false
+  }
+  }
+  emptyKeys.foreach(x = pendingTasks.remove(x))
--- End diff --

Also, the key could be executorId, host, or rackId


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4939] move to next locality when no pen...

2015-02-03 Thread davies

Github user davies commented on the pull request:

https://github.com/apache/spark/pull/3779#issuecomment-72803247
  
@kayousterhout @markhamstra @JoshRosen I should had addressed all your 
comments, they are very helpful, thanks you all!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4939] move to next locality when no pen...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3779#issuecomment-72803322
  
  [Test build #26732 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26732/consoleFull)
 for   PR 3779 at commit 
[`1550668`](https://github.com/apache/spark/commit/1550668b047f8702a6c7b19aad56cfd7a56e47c3).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4969][STREAMING][PYTHON] Add binaryReco...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3803#issuecomment-72799033
  
  [Test build #26725 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26725/consoleFull)
 for   PR 3803 at commit 
[`b676534`](https://github.com/apache/spark/commit/b676534067a626260b6921ba17a04b6e03ff587a).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class FileInputDStream[K, V, F : NewInputFormat[K,V]](`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4939] move to next locality when no pen...

2015-02-03 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/3779#discussion_r24065454
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -506,13 +506,59 @@ private[spark] class TaskSetManager(
* Get the level we can launch tasks according to delay scheduling, 
based on current wait time.
*/
   private def getAllowedLocalityLevel(curTime: Long): 
TaskLocality.TaskLocality = {
-while (curTime - lastLaunchTime = localityWaits(currentLocalityIndex) 

-currentLocalityIndex  myLocalityLevels.length - 1)
-{
-  // Jump to the next locality level, and remove our waiting time for 
the current one since
-  // we don't want to count it again on the next one
-  lastLaunchTime += localityWaits(currentLocalityIndex)
-  currentLocalityIndex += 1
+// Remove the scheduled or finished tasks lazily
+def hasNotScheduledTasks(taskIndexes: ArrayBuffer[Int]): Boolean = {
+  var indexOffset = taskIndexes.size
+  while (indexOffset  0) {
+indexOffset -= 1
+val index = taskIndexes(indexOffset)
+if (copiesRunning(index) == 0  !successful(index)) {
+  return true
+} else {
+  taskIndexes.remove(indexOffset)
+}
+  }
+  false
+}
+// It removes the empty lists after check
+def hasMoreTasks(pendingTasks: HashMap[String, ArrayBuffer[Int]]): 
Boolean = {
+  val emptyKeys = new ArrayBuffer[String]
+  val hasTasks = pendingTasks.exists{
+case (id: String, tasks: ArrayBuffer[Int]) =
+  if (hasNotScheduledTasks(tasks)) {
+true
+  } else {
+emptyKeys += id
+false
+  }
+  }
+  emptyKeys.foreach(x = pendingTasks.remove(x))
+  hasTasks
+}
+
+while (currentLocalityIndex  myLocalityLevels.length - 1) {
+  val moreTasks = myLocalityLevels(currentLocalityIndex) match {
+case TaskLocality.PROCESS_LOCAL = 
hasMoreTasks(pendingTasksForExecutor)
+case TaskLocality.NODE_LOCAL = hasMoreTasks(pendingTasksForHost)
+case TaskLocality.NO_PREF = pendingTasksWithNoPrefs.isEmpty
--- End diff --

+1 (is this a mistake?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4939] move to next locality when no pen...

2015-02-03 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/3779#discussion_r24065965
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -506,13 +506,59 @@ private[spark] class TaskSetManager(
* Get the level we can launch tasks according to delay scheduling, 
based on current wait time.
*/
   private def getAllowedLocalityLevel(curTime: Long): 
TaskLocality.TaskLocality = {
-while (curTime - lastLaunchTime = localityWaits(currentLocalityIndex) 

-currentLocalityIndex  myLocalityLevels.length - 1)
-{
-  // Jump to the next locality level, and remove our waiting time for 
the current one since
-  // we don't want to count it again on the next one
-  lastLaunchTime += localityWaits(currentLocalityIndex)
-  currentLocalityIndex += 1
+// Remove the scheduled or finished tasks lazily
+def hasNotScheduledTasks(taskIndexes: ArrayBuffer[Int]): Boolean = {
+  var indexOffset = taskIndexes.size
+  while (indexOffset  0) {
+indexOffset -= 1
+val index = taskIndexes(indexOffset)
+if (copiesRunning(index) == 0  !successful(index)) {
+  return true
+} else {
+  taskIndexes.remove(indexOffset)
+}
+  }
+  false
+}
+// It removes the empty lists after check
+def hasMoreTasks(pendingTasks: HashMap[String, ArrayBuffer[Int]]): 
Boolean = {
+  val emptyKeys = new ArrayBuffer[String]
+  val hasTasks = pendingTasks.exists{
+case (id: String, tasks: ArrayBuffer[Int]) =
+  if (hasNotScheduledTasks(tasks)) {
+true
+  } else {
+emptyKeys += id
+false
+  }
+  }
+  emptyKeys.foreach(x = pendingTasks.remove(x))
+  hasTasks
+}
+
+while (currentLocalityIndex  myLocalityLevels.length - 1) {
+  val moreTasks = myLocalityLevels(currentLocalityIndex) match {
+case TaskLocality.PROCESS_LOCAL = 
hasMoreTasks(pendingTasksForExecutor)
+case TaskLocality.NODE_LOCAL = hasMoreTasks(pendingTasksForHost)
+case TaskLocality.NO_PREF = pendingTasksWithNoPrefs.isEmpty
--- End diff --

Oh, yes, it's really a mistake! will fix it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4939] move to next locality when no pen...

2015-02-03 Thread kayousterhout

Github user kayousterhout commented on the pull request:

https://github.com/apache/spark/pull/3779#issuecomment-72800486
  
Echoing the comments from @markhamstra and @JoshRosen, this change looks 
correct but is very subtle, so it would be great to improve the commenting 
throughout to avoid confusion for others who read this code later.  I suggested 
a few comment improvements; if you have time to make these soon, that would be 
great since I know you wanted this in for 1.3 and I know the merge window for 
that is closing rapidly.  Thanks @davies!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5587][SQL] Support change database owne...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4357#issuecomment-72800621
  
  [Test build #26730 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26730/consoleFull)
 for   PR 4357 at commit 
[`79413c6`](https://github.com/apache/spark/commit/79413c6eeb4031a676e278fd6aa10679c9ab48a5).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4939] move to next locality when no pen...

2015-02-03 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/3779#discussion_r24067444
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -506,13 +506,63 @@ private[spark] class TaskSetManager(
* Get the level we can launch tasks according to delay scheduling, 
based on current wait time.
*/
   private def getAllowedLocalityLevel(curTime: Long): 
TaskLocality.TaskLocality = {
-while (curTime - lastLaunchTime = localityWaits(currentLocalityIndex) 

-currentLocalityIndex  myLocalityLevels.length - 1)
-{
-  // Jump to the next locality level, and remove our waiting time for 
the current one since
-  // we don't want to count it again on the next one
-  lastLaunchTime += localityWaits(currentLocalityIndex)
-  currentLocalityIndex += 1
+// Remove the scheduled or finished tasks lazily
+def tasksNeedToBeScheduledFrom(pendingTaskIds: ArrayBuffer[Int]): 
Boolean = {
+  var indexOffset = pendingTaskIds.size
+  while (indexOffset  0) {
+indexOffset -= 1
+val index = pendingTaskIds(indexOffset)
+if (copiesRunning(index) == 0  !successful(index)) {
+  return true
+} else {
+  pendingTaskIds.remove(indexOffset)
+}
+  }
+  false
+}
+// Walk through the list of tasks that can be scheduled at each 
location and returns true
+// if there are any tasks that still need to be scheduled. Lazily 
cleans up tasks that have
+// already been scheduled.
+def noMoreTasksToRunIn(pendingTasks: HashMap[String, 
ArrayBuffer[Int]]): Boolean = {
+  val emptyKeys = new ArrayBuffer[String]
+  val hasTasks = pendingTasks.exists{
+case (id: String, tasks: ArrayBuffer[Int]) =
+  if (tasksNeedToBeScheduledFrom(tasks)) {
+true
+  } else {
+emptyKeys += id
+false
+  }
+  }
+  emptyKeys.foreach(x = pendingTasks.remove(x))
+  hasTasks
+}
+
+while (currentLocalityIndex  myLocalityLevels.length - 1) {
+  val moreTasks = myLocalityLevels(currentLocalityIndex) match {
+case TaskLocality.PROCESS_LOCAL = 
noMoreTasksToRunIn(pendingTasksForExecutor)
+case TaskLocality.NODE_LOCAL = 
noMoreTasksToRunIn(pendingTasksForHost)
+case TaskLocality.NO_PREF = pendingTasksWithNoPrefs.nonEmpty
--- End diff --

noMoreTasksToRunIn = moreTasksToRunIn()


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2945][YARN][Doc]add doc for spark.execu...

2015-02-03 Thread WangTaoTheTonic

Github user WangTaoTheTonic commented on the pull request:

https://github.com/apache/spark/pull/4350#issuecomment-72799695
  
Seems like the UT has broken, but I think it is unralted with this patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4939] move to next locality when no pen...

2015-02-03 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/3779#discussion_r24066617
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -506,13 +506,59 @@ private[spark] class TaskSetManager(
* Get the level we can launch tasks according to delay scheduling, 
based on current wait time.
*/
   private def getAllowedLocalityLevel(curTime: Long): 
TaskLocality.TaskLocality = {
-while (curTime - lastLaunchTime = localityWaits(currentLocalityIndex) 

-currentLocalityIndex  myLocalityLevels.length - 1)
-{
-  // Jump to the next locality level, and remove our waiting time for 
the current one since
-  // we don't want to count it again on the next one
-  lastLaunchTime += localityWaits(currentLocalityIndex)
-  currentLocalityIndex += 1
+// Remove the scheduled or finished tasks lazily
+def hasNotScheduledTasks(taskIndexes: ArrayBuffer[Int]): Boolean = {
+  var indexOffset = taskIndexes.size
+  while (indexOffset  0) {
+indexOffset -= 1
+val index = taskIndexes(indexOffset)
+if (copiesRunning(index) == 0  !successful(index)) {
+  return true
+} else {
+  taskIndexes.remove(indexOffset)
+}
+  }
+  false
+}
+// It removes the empty lists after check
+def hasMoreTasks(pendingTasks: HashMap[String, ArrayBuffer[Int]]): 
Boolean = {
+  val emptyKeys = new ArrayBuffer[String]
+  val hasTasks = pendingTasks.exists{
+case (id: String, tasks: ArrayBuffer[Int]) =
+  if (hasNotScheduledTasks(tasks)) {
+true
+  } else {
+emptyKeys += id
--- End diff --

The `id` could be executorId, host or rackId.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4939] move to next locality when no pen...

2015-02-03 Thread markhamstra

Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/3779#discussion_r24067073
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -506,13 +506,63 @@ private[spark] class TaskSetManager(
* Get the level we can launch tasks according to delay scheduling, 
based on current wait time.
*/
   private def getAllowedLocalityLevel(curTime: Long): 
TaskLocality.TaskLocality = {
-while (curTime - lastLaunchTime = localityWaits(currentLocalityIndex) 

-currentLocalityIndex  myLocalityLevels.length - 1)
-{
-  // Jump to the next locality level, and remove our waiting time for 
the current one since
-  // we don't want to count it again on the next one
-  lastLaunchTime += localityWaits(currentLocalityIndex)
-  currentLocalityIndex += 1
+// Remove the scheduled or finished tasks lazily
+def tasksNeedToBeScheduledFrom(pendingTaskIds: ArrayBuffer[Int]): 
Boolean = {
+  var indexOffset = pendingTaskIds.size
+  while (indexOffset  0) {
+indexOffset -= 1
+val index = pendingTaskIds(indexOffset)
+if (copiesRunning(index) == 0  !successful(index)) {
+  return true
+} else {
+  pendingTaskIds.remove(indexOffset)
+}
+  }
+  false
+}
+// Walk through the list of tasks that can be scheduled at each 
location and returns true
+// if there are any tasks that still need to be scheduled. Lazily 
cleans up tasks that have
+// already been scheduled.
+def noMoreTasksToRunIn(pendingTasks: HashMap[String, 
ArrayBuffer[Int]]): Boolean = {
+  val emptyKeys = new ArrayBuffer[String]
+  val hasTasks = pendingTasks.exists{
+case (id: String, tasks: ArrayBuffer[Int]) =
+  if (tasksNeedToBeScheduledFrom(tasks)) {
+true
+  } else {
+emptyKeys += id
+false
+  }
+  }
+  emptyKeys.foreach(x = pendingTasks.remove(x))
+  hasTasks
+}
+
+while (currentLocalityIndex  myLocalityLevels.length - 1) {
+  val moreTasks = myLocalityLevels(currentLocalityIndex) match {
+case TaskLocality.PROCESS_LOCAL = 
noMoreTasksToRunIn(pendingTasksForExecutor)
+case TaskLocality.NODE_LOCAL = 
noMoreTasksToRunIn(pendingTasksForHost)
+case TaskLocality.NO_PREF = pendingTasksWithNoPrefs.nonEmpty
--- End diff --

Wait, are we going with moreTasks or noMoreTasks?  I think this needs to be 
`val noMoreTasks`, `case TaskLocality.NO_PREF = 
pendingTasksWithNoPrefs.isEmpty` and `if (noMoreTasks)` a couple lines down -- 
or consistently reverse the boolean sense of all that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL] Minor changes for DataFrame Implementati...

2015-02-03 Thread chenghao-intel

GitHub user chenghao-intel opened a pull request:

https://github.com/apache/spark/pull/4339

[SQL] Minor changes for DataFrame Implementation



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chenghao-intel/spark dataframe_minor2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4339.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4339


commit f63614bee7fe098ddc6d41ab0b6bd31b0abb5eca
Author: Cheng Hao hao.ch...@intel.com
Date:   2015-02-03T14:44:45Z

minor change for DataFrame Implementation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-02-03 Thread dibbhatt

Github user dibbhatt commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r24010956
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaCluster.scala
 ---
@@ -0,0 +1,344 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.kafka
+
+import scala.util.control.NonFatal
+import scala.util.Random
+import scala.collection.mutable.ArrayBuffer
+import java.util.Properties
+import kafka.api._
+import kafka.common.{ErrorMapping, OffsetMetadataAndError, 
TopicAndPartition}
+import kafka.consumer.{ConsumerConfig, SimpleConsumer}
+import org.apache.spark.SparkException
+
+/**
+ * Convenience methods for interacting with a Kafka cluster.
+ * @param kafkaParams Kafka a 
href=http://kafka.apache.org/documentation.html#configuration;
+ * configuration parameters/a.
+ *   Requires metadata.broker.list or bootstrap.servers to be set with 
Kafka broker(s),
+ *   NOT zookeeper servers, specified in host1:port1,host2:port2 form
+ */
+private[spark]
+class KafkaCluster(val kafkaParams: Map[String, String]) extends 
Serializable {
+  import KafkaCluster.{Err, LeaderOffset}
+
+  val seedBrokers: Array[(String, Int)] =
+kafkaParams.get(metadata.broker.list)
+  .orElse(kafkaParams.get(bootstrap.servers))
+  .getOrElse(throw new SparkException(Must specify 
metadata.broker.list or bootstrap.servers))
+  .split(,).map { hp =
+val hpa = hp.split(:)
+(hpa(0), hpa(1).toInt)
+  }
+
+  // ConsumerConfig isn't serializable
+  @transient private var _config: ConsumerConfig = null
+
+  def config: ConsumerConfig = this.synchronized {
+if (_config == null) {
+  _config = KafkaCluster.consumerConfig(kafkaParams)
+}
+_config
+  }
+
+  def connect(host: String, port: Int): SimpleConsumer =
+new SimpleConsumer(host, port, config.socketTimeoutMs,
+  config.socketReceiveBufferBytes, config.clientId)
+
+  def connectLeader(topic: String, partition: Int): Either[Err, 
SimpleConsumer] =
+findLeader(topic, partition).right.map(hp = connect(hp._1, hp._2))
+
+  // Metadata api
+  // scalastyle:off
+  // 
https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-MetadataAPI
+  // scalastyle:on
+
+  def findLeader(topic: String, partition: Int): Either[Err, (String, 
Int)] = {
+val req = TopicMetadataRequest(TopicMetadataRequest.CurrentVersion,
+  0, config.clientId, Seq(topic))
+val errs = new Err
+withBrokers(Random.shuffle(seedBrokers), errs) { consumer =
+  val resp: TopicMetadataResponse = consumer.send(req)
+  resp.topicsMetadata.find(_.topic == topic).flatMap { tm: 
TopicMetadata =
+tm.partitionsMetadata.find(_.partitionId == partition)
+  }.foreach { pm: PartitionMetadata =
+pm.leader.foreach { leader =
+  return Right((leader.host, leader.port))
+}
+  }
+}
+Left(errs)
+  }
+
+  def findLeaders(
+  topicAndPartitions: Set[TopicAndPartition]
+): Either[Err, Map[TopicAndPartition, (String, Int)]] = {
+val topics = topicAndPartitions.map(_.topic)
+val response = getPartitionMetadata(topics).right
+val answer = response.flatMap { tms: Set[TopicMetadata] =
+  val leaderMap = tms.flatMap { tm: TopicMetadata =
+tm.partitionsMetadata.flatMap { pm: PartitionMetadata =
+  val tp = TopicAndPartition(tm.topic, pm.partitionId)
+  if (topicAndPartitions(tp)) {
+pm.leader.map { l =
+  tp - (l.host - l.port)
+}
+  } else {
+None
+  }
+}
+  }.toMap
+
+  if (leaderMap.keys.size == topicAndPartitions.size) {
+Right(leaderMap)
+  } else {
+val

[GitHub] spark pull request: [SQL] Minor changes for DataFrame Implementati...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4339#issuecomment-72664280
  
  [Test build #26655 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26655/consoleFull)
 for   PR 4339 at commit 
[`f63614b`](https://github.com/apache/spark/commit/f63614bee7fe098ddc6d41ab0b6bd31b0abb5eca).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5498][SQL]fix bug when query the data w...

2015-02-03 Thread chenghao-intel

Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/4289#issuecomment-72671869
  
Sorry for the late reply @jeanlyn !
I think it's a bug of Hive DDL, which probably was resolved in Hive 0.14 / 
0.15, and I am not sure if we really want to fix that in Spark SQL. @yhuai , do 
you have any comment on this?
However, in this particular case, another work around in your product:
1) Rename the existed table;
2) Create a new table with schema you altered, and also the partitions.
3) Manually move the previous data into the new table folder from the HDFS.
4) Drop the old table.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [STREAMING] SPARK-4986 Wait for receivers to d...

2015-02-03 Thread cleaton

GitHub user cleaton opened a pull request:

https://github.com/apache/spark/pull/4338

[STREAMING] SPARK-4986 Wait for receivers to deregister and receiver job to 
terminate

A slow receiver might not have enough time to shutdown cleanly even when 
graceful shutdown is used. This PR extends graceful waiting to make sure all 
receivers have deregistered and that the receiver job has terminated.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cleaton/spark stopreceivers

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4338.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4338


commit d179372ae1d93adf67f8cd7ea818c443c584a8a0
Author: Jesper Lundgren jesper.lundg...@vpon.com
Date:   2015-01-25T14:31:19Z

Add graceful shutdown unit test covering slow receiver onStop

Signed-off-by: Jesper Lundgren jesper.lundg...@vpon.com

commit 9a9ff884d878824548f9a08d724f60b0c14f8310
Author: Jesper Lundgren jesper.lundg...@vpon.com
Date:   2015-01-30T07:34:03Z

wait for receivers to shutdown and receiver job to terminate

Signed-off-by: Jesper Lundgren jesper.lundg...@vpon.com

commit 3d0bd3501d24a8ab5e10e1867b464d471b3bbb67
Author: Jesper Lundgren jesper.lundg...@vpon.com
Date:   2015-01-30T07:41:00Z

switch boleans to match running status instead of terminated

Signed-off-by: Jesper Lundgren jesper.lundg...@vpon.com




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [STREAMING] SPARK-4986 Wait for receivers to d...

2015-02-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4338#issuecomment-72662235
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL] Minor changes for dataframe implementati...

2015-02-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4336#issuecomment-72667971
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26653/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL] Minor changes for dataframe implementati...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4336#issuecomment-72667963
  
  [Test build #26653 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26653/consoleFull)
 for   PR 4336 at commit 
[`3293408`](https://github.com/apache/spark/commit/3293408b812944735e03f2a41221851faffb3669).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [STREAMING] SPARK-4986 Wait for receivers to d...

2015-02-03 Thread cleaton

Github user cleaton commented on the pull request:

https://github.com/apache/spark/pull/3868#issuecomment-72662664
  
@tdas I have created a new master branch PR. You can find it here: 
https://github.com/apache/spark/pull/4338


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5498][SQL]fix bug when query the data w...

2015-02-03 Thread chenghao-intel

Github user chenghao-intel commented on a diff in the pull request:

https://github.com/apache/spark/pull/4289#discussion_r24010127
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertIntoHiveTableSuite.scala
 ---
@@ -172,4 +172,19 @@ class InsertIntoHiveTableSuite extends QueryTest {
 
 sql(DROP TABLE hiveTableWithStructValue)
   }
+  
+  test(SPARK-5498:partition schema does not match table schema){
+val testData = TestHive.sparkContext.parallelize(
+  (1 to 10).map(i = TestData(i, i.toString)))
+testData.registerTempTable(testData)
+val tmpDir = Files.createTempDir()
+sql(sCREATE TABLE table_with_partition(key int,value string) 
PARTITIONED by (ds string) location '${tmpDir.toURI.toString}' )
+sql(INSERT OVERWRITE TABLE table_with_partition  partition (ds='1') 
SELECT key,value FROM testData)
+sql(ALTER TABLE table_with_partition CHANGE COLUMN key key BIGINT)
--- End diff --

I just checked the [Hive 
Document](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable)
It says:
`The CASCADE|RESTRICT clause is available in Hive 0.15.0. ALTER TABLE 
CHANGE COLUMN with CASCADE command changes the columns of a table's metadata, 
and cascades the same change to all the partition metadata. RESTRICT is the 
default, limiting column change only to table metadata.`
I guess in Hive 0.13.1, when table schema changed via `alter table`, only 
the table meta data will be updated, can you double check if above query works 
for Hive 0.13.1? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5559] [Streaming] [Test] Remove oppotun...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4337#issuecomment-72669182
  
  [Test build #26654 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26654/consoleFull)
 for   PR 4337 at commit 
[`8212e42`](https://github.com/apache/spark/commit/8212e42cfd79b5b92c7c664a9ecff7da68e062a5).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5559] [Streaming] [Test] Remove oppotun...

2015-02-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4337#issuecomment-72669193
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26654/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1405][MLLIB] LDA on GraphX

2015-02-03 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2388


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-02-03 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/4047


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [minor] update streaming linear algorithms

2015-02-03 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/4329#issuecomment-72610576
  
This is a minor update. I've merged this into master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4987] [SQL] parquet timestamp type supp...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3820#issuecomment-72611319
  
  [Test build #26622 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26622/consoleFull)
 for   PR 3820 at commit 
[`b1e2a0d`](https://github.com/apache/spark/commit/b1e2a0d8b40f6651a0a2b36cdc9070e67e9d6bf3).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4879] [WIP] Use driver to coordinate Ha...

2015-02-03 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4066#discussion_r23988557
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
@@ -0,0 +1,179 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import scala.collection.mutable
+
+import akka.actor.{ActorRef, Actor}
+
+import org.apache.spark._
+import org.apache.spark.util.{AkkaUtils, ActorLogReceive}
+
+private[spark] sealed trait OutputCommitCoordinationMessage extends 
Serializable
+
+private[spark] case class StageStarted(stage: Int) extends 
OutputCommitCoordinationMessage
+private[spark] case class StageEnded(stage: Int) extends 
OutputCommitCoordinationMessage
+private[spark] case object StopCoordinator extends 
OutputCommitCoordinationMessage
+
+private[spark] case class AskPermissionToCommitOutput(
+stage: Int,
+task: Long,
+taskAttempt: Long)
+extends OutputCommitCoordinationMessage
+
+private[spark] case class TaskCompleted(
+stage: Int,
+task: Long,
+attempt: Long,
+reason: TaskEndReason)
+extends OutputCommitCoordinationMessage
+
+/**
+ * Authority that decides whether tasks can commit output to HDFS.
+ *
+ * This lives on the driver, but the actor allows the tasks that commit
+ * to Hadoop to invoke it.
+ */
+private[spark] class OutputCommitCoordinator(conf: SparkConf) extends 
Logging {
+
+  // Initialized by SparkEnv
+  var coordinatorActor: Option[ActorRef] = None
+  private val timeout = AkkaUtils.askTimeout(conf)
+  private val maxAttempts = AkkaUtils.numRetries(conf)
+  private val retryInterval = AkkaUtils.retryWaitMs(conf)
+
+  private type StageId = Int
+  private type TaskId = Long
+  private type TaskAttemptId = Long
+  private type CommittersByStageMap = mutable.Map[StageId, 
mutable.Map[TaskId, TaskAttemptId]]
+
+  private val authorizedCommittersByStage: CommittersByStageMap = 
mutable.Map()
+
+  def stageStart(stage: StageId) {
+sendToActor(StageStarted(stage))
+  }
+  def stageEnd(stage: StageId) {
+sendToActor(StageEnded(stage))
+  }
+
+  def canCommit(
+  stage: StageId,
+  task: TaskId,
+  attempt: TaskAttemptId): Boolean = {
+askActor(AskPermissionToCommitOutput(stage, task, attempt))
+  }
+
+  def taskCompleted(
+  stage: StageId,
+  task: TaskId,
+  attempt: TaskAttemptId,
+  reason: TaskEndReason) {
+sendToActor(TaskCompleted(stage, task, attempt, reason))
+  }
+
+  def stop() {
+sendToActor(StopCoordinator)
+coordinatorActor = None
+authorizedCommittersByStage.foreach(_._2.clear)
+authorizedCommittersByStage.clear
+  }
+
+  private def handleStageStart(stage: StageId): Unit = {
+authorizedCommittersByStage(stage) = mutable.HashMap[TaskId, 
TaskAttemptId]()
+  }
+
+  private def handleStageEnd(stage: StageId): Unit = {
+authorizedCommittersByStage.remove(stage)
+  }
+
+  private def handleAskPermissionToCommit(
+  stage: StageId,
+  task: TaskId,
+  attempt: TaskAttemptId):
+  Boolean = {
+authorizedCommittersByStage.get(stage) match {
+  case Some(authorizedCommitters) =
+authorizedCommitters.get(stage) match {
--- End diff --

Alas, StageId and TaskID are mere type aliases, not actual types, so it 
looks like there's a subtle typo here that could have been caught by stronger 
typechecking: `authorizedCommitters` is indexed by TaskID, not StageID.  I 
think this is why the streaming checkpoint suite tests were failing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working,

[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-02-03 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r23988445
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaUtils.scala 
---
@@ -144,4 +150,174 @@ object KafkaUtils {
 createStream[K, V, U, T](
   jssc.ssc, kafkaParams.toMap, 
Map(topics.mapValues(_.intValue()).toSeq: _*), storageLevel)
   }
+
+  /** A batch-oriented interface for consuming from Kafka.
+   * Starting and ending offsets are specified in advance,
+   * so that you can control exactly-once semantics.
+   * @param sc SparkContext object
+   * @param kafkaParams Kafka a 
href=http://kafka.apache.org/documentation.html#configuration;
+   * configuration parameters/a.
+   *   Requires metadata.broker.list or bootstrap.servers to be set 
with Kafka broker(s),
+   *   NOT zookeeper servers, specified in host1:port1,host2:port2 form.
+   * @param offsetRanges Each OffsetRange in the batch corresponds to a
+   *   range of offsets for a given Kafka topic/partition
+   */
+  @Experimental
+  def createRDD[
+K: ClassTag,
+V: ClassTag,
+U : Decoder[_]: ClassTag,
+T : Decoder[_]: ClassTag] (
+  sc: SparkContext,
+  kafkaParams: Map[String, String],
+  offsetRanges: Array[OffsetRange]
+  ): RDD[(K, V)] with HasOffsetRanges = {
--- End diff --

I noticed that KafkaRDD isn't exposed, so maybe this is why. Not sure I see 
a big issue with exposing KafkaRDD and its constructor given that it's 
basically the same level of visibility as this static factory function.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-02-03 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r23988871
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/streaming/kafka/OffsetRange.scala
 ---
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.kafka
+
+import kafka.common.TopicAndPartition
+
+/** Something that has a collection of OffsetRanges */
+trait HasOffsetRanges {
+  def offsetRanges: Array[OffsetRange]
+}
+
+/** Represents a range of offsets from a single Kafka TopicAndPartition */
+final class OffsetRange private(
+/** kafka topic name */
+val topic: String,
+/** kafka partition id */
+val partition: Int,
+/** inclusive starting offset */
+val fromOffset: Long,
+/** exclusive ending offset */
+val untilOffset: Long) extends Serializable {
+  import OffsetRange.OffsetRangeTuple
+
+  /** this is to avoid ClassNotFoundException during checkpoint restore */
+  private[streaming]
+  def toTuple: OffsetRangeTuple = (topic, partition, fromOffset, 
untilOffset)
+}
+
+object OffsetRange {
+  private[spark]
+  type OffsetRangeTuple = (String, Int, Long, Long)
+
+  def create(topic: String, partition: Int, fromOffset: Long, untilOffset: 
Long): OffsetRange =
--- End diff --

It's confusing to have both `create` and the apply methods here. Why not 
just have one way of creating these?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4879] [WIP] Use driver to coordinate Ha...

2015-02-03 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4066#discussion_r23988843
  
--- Diff: core/src/main/scala/org/apache/spark/SparkHadoopWriter.scala ---
@@ -105,24 +106,61 @@ class SparkHadoopWriter(@transient jobConf: JobConf)
   def commit() {
 val taCtxt = getTaskContext()
 val cmtr = getOutputCommitter()
-if (cmtr.needsTaskCommit(taCtxt)) {
+
+// Called after we have decided to commit
+def performCommit(): Unit = {
   try {
 cmtr.commitTask(taCtxt)
-logInfo (taID + : Committed)
+logInfo (s$taID: Committed)
   } catch {
-case e: IOException = {
+case e: IOException =
   logError(Error committing the output of task:  + taID.value, e)
   cmtr.abortTask(taCtxt)
   throw e
+  }
+}
+
+// First, check whether the task's output has already been committed 
by some other attempt
+if (cmtr.needsTaskCommit(taCtxt)) {
+  // The task output needs to be committed, but we don't know whether 
some other task attempt
+  // might be racing to commit the same output partition. Therefore, 
coordinate with the driver
+  // in order to determine whether this attempt can commit.
+  val shouldCoordinateWithDriver: Boolean = {
+val sparkConf = SparkEnv.get.conf
+// We only need to coordinate with the driver if there are 
multiple concurrent task
+// attempts, which should only occur if speculation is enabled
--- End diff --

Can anyone think of cases where this assumption would be violated?  Can 
this ever be violated due to, say, transitive network failures?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5551][SQL] Create type alias for Schema...

2015-02-03 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/4327


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][DataFrame] Remove DataFrameApi, Expressi...

2015-02-03 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/4328


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...

2015-02-03 Thread zsxwing

Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/3642#discussion_r23989500
  
--- Diff: 
graphx/src/test/scala/org/apache/spark/graphx/lib/ShortestPathsSuite.scala ---
@@ -40,7 +40,7 @@ class ShortestPathsSuite extends FunSuite with 
LocalSparkContext {
   val graph = Graph.fromEdgeTuples(edges, 1)
   val landmarks = Seq(1, 4).map(_.toLong)
   val results = ShortestPaths.run(graph, 
landmarks).vertices.collect.map {
-case (v, spMap) = (v, spMap.mapValues(_.get))
--- End diff --

Do you have any example? I added implicit to them and compiled codes 
successfully.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5549] Define TaskContext interface in S...

2015-02-03 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/4324#issuecomment-72613931
  
Going to merge since tests passed previously, and the latest failure was 
due to a flaky test in streaming.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5278][SQL] complete the check of ambigu...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4068#issuecomment-72614165
  
  [Test build #26641 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26641/consoleFull)
 for   PR 4068 at commit 
[`3295858`](https://github.com/apache/spark/commit/3295858b6d7e1738c11837207f564b4f2c4c503c).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...

2015-02-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4325#issuecomment-72614700
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26639/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4325#issuecomment-72614694
  
  [Test build #26639 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26639/consoleFull)
 for   PR 4325 at commit 
[`e46735c`](https://github.com/apache/spark/commit/e46735c612879bb46317efe73155b4611bb51afc).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5426][SQL] Add SparkSQL Java API helper...

2015-02-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4243#issuecomment-72614630
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-02-03 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r23989829
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaUtils.scala 
---
@@ -144,4 +150,174 @@ object KafkaUtils {
 createStream[K, V, U, T](
   jssc.ssc, kafkaParams.toMap, 
Map(topics.mapValues(_.intValue()).toSeq: _*), storageLevel)
   }
+
+  /** A batch-oriented interface for consuming from Kafka.
+   * Starting and ending offsets are specified in advance,
+   * so that you can control exactly-once semantics.
+   * @param sc SparkContext object
+   * @param kafkaParams Kafka a 
href=http://kafka.apache.org/documentation.html#configuration;
+   * configuration parameters/a.
+   *   Requires metadata.broker.list or bootstrap.servers to be set 
with Kafka broker(s),
+   *   NOT zookeeper servers, specified in host1:port1,host2:port2 form.
+   * @param offsetRanges Each OffsetRange in the batch corresponds to a
+   *   range of offsets for a given Kafka topic/partition
+   */
+  @Experimental
+  def createRDD[
+K: ClassTag,
+V: ClassTag,
+U : Decoder[_]: ClassTag,
+T : Decoder[_]: ClassTag] (
+  sc: SparkContext,
+  kafkaParams: Map[String, String],
+  offsetRanges: Array[OffsetRange]
+  ): RDD[(K, V)] with HasOffsetRanges = {
+val messageHandler = (mmd: MessageAndMetadata[K, V]) = (mmd.key, 
mmd.message)
+val kc = new KafkaCluster(kafkaParams)
+val topics = offsetRanges.map(o = TopicAndPartition(o.topic, 
o.partition)).toSet
+val leaders = kc.findLeaders(topics).fold(
+  errs = throw new SparkException(errs.mkString(\n)),
+  ok = ok
+)
+new KafkaRDD[K, V, U, T, (K, V)](sc, kafkaParams, offsetRanges, 
leaders, messageHandler)
+  }
+
+  /** A batch-oriented interface for consuming from Kafka.
+   * Starting and ending offsets are specified in advance,
+   * so that you can control exactly-once semantics.
+   * @param sc SparkContext object
+   * @param kafkaParams Kafka a 
href=http://kafka.apache.org/documentation.html#configuration;
+   * configuration parameters/a.
+   *   Requires metadata.broker.list or bootstrap.servers to be set 
with Kafka broker(s),
+   *   NOT zookeeper servers, specified in host1:port1,host2:port2 form.
+   * @param offsetRanges Each OffsetRange in the batch corresponds to a
+   *   range of offsets for a given Kafka topic/partition
+   * @param leaders Kafka leaders for each offset range in batch
+   * @param messageHandler function for translating each message into the 
desired type
--- End diff --

How is this message handler different than having the user just call a map 
function on a returned RDD? It seems a little risky because this is exposing a 
Kafka class in the byte code signature, which they could relocate in a future 
release in a way that causes this to break for callers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-02-03 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r23989786
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaUtils.scala 
---
@@ -144,4 +150,174 @@ object KafkaUtils {
 createStream[K, V, U, T](
   jssc.ssc, kafkaParams.toMap, 
Map(topics.mapValues(_.intValue()).toSeq: _*), storageLevel)
   }
+
+  /** A batch-oriented interface for consuming from Kafka.
+   * Starting and ending offsets are specified in advance,
+   * so that you can control exactly-once semantics.
+   * @param sc SparkContext object
+   * @param kafkaParams Kafka a 
href=http://kafka.apache.org/documentation.html#configuration;
+   * configuration parameters/a.
+   *   Requires metadata.broker.list or bootstrap.servers to be set 
with Kafka broker(s),
+   *   NOT zookeeper servers, specified in host1:port1,host2:port2 form.
+   * @param offsetRanges Each OffsetRange in the batch corresponds to a
+   *   range of offsets for a given Kafka topic/partition
+   */
+  @Experimental
+  def createRDD[
+K: ClassTag,
+V: ClassTag,
+U : Decoder[_]: ClassTag,
+T : Decoder[_]: ClassTag] (
+  sc: SparkContext,
+  kafkaParams: Map[String, String],
+  offsetRanges: Array[OffsetRange]
+  ): RDD[(K, V)] with HasOffsetRanges = {
+val messageHandler = (mmd: MessageAndMetadata[K, V]) = (mmd.key, 
mmd.message)
+val kc = new KafkaCluster(kafkaParams)
+val topics = offsetRanges.map(o = TopicAndPartition(o.topic, 
o.partition)).toSet
+val leaders = kc.findLeaders(topics).fold(
+  errs = throw new SparkException(errs.mkString(\n)),
+  ok = ok
+)
+new KafkaRDD[K, V, U, T, (K, V)](sc, kafkaParams, offsetRanges, 
leaders, messageHandler)
+  }
+
+  /** A batch-oriented interface for consuming from Kafka.
+   * Starting and ending offsets are specified in advance,
+   * so that you can control exactly-once semantics.
+   * @param sc SparkContext object
+   * @param kafkaParams Kafka a 
href=http://kafka.apache.org/documentation.html#configuration;
+   * configuration parameters/a.
+   *   Requires metadata.broker.list or bootstrap.servers to be set 
with Kafka broker(s),
+   *   NOT zookeeper servers, specified in host1:port1,host2:port2 form.
+   * @param offsetRanges Each OffsetRange in the batch corresponds to a
+   *   range of offsets for a given Kafka topic/partition
+   * @param leaders Kafka leaders for each offset range in batch
+   * @param messageHandler function for translating each message into the 
desired type
+   */
+  @Experimental
+  def createRDD[
+K: ClassTag,
+V: ClassTag,
+U : Decoder[_]: ClassTag,
+T : Decoder[_]: ClassTag,
+R: ClassTag] (
+  sc: SparkContext,
+  kafkaParams: Map[String, String],
+  offsetRanges: Array[OffsetRange],
+  leaders: Array[Leader],
--- End diff --

Is this version of the constructor assuming that they caller has their own 
code for finding the leaders? From what I can tell we've locked down the 
utility function for doing that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...

2015-02-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3642#issuecomment-72615909
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26636/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][DataFrame] Remove DataFrameApi, Expressi...

2015-02-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4328#issuecomment-72615989
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26634/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][DataFrame] Remove DataFrameApi, Expressi...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4328#issuecomment-72615985
  
  [Test build #26634 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26634/consoleFull)
 for   PR 4328 at commit 
[`723d600`](https://github.com/apache/spark/commit/723d60054b369573bbe8035cad266aef9300356b).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-02-03 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r23990370
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaUtils.scala 
---
@@ -144,4 +150,174 @@ object KafkaUtils {
 createStream[K, V, U, T](
   jssc.ssc, kafkaParams.toMap, 
Map(topics.mapValues(_.intValue()).toSeq: _*), storageLevel)
   }
+
+  /** A batch-oriented interface for consuming from Kafka.
+   * Starting and ending offsets are specified in advance,
+   * so that you can control exactly-once semantics.
+   * @param sc SparkContext object
+   * @param kafkaParams Kafka a 
href=http://kafka.apache.org/documentation.html#configuration;
+   * configuration parameters/a.
+   *   Requires metadata.broker.list or bootstrap.servers to be set 
with Kafka broker(s),
+   *   NOT zookeeper servers, specified in host1:port1,host2:port2 form.
+   * @param offsetRanges Each OffsetRange in the batch corresponds to a
+   *   range of offsets for a given Kafka topic/partition
+   */
+  @Experimental
+  def createRDD[
+K: ClassTag,
+V: ClassTag,
+U : Decoder[_]: ClassTag,
+T : Decoder[_]: ClassTag] (
+  sc: SparkContext,
+  kafkaParams: Map[String, String],
+  offsetRanges: Array[OffsetRange]
+  ): RDD[(K, V)] with HasOffsetRanges = {
+val messageHandler = (mmd: MessageAndMetadata[K, V]) = (mmd.key, 
mmd.message)
+val kc = new KafkaCluster(kafkaParams)
+val topics = offsetRanges.map(o = TopicAndPartition(o.topic, 
o.partition)).toSet
+val leaders = kc.findLeaders(topics).fold(
+  errs = throw new SparkException(errs.mkString(\n)),
+  ok = ok
+)
+new KafkaRDD[K, V, U, T, (K, V)](sc, kafkaParams, offsetRanges, 
leaders, messageHandler)
+  }
+
+  /** A batch-oriented interface for consuming from Kafka.
+   * Starting and ending offsets are specified in advance,
+   * so that you can control exactly-once semantics.
+   * @param sc SparkContext object
+   * @param kafkaParams Kafka a 
href=http://kafka.apache.org/documentation.html#configuration;
+   * configuration parameters/a.
+   *   Requires metadata.broker.list or bootstrap.servers to be set 
with Kafka broker(s),
+   *   NOT zookeeper servers, specified in host1:port1,host2:port2 form.
+   * @param offsetRanges Each OffsetRange in the batch corresponds to a
+   *   range of offsets for a given Kafka topic/partition
+   * @param leaders Kafka leaders for each offset range in batch
+   * @param messageHandler function for translating each message into the 
desired type
+   */
+  @Experimental
+  def createRDD[
+K: ClassTag,
+V: ClassTag,
+U : Decoder[_]: ClassTag,
+T : Decoder[_]: ClassTag,
+R: ClassTag] (
+  sc: SparkContext,
+  kafkaParams: Map[String, String],
+  offsetRanges: Array[OffsetRange],
+  leaders: Array[Leader],
+  messageHandler: MessageAndMetadata[K, V] = R
+  ): RDD[R] with HasOffsetRanges = {
+
+val leaderMap = leaders
+  .map(l = TopicAndPartition(l.topic, l.partition) - (l.host, 
l.port))
+  .toMap
+new KafkaRDD[K, V, U, T, R](sc, kafkaParams, offsetRanges, leaderMap, 
messageHandler)
+  }
+
+  /**
+   * This stream can guarantee that each message from Kafka is included in 
transformations
+   * (as opposed to output actions) exactly once, even in most failure 
situations.
+   *
+   * Points to note:
+   *
+   * Failure Recovery - You must checkpoint this stream, or save offsets 
yourself and provide them
+   * as the fromOffsets parameter on restart.
+   * Kafka must have sufficient log retention to obtain messages after 
failure.
+   *
+   * Getting offsets from the stream - see programming guide
+   *
+.  * Zookeeper - This does not use Zookeeper to store offsets.  For 
interop with Kafka monitors
+   * that depend on Zookeeper, you must store offsets in ZK yourself.
+   *
+   * End-to-end semantics - This does not guarantee that any output 
operation will push each record
+   * exactly once. To ensure end-to-end exactly-once semantics (that is, 
receiving exactly once and
+   * outputting exactly once), you have to either ensure that the output 
operation is
+   * idempotent, or transactionally store offsets with the output. See the 
programming guide for
+   * more details.
+   *
+   * @param ssc StreamingContext object
+   * @param kafkaParams Kafka a 
href=http://kafka.apache.org/documentation.html#configuration;
+   * configuration parameters/a.
+   *   Requires metadata.broker.list or bootstrap.servers to be set 
with Kafka broker(s),
+   *   NOT zookeeper servers, specified in

[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3642#issuecomment-72615904
  
  [Test build #26636 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26636/consoleFull)
 for   PR 3642 at commit 
[`0b9017f`](https://github.com/apache/spark/commit/0b9017fef57e5512d539146fafd9aa1e12b966ae).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1405] [mllib] Latent Dirichlet Allocati...

2015-02-03 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/4047#issuecomment-72609253
  
LGTM. Merged into master. Thanks everyone for collaborating on LDA! 
@jkbradley Please create follow-up JIRAs and see who are interested in working 
on LDA features.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5550] [SQL] Support the case insensitiv...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4326#issuecomment-72609562
  
  [Test build #26625 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26625/consoleFull)
 for   PR 4326 at commit 
[`485cf66`](https://github.com/apache/spark/commit/485cf66a92ce31b53153b87f178e335cd52a2f97).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class SimpleFunctionRegistry(val caseSensitive: Boolean) extends 
FunctionRegistry `
  * `class StringKeyHashMap[T](normalizer: (String) = String) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][DataFrame] Remove DataFrameApi, Expressi...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4328#issuecomment-72610285
  
  [Test build #26626 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26626/consoleFull)
 for   PR 4328 at commit 
[`e00ffcb`](https://github.com/apache/spark/commit/e00ffcb83bc3485ed7381ec2ab8bbc6d2878ee9e).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5554] [SQL] [PySpark] add more tests fo...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4331#issuecomment-72610330
  
  [Test build #26635 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26635/consoleFull)
 for   PR 4331 at commit 
[`3ab2661`](https://github.com/apache/spark/commit/3ab26614b5278edce6e8571e5c51fe0b67e3124e).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [minor] update streaming linear algorithms

2015-02-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4329#issuecomment-72610392
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26628/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3642#issuecomment-72610354
  
  [Test build #26636 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26636/consoleFull)
 for   PR 3642 at commit 
[`0b9017f`](https://github.com/apache/spark/commit/0b9017fef57e5512d539146fafd9aa1e12b966ae).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [minor] update streaming linear algorithms

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4329#issuecomment-72610387
  
  [Test build #26628 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26628/consoleFull)
 for   PR 4329 at commit 
[`78731e1`](https://github.com/apache/spark/commit/78731e165b370eafd575c93d9601237e211ec75c).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SQL][DataFrame] Remove DataFrameApi, Expressi...

2015-02-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4328#issuecomment-72610291
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26626/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...

2015-02-03 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/4325#discussion_r23988701
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/DateUtils.scala ---
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.types
+
+import java.sql.Date
+import java.util.{Calendar, TimeZone}
+
+import org.apache.spark.sql.catalyst.expressions.Cast
+
+/**
+ * helper function to convert between Int value of days since 1970-01-01 
and java.sql.Date
+ */
+object DateUtils {
--- End diff --

If it's not necessary, just do not make it public, or we can not change it 
anymore.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5182] [SPARK-5528] [SPARK-5509] [SPARK-...

2015-02-03 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/4308#discussion_r23988104
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 ---
@@ -175,7 +175,10 @@ case class EqualTo(left: Expression, right: 
Expression) extends BinaryComparison
   null
 } else {
   val r = right.eval(input)
-  if (r == null) null else l == r
+  if (r == null) null
+  else if (left.dataType != BinaryType) l == r
+  else BinaryType.ordering.compare(
+l.asInstanceOf[Array[Byte]], r.asInstanceOf[Array[Byte]]) == 0
--- End diff --

Filed SPARK-5553 to track this. I'd like to make sure equality comparison 
for binary types works properly in this PR. Also, we're already using 
`Ordering` to compare binary values in `LessThan` and `GreaterThan` etc., so at 
least this isn't a regression.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-02-03 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r23988318
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaUtils.scala 
---
@@ -144,4 +150,174 @@ object KafkaUtils {
 createStream[K, V, U, T](
   jssc.ssc, kafkaParams.toMap, 
Map(topics.mapValues(_.intValue()).toSeq: _*), storageLevel)
   }
+
+  /** A batch-oriented interface for consuming from Kafka.
+   * Starting and ending offsets are specified in advance,
+   * so that you can control exactly-once semantics.
+   * @param sc SparkContext object
+   * @param kafkaParams Kafka a 
href=http://kafka.apache.org/documentation.html#configuration;
+   * configuration parameters/a.
+   *   Requires metadata.broker.list or bootstrap.servers to be set 
with Kafka broker(s),
+   *   NOT zookeeper servers, specified in host1:port1,host2:port2 form.
+   * @param offsetRanges Each OffsetRange in the batch corresponds to a
+   *   range of offsets for a given Kafka topic/partition
+   */
+  @Experimental
+  def createRDD[
+K: ClassTag,
+V: ClassTag,
+U : Decoder[_]: ClassTag,
+T : Decoder[_]: ClassTag] (
+  sc: SparkContext,
+  kafkaParams: Map[String, String],
+  offsetRanges: Array[OffsetRange]
+  ): RDD[(K, V)] with HasOffsetRanges = {
--- End diff --

I've never seen a trait mixin in a return type. What does this actually 
mean? I looked at the compiled byte code and the byte code signature is still 
RDD.

Can we just return a `KafkaRDD` here? If this is enforced somehow by the 
scala compiler, returning an interface here ties our hands in the future, 
because we can't add functionality to the returned type without breaking binary 
compatibility. For instance, we may want to return an RDD that has additional 
methods beyond just accessing its offset ranges.

I ran a simple example and I couldn't see any byte code reference to the 
mixed in trait:

```
trait Trait {}

class Class extends Trait {}

object Object {
  def getTrait: Class with Trait = {new Class()}
}

 javap -v Object
  public static Class getTrait();
flags: ACC_PUBLIC, ACC_STATIC
Code:
  stack=1, locals=0, args_size=0
 0: getstatic #16 // Field 
Object$.MODULE$:LObject$;
 3: invokevirtual #18 // Method 
Object$.getTrait:()LClass;
 6: areturn   
```




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...

2015-02-03 Thread adrian-wang

Github user adrian-wang commented on a diff in the pull request:

https://github.com/apache/spark/pull/4325#discussion_r23988264
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/DateUtils.scala ---
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.types
+
+import java.sql.Date
+import java.util.{Calendar, TimeZone}
+
+import org.apache.spark.sql.catalyst.expressions.Cast
+
+/**
+ * helper function to convert between Int value of days since 1970-01-01 
and java.sql.Date
+ */
+object DateUtils {
--- End diff --

I think this could be useful even outside of spark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [minor] update streaming linear algorithms

2015-02-03 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/4329


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4969][STREAMING][PYTHON] Add binaryReco...

2015-02-03 Thread davies

Github user davies commented on the pull request:

https://github.com/apache/spark/pull/3803#issuecomment-72612066
  
The Python parts look good to me, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4879] [WIP] Use driver to coordinate Ha...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4066#issuecomment-72611873
  
  [Test build #26637 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26637/consoleFull)
 for   PR 4066 at commit 
[`92e6dc9`](https://github.com/apache/spark/commit/92e6dc96530351b54cb8eb9944d90b7664776a79).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5278][SQL] complete the check of ambigu...

2015-02-03 Thread cloud-fan

Github user cloud-fan commented on the pull request:

https://github.com/apache/spark/pull/4068#issuecomment-72613023
  
Hi @yhuai , I have updated this PR introducing `UnresolvedGetField` to fix 
this issue. Do you have time to review it? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...

2015-02-03 Thread JoshRosen

Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4155#discussion_r23989226
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -808,6 +810,7 @@ class DAGScheduler(
 // will be posted, which should always come after a corresponding 
SparkListenerStageSubmitted
 // event.
 stage.latestInfo = StageInfo.fromStage(stage, 
Some(partitionsToCompute.size))
+outputCommitCoordinator.stageStart(stage.id)
--- End diff --

I think that this introduces a race between commit requests and the stage 
start event.  If the listener bus is slow in delivering events, then it's 
possible that the output commit coordinator could receive a commit request via 
Akka for a stage that it doesn't know about yet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5345][DEPLOY] Fix unstable test case in...

2015-02-03 Thread JoshRosen

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/4133#issuecomment-72614051
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5549] Define TaskContext interface in S...

2015-02-03 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/4324


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5345][DEPLOY] Fix unstable test case in...

2015-02-03 Thread JoshRosen

Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/4133#issuecomment-72614122
  
I think we've observed this test's flakiness even after fixing #4220, so we 
should continue investigating it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5345][DEPLOY] Fix unstable test case in...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4133#issuecomment-72614113
  
  [Test build #26640 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26640/consoleFull)
 for   PR 4133 at commit 
[`77678fe`](https://github.com/apache/spark/commit/77678fe01372ec272005900ae701ca553d0ee01e).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5521] PCA wrapper for easy transform ve...

2015-02-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4304#issuecomment-72614765
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26632/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5521] PCA wrapper for easy transform ve...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4304#issuecomment-72614756
  
  [Test build #26632 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26632/consoleFull)
 for   PR 4304 at commit 
[`8b29946`](https://github.com/apache/spark/commit/8b29946d0b1aabfcb393f71f9de6202cee51a39d).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class PCA(val k: Int) `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-1405][MLLIB] LDA on GraphX

2015-02-03 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2388#issuecomment-72616260
  
@witgo We've merged #4047 and closed this PR. Thanks for your contribution! 
Please create JIRAs and propose new features that can be added to the LDA 
implementation in master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5550] [SQL] Support the case insensitiv...

2015-02-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4326#issuecomment-72609568
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26625/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5554] [SQL] [PySpark] add more tests fo...

2015-02-03 Thread davies

GitHub user davies opened a pull request:

https://github.com/apache/spark/pull/4331

[SPARK-5554] [SQL] [PySpark] add more tests for DataFrame

Add more tests and docs for DataFrame Python API, improve test coverage, 
fix bugs.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davies/spark fix_df

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4331.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4331


commit 3ab26614b5278edce6e8571e5c51fe0b67e3124e
Author: Davies Liu dav...@databricks.com
Date:   2015-02-03T08:08:00Z

add more tests for DataFrame




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5551][SQL] Create type alias for Schema...

2015-02-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4327#issuecomment-72610169
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26627/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5551][SQL] Create type alias for Schema...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4327#issuecomment-72610160
  
  [Test build #26627 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26627/consoleFull)
 for   PR 4327 at commit 
[`e5a8ff3`](https://github.com/apache/spark/commit/e5a8ff3f7a5c6aebe84f180aca590d3ba41610f3).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-02-03 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r23988994
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/streaming/kafka/Leader.scala ---
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.kafka
+
+import kafka.common.TopicAndPartition
+
+/** Host info for the leader of a Kafka TopicAndPartition */
+final class Leader private(
+/** kafka topic name */
+val topic: String,
+/** kafka partition id */
+val partition: Int,
+/** kafka hostname */
+val host: String,
+/** kafka host's port */
+val port: Int) extends Serializable
+
+object Leader {
+  def create(topic: String, partition: Int, host: String, port: Int): 
Leader =
--- End diff --

Similar with offset ranges, can't we just have a single way to construct 
these?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-02-03 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r23988976
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/streaming/kafka/OffsetRange.scala
 ---
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.kafka
+
+import kafka.common.TopicAndPartition
+
+/** Something that has a collection of OffsetRanges */
+trait HasOffsetRanges {
+  def offsetRanges: Array[OffsetRange]
+}
+
+/** Represents a range of offsets from a single Kafka TopicAndPartition */
+final class OffsetRange private(
+/** kafka topic name */
+val topic: String,
+/** kafka partition id */
+val partition: Int,
+/** inclusive starting offset */
+val fromOffset: Long,
+/** exclusive ending offset */
+val untilOffset: Long) extends Serializable {
+  import OffsetRange.OffsetRangeTuple
+
+  /** this is to avoid ClassNotFoundException during checkpoint restore */
--- End diff --

This comment might be more helpful to include where `OffsetRangeTuple` is 
defined rather than here. I spent a long time trying to figure out why this 
extra class existed.

Also, can you give a bit more detail. Not sure I see why you can't recover 
from a checkpoint safely provided that the recovering JVM has the class 
`OffsetRangeTuple` defined.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-02-03 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r23989107
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/streaming/kafka/OffsetRange.scala
 ---
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.kafka
+
+import kafka.common.TopicAndPartition
+
+/** Something that has a collection of OffsetRanges */
+trait HasOffsetRanges {
+  def offsetRanges: Array[OffsetRange]
+}
+
+/** Represents a range of offsets from a single Kafka TopicAndPartition */
+final class OffsetRange private(
+/** kafka topic name */
+val topic: String,
+/** kafka partition id */
+val partition: Int,
+/** inclusive starting offset */
+val fromOffset: Long,
+/** exclusive ending offset */
+val untilOffset: Long) extends Serializable {
+  import OffsetRange.OffsetRangeTuple
+
+  /** this is to avoid ClassNotFoundException during checkpoint restore */
+  private[streaming]
+  def toTuple: OffsetRangeTuple = (topic, partition, fromOffset, 
untilOffset)
+}
+
+object OffsetRange {
+  private[spark]
+  type OffsetRangeTuple = (String, Int, Long, Long)
--- End diff --

Can you group this at the bottom with the related `apply` method? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4879] [WIP] Use driver to coordinate Ha...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4066#issuecomment-72613283
  
  [Test build #26629 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26629/consoleFull)
 for   PR 4066 at commit 
[`63a7707`](https://github.com/apache/spark/commit/63a7707cad01f4dcc2c74c4a6bffded9c887f9d4).
 * This patch **fails Spark unit tests**.
 * This patch **does not merge cleanly**.
 * This patch adds the following public classes _(experimental)_:
  * `case class AskPermissionToCommitOutput(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5549] Define TaskContext interface in S...

2015-02-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4324#issuecomment-72613282
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26633/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4879] [WIP] Use driver to coordinate Ha...

2015-02-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4066#issuecomment-72613292
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26629/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5549] Define TaskContext interface in S...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4324#issuecomment-72613272
  
  [Test build #26633 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26633/consoleFull)
 for   PR 4324 at commit 
[`2480a17`](https://github.com/apache/spark/commit/2480a177df2023238778fb217fb9b3e4794a9b82).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4325#issuecomment-72613604
  
  [Test build #26639 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26639/consoleFull)
 for   PR 4325 at commit 
[`e46735c`](https://github.com/apache/spark/commit/e46735c612879bb46317efe73155b4611bb51afc).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4795][Core] Redesign the primitive typ...

2015-02-03 Thread zsxwing

Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/3642#issuecomment-72614319
  
 @zsxwing if we merge this one, are there any other usecases for importing 
SparkContext._ ?

There will be no implicit methods/objects in SparkContext object. So people 
don't need to import `SparkContext._` for implicit methods/objects.

Since the compatibility is very important, it's better to get more pairs of 
eyes to look at it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

2015-02-03 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/3798#discussion_r23989943
  
--- Diff: 
external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaUtils.scala 
---
@@ -144,4 +150,249 @@ object KafkaUtils {
 createStream[K, V, U, T](
   jssc.ssc, kafkaParams.toMap, 
Map(topics.mapValues(_.intValue()).toSeq: _*), storageLevel)
   }
+
+  /** A batch-oriented interface for consuming from Kafka.
+   * Starting and ending offsets are specified in advance,
+   * so that you can control exactly-once semantics.
+   * @param sc SparkContext object
+   * @param kafkaParams Kafka a 
href=http://kafka.apache.org/documentation.html#configuration;
+   * configuration parameters/a.
+   *   Requires metadata.broker.list or bootstrap.servers to be set 
with Kafka broker(s),
+   *   NOT zookeeper servers, specified in host1:port1,host2:port2 form.
+   * @param batch Each OffsetRange in the batch corresponds to a
+   *   range of offsets for a given Kafka topic/partition
+   */
+  @Experimental
+  def createRDD[
+K: ClassTag,
+V: ClassTag,
+U : Decoder[_]: ClassTag,
+T : Decoder[_]: ClassTag,
+R: ClassTag] (
--- End diff --

Isn't the returned RDD of type `RDD[R]`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...

2015-02-03 Thread adrian-wang

Github user adrian-wang commented on a diff in the pull request:

https://github.com/apache/spark/pull/4325#discussion_r23990278
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/DateUtils.scala ---
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.types
+
+import java.sql.Date
+import java.util.{Calendar, TimeZone}
+
+import org.apache.spark.sql.catalyst.expressions.Cast
+
+/**
+ * helper function to convert between Int value of days since 1970-01-01 
and java.sql.Date
+ */
+object DateUtils {
--- End diff --

Code gen need that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4325#issuecomment-72615808
  
  [Test build #26642 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26642/consoleFull)
 for   PR 4325 at commit 
[`096e20d`](https://github.com/apache/spark/commit/096e20d5de068157910372a03a6face9edc829e6).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...

2015-02-03 Thread adrian-wang

Github user adrian-wang commented on the pull request:

https://github.com/apache/spark/pull/4325#issuecomment-72615722
  
retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4879] [WIP] Use driver to coordinate Ha...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4066#issuecomment-72744869
  
  [Test build #26677 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26677/consoleFull)
 for   PR 4066 at commit 
[`dd00b7c`](https://github.com/apache/spark/commit/dd00b7c83fd0a4fa1cbd9115f2e0a8e69bc519b9).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5554] [SQL] [PySpark] add more tests fo...

2015-02-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4331#issuecomment-72745468
  
  [Test build #26672 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26672/consoleFull)
 for   PR 4331 at commit 
[`467332c`](https://github.com/apache/spark/commit/467332cacca8754f04271a70bbaf15c8f2afd5c6).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class Dsl(object):`
  * `class ExamplePointUDT(UserDefinedType):`
  * `class SQLTests(ReusedPySparkTestCase):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5554] [SQL] [PySpark] add more tests fo...

2015-02-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4331#issuecomment-72745480
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26672/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 5 6 7 >

201 - 300 of 684 matches

Mail list logo