date:20151027

[GitHub] spark pull request: [SPARK-10484][SQL] Optimize the cartesian join...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8652#issuecomment-151389719
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2629][STREAMING] Basic implementation o...

2015-10-27 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/9256#discussion_r43088193
  
--- Diff: streaming/src/main/scala/org/apache/spark/streaming/State.scala 
---
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming
+
+/**
+ * Abstract class for getting and updating the tracked state in the 
`trackStateByKey` operation of
+ * [[org.apache.spark.streaming.dstream.PairDStreamFunctions pair 
DStream]] and
+ * [[org.apache.spark.streaming.api.java.JavaPairDStream]].
+ * {{{
+ *
+ * }}}
+ */
+sealed abstract class State[S] {
+  
+  /** Whether the state already exists */
+  def exists(): Boolean
+  
+  /**
+   * Get the state if it exists, otherwise wise it will throw an exception.
+   * Check with `exists()` whether the state exists or not before calling 
`get()`.
+   */
+  def get(): S
+
+  /**
+   * Update the state with a new value. Note that you cannot update the 
state if the state is
+   * timing out (that is, `isTimingOut() return true`, or if the state has 
already been removed by
+   * `remove()`.
+   */
+  def update(newState: S): Unit
+
+  /** Remove the state if it exists. */
+  def remove(): Unit
+
+  /** Is the state going to be timed out by the system after this batch 
interval */
+  def isTimingOut(): Boolean
+
+  @inline final def getOption(): Option[S] = Option(get())
+
+  /** Get the state if it exists, otherwise return the default value */
+  @inline final def getOrElse[S1 >: S](default: => S1): S1 = {
--- End diff --

Not sure is this "call-by-name" parameter Java friendly? Assuming this 
`State` should also be used in Java code :).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2629][STREAMING] Basic implementation o...

2015-10-27 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/9256#discussion_r43089727
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/dstream/TrackeStateDStream.scala
 ---
@@ -0,0 +1,192 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.dstream
+
+import java.io.{IOException, ObjectOutputStream}
+
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+import org.apache.spark._
+import org.apache.spark.rdd.{EmptyRDD, RDD}
+import org.apache.spark.storage.StorageLevel
+import org.apache.spark.streaming._
+import org.apache.spark.streaming.util.StateMap
+import org.apache.spark.util.Utils
+
+private[streaming] case class TrackStateRDDRecord[K: ClassTag, S: 
ClassTag, T: ClassTag](
+stateMap: StateMap[K, S], emittedRecords: Seq[T])
+
+
+private[streaming] class TrackStateRDDPartition(
+idx: Int,
+@transient private var prevStateRDD: RDD[_],
+@transient private var partitionedDataRDD: RDD[_]) extends Partition {
+
+  private[dstream] var previousSessionRDDPartition: Partition = null
+  private[dstream] var partitionedDataRDDPartition: Partition = null
+
+  override def index: Int = idx
+  override def hashCode(): Int = idx
+
+  @throws(classOf[IOException])
+  private def writeObject(oos: ObjectOutputStream): Unit = 
Utils.tryOrIOException {
+// Update the reference to parent split at the time of task 
serialization
+previousSessionRDDPartition = prevStateRDD.partitions(index)
+partitionedDataRDDPartition = partitionedDataRDD.partitions(index)
+oos.defaultWriteObject()
+  }
+}
+
+private[streaming] class TrackStateRDD[K: ClassTag, V: ClassTag, S: 
ClassTag, T: ClassTag](
+_sc: SparkContext,
+private var prevStateRDD: RDD[TrackStateRDDRecord[K, S, T]],
+private var partitionedDataRDD: RDD[(K, V)],
+trackingFunction: (K, Option[V], State[S]) => Option[T],
+currentTime: Long, timeoutThresholdTime: Option[Long]
+  ) extends RDD[TrackStateRDDRecord[K, S, T]](
+_sc,
+List(
+  new OneToOneDependency[TrackStateRDDRecord[K, S, T]](prevStateRDD),
+  new OneToOneDependency(partitionedDataRDD))
+  ) {
+
+  @volatile private var doFullScan = false
+
+  require(partitionedDataRDD.partitioner.nonEmpty)
+  require(partitionedDataRDD.partitioner == prevStateRDD.partitioner)
+
+  override val partitioner = prevStateRDD.partitioner
+
+  override def checkpoint(): Unit = {
+super.checkpoint()
+doFullScan = true
+  }
+
+  override def compute(
+  partition: Partition, context: TaskContext): 
Iterator[TrackStateRDDRecord[K, S, T]] = {
+
+val stateRDDPartition = partition.asInstanceOf[TrackStateRDDPartition]
+val prevStateRDDIterator = prevStateRDD.iterator(
+  stateRDDPartition.previousSessionRDDPartition, context)
+val dataIterator = partitionedDataRDD.iterator(
+  stateRDDPartition.partitionedDataRDDPartition, context)
+if (!prevStateRDDIterator.hasNext) {
+  throw new SparkException(s"Could not find state map in previous 
state RDD")
+}
+
+val newStateMap = prevStateRDDIterator.next().stateMap.copy()
+val emittedRecords = new ArrayBuffer[T]
+
+val wrappedState = new StateImpl[S]()
+
+dataIterator.foreach { case (key, value) =>
+  wrappedState.wrap(newStateMap.get(key))
+  val emittedRecord = trackingFunction(key, Some(value), wrappedState)
--- End diff --

Is it possible `value` will be `null`, would it be better to use 
`Option(value)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at

[GitHub] spark pull request: [SPARK-9319][SPARKR] Add support for setting c...

2015-10-27 Thread sun-rui

Github user sun-rui commented on the pull request:

https://github.com/apache/spark/pull/9218#issuecomment-151399054
  
@felixcheung, type inferring works for complex types in createDataFrame(). 
You can refer to the test case for "create DataFrame with complex types" in 
test_sparkSQL.R.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11334] numRunningTasks can't be less th...

2015-10-27 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/9288#issuecomment-151400535
  
Hm, is it better to check the status of the task, and only decrement if it 
wasn't dead already? (I may not know what I'm talking about there)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9298][SQL] Add pearson correlation aggr...

2015-10-27 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/8587#discussion_r43090251
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala
 ---
@@ -524,6 +525,116 @@ case class Sum(child: Expression) extends 
DeclarativeAggregate {
   override val evaluateExpression = Cast(currentSum, resultType)
 }
 
+case class Corr(
--- End diff --

* Please add ScalaDoc and document the behavior for `null` and `NaN` values.
* Provide a link to the wikipedia page that contains the update formula.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9298][SQL] Add pearson correlation aggr...

2015-10-27 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/8587#discussion_r43090262
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala
 ---
@@ -524,6 +525,116 @@ case class Sum(child: Expression) extends 
DeclarativeAggregate {
   override val evaluateExpression = Cast(currentSum, resultType)
 }
 
+case class Corr(
+left: Expression,
+right: Expression,
+mutableAggBufferOffset: Int = 0,
+inputAggBufferOffset: Int = 0)
+  extends ImperativeAggregate {
+
+  def children: Seq[Expression] = Seq(left, right)
+
+  def nullable: Boolean = false
+
+  def dataType: DataType = DoubleType
+
+  def inputTypes: Seq[AbstractDataType] = Seq(DoubleType)
+
+  def aggBufferSchema: StructType = 
StructType.fromAttributes(aggBufferAttributes)
+
+  def inputAggBufferAttributes: Seq[AttributeReference] = 
aggBufferAttributes.map(_.newInstance())
+
+  val aggBufferAttributes: Seq[AttributeReference] = Seq(
+AttributeReference("xAvg", DoubleType)(),
+AttributeReference("yAvg", DoubleType)(),
+AttributeReference("Ck", DoubleType)(),
+AttributeReference("MkX", DoubleType)(),
+AttributeReference("MkY", DoubleType)(),
+AttributeReference("count", LongType)())
+
+  override def withNewMutableAggBufferOffset(newMutableAggBufferOffset: 
Int): ImperativeAggregate =
+copy(mutableAggBufferOffset = newMutableAggBufferOffset)
+
+  override def withNewInputAggBufferOffset(newInputAggBufferOffset: Int): 
ImperativeAggregate =
+copy(inputAggBufferOffset = newInputAggBufferOffset)
+
+  override def initialize(buffer: MutableRow): Unit = {
+(0 until 5).map(idx => buffer.setDouble(mutableAggBufferOffset + idx, 
0.0))
+buffer.setLong(mutableAggBufferOffset + 5, 0L)
+  }
+
+  override def update(buffer: MutableRow, input: InternalRow): Unit = {
+val x = left.eval(input).asInstanceOf[Double]
+val y = right.eval(input).asInstanceOf[Double]
+
+var xAvg = buffer.getDouble(mutableAggBufferOffset)
+var yAvg = buffer.getDouble(mutableAggBufferOffset + 1)
--- End diff --

It might be faster to cache the offsets as `private[this] var`s.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2629][STREAMING] Basic implementation o...

2015-10-27 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/9256#discussion_r43090719
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/dstream/TrackeStateDStream.scala
 ---
@@ -0,0 +1,192 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.dstream
+
+import java.io.{IOException, ObjectOutputStream}
+
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+import org.apache.spark._
+import org.apache.spark.rdd.{EmptyRDD, RDD}
+import org.apache.spark.storage.StorageLevel
+import org.apache.spark.streaming._
+import org.apache.spark.streaming.util.StateMap
+import org.apache.spark.util.Utils
+
+private[streaming] case class TrackStateRDDRecord[K: ClassTag, S: 
ClassTag, T: ClassTag](
+stateMap: StateMap[K, S], emittedRecords: Seq[T])
+
+
+private[streaming] class TrackStateRDDPartition(
+idx: Int,
+@transient private var prevStateRDD: RDD[_],
+@transient private var partitionedDataRDD: RDD[_]) extends Partition {
+
+  private[dstream] var previousSessionRDDPartition: Partition = null
+  private[dstream] var partitionedDataRDDPartition: Partition = null
+
+  override def index: Int = idx
+  override def hashCode(): Int = idx
+
+  @throws(classOf[IOException])
+  private def writeObject(oos: ObjectOutputStream): Unit = 
Utils.tryOrIOException {
+// Update the reference to parent split at the time of task 
serialization
+previousSessionRDDPartition = prevStateRDD.partitions(index)
+partitionedDataRDDPartition = partitionedDataRDD.partitions(index)
+oos.defaultWriteObject()
+  }
+}
+
+private[streaming] class TrackStateRDD[K: ClassTag, V: ClassTag, S: 
ClassTag, T: ClassTag](
+_sc: SparkContext,
+private var prevStateRDD: RDD[TrackStateRDDRecord[K, S, T]],
+private var partitionedDataRDD: RDD[(K, V)],
+trackingFunction: (K, Option[V], State[S]) => Option[T],
+currentTime: Long, timeoutThresholdTime: Option[Long]
+  ) extends RDD[TrackStateRDDRecord[K, S, T]](
+_sc,
+List(
+  new OneToOneDependency[TrackStateRDDRecord[K, S, T]](prevStateRDD),
+  new OneToOneDependency(partitionedDataRDD))
+  ) {
+
+  @volatile private var doFullScan = false
+
+  require(partitionedDataRDD.partitioner.nonEmpty)
+  require(partitionedDataRDD.partitioner == prevStateRDD.partitioner)
+
+  override val partitioner = prevStateRDD.partitioner
+
+  override def checkpoint(): Unit = {
+super.checkpoint()
+doFullScan = true
+  }
+
+  override def compute(
+  partition: Partition, context: TaskContext): 
Iterator[TrackStateRDDRecord[K, S, T]] = {
+
+val stateRDDPartition = partition.asInstanceOf[TrackStateRDDPartition]
+val prevStateRDDIterator = prevStateRDD.iterator(
+  stateRDDPartition.previousSessionRDDPartition, context)
+val dataIterator = partitionedDataRDD.iterator(
+  stateRDDPartition.partitionedDataRDDPartition, context)
+if (!prevStateRDDIterator.hasNext) {
+  throw new SparkException(s"Could not find state map in previous 
state RDD")
+}
+
+val newStateMap = prevStateRDDIterator.next().stateMap.copy()
+val emittedRecords = new ArrayBuffer[T]
+
+val wrappedState = new StateImpl[S]()
+
+dataIterator.foreach { case (key, value) =>
+  wrappedState.wrap(newStateMap.get(key))
+  val emittedRecord = trackingFunction(key, Some(value), wrappedState)
+  if (wrappedState.isRemoved) {
+newStateMap.remove(key)
+  } else if (wrappedState.isUpdated) {
+newStateMap.put(key, wrappedState.get(), currentTime)
+  }
+  emittedRecords ++= emittedRecord
--- End diff --

It is better not to materialize all the `emittedRecords` in `compute()`, 
just one by each iteration, from

[GitHub] spark pull request: [SPARK-2960][Deploy] Support executing Spark f...

2015-10-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8669#issuecomment-151405232
  
**[Test build #44408 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44408/consoleFull)**
 for PR 8669 at commit 
[`41511b4`](https://github.com/apache/spark/commit/41511b417b1009bfda3f3cf8104ef2025f50227d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * `
\"$`\n  * `\"$`\n  * `exec \"$`\n  * `exec \"$`\n  * `  nohup nice -n 
\"$SPARK_NICENESS\" \"$`\n  * `  nohup nice -n \"$SPARK_NICENESS\" \"$`\n  
* `  \"$`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11303] [SQL] filter should not be pushe...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9294#issuecomment-151406889
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11226] [SQL] Empty line in json file sh...

2015-10-27 Thread zjffdu

Github user zjffdu commented on the pull request:

https://github.com/apache/spark/pull/9211#issuecomment-151407652
  
@rxin @srowen will add unit test soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10024] [pyspark] Python API RF and GBT ...

2015-10-27 Thread vectorijk

Github user vectorijk commented on the pull request:

https://github.com/apache/spark/pull/9233#issuecomment-151409768
  
@mengxr Sure. Added since version back.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10024] [pyspark] Python API RF and GBT ...

2015-10-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9233#issuecomment-151411345
  
**[Test build #44413 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44413/consoleFull)**
 for PR 9233 at commit 
[`1f5444b`](https://github.com/apache/spark/commit/1f5444bd9e9ee4d507fb32bef8b155e10c4c7990).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11270] [STREAMING] Add improved equalit...

2015-10-27 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/9236


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10024] [pyspark] Python API RF and GBT ...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9233#issuecomment-151414186
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44413/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11335][STREAMING] update kafka direct p...

2015-10-27 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/9289#discussion_r43094153
  
--- Diff: docs/streaming-kafka-integration.md ---
@@ -181,7 +181,20 @@ Next, we discuss how to use this approach in your 
streaming application.
);


-   Not supported yet
+   offsetRanges = []
+
+   def storeOffsetRanges(rdd):
+   del offsetRanges[:]
+   offsetRanges.extend(rdd.offsetRanges())
--- End diff --

cant we simply use `offsetRanges = rdd.offsetRanges()`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10024] [pyspark] Python API RF and GBT ...

2015-10-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9233#issuecomment-151414084
  
**[Test build #44413 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44413/consoleFull)**
 for PR 9233 at commit 
[`1f5444b`](https://github.com/apache/spark/commit/1f5444bd9e9ee4d507fb32bef8b155e10c4c7990).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * 
`class GBTParams(TreeEnsembleParams):`\n  * `class 
TreeEnsembleParams(DecisionTreeParams):`\n  * `class 
TreeRegressorParams(Params):`\n  * `class 
RandomForestParams(TreeEnsembleParams):`\n  * `class 
GBTParams(TreeEnsembleParams):`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10024] [pyspark] Python API RF and GBT ...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9233#issuecomment-151414185
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9298][SQL] Add pearson correlation aggr...

2015-10-27 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/8587#discussion_r43094562
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala
 ---
@@ -524,6 +525,116 @@ case class Sum(child: Expression) extends 
DeclarativeAggregate {
   override val evaluateExpression = Cast(currentSum, resultType)
 }
 
+case class Corr(
+left: Expression,
+right: Expression,
+mutableAggBufferOffset: Int = 0,
+inputAggBufferOffset: Int = 0)
+  extends ImperativeAggregate {
+
+  def children: Seq[Expression] = Seq(left, right)
+
+  def nullable: Boolean = false
+
+  def dataType: DataType = DoubleType
+
+  def inputTypes: Seq[AbstractDataType] = Seq(DoubleType)
+
+  def aggBufferSchema: StructType = 
StructType.fromAttributes(aggBufferAttributes)
+
+  def inputAggBufferAttributes: Seq[AttributeReference] = 
aggBufferAttributes.map(_.newInstance())
+
+  val aggBufferAttributes: Seq[AttributeReference] = Seq(
+AttributeReference("xAvg", DoubleType)(),
+AttributeReference("yAvg", DoubleType)(),
+AttributeReference("Ck", DoubleType)(),
+AttributeReference("MkX", DoubleType)(),
+AttributeReference("MkY", DoubleType)(),
+AttributeReference("count", LongType)())
+
+  override def withNewMutableAggBufferOffset(newMutableAggBufferOffset: 
Int): ImperativeAggregate =
+copy(mutableAggBufferOffset = newMutableAggBufferOffset)
+
+  override def withNewInputAggBufferOffset(newInputAggBufferOffset: Int): 
ImperativeAggregate =
+copy(inputAggBufferOffset = newInputAggBufferOffset)
+
+  override def initialize(buffer: MutableRow): Unit = {
+(0 until 5).map(idx => buffer.setDouble(mutableAggBufferOffset + idx, 
0.0))
+buffer.setLong(mutableAggBufferOffset + 5, 0L)
+  }
+
+  override def update(buffer: MutableRow, input: InternalRow): Unit = {
+val x = left.eval(input).asInstanceOf[Double]
+val y = right.eval(input).asInstanceOf[Double]
+
+var xAvg = buffer.getDouble(mutableAggBufferOffset)
+var yAvg = buffer.getDouble(mutableAggBufferOffset + 1)
--- End diff --

I think we don't want to modify mutableAggBufferOffset but only want to use 
mutableAggBufferOffset + 1, mutableAggBufferOffset + 2...etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11342] [TESTS] Allow to set hadoop prof...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9295#issuecomment-151415155
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11298] When driver sends message "GetEx...

2015-10-27 Thread vanzin

Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/9268#issuecomment-151415322
  
#8887 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9298][SQL] Add pearson correlation aggr...

2015-10-27 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/8587#discussion_r43094591
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala
 ---
@@ -524,6 +525,116 @@ case class Sum(child: Expression) extends 
DeclarativeAggregate {
   override val evaluateExpression = Cast(currentSum, resultType)
 }
 
+case class Corr(
+left: Expression,
+right: Expression,
+mutableAggBufferOffset: Int = 0,
+inputAggBufferOffset: Int = 0)
+  extends ImperativeAggregate {
+
+  def children: Seq[Expression] = Seq(left, right)
+
+  def nullable: Boolean = false
+
+  def dataType: DataType = DoubleType
+
+  def inputTypes: Seq[AbstractDataType] = Seq(DoubleType)
+
+  def aggBufferSchema: StructType = 
StructType.fromAttributes(aggBufferAttributes)
+
+  def inputAggBufferAttributes: Seq[AttributeReference] = 
aggBufferAttributes.map(_.newInstance())
+
+  val aggBufferAttributes: Seq[AttributeReference] = Seq(
+AttributeReference("xAvg", DoubleType)(),
+AttributeReference("yAvg", DoubleType)(),
+AttributeReference("Ck", DoubleType)(),
+AttributeReference("MkX", DoubleType)(),
+AttributeReference("MkY", DoubleType)(),
+AttributeReference("count", LongType)())
+
+  override def withNewMutableAggBufferOffset(newMutableAggBufferOffset: 
Int): ImperativeAggregate =
+copy(mutableAggBufferOffset = newMutableAggBufferOffset)
+
+  override def withNewInputAggBufferOffset(newInputAggBufferOffset: Int): 
ImperativeAggregate =
+copy(inputAggBufferOffset = newInputAggBufferOffset)
+
+  override def initialize(buffer: MutableRow): Unit = {
+(0 until 5).map(idx => buffer.setDouble(mutableAggBufferOffset + idx, 
0.0))
+buffer.setLong(mutableAggBufferOffset + 5, 0L)
+  }
+
+  override def update(buffer: MutableRow, input: InternalRow): Unit = {
+val x = left.eval(input).asInstanceOf[Double]
+val y = right.eval(input).asInstanceOf[Double]
+
+var xAvg = buffer.getDouble(mutableAggBufferOffset)
+var yAvg = buffer.getDouble(mutableAggBufferOffset + 1)
+var Ck = buffer.getDouble(mutableAggBufferOffset + 2)
+var MkX = buffer.getDouble(mutableAggBufferOffset + 3)
+var MkY = buffer.getDouble(mutableAggBufferOffset + 4)
+var count = buffer.getLong(mutableAggBufferOffset + 5)
+
+val deltaX = x - xAvg
+val deltaY = y - yAvg
+count += 1
+xAvg += deltaX / count
+yAvg += deltaY / count
+Ck += deltaX * (y - yAvg)
+MkX += deltaX * (x - xAvg)
+MkY += deltaY * (y - yAvg)
+
+buffer.setDouble(mutableAggBufferOffset, xAvg)
+buffer.setDouble(mutableAggBufferOffset + 1, yAvg)
+buffer.setDouble(mutableAggBufferOffset + 2, Ck)
+buffer.setDouble(mutableAggBufferOffset + 3, MkX)
+buffer.setDouble(mutableAggBufferOffset + 4, MkY)
+buffer.setLong(mutableAggBufferOffset + 5, count)
+  }
+
+  override def merge(buffer1: MutableRow, buffer2: InternalRow): Unit = {
+val count2 = buffer2.getLong(inputAggBufferOffset + 5)
+
+if (count2 > 0) {
+  var xAvg = buffer1.getDouble(mutableAggBufferOffset)
+  var yAvg = buffer1.getDouble(mutableAggBufferOffset + 1)
+  var Ck = buffer1.getDouble(mutableAggBufferOffset + 2)
+  var MkX = buffer1.getDouble(mutableAggBufferOffset + 3)
+  var MkY = buffer1.getDouble(mutableAggBufferOffset + 4)
+  var count = buffer1.getLong(mutableAggBufferOffset + 5)
+
+  val xAvg2 = buffer2.getDouble(inputAggBufferOffset)
+  val yAvg2 = buffer2.getDouble(inputAggBufferOffset + 1)
+  val Ck2 = buffer2.getDouble(inputAggBufferOffset + 2)
+  val MkX2 = buffer2.getDouble(inputAggBufferOffset + 3)
+  val MkY2 = buffer2.getDouble(inputAggBufferOffset + 4)
+
+  val totalCount = count + count2
+  val deltaX = xAvg - xAvg2
+  val deltaY = yAvg - yAvg2
+  Ck += Ck2 + deltaX * deltaY * count / totalCount * count2
+  xAvg = (xAvg * count + xAvg2 * count2) / totalCount
+  yAvg = (yAvg * count + yAvg2 * count2) / totalCount
+  MkX += MkX2 + deltaX * deltaX * count / totalCount * count2
+  MkY += MkY2 + deltaY * deltaY * count / totalCount * count2
+  count = totalCount
+
+  buffer1.setDouble(mutableAggBufferOffset, xAvg)
+  buffer1.setDouble(mutableAggBufferOffset + 1, yAvg)
+  buffer1.setDouble(mutableAggBufferOffset + 2, Ck)
+  buffer1.setDouble(mutableAggBufferOffset + 3, MkX)
+  buffer1.setDouble(mutableAggBufferOffset + 4, MkY)
+  buffer1.setLong(mutableAggBufferOffset + 5, count)
+}
+  }
+

[GitHub] spark pull request: [SPARK-11284] [ML] ALS produces float predicti...

2015-10-27 Thread dahlem

Github user dahlem closed the pull request at:

https://github.com/apache/spark/pull/9252


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8582][Core]Optimize checkpointing to av...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9258#issuecomment-151418271
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10484][SQL] Optimize the cartesian join...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8652#issuecomment-151389533
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2960][Deploy] Support executing Spark f...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8669#issuecomment-151393309
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44400/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2960][Deploy] Support executing Spark f...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8669#issuecomment-151393307
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2960][Deploy] Support executing Spark f...

2015-10-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8669#issuecomment-151393236
  
**[Test build #44400 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44400/consoleFull)**
 for PR 8669 at commit 
[`c796397`](https://github.com/apache/spark/commit/c796397da5159335de3c11e62ecedc427a9c5a64).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * `
\"$`\n  * `\"$`\n  * `exec \"$`\n  * `exec \"$`\n  * `  nohup nice -n 
\"$SPARK_NICENESS\" \"$`\n  * `  nohup nice -n \"$SPARK_NICENESS\" \"$`\n  
* `  \"$`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11184] [MLLIB] Declare most of .mllib c...

2015-10-27 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/9169#issuecomment-151396799
  
Thanks! On `@Experimental`, I assume it simply means that it is intended 
for end users (i.e. not a `@DeveloperApi`) but may be removed or changed at any 
time (without even deprecation). 

While it seems nice to try to keep these stable or use deprecation, as you 
say, it tends to make them act like stable methods. This can be a small problem 
if an API is added under the theory that it's just `@Experimental` and is thus 
low-risk, but is then treated like it can't be changed. 

I imagine that most users don't pay a lot of attention to the tag, and 
therefore might generally be surprised if such a method went away. I think all 
of this simply argues for more rapidly reflecting reality: lots of 
`@Experimental` methods are long since really stable and should be untagged. 
This is a good first crack at that.

I'll put down a to-do to do the same for core and streaming, as I think 
most of those tags can probably go.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2629][STREAMING] Basic implementation o...

2015-10-27 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/9256#discussion_r43089007
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/TrackStateSpec.scala ---
@@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming
+
+import scala.reflect.ClassTag
+
+import org.apache.spark.{HashPartitioner, Partitioner}
+import org.apache.spark.api.java.JavaPairRDD
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * Abstract class having all the specifications of 
DStream.trackStateByKey().
+ * Use the `TrackStateSpec.create()` or `TrackStateSpec.create()` to 
create instances of this class.
+ *
+ * {{{
+ *TrackStateSpec(trackingFunction)// in Scala
+ *TrackStateSpec.create(trackingFunction) // in Java
+ * }}}
+ */
+sealed abstract class TrackStateSpec[K: ClassTag, V: ClassTag, S: 
ClassTag, T: ClassTag]
+  extends Serializable {
+
+  def initialState(rdd: RDD[(K, S)]): this.type
+  def initialState(javaPairRDD: JavaPairRDD[K, S]): this.type
+
+  def numPartitions(numPartitions: Int): this.type
+  def partitioner(partitioner: Partitioner): this.type
+
+  def timeout(interval: Duration): this.type
+}
+
+
+/** Builder object for creating instances of TrackStateSpec */
+object TrackStateSpec {
+
+  def apply[K: ClassTag, V: ClassTag, S: ClassTag, T: ClassTag](
+  trackingFunction: (K, Option[V], State[S]) => Option[T]): 
TrackStateSpec[K, V, S, T] = {
+new TrackStateSpecImpl[K, V, S, T](trackingFunction)
+  }
+
+  def create[K: ClassTag, V: ClassTag, S: ClassTag, T: ClassTag](
+  trackingFunction: (K, Option[V], State[S]) => Option[T]): 
TrackStateSpec[K, V, S, T] = {
+apply(trackingFunction)
+  }
--- End diff --

I think here Java friendly constructor is necessary, `create` might not be 
directly used in Java code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11340][SPARKR] Support setting driver p...

2015-10-27 Thread sun-rui

Github user sun-rui commented on a diff in the pull request:

https://github.com/apache/spark/pull/9290#discussion_r43089490
  
--- Diff: R/pkg/R/sparkR.R ---
@@ -123,16 +123,30 @@ sparkR.init <- function(
 uriSep <- ""
   }
 
+  sparkEnvirMap <- convertNamedListToEnv(sparkEnvir)
+
   existingPort <- Sys.getenv("EXISTING_SPARKR_BACKEND_PORT", "")
   if (existingPort != "") {
 backendPort <- existingPort
   } else {
 path <- tempfile(pattern = "backend_port")
+submitOps <- Sys.getenv("SPARKR_SUBMIT_ARGS", "sparkr-shell")
+# spark.driver.memory cannot be set in env:
+# 
http://spark.apache.org/docs/latest/configuration.html#application-properties
+# Add spark.driver.memory if set in sparkEnvir and not already set in 
SPARKR_SUBMIT_ARGS
+if (!grepl("--driver-memory", submitOps)) {
--- End diff --

Lets add the support for other options of the same limitation, according to 
http://spark.apache.org/docs/latest/configuration.html, we also have:
spark.driver.extraClassPath
spark.driver.extraJavaOptions
spark.driver.extraLibraryPath


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11334] numRunningTasks can't be less th...

2015-10-27 Thread XuTingjun

Github user XuTingjun commented on the pull request:

https://github.com/apache/spark/pull/9288#issuecomment-151400153
  
Yeah, I know the root cause is the wrong ordering of events. 
The code of these event order are: [kill 
Task](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1443),
 
[StageCompleted](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1444),
 
[JobEnd](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1458).
 
Because the TaskEnd is not serial with StageComplete and JobEnd, so I think 
maybe we can't control the ordering.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11102] [SQL] Uninformative exception wh...

2015-10-27 Thread chenghao-intel

Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/9223#issuecomment-151400091
  
Probably we also need to add the assertion at:

```scala
class HadoopFsRelation {
...
  final private[sql] def buildScan(
  requiredColumns: Array[String],
  filters: Array[Filter],
  inputPaths: Array[String],
  broadcastedConf: Broadcast[SerializableConfiguration])
...
}
```

I'll agree the JsonRelation is not exactly the same case, as we will try to 
infer the schema in another code path. 
So will that be OK if we add assertion at the `buildScan` in the parent 
class(as above), and also add the assertion at the `override lazy val 
dataSchema = ...` in `JsonRelation`(not the `createBaseRdd`)?

/cc @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9298][SQL] Add pearson correlation aggr...

2015-10-27 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/8587#discussion_r43090314
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/functions.scala
 ---
@@ -524,6 +525,116 @@ case class Sum(child: Expression) extends 
DeclarativeAggregate {
   override val evaluateExpression = Cast(currentSum, resultType)
 }
 
+case class Corr(
+left: Expression,
+right: Expression,
+mutableAggBufferOffset: Int = 0,
+inputAggBufferOffset: Int = 0)
+  extends ImperativeAggregate {
+
+  def children: Seq[Expression] = Seq(left, right)
+
+  def nullable: Boolean = false
+
+  def dataType: DataType = DoubleType
+
+  def inputTypes: Seq[AbstractDataType] = Seq(DoubleType)
+
+  def aggBufferSchema: StructType = 
StructType.fromAttributes(aggBufferAttributes)
+
+  def inputAggBufferAttributes: Seq[AttributeReference] = 
aggBufferAttributes.map(_.newInstance())
+
+  val aggBufferAttributes: Seq[AttributeReference] = Seq(
+AttributeReference("xAvg", DoubleType)(),
+AttributeReference("yAvg", DoubleType)(),
+AttributeReference("Ck", DoubleType)(),
+AttributeReference("MkX", DoubleType)(),
+AttributeReference("MkY", DoubleType)(),
+AttributeReference("count", LongType)())
+
+  override def withNewMutableAggBufferOffset(newMutableAggBufferOffset: 
Int): ImperativeAggregate =
+copy(mutableAggBufferOffset = newMutableAggBufferOffset)
+
+  override def withNewInputAggBufferOffset(newInputAggBufferOffset: Int): 
ImperativeAggregate =
+copy(inputAggBufferOffset = newInputAggBufferOffset)
+
+  override def initialize(buffer: MutableRow): Unit = {
+(0 until 5).map(idx => buffer.setDouble(mutableAggBufferOffset + idx, 
0.0))
+buffer.setLong(mutableAggBufferOffset + 5, 0L)
+  }
+
+  override def update(buffer: MutableRow, input: InternalRow): Unit = {
+val x = left.eval(input).asInstanceOf[Double]
+val y = right.eval(input).asInstanceOf[Double]
+
+var xAvg = buffer.getDouble(mutableAggBufferOffset)
+var yAvg = buffer.getDouble(mutableAggBufferOffset + 1)
+var Ck = buffer.getDouble(mutableAggBufferOffset + 2)
+var MkX = buffer.getDouble(mutableAggBufferOffset + 3)
+var MkY = buffer.getDouble(mutableAggBufferOffset + 4)
+var count = buffer.getLong(mutableAggBufferOffset + 5)
+
+val deltaX = x - xAvg
+val deltaY = y - yAvg
+count += 1
+xAvg += deltaX / count
+yAvg += deltaY / count
+Ck += deltaX * (y - yAvg)
+MkX += deltaX * (x - xAvg)
+MkY += deltaY * (y - yAvg)
+
+buffer.setDouble(mutableAggBufferOffset, xAvg)
+buffer.setDouble(mutableAggBufferOffset + 1, yAvg)
+buffer.setDouble(mutableAggBufferOffset + 2, Ck)
+buffer.setDouble(mutableAggBufferOffset + 3, MkX)
+buffer.setDouble(mutableAggBufferOffset + 4, MkY)
+buffer.setLong(mutableAggBufferOffset + 5, count)
+  }
+
+  override def merge(buffer1: MutableRow, buffer2: InternalRow): Unit = {
+val count2 = buffer2.getLong(inputAggBufferOffset + 5)
+
+if (count2 > 0) {
+  var xAvg = buffer1.getDouble(mutableAggBufferOffset)
+  var yAvg = buffer1.getDouble(mutableAggBufferOffset + 1)
+  var Ck = buffer1.getDouble(mutableAggBufferOffset + 2)
+  var MkX = buffer1.getDouble(mutableAggBufferOffset + 3)
+  var MkY = buffer1.getDouble(mutableAggBufferOffset + 4)
+  var count = buffer1.getLong(mutableAggBufferOffset + 5)
+
+  val xAvg2 = buffer2.getDouble(inputAggBufferOffset)
+  val yAvg2 = buffer2.getDouble(inputAggBufferOffset + 1)
+  val Ck2 = buffer2.getDouble(inputAggBufferOffset + 2)
+  val MkX2 = buffer2.getDouble(inputAggBufferOffset + 3)
+  val MkY2 = buffer2.getDouble(inputAggBufferOffset + 4)
+
+  val totalCount = count + count2
+  val deltaX = xAvg - xAvg2
+  val deltaY = yAvg - yAvg2
+  Ck += Ck2 + deltaX * deltaY * count / totalCount * count2
+  xAvg = (xAvg * count + xAvg2 * count2) / totalCount
+  yAvg = (yAvg * count + yAvg2 * count2) / totalCount
+  MkX += MkX2 + deltaX * deltaX * count / totalCount * count2
+  MkY += MkY2 + deltaY * deltaY * count / totalCount * count2
+  count = totalCount
+
+  buffer1.setDouble(mutableAggBufferOffset, xAvg)
+  buffer1.setDouble(mutableAggBufferOffset + 1, yAvg)
+  buffer1.setDouble(mutableAggBufferOffset + 2, Ck)
+  buffer1.setDouble(mutableAggBufferOffset + 3, MkX)
+  buffer1.setDouble(mutableAggBufferOffset + 4, MkY)
+  buffer1.setLong(mutableAggBufferOffset + 5, count)
+}
+  }
+

[GitHub] spark pull request: [SPARK-11341][SQL] Given non-zero ordinal toRo...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9292#issuecomment-151403244
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44407/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11341][SQL] Given non-zero ordinal toRo...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9292#issuecomment-151403241
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11302] [MLLIB] Multivariate Gaussian Mo...

2015-10-27 Thread srowen

GitHub user srowen opened a pull request:

https://github.com/apache/spark/pull/9293

[SPARK-11302] [MLLIB] Multivariate Gaussian Model with Covariance matrix 
returns incorrect answer in some cases

Compute sigma pseudo-inverse without square root to avoid precision problems

CC @mengxr @jkbradley 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srowen/spark SPARK-11302

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9293.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9293


commit b375ce5029711998728daad1235f293fbd3bfa9a
Author: Sean Owen 
Date:   2015-10-27T07:47:47Z

Compute sigma pseudo-inverse without square root to avoid precision problems




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11226] [SQL] Empty line in json file sh...

2015-10-27 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/9211#issuecomment-151406183
  
@zjffdu are you working on this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10958] Use json4s 3.3.0. Formats is now...

2015-10-27 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/8992#issuecomment-151405945
  
I'm going to close this unless there's movement, given the discussion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11302] [MLLIB] Multivariate Gaussian Mo...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9293#issuecomment-151408073
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44411/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10484][SQL] Optimize the cartesian join...

2015-10-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8652#issuecomment-151411865
  
**[Test build #44410 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44410/consoleFull)**
 for PR 8652 at commit 
[`7fda511`](https://github.com/apache/spark/commit/7fda51170c1c994c608be9e362f5464990b3204f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11342] [TESTS] Allow to set hadoop prof...

2015-10-27 Thread zjffdu

GitHub user zjffdu opened a pull request:

https://github.com/apache/spark/pull/9295

[SPARK-11342] [TESTS] Allow to set hadoop profile when running dev/ruâ¦

â¦n_tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zjffdu/spark SPARK-11342

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9295.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9295


commit 17a1eb8053a0a286a5a687418618ab90c3accbdf
Author: Jeff Zhang 
Date:   2015-10-27T08:23:05Z

[SPARK-11342] [TESTS] Allow to set hadoop profile when running dev/run_tests




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5569] [STREAMING] fix ObjectInputStream...

2015-10-27 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/8955


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11342] [TESTS] Allow to set hadoop prof...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9295#issuecomment-151415158
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44415/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11226] [SQL] Empty line in json file sh...

2015-10-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9211#issuecomment-151415652
  
**[Test build #44416 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44416/consoleFull)**
 for PR 9211 at commit 
[`6dad278`](https://github.com/apache/spark/commit/6dad2783261bfcf2fd19729c4024bff1f74b600e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2629][STREAMING] Basic implementation o...

2015-10-27 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/9256#discussion_r43095278
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/TrackStateSpec.scala ---
@@ -0,0 +1,111 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming
+
+import scala.reflect.ClassTag
+
+import org.apache.spark.{HashPartitioner, Partitioner}
+import org.apache.spark.api.java.JavaPairRDD
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * Abstract class having all the specifications of 
DStream.trackStateByKey().
+ * Use the `TrackStateSpec.create()` or `TrackStateSpec.create()` to 
create instances of this class.
+ *
+ * {{{
+ *TrackStateSpec(trackingFunction)// in Scala
+ *TrackStateSpec.create(trackingFunction) // in Java
+ * }}}
+ */
+sealed abstract class TrackStateSpec[K: ClassTag, V: ClassTag, S: 
ClassTag, T: ClassTag]
+  extends Serializable {
+
+  def initialState(rdd: RDD[(K, S)]): this.type
+  def initialState(javaPairRDD: JavaPairRDD[K, S]): this.type
+
+  def numPartitions(numPartitions: Int): this.type
+  def partitioner(partitioner: Partitioner): this.type
+
+  def timeout(interval: Duration): this.type
+}
+
+
+/** Builder object for creating instances of TrackStateSpec */
+object TrackStateSpec {
+
+  def apply[K: ClassTag, V: ClassTag, S: ClassTag, T: ClassTag](
+  trackingFunction: (K, Option[V], State[S]) => Option[T]): 
TrackStateSpec[K, V, S, T] = {
+new TrackStateSpecImpl[K, V, S, T](trackingFunction)
+  }
+
+  def create[K: ClassTag, V: ClassTag, S: ClassTag, T: ClassTag](
+  trackingFunction: (K, Option[V], State[S]) => Option[T]): 
TrackStateSpec[K, V, S, T] = {
+apply(trackingFunction)
+  }
--- End diff --

Yeah, I was planning to add all the Java friendly stuff on in a later PR, 
and focus on core functionality in this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2629][STREAMING] Basic implementation o...

2015-10-27 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/9256#discussion_r43095474
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/dstream/TrackeStateDStream.scala
 ---
@@ -0,0 +1,192 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming.dstream
+
+import java.io.{IOException, ObjectOutputStream}
+
+import scala.collection.mutable.ArrayBuffer
+import scala.reflect.ClassTag
+
+import org.apache.spark._
+import org.apache.spark.rdd.{EmptyRDD, RDD}
+import org.apache.spark.storage.StorageLevel
+import org.apache.spark.streaming._
+import org.apache.spark.streaming.util.StateMap
+import org.apache.spark.util.Utils
+
+private[streaming] case class TrackStateRDDRecord[K: ClassTag, S: 
ClassTag, T: ClassTag](
+stateMap: StateMap[K, S], emittedRecords: Seq[T])
+
+
+private[streaming] class TrackStateRDDPartition(
+idx: Int,
+@transient private var prevStateRDD: RDD[_],
+@transient private var partitionedDataRDD: RDD[_]) extends Partition {
+
+  private[dstream] var previousSessionRDDPartition: Partition = null
+  private[dstream] var partitionedDataRDDPartition: Partition = null
+
+  override def index: Int = idx
+  override def hashCode(): Int = idx
+
+  @throws(classOf[IOException])
+  private def writeObject(oos: ObjectOutputStream): Unit = 
Utils.tryOrIOException {
+// Update the reference to parent split at the time of task 
serialization
+previousSessionRDDPartition = prevStateRDD.partitions(index)
+partitionedDataRDDPartition = partitionedDataRDD.partitions(index)
+oos.defaultWriteObject()
+  }
+}
+
+private[streaming] class TrackStateRDD[K: ClassTag, V: ClassTag, S: 
ClassTag, T: ClassTag](
+_sc: SparkContext,
+private var prevStateRDD: RDD[TrackStateRDDRecord[K, S, T]],
+private var partitionedDataRDD: RDD[(K, V)],
+trackingFunction: (K, Option[V], State[S]) => Option[T],
+currentTime: Long, timeoutThresholdTime: Option[Long]
+  ) extends RDD[TrackStateRDDRecord[K, S, T]](
+_sc,
+List(
+  new OneToOneDependency[TrackStateRDDRecord[K, S, T]](prevStateRDD),
+  new OneToOneDependency(partitionedDataRDD))
+  ) {
+
+  @volatile private var doFullScan = false
+
+  require(partitionedDataRDD.partitioner.nonEmpty)
+  require(partitionedDataRDD.partitioner == prevStateRDD.partitioner)
+
+  override val partitioner = prevStateRDD.partitioner
+
+  override def checkpoint(): Unit = {
+super.checkpoint()
+doFullScan = true
+  }
+
+  override def compute(
+  partition: Partition, context: TaskContext): 
Iterator[TrackStateRDDRecord[K, S, T]] = {
+
+val stateRDDPartition = partition.asInstanceOf[TrackStateRDDPartition]
+val prevStateRDDIterator = prevStateRDD.iterator(
+  stateRDDPartition.previousSessionRDDPartition, context)
+val dataIterator = partitionedDataRDD.iterator(
+  stateRDDPartition.partitionedDataRDDPartition, context)
+if (!prevStateRDDIterator.hasNext) {
+  throw new SparkException(s"Could not find state map in previous 
state RDD")
+}
+
+val newStateMap = prevStateRDDIterator.next().stateMap.copy()
+val emittedRecords = new ArrayBuffer[T]
+
+val wrappedState = new StateImpl[S]()
+
+dataIterator.foreach { case (key, value) =>
+  wrappedState.wrap(newStateMap.get(key))
+  val emittedRecord = trackingFunction(key, Some(value), wrappedState)
--- End diff --

If the parentDStream[(K, V)] generates data such that the corresponding 
value is null, then the system should respect that null and pass it on. So odd 
as it may sounds, I think it is right to call the track state function with 
`Some(null)`.

But you do raise a good point. I will get more feedback on this.


---
If your project is set up for it, you

[GitHub] spark pull request: [SPARK-8582][Core]Optimize checkpointing to av...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9258#issuecomment-151418311
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [spark-11252][network]ShuffleClient should rel...

2015-10-27 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/9227#discussion_r43096557
  
--- Diff: 
network/common/src/main/java/org/apache/spark/network/server/TransportChannelHandler.java
 ---
@@ -109,17 +109,20 @@ public void channelRead0(ChannelHandlerContext ctx, 
Message request) {
   public void userEventTriggered(ChannelHandlerContext ctx, Object evt) 
throws Exception {
 if (evt instanceof IdleStateEvent) {
   IdleStateEvent e = (IdleStateEvent) evt;
-  // See class comment for timeout semantics. In addition to ensuring 
we only timeout while
-  // there are outstanding requests, we also do a secondary 
consistency check to ensure
-  // there's no race between the idle timeout and incrementing the 
numOutstandingRequests.
-  boolean hasInFlightRequests = 
responseHandler.numOutstandingRequests() > 0;
+  // While an IdleStateEvent has been triggered, we can close idle 
connection
+  // because it has no read/write events for requestTimeoutNs.
   boolean isActuallyOverdue =
 System.nanoTime() - responseHandler.getTimeOfLastRequestNs() > 
requestTimeoutNs;
-  if (e.state() == IdleState.ALL_IDLE && hasInFlightRequests && 
isActuallyOverdue) {
-String address = NettyUtils.getRemoteAddress(ctx.channel());
-logger.error("Connection to {} has been quiet for {} ms while 
there are outstanding " +
-  "requests. Assuming connection is dead; please adjust 
spark.network.timeout if this " +
-  "is wrong.", address, requestTimeoutNs / 1000 / 1000);
+  if (e.state() == IdleState.ALL_IDLE && isActuallyOverdue) {
+// In addition to ensuring we only timeout while there are 
outstanding requests, we also
--- End diff --

Not sure what's going on here. You're not fixing a race, because there's no 
synchronization anywhere. And now you're closing the connection anytime the 
connection is idle, regardless of whether there are outstanding requests. This 
is not good for the netty-based RpcEnv implementation, which does not like when 
its sockets are closed (e.g. executors will die and things like that).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2960][Deploy] Support executing Spark f...

2015-10-27 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/8669#discussion_r43088776
  
--- Diff: bin/beeline ---
@@ -23,8 +23,10 @@
 # Enter posix mode for bash
 set -o posix
 
-# Figure out where Spark is installed
-FWDIR="$(cd "`dirname "$0"`"/..; pwd)"
+# Figure out if SPARK_HOME is set
+if [ -z "${SPARK_HOME}" ]; then
+export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
--- End diff --

Thanks @srowen for your comments, I will change to the 2-space indent.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2629][STREAMING] Basic implementation o...

2015-10-27 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/9256#discussion_r43088697
  
--- Diff: streaming/src/main/scala/org/apache/spark/streaming/State.scala 
---
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming
+
+/**
+ * Abstract class for getting and updating the tracked state in the 
`trackStateByKey` operation of
+ * [[org.apache.spark.streaming.dstream.PairDStreamFunctions pair 
DStream]] and
+ * [[org.apache.spark.streaming.api.java.JavaPairDStream]].
+ * {{{
+ *
+ * }}}
+ */
+sealed abstract class State[S] {
+  
+  /** Whether the state already exists */
+  def exists(): Boolean
+  
+  /**
+   * Get the state if it exists, otherwise wise it will throw an exception.
+   * Check with `exists()` whether the state exists or not before calling 
`get()`.
+   */
+  def get(): S
+
+  /**
+   * Update the state with a new value. Note that you cannot update the 
state if the state is
+   * timing out (that is, `isTimingOut() return true`, or if the state has 
already been removed by
+   * `remove()`.
+   */
+  def update(newState: S): Unit
+
+  /** Remove the state if it exists. */
+  def remove(): Unit
+
+  /** Is the state going to be timed out by the system after this batch 
interval */
+  def isTimingOut(): Boolean
+
+  @inline final def getOption(): Option[S] = Option(get())
+
+  /** Get the state if it exists, otherwise return the default value */
+  @inline final def getOrElse[S1 >: S](default: => S1): S1 = {
+if (exists) this.get else default
+  }
+
+  @inline final override def toString() = getOption.map { _.toString 
}.getOrElse("")
+}
+
+/** Internal implementation of the [[State]] interface */
+private[streaming] class StateImpl[S] extends State[S] {
+
+  private var state: S = null.asInstanceOf[S]
+  private var defined: Boolean = true
+  private var timingOut: Boolean = false
+  private var updated: Boolean = false
+  private var removed: Boolean = false
+
+  // = Public API =
+  def exists(): Boolean = {
+defined
+  }
+
+  def get(): S = {
+state
+  }
+
+  def update(newState: S): Unit = {
+require(!removed, "Cannot update the state after it has been removed")
+require(!timingOut, "Cannot update the state that is timing out")
+state = newState
--- End diff --

Is this required for defensive guard `require(!updated, "cannot update the 
state this is already updated")`? 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11297] Add new code tags

2015-10-27 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/9265


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11297] Add new code tags

2015-10-27 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/9265#issuecomment-151396021
  
LGTM. Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SAPRK-8546] Add PMML export for Naive Bayes

2015-10-27 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/9057#issuecomment-151396091
  
@yinxusen Could you update the PR title? `SAPRK` is a typo.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11334] numRunningTasks can't be less th...

2015-10-27 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/9288#discussion_r43089305
  
--- Diff: 
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
@@ -615,7 +615,11 @@ private[spark] class ExecutorAllocationManager(
   val taskIndex = taskEnd.taskInfo.index
   val stageId = taskEnd.stageId
   allocationManager.synchronized {
-numRunningTasks -= 1
+if (numRunningTasks > 0) {
--- End diff --

Hm, this seems like a band-aid though. Is this because the task-end event 
happens in the wrong order with the job end event? can we catch and guard for 
that case more directly?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11302] [MLLIB] Multivariate Gaussian Mo...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9293#issuecomment-151404857
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11302] [MLLIB] Multivariate Gaussian Mo...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9293#issuecomment-151404807
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2960][Deploy] Support executing Spark f...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8669#issuecomment-151405504
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2960][Deploy] Support executing Spark f...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8669#issuecomment-151405505
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44408/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11276][CORE] SizeEstimator prevents cla...

2015-10-27 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/9244


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11276][CORE] SizeEstimator prevents cla...

2015-10-27 Thread srowen

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/9244#issuecomment-151405796
  
merged to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11303] [SQL] filter should not be pushe...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9294#issuecomment-151406913
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [spark-11252][network]ShuffleClient should rel...

2015-10-27 Thread lianhuiwang

Github user lianhuiwang commented on the pull request:

https://github.com/apache/spark/pull/9227#issuecomment-151409731
  
@zsxwing @vanzin thanks, i agree with you. now i use IdleStateHandler to 
monitor idle connection. after idle state event has been triggered, we can 
close this connection.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10024] [pyspark] Python API RF and GBT ...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9233#issuecomment-151409704
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10024] [pyspark] Python API RF and GBT ...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9233#issuecomment-151409660
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [spark-11252][network]ShuffleClient should rel...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9227#issuecomment-151409662
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [spark-11252][network]ShuffleClient should rel...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9227#issuecomment-151409703
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11212][Core][Streaming]Make preferred l...

2015-10-27 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/9181#discussion_r43093391
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverSchedulingPolicy.scala
 ---
@@ -163,40 +169,59 @@ private[streaming] class ReceiverSchedulingPolicy {
   receiverId: Int,
   preferredLocation: Option[String],
   receiverTrackingInfoMap: Map[Int, ReceiverTrackingInfo],
-  executors: Seq[String]): Seq[String] = {
+  executors: Seq[ExecutorCacheTaskLocation]): Seq[TaskLocation] = {
 if (executors.isEmpty) {
   return Seq.empty
 }
 
 // Always try to schedule to the preferred locations
-val scheduledExecutors = mutable.Set[String]()
-scheduledExecutors ++= preferredLocation
-
-val executorWeights = receiverTrackingInfoMap.values.flatMap { 
receiverTrackingInfo =>
-  receiverTrackingInfo.state match {
-case ReceiverState.INACTIVE => Nil
-case ReceiverState.SCHEDULED =>
-  val scheduledExecutors = 
receiverTrackingInfo.scheduledExecutors.get
-  // The probability that a scheduled receiver will run in an 
executor is
-  // 1.0 / scheduledLocations.size
-  scheduledExecutors.map(location => location -> (1.0 / 
scheduledExecutors.size))
-case ReceiverState.ACTIVE => 
Seq(receiverTrackingInfo.runningExecutor.get -> 1.0)
-  }
-}.groupBy(_._1).mapValues(_.map(_._2).sum) // Sum weights for each 
executor
+val scheduledLocations = mutable.Set[TaskLocation]()
+// Note: preferredLocation could be `HDFSCacheTaskLocation`, so use 
`TaskLocation.apply` to
+// handle this case
+scheduledLocations ++= preferredLocation.map(TaskLocation(_))
+
+val executorWeights: Map[ExecutorCacheTaskLocation, Double] = {
+  
receiverTrackingInfoMap.values.flatMap(convertReceiverTrackingInfoToExecutorWeights)
+.groupBy(_._1).mapValues(_.map(_._2).sum) // Sum weights for each 
executor
+}
 
 val idleExecutors = executors.toSet -- executorWeights.keys
 if (idleExecutors.nonEmpty) {
-  scheduledExecutors ++= idleExecutors
+  scheduledLocations ++= idleExecutors
 } else {
   // There is no idle executor. So select all executors that have the 
minimum weight.
   val sortedExecutors = executorWeights.toSeq.sortBy(_._2)
   if (sortedExecutors.nonEmpty) {
 val minWeight = sortedExecutors(0)._2
-scheduledExecutors ++= sortedExecutors.takeWhile(_._2 == 
minWeight).map(_._1)
+scheduledLocations ++= sortedExecutors.takeWhile(_._2 == 
minWeight).map(_._1)
   } else {
 // This should not happen since "executors" is not empty
   }
 }
-scheduledExecutors.toSeq
+scheduledLocations.toSeq
+  }
+
+  /**
+   * This method tries to convert a receiver tracking info to executor 
weights. Every executor will
+   * be assigned to a weight according to the receivers running or 
scheduling on it:
+   *
+   * - If a receiver is running on an executor, it contributes 1.0 to the 
executor's weight.
+   * - If a receiver is scheduled to an executor but has not yet run, it 
contributes
+   * `1.0 / #candidate_executors_of_this_receiver` to the executor's 
weight.
+   */
+  private def convertReceiverTrackingInfoToExecutorWeights(
+  receiverTrackingInfo: ReceiverTrackingInfo): 
Seq[(ExecutorCacheTaskLocation, Double)] = {
+receiverTrackingInfo.state match {
+  case ReceiverState.INACTIVE => Nil
+  case ReceiverState.SCHEDULED =>
+val scheduledLocations = 
receiverTrackingInfo.scheduledLocations.get
+// The probability that a scheduled receiver will run in an 
executor is
+// 1.0 / scheduledLocations.size
+
scheduledLocations.filter(_.isInstanceOf[ExecutorCacheTaskLocation]).
+  map { location =>
--- End diff --

nit:shouldnt this map be in the line above? I think `.map { loc =>` would 
fit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10484][SQL] Optimize the cartesian join...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8652#issuecomment-151411970
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44410/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10484][SQL] Optimize the cartesian join...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8652#issuecomment-151411969
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11342] [TESTS] Allow to set hadoop prof...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9295#issuecomment-151412491
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11342] [TESTS] Allow to set hadoop prof...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9295#issuecomment-151412515
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11226] [SQL] Empty line in json file sh...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9211#issuecomment-151414326
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11226] [SQL] Empty line in json file sh...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9211#issuecomment-151414301
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2629][STREAMING] Basic implementation o...

2015-10-27 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/9256#discussion_r43095228
  
--- Diff: streaming/src/main/scala/org/apache/spark/streaming/State.scala 
---
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming
+
+/**
+ * Abstract class for getting and updating the tracked state in the 
`trackStateByKey` operation of
+ * [[org.apache.spark.streaming.dstream.PairDStreamFunctions pair 
DStream]] and
+ * [[org.apache.spark.streaming.api.java.JavaPairDStream]].
+ * {{{
+ *
+ * }}}
+ */
+sealed abstract class State[S] {
+  
+  /** Whether the state already exists */
+  def exists(): Boolean
+  
+  /**
+   * Get the state if it exists, otherwise wise it will throw an exception.
+   * Check with `exists()` whether the state exists or not before calling 
`get()`.
+   */
+  def get(): S
+
+  /**
+   * Update the state with a new value. Note that you cannot update the 
state if the state is
+   * timing out (that is, `isTimingOut() return true`, or if the state has 
already been removed by
+   * `remove()`.
+   */
+  def update(newState: S): Unit
+
+  /** Remove the state if it exists. */
+  def remove(): Unit
+
+  /** Is the state going to be timed out by the system after this batch 
interval */
+  def isTimingOut(): Boolean
+
+  @inline final def getOption(): Option[S] = Option(get())
+
+  /** Get the state if it exists, otherwise return the default value */
+  @inline final def getOrElse[S1 >: S](default: => S1): S1 = {
+if (exists) this.get else default
+  }
+
+  @inline final override def toString() = getOption.map { _.toString 
}.getOrElse("")
+}
+
+/** Internal implementation of the [[State]] interface */
+private[streaming] class StateImpl[S] extends State[S] {
+
+  private var state: S = null.asInstanceOf[S]
+  private var defined: Boolean = true
+  private var timingOut: Boolean = false
+  private var updated: Boolean = false
+  private var removed: Boolean = false
+
+  // = Public API =
+  def exists(): Boolean = {
+defined
+  }
+
+  def get(): S = {
+state
+  }
+
+  def update(newState: S): Unit = {
+require(!removed, "Cannot update the state after it has been removed")
+require(!timingOut, "Cannot update the state that is timing out")
+state = newState
--- End diff --

Yeah, if the user accidentally tries to update the state that is going to 
be removed by timeout anyways, the system should throw an error rather than 
silently allowing him/her to update without actually being updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11342] [TESTS] Allow to set hadoop prof...

2015-10-27 Thread zjffdu

Github user zjffdu commented on the pull request:

https://github.com/apache/spark/pull/9295#issuecomment-151420163
  
Test Failed due to network issue
```
ERROR: Timeout after 15 minutes
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from 
https://github.com/apache/spark.git
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:735)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:983)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1016)
at hudson.scm.SCM.checkout(SCM.java:485)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1277)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:610)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:532)
at hudson.model.Run.execute(Run.java:1741)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at hudson.model.ResourceController.execute(ResourceController.java:98)
at hudson.model.Executor.run(Executor.java:408)
Caused by: hudson.plugins.git.GitException: Command "git fetch --tags 
--progress https://github.com/apache/spark.git 
+refs/pull/9295/*:refs/remotes/origin/pr/9295/*" returned status code 143:
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-10484][SQL] Optimize the cartesian join...

2015-10-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8652#issuecomment-151391381
  
**[Test build #44410 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44410/consoleFull)**
 for PR 8652 at commit 
[`7fda511`](https://github.com/apache/spark/commit/7fda51170c1c994c608be9e362f5464990b3204f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2960][Deploy] Support executing Spark f...

2015-10-27 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/8669#discussion_r43088465
  
--- Diff: bin/beeline ---
@@ -23,8 +23,10 @@
 # Enter posix mode for bash
 set -o posix
 
-# Figure out where Spark is installed
-FWDIR="$(cd "`dirname "$0"`"/..; pwd)"
+# Figure out if SPARK_HOME is set
+if [ -z "${SPARK_HOME}" ]; then
+export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
--- End diff --

Yes, I very much like how this has standardized everything to use 
`SPARK_HOME` instead of `FWDIR`, and doesn't overwrite the value if already 
set. (Nit: all of the occurrences of this line have a 4-space indent instead of 
2)

LGTM; does anyone see a reason this isn't a good idea? I suppose now 
`SPARK_HOME`, if set, has an effect everywhere, but it looks like the desired 
effect. Docs also make reference to `SPARK_HOME` as if it has this effect.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9298][SQL] Add pearson correlation aggr...

2015-10-27 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/8587#issuecomment-151401142
  
LGTM except inline comments. Ping @yhuai @rxin for another pass on SQL side.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11341][SQL] Given non-zero ordinal toRo...

2015-10-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9292#issuecomment-151403096
  
**[Test build #44407 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44407/consoleFull)**
 for PR 9292 at commit 
[`0506fa7`](https://github.com/apache/spark/commit/0506fa738089573cd3fd97629b24add26118e178).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11334] numRunningTasks can't be less th...

2015-10-27 Thread jerryshao

Github user jerryshao commented on the pull request:

https://github.com/apache/spark/pull/9288#issuecomment-151404693
  
Can we do this by adding pending to kill tasks into list, only when all the 
tasks marked as finished, then call `markStageAsFinished`, also post 
`SparkListenerJobEnd`. AFAIK, this is the way to manage executor killing in 
`CoarseGrainedSchedulerBacked`, from the code level (not sure the complexity) 
this can also be achieved here in `DAGScheduler`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11303] [SQL] filter should not be pushe...

2015-10-27 Thread yanboliang

GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/9294

[SPARK-11303] [SQL] filter should not be pushed down into sample

When sampling and then filtering DataFrame, the SQL Optimizer will push 
down filter into sample and produce wrong result.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark spark-11303

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9294.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9294


commit ef16f3d2d8aff441a92061c9f44b700161f7f9dd
Author: Yanbo Liang 
Date:   2015-10-27T07:55:06Z

filter should not be pushed down into sample




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11302] [MLLIB] Multivariate Gaussian Mo...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9293#issuecomment-151408072
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11303] [SQL] filter should not be pushe...

2015-10-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9294#issuecomment-151409429
  
**[Test build #44412 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44412/consoleFull)**
 for PR 9294 at commit 
[`ef16f3d`](https://github.com/apache/spark/commit/ef16f3d2d8aff441a92061c9f44b700161f7f9dd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [spark-11252][network]ShuffleClient should rel...

2015-10-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9227#issuecomment-151411277
  
**[Test build #44414 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44414/consoleFull)**
 for PR 9227 at commit 
[`9052688`](https://github.com/apache/spark/commit/90526884d1662ca7ebec7b205b856111ae147e6d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11324][STREAMING] Flag for closing Writ...

2015-10-27 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/9285#discussion_r43093488
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/util/FileBasedWriteAheadLog.scala
 ---
@@ -47,7 +47,8 @@ private[streaming] class FileBasedWriteAheadLog(
 logDirectory: String,
 hadoopConf: Configuration,
 rollingIntervalSecs: Int,
-maxFailures: Int
+maxFailures: Int,
+closeAfterWrite: Boolean = false
--- End diff --

Yeah, it is not exposed. So its better to actually keep it as non-default 
parameter, and add the parameter judiciously in the all the 4 spots.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2629][STREAMING] Basic implementation o...

2015-10-27 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/9256#discussion_r43095076
  
--- Diff: streaming/src/main/scala/org/apache/spark/streaming/State.scala 
---
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.streaming
+
+/**
+ * Abstract class for getting and updating the tracked state in the 
`trackStateByKey` operation of
+ * [[org.apache.spark.streaming.dstream.PairDStreamFunctions pair 
DStream]] and
+ * [[org.apache.spark.streaming.api.java.JavaPairDStream]].
+ * {{{
+ *
+ * }}}
+ */
+sealed abstract class State[S] {
+  
+  /** Whether the state already exists */
+  def exists(): Boolean
+  
+  /**
+   * Get the state if it exists, otherwise wise it will throw an exception.
+   * Check with `exists()` whether the state exists or not before calling 
`get()`.
+   */
+  def get(): S
+
+  /**
+   * Update the state with a new value. Note that you cannot update the 
state if the state is
+   * timing out (that is, `isTimingOut() return true`, or if the state has 
already been removed by
+   * `remove()`.
+   */
+  def update(newState: S): Unit
+
+  /** Remove the state if it exists. */
+  def remove(): Unit
+
+  /** Is the state going to be timed out by the system after this batch 
interval */
+  def isTimingOut(): Boolean
+
+  @inline final def getOption(): Option[S] = Option(get())
+
+  /** Get the state if it exists, otherwise return the default value */
+  @inline final def getOrElse[S1 >: S](default: => S1): S1 = {
--- End diff --

Yeah, that is probably a valid concern. If users cannot call it from Java, 
this could be a Scala-only thing. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [spark-11252][network]ShuffleClient should rel...

2015-10-27 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/9227#discussion_r43096749
  
--- Diff: 
network/common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java
 ---
@@ -158,6 +158,13 @@ public TransportClient createClient(String remoteHost, 
int remotePort) throws IO
 }
   }
 
+  /**
+   * Create a completely new {@link TransportClient} to the given remote 
host / port */
+  public TransportClient createNewClient(String remoteHost, int 
remotePort) throws IOException {
--- End diff --

I actually wanted to add this method for a different change (no PR yet), so 
it's welcome. But the javadoc needs to mention that this connection is not 
pooled, like the ones returned by `createClient`.

Because of that, I'd also name the method `createUnmanagedClient`, and make 
it clear that it's the caller's responsibility to close it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11212][Core][Streaming]Make preferred l...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9181#issuecomment-151441563
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [spark-11252][network]ShuffleClient should rel...

2015-10-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9227#issuecomment-151441541
  
**[Test build #44414 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44414/consoleFull)**
 for PR 9227 at commit 
[`9052688`](https://github.com/apache/spark/commit/90526884d1662ca7ebec7b205b856111ae147e6d).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11212][Core][Streaming]Make preferred l...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9181#issuecomment-151441532
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [spark-11252][network]ShuffleClient should rel...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9227#issuecomment-151441611
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11303] [SQL] filter should not be pushe...

2015-10-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9294#issuecomment-151445258
  
**[Test build #44412 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44412/consoleFull)**
 for PR 9294 at commit 
[`ef16f3d`](https://github.com/apache/spark/commit/ef16f3d2d8aff441a92061c9f44b700161f7f9dd).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11303] [SQL] filter should not be pushe...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9294#issuecomment-151445422
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44412/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9298][SQL] Add pearson correlation aggr...

2015-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/8587#issuecomment-151422462
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9298][SQL] Add pearson correlation aggr...

2015-10-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/8587#issuecomment-151425036
  
**[Test build #44418 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44418/consoleFull)**
 for PR 8587 at commit 
[`02562f3`](https://github.com/apache/spark/commit/02562f3a9ab9cda941b260a834d505f7cacd46f2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11206] Support SQL UI on the history se...

2015-10-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9297#issuecomment-151427044
  
**[Test build #44419 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44419/consoleFull)**
 for PR 9297 at commit 
[`d52288b`](https://github.com/apache/spark/commit/d52288bb2e18a5f0e898110893a091919e13ea84).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [spark-11252][network]ShuffleClient should rel...

2015-10-27 Thread lianhuiwang

Github user lianhuiwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9227#discussion_r43099624
  
--- Diff: 
network/common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java
 ---
@@ -158,6 +158,13 @@ public TransportClient createClient(String remoteHost, 
int remotePort) throws IO
 }
   }
 
+  /**
+   * Create a completely new {@link TransportClient} to the given remote 
host / port */
+  public TransportClient createNewClient(String remoteHost, int 
remotePort) throws IOException {
--- End diff --

yes, i will do it as you said.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11341][SQL] Given non-zero ordinal toRo...

2015-10-27 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/9292#issuecomment-151438992
  
Thanks for working on this, but I'm actually probably going to delete these 
class as part of the work I'm doing here: 
https://github.com/apache/spark/compare/master...marmbrus:datasets-tuples


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11305] [DOCS] Remove Third-Party Hadoop...

2015-10-27 Thread srowen

GitHub user srowen opened a pull request:

https://github.com/apache/spark/pull/9298

[SPARK-11305] [DOCS] Remove Third-Party Hadoop Distributions Doc Page

Remove Hadoop third party distro page, and move Hadoop cluster config info 
to configuration page

CC @pwendell 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/srowen/spark SPARK-11305

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/9298.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #9298


commit bf36334f0a5ef0e38e39ee278bf7d7b76ba2ce18
Author: Sean Owen 
Date:   2015-10-27T10:22:02Z

Remove Hadoop third party distro page, and move Hadoop cluster config info 
to configuration page




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 8 >

1 - 100 of 775 matches

Mail list logo