date:20170208

[GitHub] spark issue #16826: Fork SparkSession with option to inherit a copy of the S...

2017-02-08 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/16826
  
@kunalkhamar  by the way, please add the JIRA number to the title.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16826: Fork SparkSession with option to inherit a copy o...

2017-02-08 Thread zsxwing

Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/16826#discussion_r100192972
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala ---
@@ -38,16 +38,31 @@ import 
org.apache.spark.sql.util.ExecutionListenerManager
 
 /**
  * A class that holds all session-specific state in a given 
[[SparkSession]].
+ * If an `existingSessionState` is supplied, then its members will be 
copied over.
  */
-private[sql] class SessionState(sparkSession: SparkSession) {
+private[sql] class SessionState(
+sparkSession: SparkSession,
+existingSessionState: Option[SessionState]) {
--- End diff --

nit: `existingSessionState` -> `parentSessionState` to indicate we should 
copy its internal states.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16826: Fork SparkSession with option to inherit a copy o...

2017-02-08 Thread zsxwing

Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/16826#discussion_r100197297
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala 
---
@@ -213,6 +218,24 @@ class SparkSession private(
 new SparkSession(sparkContext, Some(sharedState))
--- End diff --

It's better to add `final` to avoid override this incorrectly. The method 
should be override is `newSession(inheritSessionState: Boolean)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16826: Fork SparkSession with option to inherit a copy o...

2017-02-08 Thread zsxwing

Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/16826#discussion_r100195293
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala 
---
@@ -213,6 +218,24 @@ class SparkSession private(
 new SparkSession(sparkContext, Some(sharedState))
   }
 
+  /**
+   * Start a new session, sharing the underlying `SparkContext` and cached 
data.
--- End diff --

nit: add `:: Experimental ::`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16715: [Spark-18080][ML] Python API & Examples for Locality Sen...

2017-02-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16715
  
**[Test build #72612 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72612/testReport)**
 for PR 16715 at commit 
[`6e85e1a`](https://github.com/apache/spark/commit/6e85e1a04b02dea26e82c8bc77151b8e389f4fe5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16826: Fork SparkSession with option to inherit a copy o...

2017-02-08 Thread zsxwing

Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/16826#discussion_r100190232
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/ExperimentalMethods.scala ---
@@ -46,4 +48,23 @@ class ExperimentalMethods private[sql]() {
 
   @volatile var extraOptimizations: Seq[Rule[LogicalPlan]] = Nil
 
+  /**
+   * Get an identical copy of this `ExperimentalMethods` instance.
+   * @note This is used when forking a `SparkSession`.
+   * `clone` is provided here instead of implementing equivalent 
functionality
+   * in `SparkSession.clone` since it needs to be updated
+   * as the class `ExperimentalMethods` is extended or  modified.
+   */
+  override def clone: ExperimentalMethods = {
+def cloneSeq[T](seq: Seq[T]): Seq[T] = {
+  val newSeq = new ListBuffer[T]
+  newSeq ++= seq
+  newSeq
+}
+
+val result = new ExperimentalMethods
+result.extraStrategies = cloneSeq(extraStrategies)
--- End diff --

You don't need to copy these two `Seq`s since they are not mutable `Seq`s.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16826: Fork SparkSession with option to inherit a copy o...

2017-02-08 Thread zsxwing

Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/16826#discussion_r100195235
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala 
---
@@ -213,6 +218,24 @@ class SparkSession private(
 new SparkSession(sparkContext, Some(sharedState))
   }
 
+  /**
+   * Start a new session, sharing the underlying `SparkContext` and cached 
data.
+   * If inheritSessionState is enabled, then SQL configurations, temporary 
tables,
+   * registered functions are copied over from parent `SparkSession`.
+   *
+   * @note Other than the `SparkContext`, all shared state is initialized 
lazily.
+   * This method will force the initialization of the shared state to 
ensure that parent
+   * and child sessions are set up with the same shared state. If the 
underlying catalog
+   * implementation is Hive, this will initialize the metastore, which may 
take some time.
+   */
--- End diff --

nit: please add `@Experimental` and `@InterfaceStability.Evolving`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16826: Fork SparkSession with option to inherit a copy o...

2017-02-08 Thread zsxwing

Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/16826#discussion_r100193350
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala ---
@@ -38,16 +38,31 @@ import 
org.apache.spark.sql.util.ExecutionListenerManager
 
 /**
  * A class that holds all session-specific state in a given 
[[SparkSession]].
+ * If an `existingSessionState` is supplied, then its members will be 
copied over.
  */
-private[sql] class SessionState(sparkSession: SparkSession) {
+private[sql] class SessionState(
+sparkSession: SparkSession,
+existingSessionState: Option[SessionState]) {
+
+  private[sql] def this(sparkSession: SparkSession) = {
+this(sparkSession, None)
+  }
 
   // Note: These are all lazy vals because they depend on each other (e.g. 
conf) and we
   // want subclasses to override some of the fields. Otherwise, we would 
get a lot of NPEs.
 
   /**
* SQL-specific key-value configurations.
*/
-  lazy val conf: SQLConf = new SQLConf
+  lazy val conf: SQLConf = {
+val result = new SQLConf
+if (existingSessionState.nonEmpty) {
--- End diff --

nit:
```scala
existingSessionState.foreach(_.conf.getAllConfs.foreach {
  case (k, v) => if (v ne null) result.setConfString(k, v)
})
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16826: Fork SparkSession with option to inherit a copy o...

2017-02-08 Thread zsxwing

Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/16826#discussion_r100194991
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala ---
@@ -151,7 +155,7 @@ private[hive] class TestHiveSparkSession(
 new TestHiveSessionState(self)
 
   override def newSession(): TestHiveSparkSession = {
--- End diff --

You can change it to override `def newSession(inheritSessionState: 
Boolean)` instead


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16826: Fork SparkSession with option to inherit a copy o...

2017-02-08 Thread zsxwing

Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/16826#discussion_r100195109
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala 
---
@@ -213,6 +218,24 @@ class SparkSession private(
 new SparkSession(sparkContext, Some(sharedState))
--- End diff --

nit: use `newSession(inheritSessionState = false)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16826: Fork SparkSession with option to inherit a copy o...

2017-02-08 Thread zsxwing

Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/16826#discussion_r100195324
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala 
---
@@ -213,6 +218,24 @@ class SparkSession private(
 new SparkSession(sparkContext, Some(sharedState))
   }
 
+  /**
+   * Start a new session, sharing the underlying `SparkContext` and cached 
data.
+   * If inheritSessionState is enabled, then SQL configurations, temporary 
tables,
+   * registered functions are copied over from parent `SparkSession`.
+   *
+   * @note Other than the `SparkContext`, all shared state is initialized 
lazily.
+   * This method will force the initialization of the shared state to 
ensure that parent
+   * and child sessions are set up with the same shared state. If the 
underlying catalog
+   * implementation is Hive, this will initialize the metastore, which may 
take some time.
--- End diff --

nit: add `@since 2.1.1`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16826: Fork SparkSession with option to inherit a copy o...

2017-02-08 Thread zsxwing

Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/16826#discussion_r100198503
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/SparkSessionBuilderSuite.scala ---
@@ -123,4 +125,70 @@ class SparkSessionBuilderSuite extends SparkFunSuite {
   session.stop()
 }
   }
+
+  test("fork new session and inherit a copy of the session state") {
+val activeSession = 
SparkSession.builder().master("local").getOrCreate()
+val forkedSession = activeSession.newSession(inheritSessionState = 
true)
+
+assert(forkedSession ne activeSession)
+assert(forkedSession.sessionState ne activeSession.sessionState)
+
+forkedSession.stop()
+activeSession.stop()
+  }
+
+  test("fork new session and inherit sql config options") {
+val activeSession = SparkSession
+  .builder()
+  .master("local")
+  .config("spark-configb", "b")
--- End diff --

This is in the shared state. You should use `SparkSession.conf.set` instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16826: Fork SparkSession with option to inherit a copy o...

2017-02-08 Thread zsxwing

Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/16826#discussion_r100194351
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala ---
@@ -65,12 +80,29 @@ private[sql] class SessionState(sparkSession: 
SparkSession) {
 hadoopConf
   }
 
-  lazy val experimentalMethods = new ExperimentalMethods
+  lazy val experimentalMethods: ExperimentalMethods = {
+existingSessionState
+  .map(_.experimentalMethods.clone)
+  .getOrElse(new ExperimentalMethods)
+  }
 
   /**
* Internal catalog for managing functions registered by the user.
*/
-  lazy val functionRegistry: FunctionRegistry = 
FunctionRegistry.builtin.copy()
+  lazy val functionRegistry: FunctionRegistry = {
--- End diff --

It's better to just add a `copy` method to `FunctionRegistry` to simplify 
these codes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16826: Fork SparkSession with option to inherit a copy o...

2017-02-08 Thread zsxwing

Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/16826#discussion_r100194640
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala ---
@@ -115,16 +113,22 @@ class TestHiveContext(
 private[hive] class TestHiveSparkSession(
 @transient private val sc: SparkContext,
 @transient private val existingSharedState: Option[SharedState],
+existingSessionState: Option[SessionState],
--- End diff --

Looks like you don't need to change this file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100199037
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
singleProbing=True,
+   distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A distCol is added
+ to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+@since("2.2.0")
+def approxSimilarityJoin(self, datasetA, datasetB, threshold, 
distCol="distCol"):
+"""
+Join two dataset to approximately find all pairs of rows whose 
distance are smaller than
+the threshold. If the :py:attr:`outputCol` is missing, the method 
will transform the data;
+if the :py:attr:`outputCol` exists, it will use that. This allows 
caching of the
+transformed data when necessary.
+
+:param datasetA: One of the datasets to join.
+:param datasetB: Another dataset to join.
+:param threshold: The threshold for the distance of row pairs.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A joined dataset containing pairs of rows. The original 
rows are in columns
+"datasetA" and "datasetB", and a distCol is added to show 
the distance of
+each pair.
+"""
+return self._call_java("approxSimilarityJoin", datasetA, datasetB, 
threshold, distCol)
+
+
+@inherit_doc
+class BucketedRandomProjectionLSH(JavaEstimator, LSHParams, HasInputCol, 
HasOutputCol, HasSeed,
+  JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+LSH class for Euclidean distance metrics.
+The input is dense or sparse vectors, each of which represents a point 
in the Euclidean
+distance space. The output will be vectors of configurable dimension. 
Hash value in the
+same dimension is calculated by the same hash function.
+
+.. seealso:: `Stable Distributions \
+
`_
+.. seealso:: `Hashing for Similarity Search: A Survey 
`_
+
+>>> from pyspark.ml.linalg import Vectors
+>>> data = [(Vectors.dense([-1.0, -1.0 ]),),
+... (Vectors.dense([-1.0, 1.0 ]),),
+... (Vectors.dense(

[GitHub] spark issue #16859: [SPARK-17714][Core][maven][test-hadoop2.6]Avoid using Ex...

2017-02-08 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/16859
  
cc @vanzin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100198559
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
singleProbing=True,
+   distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A distCol is added
+ to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+@since("2.2.0")
+def approxSimilarityJoin(self, datasetA, datasetB, threshold, 
distCol="distCol"):
+"""
+Join two dataset to approximately find all pairs of rows whose 
distance are smaller than
+the threshold. If the :py:attr:`outputCol` is missing, the method 
will transform the data;
+if the :py:attr:`outputCol` exists, it will use that. This allows 
caching of the
+transformed data when necessary.
+
+:param datasetA: One of the datasets to join.
+:param datasetB: Another dataset to join.
+:param threshold: The threshold for the distance of row pairs.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A joined dataset containing pairs of rows. The original 
rows are in columns
+"datasetA" and "datasetB", and a distCol is added to show 
the distance of
+each pair.
+"""
+return self._call_java("approxSimilarityJoin", datasetA, datasetB, 
threshold, distCol)
+
+
+@inherit_doc
+class BucketedRandomProjectionLSH(JavaEstimator, LSHParams, HasInputCol, 
HasOutputCol, HasSeed,
+  JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+LSH class for Euclidean distance metrics.
+The input is dense or sparse vectors, each of which represents a point 
in the Euclidean
+distance space. The output will be vectors of configurable dimension. 
Hash value in the
+same dimension is calculated by the same hash function.
+
+.. seealso:: `Stable Distributions \
+
`_
+.. seealso:: `Hashing for Similarity Search: A Survey 
`_
+
+>>> from pyspark.ml.linalg import Vectors
+>>> data = [(Vectors.dense([-1.0, -1.0 ]),),
+... (Vectors.dense([-1.0, 1.0 ]),),
+... (Vectors.dense(

[GitHub] spark issue #16859: [SPARK-17714][Core][maven][test-hadoop2.6]Avoid using Ex...

2017-02-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16859
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16859: [SPARK-17714][Core][maven][test-hadoop2.6]Avoid using Ex...

2017-02-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16859
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72599/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16859: [SPARK-17714][Core][maven][test-hadoop2.6]Avoid using Ex...

2017-02-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16859
  
**[Test build #72599 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72599/testReport)**
 for PR 16859 at commit 
[`1c88474`](https://github.com/apache/spark/commit/1c8847494c29d4b51182ecfeebb5cc85e000e7a1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class TransportChannelHandler extends 
ChannelInboundHandlerAdapter `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15415: [SPARK-14503][ML] spark.ml API for FPGrowth

2017-02-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15415
  
**[Test build #72611 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72611/testReport)**
 for PR 15415 at commit 
[`049e1a3`](https://github.com/apache/spark/commit/049e1a326daee4c55edb6d65090fafd229b93b6a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16858: [SPARK-19464][BUILD][HOTFIX] run-tests should use hadoop...

2017-02-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16858
  
**[Test build #3567 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3567/testReport)**
 for PR 16858 at commit 
[`03f9bfd`](https://github.com/apache/spark/commit/03f9bfd985a5a272d99258971fd83e4613e9b0fd).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15009: [SPARK-17443][SPARK-11035] Stop Spark Application if lau...

2017-02-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15009
  
**[Test build #72610 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72610/testReport)**
 for PR 15009 at commit 
[`2707d21`](https://github.com/apache/spark/commit/2707d219f3f2c3fa1e89553809cf3a8d118fc084).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15604: [SPARK-18066] [CORE] [TESTS] Add Pool usage policies tes...

2017-02-08 Thread kayousterhout

Github user kayousterhout commented on the issue:

https://github.com/apache/spark/pull/15604
  
Ok thanks for the update!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15604: [SPARK-18066] [CORE] [TESTS] Add Pool usage policies tes...

2017-02-08 Thread kayousterhout

Github user kayousterhout commented on the issue:

https://github.com/apache/spark/pull/15604
  
Ok thanks for the update!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16813: [SPARK-19466][CORE][SCHEDULER] Improve Fair Sched...

2017-02-08 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/16813#discussion_r100193866
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/SchedulableBuilder.scala ---
@@ -69,19 +70,31 @@ private[spark] class FairSchedulableBuilder(val 
rootPool: Pool, conf: SparkConf)
   val DEFAULT_WEIGHT = 1
 
   override def buildPools() {
-var is: Option[InputStream] = None
+var fileData: Option[(InputStream, String)] = None
 try {
-  is = Option {
-schedulerAllocFile.map { f =>
-  new FileInputStream(f)
-}.getOrElse {
-  
Utils.getSparkClassLoader.getResourceAsStream(DEFAULT_SCHEDULER_FILE)
+  fileData = schedulerAllocFile.map { f =>
+val fis = new FileInputStream(f)
+logInfo(s"Creating Fair Scheduler pools from $f")
+Some((fis, f))
+  }.getOrElse {
+val is = 
Utils.getSparkClassLoader.getResourceAsStream(DEFAULT_SCHEDULER_FILE)
+if (is != null) {
+  logInfo(s"Creating Fair Scheduler pools from default file: 
$DEFAULT_SCHEDULER_FILE")
+  Some((is, DEFAULT_SCHEDULER_FILE))
+} else {
+  logWarning("Fair Scheduler configuration file not found so jobs 
will be scheduled " +
+"in FIFO order")
+  None
 }
   }
 
-  is.foreach { i => buildFairSchedulerPool(i) }
+  fileData.foreach { case (is, fileName) => buildFairSchedulerPool(is, 
fileName) }
+} catch {
+  case NonFatal(t) =>
+logError("Error while building the fair scheduler pools: ", t)
--- End diff --

can you add the filename (if it's defined) to this error message?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16860: [SPARK-18613][ML] make spark.mllib LDA dependencies in s...

2017-02-08 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/16860
  
Try running dev/mima locally.  I bet you'll have to add stuff to the 
MiMaExcludes.scala file b/c of false positives.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16813: [SPARK-19466][CORE][SCHEDULER] Improve Fair Sched...

2017-02-08 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/16813#discussion_r100194712
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/SchedulableBuilder.scala ---
@@ -140,14 +157,15 @@ private[spark] class FairSchedulableBuilder(val 
rootPool: Pool, conf: SparkConf)
   private def getIntValue(
   poolNode: Node,
   poolName: String,
-  propertyName: String, defaultValue: Int): Int = {
+  propertyName: String,
+  defaultValue: Int, fileName: String): Int = {
--- End diff --

nit: can you fix the spacing here? (fileName should be on its own line)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16813: [SPARK-19466][CORE][SCHEDULER] Improve Fair Sched...

2017-02-08 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/16813#discussion_r100194109
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/SchedulableBuilder.scala ---
@@ -69,19 +70,31 @@ private[spark] class FairSchedulableBuilder(val 
rootPool: Pool, conf: SparkConf)
   val DEFAULT_WEIGHT = 1
 
   override def buildPools() {
-var is: Option[InputStream] = None
+var fileData: Option[(InputStream, String)] = None
 try {
-  is = Option {
-schedulerAllocFile.map { f =>
-  new FileInputStream(f)
-}.getOrElse {
-  
Utils.getSparkClassLoader.getResourceAsStream(DEFAULT_SCHEDULER_FILE)
+  fileData = schedulerAllocFile.map { f =>
+val fis = new FileInputStream(f)
+logInfo(s"Creating Fair Scheduler pools from $f")
+Some((fis, f))
+  }.getOrElse {
+val is = 
Utils.getSparkClassLoader.getResourceAsStream(DEFAULT_SCHEDULER_FILE)
+if (is != null) {
+  logInfo(s"Creating Fair Scheduler pools from default file: 
$DEFAULT_SCHEDULER_FILE")
+  Some((is, DEFAULT_SCHEDULER_FILE))
+} else {
+  logWarning("Fair Scheduler configuration file not found so jobs 
will be scheduled " +
--- End diff --

can you add "($DEFAULT_SCHEDULER_FILE)" after "file "?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15604: [SPARK-18066] [CORE] [TESTS] Add Pool usage policies tes...

2017-02-08 Thread erenavsarogullari

Github user erenavsarogullari commented on the issue:

https://github.com/apache/spark/pull/15604
  
@kayousterhout thanks for query this PR. I will be submitting this and the 
other one( #15326) once #16813 is merged. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16860: [SPARK-18613][ML] make spark.mllib LDA dependencies in s...

2017-02-08 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/16860
  
Oh, the class is private already, right on. That seems OK then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192059
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16715: [Spark-18080][ML] Python API & Examples for Locality Sen...

2017-02-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16715
  
**[Test build #72609 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72609/testReport)**
 for PR 16715 at commit 
[`8e5468f`](https://github.com/apache/spark/commit/8e5468f6946a8f2c051746ddb0c0e65586bd1eed).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100193058
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -755,6 +951,102 @@ def maxAbs(self):
 
 
 @inherit_doc
+class MinHashLSH(JavaEstimator, LSHParams, HasInputCol, HasOutputCol, 
HasSeed,
+ JavaMLReadable, JavaMLWritable):
+
+"""
+.. note:: Experimental
+
+LSH class for Jaccard distance.
+The input can be dense or sparse vectors, but it is more efficient if 
it is sparse.
+For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])`
--- End diff --

Fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100193020
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -755,6 +951,102 @@ def maxAbs(self):
 
 
 @inherit_doc
+class MinHashLSH(JavaEstimator, LSHParams, HasInputCol, HasOutputCol, 
HasSeed,
+ JavaMLReadable, JavaMLWritable):
+
+"""
+.. note:: Experimental
+
+LSH class for Jaccard distance.
+The input can be dense or sparse vectors, but it is more efficient if 
it is sparse.
+For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])`
+means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5.
+Also, any input vector must have at least 1 non-zero indices, and all 
non-zero values
+are treated as binary "1" values.
+
+.. seealso:: `MinHash `_
+
+>>> from pyspark.ml.linalg import Vectors
+>>> data = [(Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
+... (Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
+... (Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
+>>> df = spark.createDataFrame(data, ["keys"])
+>>> mh = MinHashLSH(inputCol="keys", outputCol="values", seed=12345)
+>>> model = mh.fit(df)
+>>> model.transform(df).head()
+Row(keys=SparseVector(6, {0: 1.0, 1: 1.0, 2: 1.0}), 
values=[DenseVector([-1638925712.0])])
+>>> data2 = [(Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
+...  (Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
+...  (Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
+>>> df2 = spark.createDataFrame(data2, ["keys"])
+>>> key = Vectors.sparse(6, [1], [1.0])
+>>> model.approxNearestNeighbors(df2, key, 
1).select("distCol").head()[0]
+0.6...
+>>> model.approxSimilarityJoin(df, df2, 
1.0).select("distCol").head()[0]
+0.5
+>>> mhPath = temp_path + "/mh"
+>>> mh.save(mhPath)
+>>> mh2 = MinHashLSH.load(mhPath)
+>>> mh2.getOutputCol() == mh.getOutputCol()
+True
+>>> modelPath = temp_path + "/mh-model"
+>>> model.save(modelPath)
+>>> model2 = MinHashLSHModel.load(modelPath)
+
+.. versionadded:: 2.2.0
+"""
+
+@keyword_only
+def __init__(self, inputCol=None, outputCol=None, seed=None, 
numHashTables=1):
+"""
+__init__(self, inputCol=None, outputCol=None, seed=None, 
numHashTables=1)
+"""
+super(MinHashLSH, self).__init__()
+self._java_obj = 
self._new_java_obj("org.apache.spark.ml.feature.MinHashLSH", self.uid)
+self._setDefault(numHashTables=1)
+kwargs = self.__init__._input_kwargs
+self.setParams(**kwargs)
+
+@keyword_only
+@since("2.2.0")
+def setParams(self, inputCol=None, outputCol=None, seed=None, 
numHashTables=1):
+"""
+setParams(self, inputCol=None, outputCol=None, seed=None, 
numHashTables=1)
+Sets params for this MinHashLSH.
+"""
+kwargs = self.setParams._input_kwargs
+return self._set(**kwargs)
+
+def _create_model(self, java_model):
+return MinHashLSHModel(java_model)
+
+
+class MinHashLSHModel(JavaModel, LSHModel, JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+Model produced by :py:class:`MinHashLSH`, where where multiple hash 
functions are stored. Each
+hash function is picked from the following family of hash functions, 
where :math:`a_i` and
+:math:`b_i` are randomly chosen integers less than prime:
+:math:`h_i(x) = ((x \cdot a_i + b_i) \mod prime)` This hash family is 
approximately min-wise
+independent according to the reference.
+
+.. seealso:: Tom Bohman, Colin Cooper, and Alan Frieze. "Min-wise 
independent linear \
+permutations." Electronic Journal of Combinatorics 7 (2000): R26.
+
+.. versionadded:: 2.2.0
+"""
+
+@property
+@since("2.2.0")
+def randCoefficients(self):
--- End diff --

Removed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100193043
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -755,6 +951,102 @@ def maxAbs(self):
 
 
 @inherit_doc
+class MinHashLSH(JavaEstimator, LSHParams, HasInputCol, HasOutputCol, 
HasSeed,
+ JavaMLReadable, JavaMLWritable):
+
+"""
+.. note:: Experimental
+
+LSH class for Jaccard distance.
+The input can be dense or sparse vectors, but it is more efficient if 
it is sparse.
+For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])`
+means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5.
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192985
  
--- Diff: examples/src/main/python/ml/bucketed_random_projection_lsh.py ---
@@ -0,0 +1,76 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.feature import BucketedRandomProjectionLSH
+from pyspark.ml.linalg import Vectors
+# $example off$
+from pyspark.sql import SparkSession
+
+"""
+An example demonstrating BucketedRandomProjectionLSH.
+Run with:
+  bin/spark-submit 
examples/src/main/python/ml/bucketed_random_projection_lsh.py
--- End diff --

That was a mistake. Sorry about it!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192933
  
--- Diff: examples/src/main/python/ml/bucketed_random_projection_lsh.py ---
@@ -0,0 +1,76 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.feature import BucketedRandomProjectionLSH
+from pyspark.ml.linalg import Vectors
+# $example off$
+from pyspark.sql import SparkSession
+
+"""
+An example demonstrating BucketedRandomProjectionLSH.
+Run with:
+  bin/spark-submit 
examples/src/main/python/ml/bucketed_random_projection_lsh.py
+"""
+
+if __name__ == "__main__":
+spark = SparkSession \
+.builder \
+.appName("BucketedRandomProjectionLSHExample") \
+.getOrCreate()
+
+# $example on$
+dataA = [(0, Vectors.dense([1.0, 1.0]),),
+ (1, Vectors.dense([1.0, -1.0]),),
+ (2, Vectors.dense([-1.0, -1.0]),),
+ (3, Vectors.dense([-1.0, 1.0]),)]
+dfA = spark.createDataFrame(dataA, ["id", "keys"])
+
+dataB = [(4, Vectors.dense([1.0, 0.0]),),
+ (5, Vectors.dense([-1.0, 0.0]),),
+ (6, Vectors.dense([0.0, 1.0]),),
+ (7, Vectors.dense([0.0, -1.0]),)]
+dfB = spark.createDataFrame(dataB, ["id", "keys"])
+
+key = Vectors.dense([1.0, 0.0])
+
+brp = BucketedRandomProjectionLSH(inputCol="keys", outputCol="values", 
bucketLength=2.0,
+  numHashTables=3)
+model = brp.fit(dfA)
+
+# Feature Transformation
+model.transform(dfA).show()
--- End diff --

Done for Scala/Java/Python Examples.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192881
  
--- Diff: examples/src/main/python/ml/min_hash_lsh.py ---
@@ -0,0 +1,75 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.feature import MinHashLSH
+from pyspark.ml.linalg import Vectors
+# $example off$
+from pyspark.sql import SparkSession
+
+"""
+An example demonstrating MinHashLSH.
+Run with:
+  bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py
+"""
+
+if __name__ == "__main__":
+spark = SparkSession \
+.builder \
+.appName("MinHashLSHExample") \
+.getOrCreate()
+
+# $example on$
+dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
+ (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
+ (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
+dfA = spark.createDataFrame(dataA, ["id", "keys"])
+
+dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
+ (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
+ (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
+dfB = spark.createDataFrame(dataB, ["id", "keys"])
+
+key = Vectors.sparse(6, [1, 3], [1.0, 1.0])
+
+mh = MinHashLSH(inputCol="keys", outputCol="values", numHashTables=3)
+model = mh.fit(dfA)
+
+# Feature Transformation
+model.transform(dfA).show()
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13077: [SPARK-10748] [Mesos] Log error instead of crashing Spar...

2017-02-08 Thread devaraj-kavali

Github user devaraj-kavali commented on the issue:

https://github.com/apache/spark/pull/13077
  
@srowen I think it is still needed, @tnachen mentioned that it will be 
closed as 'it's no longer being updated' if the conflicts cannot be resolved. I 
have updated the PR and there are no conflicts now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192685
  
--- Diff: examples/src/main/python/ml/min_hash_lsh.py ---
@@ -0,0 +1,75 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.feature import MinHashLSH
+from pyspark.ml.linalg import Vectors
+# $example off$
+from pyspark.sql import SparkSession
+
+"""
+An example demonstrating MinHashLSH.
+Run with:
+  bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py
+"""
+
+if __name__ == "__main__":
+spark = SparkSession \
+.builder \
+.appName("MinHashLSHExample") \
+.getOrCreate()
+
+# $example on$
+dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
+ (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
+ (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
+dfA = spark.createDataFrame(dataA, ["id", "keys"])
+
+dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
+ (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
+ (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
+dfB = spark.createDataFrame(dataB, ["id", "keys"])
+
+key = Vectors.sparse(6, [1, 3], [1.0, 1.0])
+
+mh = MinHashLSH(inputCol="keys", outputCol="values", numHashTables=3)
+model = mh.fit(dfA)
+
+# Feature Transformation
+model.transform(dfA).show()
+
+# Cache the transformed columns
+transformedA = model.transform(dfA).cache()
+transformedB = model.transform(dfB).cache()
+
+# Approximate similarity join
+model.approxSimilarityJoin(dfA, dfB, 0.6).show()
+model.approxSimilarityJoin(transformedA, transformedB, 0.6).show()
+
+# Self Join
+model.approxSimilarityJoin(dfA, dfA, 0.6).filter("datasetA.id < 
datasetB.id").show()
+
+# Approximate nearest neighbor search
+model.approxNearestNeighbors(dfA, key, 2).show()
--- End diff --

Increased the number of HashTables.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16813: [SPARK-19466][CORE][SCHEDULER] Improve Fair Scheduler Lo...

2017-02-08 Thread erenavsarogullari

Github user erenavsarogullari commented on the issue:

https://github.com/apache/spark/pull/16813
  
Hi @kayousterhout and @markhamstra,
All comments have been addressed via latest patch(790097e)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192402
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
singleProbing=True,
+   distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A distCol is added
+ to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+@since("2.2.0")
+def approxSimilarityJoin(self, datasetA, datasetB, threshold, 
distCol="distCol"):
+"""
+Join two dataset to approximately find all pairs of rows whose 
distance are smaller than
+the threshold. If the :py:attr:`outputCol` is missing, the method 
will transform the data;
+if the :py:attr:`outputCol` exists, it will use that. This allows 
caching of the
+transformed data when necessary.
+
+:param datasetA: One of the datasets to join.
+:param datasetB: Another dataset to join.
+:param threshold: The threshold for the distance of row pairs.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A joined dataset containing pairs of rows. The original 
rows are in columns
+"datasetA" and "datasetB", and a distCol is added to show 
the distance of
+each pair.
+"""
+return self._call_java("approxSimilarityJoin", datasetA, datasetB, 
threshold, distCol)
+
+
+@inherit_doc
+class BucketedRandomProjectionLSH(JavaEstimator, LSHParams, HasInputCol, 
HasOutputCol, HasSeed,
+  JavaMLReadable, JavaMLWritable):
+"""
+.. note:: Experimental
+
+LSH class for Euclidean distance metrics.
+The input is dense or sparse vectors, each of which represents a point 
in the Euclidean
+distance space. The output will be vectors of configurable dimension. 
Hash value in the
--- End diff --

Done in Scala/Java doc as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apac

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192347
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
singleProbing=True,
+   distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A distCol is added
+ to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+@since("2.2.0")
+def approxSimilarityJoin(self, datasetA, datasetB, threshold, 
distCol="distCol"):
+"""
+Join two dataset to approximately find all pairs of rows whose 
distance are smaller than
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16860: [SPARK-18613][ML] make spark.mllib LDA dependencies in s...

2017-02-08 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/16860
  
@srowen  This actually isn't a breaking change.  This is making protected 
methods private within classes which cannot be extended outside of Spark.  The 
classes are not final, but they only provide private constructors.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192333
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
singleProbing=True,
+   distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192298
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192314
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
+"""
+Mixin for Locality Sensitive Hashing(LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
singleProbing=True,
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192074
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing(LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel():
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML] Python API & Examples for Local...

2017-02-08 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100192026
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,200 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
--- End diff --

It's not alphabetized here because the declaration order matters for 
PySpark shell.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16800: [SPARK-19456][SparkR]:Add LinearSVC R API

2017-02-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16800
  
**[Test build #72608 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72608/testReport)**
 for PR 16800 at commit 
[`a180126`](https://github.com/apache/spark/commit/a1801261335cf08fc87f17d76979fda6fdcc1969).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16860: [SPARK-18613][ML] make spark.mllib LDA dependencies in s...

2017-02-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16860
  
**[Test build #72607 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72607/testReport)**
 for PR 16860 at commit 
[`ce8abb3`](https://github.com/apache/spark/commit/ce8abb308c0b555754d332712d922a7aa9a8b220).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16855: [SPARK-13931] Resolve stage hanging up problem in...

2017-02-08 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/16855#discussion_r100190769
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ---
@@ -664,6 +665,55 @@ class TaskSetManagerSuite extends SparkFunSuite with 
LocalSparkContext with Logg
 assert(thrown2.getMessage().contains("bigger than 
spark.driver.maxResultSize"))
   }
 
+  test("taskSetManager should not send Resubmitted tasks after being a 
zombie") {
+// Regression test for SPARK-13931
+val conf = new SparkConf().set("spark.speculation", "true")
+sc = new SparkContext("local", "test", conf)
+
+val sched = new FakeTaskScheduler(sc, ("execA", "host1"), ("execB", 
"host2"))
+sched.initialize(new FakeSchedulerBackend() {
+  override def killTask(taskId: Long, executorId: String, 
interruptThread: Boolean): Unit = {}
+})
+
+// count for Resubmitted tasks
+var resubmittedTasks = 0
+val dagScheduler = new FakeDAGScheduler(sc, sched) {
+  override def taskEnded(task: Task[_], reason: TaskEndReason, result: 
Any,
+ accumUpdates: Seq[AccumulatorV2[_, _]], 
taskInfo: TaskInfo): Unit = {
+super.taskEnded(task, reason, result, accumUpdates, taskInfo)
+reason match {
+  case Resubmitted => resubmittedTasks += 1
+  case _ =>
+}
+  }
+}
+sched.setDAGScheduler(dagScheduler)
+
+val tasks = Array.tabulate[Task[_]](1) { i =>
+  new ShuffleMapTask(i, 0, null, new Partition {
+override def index: Int = 0
+  }, Seq(TaskLocation("host1", "execA")), new Properties, null)
+}
+val taskSet = new TaskSet(tasks, 0, 0, 0, null)
+val manager = new TaskSetManager(sched, taskSet, MAX_TASK_FAILURES)
+manager.speculatableTasks += tasks.head.partitionId
+val task1 = manager.resourceOffer("execA", "host1", 
TaskLocality.PROCESS_LOCAL).get
+val task2 = manager.resourceOffer("execB", "host2", 
TaskLocality.ANY).get
+
+assert(manager.runningTasks == 2)
+assert(manager.isZombie == false)
+
+val directTaskResult = new DirectTaskResult[String](null, Seq()) {
+  override def value(resultSer: SerializerInstance): String = ""
+}
+manager.handleSuccessfulTask(task1.taskId, directTaskResult)
+assert(manager.isZombie == true)
+assert(resubmittedTasks == 0)
--- End diff --

can you check that manager.runningTasks is 1 here, and 0 below?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16855: [SPARK-13931] Resolve stage hanging up problem in...

2017-02-08 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/16855#discussion_r100188875
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ---
@@ -664,6 +665,55 @@ class TaskSetManagerSuite extends SparkFunSuite with 
LocalSparkContext with Logg
 assert(thrown2.getMessage().contains("bigger than 
spark.driver.maxResultSize"))
   }
 
+  test("taskSetManager should not send Resubmitted tasks after being a 
zombie") {
+// Regression test for SPARK-13931
+val conf = new SparkConf().set("spark.speculation", "true")
+sc = new SparkContext("local", "test", conf)
+
+val sched = new FakeTaskScheduler(sc, ("execA", "host1"), ("execB", 
"host2"))
+sched.initialize(new FakeSchedulerBackend() {
+  override def killTask(taskId: Long, executorId: String, 
interruptThread: Boolean): Unit = {}
+})
+
+// count for Resubmitted tasks
+var resubmittedTasks = 0
+val dagScheduler = new FakeDAGScheduler(sc, sched) {
+  override def taskEnded(task: Task[_], reason: TaskEndReason, result: 
Any,
+ accumUpdates: Seq[AccumulatorV2[_, _]], 
taskInfo: TaskInfo): Unit = {
+super.taskEnded(task, reason, result, accumUpdates, taskInfo)
+reason match {
+  case Resubmitted => resubmittedTasks += 1
+  case _ =>
+}
+  }
+}
+sched.setDAGScheduler(dagScheduler)
+
+val tasks = Array.tabulate[Task[_]](1) { i =>
+  new ShuffleMapTask(i, 0, null, new Partition {
+override def index: Int = 0
+  }, Seq(TaskLocation("host1", "execA")), new Properties, null)
+}
+val taskSet = new TaskSet(tasks, 0, 0, 0, null)
+val manager = new TaskSetManager(sched, taskSet, MAX_TASK_FAILURES)
+manager.speculatableTasks += tasks.head.partitionId
--- End diff --

can you add a comment here about what's going on?   I think it would be 
more clear if you moved this line below task 1.  Then before task 1, add a 
comment saying "Offer host1, which should be accepted as a PROCESS_LOCAL 
location by the one task in the task set".  Then, before this speculatableTasks 
line, add something like "Mark the task as available for speculation, and then 
offer another resource, which should be used to launch a speculative copy of 
the task."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16855: [SPARK-13931] Resolve stage hanging up problem in...

2017-02-08 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/16855#discussion_r100187460
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ---
@@ -664,6 +665,55 @@ class TaskSetManagerSuite extends SparkFunSuite with 
LocalSparkContext with Logg
 assert(thrown2.getMessage().contains("bigger than 
spark.driver.maxResultSize"))
   }
 
+  test("taskSetManager should not send Resubmitted tasks after being a 
zombie") {
+// Regression test for SPARK-13931
+val conf = new SparkConf().set("spark.speculation", "true")
+sc = new SparkContext("local", "test", conf)
+
+val sched = new FakeTaskScheduler(sc, ("execA", "host1"), ("execB", 
"host2"))
+sched.initialize(new FakeSchedulerBackend() {
+  override def killTask(taskId: Long, executorId: String, 
interruptThread: Boolean): Unit = {}
+})
+
+// count for Resubmitted tasks
+var resubmittedTasks = 0
+val dagScheduler = new FakeDAGScheduler(sc, sched) {
+  override def taskEnded(task: Task[_], reason: TaskEndReason, result: 
Any,
+ accumUpdates: Seq[AccumulatorV2[_, _]], 
taskInfo: TaskInfo): Unit = {
+super.taskEnded(task, reason, result, accumUpdates, taskInfo)
+reason match {
+  case Resubmitted => resubmittedTasks += 1
+  case _ =>
+}
+  }
+}
+sched.setDAGScheduler(dagScheduler)
+
+val tasks = Array.tabulate[Task[_]](1) { i =>
--- End diff --

do you need Array.tabulate here, given that you're only creating one task / 
task location?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16860: [SPARK-18613][ML] make spark.mllib LDA dependencies in s...

2017-02-08 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/16860
  
I get it, but making something less visible after it's released is also a 
breaking change. Without a good reason for that we shouldn't make changes like 
this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16855: [SPARK-13931] Resolve stage hanging up problem in...

2017-02-08 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/16855#discussion_r100186410
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -874,7 +874,8 @@ private[spark] class TaskSetManager(
 // and we are not using an external shuffle server which could serve 
the shuffle outputs.
 // The reason is the next stage wouldn't be able to fetch the data 
from this dead executor
 // so we would need to rerun these tasks on other executors.
-if (tasks(0).isInstanceOf[ShuffleMapTask] && 
!env.blockManager.externalShuffleServiceEnabled) {
+if (tasks(0).isInstanceOf[ShuffleMapTask] && 
!env.blockManager.externalShuffleServiceEnabled
+  && !isZombie) {
--- End diff --

nit: indentation (add two spaces)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16855: [SPARK-13931] Resolve stage hanging up problem in...

2017-02-08 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/16855#discussion_r100189090
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ---
@@ -664,6 +665,55 @@ class TaskSetManagerSuite extends SparkFunSuite with 
LocalSparkContext with Logg
 assert(thrown2.getMessage().contains("bigger than 
spark.driver.maxResultSize"))
   }
 
+  test("taskSetManager should not send Resubmitted tasks after being a 
zombie") {
+// Regression test for SPARK-13931
+val conf = new SparkConf().set("spark.speculation", "true")
+sc = new SparkContext("local", "test", conf)
+
+val sched = new FakeTaskScheduler(sc, ("execA", "host1"), ("execB", 
"host2"))
+sched.initialize(new FakeSchedulerBackend() {
+  override def killTask(taskId: Long, executorId: String, 
interruptThread: Boolean): Unit = {}
+})
+
+// count for Resubmitted tasks
+var resubmittedTasks = 0
+val dagScheduler = new FakeDAGScheduler(sc, sched) {
+  override def taskEnded(task: Task[_], reason: TaskEndReason, result: 
Any,
+ accumUpdates: Seq[AccumulatorV2[_, _]], 
taskInfo: TaskInfo): Unit = {
+super.taskEnded(task, reason, result, accumUpdates, taskInfo)
+reason match {
+  case Resubmitted => resubmittedTasks += 1
+  case _ =>
+}
+  }
+}
+sched.setDAGScheduler(dagScheduler)
+
+val tasks = Array.tabulate[Task[_]](1) { i =>
+  new ShuffleMapTask(i, 0, null, new Partition {
+override def index: Int = 0
+  }, Seq(TaskLocation("host1", "execA")), new Properties, null)
+}
+val taskSet = new TaskSet(tasks, 0, 0, 0, null)
+val manager = new TaskSetManager(sched, taskSet, MAX_TASK_FAILURES)
+manager.speculatableTasks += tasks.head.partitionId
+val task1 = manager.resourceOffer("execA", "host1", 
TaskLocality.PROCESS_LOCAL).get
+val task2 = manager.resourceOffer("execB", "host2", 
TaskLocality.ANY).get
+
+assert(manager.runningTasks == 2)
--- End diff --

can you use triple equals here and below? That way Scala Test will print 
out the expected and actual values automatically.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16855: [SPARK-13931] Resolve stage hanging up problem in...

2017-02-08 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/16855#discussion_r100187114
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -874,7 +874,8 @@ private[spark] class TaskSetManager(
 // and we are not using an external shuffle server which could serve 
the shuffle outputs.
 // The reason is the next stage wouldn't be able to fetch the data 
from this dead executor
 // so we would need to rerun these tasks on other executors.
-if (tasks(0).isInstanceOf[ShuffleMapTask] && 
!env.blockManager.externalShuffleServiceEnabled) {
+if (tasks(0).isInstanceOf[ShuffleMapTask] && 
!env.blockManager.externalShuffleServiceEnabled
+  && !isZombie) {
--- End diff --

Also I'm concerned that we might need some of the functionality below even 
when the TSM is a zombie.  While the TSM shouldn't tell the DAGScheduler that 
the task was resubmitted, I think it does need to notify the DAGScheduler that 
tasks on the executor are finished (otherwise they'll never be marked as 
finished in the UI, for example), and I also think it needs to properly update 
the running copies of the task.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16855: [SPARK-13931] Resolve stage hanging up problem in...

2017-02-08 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/16855#discussion_r100190631
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ---
@@ -664,6 +665,55 @@ class TaskSetManagerSuite extends SparkFunSuite with 
LocalSparkContext with Logg
 assert(thrown2.getMessage().contains("bigger than 
spark.driver.maxResultSize"))
   }
 
+  test("taskSetManager should not send Resubmitted tasks after being a 
zombie") {
+// Regression test for SPARK-13931
+val conf = new SparkConf().set("spark.speculation", "true")
+sc = new SparkContext("local", "test", conf)
+
+val sched = new FakeTaskScheduler(sc, ("execA", "host1"), ("execB", 
"host2"))
+sched.initialize(new FakeSchedulerBackend() {
+  override def killTask(taskId: Long, executorId: String, 
interruptThread: Boolean): Unit = {}
+})
+
+// count for Resubmitted tasks
+var resubmittedTasks = 0
+val dagScheduler = new FakeDAGScheduler(sc, sched) {
--- End diff --

Rather than defining your own DAGScheduler, can you use the existing 
FakeDAGSCheduler, and then use the FakeTaskScheduler to make sure that the task 
was recorded as ended for the correct reason? (i.e., not for the reason of 
being resubmitted)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16855: [SPARK-13931] Resolve stage hanging up problem in...

2017-02-08 Thread kayousterhout

Github user kayousterhout commented on a diff in the pull request:

https://github.com/apache/spark/pull/16855#discussion_r100189015
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ---
@@ -664,6 +665,55 @@ class TaskSetManagerSuite extends SparkFunSuite with 
LocalSparkContext with Logg
 assert(thrown2.getMessage().contains("bigger than 
spark.driver.maxResultSize"))
   }
 
+  test("taskSetManager should not send Resubmitted tasks after being a 
zombie") {
+// Regression test for SPARK-13931
+val conf = new SparkConf().set("spark.speculation", "true")
+sc = new SparkContext("local", "test", conf)
+
+val sched = new FakeTaskScheduler(sc, ("execA", "host1"), ("execB", 
"host2"))
+sched.initialize(new FakeSchedulerBackend() {
+  override def killTask(taskId: Long, executorId: String, 
interruptThread: Boolean): Unit = {}
+})
+
+// count for Resubmitted tasks
+var resubmittedTasks = 0
+val dagScheduler = new FakeDAGScheduler(sc, sched) {
+  override def taskEnded(task: Task[_], reason: TaskEndReason, result: 
Any,
+ accumUpdates: Seq[AccumulatorV2[_, _]], 
taskInfo: TaskInfo): Unit = {
+super.taskEnded(task, reason, result, accumUpdates, taskInfo)
+reason match {
+  case Resubmitted => resubmittedTasks += 1
+  case _ =>
+}
+  }
+}
+sched.setDAGScheduler(dagScheduler)
+
+val tasks = Array.tabulate[Task[_]](1) { i =>
+  new ShuffleMapTask(i, 0, null, new Partition {
+override def index: Int = 0
+  }, Seq(TaskLocation("host1", "execA")), new Properties, null)
+}
+val taskSet = new TaskSet(tasks, 0, 0, 0, null)
+val manager = new TaskSetManager(sched, taskSet, MAX_TASK_FAILURES)
+manager.speculatableTasks += tasks.head.partitionId
+val task1 = manager.resourceOffer("execA", "host1", 
TaskLocality.PROCESS_LOCAL).get
+val task2 = manager.resourceOffer("execB", "host2", 
TaskLocality.ANY).get
+
+assert(manager.runningTasks == 2)
+assert(manager.isZombie == false)
+
+val directTaskResult = new DirectTaskResult[String](null, Seq()) {
--- End diff --

here, can you add a comment with something like "Complete one copy of the 
task, which should result in the task set manager being marked as a zombie, 
because at least one copy of its only task has completed."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16860: [SPARK-18613][ML] make spark.mllib LDA dependenci...

2017-02-08 Thread sueann

GitHub user sueann opened a pull request:

https://github.com/apache/spark/pull/16860

[SPARK-18613][ML] make spark.mllib LDA dependencies in spark.ml LDA private

## What changes were proposed in this pull request?
spark.ml.*LDAModel classes were exposing spark.mllib LDA models via 
protected methods. Made them package (clustering) private.

## How was this patch tested?
```
build/sbt doc  # "millib.clustering" no longer appears in the docs for 
*LDA* classes
build/sbt compile  # compiles
build/sbt
> mllib/testOnly   # tests pass
```


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sueann/spark SPARK-18613

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16860.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16860


commit ce8abb308c0b555754d332712d922a7aa9a8b220
Author: sueann 
Date:   2017-02-08T22:38:44Z

make spark.mllib LDA dependencies private




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16625: [SPARK-17874][core] Add SSL port configuration.

2017-02-08 Thread vanzin

Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/16625
  
Unless I hear back I'll push this tomorrow morning.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16826: Fork SparkSession with option to inherit a copy of the S...

2017-02-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16826
  
**[Test build #72606 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72606/testReport)**
 for PR 16826 at commit 
[`a343d8a`](https://github.com/apache/spark/commit/a343d8af9c577158042e4af9f8832f46aeecd509).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13072: [SPARK-15288] [Mesos] Mesos dispatcher should handle gra...

2017-02-08 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/13072
  
I can't really evaluate the change; is anyone else familiar with Mesos 
around to comment? that'd be best.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13077: [SPARK-10748] [Mesos] Log error instead of crashing Spar...

2017-02-08 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/13077
  
Looks like @tnachen suggested closing this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13326: [SPARK-15560] [Mesos] Queued/Supervise drivers waiting f...

2017-02-08 Thread devaraj-kavali

Github user devaraj-kavali commented on the issue:

https://github.com/apache/spark/pull/13326
  
@tnachen Can you check this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13077: [SPARK-10748] [Mesos] Log error instead of crashing Spar...

2017-02-08 Thread devaraj-kavali

Github user devaraj-kavali commented on the issue:

https://github.com/apache/spark/pull/13077
  
@srowen /@tnachen Can you check this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16826: Fork SparkSession with option to inherit a copy of the S...

2017-02-08 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/16826
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13072: [SPARK-15288] [Mesos] Mesos dispatcher should handle gra...

2017-02-08 Thread devaraj-kavali

Github user devaraj-kavali commented on the issue:

https://github.com/apache/spark/pull/13072
  
@srowen Can you check this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13143: [SPARK-15359] [Mesos] Mesos dispatcher should handle DRI...

2017-02-08 Thread devaraj-kavali

Github user devaraj-kavali commented on the issue:

https://github.com/apache/spark/pull/13143
  
@tnachen Can you check this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16664: [SPARK-18120 ][SQL] Call QueryExecutionListener c...

2017-02-08 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16664#discussion_r100186876
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
@@ -218,7 +247,14 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
   bucketSpec = getBucketSpec,
   options = extraOptions.toMap)
 
-dataSource.write(mode, df)
+val destination = source match {
+  case "jdbc" => extraOptions.get(JDBCOptions.JDBC_TABLE_NAME)
+  case _ => extraOptions.get("path")
--- End diff --

See my 
[comment](https://github.com/apache/spark/pull/16664#issuecomment-277597939) 
about the other code paths that also need `QueryExecutionListener`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16846: [SPARK-19506][ML][PYTHON] Import warnings in pyspark.ml....

2017-02-08 Thread holdenk

Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/16846
  
This looks good, normally I'd say we should have tests but since we are 
going to remove this in 2.2 that is probably ok as is. I'll merge this to 
master & the 2.1 branch if no one objects by EOD.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16855: [SPARK-13931] Resolve stage hanging up problem in a part...

2017-02-08 Thread kayousterhout

Github user kayousterhout commented on the issue:

https://github.com/apache/spark/pull/16855
  
Can you make the PR and JIRA description something more specific? Maybe 
"[SPARK-13931] Stage can hang if an executor fails while speculated tasks are 
running"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16825: [SPARK-19481][REPL][maven]Avoid to leak SparkContext in ...

2017-02-08 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/16825
  
This PR doesn't fix all leaks though. I noticed that there are many 
Finalizers and it slows down GC. Most of them are `JarFile` and `Inflater`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16855: [SPARK-13931] Resolve stage hanging up problem in a part...

2017-02-08 Thread kayousterhout

Github user kayousterhout commented on the issue:

https://github.com/apache/spark/pull/16855
  
Jenkins, this is OK to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16664: [SPARK-18120 ][SQL] Call QueryExecutionListener c...

2017-02-08 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/16664#discussion_r100184896
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
@@ -218,7 +247,14 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
   bucketSpec = getBucketSpec,
   options = extraOptions.toMap)
 
-dataSource.write(mode, df)
+val destination = source match {
+  case "jdbc" => extraOptions.get(JDBCOptions.JDBC_TABLE_NAME)
+  case _ => extraOptions.get("path")
--- End diff --

What other parameters are you thinking about?

It's pretty easy to wrap this class in some other class that has these 
output params and any other params you want to expose, but it would be nice to 
understand exactly what you're referring to here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16736: [SPARK-19265][SQL][Follow-up] Configurable `tableRelatio...

2017-02-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16736
  
**[Test build #72605 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72605/testReport)**
 for PR 16736 at commit 
[`314f6f8`](https://github.com/apache/spark/commit/314f6f8de6990b1c3bfddea503490a1797e25117).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16804: [SPARK-19459][SQL] Add Hive datatype (char/varchar) to S...

2017-02-08 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16804
  
LGTM pending test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16736: [SPARK-19265][SQL][Follow-up] Configurable `tableRelatio...

2017-02-08 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16736
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16664: [SPARK-18120 ][SQL] Call QueryExecutionListener c...

2017-02-08 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16664#discussion_r100182956
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
@@ -218,7 +247,14 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
   bucketSpec = getBucketSpec,
   options = extraOptions.toMap)
 
-dataSource.write(mode, df)
+val destination = source match {
+  case "jdbc" => extraOptions.get(JDBCOptions.JDBC_TABLE_NAME)
+  case _ => extraOptions.get("path")
--- End diff --

I think we need to make it more general instead of introducing a class for 
the write path only. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16804: [SPARK-19459][SQL] Add Hive datatype (char/varchar) to S...

2017-02-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16804
  
**[Test build #72604 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72604/testReport)**
 for PR 16804 at commit 
[`e7ca0ea`](https://github.com/apache/spark/commit/e7ca0ead843f2c9650e690fe649be18fa6389e48).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16804: [SPARK-19459][SQL] Add Hive datatype (char/varchar) to S...

2017-02-08 Thread hvanhovell

Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/16804
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2017-02-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16386
  
**[Test build #72603 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72603/testReport)**
 for PR 16386 at commit 
[`30fb509`](https://github.com/apache/spark/commit/30fb509c71ed1f919e0198f2366aa817d96fc0ca).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16837: [SPARK-19359][SQL] renaming partition should not leave u...

2017-02-08 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16837
  
LGTM except one comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2017-02-08 Thread NathanHowell

Github user NathanHowell commented on the issue:

https://github.com/apache/spark/pull/16386
  
Rebased again to pickup the build break hotfix in 
c618ccdbe9ac103dfa3182346e2a14a1e7fca91a


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...

2017-02-08 Thread mariusvniekerk

Github user mariusvniekerk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15666#discussion_r100177099
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -1802,19 +1802,34 @@ class SparkContext(config: SparkConf) extends 
Logging {
* Adds a JAR dependency for all tasks to be executed on this 
`SparkContext` in the future.
* @param path can be either a local file, a file in HDFS (or other 
Hadoop-supported filesystems),
* an HTTP, HTTPS or FTP URI, or local:/path for a file on every worker 
node.
+   * If addToCurrentClassLoader is true, attempt to add the new class to 
the current threads' class
--- End diff --

Add to doc that already loaded urls will have no effect if a url is already 
present.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...

2017-02-08 Thread holdenk

Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15666#discussion_r100176735
  
--- Diff: python/pyspark/context.py ---
@@ -836,6 +836,17 @@ def addPyFile(self, path):
 import importlib
 importlib.invalidate_caches()
 
+def addJar(self, path, addToCurrentClassLoader=False):
+"""
+Adds a JAR dependency for all tasks to be executed on this 
SparkContext in the future.
+The `path` passed can be either a local file, a file in HDFS (or 
other Hadoop-supported
+filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file 
on every worker node.
+If addToCurrentClassLoader is true, attempt to add the new class 
to the current threads'
+class loader. In general adding to the current threads' class 
loader will impact all other
+application threads unless they have explicitly changed their 
class loader.
--- End diff --

Add a `:param:` annotation here as well for path & addToCurrentClassLoader


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16830: [MINOR][CORE] Fix incorrect documentation of WritableCon...

2017-02-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16830
  
**[Test build #3569 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3569/testReport)**
 for PR 16830 at commit 
[`c300ff6`](https://github.com/apache/spark/commit/c300ff6e0a2d802c474b2af5e1bea9afb8101a2c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15666: [SPARK-11421] [Core][Python][R] Added ability for...

2017-02-08 Thread mariusvniekerk

Github user mariusvniekerk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15666#discussion_r100176188
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -1802,19 +1802,34 @@ class SparkContext(config: SparkConf) extends 
Logging {
* Adds a JAR dependency for all tasks to be executed on this 
`SparkContext` in the future.
* @param path can be either a local file, a file in HDFS (or other 
Hadoop-supported filesystems),
* an HTTP, HTTPS or FTP URI, or local:/path for a file on every worker 
node.
+   * If addToCurrentClassLoader is true, attempt to add the new class to 
the current threads' class
+   * loader. In general adding to the current threads' class loader will 
impact all other
+   * application threads unless they have explicitly changed their class 
loader.
*/
   def addJar(path: String) {
+addJar(path, false)
+  }
+
+  def addJar(path: String, addToCurrentClassLoader: Boolean) {
 if (path == null) {
   logWarning("null specified as parameter to addJar")
 } else {
   var key = ""
-  if (path.contains("\\")) {
+
+  val uri = if (path.contains("\\")) {
 // For local paths with backslashes on Windows, URI throws an 
exception
-key = env.rpcEnv.fileServer.addJar(new File(path))
+new File(path).toURI
   } else {
 val uri = new URI(path)
 // SPARK-17650: Make sure this is a valid URL before adding it to 
the list of dependencies
 Utils.validateURL(uri)
+uri
+  }
+
+  if (path.contains("\\")) {
+// For local paths with backslashes on Windows, URI throws an 
exception
+key = env.rpcEnv.fileServer.addJar(new File(uri))
--- End diff --

If we have backslashes we are in a local path on windows.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16852: [SPARK-19512][SQL] codegen for compare structs fails

2017-02-08 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16852
  
**[Test build #3568 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3568/testReport)**
 for PR 16852 at commit 
[`9a8d853`](https://github.com/apache/spark/commit/9a8d8537748f38a4276188b3f60f6852010e3387).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16845: [SPARK-19505][Python] AttributeError on Exception...

2017-02-08 Thread dgingrich

Github user dgingrich commented on a diff in the pull request:

https://github.com/apache/spark/pull/16845#discussion_r100175968
  
--- Diff: python/pyspark/broadcast.py ---
@@ -82,7 +82,7 @@ def dump(self, value, f):
 except pickle.PickleError:
 raise
 except Exception as e:
-msg = "Could not serialize broadcast: " + e.__class__.__name__ 
+ ": " + e.message
+msg = "Could not serialize broadcast: " + e.__class__.__name__ 
+ ": " + str(e)
--- End diff --

Good call, I forgot about Python 2's `str`/`unicode` behavior.  It could be 
a problem if an exception includes user input as part of the message.  IMO it's 
worth handling it rigorously since throwing in the `except:` hides the original 
error.  I'll look at fixes tonight, in the worse case we could add a 
`get_exception_message()` helper.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16858: [SPARK-19464][BUILD][HOTFIX] run-tests should use hadoop...

2017-02-08 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/16858
  
Thank you for review and merging, @srowen !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16858: [SPARK-19464][BUILD][HOTFIX] run-tests should use...

2017-02-08 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16858


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16837: [SPARK-19359][SQL] renaming partition should not ...

2017-02-08 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16837#discussion_r100174684
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -892,21 +892,58 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
 val hasUpperCasePartitionColumn = partitionColumnNames.exists(col => 
col.toLowerCase != col)
 if (tableMeta.tableType == MANAGED && hasUpperCasePartitionColumn) {
   val tablePath = new Path(tableMeta.location)
+  val fs = tablePath.getFileSystem(hadoopConf)
   val newParts = newSpecs.map { spec =>
+val rightPath = renamePartitionDirectory(fs, tablePath, 
partitionColumnNames, spec)
 val partition = client.getPartition(db, table, 
lowerCasePartitionSpec(spec))
-val wrongPath = new Path(partition.location)
-val rightPath = ExternalCatalogUtils.generatePartitionPath(
-  spec, partitionColumnNames, tablePath)
+partition.copy(storage = partition.storage.copy(locationUri = 
Some(rightPath.toString)))
+  }
+  alterPartitions(db, table, newParts)
+}
+  }
+
+  /**
+   * Rename the partition directory w.r.t. the actual partition columns.
+   *
+   * It will recursively rename the partition directory from the first 
partition column, to be most
+   * compatible with different file systems. e.g. in some file systems, 
renaming `a=1/b=2` to
+   * `A=1/B=2` will result to `a=1/B=2`, while in some other file systems, 
the renaming works, but
+   * will leave an empty directory `a=1`.
+   */
+  private def renamePartitionDirectory(
+  fs: FileSystem,
+  tablePath: Path,
+  partCols: Seq[String],
+  newSpec: TablePartitionSpec): Path = {
+import ExternalCatalogUtils.getPartitionPathString
+
+var totalPath = tablePath
--- End diff --

How about `currFullPath` or `fullPath`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16858: [SPARK-19464][BUILD][HOTFIX] run-tests should use hadoop...

2017-02-08 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/16858
  
Merged to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16858: [SPARK-19464][BUILD][HOTFIX] run-tests should use hadoop...

2017-02-08 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/16858
  
Finally, it passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16228: [SPARK-17076] [SQL] Cardinality estimation for join base...

2017-02-08 Thread sameeragarwal

Github user sameeragarwal commented on the issue:

https://github.com/apache/spark/pull/16228
  
cc @cloud-fan can you please take a look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16858: [SPARK-19464][BUILD][HOTFIX] run-tests should use hadoop...

2017-02-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16858
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72597/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16858: [SPARK-19464][BUILD][HOTFIX] run-tests should use hadoop...

2017-02-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16858
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 5 6 7 >

201 - 300 of 614 matches

Mail list logo