date:20171018

[GitHub] spark pull request #19509: [SPARK-22290][core] Avoid creating Hive delegatio...

2017-10-18 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19509#discussion_r145329972
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala 
---
@@ -347,6 +347,10 @@ package object config {
 .timeConf(TimeUnit.MILLISECONDS)
 .createWithDefault(Long.MaxValue)
 
+  private[spark] val KERBEROS_RELOGIN_PERIOD = 
ConfigBuilder("spark.yarn.kerberos.relogin.period")
+.timeConf(TimeUnit.SECONDS)
+.createWithDefaultString("1m")
--- End diff --

I think we should put this into doc. Also is it too frequent to call?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19525: [SPARK-22289] [ML] Add JSON support for Matrix pa...

2017-10-18 Thread hhbyyh

GitHub user hhbyyh opened a pull request:

https://github.com/apache/spark/pull/19525

[SPARK-22289] [ML] Add JSON support for Matrix parameters (LR with 
coefficients bound)

## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-22289 

add JSON encode/decode for Param[Matrix]. 

The issue was reported by Nic Eggert during saving LR model with 
LowerBoundsOnCoefficients. 
There're two ways to resolve this as I see:
1. Support save/load on LogisticRegressionParams, and also adjust the 
save/load in LogisticRegression and LogisticRegressionModel.
2. Directly support Matrix in Param.jsonEncode, similar to what we have 
done for Vector.

After some discussion in jira, we prefer the fix to support Matrix as a 
valid Param type, for simplicity and convenience for other classes.

## How was this patch tested?

new unit test to cover the LR case and JsonMatrixConverter


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hhbyyh/spark lrsave

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19525.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19525


commit 92f599d3fb7159cfc1694d2c65321dd2abdc4ccc
Author: Yuhao Yang 
Date:   2017-10-18T06:49:01Z

add json for Matrix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19525: [SPARK-22289] [ML] Add JSON support for Matrix parameter...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19525
  
**[Test build #82875 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82875/testReport)**
 for PR 19525 at commit 
[`92f599d`](https://github.com/apache/spark/commit/92f599d3fb7159cfc1694d2c65321dd2abdc4ccc).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19488
  
**[Test build #82874 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82874/testReport)**
 for PR 19488 at commit 
[`506c410`](https://github.com/apache/spark/commit/506c410eefb5131c6cf50e947ff01e0bda9c28fe).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19525: [SPARK-22289] [ML] Add JSON support for Matrix pa...

2017-10-18 Thread hhbyyh

Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/19525#discussion_r145330685
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
@@ -122,17 +124,33 @@ private[ml] object Param {
 
   /** Decodes a param value from JSON. */
   def jsonDecode[T](json: String): T = {
-parse(json) match {
+val jValue = parse(json)
+jValue match {
   case JString(x) =>
 x.asInstanceOf[T]
   case JObject(v) =>
 val keys = v.map(_._1)
-assert(keys.contains("type") && keys.contains("values"),
-  s"Expect a JSON serialized vector but cannot find fields 'type' 
and 'values' in $json.")
-JsonVectorConverter.fromJson(json).asInstanceOf[T]
+if (keys.contains("class")) {
+  implicit val formats = DefaultFormats
+  val className = (jValue \ "class").extract[String]
+  className match {
+case JsonMatrixConverter.className =>
+  val checkFields = Array("numRows", "numCols", "values", 
"isTransposed")
+  require(checkFields.forall(keys.contains), s"Expect a JSON 
serialized Matrix" +
+s" but cannot find fields ${checkFields.mkString(", ")} in 
$json.")
+  JsonMatrixConverter.fromJson(json).asInstanceOf[T]
+
+case s => throw new SparkException(s"unrecognized class $s in 
$json")
+  }
+} else { // Vector does not have class info in json
+  require(keys.contains("type") && keys.contains("values"), 
s"Expect a JSON serialized" +
+s" vector/matrix but cannot find fields 'type' and 'values' in 
$json.")
--- End diff --

/matrix should be removed here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19522: [SPARK-22249][FOLLOWUP][SQL] Check if list of value for ...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19522
  
**[Test build #82873 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82873/testReport)**
 for PR 19522 at commit 
[`e95bc7b`](https://github.com/apache/spark/commit/e95bc7b395e027aa3d1e719d987b4f5a4461c34b).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19522: [SPARK-22249][FOLLOWUP][SQL] Check if list of value for ...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19522
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19488
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19522: [SPARK-22249][FOLLOWUP][SQL] Check if list of value for ...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19522
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82873/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19488
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82874/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19509: [SPARK-22290][core] Avoid creating Hive delegation token...

2017-10-18 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19509
  
LGTM, just one minor comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19522: [SPARK-22249][FOLLOWUP][SQL] Check if list of value for ...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19522
  
**[Test build #3952 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3952/testReport)**
 for PR 19522 at commit 
[`e95bc7b`](https://github.com/apache/spark/commit/e95bc7b395e027aa3d1e719d987b4f5a4461c34b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19360: [SPARK-22139][CORE]Remove the variable which is n...

2017-10-18 Thread guoxiaolongzte

GitHub user guoxiaolongzte reopened a pull request:

https://github.com/apache/spark/pull/19360

[SPARK-22139][CORE]Remove the variable which is never used in 
SparkConf.scala

## What changes were proposed in this pull request?

Remove the variable which is never used in SparkConf.scala.
val executorClasspathKey = "spark.executor.extraClassPath"
val driverOptsKey = "spark.driver.extraJavaOptions"
val driverClassPathKey = "spark.driver.extraClassPath"
val sparkExecutorInstances = "spark.executor.instances"
They variables are never used. Because the implementation code for the 
validation rule has been removed in SPARK-17979.

## How was this patch tested?

manual tests

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/guoxiaolongzte/spark SPARK-22139

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19360.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19360


commit f345aa8487a64a0256c6965bc198ba8842cd0a51
Author: guoxiaolong 
Date:   2017-09-27T06:58:37Z

[SPARK-22139] Remove the variable which is never used in SparkConf.scala




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19469: [SPARK-22243][DStreams]spark.yarn.jars reload from confi...

2017-10-18 Thread jerryshao

Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19469
  
@ssaavedra , yes I think so. with the pull-in of k8s support, I would guess 
more configurations need to be added to exclusion rule. With current solution, 
one by one PR doesn't make so sense. We should either figure out a general 
solution or refactor this part. 

Besides, as we moved to structured streaming, do we need to pay more 
efforts on these issues? @zsxwing 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19360: [SPARK-22139][CORE]Remove the variable which is never us...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19360
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...

2017-10-18 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19459#discussion_r145334576
  
--- Diff: python/pyspark/sql/session.py ---
@@ -414,6 +415,39 @@ def _createFromLocal(self, data, schema):
 data = [schema.toInternal(row) for row in data]
 return self._sc.parallelize(data), schema
 
+def _createFromPandasWithArrow(self, pdf, schema):
+"""
+Create a DataFrame from a given pandas.DataFrame by slicing it 
into partitions, converting
+to Arrow data, then sending to the JVM to parallelize. If a schema 
is passed in, the
+data types will be used to coerce the data in Pandas to Arrow 
conversion.
+"""
+from pyspark.serializers import ArrowSerializer
+from pyspark.sql.types import from_arrow_schema, to_arrow_schema
+import pyarrow as pa
+
+# Slice the DataFrame into batches
+step = -(-len(pdf) // self.sparkContext.defaultParallelism)  # 
round int up
+pdf_slices = (pdf[start:start + step] for start in xrange(0, 
len(pdf), step))
+arrow_schema = to_arrow_schema(schema) if schema is not None else 
None
+batches = [pa.RecordBatch.from_pandas(pdf_slice, 
schema=arrow_schema, preserve_index=False)
+   for pdf_slice in pdf_slices]
+
+# Verify schema, there will be at least 1 batch from 
pandas.DataFrame
+schema_from_arrow = from_arrow_schema(batches[0].schema)
+if schema is not None and schema != schema_from_arrow:
+raise ValueError("Supplied schema does not match result from 
Arrow\nsupplied: " +
+ "%s\n!=\nfrom Arrow: %s" % (str(schema), 
str(schema_from_arrow)))
--- End diff --

It's okay to fallback with warnings, but I think we should try to adjust 
types specified by users before that. Otherwise, users can never get the 
benefit from Arrow when users don't know how to adjust types especially 
integral types including NaN values.
We can split pandas DataFrame into Series once and adjust types during 
building RecordBatches. I guess we should modify the timestamp values to have 
timezone for each Series when we support timestamp type anyway.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19419
  
**[Test build #82876 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82876/testReport)**
 for PR 19419 at commit 
[`de54313`](https://github.com/apache/spark/commit/de54313479383be54de6bb075afe228617c244f2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19525: [SPARK-22289] [ML] Add JSON support for Matrix pa...

2017-10-18 Thread hhbyyh

Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/19525#discussion_r145331849
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/linalg/JsonMatrixConverter.scala ---
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.ml.linalg
+
+import org.json4s.DefaultFormats
+import org.json4s.JsonDSL._
+import org.json4s.jackson.JsonMethods.{compact, parse => parseJson, render}
+
+private[ml] object JsonMatrixConverter {
+
+  /** Unique class name for identifying JSON object encoded by this class. 
*/
+  val className = "org.apache.spark.ml.linalg.Matrix"
--- End diff --

I added this as an identifier, so during loading we know which JSON 
converter to invoke. Yet I didn't add it for JsonVectorConverter, to avoid 
breaking old models.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19525: [SPARK-22289] [ML] Add JSON support for Matrix pa...

2017-10-18 Thread hhbyyh

Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/19525#discussion_r145333064
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
@@ -122,17 +124,33 @@ private[ml] object Param {
 
   /** Decodes a param value from JSON. */
   def jsonDecode[T](json: String): T = {
-parse(json) match {
+val jValue = parse(json)
+jValue match {
   case JString(x) =>
 x.asInstanceOf[T]
   case JObject(v) =>
 val keys = v.map(_._1)
-assert(keys.contains("type") && keys.contains("values"),
-  s"Expect a JSON serialized vector but cannot find fields 'type' 
and 'values' in $json.")
-JsonVectorConverter.fromJson(json).asInstanceOf[T]
+if (keys.contains("class")) {
+  implicit val formats = DefaultFormats
+  val className = (jValue \ "class").extract[String]
+  className match {
+case JsonMatrixConverter.className =>
+  val checkFields = Array("numRows", "numCols", "values", 
"isTransposed")
+  require(checkFields.forall(keys.contains), s"Expect a JSON 
serialized Matrix" +
+s" but cannot find fields ${checkFields.mkString(", ")} in 
$json.")
+  JsonMatrixConverter.fromJson(json).asInstanceOf[T]
+
+case s => throw new SparkException(s"unrecognized class $s in 
$json")
+  }
+} else { // Vector does not have class info in json
+  require(keys.contains("type") && keys.contains("values"), 
s"Expect a JSON serialized" +
+s" vector/matrix but cannot find fields 'type' and 'values' in 
$json.")
--- End diff --

/matrix should be removed here


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...

2017-10-18 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19459#discussion_r145336060
  
--- Diff: python/pyspark/sql/session.py ---
@@ -414,6 +415,39 @@ def _createFromLocal(self, data, schema):
 data = [schema.toInternal(row) for row in data]
 return self._sc.parallelize(data), schema
 
+def _createFromPandasWithArrow(self, pdf, schema):
+"""
+Create a DataFrame from a given pandas.DataFrame by slicing it 
into partitions, converting
+to Arrow data, then sending to the JVM to parallelize. If a schema 
is passed in, the
+data types will be used to coerce the data in Pandas to Arrow 
conversion.
+"""
+from pyspark.serializers import ArrowSerializer
+from pyspark.sql.types import from_arrow_schema, to_arrow_schema
+import pyarrow as pa
+
+# Slice the DataFrame into batches
+step = -(-len(pdf) // self.sparkContext.defaultParallelism)  # 
round int up
+pdf_slices = (pdf[start:start + step] for start in xrange(0, 
len(pdf), step))
+arrow_schema = to_arrow_schema(schema) if schema is not None else 
None
+batches = [pa.RecordBatch.from_pandas(pdf_slice, 
schema=arrow_schema, preserve_index=False)
+   for pdf_slice in pdf_slices]
+
+# Verify schema, there will be at least 1 batch from 
pandas.DataFrame
+schema_from_arrow = from_arrow_schema(batches[0].schema)
+if schema is not None and schema != schema_from_arrow:
+raise ValueError("Supplied schema does not match result from 
Arrow\nsupplied: " +
+ "%s\n!=\nfrom Arrow: %s" % (str(schema), 
str(schema_from_arrow)))
--- End diff --

Sure, I couldn't agree more with ^.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...

2017-10-18 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19488
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19488
  
**[Test build #82877 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82877/testReport)**
 for PR 19488 at commit 
[`506c410`](https://github.com/apache/spark/commit/506c410eefb5131c6cf50e947ff01e0bda9c28fe).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19419
  
**[Test build #82878 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82878/testReport)**
 for PR 19419 at commit 
[`b6d4885`](https://github.com/apache/spark/commit/b6d4885e9ad9a03a40b3c28df41d7b263b89369f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19523: [SPARK-22301][SQL] Add rule to Optimizer for In w...

2017-10-18 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19523#discussion_r145340100
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
 ---
@@ -204,6 +204,7 @@ case class In(value: Expression, list: Seq[Expression]) 
extends Predicate {
 
   override def children: Seq[Expression] = value +: list
   lazy val inSetConvertible = list.forall(_.isInstanceOf[Literal])
+  lazy val isListEmpty = list.isEmpty
--- End diff --

I am using it to be consistent with the current implementation (see the 
line above)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19360: [SPARK-22139][CORE]Remove the variable which is never us...

2017-10-18 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19360
  
Let's leave this closed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19360: [SPARK-22139][CORE]Remove the variable which is n...

2017-10-18 Thread guoxiaolongzte

Github user guoxiaolongzte closed the pull request at:

https://github.com/apache/spark/pull/19360


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19525: [SPARK-22289] [ML] Add JSON support for Matrix parameter...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19525
  
**[Test build #82875 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82875/testReport)**
 for PR 19525 at commit 
[`92f599d`](https://github.com/apache/spark/commit/92f599d3fb7159cfc1694d2c65321dd2abdc4ccc).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case s => throw new SparkException(s\"unrecognized class 
$s in $json\")`
  * ` else `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19525: [SPARK-22289] [ML] Add JSON support for Matrix parameter...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19525
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19242: [CORE][DOC]Add event log conf.

2017-10-18 Thread guoxiaolongzte

Github user guoxiaolongzte commented on the issue:

https://github.com/apache/spark/pull/19242
  
@srowen 
Help to review the code, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19525: [SPARK-22289] [ML] Add JSON support for Matrix parameter...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19525
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82875/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19242: [CORE][DOC]Add event log conf.

2017-10-18 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19242#discussion_r145343263
  
--- Diff: docs/configuration.md ---
@@ -740,6 +740,20 @@ Apart from these, the following properties are also 
available, and may be useful
   
 
 
+  spark.eventLog.overwrite
+  false
+  
+Whether to overwrite any existing files.
+  
+
+
+  spark.eventLog.buffer.kb
+  100
+  
+Buffer size to use when writing to output streams.Buffer size in KB.
--- End diff --

Might want to proofread these before pinging for review. You're missing a 
space before the second sentence, but, why not just: "Buffer size in KB to use 
when writing to output streams"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-18 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r145343991
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,124 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import pyspark
+from pyspark import SparkContext
+from pyspark.sql.types import *
+from pyspark.sql.types import Row, _create_row
+from pyspark.sql import DataFrame
+from pyspark.ml.param.shared import *
+import numpy as np
+
+undefinedImageType = "Undefined"
+
+ImageFields = ["origin", "height", "width", "nChannels", "mode", "data"]
+
+ocvTypes = {
+undefinedImageType: -1,
+"CV_8U": 0, "CV_8UC1": 0, "CV_8UC2": 8, "CV_8UC3": 16, "CV_8UC4": 24,
+"CV_8S": 1, "CV_8SC1": 1, "CV_8SC2": 9, "CV_8SC3": 17, "CV_8SC4": 25,
+"CV_16U": 2, "CV_16UC1": 2, "CV_16UC2": 10, "CV_16UC3": 18, 
"CV_16UC4": 26,
+"CV_16S": 3, "CV_16SC1": 3, "CV_16SC2": 11, "CV_16SC3": 19, 
"CV_16SC4": 27,
+"CV_32S": 4, "CV_32SC1": 4, "CV_32SC2": 12, "CV_32SC3": 20, 
"CV_32SC4": 28,
+"CV_32F": 5, "CV_32FC1": 5, "CV_32FC2": 13, "CV_32FC3": 21, 
"CV_32FC4": 29,
+"CV_64F": 6, "CV_64FC1": 6, "CV_64FC2": 14, "CV_64FC3": 22, 
"CV_64FC4": 30
+}
+
+# DataFrame with a single column of images named "image" (nullable)
+imageSchema = StructType(StructField("image", StructType([
+StructField(ImageFields[0], StringType(),  True),
+StructField(ImageFields[1], IntegerType(), False),
+StructField(ImageFields[2], IntegerType(), False),
+StructField(ImageFields[3], IntegerType(), False),
+# OpenCV-compatible type: CV_8UC3 in most cases
+StructField(ImageFields[4], StringType(), False),
+# bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField(ImageFields[5], BinaryType(), False)]), True))
+
+
+# TODO: generalize to other datatypes and number of channels
+def toNDArray(image):
+"""
+Converts an image to a one-dimensional array
+
+:param image (object): The image to be converted
+:rtype array: The image as a one-dimensional array
+
+.. versionadded:: 2.3.0
+"""
+height = image.height
+width = image.width
+nChannels = image.nChannels
+return np.ndarray(
+shape=(height, width, nChannels),
+dtype=np.uint8,
+buffer=image.data,
+strides=(width * nChannels, nChannels, 1))
+
+
+# TODO: generalize to other datatypes and number of channels
+def toImage(array, origin="", mode=ocvTypes["CV_8UC3"]):
+"""
+
--- End diff --

Not a big deal but I'd remove this extra newline.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-18 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r145342026
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala 
---
@@ -0,0 +1,258 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import java.awt.Color
+import java.awt.color.ColorSpace
+import java.io.ByteArrayInputStream
+import javax.imageio.ImageIO
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.input.PortableDataStream
+import org.apache.spark.sql.{DataFrame, Row, SparkSession}
+import org.apache.spark.sql.types._
+
+@Experimental
+@Since("2.3.0")
+object ImageSchema {
+
+  val undefinedImageType = "Undefined"
+
+  val imageFields = Array("origin", "height", "width", "nChannels", 
"mode", "data")
+
+  val ocvTypes = Map(
+undefinedImageType -> -1,
+"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC2" -> 8, "CV_8UC3" -> 16, 
"CV_8UC4" -> 24,
+"CV_8S" -> 1, "CV_8SC1" -> 1, "CV_8SC2" -> 9, "CV_8SC3" -> 17, 
"CV_8SC4" -> 25,
+"CV_16U" -> 2, "CV_16UC1" -> 2, "CV_16UC2" -> 10, "CV_16UC3" -> 18, 
"CV_16UC4" -> 26,
+"CV_16S" -> 3, "CV_16SC1" -> 3, "CV_16SC2" -> 11, "CV_16SC3" -> 19, 
"CV_16SC4" -> 27,
+"CV_32S" -> 4, "CV_32SC1" -> 4, "CV_32SC2" -> 12, "CV_32SC3" -> 20, 
"CV_32SC4" -> 28,
+"CV_32F" -> 5, "CV_32FC1" -> 5, "CV_32FC2" -> 13, "CV_32FC3" -> 21, 
"CV_32FC4" -> 29,
+"CV_64F" -> 6, "CV_64FC1" -> 6, "CV_64FC2" -> 14, "CV_64FC3" -> 22, 
"CV_64FC4" -> 30
+  )
+
+  /**
+   * Schema for the image column: Row(String, Int, Int, Int, Array[Byte])
+   */
+  private val columnSchema = StructType(
+StructField(imageFields(0), StringType, true) ::
+StructField(imageFields(1), IntegerType, false) ::
+StructField(imageFields(2), IntegerType, false) ::
+StructField(imageFields(3), IntegerType, false) ::
+// OpenCV-compatible type: CV_8UC3 in most cases
+StructField(imageFields(4), IntegerType, false) ::
+// Bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField(imageFields(5), BinaryType, false) :: Nil)
+
+  /**
+   * DataFrame with a single column of images named "image" (nullable)
+   */
+  val imageSchema = StructType(StructField("image", columnSchema, true) :: 
Nil)
+
+  /**
+   * :: Experimental ::
+   * Gets the origin of the image
+   *
+   * @return The origin of the image
+   */
+  def getOrigin(row: Row): String = row.getString(0)
+
+  /**
+   * :: Experimental ::
+   * Gets the height of the image
+   *
+   * @return The height of the image
+   */
+  def getHeight(row: Row): Int = row.getInt(1)
+
+  /**
+   * :: Experimental ::
+   * Gets the width of the image
+   *
+   * @return The width of the image
+   */
+  def getWidth(row: Row): Int = row.getInt(2)
+
+  /**
+   * :: Experimental ::
+   * Gets the number of channels in the image
+   *
+   * @return The number of channels in the image
+   */
+  def getNChannels(row: Row): Int = row.getInt(3)
+
+  /**
+   * :: Experimental ::
+   * Gets the OpenCV representation as an int
+   *
+   * @return The OpenCV representation as an int
+   */
+  def getMode(row: Row): Int = row.getInt(4)
+
+  /**
+   * :: Experimental ::
+   * Gets the image data
+   *
+   * @return The image data
+   */
+  def getData(row: Row): Array[Byte] = row.getAs[Array[Byte]](5)
+
+  /**
+   * :: Experimental ::
+   * Check if the DataFrame column contains images (i.e. has ImageSchema)
+   *
+   * @param df DataFrame
+   * @param column Column name
+   * @return True if the given column matches the image schema
+   */
+  def isImageColumn(df: DataFrame, column: String): Boolean =
+df.schema(column).dataType == columnSchema
+
+  /*

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-18 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r145342355
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/HadoopUtils.scala 
---
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import scala.language.existentials
+import scala.util.Random
+
+import org.apache.commons.io.FilenameUtils
+import org.apache.hadoop.conf.{Configuration, Configured}
+import org.apache.hadoop.fs.{Path, PathFilter}
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
+
+import org.apache.spark.sql.SparkSession
+
+private object RecursiveFlag {
+
+  /**
+   * Sets a value of spark recursive flag.
+   * If value is a None, it unsets the flag.
+   *
+   * @param value value to set
+   * @param spark existing spark session
+   * @return previous value of this flag
+   */
+  def setRecursiveFlag(value: Option[String], spark: SparkSession): 
Option[String] = {
+val flagName = FileInputFormat.INPUT_DIR_RECURSIVE
+val hadoopConf = spark.sparkContext.hadoopConfiguration
+val old = Option(hadoopConf.get(flagName))
+
+value match {
+  case Some(v) => hadoopConf.set(flagName, v)
+  case None => hadoopConf.unset(flagName)
+}
+
+old
+  }
+}
+
+/**
+ * Filter that allows loading a fraction of HDFS files.
+ */
+private class SamplePathFilter extends Configured with PathFilter {
+  val random = new Random()
+
+  // Ratio of files to be read from disk
+  var sampleRatio: Double = 1
+
+  override def setConf(conf: Configuration): Unit = {
+if (conf != null) {
+  sampleRatio = conf.getDouble(SamplePathFilter.ratioParam, 1)
+}
+  }
+
+  override def accept(path: Path): Boolean = {
+// Note: checking fileSystem.isDirectory is very slow here, so we use 
basic rules instead
+!SamplePathFilter.isFile(path) || random.nextDouble() < sampleRatio
+  }
+}
+
+private object SamplePathFilter {
+  val ratioParam = "sampleRatio"
+
+  def isFile(path: Path): Boolean = 
FilenameUtils.getExtension(path.toString) != ""
+
+  /**
+   * Sets HDFS PathFilter
+   *
+   * @param value Filter class that is passed to HDFS
+   * @param sampleRatio Fraction of the files that the filter picks
+   * @param spark Existing Spark session
+   * @return Returns the previous HDFS path filter
+   */
+  def setPathFilter(value: Option[Class[_]],
+sampleRatio: Double,
+spark: SparkSession): Option[Class[_]] = {
--- End diff --

nit:

```scala
def setPathFilter(
value: Option[Class[_]],
sampleRatio: Double,
spark: SparkSession): Option[Class[_]] = {
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-18 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r145341631
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,124 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import pyspark
+from pyspark import SparkContext
+from pyspark.sql.types import *
+from pyspark.sql.types import Row, _create_row
+from pyspark.sql import DataFrame
+from pyspark.ml.param.shared import *
+import numpy as np
+
+undefinedImageType = "Undefined"
+
+ImageFields = ["origin", "height", "width", "nChannels", "mode", "data"]
--- End diff --

`ImageFields ` -> `imageFields` (to match with Scala one)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19439: [SPARK-21866][ML][PySpark] Adding spark image rea...

2017-10-18 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19439#discussion_r145344124
  
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,124 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import pyspark
+from pyspark import SparkContext
+from pyspark.sql.types import *
+from pyspark.sql.types import Row, _create_row
+from pyspark.sql import DataFrame
+from pyspark.ml.param.shared import *
+import numpy as np
+
+undefinedImageType = "Undefined"
+
+ImageFields = ["origin", "height", "width", "nChannels", "mode", "data"]
+
+ocvTypes = {
+undefinedImageType: -1,
+"CV_8U": 0, "CV_8UC1": 0, "CV_8UC2": 8, "CV_8UC3": 16, "CV_8UC4": 24,
+"CV_8S": 1, "CV_8SC1": 1, "CV_8SC2": 9, "CV_8SC3": 17, "CV_8SC4": 25,
+"CV_16U": 2, "CV_16UC1": 2, "CV_16UC2": 10, "CV_16UC3": 18, 
"CV_16UC4": 26,
+"CV_16S": 3, "CV_16SC1": 3, "CV_16SC2": 11, "CV_16SC3": 19, 
"CV_16SC4": 27,
+"CV_32S": 4, "CV_32SC1": 4, "CV_32SC2": 12, "CV_32SC3": 20, 
"CV_32SC4": 28,
+"CV_32F": 5, "CV_32FC1": 5, "CV_32FC2": 13, "CV_32FC3": 21, 
"CV_32FC4": 29,
+"CV_64F": 6, "CV_64FC1": 6, "CV_64FC2": 14, "CV_64FC3": 22, 
"CV_64FC4": 30
+}
+
+# DataFrame with a single column of images named "image" (nullable)
+imageSchema = StructType(StructField("image", StructType([
+StructField(ImageFields[0], StringType(),  True),
+StructField(ImageFields[1], IntegerType(), False),
+StructField(ImageFields[2], IntegerType(), False),
+StructField(ImageFields[3], IntegerType(), False),
+# OpenCV-compatible type: CV_8UC3 in most cases
+StructField(ImageFields[4], StringType(), False),
+# bytes in OpenCV-compatible order: row-wise BGR in most cases
+StructField(ImageFields[5], BinaryType(), False)]), True))
+
+
+# TODO: generalize to other datatypes and number of channels
+def toNDArray(image):
+"""
+Converts an image to a one-dimensional array
+
+:param image (object): The image to be converted
+:rtype array: The image as a one-dimensional array
+
+.. versionadded:: 2.3.0
+"""
+height = image.height
+width = image.width
+nChannels = image.nChannels
+return np.ndarray(
+shape=(height, width, nChannels),
+dtype=np.uint8,
+buffer=image.data,
+strides=(width * nChannels, nChannels, 1))
+
+
+# TODO: generalize to other datatypes and number of channels
+def toImage(array, origin="", mode=ocvTypes["CV_8UC3"]):
+"""
+
+Converts a one-dimensional array to a two-dimensional image
+
+:param array (array): The array to convert to image
+:param origin (str): Path to the image
+:param mode (str): OpenCV compatible type
+
+:rtype object: Two dimensional image
+
+.. versionadded:: 2.3.0
+"""
+data = bytearray(array.astype(dtype=np.uint8).ravel())
+height = array.shape[0]
+width = array.shape[1]
+nChannels = array.shape[2]
+# Creating new Row with _create_row(), because Row(name = value, ... )
+# orders fields by name, which conflicts with expected schema order
+# when the new DataFrame is created by UDF
+return _create_row(ImageFields,
+   [origin, height, width, nChannels, mode, data])
+
+
+def readImages(path, recursive=False, numPartitions=0,
+   dropImageFailures=False, sampleRatio=1.0):
+"""
+Reads the directory of images from the local or remote source.
--- End diff --

tiny nit: I'd add a newline between this description and `:param`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19242: [CORE][DOC]Add event log conf.

2017-10-18 Thread guoxiaolongzte

Github user guoxiaolongzte commented on a diff in the pull request:

https://github.com/apache/spark/pull/19242#discussion_r145346521
  
--- Diff: docs/configuration.md ---
@@ -740,6 +740,20 @@ Apart from these, the following properties are also 
available, and may be useful
   
 
 
+  spark.eventLog.overwrite
+  false
+  
+Whether to overwrite any existing files.
+  
+
+
+  spark.eventLog.buffer.kb
+  100
+  
+Buffer size to use when writing to output streams.Buffer size in KB.
--- End diff --

I have fixed the description and correction unit.
Please check org.apache.spark.internal.config#EVENT_LOG_OUTPUT_BUFFER_SIZE

  private[spark] val EVENT_LOG_OUTPUT_BUFFER_SIZE = 
ConfigBuilder("spark.eventLog.buffer.kb")
.bytesConf(ByteUnit.KiB)
.createWithDefaultString("100k")


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19526: [SPARK-22014][SQL] removed TypeCheckFailure: slid...

2017-10-18 Thread SimonUzL

GitHub user SimonUzL opened a pull request:

https://github.com/apache/spark/pull/19526

[SPARK-22014][SQL]  removed TypeCheckFailure: slide duration <= 
windowDuration

It is possible to create sampling windows in Spark Streaming, where the 
duration of the window is smaller than the slide, but it throws a 
TypeCheckFailure in Spark SQL.

I think there should be no difference (duration and slide) in a "Spark 
Streaming window" and a "Spark SQL window" function.

@brkyvz 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/SimonUzL/spark sampling_window

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19526.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19526


commit 295872b23af94f514263ec7512a572c88eb172db
Author: Simon Schiff 
Date:   2017-10-17T08:52:42Z

removed TypeCheckFailure: slide duration must be less than or equal to the 
windowDuration




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19526: [SPARK-22014][SQL] removed TypeCheckFailure: slide durat...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19526
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...

2017-10-18 Thread krishna-pandey

Github user krishna-pandey commented on the issue:

https://github.com/apache/spark/pull/19419
  
@srowen @rxin Made changes to enable the X-Content-Type-Options and 
X-XSS-Protection values by default. Please review.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...

2017-10-18 Thread viirya

GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/19527

[SPARK-13030][ML] Create OneHotEncoderEstimator for OneHotEncoder as 
Estimator

## What changes were proposed in this pull request?

This patch adds a new class `OneHotEncoderEstimator` which extends 
`Estimator`. The `fit` method returns `OneHotEncoderModel`.

Common methods between existing `OneHotEncoder` and new 
`OneHotEncoderEstimator`, such as transforming schema, are extracted and put 
into `OneHotEncoderCommon`.

### Multi-column support

`OneHotEncoderEstimator` adds simpler multi-column support because it is 
new API and can be free from backward compatibility.

### handleInvalid Param support

`OneHotEncoderEstimator` supports `handleInvalid` Param. It supports 
`error` and `skip`. Note that `skip` can't be used at the same time with 
`dropLast` as true. Because they will conflict in encoded vector.

## How was this patch tested?

Added new test suite `OneHotEncoderEstimatorSuite`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 SPARK-13030

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19527.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19527


commit 8fd4677fd0e729d99d8777010e78bb5cfea3cf86
Author: Liang-Chi Hsieh 
Date:   2017-10-18T07:31:32Z

Add OneHotEncoderEstimator and related tests.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19527
  
**[Test build #82879 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82879/testReport)**
 for PR 19527 at commit 
[`8fd4677`](https://github.com/apache/spark/commit/8fd4677fd0e729d99d8777010e78bb5cfea3cf86).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-18 Thread akopich

Github user akopich commented on the issue:

https://github.com/apache/spark/pull/18924
  
@jkbradley, no problem. The test build seems to be aborted. What's wrong?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19522: [SPARK-22249][FOLLOWUP][SQL] Check if list of value for ...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19522
  
**[Test build #3952 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3952/testReport)**
 for PR 19522 at commit 
[`e95bc7b`](https://github.com/apache/spark/commit/e95bc7b395e027aa3d1e719d987b4f5a4461c34b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-18 Thread akopich

Github user akopich commented on the issue:

https://github.com/apache/spark/pull/18924
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18924
  
**[Test build #82880 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82880/testReport)**
 for PR 18924 at commit 
[`a81dae5`](https://github.com/apache/spark/commit/a81dae574f2085ec390effd1b9b1962970f00239).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19527
  
**[Test build #82879 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82879/testReport)**
 for PR 19527 at commit 
[`8fd4677`](https://github.com/apache/spark/commit/8fd4677fd0e729d99d8777010e78bb5cfea3cf86).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class OneHotEncoderEstimator @Since(\"2.3.0\") (@Since(\"2.3.0\") 
override val uid: String)`
  * `  class OneHotEncoderModelWriter(instance: OneHotEncoderModel) extends 
MLWriter `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19527
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82879/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19527
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squa...

2017-10-18 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17862#discussion_r145371704
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -42,7 +44,26 @@ import org.apache.spark.sql.functions.{col, lit}
 /** Params for linear SVM Classifier. */
 private[classification] trait LinearSVCParams extends ClassifierParams 
with HasRegParam
   with HasMaxIter with HasFitIntercept with HasTol with HasStandardization 
with HasWeightCol
-  with HasAggregationDepth with HasThreshold {
+  with HasAggregationDepth with HasThreshold with HasSolver {
+
+  /**
+   * Specifies the loss function. Currently "hinge" and "squared_hinge" 
are supported.
+   * "hinge" is the standard SVM loss (a.k.a. L1 loss) while 
"squared_hinge" is the square of
+   * the hinge loss (a.k.a. L2 loss).
+   *
+   * @see https://en.wikipedia.org/wiki/Hinge_loss";>Hinge loss 
(Wikipedia)
+   *
+   * @group param
+   */
+  @Since("2.3.0")
+  final val loss: Param[String] = new Param(this, "loss", "Specifies the 
loss " +
+"function. hinge is the standard SVM loss while squared_hinge is the 
square of the hinge loss.",
+(s: String) => 
LinearSVC.supportedLoss.contains(s.toLowerCase(Locale.ROOT)))
--- End diff --

The `isValid` function you can use
`ParamValidators.inArray[String](supportedLosses))`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squa...

2017-10-18 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17862#discussion_r145369694
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -282,8 +348,27 @@ class LinearSVC @Since("2.2.0") (
 @Since("2.2.0")
 object LinearSVC extends DefaultParamsReadable[LinearSVC] {
 
+  /** String name for Limited-memory BFGS. */
+  private[classification] val LBFGS: String = 
"l-bfgs".toLowerCase(Locale.ROOT)
+
+  /** String name for Orthant-Wise Limited-memory Quasi-Newton. */
+  private[classification] val OWLQN: String = 
"owlqn".toLowerCase(Locale.ROOT)
+
+  /* Set of optimizers that LinearSVC supports */
+  private[classification] val supportedSolvers = Array(LBFGS, OWLQN)
+
+  /** String name for Hinge Loss. */
+  private[classification] val HINGE: String = 
"hinge".toLowerCase(Locale.ROOT)
+
+  /** String name for Squared Hinge Loss. */
+  private[classification] val SQUARED_HINGE: String = 
"squared_hinge".toLowerCase(Locale.ROOT)
--- End diff --

Why need `.toLowerCase(Locale.ROOT)` here ? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS optimizer and Squa...

2017-10-18 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17862#discussion_r145371903
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -282,8 +348,27 @@ class LinearSVC @Since("2.2.0") (
 @Since("2.2.0")
 object LinearSVC extends DefaultParamsReadable[LinearSVC] {
 
+  /** String name for Limited-memory BFGS. */
+  private[classification] val LBFGS: String = 
"l-bfgs".toLowerCase(Locale.ROOT)
+
+  /** String name for Orthant-Wise Limited-memory Quasi-Newton. */
+  private[classification] val OWLQN: String = 
"owlqn".toLowerCase(Locale.ROOT)
+
+  /* Set of optimizers that LinearSVC supports */
+  private[classification] val supportedSolvers = Array(LBFGS, OWLQN)
+
+  /** String name for Hinge Loss. */
+  private[classification] val HINGE: String = 
"hinge".toLowerCase(Locale.ROOT)
+
+  /** String name for Squared Hinge Loss. */
+  private[classification] val SQUARED_HINGE: String = 
"squared_hinge".toLowerCase(Locale.ROOT)
+
+  /* Set of loss function that LinearSVC supports */
+  private[classification] val supportedLoss = Array(HINGE, SQUARED_HINGE)
--- End diff --

supportedLoss ==> supportedLosses


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19488
  
**[Test build #82877 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82877/testReport)**
 for PR 19488 at commit 
[`506c410`](https://github.com/apache/spark/commit/506c410eefb5131c6cf50e947ff01e0bda9c28fe).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19488
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19488
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82877/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19419
  
**[Test build #82876 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82876/testReport)**
 for PR 19419 at commit 
[`de54313`](https://github.com/apache/spark/commit/de54313479383be54de6bb075afe228617c244f2).
 * This patch passes all tests.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19419
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82876/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19419
  
Build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19419
  
**[Test build #82878 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82878/testReport)**
 for PR 19419 at commit 
[`b6d4885`](https://github.com/apache/spark/commit/b6d4885e9ad9a03a40b3c28df41d7b263b89369f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19419
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82878/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19419: [SPARK-22188] [CORE] Adding security headers for prevent...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19419
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19510: [SPARK-22292][Mesos] Added spark.mem.max support ...

2017-10-18 Thread skonto

Github user skonto commented on a diff in the pull request:

https://github.com/apache/spark/pull/19510#discussion_r145383146
  
--- Diff: 
resource-managers/mesos/src/test/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackendSuite.scala
 ---
@@ -152,6 +152,23 @@ class MesosCoarseGrainedSchedulerBackendSuite extends 
SparkFunSuite
 assert(cpus == maxCores)
   }
 
+  test("mesos does not acquire more than spark.mem.max") {
+setBackend(Map("spark.mem.max" -> "2g",
+   "spark.executor.memory" -> "1g",
+   "spark.executor.cores" -> "1"))
+
+val executorMemory = backend.executorMemory(sc)
+
--- End diff --

remove space


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18924
  
**[Test build #82880 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82880/testReport)**
 for PR 18924 at commit 
[`a81dae5`](https://github.com/apache/spark/commit/a81dae574f2085ec390effd1b9b1962970f00239).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18924
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18924
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82880/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18924: [SPARK-14371] [MLLIB] OnlineLDAOptimizer should not coll...

2017-10-18 Thread akopich

Github user akopich commented on the issue:

https://github.com/apache/spark/pull/18924
  
ping @jkbradley. Anyway, tests are passed now. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19510: [SPARK-22292][Mesos] Added spark.mem.max support for Mes...

2017-10-18 Thread skonto

Github user skonto commented on the issue:

https://github.com/apache/spark/pull/19510
  
@windkit there is an open issue here: 
https://issues.apache.org/jira/browse/SPARK-22133
Could you please add documentation of the new property along with the old 
ones.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19510: [SPARK-22292][Mesos] Added spark.mem.max support ...

2017-10-18 Thread skonto

Github user skonto commented on a diff in the pull request:

https://github.com/apache/spark/pull/19510#discussion_r145389586
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala
 ---
@@ -64,6 +64,7 @@ private[spark] class MesosCoarseGrainedSchedulerBackend(
   private val MAX_SLAVE_FAILURES = 2
 
   private val maxCoresOption = 
conf.getOption("spark.cores.max").map(_.toInt)
+  private val maxMemOption = 
conf.getOption("spark.mem.max").map(Utils.memoryStringToMb)
--- End diff --

Can we defend against minimum values? For example default executor memory 
is 1.4MB. We could calculate the value returned by 
MesosSchedulerUtils.executorMemory. I don't think these values calculated in 
canLaunchTask ever change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19374: [SPARK-22145][MESOS] fix supervise with checkpoin...

2017-10-18 Thread skonto

Github user skonto commented on a diff in the pull request:

https://github.com/apache/spark/pull/19374#discussion_r145391350
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala
 ---
@@ -135,22 +135,24 @@ private[spark] class MesosClusterScheduler(
   private val useFetchCache = 
conf.getBoolean("spark.mesos.fetchCache.enable", false)
   private val schedulerState = engineFactory.createEngine("scheduler")
   private val stateLock = new Object()
+  // Keyed by submission id
   private val finishedDrivers =
 new mutable.ArrayBuffer[MesosClusterSubmissionState](retainedDrivers)
   private var frameworkId: String = null
-  // Holds all the launched drivers and current launch state, keyed by 
driver id.
+  // Holds all the launched drivers and current launch state, keyed by 
submission id.
   private val launchedDrivers = new mutable.HashMap[String, 
MesosClusterSubmissionState]()
   // Holds a map of driver id to expected slave id that is passed to Mesos 
for reconciliation.
   // All drivers that are loaded after failover are added here, as we need 
get the latest
-  // state of the tasks from Mesos.
+  // state of the tasks from Mesos. Keyed by task Id.
   private val pendingRecover = new mutable.HashMap[String, SlaveID]()
-  // Stores all the submitted drivers that hasn't been launched.
+  // Stores all the submitted drivers that hasn't been launched, keyed by 
submission id
   private val queuedDrivers = new ArrayBuffer[MesosDriverDescription]()
-  // All supervised drivers that are waiting to retry after termination.
+  // All supervised drivers that are waiting to retry after termination, 
keyed by submission id
   private val pendingRetryDrivers = new 
ArrayBuffer[MesosDriverDescription]()
   private val queuedDriversState = 
engineFactory.createEngine("driverQueue")
   private val launchedDriversState = 
engineFactory.createEngine("launchedDrivers")
   private val pendingRetryDriversState = 
engineFactory.createEngine("retryList")
+  private final val RETRY_ID = "-retry-"
--- End diff --

np will update.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19374: [SPARK-22145][MESOS] fix supervise with checkpointing on...

2017-10-18 Thread skonto

Github user skonto commented on the issue:

https://github.com/apache/spark/pull/19374
  
@srowen Could I get a merge or a review pls?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...

2017-10-18 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19527
  
cc @MLnick @WeichenXu123 @jkbradley This adds a new class 
`OneHotEncoderEstimator` which extends `Estimator`. Please review this when you 
can. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19374: [SPARK-22145][MESOS] fix supervise with checkpointing on...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19374
  
**[Test build #82881 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82881/testReport)**
 for PR 19374 at commit 
[`0699917`](https://github.com/apache/spark/commit/06999177331b7323813eca3c06d6a9a55e054f7d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19390: [SPARK-18935][MESOS] Fix dynamic reservations on ...

2017-10-18 Thread skonto

Github user skonto commented on a diff in the pull request:

https://github.com/apache/spark/pull/19390#discussion_r145397272
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala
 ---
@@ -380,7 +389,8 @@ private[spark] class MesosCoarseGrainedSchedulerBackend(
   } else {
 declineOffer(
   driver,
-  offer)
+  offer,
--- End diff --

There is a debug message if canLaunchTask fails. I will create a jira and 
improve logging across the module.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19390: [SPARK-18935][MESOS] Fix dynamic reservations on mesos

2017-10-18 Thread skonto

Github user skonto commented on the issue:

https://github.com/apache/spark/pull/19390
  
@srowen could I get a review or a merge pls?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19390: [SPARK-18935][MESOS] Fix dynamic reservations on mesos

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19390
  
**[Test build #82882 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82882/testReport)**
 for PR 19390 at commit 
[`4c51a1f`](https://github.com/apache/spark/commit/4c51a1fdb5638adba3b9bb9506ba3478cea9385f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19390: [SPARK-18935][MESOS] Fix dynamic reservations on mesos

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19390
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82882/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19390: [SPARK-18935][MESOS] Fix dynamic reservations on mesos

2017-10-18 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19390
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19390: [SPARK-18935][MESOS] Fix dynamic reservations on mesos

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19390
  
**[Test build #82882 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82882/testReport)**
 for PR 19390 at commit 
[`4c51a1f`](https://github.com/apache/spark/commit/4c51a1fdb5638adba3b9bb9506ba3478cea9385f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1

2017-10-18 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19521
  
Oh, I confused with what I'm watching in these days.

For your example, Parquet also doesn't support. We may create an issue for 
both Parquet/ORC on empty schema .
```scala
scala> val rddNoCols = sparkContext.parallelize(1 to 10).map(_ => Row.empty)
scala> val dfNoCols = spark.createDataFrame(rddNoCols, 
StructType(Seq.empty))
scala> dfNoCols.write.format("parquet").saveAsTable("px")
17/10/18 05:46:17 ERROR Utils: Aborting task
org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema 
with an empty group: message spark_schema {
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1

2017-10-18 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19521
  
LGTM too BTW.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19488: [SPARK-22266][SQL] The same aggregate function wa...

2017-10-18 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19488


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1

2017-10-18 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19521
  
Empty schema path probably related with this IIRC (not double checked):


https://github.com/apache/spark/blob/cca945b6aa679e61864c1cabae91e6ae7703362e/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala#L52-L58


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19488: [SPARK-22266][SQL] The same aggregate function was evalu...

2017-10-18 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19488
  
thanks, merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19524: [SPARK-22302][INFRA] Remove manual backports for subproc...

2017-10-18 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19524
  
@holdenk, hm, should I maybe make add this to `run-tests.py` and 
`run-tests-jenkins.py`? I wasn't sure where I should put this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18747: [SPARK-20822][SQL] Generate code to directly get ...

2017-10-18 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18747#discussion_r145410797
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 ---
@@ -23,21 +23,72 @@ import org.apache.spark.sql.catalyst.dsl.expressions._
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.plans.QueryPlan
 import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, 
Partitioning}
-import org.apache.spark.sql.execution.LeafExecNode
-import org.apache.spark.sql.execution.metric.SQLMetrics
-import org.apache.spark.sql.types.UserDefinedType
+import org.apache.spark.sql.execution.{ColumnarBatchScan, LeafExecNode, 
WholeStageCodegenExec}
+import org.apache.spark.sql.execution.vectorized._
+import org.apache.spark.sql.types._
 
 
 case class InMemoryTableScanExec(
 attributes: Seq[Attribute],
 predicates: Seq[Expression],
 @transient relation: InMemoryRelation)
-  extends LeafExecNode {
+  extends LeafExecNode with ColumnarBatchScan {
 
   override protected def innerChildren: Seq[QueryPlan[_]] = Seq(relation) 
++ super.innerChildren
 
-  override lazy val metrics = Map(
-"numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of 
output rows"))
+  override def vectorTypes: Option[Seq[String]] =
+
Option(Seq.fill(attributes.length)(classOf[OnHeapColumnVector].getName))
+
+  /**
+   * If true, get data from ColumnVector in ColumnarBatch, which are 
generally faster.
+   * If false, get data from UnsafeRow build from ColumnVector
+   */
+  override val supportCodegen: Boolean = {
+// In the initial implementation, for ease of review
+// support only primitive data types and # of fields is less than 
wholeStageMaxNumFields
+val schema = StructType.fromAttributes(relation.output)
+schema.fields.find(f => f.dataType match {
+  case BooleanType | ByteType | ShortType | IntegerType | LongType |
+   FloatType | DoubleType => false
+  case _ => true
+}).isEmpty &&
+  !WholeStageCodegenExec.isTooManyFields(conf, relation.schema) &&
+  children.find(p => WholeStageCodegenExec.isTooManyFields(conf, 
p.schema)).isEmpty
--- End diff --

this is check is unnecessary, this is a `LeafExecNode` so it has no 
children.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18747: [SPARK-20822][SQL] Generate code to directly get ...

2017-10-18 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18747#discussion_r145411743
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 ---
@@ -23,21 +23,72 @@ import org.apache.spark.sql.catalyst.dsl.expressions._
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.plans.QueryPlan
 import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, 
Partitioning}
-import org.apache.spark.sql.execution.LeafExecNode
-import org.apache.spark.sql.execution.metric.SQLMetrics
-import org.apache.spark.sql.types.UserDefinedType
+import org.apache.spark.sql.execution.{ColumnarBatchScan, LeafExecNode, 
WholeStageCodegenExec}
+import org.apache.spark.sql.execution.vectorized._
+import org.apache.spark.sql.types._
 
 
 case class InMemoryTableScanExec(
 attributes: Seq[Attribute],
 predicates: Seq[Expression],
 @transient relation: InMemoryRelation)
-  extends LeafExecNode {
+  extends LeafExecNode with ColumnarBatchScan {
 
   override protected def innerChildren: Seq[QueryPlan[_]] = Seq(relation) 
++ super.innerChildren
 
-  override lazy val metrics = Map(
-"numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of 
output rows"))
+  override def vectorTypes: Option[Seq[String]] =
+
Option(Seq.fill(attributes.length)(classOf[OnHeapColumnVector].getName))
+
+  /**
+   * If true, get data from ColumnVector in ColumnarBatch, which are 
generally faster.
+   * If false, get data from UnsafeRow build from ColumnVector
+   */
+  override val supportCodegen: Boolean = {
+// In the initial implementation, for ease of review
+// support only primitive data types and # of fields is less than 
wholeStageMaxNumFields
+val schema = StructType.fromAttributes(relation.output)
+schema.fields.find(f => f.dataType match {
+  case BooleanType | ByteType | ShortType | IntegerType | LongType |
+   FloatType | DoubleType => false
+  case _ => true
+}).isEmpty &&
+  !WholeStageCodegenExec.isTooManyFields(conf, relation.schema) &&
+  children.find(p => WholeStageCodegenExec.isTooManyFields(conf, 
p.schema)).isEmpty
+  }
+
+  private val columnIndices =
+attributes.map(a => relation.output.map(o => 
o.exprId).indexOf(a.exprId)).toArray
+
+  private val relationSchema = relation.schema.toArray
+
+  private lazy val columnarBatchSchema = new 
StructType(columnIndices.map(i => relationSchema(i)))
+
+  private def createAndDecompressColumn(cachedColumnarBatch: CachedBatch): 
ColumnarBatch = {
+val rowCount = cachedColumnarBatch.numRows
+val columnVectors = OnHeapColumnVector.allocateColumns(rowCount, 
columnarBatchSchema)
--- End diff --

Can we reuse the `OnHeapColumnVector` for the cached batches? It's a little 
inefficient to create one column vector for each cached batch. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19513: [SPARK-21551][Python] Increase timeout for PythonRDD.ser...

2017-10-18 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19513
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19514: [SPARK-21551][Python] Increase timeout for PythonRDD.ser...

2017-10-18 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19514
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19513: [SPARK-21551][Python] Increase timeout for PythonRDD.ser...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19513
  
**[Test build #82884 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82884/testReport)**
 for PR 19513 at commit 
[`96b465b`](https://github.com/apache/spark/commit/96b465b33fe8490d53d7a45f48d658958a6b43cb).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19514: [SPARK-21551][Python] Increase timeout for PythonRDD.ser...

2017-10-18 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19514
  
**[Test build #82883 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82883/consoleFull)**
 for PR 19514 at commit 
[`7ec2dc7`](https://github.com/apache/spark/commit/7ec2dc70ce26ff754c0ea38f3cf9964d67bc62f7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestam...

2017-10-18 Thread wesm

Github user wesm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18664#discussion_r145415011
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1619,11 +1619,39 @@ def to_arrow_type(dt):
 arrow_type = pa.decimal(dt.precision, dt.scale)
 elif type(dt) == StringType:
 arrow_type = pa.string()
+elif type(dt) == DateType:
+arrow_type = pa.date32()
+elif type(dt) == TimestampType:
+# Timestamps should be in UTC, JVM Arrow timestamps require a 
timezone to be read
+arrow_type = pa.timestamp('us', tz='UTC')
 else:
 raise TypeError("Unsupported type in conversion to Arrow: " + 
str(dt))
 return arrow_type
 
 
+def _check_dataframe_localize_timestamps(df):
+""" Convert timezone aware timestamps to timezone-naive in local time
+"""
+from pandas.types.common import is_datetime64tz_dtype
--- End diff --

I am not sure this is the right public API, @jreback could you advise?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestam...

2017-10-18 Thread wesm

Github user wesm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18664#discussion_r145415575
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1619,11 +1619,39 @@ def to_arrow_type(dt):
 arrow_type = pa.decimal(dt.precision, dt.scale)
 elif type(dt) == StringType:
 arrow_type = pa.string()
+elif type(dt) == DateType:
+arrow_type = pa.date32()
+elif type(dt) == TimestampType:
+# Timestamps should be in UTC, JVM Arrow timestamps require a 
timezone to be read
+arrow_type = pa.timestamp('us', tz='UTC')
 else:
 raise TypeError("Unsupported type in conversion to Arrow: " + 
str(dt))
 return arrow_type
 
 
+def _check_dataframe_localize_timestamps(df):
+""" Convert timezone aware timestamps to timezone-naive in local time
+"""
+from pandas.types.common import is_datetime64tz_dtype
+for column, series in df.iteritems():
+# TODO: handle nested timestamps?
+if is_datetime64tz_dtype(series.dtype):
+df[column] = 
series.dt.tz_convert('tzlocal()').dt.tz_localize(None)
+return df
+
+
+def _check_series_convert_timestamps_internal(s):
+""" Convert a tz-naive timestamp in local tz to UTC normalized for 
Spark internal storage
+"""
+from pandas.types.common import is_datetime64_dtype
--- End diff --

API


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestam...

2017-10-18 Thread wesm

Github user wesm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18664#discussion_r145415564
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1619,11 +1619,39 @@ def to_arrow_type(dt):
 arrow_type = pa.decimal(dt.precision, dt.scale)
 elif type(dt) == StringType:
 arrow_type = pa.string()
+elif type(dt) == DateType:
+arrow_type = pa.date32()
+elif type(dt) == TimestampType:
+# Timestamps should be in UTC, JVM Arrow timestamps require a 
timezone to be read
+arrow_type = pa.timestamp('us', tz='UTC')
 else:
 raise TypeError("Unsupported type in conversion to Arrow: " + 
str(dt))
 return arrow_type
 
 
+def _check_dataframe_localize_timestamps(df):
+""" Convert timezone aware timestamps to timezone-naive in local time
+"""
+from pandas.types.common import is_datetime64tz_dtype
+for column, series in df.iteritems():
+# TODO: handle nested timestamps?
+if is_datetime64tz_dtype(series.dtype):
+df[column] = 
series.dt.tz_convert('tzlocal()').dt.tz_localize(None)
+return df
+
+
+def _check_series_convert_timestamps_internal(s):
+""" Convert a tz-naive timestamp in local tz to UTC normalized for 
Spark internal storage
+"""
+from pandas.types.common import is_datetime64_dtype
+# TODO: handle nested timestamps?
+if is_datetime64_dtype(s.dtype):
+# NOTE: convert to 'us' with astype here, unit is ignored in 
`from_pandas` see ARROW-1680
+return 
s.dt.tz_localize('tzlocal()').dt.tz_convert('UTC').values.astype('datetime64[us]')
--- End diff --

Can you create a follow up JIRA to fix this as soon as Arrow 0.8.0 lands? I 
will make sure that this gets fixed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestam...

2017-10-18 Thread wesm

Github user wesm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18664#discussion_r145414587
  
--- Diff: python/pyspark/serializers.py ---
@@ -223,12 +224,13 @@ def _create_batch(series):
 # If a nullable integer series has been promoted to floating point 
with NaNs, need to cast
 # NOTE: this is not necessary with Arrow >= 0.7
 def cast_series(s, t):
-if t is None or s.dtype == t.to_pandas_dtype():
+if t is None or s.dtype == t.to_pandas_dtype() or type(t) == 
pa.TimestampType:
--- End diff --

Here `TimestampType` was removed from the pyarrow namespace since 0.7.0 but 
I opened a JIRA to add it back

https://issues.apache.org/jira/browse/ARROW-1683

We created a new `pyarrow.types` API which should replace these checks with 

`pa.types.is_timestamp(t)` 

but that requires Arrow 0.8.0. I would recommend making this transition 
before Spark 2.3.0. Timeline for Arrow 0.8.0 is early November


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestam...

2017-10-18 Thread wesm

Github user wesm commented on a diff in the pull request:

https://github.com/apache/spark/pull/18664#discussion_r145415365
  
--- Diff: python/pyspark/sql/types.py ---
@@ -1619,11 +1619,39 @@ def to_arrow_type(dt):
 arrow_type = pa.decimal(dt.precision, dt.scale)
 elif type(dt) == StringType:
 arrow_type = pa.string()
+elif type(dt) == DateType:
+arrow_type = pa.date32()
+elif type(dt) == TimestampType:
+# Timestamps should be in UTC, JVM Arrow timestamps require a 
timezone to be read
+arrow_type = pa.timestamp('us', tz='UTC')
 else:
 raise TypeError("Unsupported type in conversion to Arrow: " + 
str(dt))
 return arrow_type
 
 
+def _check_dataframe_localize_timestamps(df):
+""" Convert timezone aware timestamps to timezone-naive in local time
+"""
+from pandas.types.common import is_datetime64tz_dtype
+for column, series in df.iteritems():
+# TODO: handle nested timestamps?
+if is_datetime64tz_dtype(series.dtype):
+df[column] = 
series.dt.tz_convert('tzlocal()').dt.tz_localize(None)
--- End diff --

@jreback is this the best route to obtain tz-naive datetimes in localtime?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19521: [SPARK-22300][BUILD] Update ORC to 1.4.1

2017-10-18 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19521
  
Thank you for review, @HyukjinKwon .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19505: [SPARK-20396][SQL][PySpark][FOLLOW-UP] groupby().apply()...

2017-10-18 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19505
  
So, looks we are good to go?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18747: [SPARK-20822][SQL] Generate code to directly get ...

2017-10-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/18747#discussion_r145424106
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 ---
@@ -23,21 +23,72 @@ import org.apache.spark.sql.catalyst.dsl.expressions._
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.plans.QueryPlan
 import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, 
Partitioning}
-import org.apache.spark.sql.execution.LeafExecNode
-import org.apache.spark.sql.execution.metric.SQLMetrics
-import org.apache.spark.sql.types.UserDefinedType
+import org.apache.spark.sql.execution.{ColumnarBatchScan, LeafExecNode, 
WholeStageCodegenExec}
+import org.apache.spark.sql.execution.vectorized._
+import org.apache.spark.sql.types._
 
 
 case class InMemoryTableScanExec(
 attributes: Seq[Attribute],
 predicates: Seq[Expression],
 @transient relation: InMemoryRelation)
-  extends LeafExecNode {
+  extends LeafExecNode with ColumnarBatchScan {
 
   override protected def innerChildren: Seq[QueryPlan[_]] = Seq(relation) 
++ super.innerChildren
 
-  override lazy val metrics = Map(
-"numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of 
output rows"))
+  override def vectorTypes: Option[Seq[String]] =
+
Option(Seq.fill(attributes.length)(classOf[OnHeapColumnVector].getName))
+
+  /**
+   * If true, get data from ColumnVector in ColumnarBatch, which are 
generally faster.
+   * If false, get data from UnsafeRow build from ColumnVector
+   */
+  override val supportCodegen: Boolean = {
+// In the initial implementation, for ease of review
+// support only primitive data types and # of fields is less than 
wholeStageMaxNumFields
+val schema = StructType.fromAttributes(relation.output)
+schema.fields.find(f => f.dataType match {
+  case BooleanType | ByteType | ShortType | IntegerType | LongType |
+   FloatType | DoubleType => false
+  case _ => true
+}).isEmpty &&
+  !WholeStageCodegenExec.isTooManyFields(conf, relation.schema) &&
+  children.find(p => WholeStageCodegenExec.isTooManyFields(conf, 
p.schema)).isEmpty
--- End diff --

Sure


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18747: [SPARK-20822][SQL] Generate code to directly get ...

2017-10-18 Thread kiszk

Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/18747#discussion_r145424996
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 ---
@@ -23,21 +23,72 @@ import org.apache.spark.sql.catalyst.dsl.expressions._
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.plans.QueryPlan
 import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, 
Partitioning}
-import org.apache.spark.sql.execution.LeafExecNode
-import org.apache.spark.sql.execution.metric.SQLMetrics
-import org.apache.spark.sql.types.UserDefinedType
+import org.apache.spark.sql.execution.{ColumnarBatchScan, LeafExecNode, 
WholeStageCodegenExec}
+import org.apache.spark.sql.execution.vectorized._
+import org.apache.spark.sql.types._
 
 
 case class InMemoryTableScanExec(
 attributes: Seq[Attribute],
 predicates: Seq[Expression],
 @transient relation: InMemoryRelation)
-  extends LeafExecNode {
+  extends LeafExecNode with ColumnarBatchScan {
 
   override protected def innerChildren: Seq[QueryPlan[_]] = Seq(relation) 
++ super.innerChildren
 
-  override lazy val metrics = Map(
-"numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of 
output rows"))
+  override def vectorTypes: Option[Seq[String]] =
+
Option(Seq.fill(attributes.length)(classOf[OnHeapColumnVector].getName))
+
+  /**
+   * If true, get data from ColumnVector in ColumnarBatch, which are 
generally faster.
+   * If false, get data from UnsafeRow build from ColumnVector
+   */
+  override val supportCodegen: Boolean = {
+// In the initial implementation, for ease of review
+// support only primitive data types and # of fields is less than 
wholeStageMaxNumFields
+val schema = StructType.fromAttributes(relation.output)
+schema.fields.find(f => f.dataType match {
+  case BooleanType | ByteType | ShortType | IntegerType | LongType |
+   FloatType | DoubleType => false
+  case _ => true
+}).isEmpty &&
+  !WholeStageCodegenExec.isTooManyFields(conf, relation.schema) &&
+  children.find(p => WholeStageCodegenExec.isTooManyFields(conf, 
p.schema)).isEmpty
+  }
+
+  private val columnIndices =
+attributes.map(a => relation.output.map(o => 
o.exprId).indexOf(a.exprId)).toArray
+
+  private val relationSchema = relation.schema.toArray
+
+  private lazy val columnarBatchSchema = new 
StructType(columnIndices.map(i => relationSchema(i)))
+
+  private def createAndDecompressColumn(cachedColumnarBatch: CachedBatch): 
ColumnarBatch = {
+val rowCount = cachedColumnarBatch.numRows
+val columnVectors = OnHeapColumnVector.allocateColumns(rowCount, 
columnarBatchSchema)
--- End diff --

I agree that we can improve efficiency if we can reuse the 
`OnHeapColumnVector`. 

I think that it is not easy to reuse the `OnHeapColumnVector` between 
different cached batches. 
IIUC there is no point to know a cached batch will not be referenced. We 
rely the management of the lifetime on GC by creating `OnHeapColumnVector` 
every time.
If we reuse the `OnHeapColumnVector` (i.e. keep a reference to 
`OnHeapColumnVector`), GC will not dispose `OnHeapColumnVector` even if the 
generated code will not use the `OnHeapColumnVector`. It means that 
uncompressed (huge) data lives for a long time. If we know the point where a 
cache batch will not be referenced, we could set null to `data` in 
`OnHeapColumnVector`.

Thus, I currently create `OnHeapColumnVector`. What do you think?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL] Add Date and Timestamp suppo...

2017-10-18 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18664
  
Sorry for joining this discussion so late. My 2 cents:

Spark SQL only has timestamp without timezone, and internally we use a java 
long to store the value of a timestamp. This is OK as timestamp is the 
milliseconds since Unix epoch which is timezone independent.

However the timezone information is needed when doing some calculation like 
adding a day to a timestamp or converting a timestamp to a string. At this 
time, Spark SQL will pick SESSION_LOCAL_TIMEZONE instead of java local 
timezone(`TimeZone.getDefault()`). BTW if Spark supports timestamp with 
timezone, then the timezone carried by the timestamp value should be picked.

One special case is `Dataset.collect`, Spark converts internal row to 
external row, and thus converts the long value of timestamp column to 
`java.sql.Timestamp`. At this point, timezone information is not needed as 
`java.sql.Timestamp` is also timezone independent. Ideally users should use 
`SimpleDateFormat` to print `java.sql.Timestamp` with a specific timezone, but 
if they just call `java.sql.Timestamp.toString`, the java local timezone is 
picked and they will get inconsistent result if the java local timezone is 
different from spark SESSION_LOCAL_TIMEZONE.

For pyspark `DataFrame.collect`, we have a similar issue. When we convert 
the long value of timestamp column to python timestamp, we use 
`datetime.datetime.fromtimestamp` which respects the python local timezone. So 
users will also get inconsistent result if the python local timezone is 
different from spark SESSION_LOCAL_TIMEZONE.

`DataFrame.toPandas` is a little different. I think we are able to set the 
timezone of pandas DataFrame to spark SESSION_LOCAL_TIMEZONE instead of 
sticking with python local timezone. We can create a new JIRA to resolve it.

Looking back to this PR, I think we should follow the existing rule: we 
should write tz-naive timestamp(the long value in Spark SQL) to arrow. Then at 
the python side we create pandas dataframe with this tz-naive timestamp. This 
behavior is exactly same with non-arrow-optimized `toPandas`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 >

1 - 100 of 294 matches

Mail list logo