[GitHub] spark issue #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite shou...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20147
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite shou...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20147
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85718/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite shou...

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20147
  
**[Test build #85718 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85718/testReport)**
 for PR 20147 at commit 
[`7b58d99`](https://github.com/apache/spark/commit/7b58d994f485255ffad59ed9ffb480a64a00cea2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20132: [SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEs...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20132
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85719/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20132: [SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEs...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20132
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20132: [SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEs...

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20132
  
**[Test build #85719 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85719/testReport)**
 for PR 20132 at commit 
[`c547d0f`](https://github.com/apache/spark/commit/c547d0fa7c4e1bb61312219df4d273af4a5a9db9).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20160: [SPARK-22757][K8S] Enable spark.jars and spark.fi...

2018-01-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20160


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20160: [SPARK-22757][K8S] Enable spark.jars and spark.files in ...

2018-01-04 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/20160
  
test passed, merged to master/2.3.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20151: [SPARK-22959][PYTHON] Configuration to select the...

2018-01-04 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/20151#discussion_r159819670
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala ---
@@ -34,17 +34,25 @@ private[spark] class PythonWorkerFactory(pythonExec: 
String, envVars: Map[String
 
   import PythonWorkerFactory._
 
-  // Because forking processes from Java is expensive, we prefer to launch 
a single Python daemon
-  // (pyspark/daemon.py) and tell it to fork new workers for our tasks. 
This daemon currently
-  // only works on UNIX-based systems now because it uses signals for 
child management, so we can
-  // also fall back to launching workers (pyspark/worker.py) directly.
+  // Because forking processes from Java is expensive, we prefer to launch 
a single Python daemon,
+  // pyspark/daemon.py (by default) and tell it to fork new workers for 
our tasks. This daemon
+  // currently only works on UNIX-based systems now because it uses 
signals for child management,
+  // so we can also fall back to launching workers, pyspark/worker.py (by 
default) directly.
   val useDaemon = {
 val useDaemonEnabled = 
SparkEnv.get.conf.getBoolean("spark.python.use.daemon", true)
 
 // This flag is ignored on Windows as it's unable to fork.
 !System.getProperty("os.name").startsWith("Windows") && 
useDaemonEnabled
   }
 
+  // This configuration indicates the module to run the daemon to execute 
its Python workers.
+  val daemonModule = SparkEnv.get.conf.get("spark.python.daemon.module", 
"pyspark.daemon")
--- End diff --

generally, I thought we use the name "command" as what we call the thing to 
execute 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20143: [SPARK-22949][ML] Apply CrossValidator approach t...

2018-01-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20143


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20143: [SPARK-22949][ML] Apply CrossValidator approach to Drive...

2018-01-04 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/20143
  
LGTM
I'm going to merge this with master and branch-2.3,  backporting as a 
fairly important bug fix

Thanks @MrBago !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...

2018-01-04 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/20024
  
Thanks! I'll do.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20132: [SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEs...

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20132
  
**[Test build #85719 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85719/testReport)**
 for PR 20132 at commit 
[`c547d0f`](https://github.com/apache/spark/commit/c547d0fa7c4e1bb61312219df4d273af4a5a9db9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20150: [SPARK-22956][SS] Bug fix for 2 streams union failover s...

2018-01-04 Thread xuanyuanking
Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/20150
  
cc @zsxwing 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20132: [SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEs...

2018-01-04 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/20132
  
Updated!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20132: [SPARK-13030][ML] Follow-up cleanups for OneHotEn...

2018-01-04 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20132#discussion_r159815032
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala 
---
@@ -205,60 +210,58 @@ class OneHotEncoderModel private[ml] (
 
   import OneHotEncoderModel._
 
-  // Returns the category size for a given index with `dropLast` and 
`handleInvalid`
+  // Returns the category size for each index with `dropLast` and 
`handleInvalid`
   // taken into account.
-  private def configedCategorySize(orgCategorySize: Int, idx: Int): Int = {
+  private def getConfigedCategorySizes: Array[Int] = {
 val dropLast = getDropLast
 val keepInvalid = getHandleInvalid == 
OneHotEncoderEstimator.KEEP_INVALID
 
 if (!dropLast && keepInvalid) {
   // When `handleInvalid` is "keep", an extra category is added as 
last category
   // for invalid data.
-  orgCategorySize + 1
+  categorySizes.map(_ + 1)
 } else if (dropLast && !keepInvalid) {
   // When `dropLast` is true, the last category is removed.
-  orgCategorySize - 1
+  categorySizes.map(_ - 1)
 } else {
   // When `dropLast` is true and `handleInvalid` is "keep", the extra 
category for invalid
   // data is removed. Thus, it is the same as the plain number of 
categories.
-  orgCategorySize
+  categorySizes
 }
   }
 
   private def encoder: UserDefinedFunction = {
-val oneValue = Array(1.0)
-val emptyValues = Array.empty[Double]
-val emptyIndices = Array.empty[Int]
-val dropLast = getDropLast
-val handleInvalid = getHandleInvalid
-val keepInvalid = handleInvalid == OneHotEncoderEstimator.KEEP_INVALID
+val keepInvalid = getHandleInvalid == 
OneHotEncoderEstimator.KEEP_INVALID
+val configedSizes = getConfigedCategorySizes
+val localCategorySizes = categorySizes
 
 // The udf performed on input data. The first parameter is the input 
value. The second
-// parameter is the index of input.
-udf { (label: Double, idx: Int) =>
-  val plainNumCategories = categorySizes(idx)
-  val size = configedCategorySize(plainNumCategories, idx)
-
-  if (label < 0) {
-throw new SparkException(s"Negative value: $label. Input can't be 
negative.")
-  } else if (label == size && dropLast && !keepInvalid) {
-// When `dropLast` is true and `handleInvalid` is not "keep",
-// the last category is removed.
-Vectors.sparse(size, emptyIndices, emptyValues)
-  } else if (label >= plainNumCategories && keepInvalid) {
-// When `handleInvalid` is "keep", encodes invalid data to last 
category (and removed
-// if `dropLast` is true)
-if (dropLast) {
-  Vectors.sparse(size, emptyIndices, emptyValues)
+// parameter is the index in inputCols of the column being encoded.
+udf { (label: Double, colIdx: Int) =>
+  val origCategorySize = localCategorySizes(colIdx)
+  // idx: index in vector of the single 1-valued element
+  val idx = if (label >= 0 && label < origCategorySize) {
+label
+  } else {
+if (keepInvalid) {
+  origCategorySize
 } else {
-  Vectors.sparse(size, Array(size - 1), oneValue)
+  if (label < 0) {
+throw new SparkException(s"Negative value: $label. Input can't 
be negative. " +
--- End diff --

Great point, I'll make the change so that negative values are treated just 
like any other invalid value.  We could add null/NaN in the future too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20024: [SPARK-22825][SQL] Fix incorrect results of Casti...

2018-01-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20024


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...

2018-01-04 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20024
  
as a follow-up, we should do the same thing for struct and map type too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...

2018-01-04 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20024
  
thanks, merging to master/2.3!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20076
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20076
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85716/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20076
  
**[Test build #85716 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85716/testReport)**
 for PR 20076 at commit 
[`b5cd809`](https://github.com/apache/spark/commit/b5cd809c680d089d4fa8da9bd43cd64ed1a3b138).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19080
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85715/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite shou...

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20147
  
**[Test build #85718 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85718/testReport)**
 for PR 20147 at commit 
[`7b58d99`](https://github.com/apache/spark/commit/7b58d994f485255ffad59ed9ffb480a64a00cea2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19080
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19080
  
**[Test build #85715 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85715/testReport)**
 for PR 19080 at commit 
[`639e9b6`](https://github.com/apache/spark/commit/639e9b67ed55ea109d6a89a0319d83124bbd7d30).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `sealed trait Distribution `
  * `case class HashPartitionedDistribution(expressions: Seq[Expression]) 
extends Distribution `
  * `case class BroadcastDistribution(mode: BroadcastMode) extends 
Distribution `
  * `case class UnknownPartitioning(numPartitions: Int) extends 
Partitioning`
  * `case class RoundRobinPartitioning(numPartitions: Int) extends 
Partitioning`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite shou...

2018-01-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20147
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20082: [SPARK-22897][CORE]: Expose stageAttemptId in TaskContex...

2018-01-04 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/20082
  
sorry this is a little late, but lgtm too.  agree with the points above 
about leaving the old name deprecated and moving to the new name


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite shou...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20147
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite shou...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20147
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85717/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite shou...

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20147
  
**[Test build #85717 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85717/testReport)**
 for PR 20147 at commit 
[`7b58d99`](https://github.com/apache/spark/commit/7b58d994f485255ffad59ed9ffb480a64a00cea2).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19992: [SPARK-22805][CORE] Use StorageLevel aliases in event lo...

2018-01-04 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/19992
  
change is fine, but from discussion on the jira I'm unclear if this is 
really worth it -- gain seems pretty small after the other fix in 2.3.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19992: [SPARK-22805][CORE] Use StorageLevel aliases in e...

2018-01-04 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/19992#discussion_r159812291
  
--- Diff: core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala 
---
@@ -2022,12 +1947,7 @@ private[spark] object JsonProtocolSuite extends 
Assertions {
   |  "Port": 300
   |},
   |"Block ID": "rdd_0_0",
-  |"Storage Level": {
--- End diff --

yup, I completely agree that off heap is not respected in the json format.  
can you file a bug?  I think its still relevant even after this goes in, for 
custom levels


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20024
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20024
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85714/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20024
  
**[Test build #85714 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85714/testReport)**
 for PR 20024 at commit 
[`dc15b93`](https://github.com/apache/spark/commit/dc15b93fe76a675136dd1bf08ce25ad3c55959b3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20153
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20153
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85712/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20153
  
**[Test build #85712 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85712/testReport)**
 for PR 20153 at commit 
[`a019886`](https://github.com/apache/spark/commit/a01988624d0cde682aa820e59c89019812c3ef73).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class DataSourceRDDPartition[T : ClassTag](val index: Int, val 
readTask: ReadTask[T])`
  * `class DataSourceRDD[T: ClassTag](`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20096
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85713/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20096
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20096
  
**[Test build #85713 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85713/testReport)**
 for PR 20096 at commit 
[`f9ad94e`](https://github.com/apache/spark/commit/f9ad94e8aa753892f40ee8fc069969563347764c).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20024
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85710/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20024
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20024
  
**[Test build #85710 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85710/testReport)**
 for PR 20024 at commit 
[`449e2c9`](https://github.com/apache/spark/commit/449e2c9c8c5c48a14a9b2efec728b350463188bf).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20163
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85709/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20163
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20163
  
**[Test build #85709 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85709/testReport)**
 for PR 20163 at commit 
[`ca026d3`](https://github.com/apache/spark/commit/ca026d31a489f1e0eb451fe85df97083659d0f67).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20163
  
Wait .. Isn't this because we failed to call `toInternal` by the return 
type? Please give me few days .. will double check tonight.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite shou...

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20147
  
**[Test build #85717 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85717/testReport)**
 for PR 20147 at commit 
[`7b58d99`](https://github.com/apache/spark/commit/7b58d994f485255ffad59ed9ffb480a64a00cea2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19848: [SPARK-22162] Executors and the driver should use consis...

2018-01-04 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/19848
  
@steveloughran can you bring this up on dev@?  we should move this 
discussion off of this PR.

(sorry haven't had a chance to look yet, but I appreciate you doing this)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite shou...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20147
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85711/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite shou...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20147
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite shou...

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20147
  
**[Test build #85711 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85711/testReport)**
 for PR 20147 at commit 
[`3dbfffd`](https://github.com/apache/spark/commit/3dbfffd9a764cb35e05b85fc5c691a7708f31a0e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20163: [SPARK-22966][PySpark] Spark SQL should handle Py...

2018-01-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20163#discussion_r159805825
  
--- Diff: python/pyspark/sql/udf.py ---
@@ -26,6 +26,28 @@
 
 
 def _wrap_function(sc, func, returnType):
+def coerce_to_str(v):
+import datetime
+if type(v) == datetime.date or type(v) == datetime.datetime:
+return str(v)
--- End diff --

I think it's weird that we have a cast here alone ... Can't we register a 
custom Pyrolite unpickler? Does it make the things more complicated?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20098: [SPARK-22914][DEPLOY] Register history.ui.port

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20098
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85705/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20098: [SPARK-22914][DEPLOY] Register history.ui.port

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20098
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20098: [SPARK-22914][DEPLOY] Register history.ui.port

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20098
  
**[Test build #85705 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85705/testReport)**
 for PR 20098 at commit 
[`03e0e27`](https://github.com/apache/spark/commit/03e0e271c58e1a4fe7381b60b06e87e5fe2c3a77).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20163: [SPARK-22966][PySpark] Spark SQL should handle Py...

2018-01-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20163#discussion_r159804507
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
 ---
@@ -120,10 +121,18 @@ object EvaluatePython {
 case (c: java.math.BigDecimal, dt: DecimalType) => Decimal(c, 
dt.precision, dt.scale)
 
 case (c: Int, DateType) => c
--- End diff --

Of course, separate change obviously.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20162: [SPARK-22965] [PySpark] [SQL] Add deterministic paramete...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20162
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85708/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20162: [SPARK-22965] [PySpark] [SQL] Add deterministic paramete...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20162
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20162: [SPARK-22965] [PySpark] [SQL] Add deterministic paramete...

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20162
  
**[Test build #85708 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85708/testReport)**
 for PR 20162 at commit 
[`7e4f3c0`](https://github.com/apache/spark/commit/7e4f3c0f0c4082bf166cec0e72f9f86f5d23aac8).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20163: [SPARK-22966][PySpark] Spark SQL should handle Py...

2018-01-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20163#discussion_r159804356
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala
 ---
@@ -120,10 +121,18 @@ object EvaluatePython {
 case (c: java.math.BigDecimal, dt: DecimalType) => Decimal(c, 
dt.precision, dt.scale)
 
 case (c: Int, DateType) => c
--- End diff --

BTW, as a side note, I think we can make the converter for the type and 
then reuse it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-04 Thread icexelloss
Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/20163
  
I think Scalar and Group map UDF expect pandas Series of datetime64[ns] 
(native pandas timestamp type) instead of a pandas Series of datetime.date and 
datetime.datetime object. I don't think it's necessary to have pandas UDF to 
work with a pandas Series of datetime.date or datetime.datetime object, as the 
standard type of timestamp is datetime64[ns] in pandas.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20076
  
**[Test build #85716 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85716/testReport)**
 for PR 20076 at commit 
[`b5cd809`](https://github.com/apache/spark/commit/b5cd809c680d089d4fa8da9bd43cd64ed1a3b138).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20160: [SPARK-22757][K8S] Enable spark.jars and spark.files in ...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20160
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20076: [SPARK-21786][SQL] When acquiring 'compressionCod...

2018-01-04 Thread fjh100456
Github user fjh100456 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20076#discussion_r159802320
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala
 ---
@@ -27,7 +28,7 @@ import org.apache.spark.sql.internal.SQLConf
 /**
  * Options for the Parquet data source.
  */
-private[parquet] class ParquetOptions(
--- End diff --

Yes, It should be revived. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20160: [SPARK-22757][K8S] Enable spark.jars and spark.files in ...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20160
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85703/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20160: [SPARK-22757][K8S] Enable spark.jars and spark.files in ...

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20160
  
**[Test build #85703 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85703/testReport)**
 for PR 20160 at commit 
[`1075cfc`](https://github.com/apache/spark/commit/1075cfc9ce18f3717bbbeb9950eafdda219f7233).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSui...

2018-01-04 Thread bersprockets
Github user bersprockets commented on a diff in the pull request:

https://github.com/apache/spark/pull/20147#discussion_r159802199
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
 ---
@@ -85,6 +93,34 @@ class HiveExternalCatalogVersionsSuite extends 
SparkSubmitTestUtils {
 new File(tmpDataDir, name).getCanonicalPath
   }
 
+  private def getFileFromUrl(urlString: String, targetDir: String, 
filename: String): Unit = {
+val conf = new SparkConf
+// if the caller passes the name of an existing file, we want 
doFetchFile to write over it with
+// the contents from the specified url.
+conf.set("spark.files.overwrite", "true")
+val securityManager = new SecurityManager(conf)
+val hadoopConf = new Configuration
+
+val outDir = new File(targetDir)
+if (!outDir.exists()) {
+  outDir.mkdirs()
+}
+
+// propagate exceptions up to the caller of getFileFromUrl
+Utils.doFetchFile(urlString, outDir, filename, conf, securityManager, 
hadoopConf)
+  }
+
+  private def getStringFromUrl(urlString: String, encoding: String = 
"UTF-8"): String = {
--- End diff --

Oops. I should have caught that. Will fix.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19080
  
**[Test build #85715 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85715/testReport)**
 for PR 19080 at commit 
[`639e9b6`](https://github.com/apache/spark/commit/639e9b67ed55ea109d6a89a0319d83124bbd7d30).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...

2018-01-04 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19080
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20096
  
**[Test build #85713 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85713/testReport)**
 for PR 20096 at commit 
[`f9ad94e`](https://github.com/apache/spark/commit/f9ad94e8aa753892f40ee8fc069969563347764c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20024
  
**[Test build #85714 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85714/testReport)**
 for PR 20024 at commit 
[`dc15b93`](https://github.com/apache/spark/commit/dc15b93fe76a675136dd1bf08ce25ad3c55959b3).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20024: [SPARK-22825][SQL] Fix incorrect results of Casti...

2018-01-04 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/20024#discussion_r159800597
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
 ---
@@ -608,6 +665,17 @@ case class Cast(child: Expression, dataType: DataType, 
timeZoneId: Option[String
 val tz = ctx.addReferenceObj("timeZone", timeZone)
 (c, evPrim, evNull) => s"""$evPrim = UTF8String.fromString(
   
org.apache.spark.sql.catalyst.util.DateTimeUtils.timestampToString($c, $tz));"""
+  case ArrayType(et, _) =>
+(c, evPrim, evNull) => {
+  val bufferTerm = ctx.freshName("bufferTerm")
--- End diff --

ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20152: [SPARK-22957] ApproxQuantile breaks if the number...

2018-01-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20152


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20024: [SPARK-22825][SQL] Fix incorrect results of Casti...

2018-01-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20024#discussion_r159799552
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
 ---
@@ -608,6 +665,17 @@ case class Cast(child: Expression, dataType: DataType, 
timeZoneId: Option[String
 val tz = ctx.addReferenceObj("timeZone", timeZone)
 (c, evPrim, evNull) => s"""$evPrim = UTF8String.fromString(
   
org.apache.spark.sql.catalyst.util.DateTimeUtils.timestampToString($c, $tz));"""
+  case ArrayType(et, _) =>
+(c, evPrim, evNull) => {
+  val bufferTerm = ctx.freshName("bufferTerm")
--- End diff --

super nit: In codegen we usually don't add a `term` postfix, just call it 
`buffer`, `array`, etc.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20152: [SPARK-22957] ApproxQuantile breaks if the number of row...

2018-01-04 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20152
  
LGTM, merging to master/2.3!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSui...

2018-01-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20147#discussion_r159798210
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogVersionsSuite.scala
 ---
@@ -85,6 +93,34 @@ class HiveExternalCatalogVersionsSuite extends 
SparkSubmitTestUtils {
 new File(tmpDataDir, name).getCanonicalPath
   }
 
+  private def getFileFromUrl(urlString: String, targetDir: String, 
filename: String): Unit = {
+val conf = new SparkConf
+// if the caller passes the name of an existing file, we want 
doFetchFile to write over it with
+// the contents from the specified url.
+conf.set("spark.files.overwrite", "true")
+val securityManager = new SecurityManager(conf)
+val hadoopConf = new Configuration
+
+val outDir = new File(targetDir)
+if (!outDir.exists()) {
+  outDir.mkdirs()
+}
+
+// propagate exceptions up to the caller of getFileFromUrl
+Utils.doFetchFile(urlString, outDir, filename, conf, securityManager, 
hadoopConf)
+  }
+
+  private def getStringFromUrl(urlString: String, encoding: String = 
"UTF-8"): String = {
--- End diff --

Seems `encoding: String = "UTF-8"` is not used.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20153
  
**[Test build #85712 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85712/testReport)**
 for PR 20153 at commit 
[`a019886`](https://github.com/apache/spark/commit/a01988624d0cde682aa820e59c89019812c3ef73).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20147: [SPARK-22940][SQL] HiveExternalCatalogVersionsSuite shou...

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20147
  
**[Test build #85711 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85711/testReport)**
 for PR 20147 at commit 
[`3dbfffd`](https://github.com/apache/spark/commit/3dbfffd9a764cb35e05b85fc5c691a7708f31a0e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...

2018-01-04 Thread jose-torres
Github user jose-torres commented on a diff in the pull request:

https://github.com/apache/spark/pull/20096#discussion_r159797164
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
 ---
@@ -418,11 +418,16 @@ abstract class StreamExecution(
* Blocks the current thread until processing for data from the given 
`source` has reached at
* least the given `Offset`. This method is intended for use primarily 
when writing tests.
*/
-  private[sql] def awaitOffset(source: Source, newOffset: Offset): Unit = {
+  private[sql] def awaitOffset(sourceIndex: Int, newOffset: Offset): Unit 
= {
 assertAwaitThread()
 def notDone = {
   val localCommittedOffsets = committedOffsets
-  !localCommittedOffsets.contains(source) || 
localCommittedOffsets(source) != newOffset
+  if (sources.length <= sourceIndex) {
+false
--- End diff --

Sources is a var which might not be populated yet. (This race condition 
showed up in AddKafkaData in my tests.)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20024
  
**[Test build #85710 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85710/testReport)**
 for PR 20024 at commit 
[`449e2c9`](https://github.com/apache/spark/commit/449e2c9c8c5c48a14a9b2efec728b350463188bf).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...

2018-01-04 Thread jose-torres
Github user jose-torres commented on a diff in the pull request:

https://github.com/apache/spark/pull/20096#discussion_r159796525
  
--- Diff: 
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSinkV2.scala
 ---
@@ -0,0 +1,142 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.kafka010
+
+import org.apache.kafka.clients.producer.{Callback, ProducerRecord, 
RecordMetadata}
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.{Row, SparkSession}
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Attribute, Cast, 
Literal, UnsafeProjection}
+import 
org.apache.spark.sql.kafka010.KafkaSourceProvider.{kafkaParamsForProducer, 
TOPIC_OPTION_KEY}
+import org.apache.spark.sql.sources.v2.streaming.writer.ContinuousWriter
+import org.apache.spark.sql.sources.v2.writer._
+import org.apache.spark.sql.streaming.OutputMode
+import org.apache.spark.sql.types.{BinaryType, StringType, StructType}
+
+class ContinuousKafkaWriter(
+topic: Option[String], producerParams: Map[String, String], schema: 
StructType)
+  extends ContinuousWriter with SupportsWriteInternalRow {
+
+  override def createInternalRowWriterFactory(): KafkaWriterFactory =
+KafkaWriterFactory(topic, producerParams, schema)
+
+  override def commit(epochId: Long, messages: 
Array[WriterCommitMessage]): Unit = {}
+  override def abort(messages: Array[WriterCommitMessage]): Unit = {}
+}
+
+case class KafkaWriterFactory(
+topic: Option[String], producerParams: Map[String, String], schema: 
StructType)
+  extends DataWriterFactory[InternalRow] {
+
+  override def createDataWriter(partitionId: Int, attemptNumber: Int): 
DataWriter[InternalRow] = {
+new KafkaDataWriter(topic, producerParams, schema.toAttributes)
+  }
+}
+
+case class KafkaWriterCommitMessage() extends WriterCommitMessage {}
+
+class KafkaDataWriter(
+topic: Option[String], producerParams: Map[String, String], 
inputSchema: Seq[Attribute])
+  extends DataWriter[InternalRow] {
+  import scala.collection.JavaConverters._
+
+  @volatile private var failedWrite: Exception = _
+  private val projection = createProjection
+  private lazy val producer = CachedKafkaProducer.getOrCreate(
+new java.util.HashMap[String, Object](producerParams.asJava))
+
+  private val callback = new Callback() {
+override def onCompletion(recordMetadata: RecordMetadata, e: 
Exception): Unit = {
+  if (failedWrite == null && e != null) {
+failedWrite = e
+  }
+}
+  }
+
+  def write(row: InternalRow): Unit = {
+if (failedWrite != null) return
+
+val projectedRow = projection(row)
+val topic = projectedRow.getUTF8String(0)
+val key = projectedRow.getBinary(1)
+val value = projectedRow.getBinary(2)
+
+if (topic == null) {
+  throw new NullPointerException(s"null topic present in the data. Use 
the " +
+s"${KafkaSourceProvider.TOPIC_OPTION_KEY} option for setting a 
default topic.")
+}
+val record = new ProducerRecord[Array[Byte], 
Array[Byte]](topic.toString, key, value)
+producer.send(record, callback)
+  }
+
+  def commit(): WriterCommitMessage = KafkaWriterCommitMessage()
+  def abort(): Unit = {}
+
+  def close(): Unit = {
+checkForErrors()
+if (producer != null) {
+  producer.flush()
+  checkForErrors()
+}
--- End diff --

I think CachedKafkaProducer handles closing automatically, but since these 
are long lived I can do it explicitly too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...

2018-01-04 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/20024
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-04 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/20163
  
LGTM,  cc @ueshin @icexelloss does this behavior consistent with pandas UDF?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...

2018-01-04 Thread jose-torres
Github user jose-torres commented on a diff in the pull request:

https://github.com/apache/spark/pull/20096#discussion_r159796099
  
--- Diff: 
external/kafka-0-10-sql/src/main/scala/ContinuousKafkaReader.scala ---
@@ -0,0 +1,246 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.kafka010
+
+import java.io._
+import java.nio.charset.StandardCharsets
+import java.util.concurrent.atomic.AtomicBoolean
+
+import org.apache.commons.io.IOUtils
+import org.apache.kafka.clients.consumer.ConsumerRecord
+import org.apache.kafka.common.TopicPartition
+import org.apache.kafka.common.errors.WakeupException
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.{DataFrame, Row, SparkSession, SQLContext}
+import org.apache.spark.sql.catalyst.expressions.{Attribute, Cast, 
Literal, UnsafeProjection, UnsafeRow}
+import org.apache.spark.sql.catalyst.expressions.codegen.{BufferHolder, 
UnsafeRowWriter}
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.execution.streaming.{HDFSMetadataLog, 
SerializedOffset}
+import 
org.apache.spark.sql.kafka010.KafkaSource.{INSTRUCTION_FOR_FAIL_ON_DATA_LOSS_FALSE,
 INSTRUCTION_FOR_FAIL_ON_DATA_LOSS_TRUE, VERSION}
+import org.apache.spark.sql.sources.v2.reader._
+import 
org.apache.spark.sql.sources.v2.streaming.reader.{ContinuousDataReader, 
ContinuousReader, Offset, PartitionOffset}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.unsafe.types.UTF8String
+
+class ContinuousKafkaReader(
+kafkaReader: KafkaOffsetReader,
+executorKafkaParams: java.util.Map[String, Object],
+sourceOptions: Map[String, String],
+metadataPath: String,
+initialOffsets: KafkaOffsetRangeLimit,
+failOnDataLoss: Boolean)
+  extends ContinuousReader with SupportsScanUnsafeRow with Logging {
+
+  override def mergeOffsets(offsets: Array[PartitionOffset]): Offset = {
+val mergedMap = offsets.map {
+  case KafkaSourcePartitionOffset(p, o) => Map(p -> o)
+}.reduce(_ ++ _)
+KafkaSourceOffset(mergedMap)
+  }
+
+  private lazy val session = SparkSession.getActiveSession.get
+  private lazy val sc = session.sparkContext
+
+  private lazy val pollTimeoutMs = sourceOptions.getOrElse(
+"kafkaConsumer.pollTimeoutMs",
+sc.conf.getTimeAsMs("spark.network.timeout", "120s").toString
+  ).toLong
+
+  private val maxOffsetsPerTrigger =
+sourceOptions.get("maxOffsetsPerTrigger").map(_.toLong)
+
+  /**
+   * Lazily initialize `initialPartitionOffsets` to make sure that 
`KafkaConsumer.poll` is only
+   * called in StreamExecutionThread. Otherwise, interrupting a thread 
while running
+   * `KafkaConsumer.poll` may hang forever (KAFKA-1894).
+   */
+  private lazy val initialPartitionOffsets = {
+  val offsets = initialOffsets match {
+case EarliestOffsetRangeLimit => 
KafkaSourceOffset(kafkaReader.fetchEarliestOffsets())
+case LatestOffsetRangeLimit => 
KafkaSourceOffset(kafkaReader.fetchLatestOffsets())
+case SpecificOffsetRangeLimit(p) => fetchAndVerify(p)
+  }
+  logInfo(s"Initial offsets: $offsets")
+  offsets.partitionToOffsets
+  }
+
+  private def fetchAndVerify(specificOffsets: Map[TopicPartition, Long]) = 
{
+val result = kafkaReader.fetchSpecificOffsets(specificOffsets)
+specificOffsets.foreach {
+  case (tp, off) if off != KafkaOffsetRangeLimit.LATEST &&
+off != KafkaOffsetRangeLimit.EARLIEST =>
+if (result(tp) != off) {
+  reportDataLoss(
+s"startingOffsets for $tp was $off but consumer reset to 
${result(tp)}")
+}
+  case _ =>
+  // no real way to check that beginning or end is reasonable
+}
+KafkaSourceOffset(result)
+  }
+
+  // Initialized when creating read tasks. If this diverges from the 
partitions 

[GitHub] spark pull request #20163: [SPARK-22966][PySpark] Spark SQL should handle Py...

2018-01-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20163#discussion_r159796028
  
--- Diff: python/pyspark/sql/udf.py ---
@@ -26,6 +26,28 @@
 
 
 def _wrap_function(sc, func, returnType):
+def coerce_to_str(v):
+import datetime
+if type(v) == datetime.date or type(v) == datetime.datetime:
+return str(v)
+else:
+return v
+
+# Pyrolite will unpickle both Python datetime.date and 
datetime.datetime objects
+# into java.util.Calendar objects, so the type information on the 
Python side is lost.
+# This is problematic when Spark SQL needs to cast such objects into 
Spark SQL string type,
+# because the format of the string should be different, depending on 
the type of the input
+# object. So for those two specific types we eagerly convert them to 
string here, where the
+# Python type information is still intact.
+if returnType == StringType():
--- End diff --

This is to handle when a python udf returns `date` or `datetime` but mark 
the return type as string?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20024
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85701/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20024
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20024: [SPARK-22825][SQL] Fix incorrect results of Casting Arra...

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20024
  
**[Test build #85701 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85701/testReport)**
 for PR 20024 at commit 
[`449e2c9`](https://github.com/apache/spark/commit/449e2c9c8c5c48a14a9b2efec728b350463188bf).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...

2018-01-04 Thread jose-torres
Github user jose-torres commented on a diff in the pull request:

https://github.com/apache/spark/pull/20096#discussion_r159795887
  
--- Diff: 
external/kafka-0-10-sql/src/main/scala/ContinuousKafkaReader.scala ---
@@ -0,0 +1,246 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.kafka010
+
+import java.io._
+import java.nio.charset.StandardCharsets
+import java.util.concurrent.atomic.AtomicBoolean
+
+import org.apache.commons.io.IOUtils
+import org.apache.kafka.clients.consumer.ConsumerRecord
+import org.apache.kafka.common.TopicPartition
+import org.apache.kafka.common.errors.WakeupException
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.{DataFrame, Row, SparkSession, SQLContext}
+import org.apache.spark.sql.catalyst.expressions.{Attribute, Cast, 
Literal, UnsafeProjection, UnsafeRow}
+import org.apache.spark.sql.catalyst.expressions.codegen.{BufferHolder, 
UnsafeRowWriter}
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.execution.streaming.{HDFSMetadataLog, 
SerializedOffset}
+import 
org.apache.spark.sql.kafka010.KafkaSource.{INSTRUCTION_FOR_FAIL_ON_DATA_LOSS_FALSE,
 INSTRUCTION_FOR_FAIL_ON_DATA_LOSS_TRUE, VERSION}
+import org.apache.spark.sql.sources.v2.reader._
+import 
org.apache.spark.sql.sources.v2.streaming.reader.{ContinuousDataReader, 
ContinuousReader, Offset, PartitionOffset}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.unsafe.types.UTF8String
+
+class ContinuousKafkaReader(
+kafkaReader: KafkaOffsetReader,
+executorKafkaParams: java.util.Map[String, Object],
+sourceOptions: Map[String, String],
+metadataPath: String,
+initialOffsets: KafkaOffsetRangeLimit,
+failOnDataLoss: Boolean)
+  extends ContinuousReader with SupportsScanUnsafeRow with Logging {
+
+  override def mergeOffsets(offsets: Array[PartitionOffset]): Offset = {
+val mergedMap = offsets.map {
+  case KafkaSourcePartitionOffset(p, o) => Map(p -> o)
+}.reduce(_ ++ _)
+KafkaSourceOffset(mergedMap)
+  }
+
+  private lazy val session = SparkSession.getActiveSession.get
+  private lazy val sc = session.sparkContext
+
+  private lazy val pollTimeoutMs = sourceOptions.getOrElse(
+"kafkaConsumer.pollTimeoutMs",
+sc.conf.getTimeAsMs("spark.network.timeout", "120s").toString
+  ).toLong
+
+  private val maxOffsetsPerTrigger =
+sourceOptions.get("maxOffsetsPerTrigger").map(_.toLong)
+
+  /**
+   * Lazily initialize `initialPartitionOffsets` to make sure that 
`KafkaConsumer.poll` is only
+   * called in StreamExecutionThread. Otherwise, interrupting a thread 
while running
+   * `KafkaConsumer.poll` may hang forever (KAFKA-1894).
+   */
+  private lazy val initialPartitionOffsets = {
+  val offsets = initialOffsets match {
+case EarliestOffsetRangeLimit => 
KafkaSourceOffset(kafkaReader.fetchEarliestOffsets())
+case LatestOffsetRangeLimit => 
KafkaSourceOffset(kafkaReader.fetchLatestOffsets())
+case SpecificOffsetRangeLimit(p) => fetchAndVerify(p)
+  }
+  logInfo(s"Initial offsets: $offsets")
+  offsets.partitionToOffsets
+  }
+
+  private def fetchAndVerify(specificOffsets: Map[TopicPartition, Long]) = 
{
+val result = kafkaReader.fetchSpecificOffsets(specificOffsets)
+specificOffsets.foreach {
+  case (tp, off) if off != KafkaOffsetRangeLimit.LATEST &&
+off != KafkaOffsetRangeLimit.EARLIEST =>
+if (result(tp) != off) {
+  reportDataLoss(
+s"startingOffsets for $tp was $off but consumer reset to 
${result(tp)}")
+}
+  case _ =>
+  // no real way to check that beginning or end is reasonable
+}
+KafkaSourceOffset(result)
+  }
+
+  // Initialized when creating read tasks. If this diverges from the 
partitions 

[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...

2018-01-04 Thread jose-torres
Github user jose-torres commented on a diff in the pull request:

https://github.com/apache/spark/pull/20096#discussion_r159795747
  
--- Diff: 
external/kafka-0-10-sql/src/main/scala/ContinuousKafkaReader.scala ---
@@ -0,0 +1,246 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.kafka010
+
+import java.io._
+import java.nio.charset.StandardCharsets
+import java.util.concurrent.atomic.AtomicBoolean
+
+import org.apache.commons.io.IOUtils
+import org.apache.kafka.clients.consumer.ConsumerRecord
+import org.apache.kafka.common.TopicPartition
+import org.apache.kafka.common.errors.WakeupException
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.{DataFrame, Row, SparkSession, SQLContext}
+import org.apache.spark.sql.catalyst.expressions.{Attribute, Cast, 
Literal, UnsafeProjection, UnsafeRow}
+import org.apache.spark.sql.catalyst.expressions.codegen.{BufferHolder, 
UnsafeRowWriter}
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.execution.streaming.{HDFSMetadataLog, 
SerializedOffset}
+import 
org.apache.spark.sql.kafka010.KafkaSource.{INSTRUCTION_FOR_FAIL_ON_DATA_LOSS_FALSE,
 INSTRUCTION_FOR_FAIL_ON_DATA_LOSS_TRUE, VERSION}
+import org.apache.spark.sql.sources.v2.reader._
+import 
org.apache.spark.sql.sources.v2.streaming.reader.{ContinuousDataReader, 
ContinuousReader, Offset, PartitionOffset}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.unsafe.types.UTF8String
+
+class ContinuousKafkaReader(
+kafkaReader: KafkaOffsetReader,
+executorKafkaParams: java.util.Map[String, Object],
+sourceOptions: Map[String, String],
+metadataPath: String,
+initialOffsets: KafkaOffsetRangeLimit,
+failOnDataLoss: Boolean)
+  extends ContinuousReader with SupportsScanUnsafeRow with Logging {
+
+  override def mergeOffsets(offsets: Array[PartitionOffset]): Offset = {
+val mergedMap = offsets.map {
+  case KafkaSourcePartitionOffset(p, o) => Map(p -> o)
+}.reduce(_ ++ _)
+KafkaSourceOffset(mergedMap)
+  }
+
+  private lazy val session = SparkSession.getActiveSession.get
+  private lazy val sc = session.sparkContext
+
+  private lazy val pollTimeoutMs = sourceOptions.getOrElse(
+"kafkaConsumer.pollTimeoutMs",
+sc.conf.getTimeAsMs("spark.network.timeout", "120s").toString
+  ).toLong
+
+  private val maxOffsetsPerTrigger =
+sourceOptions.get("maxOffsetsPerTrigger").map(_.toLong)
+
+  /**
+   * Lazily initialize `initialPartitionOffsets` to make sure that 
`KafkaConsumer.poll` is only
+   * called in StreamExecutionThread. Otherwise, interrupting a thread 
while running
+   * `KafkaConsumer.poll` may hang forever (KAFKA-1894).
+   */
+  private lazy val initialPartitionOffsets = {
+  val offsets = initialOffsets match {
+case EarliestOffsetRangeLimit => 
KafkaSourceOffset(kafkaReader.fetchEarliestOffsets())
+case LatestOffsetRangeLimit => 
KafkaSourceOffset(kafkaReader.fetchLatestOffsets())
+case SpecificOffsetRangeLimit(p) => fetchAndVerify(p)
+  }
+  logInfo(s"Initial offsets: $offsets")
+  offsets.partitionToOffsets
+  }
+
+  private def fetchAndVerify(specificOffsets: Map[TopicPartition, Long]) = 
{
+val result = kafkaReader.fetchSpecificOffsets(specificOffsets)
+specificOffsets.foreach {
+  case (tp, off) if off != KafkaOffsetRangeLimit.LATEST &&
+off != KafkaOffsetRangeLimit.EARLIEST =>
+if (result(tp) != off) {
+  reportDataLoss(
+s"startingOffsets for $tp was $off but consumer reset to 
${result(tp)}")
+}
+  case _ =>
+  // no real way to check that beginning or end is reasonable
+}
+KafkaSourceOffset(result)
+  }
+
+  // Initialized when creating read tasks. If this diverges from the 
partitions 

[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...

2018-01-04 Thread jose-torres
Github user jose-torres commented on a diff in the pull request:

https://github.com/apache/spark/pull/20096#discussion_r159795735
  
--- Diff: 
external/kafka-0-10-sql/src/main/scala/ContinuousKafkaReader.scala ---
@@ -0,0 +1,246 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.kafka010
+
+import java.io._
+import java.nio.charset.StandardCharsets
+import java.util.concurrent.atomic.AtomicBoolean
+
+import org.apache.commons.io.IOUtils
+import org.apache.kafka.clients.consumer.ConsumerRecord
+import org.apache.kafka.common.TopicPartition
+import org.apache.kafka.common.errors.WakeupException
+
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.{DataFrame, Row, SparkSession, SQLContext}
+import org.apache.spark.sql.catalyst.expressions.{Attribute, Cast, 
Literal, UnsafeProjection, UnsafeRow}
+import org.apache.spark.sql.catalyst.expressions.codegen.{BufferHolder, 
UnsafeRowWriter}
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.execution.streaming.{HDFSMetadataLog, 
SerializedOffset}
+import 
org.apache.spark.sql.kafka010.KafkaSource.{INSTRUCTION_FOR_FAIL_ON_DATA_LOSS_FALSE,
 INSTRUCTION_FOR_FAIL_ON_DATA_LOSS_TRUE, VERSION}
+import org.apache.spark.sql.sources.v2.reader._
+import 
org.apache.spark.sql.sources.v2.streaming.reader.{ContinuousDataReader, 
ContinuousReader, Offset, PartitionOffset}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.unsafe.types.UTF8String
+
+class ContinuousKafkaReader(
+kafkaReader: KafkaOffsetReader,
+executorKafkaParams: java.util.Map[String, Object],
+sourceOptions: Map[String, String],
+metadataPath: String,
+initialOffsets: KafkaOffsetRangeLimit,
+failOnDataLoss: Boolean)
+  extends ContinuousReader with SupportsScanUnsafeRow with Logging {
+
+  override def mergeOffsets(offsets: Array[PartitionOffset]): Offset = {
+val mergedMap = offsets.map {
+  case KafkaSourcePartitionOffset(p, o) => Map(p -> o)
+}.reduce(_ ++ _)
+KafkaSourceOffset(mergedMap)
+  }
+
+  private lazy val session = SparkSession.getActiveSession.get
+  private lazy val sc = session.sparkContext
+
+  private lazy val pollTimeoutMs = sourceOptions.getOrElse(
+"kafkaConsumer.pollTimeoutMs",
+sc.conf.getTimeAsMs("spark.network.timeout", "120s").toString
+  ).toLong
+
+  private val maxOffsetsPerTrigger =
+sourceOptions.get("maxOffsetsPerTrigger").map(_.toLong)
+
+  /**
+   * Lazily initialize `initialPartitionOffsets` to make sure that 
`KafkaConsumer.poll` is only
+   * called in StreamExecutionThread. Otherwise, interrupting a thread 
while running
+   * `KafkaConsumer.poll` may hang forever (KAFKA-1894).
+   */
+  private lazy val initialPartitionOffsets = {
--- End diff --

oops, no


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20163: [SPARK-22966][PySpark] Spark SQL should handle Python UD...

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20163
  
**[Test build #85709 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85709/testReport)**
 for PR 20163 at commit 
[`ca026d3`](https://github.com/apache/spark/commit/ca026d31a489f1e0eb451fe85df97083659d0f67).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20163: [SPARK-22966][PySpark] Spark SQL should handle Py...

2018-01-04 Thread rednaxelafx
GitHub user rednaxelafx opened a pull request:

https://github.com/apache/spark/pull/20163

[SPARK-22966][PySpark] Spark SQL should handle Python UDFs that return a 
datetime.date or datetime.datetime

## What changes were proposed in this pull request?

Perform appropriate conversions for results coming from Python UDFs that 
return `datetime.date` or `datetime.datetime`.

Before this PR, Pyrolite would unpickle both `datetime.date` and 
`datetime.datetime` into a `java.util.Calendar`, which Spark SQL doesn't 
understand, which then leads to incorrect results. An example of such incorrect 
result is:

```
>>> py_date = udf(datetime.date)
>>> spark.range(1).select(py_date(lit(2017), lit(10), 
lit(30))).show(truncate=False)

+--+
|date(2017, 10, 30) 








   |

+--+

|java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet=false,lenient=true,zone=sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]],firstDayOfWeek=1,minimalDaysInFirstWeek=1,ERA=?,YEAR=2017,MONTH=9,WEEK_OF_YEAR=?,WEEK_OF_MONTH=?,DAY_OF_MONTH=30,DAY_OF_YEAR=?,DAY_OF_WEEK=?,DAY_OF_WEEK_IN_MONTH=?,AM_PM=0,HOUR=0,HOUR_OF_DAY=0,MINUTE=0,SECOND=0,MILLISECOND=?,ZONE_OFFSET=?,DST_OFFSET=?]|

+--+
```
After this PR, the same query above would give correct results:
```
>>> spark.range(1).select(py_date(lit(2017), lit(10), 
lit(30))).show(truncate=False)
+--+
|date(2017, 10, 30)|
+--+
|2017-10-30|
+--+
```

An explicit non-goal of this PR is to change the behavior of timezone 
awareness or 

[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20096
  
Build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20096
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85699/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20096
  
**[Test build #85699 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85699/testReport)**
 for PR 20096 at commit 
[`dae3a09`](https://github.com/apache/spark/commit/dae3a09e48565439ed8c22dda857a7747e518b3b).
 * This patch **fails Spark unit tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20161: [SPARK-21525][streaming] Check error code from superviso...

2018-01-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20161
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   >