[GitHub] spark issue #17239: Using map function in spark for huge operation
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17239 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17239: Using map function in spark for huge operation
GitHub user nischay21 opened a pull request: https://github.com/apache/spark/pull/17239 Using map function in spark for huge operation We need to calculate distance matrix like jaccard on huge collection of Dataset in spark. Facing couple of issues. Kindly help us to give directions. Issue 1. import info.debatty.java.stringsimilarity.Jaccard; //sample Data set creation List data = Arrays.asList( RowFactory.create("Hi I heard about Spark", "Hi I Know about Spark"), RowFactory.create("I wish Java could use case classes","I wish C# could use case classes"), RowFactory.create("Logistic,regression,models,are,neat","Logistic,regression,models,are,neat")); StructType schema = new StructType(new StructField[] {new StructField("label", DataTypes.StringType, false,Metadata.empty()), new StructField("sentence", DataTypes.StringType, false,Metadata.empty()) }); Dataset sentenceDataFrame = spark.createDataFrame(data, schema); // Distance matrix object creation Jaccard jaccard=new Jaccard(); //Working on each of the member element of dataset and applying distance matrix. Dataset sentenceDataFrame1 =sentenceDataFrame.map( (MapFunction) row -> "Name: " + jaccard.similarity(row.getString(0),row.getString(1)),Encoders.STRING() ); sentenceDataFrame1.show(); No compile time errors. But getting run time exception like org.apache.spark.SparkException: Task not serializable Issue 2. Moreover we need to find which pair is having highest score for which we need to declare some variables. Also we need to perform other calculation as well, we are facing lots of difficulty. Even if I try to declare a simple variable like counter within MapBlock we are not able to capture the incremented value. If we declare outside the Map block we are getting lots of compile time errors. int counter=0; Dataset sentenceDataFrame1 =sentenceDataFrame.map( (MapFunction
) row -> { System.out.println("Name: " + row.getString(1)); //int counter = 0; counter++; System.out.println("Counter: " + counter); return counter+""; },Encoders.STRING() ); Please gives us directions. You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/spark branch-2.1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17239.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17239 commit 1cafc76ea1e9eef40b24060d1cd7c4aaf9f16a49 Author: Shixiong Zhu
Date: 2016-12-09T01:58:44Z [SPARK-18774][CORE][SQL] Ignore non-existing files when ignoreCorruptFiles is enabled (branch 2.1) ## What changes were proposed in this pull request? Backport #16203 to branch 2.1. ## How was this patch tested? Jennkins Author: Shixiong Zhu Closes #16216 from zsxwing/SPARK-18774-2.1. commit ef5646b4c6792a96e85d1dd4bb3103ba8306949b Author: Shivaram Venkataraman Date: 2016-12-09T02:26:54Z [SPARKR][PYSPARK] Fix R source package name to match Spark version. Remove pip tar.gz from distribution ## What changes were proposed in this pull request? Fixes name of R source package so that the `cp` in release-build.sh works correctly. Issue discussed in https://github.com/apache/spark/pull/16014#issuecomment-265867125 Author: Shivaram Venkataraman Closes #16221 from shivaram/fix-sparkr-release-build-name. (cherry picked from commit 4ac8b20bf2f962d9b8b6b209468896758d49efe3) Signed-off-by: Shivaram Venkataraman commit 4ceed95b43d0cd9665004865095a40926efcc289 Author: wm...@hotmail.com Date: 2016-12-09T06:08:19Z [SPARK-18349][SPARKR]
[GitHub] spark issue #17177: [SPARK-19834][SQL] csv escape of quote escape
Github user jbax commented on the issue: https://github.com/apache/spark/pull/17177 Doesn't seem correct to me. All test cases are using broken CSV and trigger the parser handling of unescaped quotes, where it tries to rescue the data and produce something sensible. See my test case here: https://github.com/uniVocity/univocity-parsers/issues/143 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17237: [SPARK-19852][PYSPARK][ML] Update Python API setHandleIn...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17237 **[Test build #74305 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74305/testReport)** for PR 17237 at commit [`f1d9bcb`](https://github.com/apache/spark/commit/f1d9bcb3d615444a3c326f0b6b0f7999edecdf4f). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17237: [SPARK-19852][PYSPARK][ML] Update Python API setHandleIn...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17237 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17237: [SPARK-19852][PYSPARK][ML] Update Python API setHandleIn...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17237 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74305/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14033: [SPARK-16286][SQL] Implement stack table generating func...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/14033 just found out that we didn't implement a type coercion rule for `stack`... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17226 For me, LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17188: [SPARK-19751][SQL] Throw an exception if bean cla...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/17188#discussion_r105343821 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala --- @@ -69,7 +69,8 @@ object JavaTypeInference { * @param typeToken Java type * @return (SQL data type, nullable) */ - private def inferDataType(typeToken: TypeToken[_]): (DataType, Boolean) = { + private def inferDataType(typeToken: TypeToken[_], seenTypeSet: Set[Class[_]] = Set.empty) --- End diff -- You mean this case? ``` scala> :paste case class classA(i: Int, cls: classB) case class classB(cls: classA) scala> Seq(classA(0, null)).toDS() java.lang.StackOverflowError at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1494) at scala.reflect.runtime.JavaMirrors$JavaMirror$$anon$1.scala$reflect$runtime$SynchronizedSymbols$SynchronizedSymbol$$super$info(JavaMirrors.scala:66) at scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anonfun$info$1.apply(SynchronizedSymbols.scala:127) at scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anonfun$info$1.apply(SynchronizedSymbols.scala:127) at scala.reflect.runtime.Gil$class.gilSynchronized(Gil.scala:19) at scala.reflect.runtime.JavaUniverse.gilSynchronized(JavaUniverse.scala:16) at scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$class.gilSynchronizedIfNotThreadsafe(SynchronizedSymbols.scala:123) at scala.reflect.runtime.JavaMirrors$JavaMirror$$anon$1.gilSynchronizedIfNotThreadsafe(JavaMirrors.scala:66) at scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$class.info(SynchronizedSymbols.scala:127) at scala.reflect.runtime.JavaMirrors$JavaMirror$$anon$1.info(JavaMirrors.scala:66) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:45) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:45) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:45) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:45) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17238: getRackForHost returns None if host is unknown by driver
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17238 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17238: getRackForHost returns None if host is unknown by...
GitHub user morenn520 opened a pull request: https://github.com/apache/spark/pull/17238 getRackForHost returns None if host is unknown by driver ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-19894 ## How was this patch tested? It tests on our production cluster(YARN) by YARN-cluster mode, and resolve user rack-local problems by applying this patch. Problem: In our production cluster(YARN), one node(called missing-rack-info node) miss some rack information for other nodes. One Spark Streaming program(Datasource: Kafka, Mode: Yarn-cluster), runs driver on this missing-rack-info node. The nodes whose host is missed on Driver node, and the Kafka broker node whose host is also unknown by YARN, would both be recognized as "/default-rack" by YARN scheduler, so that all tasks would be assigned to the nodes for RACK_LOCAL. You can merge this pull request into a Git repository by running: $ git pull https://github.com/morenn520/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17238.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17238 commit 6630e747efa52bff5ca48bb0a5610357c7754c10 Author: Chen YuechenDate: 2017-03-10T07:24:48Z getRackForHost returns None if host is unknown by driver --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17237: [SPARK-19852][PYSPARK][ML] Update Python API setHandleIn...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17237 **[Test build #74305 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74305/testReport)** for PR 17237 at commit [`f1d9bcb`](https://github.com/apache/spark/commit/f1d9bcb3d615444a3c326f0b6b0f7999edecdf4f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17175: [SPARK-19468][SQL] Rewrite physical Project opera...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17175#discussion_r105342800 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala --- @@ -78,9 +78,42 @@ case class ProjectExec(projectList: Seq[NamedExpression], child: SparkPlan) } } - override def outputOrdering: Seq[SortOrder] = child.outputOrdering --- End diff -- My original thought is to support the cases where `requiresChildDistribution` refers nested fields created and aliased in `Project`, as the example I showed. The example you give is another kind of case. In order to support, if we wannt, at least we need to improve expression canonicalization and partitioning/distribution matching. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17224: [SPARK-19882][SQL] Pivot with null as the dictinc...
Github user HyukjinKwon closed the pull request at: https://github.com/apache/spark/pull/17224 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17224: [SPARK-19882][SQL] Pivot with null as the dictinct pivot...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17224 I am closing this per https://github.com/apache/spark/pull/17226#issuecomment-285597434 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17226 I see. So, `count` in "**Spark 2.1.0** (and presumably 2.0.x/master)" was unexpectedly introduced by the optimization in SPARK-13749 and this behaviour change between 1.6 and master (whether it is right or not) is found now together. So.. if I understood correctly, several problems are mixed and found here: 1. counting `null` problem (for both optimized and non-optimized) 2. NPE (for optimized) 3. `0` vs `null` for missing values in `count` and this PR tries to fix both 1. and 2. whereas mine tries to fix 1. and both 2. and few specific cases in 3. (by avoiding optimization as an workaround). Okay, I am fine with closing mine (honestly, the initial version in my PR was almost identical with this PR..). Thanks for elaborating it and bearing with me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17225: [CORE] Support ZStandard Compression
Github user dongjinleekr commented on a diff in the pull request: https://github.com/apache/spark/pull/17225#discussion_r105342714 --- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala --- @@ -49,13 +50,14 @@ private[spark] object CompressionCodec { private[spark] def supportsConcatenationOfSerializedStreams(codec: CompressionCodec): Boolean = { (codec.isInstanceOf[SnappyCompressionCodec] || codec.isInstanceOf[LZFCompressionCodec] - || codec.isInstanceOf[LZ4CompressionCodec]) + || codec.isInstanceOf[LZ4CompressionCodec] || codec.isInstanceOf[ZStandardCompressionCodec]) } private val shortCompressionCodecNames = Map( "lz4" -> classOf[LZ4CompressionCodec].getName, "lzf" -> classOf[LZFCompressionCodec].getName, -"snappy" -> classOf[SnappyCompressionCodec].getName) +"snappy" -> classOf[SnappyCompressionCodec].getName, +"zstd" -> classOf[SnappyCompressionCodec].getName) --- End diff -- OMG, it is a typo. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17188: [SPARK-19751][SQL] Throw an exception if bean cla...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17188#discussion_r105342563 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala --- @@ -69,7 +69,8 @@ object JavaTypeInference { * @param typeToken Java type * @return (SQL data type, nullable) */ - private def inferDataType(typeToken: TypeToken[_]): (DataType, Boolean) = { + private def inferDataType(typeToken: TypeToken[_], seenTypeSet: Set[Class[_]] = Set.empty) --- End diff -- does scala case class have the same problem? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17188: [SPARK-19751][SQL] Throw an exception if bean class has ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17188 **[Test build #74304 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74304/testReport)** for PR 17188 at commit [`5e519b1`](https://github.com/apache/spark/commit/5e519b180d7905e9c4e60f708db11ce6ccf86866). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17188: [SPARK-19751][SQL] Throw an exception if bean class has ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17188 **[Test build #74303 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74303/testReport)** for PR 17188 at commit [`9abe861`](https://github.com/apache/spark/commit/9abe861f8153431d30c06c42243adbf346a85772). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17237: [SPARK-19852][PYSPARK][ML] Update Python API setHandleIn...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17237 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74301/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17237: [SPARK-19852][PYSPARK][ML] Update Python API setHandleIn...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17237 **[Test build #74301 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74301/testReport)** for PR 17237 at commit [`d94dc68`](https://github.com/apache/spark/commit/d94dc68a8c1a5c082cf3de6c7e4d429bfd24d817). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17237: [SPARK-19852][PYSPARK][ML] Update Python API setHandleIn...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17237 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17237: [SPARK-19852][PYSPARK][ML] Update Python API setHandleIn...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17237 **[Test build #74301 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74301/testReport)** for PR 17237 at commit [`d94dc68`](https://github.com/apache/spark/commit/d94dc68a8c1a5c082cf3de6c7e4d429bfd24d817). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17188: [SPARK-19751][SQL] Throw an exception if bean class has ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17188 **[Test build #74302 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74302/testReport)** for PR 17188 at commit [`b7eba26`](https://github.com/apache/spark/commit/b7eba26f5131ab0c99142aca3ab46e868226026e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17231: [SPARK-19891][SS] Await Batch Lock notified on st...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17231 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17237: [SPARK-19852][PYSPARK][ML] Update Python API setH...
GitHub user VinceShieh opened a pull request: https://github.com/apache/spark/pull/17237 [SPARK-19852][PYSPARK][ML] Update Python API setHandleInvalid for StringIndexer ## What changes were proposed in this pull request? This PR is to maintain API parity with changes made in SPARK-17498 to support a new option 'keep' in StringIndexer to handle unseen labels with pyspark ## How was this patch tested? existing tests testing is done with new doctests Signed-off-by: VinceShiehYou can merge this pull request into a Git repository by running: $ git pull https://github.com/VinceShieh/spark spark-19852 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17237.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17237 commit d94dc68a8c1a5c082cf3de6c7e4d429bfd24d817 Author: VinceShieh Date: 2017-03-10T06:50:41Z [SPARK-19852][PYSPARK][ML] Update Python API for StringIndexer setHandleInvalid This PR reflect the changes made in SPARK-17498 on pyspark to support a new option 'keep' in StringIndexer to handle unseen labels Signed-off-by: VinceShieh --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17231: [SPARK-19891][SS] Await Batch Lock notified on stream ex...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/17231 LGTM. Merging to master and 2.1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17172: [SPARK-19008][SQL] Improve performance of Dataset...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17172 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17172: [SPARK-19008][SQL] Improve performance of Dataset.map by...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/17172 cool! merging to master! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17236: [SPARK-xxxx][SQL] Cannot run intersect/except with map t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17236 **[Test build #74300 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74300/testReport)** for PR 17236 at commit [`d670a11`](https://github.com/apache/spark/commit/d670a11b8391c4e44e699372c567980fa9d29fed). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17236: [SPARK-xxxx][SQL] Cannot run intersect/except with map t...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/17236 cc @yhuai @sameeragarwal @hvanhovell --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17236: [SPARK-xxxx][SQL] Cannot run intersect/except wit...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/17236 [SPARK-][SQL] Cannot run intersect/except with map type ## What changes were proposed in this pull request? In spark SQL, map type can't be used in equality test/comparison, and `Intersect`/`Except` do need equality test for all columns, we should not allow map type in `Intersect`/`Except`. ## How was this patch tested? new regression test You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark map Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17236.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17236 commit d670a11b8391c4e44e699372c567980fa9d29fed Author: Wenchen FanDate: 2017-03-10T06:41:53Z Cannot run intersect/except with map type --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17138: [SPARK-17080] [SQL] join reorder
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/17138 @nsyca Yes you're right. There's still much room of optimization. We will improve Spark's optimizer gradually :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17109: [SPARK-19740][MESOS]Add support in Spark to pass arbitra...
Github user tnachen commented on the issue: https://github.com/apache/spark/pull/17109 @srowen @mgummelt PTAL --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17192#discussion_r105337154 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -1339,6 +1339,11 @@ test_that("column functions", { expect_equal(collect(select(df, bround(df$x, 0)))[[1]][2], 4) # Test to_json(), from_json() + arr <- list(listToStruct(list("name" = "bob"))) + df <- as.DataFrame(list(listToStruct(list("people" = arr + j <- collect(select(df, alias(to_json(df$people), "json"))) + expect_equal(j[order(j$json), ][1], "[{\"name\":\"bob\"}]") + --- End diff -- Let me propose this with it for now and try to find another way until it is merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17177: [SPARK-19834][SQL] csv escape of quote escape
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17177#discussion_r105337093 --- Diff: python/pyspark/sql/readwriter.py --- @@ -693,8 +697,8 @@ def text(self, path, compression=None): @since(2.0) def csv(self, path, mode=None, compression=None, sep=None, quote=None, escape=None, -header=None, nullValue=None, escapeQuotes=None, quoteAll=None, dateFormat=None, -timestampFormat=None, timeZone=None): +escapeQuoteEscaping=None, header=None, nullValue=None, escapeQuotes=None, quoteAll=None, --- End diff -- Oh, @ep1804, we should add new option at the end of these arguments because otherwise it breaks existing codes using this API in lower versions that use those options as positional arguments (not keyword arguments). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17164: [SPARK-16844][SQL] Support codegen for sort-based aggrea...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/17164 @hvanhovell ping --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17192#discussion_r105336865 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -1339,6 +1339,11 @@ test_that("column functions", { expect_equal(collect(select(df, bround(df$x, 0)))[[1]][2], 4) # Test to_json(), from_json() + arr <- list(listToStruct(list("name" = "bob"))) + df <- as.DataFrame(list(listToStruct(list("people" = arr + j <- collect(select(df, alias(to_json(df$people), "json"))) + expect_equal(j[order(j$json), ][1], "[{\"name\":\"bob\"}]") + --- End diff -- yes `listToStruct` is internal and it's mucking with types (though it's legitimate to do in R) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17192#discussion_r105336776 --- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R --- @@ -1339,6 +1339,11 @@ test_that("column functions", { expect_equal(collect(select(df, bround(df$x, 0)))[[1]][2], 4) # Test to_json(), from_json() + arr <- list(listToStruct(list("name" = "bob"))) + df <- as.DataFrame(list(listToStruct(list("people" = arr + j <- collect(select(df, alias(to_json(df$people), "json"))) + expect_equal(j[order(j$json), ][1], "[{\"name\":\"bob\"}]") + --- End diff -- possibly if it does --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17234: [SPARK-19892][MLlib] Implement findAnalogies meth...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/17234#discussion_r105336208 --- Diff: R/pkg/DESCRIPTION --- @@ -54,5 +54,5 @@ Collate: 'types.R' 'utils.R' 'window.R' -RoxygenNote: 5.0.1 +RoxygenNote: 6.0.1 --- End diff -- please revert this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17109: [SPARK-19740][MESOS]Add support in Spark to pass arbitra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17109 **[Test build #74299 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74299/testReport)** for PR 17109 at commit [`03e89eb`](https://github.com/apache/spark/commit/03e89eb2bae3b08207429a6b772d7dcae45b554c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17109: [SPARK-19740][MESOS]Add support in Spark to pass arbitra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17109 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17109: [SPARK-19740][MESOS]Add support in Spark to pass arbitra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17109 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74299/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/17170 I can see your point, but renaming it only on the R side is not really addressing the issue. Please feel free to open a JIRA on spark.ml FPGrowth and start a discussion there. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17109: [SPARK-19740][MESOS]Add support in Spark to pass arbitra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17109 **[Test build #74299 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74299/testReport)** for PR 17109 at commit [`03e89eb`](https://github.com/apache/spark/commit/03e89eb2bae3b08207429a6b772d7dcae45b554c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17177: [SPARK-19834][SQL] csv escape of quote escape
Github user ep1804 commented on the issue: https://github.com/apache/spark/pull/17177 Documentation for DataFrameReader, DataFrameWriter, DataStreamReader, readwriter.py and streaming.py are written. Check please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17232: [SPARK-18112] [SQL] Support reading data from Hive 2.1 m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17232 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17232: [SPARK-18112] [SQL] Support reading data from Hive 2.1 m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17232 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74298/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17232: [SPARK-18112] [SQL] Support reading data from Hive 2.1 m...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17232 **[Test build #74298 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74298/testReport)** for PR 17232 at commit [`af81cee`](https://github.com/apache/spark/commit/af81cee9f54abc13d7d07a12e4b499e49cd0dbcb). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...
Github user aray commented on the issue: https://github.com/apache/spark/pull/17226 @HyukjinKwon There is an inconsistency/regression but its not being introduced in this PR, its already there. Take an example without null as a pivot column value like below. The only difference is for the `count(1)` aggregate on cells with no values aggregated in the pivot table. Again I don't think it's clear which is "correct" here. **Spark 2.1.0** (and presumably 2.0.x/master) ``` scala> Seq(1,2).toDF("a").groupBy("a").pivot("a").count().show +---+++ | a| 1| 2| +---+++ | 1| 1|null| | 2|null| 1| +---+++ scala> Seq(1,2).toDF("a").groupBy("a").pivot("a").sum("a").show +---+++ | a| 1| 2| +---+++ | 1| 1|null| | 2|null| 2| +---+++ ``` **Spark 1.6.0** ``` scala> sc.parallelize(Seq(1,2)).toDF("a").groupBy("a").pivot("a").count().show +---+---+---+ | a| 1| 2| +---+---+---+ | 1| 1| 0| | 2| 0| 1| +---+---+---+ scala> sc.parallelize(Seq(1,2)).toDF("a").groupBy("a").pivot("a").sum("a").show +---+++ | a| 1| 2| +---+++ | 1| 1|null| | 2|null| 2| +---+++ ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17235: [SPARK-19320][MESOS][WIP]allow specifying a hard limit o...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17235 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17235: [SPARK-19320][MESOS][WIP]allow specifying a hard ...
GitHub user yanji84 opened a pull request: https://github.com/apache/spark/pull/17235 [SPARK-19320][MESOS][WIP]allow specifying a hard limit on number of gpus required in each spark executor when running on mesos ## What changes were proposed in this pull request? Currently, spark only allows specifying gpu resources as an upper limit, this adds a new conf parameter to allow specifying a hard limit on the number of gpu cores. If this hard limit is greater than 0, it will override the effect of spark.mesos.gpus.max ## How was this patch tested? Tests pending Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanji84/spark ji/hard_limit_on_gpu Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17235.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17235 commit 5f8ccd5789137363e035d1dfb9a05d3b9bf3ce6b Author: Ji YanDate: 2017-03-10T05:30:11Z respect both gpu and maxgpu --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17234: [SPARK-19892][MLlib] Implement findAnalogies method for ...
Github user benradford commented on the issue: https://github.com/apache/spark/pull/17234 ok to test Jenkins, add to whitelist --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17234: [SPARK-19892][MLlib] Implement findAnalogies method for ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17234 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17234: [SPARK-19892][MLlib] Implement findAnalogies meth...
GitHub user benradford opened a pull request: https://github.com/apache/spark/pull/17234 [SPARK-19892][MLlib] Implement findAnalogies method for Word2VecModel ## What changes were proposed in this pull request? Added findAnalogies method to Word2VecModel for performing vector-algebra-based queries (e.g. King + Woman - Man). ## How was this patch tested? Followed the contributor's guide for Spark and ran the run-tests. Compiled and tested functionality in spark-shell. This is an original work that I license to the project under the project's open source license. You can merge this pull request into a Git repository by running: $ git pull https://github.com/benradford/spark feature/findAnalogies Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17234.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17234 commit 2e7f1a3bd519d79ce9b08d388247e9a1d7f67635 Author: Benjamin RadfordDate: 2017-03-10T04:42:33Z Added findAnalogies method to Word2VecModel commit 9aefebfcd2e6eaad117727901ad70d0d26b03a1a Author: Benjamin Radford Date: 2017-03-10T05:16:46Z Fixed comment indentation to conform to style guide. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17174: [SPARK-19145][SQL] Timestamp to String casting is...
Github user tanejagagan commented on a diff in the pull request: https://github.com/apache/spark/pull/17174#discussion_r105333124 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -324,14 +324,22 @@ object TypeCoercion { // We should cast all relative timestamp/date/string comparison into string comparisons // This behaves as a user would expect because timestamp strings sort lexicographically. // i.e. TimeStamp(2013-01-01 00:00 ...) < "2014" = true - case p @ BinaryComparison(left @ StringType(), right @ DateType()) => -p.makeCopy(Array(left, Cast(right, StringType))) - case p @ BinaryComparison(left @ DateType(), right @ StringType()) => -p.makeCopy(Array(Cast(left, StringType), right)) - case p @ BinaryComparison(left @ StringType(), right @ TimestampType()) => -p.makeCopy(Array(left, Cast(right, StringType))) - case p @ BinaryComparison(left @ TimestampType(), right @ StringType()) => -p.makeCopy(Array(Cast(left, StringType), right)) + // If StringType is foldable then we need to cast String to Date or Timestamp type + // which would give order of magnitude performance gain as well as preserve the behavior + // achieved by expressed above + // TimeStamp(2013-01-01 00:00 ...) < Cast( "2014" as timestamp) = true + case p @ BinaryComparison(left @ StringType(), right) if dateOrTimestampType(right) => +if (left.foldable) { + p.makeCopy(Array(Cast(left, right.dataType), right)) --- End diff -- Yes.. You can explicitly cast the string to timestamp and then speed up will be much faster. By default without casting query just runs fine silently , pick up a very bad plan, with no indication to user whatsoever and about order of magnitude slower Some of the other issue related to comparison such as` time < 'abc' `will also run just fine which i think should be fail fast and let user know about the issue with casting Other problem is with BI tools which generate these SQLs where user do not have direct control on the SQL. We came across this issue when the same query in Impala was running 10 times faster than in Spark and investigation of the that resulted in this bug and therefore fix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16952: [SPARK-19620][SQL]Fix incorrect exchange coordinator id ...
Github user carsonwang commented on the issue: https://github.com/apache/spark/pull/16952 @gatorsmile @cloud-fan @yhuai , can you help review and merge this minor one line fix? The code change itself is straightforward. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...
Github user crackcell commented on the issue: https://github.com/apache/spark/pull/17123 @cloud-fan Would you please review my code again? I'm now using `Option` to handle NULLs. :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17177: [SPARK-19834][SQL] csv escape of quote escape
Github user ep1804 commented on the issue: https://github.com/apache/spark/pull/17177 An issue is raised for uniVocity parser: https://github.com/uniVocity/univocity-parsers/issues/143 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17233: [SPARK-11569][ML] Fix StringIndexer to handle nul...
GitHub user crackcell opened a pull request: https://github.com/apache/spark/pull/17233 [SPARK-11569][ML] Fix StringIndexer to handle null value properly ## What changes were proposed in this pull request? This PR is to enhance StringIndexer with NULL values handling. Before the PR, StringIndexer will throw an exception when encounters NULL values. With this PR: - handleInvalid=error: Throw an exception as before - handleInvalid=skip: Skip null values as well as unseen labels - handleInvalid=keep: Give null values an additional index as well as unseen labels BTW, I noticed someone was trying to solve the same problem ( #9920 ) but seems getting no progress or response for a long time. Would you mind give a chance to solve it ? ## How was this patch tested? new unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/crackcell/spark 11569_StringIndexer_NULL Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17233.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17233 commit 75e3975597aa6271f4f8ab688922edda88b03045 Author: Menglong TANDate: 2017-03-08T03:50:17Z Merge pull request #1 from apache/master merge master to my repo commit 79d706085e8371fb1724ce73377767c38d551e5d Author: Menglong TAN Date: 2017-03-10T04:45:56Z Enhance StringIndexer with NULL values commit 0cb121c65f592b9623bdeef2746d7c2a3c281ae1 Author: Menglong TAN Date: 2017-03-10T04:52:30Z filter out NULLs when transform dataset --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17177: [SPARK-19834][SQL] csv escape of quote escape
Github user ep1804 commented on the issue: https://github.com/apache/spark/pull/17177 Thank you @HyukjinKwon . I made changes following your comments: * `escapeQuoteEscaping` instead of `escapeEscape` * defalutl value to `\u` (unset) * `withTempPath` * no `orderBy` * `checkAnswer` * styles : ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17172: [SPARK-19008][SQL] Improve performance of Dataset.map by...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17172 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74297/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17172: [SPARK-19008][SQL] Improve performance of Dataset.map by...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17172 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17172: [SPARK-19008][SQL] Improve performance of Dataset.map by...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17172 **[Test build #74297 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74297/testReport)** for PR 17172 at commit [`b25b191`](https://github.com/apache/spark/commit/b25b191687259303df5ab2fad0c64687a88de5bd). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17172: [SPARK-19008][SQL] Improve performance of Dataset.map by...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17172 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17172: [SPARK-19008][SQL] Improve performance of Dataset.map by...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17172 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74296/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17172: [SPARK-19008][SQL] Improve performance of Dataset.map by...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17172 **[Test build #74296 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74296/testReport)** for PR 17172 at commit [`200cec7`](https://github.com/apache/spark/commit/200cec783f33de21d9895f90161a9d11877d0877). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17226 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17226 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74295/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17226 **[Test build #74295 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74295/testReport)** for PR 17226 at commit [`a81c062`](https://github.com/apache/spark/commit/a81c06201259b246c7b9e8b56ecc4183e6279410). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17232: [SPARK-18112] [SQL] Support reading data from Hive 2.1 m...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17232 **[Test build #74298 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74298/testReport)** for PR 17232 at commit [`af81cee`](https://github.com/apache/spark/commit/af81cee9f54abc13d7d07a12e4b499e49cd0dbcb). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17232: [SPARK-18112] [SQL] Support reading data from Hiv...
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/17232 [SPARK-18112] [SQL] Support reading data from Hive 2.1 metastore [WIP] ### What changes were proposed in this pull request? This PR is to support reading data from Hive 2.1 metastore. Need to update shim class because of the Hive API changes caused by the following two Hive JIRAs: - [HIVE-12730 MetadataUpdater: provide a mechanism to edit the basic statistics of a table (or a partition)](https://issues.apache.org/jira/browse/HIVE-12730) - [Hive-13341 Stats state is not captured correctly: differentiate load table and create table](https://issues.apache.org/jira/browse/HIVE-13341) There two new fields have been added in Hive. - `EnvironmentContext environmentContext`. So far, this is always set to `null`. This was introduced for supporting DDL `alter table s update statistics set ('numRows'='NaN')`. Using this DDL, users can specify the statistics. So far, our Spark SQL does not need it, because we use different table properties to store our generated statistics values. However, when Spark SQL issues ALTER TABLE DDL statements, Hive metastore always automatically invalidate the Hive-generated statistics. In the follow-up PR, we can fix it by explicitly adding a property to `environmentContext`. ```JAVA putToProperties(StatsSetupConst.STATS_GENERATED, StatsSetupConst.USER) ``` - `boolean hasFollowingStatsTask`. We always set it to `false`. TODO: more investigation about this ### How was this patch tested? Added test cases to VersionsSuite.scala You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark Hive21 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17232.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17232 commit af81cee9f54abc13d7d07a12e4b499e49cd0dbcb Author: Xiao LiDate: 2017-03-10T03:35:38Z fix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17231: [SPARK-19891][SS] Await Batch Lock notified on stream ex...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17231 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17231: [SPARK-19891][SS] Await Batch Lock notified on stream ex...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17231 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17231: [SPARK-19891][SS] Await Batch Lock notified on stream ex...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17231 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74293/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17231: [SPARK-19891][SS] Await Batch Lock notified on stream ex...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17231 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74294/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17226 > we're not introducing a regression in this PR by fixing the NPE, the answer given by 1.6 was incorrect under any interpenetration Right, if it was a bug, then this PR introduces an inconsistency between optimized one and non-optimized one. We should fix both together or not. > there is a completely separate issue of what the proper value of count(1) on no values should be in a pivot that does not depend at all on nulls in the pivot column. Then why are we partially fixing a completely separate issue here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17231: [SPARK-19891][SS] Await Batch Lock notified on stream ex...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17231 **[Test build #74293 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74293/testReport)** for PR 17231 at commit [`c49e183`](https://github.com/apache/spark/commit/c49e183bf391ad0d9352400b19a2e191a3f0be11). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17231: [SPARK-19891][SS] Await Batch Lock notified on stream ex...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17231 **[Test build #74294 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74294/testReport)** for PR 17231 at commit [`8170792`](https://github.com/apache/spark/commit/817079234203dc428032c5a110a03f43ec7a813f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17231: [SPARK-19891][SS] Await Batch Lock notified on stream ex...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17231 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17231: [SPARK-19891][SS] Await Batch Lock notified on stream ex...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17231 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74292/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17231: [SPARK-19891][SS] Await Batch Lock notified on stream ex...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17231 **[Test build #74292 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74292/testReport)** for PR 17231 at commit [`a23162a`](https://github.com/apache/spark/commit/a23162ac7322af2b662e870ea4f889f19d4a8b2c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...
Github user aray commented on the issue: https://github.com/apache/spark/pull/17226 @HyukjinKwon we're not introducing a regression in this PR by fixing the NPE, the answer given by 1.6 was incorrect under any interpenetration. Again, there is a completely separate issue of what the proper value of count(1) on no values should be in a pivot that does not depend at all on nulls in the pivot column. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17224: [SPARK-19882][SQL] Pivot with null as the dictinc...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17224#discussion_r105327492 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -524,15 +529,21 @@ class Analyzer( def ifExpr(expr: Expression) = { If(EqualTo(pivotColumn, value), expr, Literal(null)) } +def ifNullSafeExpr(expr: Expression) = { + If(EqualNullSafe(pivotColumn, value), expr, Literal(null)) +} aggregates.map { aggregate => val filteredAggregate = aggregate.transformDown { // Assumption is the aggregate function ignores nulls. This is true for all current -// AggregateFunction's with the exception of First and Last in their default mode -// (which we handle) and possibly some Hive UDAF's. +// AggregateFunction's with the exception of First, Last and Count in their +// default mode (which we handle) and possibly some Hive UDAF's. case First(expr, _) => First(ifExpr(expr), Literal(true)) case Last(expr, _) => Last(ifExpr(expr), Literal(true)) +case c: Count => + // In case of count, `null` should be counted. + c.withNewChildren(c.children.map(ifNullSafeExpr)) --- End diff -- Let me update this path as soon as we decide what we want in another PR for this JIRA. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17226 cc @cloud-fan and @yhuai could you pick up one of them? Let me follow. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17226 > Spark 2.0+ with PivotFirst gives a NPE when one of the pivot column values is null. The main thing fixed in this PR. I meant to say it is not fully fixed because it does not NPE but now introduce a regression. Why don't we fix NPE and resolve the regression first and then put the optimization for it? I think we have two options. - Deal with this case in both optimization path and non-optimization path which introduces a regression/inconsistency between them that should be a separate JIRA. - in this case, let me close mine. - Fall back to non-optimization path and leave a JIRA to put this in optimized path - in this case, I think this PR should be closed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...
Github user aray commented on the issue: https://github.com/apache/spark/pull/17226 BTW for 3 above if we decide it should be 0, we can add an initial value for `PivotFirst` to make the fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...
Github user aray commented on the issue: https://github.com/apache/spark/pull/17226 There are three things going on here in your one example. 1. Spark 1.6 [first version with pivot] (and Spark 2.0+ with an aggregate output type unsupported by PivotFirst) gives incorrect answers to when one of the pivot column values is null (only affects the 'null' column) this is fixed by doing a null safe equals in the injected if statement https://github.com/apache/spark/pull/17226/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R525 2. Spark 2.0+ with PivotFirst gives a NPE when one of the pivot column values is null. The main thing fixed in this PR. 3. There is inconsistency between Spark 1.6 and 2.0+ on the result of a pivot with a `count(1)` aggregate when no values are aggregated for a cell. This is separate from the issues above and it's not clear which version is naturally correct (pandas leaves those values as null, Oracle 11g gives 0, and I need to test others). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17226: [SPARK-19882][SQL] Pivot with null as a distinct ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17226#discussion_r105325964 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -522,7 +522,7 @@ class Analyzer( } else { val pivotAggregates: Seq[NamedExpression] = pivotValues.flatMap { value => def ifExpr(expr: Expression) = { - If(EqualTo(pivotColumn, value), expr, Literal(null)) + If(EqualNullSafe(pivotColumn, value), expr, Literal(null)) --- End diff -- Ah, yes. You are right. I think I was mistaken. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...
Github user julienledem commented on a diff in the pull request: https://github.com/apache/spark/pull/15821#discussion_r105323172 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ArrowConverters.scala --- @@ -0,0 +1,411 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import java.io.ByteArrayOutputStream +import java.nio.ByteBuffer +import java.nio.channels.{Channels, SeekableByteChannel} + +import scala.collection.JavaConverters._ + +import io.netty.buffer.ArrowBuf +import org.apache.arrow.memory.{BaseAllocator, RootAllocator} +import org.apache.arrow.vector._ +import org.apache.arrow.vector.BaseValueVector.BaseMutator +import org.apache.arrow.vector.file.{ArrowReader, ArrowWriter} +import org.apache.arrow.vector.schema.{ArrowFieldNode, ArrowRecordBatch} +import org.apache.arrow.vector.types.{FloatingPointPrecision, TimeUnit} +import org.apache.arrow.vector.types.pojo.{ArrowType, Field, Schema} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types._ + + +/** + * ArrowReader requires a seekable byte channel. + * NOTE - this is taken from test org.apache.vector.file, see about moving to public util pkg + */ +private[sql] class ByteArrayReadableSeekableByteChannel(var byteArray: Array[Byte]) + extends SeekableByteChannel { + var _position: Long = 0L + + override def isOpen: Boolean = { +byteArray != null + } + + override def close(): Unit = { +byteArray = null + } + + override def read(dst: ByteBuffer): Int = { +val remainingBuf = byteArray.length - _position +val length = Math.min(dst.remaining(), remainingBuf).toInt +dst.put(byteArray, _position.toInt, length) +_position += length +length.toInt + } + + override def position(): Long = _position + + override def position(newPosition: Long): SeekableByteChannel = { +_position = newPosition.toLong +this + } + + override def size: Long = { +byteArray.length.toLong + } + + override def write(src: ByteBuffer): Int = { +throw new UnsupportedOperationException("Read Only") + } + + override def truncate(size: Long): SeekableByteChannel = { +throw new UnsupportedOperationException("Read Only") + } +} + +/** + * Intermediate data structure returned from Arrow conversions + */ +private[sql] abstract class ArrowPayload extends Iterator[ArrowRecordBatch] + +/** + * Build a payload from existing ArrowRecordBatches + */ +private[sql] class ArrowStaticPayload(batches: ArrowRecordBatch*) extends ArrowPayload { + private val iter = batches.iterator + override def next(): ArrowRecordBatch = iter.next() + override def hasNext: Boolean = iter.hasNext +} + +/** + * Class that wraps an Arrow RootAllocator used in conversion + */ +private[sql] class ArrowConverters { + private val _allocator = new RootAllocator(Long.MaxValue) + + private[sql] def allocator: RootAllocator = _allocator + + def interalRowIterToPayload(rowIter: Iterator[InternalRow], schema: StructType): ArrowPayload = { +val batch = ArrowConverters.internalRowIterToArrowBatch(rowIter, schema, allocator) +new ArrowStaticPayload(batch) + } + + def readPayloadByteArrays(payloadByteArrays: Array[Array[Byte]]): ArrowPayload = { +val batches = scala.collection.mutable.ArrayBuffer.empty[ArrowRecordBatch] +var i = 0 +while (i < payloadByteArrays.length) { + val payloadBytes = payloadByteArrays(i) + val in = new ByteArrayReadableSeekableByteChannel(payloadBytes) + val reader = new ArrowReader(in, _allocator) + val footer = reader.readFooter() + val batchBlocks = footer.getRecordBatches.asScala.toArray + batchBlocks.foreach(block => batches += reader.readRecordBatch(block)) + i += 1 +
[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...
Github user julienledem commented on a diff in the pull request: https://github.com/apache/spark/pull/15821#discussion_r105322928 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ArrowConverters.scala --- @@ -0,0 +1,411 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import java.io.ByteArrayOutputStream +import java.nio.ByteBuffer +import java.nio.channels.{Channels, SeekableByteChannel} + +import scala.collection.JavaConverters._ + +import io.netty.buffer.ArrowBuf +import org.apache.arrow.memory.{BaseAllocator, RootAllocator} +import org.apache.arrow.vector._ +import org.apache.arrow.vector.BaseValueVector.BaseMutator +import org.apache.arrow.vector.file.{ArrowReader, ArrowWriter} +import org.apache.arrow.vector.schema.{ArrowFieldNode, ArrowRecordBatch} +import org.apache.arrow.vector.types.{FloatingPointPrecision, TimeUnit} +import org.apache.arrow.vector.types.pojo.{ArrowType, Field, Schema} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types._ + + +/** + * ArrowReader requires a seekable byte channel. + * NOTE - this is taken from test org.apache.vector.file, see about moving to public util pkg + */ +private[sql] class ByteArrayReadableSeekableByteChannel(var byteArray: Array[Byte]) + extends SeekableByteChannel { + var _position: Long = 0L + + override def isOpen: Boolean = { +byteArray != null + } + + override def close(): Unit = { +byteArray = null + } + + override def read(dst: ByteBuffer): Int = { +val remainingBuf = byteArray.length - _position +val length = Math.min(dst.remaining(), remainingBuf).toInt +dst.put(byteArray, _position.toInt, length) +_position += length +length.toInt + } + + override def position(): Long = _position + + override def position(newPosition: Long): SeekableByteChannel = { +_position = newPosition.toLong +this + } + + override def size: Long = { +byteArray.length.toLong + } + + override def write(src: ByteBuffer): Int = { +throw new UnsupportedOperationException("Read Only") + } + + override def truncate(size: Long): SeekableByteChannel = { +throw new UnsupportedOperationException("Read Only") + } +} + +/** + * Intermediate data structure returned from Arrow conversions + */ +private[sql] abstract class ArrowPayload extends Iterator[ArrowRecordBatch] + +/** + * Build a payload from existing ArrowRecordBatches + */ +private[sql] class ArrowStaticPayload(batches: ArrowRecordBatch*) extends ArrowPayload { + private val iter = batches.iterator + override def next(): ArrowRecordBatch = iter.next() + override def hasNext: Boolean = iter.hasNext +} + +/** + * Class that wraps an Arrow RootAllocator used in conversion + */ +private[sql] class ArrowConverters { + private val _allocator = new RootAllocator(Long.MaxValue) + + private[sql] def allocator: RootAllocator = _allocator + + def interalRowIterToPayload(rowIter: Iterator[InternalRow], schema: StructType): ArrowPayload = { +val batch = ArrowConverters.internalRowIterToArrowBatch(rowIter, schema, allocator) +new ArrowStaticPayload(batch) + } + + def readPayloadByteArrays(payloadByteArrays: Array[Array[Byte]]): ArrowPayload = { +val batches = scala.collection.mutable.ArrayBuffer.empty[ArrowRecordBatch] +var i = 0 +while (i < payloadByteArrays.length) { + val payloadBytes = payloadByteArrays(i) + val in = new ByteArrayReadableSeekableByteChannel(payloadBytes) + val reader = new ArrowReader(in, _allocator) + val footer = reader.readFooter() + val batchBlocks = footer.getRecordBatches.asScala.toArray + batchBlocks.foreach(block => batches += reader.readRecordBatch(block)) + i += 1 +
[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...
Github user julienledem commented on a diff in the pull request: https://github.com/apache/spark/pull/15821#discussion_r105324269 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/ArrowConvertersSuite.scala --- @@ -0,0 +1,567 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql + +import java.io.File +import java.nio.charset.StandardCharsets +import java.sql.{Date, Timestamp} +import java.text.SimpleDateFormat +import java.util.Locale + +import com.google.common.io.Files +import org.apache.arrow.vector.{VectorLoader, VectorSchemaRoot} +import org.apache.arrow.vector.file.json.JsonFileReader +import org.apache.arrow.vector.util.Validator +import org.json4s.jackson.JsonMethods._ +import org.json4s.JsonAST._ +import org.json4s.JsonDSL._ +import org.scalatest.BeforeAndAfterAll + +import org.apache.spark.SparkException +import org.apache.spark.sql.test.SharedSQLContext +import org.apache.spark.sql.types.StructType +import org.apache.spark.util.Utils + + +class ArrowConvertersSuite extends SharedSQLContext with BeforeAndAfterAll { + import testImplicits._ + + private var tempDataPath: String = _ + + private def collectAsArrow(df: DataFrame, + converter: Option[ArrowConverters] = None): ArrowPayload = { +val cnvtr = converter.getOrElse(new ArrowConverters) +val payloadByteArrays = df.toArrowPayloadBytes().collect() +cnvtr.readPayloadByteArrays(payloadByteArrays) + } + + override def beforeAll(): Unit = { +super.beforeAll() +tempDataPath = Utils.createTempDir(namePrefix = "arrow").getAbsolutePath + } + + test("collect to arrow record batch") { +val indexData = (1 to 6).toDF("i") +val arrowPayload = collectAsArrow(indexData) +assert(arrowPayload.nonEmpty) +val arrowBatches = arrowPayload.toArray +assert(arrowBatches.length == indexData.rdd.getNumPartitions) +val rowCount = arrowBatches.map(batch => batch.getLength).sum +assert(rowCount === indexData.count()) +arrowBatches.foreach(batch => assert(batch.getNodes.size() > 0)) +arrowBatches.foreach(batch => batch.close()) + } + + test("numeric type conversion") { +collectAndValidate(indexData) +collectAndValidate(shortData) +collectAndValidate(intData) +collectAndValidate(longData) +collectAndValidate(floatData) +collectAndValidate(doubleData) + } + + test("mixed numeric type conversion") { +collectAndValidate(mixedNumericData) + } + + test("boolean type conversion") { +collectAndValidate(boolData) + } + + test("string type conversion") { +collectAndValidate(stringData) + } + + test("byte type conversion") { +collectAndValidate(byteData) + } + + test("timestamp conversion") { +collectAndValidate(timestampData) + } + + // TODO: Not currently supported in Arrow JSON reader + ignore("date conversion") { +// collectAndValidate(dateTimeData) + } + + // TODO: Not currently supported in Arrow JSON reader + ignore("binary type conversion") { +// collectAndValidate(binaryData) + } + + test("floating-point NaN") { +collectAndValidate(floatNaNData) + } + + test("partitioned DataFrame") { +val converter = new ArrowConverters +val schema = testData2.schema +val arrowPayload = collectAsArrow(testData2, Some(converter)) +val arrowBatches = arrowPayload.toArray +// NOTE: testData2 should have 2 partitions -> 2 arrow batches in payload +assert(arrowBatches.length === 2) +val pl1 = new ArrowStaticPayload(arrowBatches(0)) +val pl2 = new ArrowStaticPayload(arrowBatches(1)) +// Generate JSON files +val a = List[Int](1, 1, 2, 2, 3, 3) +val b = List[Int](1, 2, 1, 2, 1, 2) +val fields
[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...
Github user julienledem commented on a diff in the pull request: https://github.com/apache/spark/pull/15821#discussion_r105321731 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ArrowConverters.scala --- @@ -0,0 +1,411 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import java.io.ByteArrayOutputStream +import java.nio.ByteBuffer +import java.nio.channels.{Channels, SeekableByteChannel} + +import scala.collection.JavaConverters._ + +import io.netty.buffer.ArrowBuf +import org.apache.arrow.memory.{BaseAllocator, RootAllocator} +import org.apache.arrow.vector._ +import org.apache.arrow.vector.BaseValueVector.BaseMutator +import org.apache.arrow.vector.file.{ArrowReader, ArrowWriter} +import org.apache.arrow.vector.schema.{ArrowFieldNode, ArrowRecordBatch} +import org.apache.arrow.vector.types.{FloatingPointPrecision, TimeUnit} +import org.apache.arrow.vector.types.pojo.{ArrowType, Field, Schema} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types._ + + +/** + * ArrowReader requires a seekable byte channel. + * NOTE - this is taken from test org.apache.vector.file, see about moving to public util pkg + */ +private[sql] class ByteArrayReadableSeekableByteChannel(var byteArray: Array[Byte]) + extends SeekableByteChannel { + var _position: Long = 0L + + override def isOpen: Boolean = { +byteArray != null + } + + override def close(): Unit = { +byteArray = null + } + + override def read(dst: ByteBuffer): Int = { +val remainingBuf = byteArray.length - _position +val length = Math.min(dst.remaining(), remainingBuf).toInt +dst.put(byteArray, _position.toInt, length) +_position += length +length.toInt + } + + override def position(): Long = _position + + override def position(newPosition: Long): SeekableByteChannel = { +_position = newPosition.toLong +this + } + + override def size: Long = { +byteArray.length.toLong + } + + override def write(src: ByteBuffer): Int = { +throw new UnsupportedOperationException("Read Only") + } + + override def truncate(size: Long): SeekableByteChannel = { +throw new UnsupportedOperationException("Read Only") + } +} + +/** + * Intermediate data structure returned from Arrow conversions + */ +private[sql] abstract class ArrowPayload extends Iterator[ArrowRecordBatch] + +/** + * Build a payload from existing ArrowRecordBatches + */ +private[sql] class ArrowStaticPayload(batches: ArrowRecordBatch*) extends ArrowPayload { + private val iter = batches.iterator + override def next(): ArrowRecordBatch = iter.next() + override def hasNext: Boolean = iter.hasNext +} + +/** + * Class that wraps an Arrow RootAllocator used in conversion + */ +private[sql] class ArrowConverters { + private val _allocator = new RootAllocator(Long.MaxValue) --- End diff -- if we create an allocator we should have a way to close it in the end. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...
Github user julienledem commented on a diff in the pull request: https://github.com/apache/spark/pull/15821#discussion_r105322470 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ArrowConverters.scala --- @@ -0,0 +1,411 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import java.io.ByteArrayOutputStream +import java.nio.ByteBuffer +import java.nio.channels.{Channels, SeekableByteChannel} + +import scala.collection.JavaConverters._ + +import io.netty.buffer.ArrowBuf +import org.apache.arrow.memory.{BaseAllocator, RootAllocator} +import org.apache.arrow.vector._ +import org.apache.arrow.vector.BaseValueVector.BaseMutator +import org.apache.arrow.vector.file.{ArrowReader, ArrowWriter} +import org.apache.arrow.vector.schema.{ArrowFieldNode, ArrowRecordBatch} +import org.apache.arrow.vector.types.{FloatingPointPrecision, TimeUnit} +import org.apache.arrow.vector.types.pojo.{ArrowType, Field, Schema} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types._ + + +/** + * ArrowReader requires a seekable byte channel. + * NOTE - this is taken from test org.apache.vector.file, see about moving to public util pkg + */ +private[sql] class ByteArrayReadableSeekableByteChannel(var byteArray: Array[Byte]) + extends SeekableByteChannel { + var _position: Long = 0L + + override def isOpen: Boolean = { +byteArray != null + } + + override def close(): Unit = { +byteArray = null + } + + override def read(dst: ByteBuffer): Int = { +val remainingBuf = byteArray.length - _position +val length = Math.min(dst.remaining(), remainingBuf).toInt +dst.put(byteArray, _position.toInt, length) +_position += length +length.toInt + } + + override def position(): Long = _position + + override def position(newPosition: Long): SeekableByteChannel = { +_position = newPosition.toLong +this + } + + override def size: Long = { +byteArray.length.toLong + } + + override def write(src: ByteBuffer): Int = { +throw new UnsupportedOperationException("Read Only") + } + + override def truncate(size: Long): SeekableByteChannel = { +throw new UnsupportedOperationException("Read Only") + } +} + +/** + * Intermediate data structure returned from Arrow conversions + */ +private[sql] abstract class ArrowPayload extends Iterator[ArrowRecordBatch] + +/** + * Build a payload from existing ArrowRecordBatches + */ +private[sql] class ArrowStaticPayload(batches: ArrowRecordBatch*) extends ArrowPayload { + private val iter = batches.iterator + override def next(): ArrowRecordBatch = iter.next() + override def hasNext: Boolean = iter.hasNext +} + +/** + * Class that wraps an Arrow RootAllocator used in conversion + */ +private[sql] class ArrowConverters { + private val _allocator = new RootAllocator(Long.MaxValue) + + private[sql] def allocator: RootAllocator = _allocator + + def interalRowIterToPayload(rowIter: Iterator[InternalRow], schema: StructType): ArrowPayload = { +val batch = ArrowConverters.internalRowIterToArrowBatch(rowIter, schema, allocator) +new ArrowStaticPayload(batch) + } + + def readPayloadByteArrays(payloadByteArrays: Array[Array[Byte]]): ArrowPayload = { +val batches = scala.collection.mutable.ArrayBuffer.empty[ArrowRecordBatch] +var i = 0 +while (i < payloadByteArrays.length) { + val payloadBytes = payloadByteArrays(i) + val in = new ByteArrayReadableSeekableByteChannel(payloadBytes) + val reader = new ArrowReader(in, _allocator) + val footer = reader.readFooter() + val batchBlocks = footer.getRecordBatches.asScala.toArray + batchBlocks.foreach(block => batches += reader.readRecordBatch(block)) + i += 1 +
[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...
Github user julienledem commented on a diff in the pull request: https://github.com/apache/spark/pull/15821#discussion_r105322707 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ArrowConverters.scala --- @@ -0,0 +1,411 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import java.io.ByteArrayOutputStream +import java.nio.ByteBuffer +import java.nio.channels.{Channels, SeekableByteChannel} + +import scala.collection.JavaConverters._ + +import io.netty.buffer.ArrowBuf +import org.apache.arrow.memory.{BaseAllocator, RootAllocator} +import org.apache.arrow.vector._ +import org.apache.arrow.vector.BaseValueVector.BaseMutator +import org.apache.arrow.vector.file.{ArrowReader, ArrowWriter} +import org.apache.arrow.vector.schema.{ArrowFieldNode, ArrowRecordBatch} +import org.apache.arrow.vector.types.{FloatingPointPrecision, TimeUnit} +import org.apache.arrow.vector.types.pojo.{ArrowType, Field, Schema} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types._ + + +/** + * ArrowReader requires a seekable byte channel. + * NOTE - this is taken from test org.apache.vector.file, see about moving to public util pkg + */ +private[sql] class ByteArrayReadableSeekableByteChannel(var byteArray: Array[Byte]) + extends SeekableByteChannel { + var _position: Long = 0L + + override def isOpen: Boolean = { +byteArray != null + } + + override def close(): Unit = { +byteArray = null + } + + override def read(dst: ByteBuffer): Int = { +val remainingBuf = byteArray.length - _position +val length = Math.min(dst.remaining(), remainingBuf).toInt +dst.put(byteArray, _position.toInt, length) +_position += length +length.toInt + } + + override def position(): Long = _position + + override def position(newPosition: Long): SeekableByteChannel = { +_position = newPosition.toLong +this + } + + override def size: Long = { +byteArray.length.toLong + } + + override def write(src: ByteBuffer): Int = { +throw new UnsupportedOperationException("Read Only") + } + + override def truncate(size: Long): SeekableByteChannel = { +throw new UnsupportedOperationException("Read Only") + } +} + +/** + * Intermediate data structure returned from Arrow conversions + */ +private[sql] abstract class ArrowPayload extends Iterator[ArrowRecordBatch] + +/** + * Build a payload from existing ArrowRecordBatches + */ +private[sql] class ArrowStaticPayload(batches: ArrowRecordBatch*) extends ArrowPayload { + private val iter = batches.iterator + override def next(): ArrowRecordBatch = iter.next() + override def hasNext: Boolean = iter.hasNext +} + +/** + * Class that wraps an Arrow RootAllocator used in conversion + */ +private[sql] class ArrowConverters { + private val _allocator = new RootAllocator(Long.MaxValue) + + private[sql] def allocator: RootAllocator = _allocator + + def interalRowIterToPayload(rowIter: Iterator[InternalRow], schema: StructType): ArrowPayload = { +val batch = ArrowConverters.internalRowIterToArrowBatch(rowIter, schema, allocator) +new ArrowStaticPayload(batch) + } + + def readPayloadByteArrays(payloadByteArrays: Array[Array[Byte]]): ArrowPayload = { +val batches = scala.collection.mutable.ArrayBuffer.empty[ArrowRecordBatch] +var i = 0 +while (i < payloadByteArrays.length) { + val payloadBytes = payloadByteArrays(i) + val in = new ByteArrayReadableSeekableByteChannel(payloadBytes) + val reader = new ArrowReader(in, _allocator) + val footer = reader.readFooter() + val batchBlocks = footer.getRecordBatches.asScala.toArray + batchBlocks.foreach(block => batches += reader.readRecordBatch(block)) + i += 1 +
[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/17226 @aray, this is a regression as I described in my PR that is introduced by this optimization. Spark 1.6. ``` +++---+ | a|null| 1| +++---+ |null| 0| 0| | 1| 0| 1| +++---+ ``` For input/output, this PR does not fully resolve this regression. That's why I proposed to avoid this in the optimization path. We could leave another JIRA open to support this case in optimization path. It is little bit funny that the output is different by the internal optimization. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...
Github user julienledem commented on a diff in the pull request: https://github.com/apache/spark/pull/15821#discussion_r105322098 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ArrowConverters.scala --- @@ -0,0 +1,411 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import java.io.ByteArrayOutputStream +import java.nio.ByteBuffer +import java.nio.channels.{Channels, SeekableByteChannel} + +import scala.collection.JavaConverters._ + +import io.netty.buffer.ArrowBuf +import org.apache.arrow.memory.{BaseAllocator, RootAllocator} +import org.apache.arrow.vector._ +import org.apache.arrow.vector.BaseValueVector.BaseMutator +import org.apache.arrow.vector.file.{ArrowReader, ArrowWriter} +import org.apache.arrow.vector.schema.{ArrowFieldNode, ArrowRecordBatch} +import org.apache.arrow.vector.types.{FloatingPointPrecision, TimeUnit} +import org.apache.arrow.vector.types.pojo.{ArrowType, Field, Schema} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types._ + + +/** + * ArrowReader requires a seekable byte channel. + * NOTE - this is taken from test org.apache.vector.file, see about moving to public util pkg + */ +private[sql] class ByteArrayReadableSeekableByteChannel(var byteArray: Array[Byte]) + extends SeekableByteChannel { + var _position: Long = 0L + + override def isOpen: Boolean = { +byteArray != null + } + + override def close(): Unit = { +byteArray = null + } + + override def read(dst: ByteBuffer): Int = { +val remainingBuf = byteArray.length - _position +val length = Math.min(dst.remaining(), remainingBuf).toInt +dst.put(byteArray, _position.toInt, length) +_position += length +length.toInt + } + + override def position(): Long = _position + + override def position(newPosition: Long): SeekableByteChannel = { +_position = newPosition.toLong +this + } + + override def size: Long = { +byteArray.length.toLong + } + + override def write(src: ByteBuffer): Int = { +throw new UnsupportedOperationException("Read Only") + } + + override def truncate(size: Long): SeekableByteChannel = { +throw new UnsupportedOperationException("Read Only") + } +} + +/** + * Intermediate data structure returned from Arrow conversions + */ +private[sql] abstract class ArrowPayload extends Iterator[ArrowRecordBatch] + +/** + * Build a payload from existing ArrowRecordBatches + */ +private[sql] class ArrowStaticPayload(batches: ArrowRecordBatch*) extends ArrowPayload { + private val iter = batches.iterator + override def next(): ArrowRecordBatch = iter.next() + override def hasNext: Boolean = iter.hasNext +} + +/** + * Class that wraps an Arrow RootAllocator used in conversion + */ +private[sql] class ArrowConverters { + private val _allocator = new RootAllocator(Long.MaxValue) + + private[sql] def allocator: RootAllocator = _allocator + + def interalRowIterToPayload(rowIter: Iterator[InternalRow], schema: StructType): ArrowPayload = { +val batch = ArrowConverters.internalRowIterToArrowBatch(rowIter, schema, allocator) +new ArrowStaticPayload(batch) + } + + def readPayloadByteArrays(payloadByteArrays: Array[Array[Byte]]): ArrowPayload = { +val batches = scala.collection.mutable.ArrayBuffer.empty[ArrowRecordBatch] +var i = 0 +while (i < payloadByteArrays.length) { + val payloadBytes = payloadByteArrays(i) + val in = new ByteArrayReadableSeekableByteChannel(payloadBytes) + val reader = new ArrowReader(in, _allocator) + val footer = reader.readFooter() + val batchBlocks = footer.getRecordBatches.asScala.toArray + batchBlocks.foreach(block => batches += reader.readRecordBatch(block)) + i += 1 +
[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...
Github user julienledem commented on a diff in the pull request: https://github.com/apache/spark/pull/15821#discussion_r105322283 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ArrowConverters.scala --- @@ -0,0 +1,411 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import java.io.ByteArrayOutputStream +import java.nio.ByteBuffer +import java.nio.channels.{Channels, SeekableByteChannel} + +import scala.collection.JavaConverters._ + +import io.netty.buffer.ArrowBuf +import org.apache.arrow.memory.{BaseAllocator, RootAllocator} +import org.apache.arrow.vector._ +import org.apache.arrow.vector.BaseValueVector.BaseMutator +import org.apache.arrow.vector.file.{ArrowReader, ArrowWriter} +import org.apache.arrow.vector.schema.{ArrowFieldNode, ArrowRecordBatch} +import org.apache.arrow.vector.types.{FloatingPointPrecision, TimeUnit} +import org.apache.arrow.vector.types.pojo.{ArrowType, Field, Schema} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types._ + + +/** + * ArrowReader requires a seekable byte channel. + * NOTE - this is taken from test org.apache.vector.file, see about moving to public util pkg + */ +private[sql] class ByteArrayReadableSeekableByteChannel(var byteArray: Array[Byte]) + extends SeekableByteChannel { + var _position: Long = 0L + + override def isOpen: Boolean = { +byteArray != null + } + + override def close(): Unit = { +byteArray = null + } + + override def read(dst: ByteBuffer): Int = { +val remainingBuf = byteArray.length - _position +val length = Math.min(dst.remaining(), remainingBuf).toInt +dst.put(byteArray, _position.toInt, length) +_position += length +length.toInt + } + + override def position(): Long = _position + + override def position(newPosition: Long): SeekableByteChannel = { +_position = newPosition.toLong +this + } + + override def size: Long = { +byteArray.length.toLong + } + + override def write(src: ByteBuffer): Int = { +throw new UnsupportedOperationException("Read Only") + } + + override def truncate(size: Long): SeekableByteChannel = { +throw new UnsupportedOperationException("Read Only") + } +} + +/** + * Intermediate data structure returned from Arrow conversions + */ +private[sql] abstract class ArrowPayload extends Iterator[ArrowRecordBatch] + +/** + * Build a payload from existing ArrowRecordBatches + */ +private[sql] class ArrowStaticPayload(batches: ArrowRecordBatch*) extends ArrowPayload { + private val iter = batches.iterator + override def next(): ArrowRecordBatch = iter.next() + override def hasNext: Boolean = iter.hasNext +} + +/** + * Class that wraps an Arrow RootAllocator used in conversion + */ +private[sql] class ArrowConverters { + private val _allocator = new RootAllocator(Long.MaxValue) + + private[sql] def allocator: RootAllocator = _allocator + + def interalRowIterToPayload(rowIter: Iterator[InternalRow], schema: StructType): ArrowPayload = { +val batch = ArrowConverters.internalRowIterToArrowBatch(rowIter, schema, allocator) +new ArrowStaticPayload(batch) + } + + def readPayloadByteArrays(payloadByteArrays: Array[Array[Byte]]): ArrowPayload = { +val batches = scala.collection.mutable.ArrayBuffer.empty[ArrowRecordBatch] +var i = 0 +while (i < payloadByteArrays.length) { + val payloadBytes = payloadByteArrays(i) + val in = new ByteArrayReadableSeekableByteChannel(payloadBytes) + val reader = new ArrowReader(in, _allocator) + val footer = reader.readFooter() + val batchBlocks = footer.getRecordBatches.asScala.toArray + batchBlocks.foreach(block => batches += reader.readRecordBatch(block)) + i += 1 +
[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...
Github user julienledem commented on a diff in the pull request: https://github.com/apache/spark/pull/15821#discussion_r105321652 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ArrowConverters.scala --- @@ -0,0 +1,411 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import java.io.ByteArrayOutputStream +import java.nio.ByteBuffer +import java.nio.channels.{Channels, SeekableByteChannel} + +import scala.collection.JavaConverters._ + +import io.netty.buffer.ArrowBuf +import org.apache.arrow.memory.{BaseAllocator, RootAllocator} +import org.apache.arrow.vector._ +import org.apache.arrow.vector.BaseValueVector.BaseMutator +import org.apache.arrow.vector.file.{ArrowReader, ArrowWriter} +import org.apache.arrow.vector.schema.{ArrowFieldNode, ArrowRecordBatch} +import org.apache.arrow.vector.types.{FloatingPointPrecision, TimeUnit} +import org.apache.arrow.vector.types.pojo.{ArrowType, Field, Schema} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types._ + + +/** + * ArrowReader requires a seekable byte channel. + * NOTE - this is taken from test org.apache.vector.file, see about moving to public util pkg + */ +private[sql] class ByteArrayReadableSeekableByteChannel(var byteArray: Array[Byte]) + extends SeekableByteChannel { + var _position: Long = 0L + + override def isOpen: Boolean = { +byteArray != null + } + + override def close(): Unit = { +byteArray = null + } + + override def read(dst: ByteBuffer): Int = { +val remainingBuf = byteArray.length - _position +val length = Math.min(dst.remaining(), remainingBuf).toInt +dst.put(byteArray, _position.toInt, length) +_position += length +length.toInt + } + + override def position(): Long = _position + + override def position(newPosition: Long): SeekableByteChannel = { +_position = newPosition.toLong +this + } + + override def size: Long = { +byteArray.length.toLong + } + + override def write(src: ByteBuffer): Int = { +throw new UnsupportedOperationException("Read Only") + } + + override def truncate(size: Long): SeekableByteChannel = { +throw new UnsupportedOperationException("Read Only") + } +} + +/** + * Intermediate data structure returned from Arrow conversions + */ +private[sql] abstract class ArrowPayload extends Iterator[ArrowRecordBatch] + +/** + * Build a payload from existing ArrowRecordBatches + */ +private[sql] class ArrowStaticPayload(batches: ArrowRecordBatch*) extends ArrowPayload { + private val iter = batches.iterator + override def next(): ArrowRecordBatch = iter.next() + override def hasNext: Boolean = iter.hasNext +} + +/** + * Class that wraps an Arrow RootAllocator used in conversion + */ +private[sql] class ArrowConverters { + private val _allocator = new RootAllocator(Long.MaxValue) + + private[sql] def allocator: RootAllocator = _allocator + + def interalRowIterToPayload(rowIter: Iterator[InternalRow], schema: StructType): ArrowPayload = { +val batch = ArrowConverters.internalRowIterToArrowBatch(rowIter, schema, allocator) +new ArrowStaticPayload(batch) + } + + def readPayloadByteArrays(payloadByteArrays: Array[Array[Byte]]): ArrowPayload = { +val batches = scala.collection.mutable.ArrayBuffer.empty[ArrowRecordBatch] +var i = 0 +while (i < payloadByteArrays.length) { + val payloadBytes = payloadByteArrays(i) + val in = new ByteArrayReadableSeekableByteChannel(payloadBytes) + val reader = new ArrowReader(in, _allocator) + val footer = reader.readFooter() + val batchBlocks = footer.getRecordBatches.asScala.toArray + batchBlocks.foreach(block => batches += reader.readRecordBatch(block)) + i += 1 ---