[GitHub] spark issue #17239: Using map function in spark for huge operation

2017-03-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17239
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17239: Using map function in spark for huge operation

2017-03-09 Thread nischay21
GitHub user nischay21 opened a pull request:

https://github.com/apache/spark/pull/17239

Using map function in spark for huge operation

We need to calculate distance matrix like jaccard on huge collection of 
Dataset in spark.
Facing couple of issues. Kindly help us to give directions.

Issue 1.

import info.debatty.java.stringsimilarity.Jaccard;
//sample Data set creation
List data = Arrays.asList(
RowFactory.create("Hi I heard about Spark", "Hi I Know 
about Spark"),
RowFactory.create("I wish Java could use case 
classes","I wish C# could use case classes"),

RowFactory.create("Logistic,regression,models,are,neat","Logistic,regression,models,are,neat"));

StructType schema = new StructType(new StructField[] 
{new StructField("label", DataTypes.StringType, false,Metadata.empty()),
new StructField("sentence", DataTypes.StringType, 
false,Metadata.empty()) });
Dataset sentenceDataFrame = 
spark.createDataFrame(data, schema);

// Distance matrix object creation
Jaccard jaccard=new Jaccard();

//Working on each of the member element of dataset and 
applying distance matrix.
Dataset sentenceDataFrame1 
=sentenceDataFrame.map(
(MapFunction) row -> 
"Name: " + 
jaccard.similarity(row.getString(0),row.getString(1)),Encoders.STRING()
);
sentenceDataFrame1.show();

No compile time errors. But getting run time exception like 
org.apache.spark.SparkException: Task not serializable

Issue 2.
Moreover we need to find which pair is having highest score for which we 
need to declare some variables. Also we need to perform other calculation as 
well, we are facing lots of difficulty. Even if I try to declare a simple 
variable like counter within MapBlock we are not able to capture the 
incremented value. If we declare outside the Map block we are getting lots of 
compile time errors.


int counter=0;
Dataset sentenceDataFrame1 
=sentenceDataFrame.map(
(MapFunction) row -> {
System.out.println("Name: " + 
row.getString(1));
//int counter = 0;
counter++;
System.out.println("Counter: " 
+ counter);
return counter+"";  

},Encoders.STRING() 

);

Please gives us directions.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/apache/spark branch-2.1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17239.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17239


commit 1cafc76ea1e9eef40b24060d1cd7c4aaf9f16a49
Author: Shixiong Zhu 
Date:   2016-12-09T01:58:44Z

[SPARK-18774][CORE][SQL] Ignore non-existing files when ignoreCorruptFiles 
is enabled (branch 2.1)

## What changes were proposed in this pull request?

Backport #16203 to branch 2.1.

## How was this patch tested?

Jennkins

Author: Shixiong Zhu 

Closes #16216 from zsxwing/SPARK-18774-2.1.

commit ef5646b4c6792a96e85d1dd4bb3103ba8306949b
Author: Shivaram Venkataraman 
Date:   2016-12-09T02:26:54Z

[SPARKR][PYSPARK] Fix R source package name to match Spark version. Remove 
pip tar.gz from distribution

## What changes were proposed in this pull request?

Fixes name of R source package so that the `cp` in release-build.sh works 
correctly.

Issue discussed in 
https://github.com/apache/spark/pull/16014#issuecomment-265867125

Author: Shivaram Venkataraman 

Closes #16221 from shivaram/fix-sparkr-release-build-name.

(cherry picked from commit 4ac8b20bf2f962d9b8b6b209468896758d49efe3)
Signed-off-by: Shivaram Venkataraman 

commit 4ceed95b43d0cd9665004865095a40926efcc289
Author: wm...@hotmail.com 
Date:   2016-12-09T06:08:19Z

[SPARK-18349][SPARKR] 

[GitHub] spark issue #17177: [SPARK-19834][SQL] csv escape of quote escape

2017-03-09 Thread jbax
Github user jbax commented on the issue:

https://github.com/apache/spark/pull/17177
  
Doesn't seem correct to me. All test cases are using broken CSV and trigger 
the parser handling of unescaped quotes, where it tries to rescue the data and 
produce something sensible. See my test case here: 
https://github.com/uniVocity/univocity-parsers/issues/143


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17237: [SPARK-19852][PYSPARK][ML] Update Python API setHandleIn...

2017-03-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17237
  
**[Test build #74305 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74305/testReport)**
 for PR 17237 at commit 
[`f1d9bcb`](https://github.com/apache/spark/commit/f1d9bcb3d615444a3c326f0b6b0f7999edecdf4f).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17237: [SPARK-19852][PYSPARK][ML] Update Python API setHandleIn...

2017-03-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17237
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17237: [SPARK-19852][PYSPARK][ML] Update Python API setHandleIn...

2017-03-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17237
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74305/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14033: [SPARK-16286][SQL] Implement stack table generating func...

2017-03-09 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/14033
  
just found out that we didn't implement a type coercion rule for `stack`...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17226
  
For me, LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17188: [SPARK-19751][SQL] Throw an exception if bean cla...

2017-03-09 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17188#discussion_r105343821
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
 ---
@@ -69,7 +69,8 @@ object JavaTypeInference {
* @param typeToken Java type
* @return (SQL data type, nullable)
*/
-  private def inferDataType(typeToken: TypeToken[_]): (DataType, Boolean) 
= {
+  private def inferDataType(typeToken: TypeToken[_], seenTypeSet: 
Set[Class[_]] = Set.empty)
--- End diff --

You mean this case?
```

scala> :paste
case class classA(i: Int, cls: classB)
case class classB(cls: classA)

scala> Seq(classA(0, null)).toDS()
java.lang.StackOverflowError
  at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1494)
  at 
scala.reflect.runtime.JavaMirrors$JavaMirror$$anon$1.scala$reflect$runtime$SynchronizedSymbols$SynchronizedSymbol$$super$info(JavaMirrors.scala:66)
  at 
scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anonfun$info$1.apply(SynchronizedSymbols.scala:127)
  at 
scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anonfun$info$1.apply(SynchronizedSymbols.scala:127)
  at scala.reflect.runtime.Gil$class.gilSynchronized(Gil.scala:19)
  at 
scala.reflect.runtime.JavaUniverse.gilSynchronized(JavaUniverse.scala:16)
  at 
scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$class.gilSynchronizedIfNotThreadsafe(SynchronizedSymbols.scala:123)
  at 
scala.reflect.runtime.JavaMirrors$JavaMirror$$anon$1.gilSynchronizedIfNotThreadsafe(JavaMirrors.scala:66)
  at 
scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$class.info(SynchronizedSymbols.scala:127)
  at 
scala.reflect.runtime.JavaMirrors$JavaMirror$$anon$1.info(JavaMirrors.scala:66)
  at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
  at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:45)
  at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:45)
  at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:45)
  at 
scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:45)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17238: getRackForHost returns None if host is unknown by driver

2017-03-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17238
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17238: getRackForHost returns None if host is unknown by...

2017-03-09 Thread morenn520
GitHub user morenn520 opened a pull request:

https://github.com/apache/spark/pull/17238

getRackForHost returns None if host is unknown by driver

## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-19894

## How was this patch tested?

It tests on our production cluster(YARN) by YARN-cluster mode, and resolve 
user rack-local problems by applying this patch.
Problem:
In our production cluster(YARN), one node(called missing-rack-info node) 
miss some rack information for other nodes. One Spark Streaming 
program(Datasource: Kafka, Mode: Yarn-cluster), runs driver on this 
missing-rack-info node.
The nodes whose host is missed on Driver node, and the Kafka broker node 
whose host is also unknown by YARN, would both be recognized as "/default-rack" 
by YARN scheduler, so that all tasks would be assigned to the nodes for 
RACK_LOCAL.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/morenn520/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17238.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17238


commit 6630e747efa52bff5ca48bb0a5610357c7754c10
Author: Chen Yuechen 
Date:   2017-03-10T07:24:48Z

getRackForHost returns None if host is unknown by driver




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17237: [SPARK-19852][PYSPARK][ML] Update Python API setHandleIn...

2017-03-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17237
  
**[Test build #74305 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74305/testReport)**
 for PR 17237 at commit 
[`f1d9bcb`](https://github.com/apache/spark/commit/f1d9bcb3d615444a3c326f0b6b0f7999edecdf4f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17175: [SPARK-19468][SQL] Rewrite physical Project opera...

2017-03-09 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17175#discussion_r105342800
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala
 ---
@@ -78,9 +78,42 @@ case class ProjectExec(projectList: 
Seq[NamedExpression], child: SparkPlan)
 }
   }
 
-  override def outputOrdering: Seq[SortOrder] = child.outputOrdering
--- End diff --

My original thought is to support the cases where 
`requiresChildDistribution` refers nested fields created and aliased in 
`Project`, as the example I showed.

The example you give is another kind of case. In order to support, if we 
wannt, at least we need to improve expression canonicalization and 
partitioning/distribution matching.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17224: [SPARK-19882][SQL] Pivot with null as the dictinc...

2017-03-09 Thread HyukjinKwon
Github user HyukjinKwon closed the pull request at:

https://github.com/apache/spark/pull/17224


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17224: [SPARK-19882][SQL] Pivot with null as the dictinct pivot...

2017-03-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17224
  
I am closing this per 
https://github.com/apache/spark/pull/17226#issuecomment-285597434


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17226
  
I see. So, `count` in "**Spark 2.1.0** (and presumably 2.0.x/master)" was 
unexpectedly introduced by the optimization in SPARK-13749 and this behaviour 
change between 1.6 and master (whether it is right or not) is found now 
together.

So.. if I understood correctly, several problems are mixed and found here:

 1. counting `null` problem (for both optimized and non-optimized)

 2. NPE (for optimized)

 3. `0` vs `null` for missing values in `count`

and this PR tries to fix both 1. and 2. 
whereas mine tries to fix 1. and both 2. and few specific cases in 3. (by 
avoiding optimization as an workaround).

Okay, I am fine with closing mine (honestly, the initial version in my PR 
was almost identical with this PR..). 

Thanks for elaborating it and bearing with me.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17225: [CORE] Support ZStandard Compression

2017-03-09 Thread dongjinleekr
Github user dongjinleekr commented on a diff in the pull request:

https://github.com/apache/spark/pull/17225#discussion_r105342714
  
--- Diff: core/src/main/scala/org/apache/spark/io/CompressionCodec.scala ---
@@ -49,13 +50,14 @@ private[spark] object CompressionCodec {
 
   private[spark] def supportsConcatenationOfSerializedStreams(codec: 
CompressionCodec): Boolean = {
 (codec.isInstanceOf[SnappyCompressionCodec] || 
codec.isInstanceOf[LZFCompressionCodec]
-  || codec.isInstanceOf[LZ4CompressionCodec])
+  || codec.isInstanceOf[LZ4CompressionCodec] || 
codec.isInstanceOf[ZStandardCompressionCodec])
   }
 
   private val shortCompressionCodecNames = Map(
 "lz4" -> classOf[LZ4CompressionCodec].getName,
 "lzf" -> classOf[LZFCompressionCodec].getName,
-"snappy" -> classOf[SnappyCompressionCodec].getName)
+"snappy" -> classOf[SnappyCompressionCodec].getName,
+"zstd" -> classOf[SnappyCompressionCodec].getName)
--- End diff --

OMG, it is a typo.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17188: [SPARK-19751][SQL] Throw an exception if bean cla...

2017-03-09 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17188#discussion_r105342563
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala
 ---
@@ -69,7 +69,8 @@ object JavaTypeInference {
* @param typeToken Java type
* @return (SQL data type, nullable)
*/
-  private def inferDataType(typeToken: TypeToken[_]): (DataType, Boolean) 
= {
+  private def inferDataType(typeToken: TypeToken[_], seenTypeSet: 
Set[Class[_]] = Set.empty)
--- End diff --

does scala case class have the same problem?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17188: [SPARK-19751][SQL] Throw an exception if bean class has ...

2017-03-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17188
  
**[Test build #74304 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74304/testReport)**
 for PR 17188 at commit 
[`5e519b1`](https://github.com/apache/spark/commit/5e519b180d7905e9c4e60f708db11ce6ccf86866).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17188: [SPARK-19751][SQL] Throw an exception if bean class has ...

2017-03-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17188
  
**[Test build #74303 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74303/testReport)**
 for PR 17188 at commit 
[`9abe861`](https://github.com/apache/spark/commit/9abe861f8153431d30c06c42243adbf346a85772).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17237: [SPARK-19852][PYSPARK][ML] Update Python API setHandleIn...

2017-03-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17237
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74301/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17237: [SPARK-19852][PYSPARK][ML] Update Python API setHandleIn...

2017-03-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17237
  
**[Test build #74301 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74301/testReport)**
 for PR 17237 at commit 
[`d94dc68`](https://github.com/apache/spark/commit/d94dc68a8c1a5c082cf3de6c7e4d429bfd24d817).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, 
JavaMLReadable, JavaMLWritable):`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17237: [SPARK-19852][PYSPARK][ML] Update Python API setHandleIn...

2017-03-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17237
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17237: [SPARK-19852][PYSPARK][ML] Update Python API setHandleIn...

2017-03-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17237
  
**[Test build #74301 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74301/testReport)**
 for PR 17237 at commit 
[`d94dc68`](https://github.com/apache/spark/commit/d94dc68a8c1a5c082cf3de6c7e4d429bfd24d817).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17188: [SPARK-19751][SQL] Throw an exception if bean class has ...

2017-03-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17188
  
**[Test build #74302 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74302/testReport)**
 for PR 17188 at commit 
[`b7eba26`](https://github.com/apache/spark/commit/b7eba26f5131ab0c99142aca3ab46e868226026e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17231: [SPARK-19891][SS] Await Batch Lock notified on st...

2017-03-09 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17231


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17237: [SPARK-19852][PYSPARK][ML] Update Python API setH...

2017-03-09 Thread VinceShieh
GitHub user VinceShieh opened a pull request:

https://github.com/apache/spark/pull/17237

[SPARK-19852][PYSPARK][ML] Update Python API setHandleInvalid for 
StringIndexer

## What changes were proposed in this pull request?

This PR is to maintain API parity with changes made in SPARK-17498 to 
support a new option
'keep' in StringIndexer to handle unseen labels with pyspark

## How was this patch tested?
existing tests
testing is done with new doctests

Signed-off-by: VinceShieh 


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/VinceShieh/spark spark-19852

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17237.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17237


commit d94dc68a8c1a5c082cf3de6c7e4d429bfd24d817
Author: VinceShieh 
Date:   2017-03-10T06:50:41Z

[SPARK-19852][PYSPARK][ML] Update Python API for StringIndexer 
setHandleInvalid

This PR reflect the changes made in SPARK-17498 on pyspark to support a new 
option
'keep' in StringIndexer to handle unseen labels

Signed-off-by: VinceShieh 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17231: [SPARK-19891][SS] Await Batch Lock notified on stream ex...

2017-03-09 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/17231
  
LGTM. Merging to master and 2.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17172: [SPARK-19008][SQL] Improve performance of Dataset...

2017-03-09 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17172


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17172: [SPARK-19008][SQL] Improve performance of Dataset.map by...

2017-03-09 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/17172
  
cool! merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17236: [SPARK-xxxx][SQL] Cannot run intersect/except with map t...

2017-03-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17236
  
**[Test build #74300 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74300/testReport)**
 for PR 17236 at commit 
[`d670a11`](https://github.com/apache/spark/commit/d670a11b8391c4e44e699372c567980fa9d29fed).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17236: [SPARK-xxxx][SQL] Cannot run intersect/except with map t...

2017-03-09 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/17236
  
cc @yhuai @sameeragarwal @hvanhovell 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17236: [SPARK-xxxx][SQL] Cannot run intersect/except wit...

2017-03-09 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/17236

[SPARK-][SQL] Cannot run intersect/except with map type

## What changes were proposed in this pull request?

In spark SQL, map type can't be used in equality test/comparison, and 
`Intersect`/`Except` do need equality test for all columns, we should not allow 
map type in `Intersect`/`Except`.

## How was this patch tested?

new regression test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark map

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17236.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17236


commit d670a11b8391c4e44e699372c567980fa9d29fed
Author: Wenchen Fan 
Date:   2017-03-10T06:41:53Z

Cannot run intersect/except with map type




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17138: [SPARK-17080] [SQL] join reorder

2017-03-09 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/17138
  
@nsyca Yes you're right. There's still much room of optimization. We will 
improve Spark's optimizer gradually :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17109: [SPARK-19740][MESOS]Add support in Spark to pass arbitra...

2017-03-09 Thread tnachen
Github user tnachen commented on the issue:

https://github.com/apache/spark/pull/17109
  
@srowen @mgummelt PTAL


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...

2017-03-09 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17192#discussion_r105337154
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -1339,6 +1339,11 @@ test_that("column functions", {
   expect_equal(collect(select(df, bround(df$x, 0)))[[1]][2], 4)
 
   # Test to_json(), from_json()
+  arr <- list(listToStruct(list("name" = "bob")))
+  df <- as.DataFrame(list(listToStruct(list("people" = arr
+  j <- collect(select(df, alias(to_json(df$people), "json")))
+  expect_equal(j[order(j$json), ][1], "[{\"name\":\"bob\"}]")
+
--- End diff --

Let me propose this with it for now and try to find another way until it is 
merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17177: [SPARK-19834][SQL] csv escape of quote escape

2017-03-09 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17177#discussion_r105337093
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -693,8 +697,8 @@ def text(self, path, compression=None):
 
 @since(2.0)
 def csv(self, path, mode=None, compression=None, sep=None, quote=None, 
escape=None,
-header=None, nullValue=None, escapeQuotes=None, quoteAll=None, 
dateFormat=None,
-timestampFormat=None, timeZone=None):
+escapeQuoteEscaping=None, header=None, nullValue=None, 
escapeQuotes=None, quoteAll=None,
--- End diff --

Oh, @ep1804, we should add new option at the end of these arguments because 
otherwise it breaks existing codes using this API in lower versions that use 
those options as positional arguments (not keyword arguments).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17164: [SPARK-16844][SQL] Support codegen for sort-based aggrea...

2017-03-09 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/17164
  
@hvanhovell ping


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...

2017-03-09 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17192#discussion_r105336865
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -1339,6 +1339,11 @@ test_that("column functions", {
   expect_equal(collect(select(df, bround(df$x, 0)))[[1]][2], 4)
 
   # Test to_json(), from_json()
+  arr <- list(listToStruct(list("name" = "bob")))
+  df <- as.DataFrame(list(listToStruct(list("people" = arr
+  j <- collect(select(df, alias(to_json(df$people), "json")))
+  expect_equal(j[order(j$json), ][1], "[{\"name\":\"bob\"}]")
+
--- End diff --

yes `listToStruct` is internal and it's mucking with types (though it's 
legitimate to do in R)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17192: [SPARK-19849][SQL] Support ArrayType in to_json t...

2017-03-09 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17192#discussion_r105336776
  
--- Diff: R/pkg/inst/tests/testthat/test_sparkSQL.R ---
@@ -1339,6 +1339,11 @@ test_that("column functions", {
   expect_equal(collect(select(df, bround(df$x, 0)))[[1]][2], 4)
 
   # Test to_json(), from_json()
+  arr <- list(listToStruct(list("name" = "bob")))
+  df <- as.DataFrame(list(listToStruct(list("people" = arr
+  j <- collect(select(df, alias(to_json(df$people), "json")))
+  expect_equal(j[order(j$json), ][1], "[{\"name\":\"bob\"}]")
+
--- End diff --

possibly if it does


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17234: [SPARK-19892][MLlib] Implement findAnalogies meth...

2017-03-09 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17234#discussion_r105336208
  
--- Diff: R/pkg/DESCRIPTION ---
@@ -54,5 +54,5 @@ Collate:
 'types.R'
 'utils.R'
 'window.R'
-RoxygenNote: 5.0.1
+RoxygenNote: 6.0.1
--- End diff --

please revert this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17109: [SPARK-19740][MESOS]Add support in Spark to pass arbitra...

2017-03-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17109
  
**[Test build #74299 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74299/testReport)**
 for PR 17109 at commit 
[`03e89eb`](https://github.com/apache/spark/commit/03e89eb2bae3b08207429a6b772d7dcae45b554c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17109: [SPARK-19740][MESOS]Add support in Spark to pass arbitra...

2017-03-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17109
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17109: [SPARK-19740][MESOS]Add support in Spark to pass arbitra...

2017-03-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17109
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74299/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17170: [SPARK-19825][R][ML] spark.ml R API for FPGrowth

2017-03-09 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/17170
  
I can see your point, but renaming it only on the R side is not really 
addressing the issue.
Please feel free to open a JIRA on spark.ml FPGrowth and start a discussion 
there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17109: [SPARK-19740][MESOS]Add support in Spark to pass arbitra...

2017-03-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17109
  
**[Test build #74299 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74299/testReport)**
 for PR 17109 at commit 
[`03e89eb`](https://github.com/apache/spark/commit/03e89eb2bae3b08207429a6b772d7dcae45b554c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17177: [SPARK-19834][SQL] csv escape of quote escape

2017-03-09 Thread ep1804
Github user ep1804 commented on the issue:

https://github.com/apache/spark/pull/17177
  
Documentation for DataFrameReader, DataFrameWriter, DataStreamReader, 
readwriter.py and streaming.py are written. Check please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17232: [SPARK-18112] [SQL] Support reading data from Hive 2.1 m...

2017-03-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17232
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17232: [SPARK-18112] [SQL] Support reading data from Hive 2.1 m...

2017-03-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17232
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74298/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17232: [SPARK-18112] [SQL] Support reading data from Hive 2.1 m...

2017-03-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17232
  
**[Test build #74298 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74298/testReport)**
 for PR 17232 at commit 
[`af81cee`](https://github.com/apache/spark/commit/af81cee9f54abc13d7d07a12e4b499e49cd0dbcb).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/17226
  
@HyukjinKwon There is an inconsistency/regression but its not being 
introduced in this PR, its already there. Take an example without null as a 
pivot column value like below. The only difference is for the `count(1)` 
aggregate on cells with no values aggregated in the pivot table. Again I don't 
think it's clear which is "correct" here.

**Spark 2.1.0** (and presumably 2.0.x/master)
```
scala> Seq(1,2).toDF("a").groupBy("a").pivot("a").count().show
+---+++
|  a|   1|   2|
+---+++
|  1|   1|null|
|  2|null|   1|
+---+++
scala> Seq(1,2).toDF("a").groupBy("a").pivot("a").sum("a").show
+---+++
|  a|   1|   2|
+---+++
|  1|   1|null|
|  2|null|   2|
+---+++
```

**Spark 1.6.0**
```
scala> 
sc.parallelize(Seq(1,2)).toDF("a").groupBy("a").pivot("a").count().show
+---+---+---+
|  a|  1|  2|
+---+---+---+
|  1|  1|  0|
|  2|  0|  1|
+---+---+---+
scala> 
sc.parallelize(Seq(1,2)).toDF("a").groupBy("a").pivot("a").sum("a").show
+---+++
|  a|   1|   2|
+---+++
|  1|   1|null|
|  2|null|   2|
+---+++
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17235: [SPARK-19320][MESOS][WIP]allow specifying a hard limit o...

2017-03-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17235
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17235: [SPARK-19320][MESOS][WIP]allow specifying a hard ...

2017-03-09 Thread yanji84
GitHub user yanji84 opened a pull request:

https://github.com/apache/spark/pull/17235

[SPARK-19320][MESOS][WIP]allow specifying a hard limit on number of gpus 
required in each spark executor when running on mesos

## What changes were proposed in this pull request?

Currently, spark only allows specifying gpu resources as an upper limit, 
this adds a new conf parameter to allow specifying a hard limit on the number 
of gpu cores. If this hard limit is greater than 0, it will override the effect 
of spark.mesos.gpus.max

## How was this patch tested?

Tests pending

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanji84/spark ji/hard_limit_on_gpu

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17235.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17235


commit 5f8ccd5789137363e035d1dfb9a05d3b9bf3ce6b
Author: Ji Yan 
Date:   2017-03-10T05:30:11Z

respect both gpu and maxgpu




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17234: [SPARK-19892][MLlib] Implement findAnalogies method for ...

2017-03-09 Thread benradford
Github user benradford commented on the issue:

https://github.com/apache/spark/pull/17234
  
ok to test
Jenkins, add to whitelist


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17234: [SPARK-19892][MLlib] Implement findAnalogies method for ...

2017-03-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17234
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17234: [SPARK-19892][MLlib] Implement findAnalogies meth...

2017-03-09 Thread benradford
GitHub user benradford opened a pull request:

https://github.com/apache/spark/pull/17234

[SPARK-19892][MLlib] Implement findAnalogies method for Word2VecModel

## What changes were proposed in this pull request?

Added findAnalogies method to Word2VecModel for performing 
vector-algebra-based queries (e.g. King + Woman - Man).

## How was this patch tested?

Followed the contributor's guide for Spark and ran the run-tests. Compiled 
and tested functionality in spark-shell.

This is an original work that I license to the project under the project's 
open source license.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/benradford/spark feature/findAnalogies

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17234.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17234


commit 2e7f1a3bd519d79ce9b08d388247e9a1d7f67635
Author: Benjamin Radford 
Date:   2017-03-10T04:42:33Z

Added findAnalogies method to Word2VecModel

commit 9aefebfcd2e6eaad117727901ad70d0d26b03a1a
Author: Benjamin Radford 
Date:   2017-03-10T05:16:46Z

Fixed comment indentation to conform to style guide.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17174: [SPARK-19145][SQL] Timestamp to String casting is...

2017-03-09 Thread tanejagagan
Github user tanejagagan commented on a diff in the pull request:

https://github.com/apache/spark/pull/17174#discussion_r105333124
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -324,14 +324,22 @@ object TypeCoercion {
   // We should cast all relative timestamp/date/string comparison into 
string comparisons
   // This behaves as a user would expect because timestamp strings 
sort lexicographically.
   // i.e. TimeStamp(2013-01-01 00:00 ...) < "2014" = true
-  case p @ BinaryComparison(left @ StringType(), right @ DateType()) =>
-p.makeCopy(Array(left, Cast(right, StringType)))
-  case p @ BinaryComparison(left @ DateType(), right @ StringType()) =>
-p.makeCopy(Array(Cast(left, StringType), right))
-  case p @ BinaryComparison(left @ StringType(), right @ 
TimestampType()) =>
-p.makeCopy(Array(left, Cast(right, StringType)))
-  case p @ BinaryComparison(left @ TimestampType(), right @ 
StringType()) =>
-p.makeCopy(Array(Cast(left, StringType), right))
+  // If StringType is foldable then we need to cast String to Date or 
Timestamp type
+  // which would give order of magnitude performance gain as well as 
preserve the behavior
+  // achieved by expressed above
+  // TimeStamp(2013-01-01 00:00 ...) < Cast( "2014" as timestamp) = 
true
+  case p @ BinaryComparison(left @ StringType(), right) if 
dateOrTimestampType(right) =>
+if (left.foldable) {
+  p.makeCopy(Array(Cast(left, right.dataType), right))
--- End diff --

Yes.. You can explicitly cast the string to timestamp and then speed up 
will be much faster. By default without casting query just runs fine silently , 
pick up a very bad plan, with no indication to user whatsoever and about order 
of magnitude slower
Some of the other issue related to comparison such as` time < 'abc' `will 
also run just fine which i think should be fail fast and let user know about 
the issue with casting
Other problem is with BI tools which generate these SQLs where user do not 
have direct control on the SQL.
We came across this issue when the same query in Impala was running 10 
times faster than in Spark and investigation of the that resulted in this bug 
and therefore fix


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16952: [SPARK-19620][SQL]Fix incorrect exchange coordinator id ...

2017-03-09 Thread carsonwang
Github user carsonwang commented on the issue:

https://github.com/apache/spark/pull/16952
  
@gatorsmile  @cloud-fan @yhuai , can you help review and merge this minor 
one line fix? The code change itself is straightforward.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17123: [SPARK-19781][ML] Handle NULLs as well as NaNs in Bucket...

2017-03-09 Thread crackcell
Github user crackcell commented on the issue:

https://github.com/apache/spark/pull/17123
  
@cloud-fan Would you please review my code again? I'm now using `Option` to 
handle NULLs. :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17177: [SPARK-19834][SQL] csv escape of quote escape

2017-03-09 Thread ep1804
Github user ep1804 commented on the issue:

https://github.com/apache/spark/pull/17177
  
An issue is raised for uniVocity parser: 
https://github.com/uniVocity/univocity-parsers/issues/143


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17233: [SPARK-11569][ML] Fix StringIndexer to handle nul...

2017-03-09 Thread crackcell
GitHub user crackcell opened a pull request:

https://github.com/apache/spark/pull/17233

[SPARK-11569][ML] Fix StringIndexer to handle null value properly

## What changes were proposed in this pull request?

This PR is to enhance StringIndexer with NULL values handling.

Before the PR, StringIndexer will throw an exception when encounters NULL 
values.
With this PR:
- handleInvalid=error: Throw an exception as before
- handleInvalid=skip: Skip null values as well as unseen labels
- handleInvalid=keep: Give null values an additional index as well as 
unseen labels

BTW, I noticed someone was trying to solve the same problem ( #9920 ) but 
seems getting no progress or response for a long time. Would you mind give a 
chance to solve it ?

## How was this patch tested?

new unit tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/crackcell/spark 11569_StringIndexer_NULL

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17233.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17233


commit 75e3975597aa6271f4f8ab688922edda88b03045
Author: Menglong TAN 
Date:   2017-03-08T03:50:17Z

Merge pull request #1 from apache/master

merge master to my repo

commit 79d706085e8371fb1724ce73377767c38d551e5d
Author: Menglong TAN 
Date:   2017-03-10T04:45:56Z

Enhance StringIndexer with NULL values

commit 0cb121c65f592b9623bdeef2746d7c2a3c281ae1
Author: Menglong TAN 
Date:   2017-03-10T04:52:30Z

filter out NULLs when transform dataset




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17177: [SPARK-19834][SQL] csv escape of quote escape

2017-03-09 Thread ep1804
Github user ep1804 commented on the issue:

https://github.com/apache/spark/pull/17177
  
Thank you @HyukjinKwon .

I made changes following your comments:
 * `escapeQuoteEscaping` instead of `escapeEscape`
 * defalutl value to `\u` (unset)
 * `withTempPath`
 * no `orderBy`
 * `checkAnswer`
 * styles : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17172: [SPARK-19008][SQL] Improve performance of Dataset.map by...

2017-03-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17172
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74297/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17172: [SPARK-19008][SQL] Improve performance of Dataset.map by...

2017-03-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17172
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17172: [SPARK-19008][SQL] Improve performance of Dataset.map by...

2017-03-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17172
  
**[Test build #74297 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74297/testReport)**
 for PR 17172 at commit 
[`b25b191`](https://github.com/apache/spark/commit/b25b191687259303df5ab2fad0c64687a88de5bd).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17172: [SPARK-19008][SQL] Improve performance of Dataset.map by...

2017-03-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17172
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17172: [SPARK-19008][SQL] Improve performance of Dataset.map by...

2017-03-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17172
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74296/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17172: [SPARK-19008][SQL] Improve performance of Dataset.map by...

2017-03-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17172
  
**[Test build #74296 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74296/testReport)**
 for PR 17172 at commit 
[`200cec7`](https://github.com/apache/spark/commit/200cec783f33de21d9895f90161a9d11877d0877).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17226
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17226
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74295/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17226
  
**[Test build #74295 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74295/testReport)**
 for PR 17226 at commit 
[`a81c062`](https://github.com/apache/spark/commit/a81c06201259b246c7b9e8b56ecc4183e6279410).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17232: [SPARK-18112] [SQL] Support reading data from Hive 2.1 m...

2017-03-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17232
  
**[Test build #74298 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74298/testReport)**
 for PR 17232 at commit 
[`af81cee`](https://github.com/apache/spark/commit/af81cee9f54abc13d7d07a12e4b499e49cd0dbcb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17232: [SPARK-18112] [SQL] Support reading data from Hiv...

2017-03-09 Thread gatorsmile
GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/17232

[SPARK-18112] [SQL] Support reading data from Hive 2.1 metastore [WIP]

### What changes were proposed in this pull request?
This PR is to support reading data from Hive 2.1 metastore. Need to update 
shim class because of the Hive API changes caused by the following two Hive 
JIRAs:
- [HIVE-12730 MetadataUpdater: provide a mechanism to edit the basic 
statistics of a table (or a 
partition)](https://issues.apache.org/jira/browse/HIVE-12730)
- [Hive-13341 Stats state is not captured correctly: differentiate load 
table and create table](https://issues.apache.org/jira/browse/HIVE-13341)

There two new fields have been added in Hive.  
- `EnvironmentContext environmentContext`. So far, this is always set to 
`null`. This was introduced for supporting DDL `alter table s update statistics 
set ('numRows'='NaN')`. Using this DDL, users can specify the statistics. So 
far, our Spark SQL does not need it, because we use different table properties 
to store our generated statistics values. However, when Spark SQL issues ALTER 
TABLE DDL statements, Hive metastore always automatically invalidate the 
Hive-generated statistics. In the follow-up PR, we can fix it by explicitly 
adding a property to `environmentContext`.
```JAVA
putToProperties(StatsSetupConst.STATS_GENERATED, StatsSetupConst.USER)
```
- `boolean hasFollowingStatsTask`. We always set it to `false`. TODO: more 
investigation about this

### How was this patch tested?
Added test cases to VersionsSuite.scala

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark Hive21

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17232.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17232


commit af81cee9f54abc13d7d07a12e4b499e49cd0dbcb
Author: Xiao Li 
Date:   2017-03-10T03:35:38Z

fix




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17231: [SPARK-19891][SS] Await Batch Lock notified on stream ex...

2017-03-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17231
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17231: [SPARK-19891][SS] Await Batch Lock notified on stream ex...

2017-03-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17231
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17231: [SPARK-19891][SS] Await Batch Lock notified on stream ex...

2017-03-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17231
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74293/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17231: [SPARK-19891][SS] Await Batch Lock notified on stream ex...

2017-03-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17231
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74294/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17226
  
> we're not introducing a regression in this PR by fixing the NPE, the 
answer given by 1.6 was incorrect under any interpenetration

Right, if it was a bug, then this PR introduces an inconsistency between 
optimized one and non-optimized one. We should fix both together or not.

> there is a completely separate issue of what the proper value of count(1) 
on no values should be in a pivot that does not depend at all on nulls in the 
pivot column.

Then why are we partially fixing a completely separate issue here?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17231: [SPARK-19891][SS] Await Batch Lock notified on stream ex...

2017-03-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17231
  
**[Test build #74293 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74293/testReport)**
 for PR 17231 at commit 
[`c49e183`](https://github.com/apache/spark/commit/c49e183bf391ad0d9352400b19a2e191a3f0be11).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17231: [SPARK-19891][SS] Await Batch Lock notified on stream ex...

2017-03-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17231
  
**[Test build #74294 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74294/testReport)**
 for PR 17231 at commit 
[`8170792`](https://github.com/apache/spark/commit/817079234203dc428032c5a110a03f43ec7a813f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17231: [SPARK-19891][SS] Await Batch Lock notified on stream ex...

2017-03-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17231
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17231: [SPARK-19891][SS] Await Batch Lock notified on stream ex...

2017-03-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17231
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74292/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17231: [SPARK-19891][SS] Await Batch Lock notified on stream ex...

2017-03-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17231
  
**[Test build #74292 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74292/testReport)**
 for PR 17231 at commit 
[`a23162a`](https://github.com/apache/spark/commit/a23162ac7322af2b662e870ea4f889f19d4a8b2c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/17226
  
@HyukjinKwon we're not introducing a regression in this PR by fixing the 
NPE, the answer given by 1.6 was incorrect under any interpenetration. Again, 
there is a completely separate issue of what the proper value of count(1) on no 
values should be in a pivot that does not depend at all on nulls in the pivot 
column.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17224: [SPARK-19882][SQL] Pivot with null as the dictinc...

2017-03-09 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17224#discussion_r105327492
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -524,15 +529,21 @@ class Analyzer(
 def ifExpr(expr: Expression) = {
   If(EqualTo(pivotColumn, value), expr, Literal(null))
 }
+def ifNullSafeExpr(expr: Expression) = {
+  If(EqualNullSafe(pivotColumn, value), expr, Literal(null))
+}
 aggregates.map { aggregate =>
   val filteredAggregate = aggregate.transformDown {
 // Assumption is the aggregate function ignores nulls. 
This is true for all current
-// AggregateFunction's with the exception of First and 
Last in their default mode
-// (which we handle) and possibly some Hive UDAF's.
+// AggregateFunction's with the exception of First, Last 
and Count in their
+// default mode (which we handle) and possibly some Hive 
UDAF's.
 case First(expr, _) =>
   First(ifExpr(expr), Literal(true))
 case Last(expr, _) =>
   Last(ifExpr(expr), Literal(true))
+case c: Count =>
+  // In case of count, `null` should be counted.
+  c.withNewChildren(c.children.map(ifNullSafeExpr))
--- End diff --

Let me update this path as soon as we decide what we want in another PR for 
this JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17226
  
cc @cloud-fan and @yhuai could you pick up one of them? Let me follow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17226
  
> Spark 2.0+ with PivotFirst gives a NPE when one of the pivot column 
values is null. The main thing fixed in this PR.

I meant to say it is not fully fixed because it does not NPE but now 
introduce a regression.
Why don't we fix NPE and resolve the regression first and then put the 
optimization for it?

I think we have two options.

- Deal with this case in both optimization path and non-optimization path 
which introduces a regression/inconsistency between them that should be a 
separate JIRA. - in this case, let me close mine.

- Fall back to non-optimization path and leave a JIRA to put this in 
optimized path - in this case, I think this PR should be closed.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/17226
  
BTW for 3 above if we decide it should be 0, we can add an initial value 
for `PivotFirst` to make the fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread aray
Github user aray commented on the issue:

https://github.com/apache/spark/pull/17226
  
There are three things going on here in your one example.

1. Spark 1.6 [first version with pivot] (and Spark 2.0+ with an aggregate 
output type unsupported by PivotFirst) gives incorrect answers to when one of 
the pivot column values is null (only affects the 'null' column) this is fixed 
by doing a null safe equals in the injected if statement 
https://github.com/apache/spark/pull/17226/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R525

2. Spark 2.0+ with PivotFirst gives a NPE when one of the pivot column 
values is null. The main thing fixed in this PR.

3. There is inconsistency between Spark 1.6 and 2.0+ on the result of a 
pivot with a `count(1)` aggregate when no values are aggregated for a cell. 
This is separate from the issues above and it's not clear which version is 
naturally correct (pandas leaves those values as null, Oracle 11g gives 0, and 
I need to test others).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17226: [SPARK-19882][SQL] Pivot with null as a distinct ...

2017-03-09 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17226#discussion_r105325964
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -522,7 +522,7 @@ class Analyzer(
 } else {
   val pivotAggregates: Seq[NamedExpression] = pivotValues.flatMap 
{ value =>
 def ifExpr(expr: Expression) = {
-  If(EqualTo(pivotColumn, value), expr, Literal(null))
+  If(EqualNullSafe(pivotColumn, value), expr, Literal(null))
--- End diff --

Ah, yes. You are right. I think I was mistaken.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...

2017-03-09 Thread julienledem
Github user julienledem commented on a diff in the pull request:

https://github.com/apache/spark/pull/15821#discussion_r105323172
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/ArrowConverters.scala ---
@@ -0,0 +1,411 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql
+
+import java.io.ByteArrayOutputStream
+import java.nio.ByteBuffer
+import java.nio.channels.{Channels, SeekableByteChannel}
+
+import scala.collection.JavaConverters._
+
+import io.netty.buffer.ArrowBuf
+import org.apache.arrow.memory.{BaseAllocator, RootAllocator}
+import org.apache.arrow.vector._
+import org.apache.arrow.vector.BaseValueVector.BaseMutator
+import org.apache.arrow.vector.file.{ArrowReader, ArrowWriter}
+import org.apache.arrow.vector.schema.{ArrowFieldNode, ArrowRecordBatch}
+import org.apache.arrow.vector.types.{FloatingPointPrecision, TimeUnit}
+import org.apache.arrow.vector.types.pojo.{ArrowType, Field, Schema}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types._
+
+
+/**
+ * ArrowReader requires a seekable byte channel.
+ * NOTE - this is taken from test org.apache.vector.file, see about moving 
to public util pkg
+ */
+private[sql] class ByteArrayReadableSeekableByteChannel(var byteArray: 
Array[Byte])
+  extends SeekableByteChannel {
+  var _position: Long = 0L
+
+  override def isOpen: Boolean = {
+byteArray != null
+  }
+
+  override def close(): Unit = {
+byteArray = null
+  }
+
+  override def read(dst: ByteBuffer): Int = {
+val remainingBuf = byteArray.length - _position
+val length = Math.min(dst.remaining(), remainingBuf).toInt
+dst.put(byteArray, _position.toInt, length)
+_position += length
+length.toInt
+  }
+
+  override def position(): Long = _position
+
+  override def position(newPosition: Long): SeekableByteChannel = {
+_position = newPosition.toLong
+this
+  }
+
+  override def size: Long = {
+byteArray.length.toLong
+  }
+
+  override def write(src: ByteBuffer): Int = {
+throw new UnsupportedOperationException("Read Only")
+  }
+
+  override def truncate(size: Long): SeekableByteChannel = {
+throw new UnsupportedOperationException("Read Only")
+  }
+}
+
+/**
+ * Intermediate data structure returned from Arrow conversions
+ */
+private[sql] abstract class ArrowPayload extends Iterator[ArrowRecordBatch]
+
+/**
+ * Build a payload from existing ArrowRecordBatches
+ */
+private[sql] class ArrowStaticPayload(batches: ArrowRecordBatch*) extends 
ArrowPayload {
+  private val iter = batches.iterator
+  override def next(): ArrowRecordBatch = iter.next()
+  override def hasNext: Boolean = iter.hasNext
+}
+
+/**
+ * Class that wraps an Arrow RootAllocator used in conversion
+ */
+private[sql] class ArrowConverters {
+  private val _allocator = new RootAllocator(Long.MaxValue)
+
+  private[sql] def allocator: RootAllocator = _allocator
+
+  def interalRowIterToPayload(rowIter: Iterator[InternalRow], schema: 
StructType): ArrowPayload = {
+val batch = ArrowConverters.internalRowIterToArrowBatch(rowIter, 
schema, allocator)
+new ArrowStaticPayload(batch)
+  }
+
+  def readPayloadByteArrays(payloadByteArrays: Array[Array[Byte]]): 
ArrowPayload = {
+val batches = 
scala.collection.mutable.ArrayBuffer.empty[ArrowRecordBatch]
+var i = 0
+while (i < payloadByteArrays.length) {
+  val payloadBytes = payloadByteArrays(i)
+  val in = new ByteArrayReadableSeekableByteChannel(payloadBytes)
+  val reader = new ArrowReader(in, _allocator)
+  val footer = reader.readFooter()
+  val batchBlocks = footer.getRecordBatches.asScala.toArray
+  batchBlocks.foreach(block => batches += 
reader.readRecordBatch(block))
+  i += 1
+

[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...

2017-03-09 Thread julienledem
Github user julienledem commented on a diff in the pull request:

https://github.com/apache/spark/pull/15821#discussion_r105322928
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/ArrowConverters.scala ---
@@ -0,0 +1,411 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql
+
+import java.io.ByteArrayOutputStream
+import java.nio.ByteBuffer
+import java.nio.channels.{Channels, SeekableByteChannel}
+
+import scala.collection.JavaConverters._
+
+import io.netty.buffer.ArrowBuf
+import org.apache.arrow.memory.{BaseAllocator, RootAllocator}
+import org.apache.arrow.vector._
+import org.apache.arrow.vector.BaseValueVector.BaseMutator
+import org.apache.arrow.vector.file.{ArrowReader, ArrowWriter}
+import org.apache.arrow.vector.schema.{ArrowFieldNode, ArrowRecordBatch}
+import org.apache.arrow.vector.types.{FloatingPointPrecision, TimeUnit}
+import org.apache.arrow.vector.types.pojo.{ArrowType, Field, Schema}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types._
+
+
+/**
+ * ArrowReader requires a seekable byte channel.
+ * NOTE - this is taken from test org.apache.vector.file, see about moving 
to public util pkg
+ */
+private[sql] class ByteArrayReadableSeekableByteChannel(var byteArray: 
Array[Byte])
+  extends SeekableByteChannel {
+  var _position: Long = 0L
+
+  override def isOpen: Boolean = {
+byteArray != null
+  }
+
+  override def close(): Unit = {
+byteArray = null
+  }
+
+  override def read(dst: ByteBuffer): Int = {
+val remainingBuf = byteArray.length - _position
+val length = Math.min(dst.remaining(), remainingBuf).toInt
+dst.put(byteArray, _position.toInt, length)
+_position += length
+length.toInt
+  }
+
+  override def position(): Long = _position
+
+  override def position(newPosition: Long): SeekableByteChannel = {
+_position = newPosition.toLong
+this
+  }
+
+  override def size: Long = {
+byteArray.length.toLong
+  }
+
+  override def write(src: ByteBuffer): Int = {
+throw new UnsupportedOperationException("Read Only")
+  }
+
+  override def truncate(size: Long): SeekableByteChannel = {
+throw new UnsupportedOperationException("Read Only")
+  }
+}
+
+/**
+ * Intermediate data structure returned from Arrow conversions
+ */
+private[sql] abstract class ArrowPayload extends Iterator[ArrowRecordBatch]
+
+/**
+ * Build a payload from existing ArrowRecordBatches
+ */
+private[sql] class ArrowStaticPayload(batches: ArrowRecordBatch*) extends 
ArrowPayload {
+  private val iter = batches.iterator
+  override def next(): ArrowRecordBatch = iter.next()
+  override def hasNext: Boolean = iter.hasNext
+}
+
+/**
+ * Class that wraps an Arrow RootAllocator used in conversion
+ */
+private[sql] class ArrowConverters {
+  private val _allocator = new RootAllocator(Long.MaxValue)
+
+  private[sql] def allocator: RootAllocator = _allocator
+
+  def interalRowIterToPayload(rowIter: Iterator[InternalRow], schema: 
StructType): ArrowPayload = {
+val batch = ArrowConverters.internalRowIterToArrowBatch(rowIter, 
schema, allocator)
+new ArrowStaticPayload(batch)
+  }
+
+  def readPayloadByteArrays(payloadByteArrays: Array[Array[Byte]]): 
ArrowPayload = {
+val batches = 
scala.collection.mutable.ArrayBuffer.empty[ArrowRecordBatch]
+var i = 0
+while (i < payloadByteArrays.length) {
+  val payloadBytes = payloadByteArrays(i)
+  val in = new ByteArrayReadableSeekableByteChannel(payloadBytes)
+  val reader = new ArrowReader(in, _allocator)
+  val footer = reader.readFooter()
+  val batchBlocks = footer.getRecordBatches.asScala.toArray
+  batchBlocks.foreach(block => batches += 
reader.readRecordBatch(block))
+  i += 1
+

[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...

2017-03-09 Thread julienledem
Github user julienledem commented on a diff in the pull request:

https://github.com/apache/spark/pull/15821#discussion_r105324269
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/ArrowConvertersSuite.scala ---
@@ -0,0 +1,567 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql
+
+import java.io.File
+import java.nio.charset.StandardCharsets
+import java.sql.{Date, Timestamp}
+import java.text.SimpleDateFormat
+import java.util.Locale
+
+import com.google.common.io.Files
+import org.apache.arrow.vector.{VectorLoader, VectorSchemaRoot}
+import org.apache.arrow.vector.file.json.JsonFileReader
+import org.apache.arrow.vector.util.Validator
+import org.json4s.jackson.JsonMethods._
+import org.json4s.JsonAST._
+import org.json4s.JsonDSL._
+import org.scalatest.BeforeAndAfterAll
+
+import org.apache.spark.SparkException
+import org.apache.spark.sql.test.SharedSQLContext
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.Utils
+
+
+class ArrowConvertersSuite extends SharedSQLContext with BeforeAndAfterAll 
{
+  import testImplicits._
+
+  private var tempDataPath: String = _
+
+  private def collectAsArrow(df: DataFrame,
+ converter: Option[ArrowConverters] = None): 
ArrowPayload = {
+val cnvtr = converter.getOrElse(new ArrowConverters)
+val payloadByteArrays = df.toArrowPayloadBytes().collect()
+cnvtr.readPayloadByteArrays(payloadByteArrays)
+  }
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+tempDataPath = Utils.createTempDir(namePrefix = 
"arrow").getAbsolutePath
+  }
+
+  test("collect to arrow record batch") {
+val indexData = (1 to 6).toDF("i")
+val arrowPayload = collectAsArrow(indexData)
+assert(arrowPayload.nonEmpty)
+val arrowBatches = arrowPayload.toArray
+assert(arrowBatches.length == indexData.rdd.getNumPartitions)
+val rowCount = arrowBatches.map(batch => batch.getLength).sum
+assert(rowCount === indexData.count())
+arrowBatches.foreach(batch => assert(batch.getNodes.size() > 0))
+arrowBatches.foreach(batch => batch.close())
+  }
+
+  test("numeric type conversion") {
+collectAndValidate(indexData)
+collectAndValidate(shortData)
+collectAndValidate(intData)
+collectAndValidate(longData)
+collectAndValidate(floatData)
+collectAndValidate(doubleData)
+  }
+
+  test("mixed numeric type conversion") {
+collectAndValidate(mixedNumericData)
+  }
+
+  test("boolean type conversion") {
+collectAndValidate(boolData)
+  }
+
+  test("string type conversion") {
+collectAndValidate(stringData)
+  }
+
+  test("byte type conversion") {
+collectAndValidate(byteData)
+  }
+
+  test("timestamp conversion") {
+collectAndValidate(timestampData)
+  }
+
+  // TODO: Not currently supported in Arrow JSON reader
+  ignore("date conversion") {
+// collectAndValidate(dateTimeData)
+  }
+
+  // TODO: Not currently supported in Arrow JSON reader
+  ignore("binary type conversion") {
+// collectAndValidate(binaryData)
+  }
+
+  test("floating-point NaN") {
+collectAndValidate(floatNaNData)
+  }
+
+  test("partitioned DataFrame") {
+val converter = new ArrowConverters
+val schema = testData2.schema
+val arrowPayload = collectAsArrow(testData2, Some(converter))
+val arrowBatches = arrowPayload.toArray
+// NOTE: testData2 should have 2 partitions -> 2 arrow batches in 
payload
+assert(arrowBatches.length === 2)
+val pl1 = new ArrowStaticPayload(arrowBatches(0))
+val pl2 = new ArrowStaticPayload(arrowBatches(1))
+// Generate JSON files
+val a = List[Int](1, 1, 2, 2, 3, 3)
+val b = List[Int](1, 2, 1, 2, 1, 2)
+val fields 

[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...

2017-03-09 Thread julienledem
Github user julienledem commented on a diff in the pull request:

https://github.com/apache/spark/pull/15821#discussion_r105321731
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/ArrowConverters.scala ---
@@ -0,0 +1,411 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql
+
+import java.io.ByteArrayOutputStream
+import java.nio.ByteBuffer
+import java.nio.channels.{Channels, SeekableByteChannel}
+
+import scala.collection.JavaConverters._
+
+import io.netty.buffer.ArrowBuf
+import org.apache.arrow.memory.{BaseAllocator, RootAllocator}
+import org.apache.arrow.vector._
+import org.apache.arrow.vector.BaseValueVector.BaseMutator
+import org.apache.arrow.vector.file.{ArrowReader, ArrowWriter}
+import org.apache.arrow.vector.schema.{ArrowFieldNode, ArrowRecordBatch}
+import org.apache.arrow.vector.types.{FloatingPointPrecision, TimeUnit}
+import org.apache.arrow.vector.types.pojo.{ArrowType, Field, Schema}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types._
+
+
+/**
+ * ArrowReader requires a seekable byte channel.
+ * NOTE - this is taken from test org.apache.vector.file, see about moving 
to public util pkg
+ */
+private[sql] class ByteArrayReadableSeekableByteChannel(var byteArray: 
Array[Byte])
+  extends SeekableByteChannel {
+  var _position: Long = 0L
+
+  override def isOpen: Boolean = {
+byteArray != null
+  }
+
+  override def close(): Unit = {
+byteArray = null
+  }
+
+  override def read(dst: ByteBuffer): Int = {
+val remainingBuf = byteArray.length - _position
+val length = Math.min(dst.remaining(), remainingBuf).toInt
+dst.put(byteArray, _position.toInt, length)
+_position += length
+length.toInt
+  }
+
+  override def position(): Long = _position
+
+  override def position(newPosition: Long): SeekableByteChannel = {
+_position = newPosition.toLong
+this
+  }
+
+  override def size: Long = {
+byteArray.length.toLong
+  }
+
+  override def write(src: ByteBuffer): Int = {
+throw new UnsupportedOperationException("Read Only")
+  }
+
+  override def truncate(size: Long): SeekableByteChannel = {
+throw new UnsupportedOperationException("Read Only")
+  }
+}
+
+/**
+ * Intermediate data structure returned from Arrow conversions
+ */
+private[sql] abstract class ArrowPayload extends Iterator[ArrowRecordBatch]
+
+/**
+ * Build a payload from existing ArrowRecordBatches
+ */
+private[sql] class ArrowStaticPayload(batches: ArrowRecordBatch*) extends 
ArrowPayload {
+  private val iter = batches.iterator
+  override def next(): ArrowRecordBatch = iter.next()
+  override def hasNext: Boolean = iter.hasNext
+}
+
+/**
+ * Class that wraps an Arrow RootAllocator used in conversion
+ */
+private[sql] class ArrowConverters {
+  private val _allocator = new RootAllocator(Long.MaxValue)
--- End diff --

if we create an allocator we should have a way to close it in the end.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...

2017-03-09 Thread julienledem
Github user julienledem commented on a diff in the pull request:

https://github.com/apache/spark/pull/15821#discussion_r105322470
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/ArrowConverters.scala ---
@@ -0,0 +1,411 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql
+
+import java.io.ByteArrayOutputStream
+import java.nio.ByteBuffer
+import java.nio.channels.{Channels, SeekableByteChannel}
+
+import scala.collection.JavaConverters._
+
+import io.netty.buffer.ArrowBuf
+import org.apache.arrow.memory.{BaseAllocator, RootAllocator}
+import org.apache.arrow.vector._
+import org.apache.arrow.vector.BaseValueVector.BaseMutator
+import org.apache.arrow.vector.file.{ArrowReader, ArrowWriter}
+import org.apache.arrow.vector.schema.{ArrowFieldNode, ArrowRecordBatch}
+import org.apache.arrow.vector.types.{FloatingPointPrecision, TimeUnit}
+import org.apache.arrow.vector.types.pojo.{ArrowType, Field, Schema}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types._
+
+
+/**
+ * ArrowReader requires a seekable byte channel.
+ * NOTE - this is taken from test org.apache.vector.file, see about moving 
to public util pkg
+ */
+private[sql] class ByteArrayReadableSeekableByteChannel(var byteArray: 
Array[Byte])
+  extends SeekableByteChannel {
+  var _position: Long = 0L
+
+  override def isOpen: Boolean = {
+byteArray != null
+  }
+
+  override def close(): Unit = {
+byteArray = null
+  }
+
+  override def read(dst: ByteBuffer): Int = {
+val remainingBuf = byteArray.length - _position
+val length = Math.min(dst.remaining(), remainingBuf).toInt
+dst.put(byteArray, _position.toInt, length)
+_position += length
+length.toInt
+  }
+
+  override def position(): Long = _position
+
+  override def position(newPosition: Long): SeekableByteChannel = {
+_position = newPosition.toLong
+this
+  }
+
+  override def size: Long = {
+byteArray.length.toLong
+  }
+
+  override def write(src: ByteBuffer): Int = {
+throw new UnsupportedOperationException("Read Only")
+  }
+
+  override def truncate(size: Long): SeekableByteChannel = {
+throw new UnsupportedOperationException("Read Only")
+  }
+}
+
+/**
+ * Intermediate data structure returned from Arrow conversions
+ */
+private[sql] abstract class ArrowPayload extends Iterator[ArrowRecordBatch]
+
+/**
+ * Build a payload from existing ArrowRecordBatches
+ */
+private[sql] class ArrowStaticPayload(batches: ArrowRecordBatch*) extends 
ArrowPayload {
+  private val iter = batches.iterator
+  override def next(): ArrowRecordBatch = iter.next()
+  override def hasNext: Boolean = iter.hasNext
+}
+
+/**
+ * Class that wraps an Arrow RootAllocator used in conversion
+ */
+private[sql] class ArrowConverters {
+  private val _allocator = new RootAllocator(Long.MaxValue)
+
+  private[sql] def allocator: RootAllocator = _allocator
+
+  def interalRowIterToPayload(rowIter: Iterator[InternalRow], schema: 
StructType): ArrowPayload = {
+val batch = ArrowConverters.internalRowIterToArrowBatch(rowIter, 
schema, allocator)
+new ArrowStaticPayload(batch)
+  }
+
+  def readPayloadByteArrays(payloadByteArrays: Array[Array[Byte]]): 
ArrowPayload = {
+val batches = 
scala.collection.mutable.ArrayBuffer.empty[ArrowRecordBatch]
+var i = 0
+while (i < payloadByteArrays.length) {
+  val payloadBytes = payloadByteArrays(i)
+  val in = new ByteArrayReadableSeekableByteChannel(payloadBytes)
+  val reader = new ArrowReader(in, _allocator)
+  val footer = reader.readFooter()
+  val batchBlocks = footer.getRecordBatches.asScala.toArray
+  batchBlocks.foreach(block => batches += 
reader.readRecordBatch(block))
+  i += 1
+

[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...

2017-03-09 Thread julienledem
Github user julienledem commented on a diff in the pull request:

https://github.com/apache/spark/pull/15821#discussion_r105322707
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/ArrowConverters.scala ---
@@ -0,0 +1,411 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql
+
+import java.io.ByteArrayOutputStream
+import java.nio.ByteBuffer
+import java.nio.channels.{Channels, SeekableByteChannel}
+
+import scala.collection.JavaConverters._
+
+import io.netty.buffer.ArrowBuf
+import org.apache.arrow.memory.{BaseAllocator, RootAllocator}
+import org.apache.arrow.vector._
+import org.apache.arrow.vector.BaseValueVector.BaseMutator
+import org.apache.arrow.vector.file.{ArrowReader, ArrowWriter}
+import org.apache.arrow.vector.schema.{ArrowFieldNode, ArrowRecordBatch}
+import org.apache.arrow.vector.types.{FloatingPointPrecision, TimeUnit}
+import org.apache.arrow.vector.types.pojo.{ArrowType, Field, Schema}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types._
+
+
+/**
+ * ArrowReader requires a seekable byte channel.
+ * NOTE - this is taken from test org.apache.vector.file, see about moving 
to public util pkg
+ */
+private[sql] class ByteArrayReadableSeekableByteChannel(var byteArray: 
Array[Byte])
+  extends SeekableByteChannel {
+  var _position: Long = 0L
+
+  override def isOpen: Boolean = {
+byteArray != null
+  }
+
+  override def close(): Unit = {
+byteArray = null
+  }
+
+  override def read(dst: ByteBuffer): Int = {
+val remainingBuf = byteArray.length - _position
+val length = Math.min(dst.remaining(), remainingBuf).toInt
+dst.put(byteArray, _position.toInt, length)
+_position += length
+length.toInt
+  }
+
+  override def position(): Long = _position
+
+  override def position(newPosition: Long): SeekableByteChannel = {
+_position = newPosition.toLong
+this
+  }
+
+  override def size: Long = {
+byteArray.length.toLong
+  }
+
+  override def write(src: ByteBuffer): Int = {
+throw new UnsupportedOperationException("Read Only")
+  }
+
+  override def truncate(size: Long): SeekableByteChannel = {
+throw new UnsupportedOperationException("Read Only")
+  }
+}
+
+/**
+ * Intermediate data structure returned from Arrow conversions
+ */
+private[sql] abstract class ArrowPayload extends Iterator[ArrowRecordBatch]
+
+/**
+ * Build a payload from existing ArrowRecordBatches
+ */
+private[sql] class ArrowStaticPayload(batches: ArrowRecordBatch*) extends 
ArrowPayload {
+  private val iter = batches.iterator
+  override def next(): ArrowRecordBatch = iter.next()
+  override def hasNext: Boolean = iter.hasNext
+}
+
+/**
+ * Class that wraps an Arrow RootAllocator used in conversion
+ */
+private[sql] class ArrowConverters {
+  private val _allocator = new RootAllocator(Long.MaxValue)
+
+  private[sql] def allocator: RootAllocator = _allocator
+
+  def interalRowIterToPayload(rowIter: Iterator[InternalRow], schema: 
StructType): ArrowPayload = {
+val batch = ArrowConverters.internalRowIterToArrowBatch(rowIter, 
schema, allocator)
+new ArrowStaticPayload(batch)
+  }
+
+  def readPayloadByteArrays(payloadByteArrays: Array[Array[Byte]]): 
ArrowPayload = {
+val batches = 
scala.collection.mutable.ArrayBuffer.empty[ArrowRecordBatch]
+var i = 0
+while (i < payloadByteArrays.length) {
+  val payloadBytes = payloadByteArrays(i)
+  val in = new ByteArrayReadableSeekableByteChannel(payloadBytes)
+  val reader = new ArrowReader(in, _allocator)
+  val footer = reader.readFooter()
+  val batchBlocks = footer.getRecordBatches.asScala.toArray
+  batchBlocks.foreach(block => batches += 
reader.readRecordBatch(block))
+  i += 1
+

[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/17226
  
@aray, this is a regression as I described in my PR that is introduced by 
this optimization.

Spark 1.6.

```
+++---+
|   a|null|  1|
+++---+
|null|   0|  0|
|   1|   0|  1|
+++---+
```

For input/output, this PR does not fully resolve this regression. That's 
why I proposed to avoid this in the optimization path.

We could leave another JIRA open to support this case in optimization path. 
It is little bit funny that the output is different by the internal 
optimization.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...

2017-03-09 Thread julienledem
Github user julienledem commented on a diff in the pull request:

https://github.com/apache/spark/pull/15821#discussion_r105322098
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/ArrowConverters.scala ---
@@ -0,0 +1,411 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql
+
+import java.io.ByteArrayOutputStream
+import java.nio.ByteBuffer
+import java.nio.channels.{Channels, SeekableByteChannel}
+
+import scala.collection.JavaConverters._
+
+import io.netty.buffer.ArrowBuf
+import org.apache.arrow.memory.{BaseAllocator, RootAllocator}
+import org.apache.arrow.vector._
+import org.apache.arrow.vector.BaseValueVector.BaseMutator
+import org.apache.arrow.vector.file.{ArrowReader, ArrowWriter}
+import org.apache.arrow.vector.schema.{ArrowFieldNode, ArrowRecordBatch}
+import org.apache.arrow.vector.types.{FloatingPointPrecision, TimeUnit}
+import org.apache.arrow.vector.types.pojo.{ArrowType, Field, Schema}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types._
+
+
+/**
+ * ArrowReader requires a seekable byte channel.
+ * NOTE - this is taken from test org.apache.vector.file, see about moving 
to public util pkg
+ */
+private[sql] class ByteArrayReadableSeekableByteChannel(var byteArray: 
Array[Byte])
+  extends SeekableByteChannel {
+  var _position: Long = 0L
+
+  override def isOpen: Boolean = {
+byteArray != null
+  }
+
+  override def close(): Unit = {
+byteArray = null
+  }
+
+  override def read(dst: ByteBuffer): Int = {
+val remainingBuf = byteArray.length - _position
+val length = Math.min(dst.remaining(), remainingBuf).toInt
+dst.put(byteArray, _position.toInt, length)
+_position += length
+length.toInt
+  }
+
+  override def position(): Long = _position
+
+  override def position(newPosition: Long): SeekableByteChannel = {
+_position = newPosition.toLong
+this
+  }
+
+  override def size: Long = {
+byteArray.length.toLong
+  }
+
+  override def write(src: ByteBuffer): Int = {
+throw new UnsupportedOperationException("Read Only")
+  }
+
+  override def truncate(size: Long): SeekableByteChannel = {
+throw new UnsupportedOperationException("Read Only")
+  }
+}
+
+/**
+ * Intermediate data structure returned from Arrow conversions
+ */
+private[sql] abstract class ArrowPayload extends Iterator[ArrowRecordBatch]
+
+/**
+ * Build a payload from existing ArrowRecordBatches
+ */
+private[sql] class ArrowStaticPayload(batches: ArrowRecordBatch*) extends 
ArrowPayload {
+  private val iter = batches.iterator
+  override def next(): ArrowRecordBatch = iter.next()
+  override def hasNext: Boolean = iter.hasNext
+}
+
+/**
+ * Class that wraps an Arrow RootAllocator used in conversion
+ */
+private[sql] class ArrowConverters {
+  private val _allocator = new RootAllocator(Long.MaxValue)
+
+  private[sql] def allocator: RootAllocator = _allocator
+
+  def interalRowIterToPayload(rowIter: Iterator[InternalRow], schema: 
StructType): ArrowPayload = {
+val batch = ArrowConverters.internalRowIterToArrowBatch(rowIter, 
schema, allocator)
+new ArrowStaticPayload(batch)
+  }
+
+  def readPayloadByteArrays(payloadByteArrays: Array[Array[Byte]]): 
ArrowPayload = {
+val batches = 
scala.collection.mutable.ArrayBuffer.empty[ArrowRecordBatch]
+var i = 0
+while (i < payloadByteArrays.length) {
+  val payloadBytes = payloadByteArrays(i)
+  val in = new ByteArrayReadableSeekableByteChannel(payloadBytes)
+  val reader = new ArrowReader(in, _allocator)
+  val footer = reader.readFooter()
+  val batchBlocks = footer.getRecordBatches.asScala.toArray
+  batchBlocks.foreach(block => batches += 
reader.readRecordBatch(block))
+  i += 1
+

[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...

2017-03-09 Thread julienledem
Github user julienledem commented on a diff in the pull request:

https://github.com/apache/spark/pull/15821#discussion_r105322283
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/ArrowConverters.scala ---
@@ -0,0 +1,411 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql
+
+import java.io.ByteArrayOutputStream
+import java.nio.ByteBuffer
+import java.nio.channels.{Channels, SeekableByteChannel}
+
+import scala.collection.JavaConverters._
+
+import io.netty.buffer.ArrowBuf
+import org.apache.arrow.memory.{BaseAllocator, RootAllocator}
+import org.apache.arrow.vector._
+import org.apache.arrow.vector.BaseValueVector.BaseMutator
+import org.apache.arrow.vector.file.{ArrowReader, ArrowWriter}
+import org.apache.arrow.vector.schema.{ArrowFieldNode, ArrowRecordBatch}
+import org.apache.arrow.vector.types.{FloatingPointPrecision, TimeUnit}
+import org.apache.arrow.vector.types.pojo.{ArrowType, Field, Schema}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types._
+
+
+/**
+ * ArrowReader requires a seekable byte channel.
+ * NOTE - this is taken from test org.apache.vector.file, see about moving 
to public util pkg
+ */
+private[sql] class ByteArrayReadableSeekableByteChannel(var byteArray: 
Array[Byte])
+  extends SeekableByteChannel {
+  var _position: Long = 0L
+
+  override def isOpen: Boolean = {
+byteArray != null
+  }
+
+  override def close(): Unit = {
+byteArray = null
+  }
+
+  override def read(dst: ByteBuffer): Int = {
+val remainingBuf = byteArray.length - _position
+val length = Math.min(dst.remaining(), remainingBuf).toInt
+dst.put(byteArray, _position.toInt, length)
+_position += length
+length.toInt
+  }
+
+  override def position(): Long = _position
+
+  override def position(newPosition: Long): SeekableByteChannel = {
+_position = newPosition.toLong
+this
+  }
+
+  override def size: Long = {
+byteArray.length.toLong
+  }
+
+  override def write(src: ByteBuffer): Int = {
+throw new UnsupportedOperationException("Read Only")
+  }
+
+  override def truncate(size: Long): SeekableByteChannel = {
+throw new UnsupportedOperationException("Read Only")
+  }
+}
+
+/**
+ * Intermediate data structure returned from Arrow conversions
+ */
+private[sql] abstract class ArrowPayload extends Iterator[ArrowRecordBatch]
+
+/**
+ * Build a payload from existing ArrowRecordBatches
+ */
+private[sql] class ArrowStaticPayload(batches: ArrowRecordBatch*) extends 
ArrowPayload {
+  private val iter = batches.iterator
+  override def next(): ArrowRecordBatch = iter.next()
+  override def hasNext: Boolean = iter.hasNext
+}
+
+/**
+ * Class that wraps an Arrow RootAllocator used in conversion
+ */
+private[sql] class ArrowConverters {
+  private val _allocator = new RootAllocator(Long.MaxValue)
+
+  private[sql] def allocator: RootAllocator = _allocator
+
+  def interalRowIterToPayload(rowIter: Iterator[InternalRow], schema: 
StructType): ArrowPayload = {
+val batch = ArrowConverters.internalRowIterToArrowBatch(rowIter, 
schema, allocator)
+new ArrowStaticPayload(batch)
+  }
+
+  def readPayloadByteArrays(payloadByteArrays: Array[Array[Byte]]): 
ArrowPayload = {
+val batches = 
scala.collection.mutable.ArrayBuffer.empty[ArrowRecordBatch]
+var i = 0
+while (i < payloadByteArrays.length) {
+  val payloadBytes = payloadByteArrays(i)
+  val in = new ByteArrayReadableSeekableByteChannel(payloadBytes)
+  val reader = new ArrowReader(in, _allocator)
+  val footer = reader.readFooter()
+  val batchBlocks = footer.getRecordBatches.asScala.toArray
+  batchBlocks.foreach(block => batches += 
reader.readRecordBatch(block))
+  i += 1
+

[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...

2017-03-09 Thread julienledem
Github user julienledem commented on a diff in the pull request:

https://github.com/apache/spark/pull/15821#discussion_r105321652
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/ArrowConverters.scala ---
@@ -0,0 +1,411 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the "License"); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql
+
+import java.io.ByteArrayOutputStream
+import java.nio.ByteBuffer
+import java.nio.channels.{Channels, SeekableByteChannel}
+
+import scala.collection.JavaConverters._
+
+import io.netty.buffer.ArrowBuf
+import org.apache.arrow.memory.{BaseAllocator, RootAllocator}
+import org.apache.arrow.vector._
+import org.apache.arrow.vector.BaseValueVector.BaseMutator
+import org.apache.arrow.vector.file.{ArrowReader, ArrowWriter}
+import org.apache.arrow.vector.schema.{ArrowFieldNode, ArrowRecordBatch}
+import org.apache.arrow.vector.types.{FloatingPointPrecision, TimeUnit}
+import org.apache.arrow.vector.types.pojo.{ArrowType, Field, Schema}
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.types._
+
+
+/**
+ * ArrowReader requires a seekable byte channel.
+ * NOTE - this is taken from test org.apache.vector.file, see about moving 
to public util pkg
+ */
+private[sql] class ByteArrayReadableSeekableByteChannel(var byteArray: 
Array[Byte])
+  extends SeekableByteChannel {
+  var _position: Long = 0L
+
+  override def isOpen: Boolean = {
+byteArray != null
+  }
+
+  override def close(): Unit = {
+byteArray = null
+  }
+
+  override def read(dst: ByteBuffer): Int = {
+val remainingBuf = byteArray.length - _position
+val length = Math.min(dst.remaining(), remainingBuf).toInt
+dst.put(byteArray, _position.toInt, length)
+_position += length
+length.toInt
+  }
+
+  override def position(): Long = _position
+
+  override def position(newPosition: Long): SeekableByteChannel = {
+_position = newPosition.toLong
+this
+  }
+
+  override def size: Long = {
+byteArray.length.toLong
+  }
+
+  override def write(src: ByteBuffer): Int = {
+throw new UnsupportedOperationException("Read Only")
+  }
+
+  override def truncate(size: Long): SeekableByteChannel = {
+throw new UnsupportedOperationException("Read Only")
+  }
+}
+
+/**
+ * Intermediate data structure returned from Arrow conversions
+ */
+private[sql] abstract class ArrowPayload extends Iterator[ArrowRecordBatch]
+
+/**
+ * Build a payload from existing ArrowRecordBatches
+ */
+private[sql] class ArrowStaticPayload(batches: ArrowRecordBatch*) extends 
ArrowPayload {
+  private val iter = batches.iterator
+  override def next(): ArrowRecordBatch = iter.next()
+  override def hasNext: Boolean = iter.hasNext
+}
+
+/**
+ * Class that wraps an Arrow RootAllocator used in conversion
+ */
+private[sql] class ArrowConverters {
+  private val _allocator = new RootAllocator(Long.MaxValue)
+
+  private[sql] def allocator: RootAllocator = _allocator
+
+  def interalRowIterToPayload(rowIter: Iterator[InternalRow], schema: 
StructType): ArrowPayload = {
+val batch = ArrowConverters.internalRowIterToArrowBatch(rowIter, 
schema, allocator)
+new ArrowStaticPayload(batch)
+  }
+
+  def readPayloadByteArrays(payloadByteArrays: Array[Array[Byte]]): 
ArrowPayload = {
+val batches = 
scala.collection.mutable.ArrayBuffer.empty[ArrowRecordBatch]
+var i = 0
+while (i < payloadByteArrays.length) {
+  val payloadBytes = payloadByteArrays(i)
+  val in = new ByteArrayReadableSeekableByteChannel(payloadBytes)
+  val reader = new ArrowReader(in, _allocator)
+  val footer = reader.readFooter()
+  val batchBlocks = footer.getRecordBatches.asScala.toArray
+  batchBlocks.foreach(block => batches += 
reader.readRecordBatch(block))
+  i += 1
--- 

  1   2   3   4   5   >