date:20180822

[GitHub] spark issue #22063: [WIP][SPARK-25044][SQL] Address translation of LMF closu...

2018-08-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22063
  
**[Test build #95106 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95106/testReport)**
 for PR 22063 at commit 
[`e7abb67`](https://github.com/apache/spark/commit/e7abb67bcb66b21a41818e435b8ec848df62edd8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22063: [WIP][SPARK-25044][SQL] Address translation of LMF closu...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22063
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2446/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22063: [WIP][SPARK-25044][SQL] Address translation of LMF closu...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22063
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22121#discussion_r211985668
  
--- Diff: docs/avro-data-source-guide.md ---
@@ -0,0 +1,377 @@
+---
+layout: global
+title: Apache Avro Data Source Guide
+---
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+Since Spark 2.4 release, [Spark 
SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides 
built-in support for reading and writing Apache Avro data.
+
+## Deploying
+The `spark-avro` module is external and not included in `spark-submit` or 
`spark-shell` by default.
+
+As with any Spark applications, `spark-submit` is used to launch your 
application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
+and its dependencies can be directly added to `spark-submit` using 
`--packages`, such as,
+
+./bin/spark-submit --packages 
org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}}
 ...
+
+For experimenting on `spark-shell`, you can also use `--packages` to add 
`org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its 
dependencies directly,
+
+./bin/spark-shell --packages 
org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}}
 ...
+
+See [Application Submission Guide](submitting-applications.html) for more 
details about submitting applications with external dependencies.
+
+## Load and Save Functions
+
+Since `spark-avro` module is external, there is no `.avro` API in 
+`DataFrameReader` or `DataFrameWriter`.
+
+To load/save data in Avro format, you need to specify the data source 
option `format` as `avro`(or `org.apache.spark.sql.avro`).
+
+
+{% highlight scala %}
+
+val usersDF = 
spark.read.format("avro").load("examples/src/main/resources/users.avro")
+usersDF.select("name", 
"favorite_color").write.format("avro").save("namesAndFavColors.avro")
+
+{% endhighlight %}
+
+
+{% highlight java %}
+
+Dataset usersDF = 
spark.read().format("avro").load("examples/src/main/resources/users.avro");
+usersDF.select("name", 
"favorite_color").write().format("avro").save("namesAndFavColors.avro");
+
+{% endhighlight %}
+
+
+{% highlight python %}
+
+df = 
spark.read.format("avro").load("examples/src/main/resources/users.avro")
+df.select("name", 
"favorite_color").write.format("avro").save("namesAndFavColors.avro")
+
+{% endhighlight %}
+
+
+{% highlight r %}
+
+df <- read.df("examples/src/main/resources/users.avro", "avro")
+write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", 
"avro")
+
+{% endhighlight %}
+
+
+
+## to_avro() and from_avro()
+Spark SQL provides function `to_avro` to encode a struct as a string and 
`from_avro()` to retrieve the struct as a complex type.
+
+Using Avro record as columns are useful when reading from or writing to a 
streaming source like Kafka. Each 
+Kafka key-value record will be augmented with some metadata, such as the 
ingestion timestamp into Kafka, the offset in Kafka, etc.
+* If the "value" field that contains your data is in Avro, you could use 
`from_avro()` to extract your data, enrich it, clean it, and then push it 
downstream to Kafka again or write it out to a file.
+* `to_avro()` can be used to turn structs into Avro records. This method 
is particularly useful when you would like to re-encode multiple columns into a 
single one when writing data out to Kafka.
+
+Both methods are presently only available in Scala and Java.
--- End diff --

Do not use `presently`, we should say `As of Spark 2.4, ...`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22141: [SPARK-25154][SQL] Support NOT IN sub-queries ins...

2018-08-22 Thread dilipbiswal

Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22141#discussion_r211985537
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
 ---
@@ -137,13 +137,21 @@ object RewritePredicateSubquery extends 
Rule[LogicalPlan] with PredicateHelper {
   plan: LogicalPlan): (Option[Expression], LogicalPlan) = {
 var newPlan = plan
 val newExprs = exprs.map { e =>
-  e transformUp {
+  e transformDown {
--- End diff --

@mgaido91 How can i say "no" to more testing :-) ? I will add the tests. 
Thanks !!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row

2018-08-22 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/22140
  
AFAIC, the fix should forbid illegal extra value passing. If less values 
than fields it should get a `AttributeError` while accessing as the currently 
implement, not ban it here? What do you think :) @HyukjinKwon @BryanCutler 
Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22121#discussion_r211985059
  
--- Diff: docs/avro-data-source-guide.md ---
@@ -0,0 +1,377 @@
+---
+layout: global
+title: Apache Avro Data Source Guide
+---
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+Since Spark 2.4 release, [Spark 
SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides 
built-in support for reading and writing Apache Avro data.
+
+## Deploying
+The `spark-avro` module is external and not included in `spark-submit` or 
`spark-shell` by default.
+
+As with any Spark applications, `spark-submit` is used to launch your 
application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
+and its dependencies can be directly added to `spark-submit` using 
`--packages`, such as,
+
+./bin/spark-submit --packages 
org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}}
 ...
+
+For experimenting on `spark-shell`, you can also use `--packages` to add 
`org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its 
dependencies directly,
+
+./bin/spark-shell --packages 
org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}}
 ...
+
+See [Application Submission Guide](submitting-applications.html) for more 
details about submitting applications with external dependencies.
+
+## Load and Save Functions
+
+Since `spark-avro` module is external, there is no `.avro` API in 
+`DataFrameReader` or `DataFrameWriter`.
+
+To load/save data in Avro format, you need to specify the data source 
option `format` as `avro`(or `org.apache.spark.sql.avro`).
+
+
+{% highlight scala %}
+
+val usersDF = 
spark.read.format("avro").load("examples/src/main/resources/users.avro")
+usersDF.select("name", 
"favorite_color").write.format("avro").save("namesAndFavColors.avro")
+
+{% endhighlight %}
+
+
+{% highlight java %}
+
+Dataset usersDF = 
spark.read().format("avro").load("examples/src/main/resources/users.avro");
+usersDF.select("name", 
"favorite_color").write().format("avro").save("namesAndFavColors.avro");
+
+{% endhighlight %}
+
+
+{% highlight python %}
+
+df = 
spark.read.format("avro").load("examples/src/main/resources/users.avro")
+df.select("name", 
"favorite_color").write.format("avro").save("namesAndFavColors.avro")
+
+{% endhighlight %}
+
+
+{% highlight r %}
+
+df <- read.df("examples/src/main/resources/users.avro", "avro")
+write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", 
"avro")
+
+{% endhighlight %}
+
+
+
+## to_avro() and from_avro()
+Spark SQL provides function `to_avro` to encode a struct as a string and 
`from_avro()` to retrieve the struct as a complex type.
--- End diff --

`encode a struct as a string`, I think it's not "string", but "binary"?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22121#discussion_r211984616
  
--- Diff: docs/avro-data-source-guide.md ---
@@ -0,0 +1,377 @@
+---
+layout: global
+title: Apache Avro Data Source Guide
+---
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+Since Spark 2.4 release, [Spark 
SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides 
built-in support for reading and writing Apache Avro data.
+
+## Deploying
+The `spark-avro` module is external and not included in `spark-submit` or 
`spark-shell` by default.
+
+As with any Spark applications, `spark-submit` is used to launch your 
application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
+and its dependencies can be directly added to `spark-submit` using 
`--packages`, such as,
+
+./bin/spark-submit --packages 
org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}}
 ...
+
+For experimenting on `spark-shell`, you can also use `--packages` to add 
`org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its 
dependencies directly,
+
+./bin/spark-shell --packages 
org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}}
 ...
+
+See [Application Submission Guide](submitting-applications.html) for more 
details about submitting applications with external dependencies.
+
+## Load and Save Functions
+
+Since `spark-avro` module is external, there is no `.avro` API in 
+`DataFrameReader` or `DataFrameWriter`.
+
+To load/save data in Avro format, you need to specify the data source 
option `format` as `avro`(or `org.apache.spark.sql.avro`).
+
+
+{% highlight scala %}
+
+val usersDF = 
spark.read.format("avro").load("examples/src/main/resources/users.avro")
+usersDF.select("name", 
"favorite_color").write.format("avro").save("namesAndFavColors.avro")
+
+{% endhighlight %}
+
+
+{% highlight java %}
+
+Dataset usersDF = 
spark.read().format("avro").load("examples/src/main/resources/users.avro");
+usersDF.select("name", 
"favorite_color").write().format("avro").save("namesAndFavColors.avro");
+
+{% endhighlight %}
+
+
+{% highlight python %}
+
+df = 
spark.read.format("avro").load("examples/src/main/resources/users.avro")
+df.select("name", 
"favorite_color").write.format("avro").save("namesAndFavColors.avro")
+
+{% endhighlight %}
+
+
+{% highlight r %}
+
+df <- read.df("examples/src/main/resources/users.avro", "avro")
+write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", 
"avro")
+
+{% endhighlight %}
+
+
+
+## to_avro() and from_avro()
+Spark SQL provides function `to_avro` to encode a struct as a string and 
`from_avro()` to retrieve the struct as a complex type.
--- End diff --

not "Spark SQL", it should be "The Avro package"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22121
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95105/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22121
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22121
  
**[Test build #95105 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95105/testReport)**
 for PR 22121 at commit 
[`8da8250`](https://github.com/apache/spark/commit/8da82506e06e36d63bf91fdda194a866f2d977ea).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22181: [SPARK-25163][SQL] Fix flaky test: o.a.s.util.collection...

2018-08-22 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22181
  
good catch! LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22165: [SPARK-25017][Core] Add test suite for BarrierCoordinato...

2018-08-22 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/22165
  
My pleasure, just find this during glance over jira in recent days. :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22141: [SPARK-25154][SQL] Support NOT IN sub-queries ins...

2018-08-22 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22141#discussion_r211979620
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
 ---
@@ -137,13 +137,21 @@ object RewritePredicateSubquery extends 
Rule[LogicalPlan] with PredicateHelper {
   plan: LogicalPlan): (Option[Expression], LogicalPlan) = {
 var newPlan = plan
 val newExprs = exprs.map { e =>
-  e transformUp {
+  e transformDown {
--- End diff --

yes, thanks, but that doesn't test when the outer values are null, right? I 
think it would be good to have also cases with:
 - more than 2 attributes;
 - with the outer values being null;
 - complex data types involved (eg. structs)

What do you think?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22112
  
FYI I've implemented the support of "repeatable" RDD action in my local 
branch. It needs to add a new parameter to the public `SparkContext#runJob`, so 
I'm a little hesitant to push it. Please let me know if you have different 
ideas. thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22121
  
**[Test build #95105 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95105/testReport)**
 for PR 22121 at commit 
[`8da8250`](https://github.com/apache/spark/commit/8da82506e06e36d63bf91fdda194a866f2d977ea).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22121
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2445/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22121
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22157: [SPARK-25126] Avoid creating Reader for all orc files

2018-08-22 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22157
  
The failure in OrcQuerySuite looks legitimate. It's because it corrupts the 
third file of three, then sets the reader to not ignore corrupt files, but 
never actually reads the third file now with this change. I think that might be 
a good thing. @dongjoon-hyun do you have an opinion?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22141: [SPARK-25154][SQL] Support NOT IN sub-queries ins...

2018-08-22 Thread dilipbiswal

Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22141#discussion_r211971929
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
 ---
@@ -137,13 +137,21 @@ object RewritePredicateSubquery extends 
Rule[LogicalPlan] with PredicateHelper {
   plan: LogicalPlan): (Option[Expression], LogicalPlan) = {
 var newPlan = plan
 val newExprs = exprs.map { e =>
-  e transformUp {
+  e transformDown {
--- End diff --

@mgaido91 
>> I don't see any test (please correct me if I am wrong) where multiple 
attributes are used as output of the subquery. Can we add and compare with 
other RDBMS? Thanks.

In 
[here](https://github.com/apache/spark/blob/844a3ff82a688e7398bb130a44750aec78420698/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/nested-not-in.sql#L113-L134)
 ? Is this what you meant ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2

2018-08-22 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22179
  
That looks like a major version bump -- the usual question here -- what are 
the key changes we need, what are possible incompatible changes?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22112
  
> how does the user then tell spark that the result stage becomes 
repeatable because they did the checkpoint?

There are 2 concepts here:
1. The random level of the RDD computing function (see my PR description). 
There are 3 random levels: IDEMPOTENT, RANDOM_ORDER, COMPLETE_RANDOM. e.g. file 
reading is IDEMPOTENT, shuffle fetching is RANDOM_ORDER, shuffle fetching + 
repartition/zip is COMPLETE_RANDOM. Spark only needs to retry the succeeding 
stages if we retry a stage which is COMPLETE_RANDOM.
2. Whether the result stage is repeatable. e.g. "collect" is repeatable, 
writing with hadoop output committer is not.

For concept 1, it's a property of RDD, so users can specify it by 
implementing a custom RDD, or marking the RDD map function as 
order-sensitive(e.g. `zip`). This PR does not design proper public APIs for it.

For concept 2, it's a property of the RDD action. Users usually don't need 
to specify it, as we will specify it for each RDD action. e.g. `collect` is 
repeatable. `saveAsHadoopDataset` is not.

Spark only fails the job if the RDD is COMPLETE_RANDOM (shuffle + 
repartition/zip), and the action is not repeatable. If users checkpoint the RDD 
before repartition/zip(e.g. shuffle + checkpoint + repartition/zip), then the 
RDD becomes IDEMPOTENT(see ) and Spark will not fail the job even if the action 
is not repeatable.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22141: [SPARK-25154][SQL] Support NOT IN sub-queries ins...

2018-08-22 Thread dilipbiswal

Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22141#discussion_r211969009
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
 ---
@@ -137,13 +137,21 @@ object RewritePredicateSubquery extends 
Rule[LogicalPlan] with PredicateHelper {
   plan: LogicalPlan): (Option[Expression], LogicalPlan) = {
 var newPlan = plan
 val newExprs = exprs.map { e =>
-  e transformUp {
+  e transformDown {
--- End diff --

to be able to see Not(In) first before (In) ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...

2018-08-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20345
  
**[Test build #95104 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95104/testReport)**
 for PR 20345 at commit 
[`39462fb`](https://github.com/apache/spark/commit/39462fbee952ec574b4c04d7718fd73bb5f56d9d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...

2018-08-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21770
  
**[Test build #95103 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95103/testReport)**
 for PR 21770 at commit 
[`5a70a7c`](https://github.com/apache/spark/commit/5a70a7cb33c6fbdf114b39fc8f0196b8d01f8582).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20345
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22141: [SPARK-25154][SQL] Support NOT IN sub-queries ins...

2018-08-22 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22141#discussion_r211961738
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
 ---
@@ -137,13 +137,21 @@ object RewritePredicateSubquery extends 
Rule[LogicalPlan] with PredicateHelper {
   plan: LogicalPlan): (Option[Expression], LogicalPlan) = {
 var newPlan = plan
 val newExprs = exprs.map { e =>
-  e transformUp {
+  e transformDown {
--- End diff --

why did you change this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20345
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2444/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21770
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2443/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21770
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...

2018-08-22 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/20345
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21770: [SPARK-24806][SQL] Brush up generated code so that JDK c...

2018-08-22 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21770
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream forma...

2018-08-22 Thread icexelloss

Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/21546#discussion_r211964996
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala
 ---
@@ -183,34 +178,106 @@ private[sql] object ArrowConverters {
   }
 
   /**
-   * Convert a byte array to an ArrowRecordBatch.
+   * Load a serialized ArrowRecordBatch.
*/
-  private[arrow] def byteArrayToBatch(
+  private[arrow] def loadBatch(
   batchBytes: Array[Byte],
   allocator: BufferAllocator): ArrowRecordBatch = {
-val in = new ByteArrayReadableSeekableByteChannel(batchBytes)
-val reader = new ArrowFileReader(in, allocator)
-
-// Read a batch from a byte stream, ensure the reader is closed
-Utils.tryWithSafeFinally {
-  val root = reader.getVectorSchemaRoot  // throws IOException
-  val unloader = new VectorUnloader(root)
-  reader.loadNextBatch()  // throws IOException
-  unloader.getRecordBatch
-} {
-  reader.close()
-}
+val in = new ByteArrayInputStream(batchBytes)
+MessageSerializer.deserializeRecordBatch(
+  new ReadChannel(Channels.newChannel(in)), allocator)  // throws 
IOException
   }
 
+  /**
+   * Create a DataFrame from a JavaRDD of serialized ArrowRecordBatches.
+   */
   private[sql] def toDataFrame(
-  payloadRDD: JavaRDD[Array[Byte]],
+  arrowBatchRDD: JavaRDD[Array[Byte]],
   schemaString: String,
   sqlContext: SQLContext): DataFrame = {
-val rdd = payloadRDD.rdd.mapPartitions { iter =>
+val schema = DataType.fromJson(schemaString).asInstanceOf[StructType]
+val timeZoneId = sqlContext.sessionState.conf.sessionLocalTimeZone
+val rdd = arrowBatchRDD.rdd.mapPartitions { iter =>
   val context = TaskContext.get()
-  ArrowConverters.fromPayloadIterator(iter.map(new ArrowPayload(_)), 
context)
+  ArrowConverters.fromBatchIterator(iter, schema, timeZoneId, context)
 }
-val schema = DataType.fromJson(schemaString).asInstanceOf[StructType]
 sqlContext.internalCreateDataFrame(rdd, schema)
   }
+
+  /**
+   * Read a file as an Arrow stream and parallelize as an RDD of 
serialized ArrowRecordBatches.
+   */
+  private[sql] def readArrowStreamFromFile(
+  sqlContext: SQLContext,
+  filename: String): JavaRDD[Array[Byte]] = {
+val fileStream = new FileInputStream(filename)
+try {
+  // Create array so that we can safely close the file
+  val batches = getBatchesFromStream(fileStream.getChannel).toArray
+  // Parallelize the record batches to create an RDD
+  JavaRDD.fromRDD(sqlContext.sparkContext.parallelize(batches, 
batches.length))
+} finally {
+  fileStream.close()
+}
+  }
+
+  /**
+   * Read an Arrow stream input and return an iterator of serialized 
ArrowRecordBatches.
+   */
+  private[sql] def getBatchesFromStream(in: SeekableByteChannel): 
Iterator[Array[Byte]] = {
+
+// Create an iterator to get each serialized ArrowRecordBatch from a 
stream
+new Iterator[Array[Byte]] {
+  var batch: Array[Byte] = readNextBatch()
+
+  override def hasNext: Boolean = batch != null
+
+  override def next(): Array[Byte] = {
+val prevBatch = batch
+batch = readNextBatch()
+prevBatch
+  }
+
+  def readNextBatch(): Array[Byte] = {
+val msgMetadata = MessageSerializer.readMessage(new 
ReadChannel(in))
+if (msgMetadata == null) {
+  return null
+}
+
+// Get the length of the body, which has not be read at this point
+val bodyLength = msgMetadata.getMessageBodyLength.toInt
+
+// Only care about RecordBatch data, skip Schema and unsupported 
Dictionary messages
+if (msgMetadata.getMessage.headerType() == 
MessageHeader.RecordBatch) {
+
+  // Create output backed by buffer to hold msg length (int32), 
msg metadata, msg body
+  val bbout = new ByteBufferOutputStream(4 + 
msgMetadata.getMessageLength + bodyLength)
--- End diff --

Add a comment that this is the deserialized form of an Arrow Record Batch? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22141: [SPARK-25154][SQL] Support NOT IN sub-queries ins...

2018-08-22 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22141#discussion_r211961021
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
 ---
@@ -137,13 +137,21 @@ object RewritePredicateSubquery extends 
Rule[LogicalPlan] with PredicateHelper {
   plan: LogicalPlan): (Option[Expression], LogicalPlan) = {
 var newPlan = plan
 val newExprs = exprs.map { e =>
-  e transformUp {
+  e transformDown {
 case Exists(sub, conditions, _) =>
   val exists = AttributeReference("exists", BooleanType, nullable 
= false)()
   // Deduplicate conflicting attributes if any.
   newPlan = dedupJoin(
 Join(newPlan, sub, ExistenceJoin(exists), 
conditions.reduceLeftOption(And)))
   exists
+case (Not(InSubquery(values, ListQuery(sub, conditions, _, _ =>
+  val exists = AttributeReference("exists", BooleanType, nullable 
= false)()
+  val inConditions = values.zip(sub.output).map(EqualTo.tupled)
+  val nullAwareJoinConds = inConditions.map(c => Or(c, IsNull(c)))
--- End diff --

makes sense, thanks for your answer @dilipbiswal 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread tgravescs

Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/22121#discussion_r211959406
  
--- Diff: docs/avro-data-source-guide.md ---
@@ -0,0 +1,377 @@
+---
+layout: global
+title: Apache Avro Data Source Guide
+---
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+Since Spark 2.4 release, [Spark 
SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides 
built-in support for reading and writing Apache Avro data.
+
+## Deploying
+The `spark-avro` module is external and not included in `spark-submit` or 
`spark-shell` by default.
+
+As with any Spark applications, `spark-submit` is used to launch your 
application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
+and its dependencies can be directly added to `spark-submit` using 
`--packages`, such as,
+
+./bin/spark-submit --packages 
org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}}
 ...
+
+For experimenting on `spark-shell`, you can also use `--packages` to add 
`org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its 
dependencies directly,
+
+./bin/spark-shell --packages 
org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}}
 ...
+
+See [Application Submission Guide](submitting-applications.html) for more 
details about submitting applications with external dependencies.
+
+## Load and Save Functions
+
+Since `spark-avro` module is external, there is not such API as `.avro` in 
--- End diff --

there is no '.avro' API in


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread tgravescs

Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/22112
  
> I'm proposing an option 3:
> Retry all the tasks of all the succeeding stages if a stage with 
repartition/zip failed. All RDD actions should tell Spark if it's "repeatable", 
which becomes a property of the result stage. When we retry a result stage that 
has several tasks finished, if the result stage is "repeatable" (e.g. collect), 
retry it. If the result stage is not "repeatable", fail the job with the error 
message to ask users to checkpoint the RDD before repartition/zip.

how does the user then tell spark that the result stage becomes repeatable 
because they did the checkpoint?  Add an option to the api?  Or does Spark 
automatically try to figure that out?I'm still a bit hesitant about making 
our long term solution that these operations aren't resilient, but I as long as 
the user can make them resilient perhaps its ok.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22141: [SPARK-25154][SQL] Support NOT IN sub-queries ins...

2018-08-22 Thread dilipbiswal

Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22141#discussion_r211955605
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
 ---
@@ -137,13 +137,21 @@ object RewritePredicateSubquery extends 
Rule[LogicalPlan] with PredicateHelper {
   plan: LogicalPlan): (Option[Expression], LogicalPlan) = {
 var newPlan = plan
 val newExprs = exprs.map { e =>
-  e transformUp {
+  e transformDown {
 case Exists(sub, conditions, _) =>
   val exists = AttributeReference("exists", BooleanType, nullable 
= false)()
   // Deduplicate conflicting attributes if any.
   newPlan = dedupJoin(
 Join(newPlan, sub, ExistenceJoin(exists), 
conditions.reduceLeftOption(And)))
   exists
+case (Not(InSubquery(values, ListQuery(sub, conditions, _, _ =>
+  val exists = AttributeReference("exists", BooleanType, nullable 
= false)()
+  val inConditions = values.zip(sub.output).map(EqualTo.tupled)
+  val nullAwareJoinConds = inConditions.map(c => Or(c, IsNull(c)))
--- End diff --

@mgaido91 Thanks !! Actually i have been thinking about it for last few 
days :-). We probably need a new optimizer rule that simplifies the join 
conditions based on its child's constraints. So we should be able to simplify -

``` SQL
select * from t1 join t2 on (t1c1 = t2c1 OR isnull(t1c1 = t2c1) where t1c1 
is not null and t2c1 is not null
```
to
```SQL
select * from t1 join t2 on (t1c1 = t2c1) where  t1c1 is not null and t2c1 
is not null

I wanted to handle it as a follow-up. 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22163: [SPARK-25166][CORE]Reduce the number of write ope...

2018-08-22 Thread Ngone51

Github user Ngone51 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22163#discussion_r211954019
  
--- Diff: 
core/src/main/java/org/apache/spark/shuffle/sort/ShuffleExternalSorter.java ---
@@ -206,14 +211,21 @@ private void writeSortedFile(boolean isLastFile) {
   long recordReadPosition = recordOffsetInPage + uaoSize; // skip over 
record length
   while (dataRemaining > 0) {
 final int toTransfer = Math.min(diskWriteBufferSize, 
dataRemaining);
-Platform.copyMemory(
-  recordPage, recordReadPosition, writeBuffer, 
Platform.BYTE_ARRAY_OFFSET, toTransfer);
-writer.write(writeBuffer, 0, toTransfer);
+if (bufferOffset > 0 && bufferOffset + toTransfer > 
DISK_WRITE_BUFFER_SIZE) {
--- End diff --

Not a bad idea, but codes here may not work as you expect. If we got a 
record with size `X` < `diskWriteBufferSize `(same as `DISK_WRITE_BUFFER_SIZE 
`), then we will only call `writer.write()` once. And if we got a record with 
size `Y` >= `diskWriteBufferSize `, then we will call `writer.write()` for  
(`Y` + `diskWriteBufferSize ` - 1)  / `diskWriteBufferSize`  times. And this do 
not change with the new code.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18099: [SPARK-18406][CORE][Backport-2.1] Race between end-of-ta...

2018-08-22 Thread appleyuchi

Github user appleyuchi commented on the issue:

https://github.com/apache/spark/pull/18099
  
the following occur to me when I run lab with ALS in spark

8/08/22 21:24:14 ERROR Utils: Uncaught exception in thread stdout writer 
for python
j**ava.lang.AssertionError: assertion failed: Block rdd_7_0 is not locked 
for reading**
at scala.Predef$.assert(Predef.scala:170)
at 
org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:299)
at 
org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:769)
at 
org.apache.spark.storage.BlockManager$$anonfun$1.apply$mcV$sp(BlockManager.scala:540)
at 
org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:44)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:33)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:213)
at 
org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:407)
at 
org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991)
at 
org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170)
Exception in thread "stdout writer for python" java.lang.AssertionError: 
assertion failed: Block rdd_7_0 is not locked for reading
at scala.Predef$.assert(Predef.scala:170)
at 
org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:299)
at 
org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:769)
at 
org.apache.spark.storage.BlockManager$$anonfun$1.apply$mcV$sp(BlockManager.scala:540)
at 
org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:44)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:33)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:213)
at 
org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:407)
at 
org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991)
at 
org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170)



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18099: [SPARK-18406][CORE][Backport-2.1] Race between end-of-ta...

2018-08-22 Thread appleyuchi

Github user appleyuchi commented on the issue:

https://github.com/apache/spark/pull/18099
  
it this fix available to spark2.3.1?
thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22121#discussion_r211940709
  
--- Diff: docs/avro-data-source-guide.md ---
@@ -0,0 +1,377 @@
+---
+layout: global
+title: Apache Avro Data Source Guide
+---
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+Since Spark 2.4 release, [Spark 
SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides 
built-in support for reading and writing Apache Avro data.
+
+## Deploying
+The `spark-avro` module is external and not included in `spark-submit` or 
`spark-shell` by default.
+
+As with any Spark applications, `spark-submit` is used to launch your 
application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
+and its dependencies can be directly added to `spark-submit` using 
`--packages`, such as,
+
+./bin/spark-submit --packages 
org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}}
 ...
+
+For experimenting on `spark-shell`, you can also use `--packages` to add 
`org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its 
dependencies directly,
+
+./bin/spark-shell --packages 
org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}}
 ...
+
+See [Application Submission Guide](submitting-applications.html) for more 
details about submitting applications with external dependencies.
+
+## Load and Save Functions
+
+Since `spark-avro` module is external, there is not such API as `.avro` in 
+`DataFrameReader` or `DataFrameWriter`.
+
+To load/save data in Avro format, you need to specify the data source 
option `format` as `avro`(or `org.apache.spark.sql.avro`).
+
+
+{% highlight scala %}
+
+val usersDF = 
spark.read.format("avro").load("examples/src/main/resources/users.avro")
+usersDF.select("name", 
"favorite_color").write.format("avro").save("namesAndFavColors.avro")
+
+{% endhighlight %}
+
+
+{% highlight java %}
+
+Dataset usersDF = 
spark.read().format("avro").load("examples/src/main/resources/users.avro");
+usersDF.select("name", 
"favorite_color").write().format("avro").save("namesAndFavColors.avro");
+
+{% endhighlight %}
+
+
+{% highlight python %}
+
+df = 
spark.read.format("avro").load("examples/src/main/resources/users.avro")
+df.select("name", 
"favorite_color").write.format("avro").save("namesAndFavColors.avro")
+
+{% endhighlight %}
+
+
+{% highlight r %}
+
+df <- read.df("examples/src/main/resources/users.avro", "avro")
+write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", 
"avro")
+
+{% endhighlight %}
+
+
+
+## to_avro() and from_avro()
+Spark SQL provides function `to_avro` to encode a struct as a string and 
`from_avro()` to retrieve the struct as a complex type.
+
+Using Avro record as columns are useful when reading from or writing to a 
streaming source like Kafka. Each 
+Kafka key-value record will be augmented with some metadata, such as the 
ingestion timestamp into Kafka, the offset in Kafka, etc.
+* If the "value" field that contains your data is in Avro, you could use 
`from_avro()` to extract your data, enrich it, clean it, and then push it 
downstream to Kafka again or write it out to a file.
+* `to_avro()` can be used to turn structs into Avro records. This method 
is particularly useful when you would like to re-encode multiple columns into a 
single one when writing data out to Kafka.
+
+Both methods are presently only available in Scala and Java.
+
+
+
+{% highlight scala %}
+import org.apache.spark.sql.avro._
+
+// `from_avro` requires Avro schema in JSON string format.
+val jsonFormatSchema = new 
String(Files.readAllBytes(Paths.get("./examples/src/main/resources/user.avsc")))
+
+val df = spark
+  .readStream
+  .format("kafka")
+  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
+  .option("subscribe", "topic1")
+  .load()
+
+// 1. Decode the Avro data into a struct;
+// 2. Filter by column `favorite_color`;
+// 3. Encode the column `name` in Avro format.
+val output = df
+  .select(from_avro('value, jsonFormatSchema) as 'user)
+  .where("user.favorite_color == \"red\"")
+  .select(to_avro($"user.name") as 'value)
+
+val ds = output
+  .writeStream
+  .format("kafka")
+  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
+  .option("topic", "topic2")
+  .start()
+
+{% endhighlight %}
+
+
+{% highlight java %}
+import org.apache.spark.sql.avro.*
+
+// `from_avro` requires Avro schema in JSON string format.
+String jsonFormatSchema = new

[GitHub] spark issue #17400: [SPARK-19981][SQL] Respect aliases in output partitionin...

2018-08-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17400
  
**[Test build #95102 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95102/testReport)**
 for PR 17400 at commit 
[`5482b1b`](https://github.com/apache/spark/commit/5482b1be6308ddf7e77dc25c0bdfca3ede2d61a7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17400: [SPARK-19981][SQL] Respect aliases in output partitionin...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17400
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17400: [SPARK-19981][SQL] Respect aliases in output partitionin...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17400
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2442/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22112
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95096/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22112
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22112
  
**[Test build #95096 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95096/testReport)**
 for PR 22112 at commit 
[`2a88a47`](https://github.com/apache/spark/commit/2a88a473f036c2da3612f3e53e17d1c05dff4458).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22182: [SPARK-25184][SS] Fixed race condition in StreamExecutio...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22182
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22182: [SPARK-25184][SS] Fixed race condition in StreamExecutio...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22182
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95098/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22182: [SPARK-25184][SS] Fixed race condition in StreamExecutio...

2018-08-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22182
  
**[Test build #95098 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95098/testReport)**
 for PR 22182 at commit 
[`319990f`](https://github.com/apache/spark/commit/319990ff60ad7b6fad6fd0cea5cada0b22e3f3c9).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22181: [SPARK-25163][SQL] Fix flaky test: o.a.s.util.collection...

2018-08-22 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/22181
  
cc @zsxwing @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22181: [SPARK-25163][SQL] Fix flaky test: o.a.s.util.collection...

2018-08-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22181
  
**[Test build #95101 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95101/testReport)**
 for PR 22181 at commit 
[`77e108a`](https://github.com/apache/spark/commit/77e108a18788502d05b1b3dacc21c3e72eac4264).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22181: [SPARK-25163][SQL] Fix flaky test: o.a.s.util.collection...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22181
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22181: [SPARK-25163][SQL] Fix flaky test: o.a.s.util.collection...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22181
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2441/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16478: [SPARK-7768][SQL] Revise user defined types (UDT)

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16478
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95094/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16478: [SPARK-7768][SQL] Revise user defined types (UDT)

2018-08-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16478
  
**[Test build #95094 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95094/testReport)**
 for PR 16478 at commit 
[`8b83ec7`](https://github.com/apache/spark/commit/8b83ec7242fe44847485c0591c90bc41dbdfea4a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16478: [SPARK-7768][SQL] Revise user defined types (UDT)

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16478
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22181: [SPARK-25163][SQL] Fix flaky test: o.a.s.util.collection...

2018-08-22 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/22181
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream format for c...

2018-08-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21546
  
**[Test build #95093 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95093/testReport)**
 for PR 21546 at commit 
[`89d7836`](https://github.com/apache/spark/commit/89d78364d93490b1b301c5ec766e4390bdc0b8a7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class BarrierTaskContext(TaskContext):`
  * `class BarrierTaskInfo(object):`
  * `case class StateStoreCustomSumMetric(name: String, desc: String) 
extends StateStoreCustomMetric`
  * `sealed trait StreamingAggregationStateManager extends Serializable `
  * `abstract class StreamingAggregationStateManagerBaseImpl(`
  * `class StreamingAggregationStateManagerImplV1(`
  * `class StreamingAggregationStateManagerImplV2(`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream format for c...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21546
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21546: [SPARK-23030][SQL][PYTHON] Use Arrow stream format for c...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21546
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95093/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22163
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...

2018-08-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22163
  
**[Test build #95095 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95095/testReport)**
 for PR 22163 at commit 
[`f91e18c`](https://github.com/apache/spark/commit/f91e18c7d4b8eab53c4983320a0eab0403c37a48).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22163: [SPARK-25166][CORE]Reduce the number of write operations...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22163
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95095/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22141
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95091/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22141
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...

2018-08-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22141
  
**[Test build #95091 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95091/testReport)**
 for PR 22141 at commit 
[`844a3ff`](https://github.com/apache/spark/commit/844a3ff82a688e7398bb130a44750aec78420698).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22181: [SPARK-25163][SQL] Fix flaky test: o.a.s.util.collection...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22181
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22181: [SPARK-25163][SQL] Fix flaky test: o.a.s.util.collection...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22181
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95088/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22181: [SPARK-25163][SQL] Fix flaky test: o.a.s.util.collection...

2018-08-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22181
  
**[Test build #95088 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95088/testReport)**
 for PR 22181 at commit 
[`77e108a`](https://github.com/apache/spark/commit/77e108a18788502d05b1b3dacc21c3e72eac4264).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

2018-08-22 Thread seancxmao

Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22184
  
@gatorsmile Could you kindly help trigger Jenkins and review?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22153: [SPARK-23034][SQL] Show RDD/relation names in RDD/In-Mem...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22153
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22153: [SPARK-23034][SQL] Show RDD/relation names in RDD/In-Mem...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22153
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95089/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22153: [SPARK-23034][SQL] Show RDD/relation names in RDD/In-Mem...

2018-08-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22153
  
**[Test build #95089 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95089/testReport)**
 for PR 22153 at commit 
[`da76a1b`](https://github.com/apache/spark/commit/da76a1beb31e972b41b7015e666bc1ee4e18007f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17400: [SPARK-19981][SQL] Respect aliases in output partitionin...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17400
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17400: [SPARK-19981][SQL] Respect aliases in output partitionin...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17400
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95097/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17400: [SPARK-19981][SQL] Respect aliases in output partitionin...

2018-08-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17400
  
**[Test build #95097 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95097/testReport)**
 for PR 17400 at commit 
[`91809e5`](https://github.com/apache/spark/commit/91809e5942e5f90c802234f815593ccec92a0c54).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait AliasAwareOutputPartitioning extends UnaryExecNode `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22121
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95100/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22121
  
**[Test build #95100 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95100/testReport)**
 for PR 22121 at commit 
[`d9c5352`](https://github.com/apache/spark/commit/d9c5352c8ffc70d271a8aa68c3ffec41b4158ece).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22121
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21931: [SPARK-24978][SQL]Add spark.sql.fast.hash.aggregate.row....

2018-08-22 Thread heary-cao

Github user heary-cao commented on the issue:

https://github.com/apache/spark/pull/21931
  
cc @cloud-fan @hvanhovell



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...

2018-08-22 Thread heary-cao

Github user heary-cao commented on the issue:

https://github.com/apache/spark/pull/21860
  
cc @cloud-fan @hvanhovell


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22184
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22184
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

2018-08-22 Thread seancxmao

GitHub user seancxmao opened a pull request:

https://github.com/apache/spark/pull/22184

[SPARK-25132][SQL][DOC] Add migration doc for case-insensitive field 
resolution when reading from Parquet

## What changes were proposed in this pull request?
#22148 introduces a behavior change. We need to document it in the 
migration guide.

## How was this patch tested?
N/A


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/seancxmao/spark SPARK-25132-DOC

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22184.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22184


commit eae8a3c98f146765d25bbf529421ce3c7a92639b
Author: seancxmao 
Date:   2018-08-22T09:17:55Z

[SPARK-25132][SQL][DOC] Case-insensitive field resolution when reading from 
Parquet




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22184
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22121
  
**[Test build #95100 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95100/testReport)**
 for PR 22121 at commit 
[`d9c5352`](https://github.com/apache/spark/commit/d9c5352c8ffc70d271a8aa68c3ffec41b4158ece).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22121
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2440/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22121
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22121
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22121
  
**[Test build #95099 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95099/testReport)**
 for PR 22121 at commit 
[`d2681ec`](https://github.com/apache/spark/commit/d2681ec51a7dbc0296800cdbedb3d46827bf2b6f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22121
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95099/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22171: [SPARK-25177][SQL] When dataframe decimal type column ha...

2018-08-22 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/22171
  
So this is an issue only related to `Dataset.show`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22121
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22121
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2439/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread gengliangwang

Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/22121
  
@srowen @tgravescs @gatorsmile @HyukjinKwon  @dongjoon-hyun Thanks for the 
reviews! I have added section `to_avro() and from_avro()` and `Compatibility 
with Databricks spark-avro`. 

Also attach html file for preview, please check it in PR description.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22121: [WIP][SPARK-25133][SQL][Doc]Avro data source guide

2018-08-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22121
  
**[Test build #95099 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95099/testReport)**
 for PR 22121 at commit 
[`d2681ec`](https://github.com/apache/spark/commit/d2681ec51a7dbc0296800cdbedb3d46827bf2b6f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22121: [WIP][SPARK-25133][SQL][Doc]Avro data source guid...

2018-08-22 Thread gengliangwang

Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22121#discussion_r211875231
  
--- Diff: docs/avro-data-source-guide.md ---
@@ -0,0 +1,260 @@
+---
+layout: global
+title: Apache Avro Data Source Guide
+---
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+Since Spark 2.4 release, [Spark 
SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides 
built-in support for reading and writing Apache Avro data.
+
--- End diff --

@tgravescs I have add an independent section for it :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21794: [SPARK-24834][CORE] use java comparison for float and do...

2018-08-22 Thread bavardage

Github user bavardage commented on the issue:

https://github.com/apache/spark/pull/21794
  
yep fair - the intent I think was clarity rather than necessarily perf: 
it's misleading to have a method named 'nan safe' which has no special handling 
of nans. I'll look at opening a different PR which could increase clarity/may 
have minor perf benefit.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22171: [SPARK-25177][SQL] When dataframe decimal type co...

2018-08-22 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/22171#discussion_r211867603
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala ---
@@ -197,7 +197,7 @@ final class Decimal extends Ordered[Decimal] with 
Serializable {
 }
   }
 
-  override def toString: String = toBigDecimal.toString()
+  override def toString: String = toBigDecimal.bigDecimal.toPlainString()
--- End diff --

I don't recall anything that is relevant:)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 5 6 7 >

401 - 500 of 600 matches

Mail list logo