[GitHub] spark issue #21859: [SPARK-24900][SQL]Speed up sort when the dataset is smal...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21859
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21859: [SPARK-24900][SQL]Speed up sort when the dataset is smal...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21859
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94931/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21859: [SPARK-24900][SQL]Speed up sort when the dataset is smal...

2018-08-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21859
  
**[Test build #94931 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94931/testReport)**
 for PR 21859 at commit 
[`46bab16`](https://github.com/apache/spark/commit/46bab165af68c1ef2dd1fc57e7f27f5d27c72015).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22135: [SPARK-25093][SQL] Avoid recompiling regexp for c...

2018-08-18 Thread igreenfield
Github user igreenfield commented on a diff in the pull request:

https://github.com/apache/spark/pull/22135#discussion_r211091975
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeFormatter.scala
 ---
@@ -91,10 +94,7 @@ object CodeFormatter {
   }
 
   def stripExtraNewLinesAndComments(input: String): String = {
-val commentReg =
-  ("""([ |\t]*?\/\*[\s|\S]*?\*\/[ |\t]*?)|""" +// strip /*comment*/
-   """([ |\t]*?\/\/[\s\S]*?\n)""").r   // strip //comment
-val codeWithoutComment = commentReg.replaceAllIn(input, "")
+val codeWithoutComment = commentRegexp.replaceAllIn(input, "")
 codeWithoutComment.replaceAll("""\n\s*\n""", "\n") // strip 
ExtraNewLines
--- End diff --

this line also compile regex and could be replaced!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row

2018-08-18 Thread xuanyuanking
Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/22140
  
cc @HyukjinKwon


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22141
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2304/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22141
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...

2018-08-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22141
  
**[Test build #94932 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94932/testReport)**
 for PR 22141 at commit 
[`926c7fc`](https://github.com/apache/spark/commit/926c7fc7a766b1e310798a87ae7a485f731ade99).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21931: [SPARK-24978][SQL]Add spark.sql.fast.hash.aggrega...

2018-08-18 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21931#discussion_r211089742
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala
 ---
@@ -366,6 +366,43 @@ class AggregateBenchmark extends BenchmarkBase {
  */
   }
 
+  ignore("capacity for fast hash aggregate") {
+val N = 20 << 20
+val M = 1 << 19
+
+val benchmark = new Benchmark("Aggregate w multiple keys", N)
--- End diff --

`Benchmark("Capacity for fast hash aggregate")`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21931: [SPARK-24978][SQL]Add spark.sql.fast.hash.aggregate.row....

2018-08-18 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/21931
  
Minor comments. LGTM.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21931: [SPARK-24978][SQL]Add spark.sql.fast.hash.aggrega...

2018-08-18 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21931#discussion_r211089705
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala
 ---
@@ -366,6 +366,43 @@ class AggregateBenchmark extends BenchmarkBase {
  */
   }
 
+  ignore("capacity for fast hash aggregate") {
+val N = 20 << 20
+val M = 1 << 19
+
+val benchmark = new Benchmark("Aggregate w multiple keys", N)
+sparkSession.range(N)
+  .selectExpr(
+"id",
+s"(id % $M) as k1",
+s"cast(id % $M as int) as k2",
+s"cast(id % $M as double) as k3",
+s"cast(id % $M as float) as k4") .createOrReplaceTempView("test")
+
+def f(): Unit = sparkSession.sql("select k1, k2, k3, k4, sum(k1), 
sum(k2), sum(k3), " +
+  "sum(k4) from test group by k1, k2, k3, k4").collect()
+
+benchmark.addCase(s"fasthash = default") { iter =>
+  
sparkSession.conf.set("spark.sql.codegen.aggregate.fastHashMap.capacityBit", 
"16")
+  f()
+}
+
+benchmark.addCase(s"fasthash = config") { iter =>
--- End diff --

"fasthash = 20"?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21931: [SPARK-24978][SQL]Add spark.sql.fast.hash.aggrega...

2018-08-18 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21931#discussion_r211089695
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1437,6 +1437,16 @@ object SQLConf {
 .intConf
 .createWithDefault(20)
 
+  val FAST_HASH_AGGREGATE_MAX_ROWS_CAPACITY_BIT =
+buildConf("spark.sql.codegen.aggregate.fastHashMap.capacityBit")
+  .internal()
+  .doc("Capacity for the max number of rows to be held in memory by 
the fast hash aggregate " +
+"product operator. the bit not for actual value, but the actual 
numBuckets is determined " +
--- End diff --

`the bit` -> `The bit is`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21931: [SPARK-24978][SQL]Add spark.sql.fast.hash.aggrega...

2018-08-18 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21931#discussion_r211089716
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala
 ---
@@ -366,6 +366,43 @@ class AggregateBenchmark extends BenchmarkBase {
  */
   }
 
+  ignore("capacity for fast hash aggregate") {
+val N = 20 << 20
+val M = 1 << 19
+
+val benchmark = new Benchmark("Aggregate w multiple keys", N)
+sparkSession.range(N)
+  .selectExpr(
+"id",
+s"(id % $M) as k1",
+s"cast(id % $M as int) as k2",
+s"cast(id % $M as double) as k3",
+s"cast(id % $M as float) as k4") .createOrReplaceTempView("test")
+
+def f(): Unit = sparkSession.sql("select k1, k2, k3, k4, sum(k1), 
sum(k2), sum(k3), " +
+  "sum(k4) from test group by k1, k2, k3, k4").collect()
+
+benchmark.addCase(s"fasthash = default") { iter =>
--- End diff --

"fasthash = 16"?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3

2018-08-18 Thread cclauss
Github user cclauss commented on the issue:

https://github.com/apache/spark/pull/20838
  
This is not working at all...  I am wasting way too much time.  5+ months 
and 80+ comments for 12 lines of code is 

I do not have the skills to solve the following undefined name 'long' in a 
satisfactory manner:
```
./python/pyspark/streaming/dstream.py:405:35: F821 undefined name 'long'
return self._sc._jvm.Time(long(timestamp * 1000))
  ^
```
If someone with more skills would be willing to take that one undefined 
name off my plate and solve it with a test in a separate PR then I would be 
grateful.  I will study that PR carefully and can then proceed with the others 
that are in this PR.

My recommended fix is at 
https://github.com/apache/spark/pull/20838/files#diff-6c576c52abc0624ccb6a2f45828dc6a7
 and my proposed test (it is failing!) is immediately following.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21931: [SPARK-24978][SQL]Add spark.sql.fast.hash.aggregate.row....

2018-08-18 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/21931
  
LGTM, cc @cloud-fan @hvanhovell 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21859: [SPARK-24900][SQL]Speed up sort when the dataset is smal...

2018-08-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21859
  
**[Test build #94931 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94931/testReport)**
 for PR 21859 at commit 
[`46bab16`](https://github.com/apache/spark/commit/46bab165af68c1ef2dd1fc57e7f27f5d27c72015).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21859: [SPARK-24900][SQL]Speed up sort when the dataset is smal...

2018-08-18 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/21859
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22141
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94930/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22141
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...

2018-08-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22141
  
**[Test build #94930 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94930/testReport)**
 for PR 22141 at commit 
[`473bfb5`](https://github.com/apache/spark/commit/473bfb500b07626ff42a9e5ddc167970299bde21).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22131: [SPARK-25141][SQL][TEST] Modify tests for higher-...

2018-08-18 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22131


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22131: [SPARK-25141][SQL][TEST] Modify tests for higher-order f...

2018-08-18 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/22131
  
Thanks! I'd use this one. merging to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...

2018-08-18 Thread dmateusp
Github user dmateusp commented on the issue:

https://github.com/apache/spark/pull/22141
  
I reproduced the issue with the following code (was a bit surprised with 
the behavior)

The tables:
```scala
scala> spark.sql("SELECT * FROM users").show
+---+---+
| id|country|
+---+---+
|  0| 10|
|  1| 20|
+---+---+


scala> spark.sql("SELECT * FROM countries").show
+---++
| id|name|
+---++
| 10|Portugal|
+---++
```

Without the OR:
```scala
scala> spark.sql("SELECT * FROM users u WHERE u.country NOT IN (SELECT id 
from countries)").show
+---+---+
| id|country|
+---+---+
|  1| 20|
+---+---+
```

With an OR and IN:
scala> spark.sql("SELECT * FROM users u WHERE 1=0 OR u.country IN (SELECT 
id from countries)").show
+---+---+
| id|country|
+---+---+
|  0| 10|
+---+---+

With an OR and NOT IN:
```scala
scala> spark.sql("SELECT * FROM users u WHERE 1=0 OR u.country NOT IN 
(SELECT id from countries)").show
org.apache.spark.sql.AnalysisException: Null-aware predicate sub-queries 
cannot be used in nested conditions: ((1 = 0) || NOT country#9 IN (list#62 
[]));;
```

+1 to get that fixed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21899: [SPARK-24912][SQL] Don't obscure source of OOM during br...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21899
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21899: [SPARK-24912][SQL] Don't obscure source of OOM during br...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21899
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94929/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21899: [SPARK-24912][SQL] Don't obscure source of OOM during br...

2018-08-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21899
  
**[Test build #94929 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94929/testReport)**
 for PR 21899 at commit 
[`829a333`](https://github.com/apache/spark/commit/829a333ad3dc152b90e5257cf67e2134c31e839e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...

2018-08-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22141
  
**[Test build #94930 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94930/testReport)**
 for PR 22141 at commit 
[`473bfb5`](https://github.com/apache/spark/commit/473bfb500b07626ff42a9e5ddc167970299bde21).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22141
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2303/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22141
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22141: [SPARK-25154] Support NOT IN sub-queries inside n...

2018-08-18 Thread dilipbiswal
GitHub user dilipbiswal opened a pull request:

https://github.com/apache/spark/pull/22141

[SPARK-25154] Support NOT IN sub-queries inside nested OR conditions.

## What changes were proposed in this pull request?
Currently NOT IN subqueries (predicated null aware subquery) are not 
allowed inside OR expressions. We currently catch this condition in 
checkAnalysis and throw an error.

This PR enhances the subquery rewrite to support this type of queries.

Query
```SQL
SELECT * FROM s1 WHERE a > 5 or b NOT IN (SELECT c FROM s2);
```
Optimized Plan
```SQL
a: int, b: int
Project [a#16, b#17]
+- Filter ((a#16 > 5) || NOT b#17 IN (list#13 []))
   :  +- Project [c#18]
   : +- SubqueryAlias `default`.`s2`
   :+- HiveTableRelation `default`.`s2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c#18, d#19]
   +- SubqueryAlias `default`.`s1`
  +- HiveTableRelation `default`.`s1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#16, b#17]
```
## How was this patch tested?
Added new testsin SQLQueryTestSuite, RewriteSubquerySuite and SubquerySuite.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dilipbiswal/spark SPARK-25154

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22141.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22141


commit 473bfb500b07626ff42a9e5ddc167970299bde21
Author: Dilip Biswal 
Date:   2018-08-18T21:22:37Z

[SPARK-25154] Support NOT IN sub-queries inside nested OR conditions.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21669: [SPARK-23257][K8S][WIP] Kerberos Support for Spark on K8...

2018-08-18 Thread witten
Github user witten commented on the issue:

https://github.com/apache/spark/pull/21669
  
I see that this branch currently has merge conflicts, but any idea on when 
this might land? This is the last feature we're waiting for in order to switch 
from the abandoned [apache-spark-on-k8s 
fork](https://github.com/apache-spark-on-k8s/spark) to Spark actual! Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22112
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94923/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22112
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22112
  
**[Test build #94923 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94923/testReport)**
 for PR 22112 at commit 
[`739f210`](https://github.com/apache/spark/commit/739f210eb8f70499b56fb75fe573099fcad63541).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20838
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94927/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20838
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22135: [SPARK-25093][SQL] Avoid recompiling regexp for comments...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22135
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94926/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22135: [SPARK-25093][SQL] Avoid recompiling regexp for comments...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22135
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3

2018-08-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20838
  
**[Test build #94927 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94927/testReport)**
 for PR 20838 at commit 
[`73a4fd2`](https://github.com/apache/spark/commit/73a4fd26eed19256da80d27b07b2e5f4d85eb9f6).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22135: [SPARK-25093][SQL] Avoid recompiling regexp for comments...

2018-08-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22135
  
**[Test build #94926 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94926/testReport)**
 for PR 22135 at commit 
[`5731825`](https://github.com/apache/spark/commit/5731825c9171cfe20591a0e7a34d927402881470).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22078: [SPARK-25085][SQL] Insert overwrite a non-partitioned ta...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22078
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22078: [SPARK-25085][SQL] Insert overwrite a non-partitioned ta...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22078
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94925/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22078: [SPARK-25085][SQL] Insert overwrite a non-partitioned ta...

2018-08-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22078
  
**[Test build #94925 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94925/testReport)**
 for PR 22078 at commit 
[`f1574d5`](https://github.com/apache/spark/commit/f1574d5a3c4bfbd1a202154e69cff0dc81283e35).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22078: [SPARK-25085][SQL] Insert overwrite a non-partitioned ta...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22078
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94924/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22078: [SPARK-25085][SQL] Insert overwrite a non-partitioned ta...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22078
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22078: [SPARK-25085][SQL] Insert overwrite a non-partitioned ta...

2018-08-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22078
  
**[Test build #94924 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94924/testReport)**
 for PR 22078 at commit 
[`bef1d17`](https://github.com/apache/spark/commit/bef1d17ced8865c3e95eaa451424e153b4b7214a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21899: [SPARK-24912][SQL] Don't obscure source of OOM during br...

2018-08-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21899
  
**[Test build #94929 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94929/testReport)**
 for PR 21899 at commit 
[`829a333`](https://github.com/apache/spark/commit/829a333ad3dc152b90e5257cf67e2134c31e839e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21899: [SPARK-24912][SQL] Don't obscure source of OOM during br...

2018-08-18 Thread bersprockets
Github user bersprockets commented on the issue:

https://github.com/apache/spark/pull/21899
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...

2018-08-18 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22123#discussion_r211081732
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 ---
@@ -1603,6 +1603,25 @@ class CSVSuite extends QueryTest with 
SharedSQLContext with SQLTestUtils with Te
   .exists(msg => msg.getRenderedMessage.contains("CSV header does not 
conform to the schema")))
   }
 
+  test("SPARK-25134: check header on parsing of dataset with projection 
and column pruning") {
+withSQLConf(SQLConf.CSV_PARSER_COLUMN_PRUNING.key -> "true") {
+  withTempPath { path =>
+val dir = path.getAbsolutePath
+Seq(("a", "b")).toDF("columnA", "columnB").write
+  .format("csv")
+  .option("header", true)
+  .save(dir)
+checkAnswer(spark.read
+  .format("csv")
+  .option("header", true)
+  .option("enforceSchema", false)
+  .load(dir)
+  .select("columnA"),
--- End diff --

Could you check a corner case when required Schema is empty. For example, 
`.option("enforceSchema", false)` + `count()`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22123: [SPARK-25134][SQL] Csv column pruning with checking of h...

2018-08-18 Thread MaxGekk
Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/22123
  
May I ask you check the `multiLine` mode additionally since we use 
different methods of uniVocity parser. When `multiLine` is disabled, the 
`parseLine` method is used but in the `multiLine` mode:

https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala#L303-L307



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22121
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22121
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94928/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22121
  
**[Test build #94928 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94928/testReport)**
 for PR 22121 at commit 
[`8b191bd`](https://github.com/apache/spark/commit/8b191bd37af24ff27b6416ee6af4d885f1c94852).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...

2018-08-18 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21909


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22121
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22121
  
**[Test build #94928 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94928/testReport)**
 for PR 22121 at commit 
[`8b191bd`](https://github.com/apache/spark/commit/8b191bd37af24ff27b6416ee6af4d885f1c94852).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22121
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2302/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide

2018-08-18 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22121#discussion_r211081684
  
--- Diff: docs/avro-data-source-guide.md ---
@@ -0,0 +1,260 @@
+---
+layout: global
+title: Apache Avro Data Source Guide
+---
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+Since Spark 2.4 release, [Spark 
SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides 
built-in support for reading and writing Apache Avro data.
+
+## Deploying
+The `spark-avro` module is external and not included in `spark-submit` or 
`spark-shell` by default.
+
+As with any Spark applications, `spark-submit` is used to launch your 
application. `spark-avro_{{site.SCALA_BINARY_VERSION}}`
+and its dependencies can be directly added to `spark-submit` using 
`--packages`, such as,
+
+./bin/spark-submit --packages 
org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}}
 ...
+
+For experimenting on `spark-shell`, you can also use `--packages` to add 
`org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its 
dependencies directly,
+
+./bin/spark-shell --packages 
org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}}
 ...
+
+See [Application Submission Guide](submitting-applications.html) for more 
details about submitting applications with external dependencies.
+
+## Load/Save Functions
+
+Since `spark-avro` module is external, there is not such API as `.avro` in 
+`DataFrameReader` or `DataFrameWriter`.
+To load/save data in Avro format, you need to specify the data source 
option `format` as short name `avro` or full name `org.apache.spark.sql.avro`.
+
+
+{% highlight scala %}
+
+val usersDF = 
spark.read.format("avro").load("examples/src/main/resources/users.avro")
+usersDF.select("name", 
"favorite_color").write.format("avro").save("namesAndFavColors.avro")
+
+{% endhighlight %}
+
+
+{% highlight java %}
+
+Dataset usersDF = 
spark.read().format("avro").load("examples/src/main/resources/users.avro");
+usersDF.select("name", 
"favorite_color").write().format("avro").save("namesAndFavColors.avro");
+
+{% endhighlight %}
+
+
+{% highlight python %}
+
+df = 
spark.read.format("avro").load("examples/src/main/resources/users.avro")
+df.select("name", 
"favorite_color").write.format("avro").save("namesAndFavColors.avro")
+
+{% endhighlight %}
+
+
+{% highlight r %}
+
+df <- read.df("examples/src/main/resources/users.avro", "avro")
+write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", 
"avro")
+
+{% endhighlight %}
+
+
+
+## Data Source Options
+
+Data source options of Avro can be set using the `.option` method on 
`DataFrameReader` or `DataFrameWriter`.
+
+  Property 
NameDefaultMeaningScope
+  
+avroSchema
+None
+Optional Avro schema provided by an user in JSON format.
+read and write
+  
+  
+recordName
+topLevelRecord
+Top level record name in write result, which is required in Avro 
spec.
+write
+  
+  
+recordNamespace
+""
+Record namespace in write result.
+write
+  
+  
+ignoreExtension
+true
+The option controls ignoring of files without .avro 
extensions in read. If the option is enabled, all files (with and without 
.avro extension) are loaded.
+read
+  
+  
+compression
+snappy
+The compression option allows to specify a 
compression codec used in write. Currently supported codecs are 
uncompressed, snappy, deflate, 
bzip2 and xz. If the option is not set, the 
configuration spark.sql.avro.compression.codec config is taken 
into account.
+write
+  
+
+
+## Supported types for Avro -> Spark SQL conversion
+Currently Spark supports reading all [primitive 
types](https://avro.apache.org/docs/1.8.2/spec.html#schema_primitive) and 
[complex types](https://avro.apache.org/docs/1.8.2/spec.html#schema_complex) of 
Avro.
+
+  Avro typeSpark SQL type
+  
+boolean
+BooleanType
+  
+  
+int
+IntegerType
+  
+  
+long
+LongType
+  
+  
+float
+FloatType
+  
+  
+double
+DoubleType
+  
+  
+string
+StringType
+  
+  
+enum
+StringType
+  
+  
+fixed
+BinaryType
+  
+  
+bytes
+BinaryType
+  
+  
+record
+StructType
+  
+  
+array
+ArrayType
+  
+  
+map
+MapType

[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV

2018-08-18 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21909
  
LGTM.

Thanks for being patient to address all the comments! Merged to master. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21909
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21909
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94922/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV

2018-08-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21909
  
**[Test build #94922 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94922/testReport)**
 for PR 21909 at commit 
[`050c8ce`](https://github.com/apache/spark/commit/050c8ce73f35791c4adb1a4d11f120288865cae8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22123: [SPARK-25134][SQL] Csv column pruning with checking of h...

2018-08-18 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22123
  
cc @MaxGekk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21087: [SPARK-23997][SQL] Configurable maximum number of...

2018-08-18 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21087#discussion_r211080067
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
 ---
@@ -164,9 +165,12 @@ case class BucketSpec(
 numBuckets: Int,
 bucketColumnNames: Seq[String],
 sortColumnNames: Seq[String]) {
-  if (numBuckets <= 0 || numBuckets >= 10) {
+  def conf: SQLConf = SQLConf.get
+
+  if (numBuckets <= 0 || numBuckets > conf.bucketingMaxBuckets) {
--- End diff --

Since the condition is changed from `>` to `>=`, there is inconsistent 
between the condition and the error message.

If this condition is true, the message is like `... but less than or equal 
to bucketing.maxBuckets ...`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...

2018-08-18 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/21860
  
cc @hvanhovell


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3

2018-08-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20838
  
**[Test build #94927 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94927/testReport)**
 for PR 20838 at commit 
[`73a4fd2`](https://github.com/apache/spark/commit/73a4fd26eed19256da80d27b07b2e5f4d85eb9f6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22135: [SPARK-25093][SQL] Avoid recompiling regexp for comments...

2018-08-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22135
  
**[Test build #94926 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94926/testReport)**
 for PR 22135 at commit 
[`5731825`](https://github.com/apache/spark/commit/5731825c9171cfe20591a0e7a34d927402881470).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22135: [SPARK-25093][SQL] Avoid recompiling regexp for comments...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22135
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22135: [SPARK-25093][SQL] Avoid recompiling regexp for comments...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22135
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2301/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22124: [SPARK-25135][SQL] Insert datasource table may all null ...

2018-08-18 Thread wangyum
Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22124
  
Comes from: [[SPARK-22834][SQL] Make insertion commands have real children 
to fix UI issues](https://github.com/apache/spark/pull/20020).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22078: [SPARK-25085][SQL] Insert overwrite a non-partitioned ta...

2018-08-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22078
  
**[Test build #94925 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94925/testReport)**
 for PR 22078 at commit 
[`f1574d5`](https://github.com/apache/spark/commit/f1574d5a3c4bfbd1a202154e69cff0dc81283e35).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22078: [SPARK-25085][SQL] Insert overwrite a non-partitioned ta...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22078
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2300/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22078: [SPARK-25085][SQL] Insert overwrite a non-partitioned ta...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22078
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22124: [SPARK-25135][SQL] Insert datasource table may all null ...

2018-08-18 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22124
  
> But it is inconsistent now.

Can you point out in the codebase where the inconsistency comes from?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-18 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22112
  
I've removed the concept of "order sensitive partitioner" and came up with 
a better abstraction. Please take a look at the updated PR descrption, thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22078: [SPARK-25085][SQL] Insert overwrite a non-partitioned ta...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22078
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22078: [SPARK-25085][SQL] Insert overwrite a non-partitioned ta...

2018-08-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22078
  
**[Test build #94924 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94924/testReport)**
 for PR 22078 at commit 
[`bef1d17`](https://github.com/apache/spark/commit/bef1d17ced8865c3e95eaa451424e153b4b7214a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22078: [SPARK-25085][SQL] Insert overwrite a non-partitioned ta...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22078
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2299/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-18 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22112
  
@tgravescs The `FileCommitProtocol` is an internal API, and our current 
implementation does store task-level data temporary in a staging directory (See 
`HadoopMapReduceCommitProtocol`). That said, we can fix the 
`FileCommitProtocol` to be able to rollback a committed task, as long as the 
job is not committed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22112
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2298/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22112
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22112
  
**[Test build #94923 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94923/testReport)**
 for PR 22112 at commit 
[`739f210`](https://github.com/apache/spark/commit/739f210eb8f70499b56fb75fe573099fcad63541).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22124: [SPARK-25135][SQL] Insert datasource table may all null ...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22124
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22124: [SPARK-25135][SQL] Insert datasource table may all null ...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22124
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94921/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22124: [SPARK-25135][SQL] Insert datasource table may all null ...

2018-08-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22124
  
**[Test build #94921 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94921/testReport)**
 for PR 22124 at commit 
[`9b16ff0`](https://github.com/apache/spark/commit/9b16ff0f0581366f587db735658e3110237ceef0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV

2018-08-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21909
  
**[Test build #94922 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94922/testReport)**
 for PR 21909 at commit 
[`050c8ce`](https://github.com/apache/spark/commit/050c8ce73f35791c4adb1a4d11f120288865cae8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...

2018-08-18 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21909#discussion_r211075385
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala
 ---
@@ -223,7 +224,8 @@ object MultiLineJsonDataSource extends JsonDataSource {
   input => parser.parse[InputStream](input, streamParser, 
partitionedFileString),
   parser.options.parseMode,
   schema,
-  parser.options.columnNameOfCorruptRecord)
+  parser.options.columnNameOfCorruptRecord,
+  optimizeEmptySchema = false)
--- End diff --

renamed


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...

2018-08-18 Thread MaxGekk
Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21909#discussion_r211075384
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1492,6 +1492,15 @@ object SQLConf {
 "This usually speeds up commands that need to list many 
directories.")
   .booleanConf
   .createWithDefault(true)
+
+  val BYPASS_PARSER_FOR_EMPTY_SCHEMA =
+buildConf("spark.sql.legacy.bypassParserForEmptySchema")
--- End diff --

It seems we don't need it anymore


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

2018-08-18 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/22112
  
I can't envision how that would work? You can't change how output 
committers work. You would have to not store anything until all pass or store 
it temporarily, both in my opinion are not good.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22135: [SPARK-25093][SQL] Avoid recompiling regexp for comments...

2018-08-18 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/22135
  
thanks for the comment @kiszk , I am doing it!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22124: [SPARK-25135][SQL] Insert datasource table may all null ...

2018-08-18 Thread wangyum
Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22124
  
The root project should be consistent with the schema of the target table. 
But now it is inconsistent.

**Before this PR**:

[dataColumns](https://github.com/apache/spark/blob/e6c6f90a55241905c420afbc803dd3bd6961d66b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L84):
`col1#8L,col2#9L`

[plan](https://github.com/apache/spark/blob/e6c6f90a55241905c420afbc803dd3bd6961d66b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L67):
```
*(1) Project [col1#8L, col2#9L]
+- *(1) Filter (isnotnull(col1#8L) && (col1#8L > -20))
   +- *(1) FileScan parquet default.table1[col1#8L,col2#9L] Batched: true, 
Format: Parquet, Location: InMemoryFileIndex[file:/tmp/yumwang/spark/parquet], 
PartitionFilters: [], PushedFilters: [IsNotNull(col1), GreaterThan(col1,-20)], 
ReadSchema: struct
```
**After this PR**:

[dataColumns](https://github.com/apache/spark/blob/e6c6f90a55241905c420afbc803dd3bd6961d66b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L84):
`COL1#14L,COL2#15L`

[plan](https://github.com/apache/spark/blob/e6c6f90a55241905c420afbc803dd3bd6961d66b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L67):
```
*(1) Project [col1#8L AS COL1#14L, col2#9L AS COL2#15L]
+- *(1) Filter (isnotnull(col1#8L) && (col1#8L > -20))
   +- *(1) FileScan parquet default.table1[col1#8L,col2#9L] Batched: true, 
Format: Parquet, Location: InMemoryFileIndex[file:/tmp/yumwang/spark/parquet], 
PartitionFilters: [], PushedFilters: [IsNotNull(col1), GreaterThan(col1,-20)], 
ReadSchema: struct
```

Before [SPARK-22834](https://issues.apache.org/jira/browse/SPARK-22834)

[dataColumns](https://github.com/apache/spark/blob/ec122209fb35a65637df42eded64b0203e105aae/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L124):
`COL1#19L,COL2#20L`


[queryExecution](https://github.com/apache/spark/blob/ec122209fb35a65637df42eded64b0203e105aae/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L104):
```
== Parsed Logical Plan ==
Project [COL1#19L, COL2#20L]
+- SubqueryAlias view1
   +- View (`default`.`view1`, [col1#19L,col2#20L])
  +- Project [col1#15L, col2#16L]
 +- Filter (col1#15L > cast(-20 as bigint))
+- SubqueryAlias table1
   +- Relation[col1#15L,col2#16L] parquet

== Analyzed Logical Plan ==
COL1: bigint, COL2: bigint
Project [COL1#19L, COL2#20L]
+- SubqueryAlias view1
   +- View (`default`.`view1`, [col1#19L,col2#20L])
  +- Project [cast(col1#15L as bigint) AS col1#19L, cast(col2#16L as 
bigint) AS col2#20L]
 +- Project [col1#15L, col2#16L]
+- Filter (col1#15L > cast(-20 as bigint))
   +- SubqueryAlias table1
  +- Relation[col1#15L,col2#16L] parquet

== Optimized Logical Plan ==
Filter (isnotnull(col1#15L) && (col1#15L > -20))
+- Relation[col1#15L,col2#16L] parquet

== Physical Plan ==
*Project [col1#15L, col2#16L]
+- *Filter (isnotnull(col1#15L) && (col1#15L > -20))
   +- *FileScan parquet default.table1[col1#15L,col2#16L] Batched: true, 
Format: Parquet, Location: InMemoryFileIndex[file:/tmp/yumwang/spark/parquet], 
PartitionFilters: [], PushedFilters: [IsNotNull(col1), GreaterThan(col1,-20)], 
ReadSchema: struct
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22124: [SPARK-25135][SQL] Insert datasource table may all null ...

2018-08-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22124
  
**[Test build #94921 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94921/testReport)**
 for PR 22124 at commit 
[`9b16ff0`](https://github.com/apache/spark/commit/9b16ff0f0581366f587db735658e3110237ceef0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22124: [SPARK-25135][SQL] Insert datasource table may all null ...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22124
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22124: [SPARK-25135][SQL] Insert datasource table may all null ...

2018-08-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22124
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2297/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...

2018-08-18 Thread heary-cao
Github user heary-cao commented on the issue:

https://github.com/apache/spark/pull/21860
  
cc @cloud-fan @maropu 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21819: [SPARK-24863][SS] Report Kafka offset lag as a cu...

2018-08-18 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21819


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22098: [SPARK-24886][INFRA] Fix the testing script to in...

2018-08-18 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22098


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21819: [SPARK-24863][SS] Report Kafka offset lag as a custom me...

2018-08-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21819
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22098: [SPARK-24886][INFRA] Fix the testing script to increase ...

2018-08-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22098
  
Let me just push this in.

Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22132: [SPARK-25142][PYSPARK] Add error messages when Py...

2018-08-18 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22132


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >