[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-08 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/21070
  
thanks, merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21070
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90375/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21070
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21070
  
**[Test build #90375 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90375/testReport)**
 for PR 21070 at commit 
[`95ecde0`](https://github.com/apache/spark/commit/95ecde09392ab4f2aab2f1165fb180c5466c26b4).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21070
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3045/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21070
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21070
  
**[Test build #90375 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90375/testReport)**
 for PR 21070 at commit 
[`95ecde0`](https://github.com/apache/spark/commit/95ecde09392ab4f2aab2f1165fb180c5466c26b4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-07 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21070
  
(While I am here) seems fine to me otherwise.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-07 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/21070
  
LGTM except 2 comments


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21070
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90327/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21070
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21070
  
**[Test build #90327 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90327/testReport)**
 for PR 21070 at commit 
[`6c9d47b`](https://github.com/apache/spark/commit/6c9d47babd16b067923014d49b83bfd1afb33c9b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21070
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3007/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21070
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21070
  
**[Test build #90327 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90327/testReport)**
 for PR 21070 at commit 
[`6c9d47b`](https://github.com/apache/spark/commit/6c9d47babd16b067923014d49b83bfd1afb33c9b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-07 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21070
  
@gatorsmile ok, I'll do later.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-06 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21070
  
> sql("select * from parquetTable where value = '0'")
> sql("select * from parquetTable where value = 0")

Semantically, they are different. The results should be also different. The 
plan shows whether these predicates are pushed down or not. 

Without the upgrade to 1.10.0, we can't get the benefit from the predicate 
pushdown between the above two queries. I also double checked it in my local 
environment. Now, it sounds like it is good reason to upgrade the version, 
since string predicates are pretty common. 

@rdblue @maropu Thank you for your great efforts and patience! This upgrade 
is definitely good to have. We should try to make it in the next release.

@maropu Could you also submit a separate PR for your micro benchmark? 
Thanks!

cc @cloud-fan @hvanhovell @mswit-databricks Do you have any comment about 
the implementation of this PR? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21070
  
For documentation, I think it's fine. it's a trackable history in JIRA 
although it's a bit hard to search .. Also, there might be some diff we should 
check since I haven't checked the details of JIRAs and if it's still true.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-06 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21070
  
@HyukjinKwon Thanks for the explanation! I feel it's ok to keep this as it 
it for now. btw, we don't need to document the story somewhere?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-06 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21070
  
@liancheng should also remember all the story too since I talked with him 
at that time IIRC.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21070
  
Yup, https://github.com/apache/spark/pull/21070#issuecomment-386793202, 
just for clarification, the given attribute and literal is castable but they 
are not being as so, right?

I believe this is a known issue and there were several tries:

One approach was directly casting always and it was reverted (roughly 2 
years ago?). Another approach was constant folding at optimizer level but it 
was rejected as it's too messy. Another approach was directly casting and 
comparing both values but it was also rejected since it sounded unsafe.

It was a long old story so probably worth double checking the history but I 
feel sure that I remember this story correctly.

The key point IIUC was that `translateFilter` should be super conservative 
and sounds we better need to check every possibility.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-05 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21070
  
ok, I checked string pushdown worked well;
https://gist.github.com/maropu/e7cbfb8dc73a91806088e159f64e113f


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-05 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21070
  
I looked over the code and I fond this issue could be fix by;
https://github.com/apache/spark/compare/master...maropu:FixFilterPushdown
```
scala> sql("select * from parquetTable where value = 0").explain
== Physical Plan ==
*(1) Project [c1#35, c2#36, c3#37, c4#38, c5#39, c6#40, value#41]
+- *(1) Filter (isnotnull(value#41) && (cast(value#41 as int) = 0))
   +- *(1) FileScan parquet [c1#35,c2#36,c3#37,c4#38,c5#39,c6#40,value#41] 
Batched: true, Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/maropu/Desktop/parquet-test], PartitionFilters: 
[], PushedFilters: [IsNotNull(value), EqualTo(value,0)], ReadSchema: 
struct
```
But, I don't know this is a known issue and this fix is a right approach. 
cc: @HyukjinKwon 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-05 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21070
  
ok, I got the reason why;
```

$ ./bin/spark-shell --master=local[1]

scala> import scala.util.Random
scala> :paste
def timer[R](f: => {}): Unit = {
  val count = 5
  val iters = (0 until count).map { i =>
val t0 = System.nanoTime()
f
val t1 = System.nanoTime()
val elapsed = t1 - t0 + 0.0
println(s"#$i: ${elapsed / 10.0}")
elapsed
  }
  println("Avg. Elapsed Time: " + ((iters.sum / count) / 10.0) + 
"s")
}
var dir = "/home/ec2-user/parquet-test-string"
val numRows = 1024 * 1024 * 15
val width = 6
val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i")
val valueCol = monotonically_increasing_id().cast("string")
val df = spark.range(numRows).map(_ => 
Random.nextLong).selectExpr(selectExpr: _*).withColumn("value", 
valueCol).sort("value")
df.write.mode("overwrite").parquet(dir)
spark.read.parquet(dir).createOrReplaceTempView("parquetTable")

scala> sql("SET spark.sql.parquet.filterPushdown=true")
scala> timer { sql("select * from parquetTable where value = '0'").collect }
#0: 1.041495043 

#1: 0.53017502
#2: 0.468505896
#3: 0.437655119
#4: 0.429151435
Avg. Elapsed Time: 0.5813965026s

scala> sql("select * from parquetTable where value = 0").explain
== Physical Plan ==
*(1) Project [c1#35, c2#36, c3#37, c4#38, c5#39, c6#40, value#41]
+- *(1) Filter (isnotnull(value#41) && (cast(value#41 as int) = 0))
   +- *(1) FileScan parquet [c1#35,c2#36,c3#37,c4#38,c5#39,c6#40,value#41] 
Batched: true, Format: Parquet, Location: 
InMemoryFileIndex[file:/home/ec2-user/parquet-test-string], PartitionFilters: 
[], PushedFilters: [IsNotNull(value)], ReadSchema: 
struct

scala> timer { sql("select * from parquetTable where value = 0").collect }
#0: 10.656159769

#1: 10.583965537

#2: 10.486018192

#3: 10.475532898

#4: 10.494059857

Avg. Elapsed Time: 10.53914725061s

scala> sql("select * from parquetTable where value = '0'").explain
== Physical Plan ==
*(1) Project [c1#35, c2#36, c3#37, c4#38, c5#39, c6#40, value#41]
+- *(1) Filter (isnotnull(value#41) && (value#41 = 0))
   +- *(1) FileScan parquet [c1#35,c2#36,c3#37,c4#38,c5#39,c6#40,value#41] 
Batched: true, Format: Parquet, Location: 
InMemoryFileIndex[file:/home/ec2-user/parquet-test-string], PartitionFilters: 
[], PushedFilters: [IsNotNull(value), EqualTo(value,0)], ReadSchema: 
struct
```
We do push down the predicate `value = '0'` into the parquet filter though, 
we don't push down the predicate `value = 0`. So, this is a spark-side issue 
for push-down handling.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-04 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21070
  
ok, give me more time to do so.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-04 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21070
  
@maropu, I'd recommend looking at the Parquet files using 
[`parquet-cli`](http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22parquet-cli%22)
 to see if you're getting reasonable min/max stats for your string columns. 
That will give you an idea of when predicate push-down will help.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-04 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21070
  
@rdblue [I made the input test date sorted 
(clustered)](https://github.com/maropu/spark/blob/dfb6f3bad10058b28ff66a6d2b72225b8847296c/sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala#L77)
 though, I didn't get a luck in the string case;
https://gist.github.com/maropu/b7ff6d528547de03a4a5cbabd120e78d
I'm still looking into why (I welcome any suggestion).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-03 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21070
  
aha, I see and it might to be true. I'll check the benchmark again.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-03 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21070
  
@maropu, I suspect that the problem is that comparison is different for 
strings: `"17297598712"` is less than `"5"` with string comparison.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-03 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21070
  
@rdblue Aha, thanks for the explanation! 

> I think that you were expecting a string comparison case to have a 
significant benefit over 
> non->pushdown. But I would only expect that if ORC had a similar benefit.
> That's because this is dependent on the clustering of values in the file 
so that Parquet can
> eliminate row groups. If ORC didn't have a benefit, then I would expect 
that the data just
> isn't clustered in a way that helps.

The `string` case had the same test data set (monotonically-increasing ids) 
with the `int` case, but we didn't get the benefit of push-down only in the 
string case. Is the logic to eliminate row groups different between `int` cases 
and `string` cases in spite that we use the same dataset?

https://github.com/maropu/spark/blob/465aa420b1399aba7199aa2868ad6ae58d877d50/sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala#L70



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-03 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21070
  
@gatorsmile, are you happy committing this with the benchmark results?

@maropu, thanks for taking the time to add these benchmarks, it is really 
great to have them so we can monitor the performance over time!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-03 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21070
  
@rdblue ya, my bad for the simple scan case, you're right.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-03 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21070
  
@maropu, looking at the pushdown benchmark, it looks like ORC and Parquet 
either both benefit or both do not benefit from pushdown. In some cases ORC is 
much faster, which is due to the fact that ORC will skip reading pages, not 
just row groups. But, when ORC benefits from pushdown so does Parquet, for 
example the `Select 1 int row (value = 7864320)` case.

I think that you were expecting a string comparison case to have a 
significant benefit over non-pushdown. But I would only expect that if ORC had 
a similar benefit. That's because this is dependent on the clustering of values 
in the file so that Parquet can eliminate row groups. If ORC didn't have a 
benefit, then I would expect that the data just isn't clustered in a way that 
helps.

I'm not sure how you're generating data, but I'd recommend adding a sorted 
column case with enough data to create multiple row groups (or stripes for 
ORC). That would write data so that you can ignore some row groups and you 
should see a speed up.

Parquet also supports dictionary-based row group filtering. To test that, 
make sure you have a column that is entirely dictionary-encoded: pick a small 
set of values and randomly draw from that set. Then if you search for a value 
that isn't in that set you should see a speedup. Also make sure that you have 
`parquet.filter.dictionary.enabled=true` set in the Hadoop configuration so 
that Parquet uses dictionary filtering.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-03 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21070
  
@maropu, are you sure about the INT and FLOAT columns? I think you might 
have that assessment backwards. Here's the INT results from the PR gist:

```
SQL Single INT Column Scan:  Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative


SQL Parquet Vectorized 149 /  162105.5  
 9.5   1.0X
SQL Parquet MR1825 / 1836  8.6  
   116.1   0.1X
```

And here are the INT results from the master gist:

```
SQL Single INT Column Scan:  Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative


SQL Parquet Vectorized 250 /  292 63.0  
15.9   1.0X
SQL Parquet MR3175 / 3202  5.0  
   201.8   0.1X
```

I think that shows that the PR result was significantly faster, not slower. 
(The other INT test was about the same.)

Here's the FLOAT column from the PR gist:

```
SQL Single FLOAT Column Scan:Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative


SQL Parquet Vectorized 145 /  158108.8  
 9.2   1.0X
SQL Parquet MR1840 / 1843  8.5  
   117.0   0.1X
```

And FLOAT from the master gist:

```
SQL Single FLOAT Column Scan:Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative


SQL Parquet Vectorized 261 /  316 60.2  
16.6   1.0X
SQL Parquet MR3267 / 3284  4.8  
   207.7   0.1X
```

Am I reading this incorrectly? I'm considering lower time values and higher 
rate values to be better.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-03 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21070
  
Simple scan benchmarks:
code: 
https://github.com/apache/spark/compare/master...maropu:DataSourceReadBenchmark
master: https://gist.github.com/maropu/a767d21ed1dd047ec2bdca92915dc5c5
this pr: https://gist.github.com/maropu/b9312488ec791caec9711ba64027e3e2

Pushdown benchmarks:
code: 
https://github.com/apache/spark/compare/master...maropu:UpdateParquetBenchmark
this pr: https://gist.github.com/maropu/e7cbfb8dc73a91806088e159f64e113f

Obviously, the pushdown for strings didn't work and I don't look into why 
now.
Can you check this? @rdblue


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-02 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21070
  
I'll put the benchmark results (the benchmark's currently running) today, 
just a sec.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-01 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21070
  
@mswit-databricks, I wouldn't worry about that. We've limited the length of 
binary and string fields.

In the next version of Parquet, we're planning on releasing page indexes, 
which are lower and upper bounds instead of min and max values. That gives us 
more flexibility to shorten values and avoid the case that you're worried about.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-01 Thread mswit-databricks
Github user mswit-databricks commented on the issue:

https://github.com/apache/spark/pull/21070
  
@rdblue Do you see any risk of additional overhead coming from the extra 
stats? For example, if the data contains very long strings, performing 
comparison on them to generate stats will be expensive and might not be worth 
it in certain scenarios.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-30 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21070
  
TPC-DS is just an end-to-end perf testing. To ensure no perf regressions, 
we need micro-benchmark tests. That is just a normal procedure. We  did the 
same in the ORC upgrade too. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-30 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21070
  
Is it necessary to block this commit on benchmarks? We know that it doesn't 
degrade performance from the Parquet benchmarks and TPC-DS run. What do you 
think needs to be done before this can move forward?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-30 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21070
  
@maropu Could you help verify the perf improvement in this string filter 
pushdown in the microbenchmark? You can also refer to another microbenchmark we 
did in ORC. 
https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala

If possible, please make a new microbenchmark suite that is extensible to 
all the data sources. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-30 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21070
  
Yes, it is safe to use push-down for string columns in data written with 
this version.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-30 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21070
  
@iddoav Thank you for your feedbacks. It sounds like CDH5.X's spark has a 
bug, right? Our Apache Spark does not have such an issue based on my 
understanding. 

@rdblue Does that means we can see a better perf number after upgrading to 
1.10.0, since it has the correct stats and can use these stats for string 
filter pushdown?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-29 Thread iddoav
Github user iddoav commented on the issue:

https://github.com/apache/spark/pull/21070
  
Our R in SimilarWeb have hard times with PARQUET-686, and merging this PR 
will help us a lot. Note, that unlike Spark 2.1+ readers which have read-time 
mitigations (SPARK-17213 et al), other systems like CDH5.X's spark and AWS 
athena (probably also presto) do predicate pushdown on Spark 2.3 parquet 
outputs, and return wrong answers when string columns are involved.
@gatorsmile @rdblue 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-27 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21070
  
ok, I'll work on it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-27 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21070
  
@maropu Thank you for works! Could you also run/improve the 
micro-benchmark? 
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadBenchmark.scala

If possible, also add a new one for the write path?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-25 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21070
  
I just run `TPCDSQueryBenchmark` in an ec2 instance (m4.2xlarge). The job 
to check performance regression is too heavy, so we just check if `tpcds`, 
`tpch`, and `ssb` can be correctly compiled in `TPCDSQuerySuite`, 
`TPCHQuerySuite`, and `SSBQuerySuite`.

BTW, I sometime checks the regression in my repo by myself: 
https://github.com/maropu/spark-tpcds-datagen.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-25 Thread henryr
Github user henryr commented on the issue:

https://github.com/apache/spark/pull/21070
  
yes, thanks @maropu! +1 to the idea of making this a jenkins job. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-25 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21070
  
Thank you @maropu!

What resources does the run require? Is it something we could create a 
Jenkins job to run?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-25 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21070
  
@gatorsmile @rdblue I checked the numbers and I found no regression at 
least in TPC-DS: to check the actual numbers, see 
https://issues.apache.org/jira/browse/SPARK-24070.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-25 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21070
  
The fix for PARQUET-686 was to suppress min/max stats. It is safe to push 
filters, but those filters can't be used without the stats. 1.10.0 has the 
correct stats and can use those filters.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-25 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21070
  
Regarding 
[`SPARK-17213`](https://issues.apache.org/jira/browse/SPARK-17213), we already 
enable the pushdown after we upgrading to 1.8.2. Is my understanding right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-25 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21070
  
There are two main reasons to update. First, the problem behind SPARK-17213 
is fixed, hence the new min and max fields. Second, this updates the internal 
byte array management, which is needed for page skipping in the next few 
versions.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-25 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21070
  
Parquet is the default file format. Thus, the impact could be huge. We need 
to be more cautious. Since Parquet 1.10 was officially released this month. 
Thus, we might need more careful review for such a big version upgrade. 

@rdblue Could you summarize the major benefits brought by this upgrade from 
1.8 to 1.10? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-24 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21070
  
I'll check performance changes w/ this pr in days and put the result in 
https://issues.apache.org/jira/browse/SPARK-24070


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-24 Thread henryr
Github user henryr commented on the issue:

https://github.com/apache/spark/pull/21070
  
Ok, thanks for the context. I've worked on other projects where it's 
sometimes ok to take a small risk of a perf hit on trunk as long as the 
community is committed to addressing the issues before the next release. If the 
norms here are not to absorb any risk on trunk, that also seems reasonable.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-24 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21070
  
> Based on the previous upgrade (e.g., #13280 (comment)), we hit the 
performance regressions when we upgrade Parquet and we did the revert at the 
end.

I should point out that the regression wasn't reproducible, so we aren't 
sure what the cause was. We also didn't have performance numbers on the Parquet 
side or a case of anyone running it in production (we have been for a couple 
months). But, I can understand wanting to be thorough.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-24 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21070
  
cc @liancheng @yhuai @rxin 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-24 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21070
  
Based on the previous upgrade (e.g., 
https://github.com/apache/spark/pull/13280#issuecomment-223022415), we hit the 
performance regressions when we upgrade Parquet and we did the revert at the 
end. That might be the major reason why we have not upgraded to 1.9 yet. 

Jumping to 1.10.0 from 1.8.2 is big and risky especially when it was 
recently released. Even if we merge it now, we might still revert it if we hit 
ANY performance regression. That is why we prefer to do the performance testing 
before we do the merge. 

Not sure whether the community can help the performance tests? cc @maropu 
You did many performance tests in TPC-DS in the past release. Could you help 
this too? If possible, we also need to do the micro-benchmark like what 
@dongjoon-hyun did for ORC.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-24 Thread henryr
Github user henryr commented on the issue:

https://github.com/apache/spark/pull/21070
  
@cloud-fan since there’s probably quite some time before this lands in a 
release, what do you think about merging this now if it’s ready, and filing 
the perf jira as a blocker against 2.4? My guess is that Spark will want to 
move to 1.10 at some point no matter what, so doing it this way has the 
advantage of giving plenty of time for any other issues to shake out before the 
next release. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-24 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21070
  
Okay, I don't have the time to set up and run benchmarks without a stronger 
case for this causing a regression (given the Parquet testing), but other 
people should feel free to pick this up.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-24 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/21070
  
No. But there are a bunch of open source projects to help you easily run 
TPCDS on Spark with parquet files, e.g. 
https://github.com/maropu/spark-tpcds-datagen


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-24 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21070
  
@cloud-fan, is there a Jenkins job to run it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-23 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/21070
  
Can we run a TPCDS and show that this upgrade doesn't cause performance 
regression in Spark? I can see that this new version doesn't have perf 
regression at parquet side, just want to be sure the Spark parquet integration 
is also OK.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-23 Thread henryr
Github user henryr commented on the issue:

https://github.com/apache/spark/pull/21070
  
This looks pretty good to me - are there any committers that can give it a 
(hopefully) final review?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21070
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89644/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21070
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21070
  
**[Test build #89644 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89644/testReport)**
 for PR 21070 at commit 
[`40a9cdd`](https://github.com/apache/spark/commit/40a9cdd969d3695a9dab5d6e18ed1b062be254bf).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21070
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21070
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2539/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21070
  
**[Test build #89644 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89644/testReport)**
 for PR 21070 at commit 
[`40a9cdd`](https://github.com/apache/spark/commit/40a9cdd969d3695a9dab5d6e18ed1b062be254bf).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-20 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21070
  
Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21070
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21070
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89600/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21070
  
**[Test build #89600 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89600/testReport)**
 for PR 21070 at commit 
[`40a9cdd`](https://github.com/apache/spark/commit/40a9cdd969d3695a9dab5d6e18ed1b062be254bf).
 * This patch **fails from timeout after a configured wait of \`300m\`**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21070
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2504/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21070
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-19 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21070
  
**[Test build #89600 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89600/testReport)**
 for PR 21070 at commit 
[`40a9cdd`](https://github.com/apache/spark/commit/40a9cdd969d3695a9dab5d6e18ed1b062be254bf).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21070
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89533/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21070
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21070
  
**[Test build #89533 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89533/testReport)**
 for PR 21070 at commit 
[`5fca3ce`](https://github.com/apache/spark/commit/5fca3ce1e199b0f3ee9d2b2178cf1f2418d4324f).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21070
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89531/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21070
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21070
  
**[Test build #89531 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89531/testReport)**
 for PR 21070 at commit 
[`5a78030`](https://github.com/apache/spark/commit/5a78030f2ba0b8df1098ccabad4afd30650d2336).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread scottcarey
Github user scottcarey commented on the issue:

https://github.com/apache/spark/pull/21070
  
> This is about getting Parquet updated, not about worrying whether users 
can easily add compression implementations to their classpath.

Yes, of course.

My hunch is that someone else will read the release notes that spark 2.3.0 
supports zstandard, and parquet 1.10.0 supports zstandard, then realize it 
doesn't work in combination and end up here.  So I feel that this is the right 
place to discuss the state of these features until there is another more 
specific place to do so.

The discussion here has been useful to get closer to understanding what 
further tasks there may be.  If there are any follow-on issues, the discussion 
can move there.


I would love to be able to test this with my full use case, and give it a 
big thumbs up if it works.  Unfortunately my only motivation for this upgrade 
is access to ZStandard, and I'm not as excited to say 'works for me if I don't 
use new parquet codecs'.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21070
  
I backported the Hadoop zstd codec to 2.7.3 without much trouble. But 
either way, I think that's a separate concern. This is about getting Parquet 
updated, not about worrying whether users can easily add compression 
implementations to their classpath.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread scottcarey
Github user scottcarey commented on the issue:

https://github.com/apache/spark/pull/21070
  
@rdblue 
The problem with zstd is that it is only in Hadoop 3.0, and dropping _that_ 
jar in breaks things as it is a major release.  Extracting out only the 
ZStandardCodec from that and recompiling to a 2.x release does not work either, 
because it depends on some low level hadoop native library management to load 
the native library (it does not appear to use  
https://github.com/luben/zstd-jni).

The alternative is to write a custom ZStandardCodec implementation that 
uses luben:zstd-jni

Furthermore, if you add a `o.a.h.io.codecs.ZStandardCodec` class to a jar 
on the client side, it is still not found -- my guess is there is some 
classloader isolation between client code and spark itself and spark itself is 
what needs to find the class.  So one has to have it installed inside of the 
spark distribution.

I may take you up on fixing the compression codec dependency mess in a 
couple months.  The hardest part will be lining up the configuration options 
with what users already expect -- the raw codecs aren't that hard to do.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21070
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2455/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21070
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21070
  
**[Test build #89533 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89533/testReport)**
 for PR 21070 at commit 
[`5fca3ce`](https://github.com/apache/spark/commit/5fca3ce1e199b0f3ee9d2b2178cf1f2418d4324f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21070
  
@scottcarey, Parquet will use the compressors if they are available. You 
can add them from an external Jar and it will work. LZ4 should also work out of 
the box because it is included in Hadoop 2.7.

I agree that it would be nice if Parquet didn't rely on Hadoop for 
compression libraries, but that's how it is at the moment. Feel free to fix it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21070
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2453/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21070
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21070
  
**[Test build #89531 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89531/testReport)**
 for PR 21070 at commit 
[`5a78030`](https://github.com/apache/spark/commit/5a78030f2ba0b8df1098ccabad4afd30650d2336).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread scottcarey
Github user scottcarey commented on the issue:

https://github.com/apache/spark/pull/21070
  
I tested this with the addition of some changes to ParquetOptions.scala, 
but this alone does not allow for writing or reading zstd compressed parquet 
files, because it is using reflection to acquire hadoop classes for compression 
which are not in the supplied dependencies.

From what I can see, anyone that wants to use the new compression codecs 
are going to have to build their own custom version of spark.  And probably 
with modified versions of hadoop libraries as well, including changing how the 
native bindings are built because that would be easier than updating the 
whole thing to hadoop-common 3.0 where the required compressors exist.

Alternatively, spark+parquet should avoid the hadoop dependencies like the 
plague for compression / decompression.  They bring in a steaming heap of 
dependencies and possible library conflicts and users often have versions (or 
CDH versions) that don't exactly match.
In my mind, parquet should handle the compression itself, or with a 
light-weight dependency.
Perhaps it can use either the hadoop flavor, or if that is not found, 
another one, or even a user-supplied one so that it works stand-alone or from 
inside hadoop without issue.
Right now it is bound together with reflection and an awkward stack of 
brittle dependencies with no escape hatch.

Or am I missing something here, and it is possible to read/write with the 
new codecs if I configure it differently?




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-17 Thread henryr
Github user henryr commented on the issue:

https://github.com/apache/spark/pull/21070
  
@scottcarey I agree that's important. Perhaps it could be done as a 
follow-up PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-17 Thread scottcarey
Github user scottcarey commented on the issue:

https://github.com/apache/spark/pull/21070
  
This PR should include changes to `ParquetOptions.scala` to expose the 
`LZ4`, `ZSTD` and `BROTLI` parquet compression codecs, or else spark users 
won't be able to leverage those parquet 1.10.0 features.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21070
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21070
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89415/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21070
  
**[Test build #89415 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89415/testReport)**
 for PR 21070 at commit 
[`27a66d8`](https://github.com/apache/spark/commit/27a66d8114552e199389c944517d42719861b9de).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >