GitHub user HyukjinKwon opened a pull request:
https://github.com/apache/spark/pull/14660
[SPARK-17071][SQL] Fetch Parquet schema without another Spark job when it
is a single file to touch
## What changes were proposed in this pull request?
It seems Spark executes another job to figure out schema always
([ParquetFileFormat#L739-L778](https://github.com/apache/spark/blob/abff92bfdc7d4c9d2308794f0350561fe0ceb4dd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L739-L778)).
However, it seems it's a bit of overhead to touch only a single file. I ran
a bench mark with the code below:
```scala
test("Benchmark for JSON writer") {
withTempPath { path =>
Seq((1, 2D, 3L, "4")).toDF("a", "b", "c", "d")
.write.format("parquet").save(path.getAbsolutePath)
val benchmark = new Benchmark("Parquet - read schema", 1)
benchmark.addCase("Parquet - read schema", 10) { _ =>
spark.read.format("parquet").load(path.getCanonicalPath).schema
}
benchmark.run()
}
}
```
with the results as below:
- **Before**
```scala
Parquet - read schema: Best/Avg Time(ms)Rate(M/s)
Per Row(ns) Relative
Parquet - read schema 47 / 49 0.0
46728419.0 1.0X
```
- **After**
```scala
Parquet - read schema: Best/Avg Time(ms)Rate(M/s)
Per Row(ns) Relative
Parquet - read schema2 /3 0.0
1811673.0 1.0X
```
It seems it became 20X faster (although It is a small bit in total job
run-time).
As a reference, it seems ORC is doing this within driver-side
[OrcFileOperator.scala#L74-L83](https://github.com/apache/spark/blob/a95252823e09939b654dd425db38dadc4100bc87/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala#L74-L83).
## How was this patch tested?
Existing tests should cover this
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/HyukjinKwon/spark SPARK-17071
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/14660.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #14660
commit 614abbc6b7a03ff0d3e505697c0bbfec3b330c2b
Author: hyukjinkwon
Date: 2016-08-16T05:42:29Z
Fetch Parquet schema within driver-side when there is single file to touch
without another Spark job
commit e1214d50035441fb96551683cf38ae3e49f07b7d
Author: hyukjinkwon
Date: 2016-08-16T05:46:12Z
Fix modifier
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org