[GitHub] spark pull request #14660: [SPARK-17071][SQL] Fetch Parquet schema without a...

2016-08-16 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14660#discussion_r74889938
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -224,7 +226,16 @@ class ParquetFileFormat
 .orElse(filesByType.data.headOption)
 .toSeq
   }
-ParquetFileFormat.mergeSchemasInParallel(filesToTouch, sparkSession)
+
+filesToTouch match {
+  case f :: Nil =>
--- End diff --

if you are doing this, we should probably add an undocumented config option 
to control this. I also think the number can be bigger. We'd need to experiment 
with it though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14660: [SPARK-17071][SQL] Fetch Parquet schema without a...

2016-08-16 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14660#discussion_r74889501
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
 ---
@@ -794,13 +805,44 @@ object ParquetFileFormat extends Logging {
   }
 
   /**
-   * Reads Spark SQL schema from a Parquet footer.  If a valid serialized 
Spark SQL schema string
+   * Figures out a Parquet schema in driver-side.
+   */
+  def readSchemaFromSingleFile(
--- End diff --

shouldn't mergeSchemasInParallel call readSchemaFromSingleFile?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14660: [SPARK-17071][SQL] Fetch Parquet schema without a...

2016-08-15 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/14660

[SPARK-17071][SQL] Fetch Parquet schema without another Spark job when it 
is a single file to touch 

## What changes were proposed in this pull request?

It seems Spark executes another job to figure out schema always 
([ParquetFileFormat#L739-L778](https://github.com/apache/spark/blob/abff92bfdc7d4c9d2308794f0350561fe0ceb4dd/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L739-L778)).

However, it seems it's a bit of overhead to touch only a single file. I ran 
a bench mark with the code below:

```scala
test("Benchmark for JSON writer") {
  withTempPath { path =>
Seq((1, 2D, 3L, "4")).toDF("a", "b", "c", "d")
  .write.format("parquet").save(path.getAbsolutePath)

val benchmark = new Benchmark("Parquet - read schema", 1)
benchmark.addCase("Parquet - read schema", 10) { _ =>
  spark.read.format("parquet").load(path.getCanonicalPath).schema
}
benchmark.run()
  }
}
```

with the results as below:

- **Before**

  ```scala
  Parquet - read schema:   Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative
  

  Parquet - read schema   47 /   49  0.0
46728419.0   1.0X
  ```

- **After**

  ```scala
  Parquet - read schema:   Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative
  

  Parquet - read schema2 /3  0.0
 1811673.0   1.0X
  ```

It seems it became 20X faster (although It is a small bit in total job 
run-time).

As a reference, it seems ORC is doing this within driver-side 
[OrcFileOperator.scala#L74-L83](https://github.com/apache/spark/blob/a95252823e09939b654dd425db38dadc4100bc87/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala#L74-L83).

## How was this patch tested?

Existing tests should cover this



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-17071

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14660.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14660


commit 614abbc6b7a03ff0d3e505697c0bbfec3b330c2b
Author: hyukjinkwon 
Date:   2016-08-16T05:42:29Z

Fetch Parquet schema within driver-side when there is single file to touch 
without another Spark job

commit e1214d50035441fb96551683cf38ae3e49f07b7d
Author: hyukjinkwon 
Date:   2016-08-16T05:46:12Z

Fix modifier




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org