[GitHub] spark pull request: [SPARK-2119][SQL] Improved Parquet performance...

2014-07-16 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1370#issuecomment-49193554
  
Thanks!  I've merged this into master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2119][SQL] Improved Parquet performance...

2014-07-16 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1370


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2119][SQL] Improved Parquet performance...

2014-07-15 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1370#discussion_r14967550
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
@@ -365,20 +366,23 @@ private[parquet] object ParquetTypesConverter extends 
Logging {
 sExpected $path for be a directory with Parquet files/metadata)
 }
 ParquetRelation.enableLogForwarding()
-val metadataPath = new Path(path, 
ParquetFileWriter.PARQUET_METADATA_FILE)
-// if this is a new table that was just created we will find only the 
metadata file
-if (fs.exists(metadataPath)  fs.isFile(metadataPath)) {
-  ParquetFileReader.readFooter(conf, metadataPath)
-} else {
-  // there may be one or more Parquet files in the given directory
-  val footers = ParquetFileReader.readFooters(conf, 
fs.getFileStatus(path))
-  // TODO: for now we assume that all footers (if there is more than 
one) have identical
-  // metadata; we may want to add a check here at some point
-  if (footers.size() == 0) {
-throw new IllegalArgumentException(sCould not find Parquet 
metadata at path $path)
-  }
-  footers(0).getParquetMetadata
+
+val children = fs.listStatus(path).filterNot {
+  _.getPath.getName == FileOutputCommitter.SUCCEEDED_FILE_NAME
 }
+
+// NOTE (lian): Parquet _metadata file can be very slow if the file 
consists of lots of row
+// groups. Since Parquet schema is replicated among all row groups, we 
only need to touch a
--- End diff --

Are we making a new assumption here that all of the data has the same 
schema.  I know we don't promise support for that now, but it would be nice to 
do in the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2119][SQL] Improved Parquet performance...

2014-07-15 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1370#discussion_r14979405
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
@@ -365,20 +366,23 @@ private[parquet] object ParquetTypesConverter extends 
Logging {
 sExpected $path for be a directory with Parquet files/metadata)
 }
 ParquetRelation.enableLogForwarding()
-val metadataPath = new Path(path, 
ParquetFileWriter.PARQUET_METADATA_FILE)
-// if this is a new table that was just created we will find only the 
metadata file
-if (fs.exists(metadataPath)  fs.isFile(metadataPath)) {
-  ParquetFileReader.readFooter(conf, metadataPath)
-} else {
-  // there may be one or more Parquet files in the given directory
-  val footers = ParquetFileReader.readFooters(conf, 
fs.getFileStatus(path))
-  // TODO: for now we assume that all footers (if there is more than 
one) have identical
-  // metadata; we may want to add a check here at some point
-  if (footers.size() == 0) {
-throw new IllegalArgumentException(sCould not find Parquet 
metadata at path $path)
-  }
-  footers(0).getParquetMetadata
+
+val children = fs.listStatus(path).filterNot {
+  _.getPath.getName == FileOutputCommitter.SUCCEEDED_FILE_NAME
 }
+
+// NOTE (lian): Parquet _metadata file can be very slow if the file 
consists of lots of row
+// groups. Since Parquet schema is replicated among all row groups, we 
only need to touch a
--- End diff --

Yes, we are making this assumption, will add a comment here. (And checking 
schema consistency can be potentially inefficient for large Parquet file with 
lots of row groups.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2119][SQL] Improved Parquet performance...

2014-07-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1370#issuecomment-49120350
  
QA tests have started for PR 1370. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16704/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2119][SQL] Improved Parquet performance...

2014-07-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1370#issuecomment-48700541
  
QA results for PR 1370:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16555/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2119][SQL] Improved Parquet performance...

2014-07-10 Thread liancheng
GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/1370

[SPARK-2119][SQL] Improved Parquet performance when reading off S3

JIRA issue: [SPARK-2119](https://issues.apache.org/jira/browse/SPARK-2119)

Essentially this PR fixed three issues to gain much better performance when 
reading large Parquet file off S3.

1. When reading the schema, fetching Parquet metadata from a part-file 
rather than the `_metadata` file

   The `_metadata` file contains metadata of all row groups, and can be 
very large if there are many row groups. Since schema information and row group 
metadata are coupled within a single Thrift object, we have to read the whole 
`_metadata` to fetch the schema. On the other hand, schema is replicated among 
footers of all part-files, which are fairly small.

1. Only add the root directory of the Parquet file rather than all the 
part-files to input paths

   HDFS API can automatically filter out all hidden files and underscore 
files (`_SUCCESS`  `_metadata`), there's no need to filter out all part-files 
and add them individually to input paths. What make it much worse is that, 
`FileInputFormat.listStatus()` calls `FileSystem.globStatus()` on each 
individual input path sequentially, each results a blocking remote S3 HTTP 
request.

1. Worked around 
[PARQUET-16](https://issues.apache.org/jira/browse/PARQUET-16)

   Essentially PARQUET-16 is similar to the above issue, and results lots 
of sequential `FileSystem.getFileStatus()` calls, which are further translated 
into a bunch of remote S3 HTTP requests.

   `FilteringParquetRowInputFormat` should be cleaned up once PARQUET-16 is 
fixed.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark faster-parquet

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1370.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1370


commit 5bd3d29f9fa118719c94d1f5acffa24d6f1a755d
Author: Cheng Lian lian.cs@gmail.com
Date:   2014-07-06T04:59:01Z

Fixed Parquet log level

commit 1c0d1b923a57fddd1fe67270c71e28ac0324de04
Author: Cheng Lian lian.cs@gmail.com
Date:   2014-07-09T01:53:38Z

Accelerated Parquet schema retrieving

commit d2c4417a45dff48ad52a830695f9d68f9ed8531f
Author: Cheng Lian lian.cs@gmail.com
Date:   2014-07-10T20:17:57Z

Worked around PARQUET-16 to improve Parquet performance




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2119][SQL] Improved Parquet performance...

2014-07-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1370#issuecomment-48695685
  
QA tests have started for PR 1370. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16555/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---