[GitHub] spark pull request: [SPARK-2119][SQL] Improved Parquet performance...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1370#issuecomment-49193554 Thanks! I've merged this into master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2119][SQL] Improved Parquet performance...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1370 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2119][SQL] Improved Parquet performance...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/1370#discussion_r14967550 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala --- @@ -365,20 +366,23 @@ private[parquet] object ParquetTypesConverter extends Logging { sExpected $path for be a directory with Parquet files/metadata) } ParquetRelation.enableLogForwarding() -val metadataPath = new Path(path, ParquetFileWriter.PARQUET_METADATA_FILE) -// if this is a new table that was just created we will find only the metadata file -if (fs.exists(metadataPath) fs.isFile(metadataPath)) { - ParquetFileReader.readFooter(conf, metadataPath) -} else { - // there may be one or more Parquet files in the given directory - val footers = ParquetFileReader.readFooters(conf, fs.getFileStatus(path)) - // TODO: for now we assume that all footers (if there is more than one) have identical - // metadata; we may want to add a check here at some point - if (footers.size() == 0) { -throw new IllegalArgumentException(sCould not find Parquet metadata at path $path) - } - footers(0).getParquetMetadata + +val children = fs.listStatus(path).filterNot { + _.getPath.getName == FileOutputCommitter.SUCCEEDED_FILE_NAME } + +// NOTE (lian): Parquet _metadata file can be very slow if the file consists of lots of row +// groups. Since Parquet schema is replicated among all row groups, we only need to touch a --- End diff -- Are we making a new assumption here that all of the data has the same schema. I know we don't promise support for that now, but it would be nice to do in the future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2119][SQL] Improved Parquet performance...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/1370#discussion_r14979405 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala --- @@ -365,20 +366,23 @@ private[parquet] object ParquetTypesConverter extends Logging { sExpected $path for be a directory with Parquet files/metadata) } ParquetRelation.enableLogForwarding() -val metadataPath = new Path(path, ParquetFileWriter.PARQUET_METADATA_FILE) -// if this is a new table that was just created we will find only the metadata file -if (fs.exists(metadataPath) fs.isFile(metadataPath)) { - ParquetFileReader.readFooter(conf, metadataPath) -} else { - // there may be one or more Parquet files in the given directory - val footers = ParquetFileReader.readFooters(conf, fs.getFileStatus(path)) - // TODO: for now we assume that all footers (if there is more than one) have identical - // metadata; we may want to add a check here at some point - if (footers.size() == 0) { -throw new IllegalArgumentException(sCould not find Parquet metadata at path $path) - } - footers(0).getParquetMetadata + +val children = fs.listStatus(path).filterNot { + _.getPath.getName == FileOutputCommitter.SUCCEEDED_FILE_NAME } + +// NOTE (lian): Parquet _metadata file can be very slow if the file consists of lots of row +// groups. Since Parquet schema is replicated among all row groups, we only need to touch a --- End diff -- Yes, we are making this assumption, will add a comment here. (And checking schema consistency can be potentially inefficient for large Parquet file with lots of row groups.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2119][SQL] Improved Parquet performance...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1370#issuecomment-49120350 QA tests have started for PR 1370. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16704/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2119][SQL] Improved Parquet performance...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1370#issuecomment-48700541 QA results for PR 1370:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16555/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2119][SQL] Improved Parquet performance...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/1370 [SPARK-2119][SQL] Improved Parquet performance when reading off S3 JIRA issue: [SPARK-2119](https://issues.apache.org/jira/browse/SPARK-2119) Essentially this PR fixed three issues to gain much better performance when reading large Parquet file off S3. 1. When reading the schema, fetching Parquet metadata from a part-file rather than the `_metadata` file The `_metadata` file contains metadata of all row groups, and can be very large if there are many row groups. Since schema information and row group metadata are coupled within a single Thrift object, we have to read the whole `_metadata` to fetch the schema. On the other hand, schema is replicated among footers of all part-files, which are fairly small. 1. Only add the root directory of the Parquet file rather than all the part-files to input paths HDFS API can automatically filter out all hidden files and underscore files (`_SUCCESS` `_metadata`), there's no need to filter out all part-files and add them individually to input paths. What make it much worse is that, `FileInputFormat.listStatus()` calls `FileSystem.globStatus()` on each individual input path sequentially, each results a blocking remote S3 HTTP request. 1. Worked around [PARQUET-16](https://issues.apache.org/jira/browse/PARQUET-16) Essentially PARQUET-16 is similar to the above issue, and results lots of sequential `FileSystem.getFileStatus()` calls, which are further translated into a bunch of remote S3 HTTP requests. `FilteringParquetRowInputFormat` should be cleaned up once PARQUET-16 is fixed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark faster-parquet Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1370.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1370 commit 5bd3d29f9fa118719c94d1f5acffa24d6f1a755d Author: Cheng Lian lian.cs@gmail.com Date: 2014-07-06T04:59:01Z Fixed Parquet log level commit 1c0d1b923a57fddd1fe67270c71e28ac0324de04 Author: Cheng Lian lian.cs@gmail.com Date: 2014-07-09T01:53:38Z Accelerated Parquet schema retrieving commit d2c4417a45dff48ad52a830695f9d68f9ed8531f Author: Cheng Lian lian.cs@gmail.com Date: 2014-07-10T20:17:57Z Worked around PARQUET-16 to improve Parquet performance --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2119][SQL] Improved Parquet performance...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1370#issuecomment-48695685 QA tests have started for PR 1370. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16555/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---