[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...
Github user chenghao-intel closed the pull request at: https://github.com/apache/spark/pull/5630 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/5630#issuecomment-101372584 Honestly all of this session stuff seems pretty confusing and half-baked. I think we should have a full design of configuration and session state before we do any further development. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/5630#issuecomment-101355347 Here is a summary per our offline discussion: 1. Add `SQLContext.hadoopConf` 2. Let `HiveContext.hiveconf` override `SQLContext.hadoopConf` 3. `SQLContext.setConf` should also set `SQLContext.hadoopConf` 4. Use `SQLContext.hadoopConf` when through out all table scan and insertion jobs A tricky part here is that `SQLContext.hadoopConf` must play well with multi-session support. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/5630#issuecomment-101110417 After talking with @liancheng offline, I will work this again after #5526 merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/5630#discussion_r29310766 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala --- @@ -36,6 +36,8 @@ private[spark] object SQLConf { val PARQUET_INT96_AS_TIMESTAMP = "spark.sql.parquet.int96AsTimestamp" val PARQUET_CACHE_METADATA = "spark.sql.parquet.cacheMetadata" val PARQUET_COMPRESSION = "spark.sql.parquet.compression.codec" + val PARQUET_BLOCK_SIZE = "spark.sql.parquet.blocksize" + val PARQUET_PAGE_SIZE = "spark.sql.parquet.pagesize" --- End diff -- @yhuai , I just confirmed, it still keep uncompressed parquet file even if we set the property like `SET parquet.compression=GZIP`. From the source code, https://github.com/chenghao-intel/spark/blob/parquet/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala#L636 Settings in `SQLConf` are not append into the `Configuration`, that's why all of the settings doesn't take effect. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5630#issuecomment-96767366 [Test build #30992 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30992/consoleFull) for PR 5630 at commit [`62e587f`](https://github.com/apache/spark/commit/62e587f40f7b06127c761152cddab7e75dc83879). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/5630#discussion_r28937606 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala --- @@ -36,6 +36,8 @@ private[spark] object SQLConf { val PARQUET_INT96_AS_TIMESTAMP = "spark.sql.parquet.int96AsTimestamp" val PARQUET_CACHE_METADATA = "spark.sql.parquet.cacheMetadata" val PARQUET_COMPRESSION = "spark.sql.parquet.compression.codec" + val PARQUET_BLOCK_SIZE = "spark.sql.parquet.blocksize" + val PARQUET_PAGE_SIZE = "spark.sql.parquet.pagesize" --- End diff -- At least we need to add the prefix `spark.sql.`, right? like the what we did for `spark.sql.parquet.compression.codec`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/5630#discussion_r28937324 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala --- @@ -36,6 +36,8 @@ private[spark] object SQLConf { val PARQUET_INT96_AS_TIMESTAMP = "spark.sql.parquet.int96AsTimestamp" val PARQUET_CACHE_METADATA = "spark.sql.parquet.cacheMetadata" val PARQUET_COMPRESSION = "spark.sql.parquet.compression.codec" + val PARQUET_BLOCK_SIZE = "spark.sql.parquet.blocksize" + val PARQUET_PAGE_SIZE = "spark.sql.parquet.pagesize" --- End diff -- I meant Parquet should have its own conf keys for those, right? I do not need to add spark sql ones. See https://github.com/apache/parquet-mr/blob/parquet-1.6.0rc3/parquet-hadoop/src/main/java/parquet/hadoop/ParquetOutputFormat.java#L98-L106 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/5630#discussion_r28937128 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala --- @@ -36,6 +36,8 @@ private[spark] object SQLConf { val PARQUET_INT96_AS_TIMESTAMP = "spark.sql.parquet.int96AsTimestamp" val PARQUET_CACHE_METADATA = "spark.sql.parquet.cacheMetadata" val PARQUET_COMPRESSION = "spark.sql.parquet.compression.codec" + val PARQUET_BLOCK_SIZE = "spark.sql.parquet.blocksize" + val PARQUET_PAGE_SIZE = "spark.sql.parquet.pagesize" --- End diff -- Yes, users have to use the set command in Hive, not specify the Hive's table properties. http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_parquet.html --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/5630#discussion_r28936699 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala --- @@ -36,6 +36,8 @@ private[spark] object SQLConf { val PARQUET_INT96_AS_TIMESTAMP = "spark.sql.parquet.int96AsTimestamp" val PARQUET_CACHE_METADATA = "spark.sql.parquet.cacheMetadata" val PARQUET_COMPRESSION = "spark.sql.parquet.compression.codec" + val PARQUET_BLOCK_SIZE = "spark.sql.parquet.blocksize" + val PARQUET_PAGE_SIZE = "spark.sql.parquet.pagesize" --- End diff -- You meant when users create a hive table by using `CREATE TABLE ... STORED AS PARQUET`? For this case, the user should use `set` or Hive's table properties, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/5630#discussion_r28927882 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala --- @@ -36,6 +36,8 @@ private[spark] object SQLConf { val PARQUET_INT96_AS_TIMESTAMP = "spark.sql.parquet.int96AsTimestamp" val PARQUET_CACHE_METADATA = "spark.sql.parquet.cacheMetadata" val PARQUET_COMPRESSION = "spark.sql.parquet.compression.codec" + val PARQUET_BLOCK_SIZE = "spark.sql.parquet.blocksize" + val PARQUET_PAGE_SIZE = "spark.sql.parquet.pagesize" --- End diff -- @yhuai I know it will work if we create the table by external data source API, here is for making it consistency with Hive table. Sorry I should make it clearer, and I've updated the description. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/5630#discussion_r28892756 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala --- @@ -36,6 +36,8 @@ private[spark] object SQLConf { val PARQUET_INT96_AS_TIMESTAMP = "spark.sql.parquet.int96AsTimestamp" val PARQUET_CACHE_METADATA = "spark.sql.parquet.cacheMetadata" val PARQUET_COMPRESSION = "spark.sql.parquet.compression.codec" + val PARQUET_BLOCK_SIZE = "spark.sql.parquet.blocksize" + val PARQUET_PAGE_SIZE = "spark.sql.parquet.pagesize" --- End diff -- I think that parquet has conf keys for these, right? Users just pass those properties in the `Options` (`parameters` in `ParquetRelation2`). Then, we will need to change parquet relation to set those properties correctly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5630#issuecomment-95100781 [Test build #30741 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30741/consoleFull) for PR 5630 at commit [`62e587f`](https://github.com/apache/spark/commit/62e587f40f7b06127c761152cddab7e75dc83879). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5630#issuecomment-95100822 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30741/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5630#issuecomment-95068115 [Test build #30741 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30741/consoleFull) for PR 5630 at commit [`62e587f`](https://github.com/apache/spark/commit/62e587f40f7b06127c761152cddab7e75dc83879). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...
GitHub user chenghao-intel opened a pull request: https://github.com/apache/spark/pull/5630 [SPARK-7051] [SQL] Configuration for parquet data writting You can merge this pull request into a Git repository by running: $ git pull https://github.com/chenghao-intel/spark parquet Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5630.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5630 commit 62e587f40f7b06127c761152cddab7e75dc83879 Author: Cheng Hao Date: 2015-04-22T07:50:54Z Parquet Configuration for writting --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org