[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

2015-05-18 Thread chenghao-intel
Github user chenghao-intel closed the pull request at:

https://github.com/apache/spark/pull/5630


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

2015-05-12 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/5630#issuecomment-101372584
  
Honestly all of this session stuff seems pretty confusing and half-baked.  
I think we should have a full design of configuration and session state before 
we do any further development.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

2015-05-12 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/5630#issuecomment-101355347
  
Here is a summary per our offline discussion:

1. Add `SQLContext.hadoopConf`
2. Let `HiveContext.hiveconf` override `SQLContext.hadoopConf`
3. `SQLContext.setConf` should also set `SQLContext.hadoopConf`
4. Use `SQLContext.hadoopConf` when through out all table scan and 
insertion jobs

A tricky part here is that `SQLContext.hadoopConf` must play well with 
multi-session support.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

2015-05-11 Thread chenghao-intel
Github user chenghao-intel commented on the pull request:

https://github.com/apache/spark/pull/5630#issuecomment-101110417
  
After talking with @liancheng offline, I will work this again after #5526 
merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

2015-04-28 Thread chenghao-intel
Github user chenghao-intel commented on a diff in the pull request:

https://github.com/apache/spark/pull/5630#discussion_r29310766
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
@@ -36,6 +36,8 @@ private[spark] object SQLConf {
   val PARQUET_INT96_AS_TIMESTAMP = "spark.sql.parquet.int96AsTimestamp"
   val PARQUET_CACHE_METADATA = "spark.sql.parquet.cacheMetadata"
   val PARQUET_COMPRESSION = "spark.sql.parquet.compression.codec"
+  val PARQUET_BLOCK_SIZE = "spark.sql.parquet.blocksize"
+  val PARQUET_PAGE_SIZE = "spark.sql.parquet.pagesize"
--- End diff --

@yhuai , I just confirmed, it still keep uncompressed parquet file even if 
we set the property like `SET parquet.compression=GZIP`.

From the source code, 
https://github.com/chenghao-intel/spark/blob/parquet/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala#L636
Settings in `SQLConf` are not append into the `Configuration`, that's why 
all of the settings doesn't take effect.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

2015-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5630#issuecomment-96767366
  
  [Test build #30992 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30992/consoleFull)
 for   PR 5630 at commit 
[`62e587f`](https://github.com/apache/spark/commit/62e587f40f7b06127c761152cddab7e75dc83879).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

2015-04-22 Thread chenghao-intel
Github user chenghao-intel commented on a diff in the pull request:

https://github.com/apache/spark/pull/5630#discussion_r28937606
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
@@ -36,6 +36,8 @@ private[spark] object SQLConf {
   val PARQUET_INT96_AS_TIMESTAMP = "spark.sql.parquet.int96AsTimestamp"
   val PARQUET_CACHE_METADATA = "spark.sql.parquet.cacheMetadata"
   val PARQUET_COMPRESSION = "spark.sql.parquet.compression.codec"
+  val PARQUET_BLOCK_SIZE = "spark.sql.parquet.blocksize"
+  val PARQUET_PAGE_SIZE = "spark.sql.parquet.pagesize"
--- End diff --

At least we need to add the prefix `spark.sql.`, right? like the what we 
did for `spark.sql.parquet.compression.codec`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

2015-04-22 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/5630#discussion_r28937324
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
@@ -36,6 +36,8 @@ private[spark] object SQLConf {
   val PARQUET_INT96_AS_TIMESTAMP = "spark.sql.parquet.int96AsTimestamp"
   val PARQUET_CACHE_METADATA = "spark.sql.parquet.cacheMetadata"
   val PARQUET_COMPRESSION = "spark.sql.parquet.compression.codec"
+  val PARQUET_BLOCK_SIZE = "spark.sql.parquet.blocksize"
+  val PARQUET_PAGE_SIZE = "spark.sql.parquet.pagesize"
--- End diff --

I meant Parquet should have its own conf keys for those, right? I do not 
need to add spark sql ones.

See 
https://github.com/apache/parquet-mr/blob/parquet-1.6.0rc3/parquet-hadoop/src/main/java/parquet/hadoop/ParquetOutputFormat.java#L98-L106
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

2015-04-22 Thread chenghao-intel
Github user chenghao-intel commented on a diff in the pull request:

https://github.com/apache/spark/pull/5630#discussion_r28937128
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
@@ -36,6 +36,8 @@ private[spark] object SQLConf {
   val PARQUET_INT96_AS_TIMESTAMP = "spark.sql.parquet.int96AsTimestamp"
   val PARQUET_CACHE_METADATA = "spark.sql.parquet.cacheMetadata"
   val PARQUET_COMPRESSION = "spark.sql.parquet.compression.codec"
+  val PARQUET_BLOCK_SIZE = "spark.sql.parquet.blocksize"
+  val PARQUET_PAGE_SIZE = "spark.sql.parquet.pagesize"
--- End diff --

Yes, users have to use the set command in Hive, not specify the Hive's 
table properties.


http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_parquet.html



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

2015-04-22 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/5630#discussion_r28936699
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
@@ -36,6 +36,8 @@ private[spark] object SQLConf {
   val PARQUET_INT96_AS_TIMESTAMP = "spark.sql.parquet.int96AsTimestamp"
   val PARQUET_CACHE_METADATA = "spark.sql.parquet.cacheMetadata"
   val PARQUET_COMPRESSION = "spark.sql.parquet.compression.codec"
+  val PARQUET_BLOCK_SIZE = "spark.sql.parquet.blocksize"
+  val PARQUET_PAGE_SIZE = "spark.sql.parquet.pagesize"
--- End diff --

You meant when users create a hive table by using `CREATE TABLE ... STORED 
AS PARQUET`? For this case, the user should use `set` or Hive's table 
properties, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

2015-04-22 Thread chenghao-intel
Github user chenghao-intel commented on a diff in the pull request:

https://github.com/apache/spark/pull/5630#discussion_r28927882
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
@@ -36,6 +36,8 @@ private[spark] object SQLConf {
   val PARQUET_INT96_AS_TIMESTAMP = "spark.sql.parquet.int96AsTimestamp"
   val PARQUET_CACHE_METADATA = "spark.sql.parquet.cacheMetadata"
   val PARQUET_COMPRESSION = "spark.sql.parquet.compression.codec"
+  val PARQUET_BLOCK_SIZE = "spark.sql.parquet.blocksize"
+  val PARQUET_PAGE_SIZE = "spark.sql.parquet.pagesize"
--- End diff --

@yhuai  I know it will work if we create the table by external data source 
API, here is for making it consistency with Hive table.

Sorry I should make it clearer, and I've updated the description.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

2015-04-22 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/5630#discussion_r28892756
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
@@ -36,6 +36,8 @@ private[spark] object SQLConf {
   val PARQUET_INT96_AS_TIMESTAMP = "spark.sql.parquet.int96AsTimestamp"
   val PARQUET_CACHE_METADATA = "spark.sql.parquet.cacheMetadata"
   val PARQUET_COMPRESSION = "spark.sql.parquet.compression.codec"
+  val PARQUET_BLOCK_SIZE = "spark.sql.parquet.blocksize"
+  val PARQUET_PAGE_SIZE = "spark.sql.parquet.pagesize"
--- End diff --

I think that parquet has conf keys for these, right? Users just pass those 
properties in the `Options` (`parameters` in `ParquetRelation2`). Then, we will 
need to change parquet relation to set those properties correctly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

2015-04-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5630#issuecomment-95100781
  
  [Test build #30741 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30741/consoleFull)
 for   PR 5630 at commit 
[`62e587f`](https://github.com/apache/spark/commit/62e587f40f7b06127c761152cddab7e75dc83879).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.
 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

2015-04-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5630#issuecomment-95100822
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30741/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

2015-04-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5630#issuecomment-95068115
  
  [Test build #30741 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30741/consoleFull)
 for   PR 5630 at commit 
[`62e587f`](https://github.com/apache/spark/commit/62e587f40f7b06127c761152cddab7e75dc83879).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7051] [SQL] Configuration for parquet d...

2015-04-22 Thread chenghao-intel
GitHub user chenghao-intel opened a pull request:

https://github.com/apache/spark/pull/5630

[SPARK-7051] [SQL] Configuration for parquet data writting



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chenghao-intel/spark parquet

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5630.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5630


commit 62e587f40f7b06127c761152cddab7e75dc83879
Author: Cheng Hao 
Date:   2015-04-22T07:50:54Z

Parquet Configuration for writting




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org