git commit: [SQL] More aggressive defaults

2014-11-03 Thread marmbrus
Repository: spark
Updated Branches:
  refs/heads/branch-1.2 6104754f7 - 51985f78c


[SQL] More aggressive defaults

 - Turns on compression for in-memory cached data by default
 - Changes the default parquet compression format back to gzip (we have seen 
more OOMs with production workloads due to the way Snappy allocates memory)
 - Ups the batch size to 10,000 rows
 - Increases the broadcast threshold to 10mb.
 - Uses our parquet implementation instead of the hive one by default.
 - Cache parquet metadata by default.

Author: Michael Armbrust mich...@databricks.com

Closes #3064 from marmbrus/fasterDefaults and squashes the following commits:

97ee9f8 [Michael Armbrust] parquet codec docs
e641694 [Michael Armbrust] Remote also
a12866a [Michael Armbrust] Cache metadata.
2d73acc [Michael Armbrust] Update docs defaults.
d63d2d5 [Michael Armbrust] document parquet option
da373f9 [Michael Armbrust] More aggressive defaults

(cherry picked from commit 25bef7e6951301e93004567fc0cef96bf8d1a224)
Signed-off-by: Michael Armbrust mich...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/51985f78
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/51985f78
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/51985f78

Branch: refs/heads/branch-1.2
Commit: 51985f78ca5f728f8b9233b703110f541d27b274
Parents: 6104754
Author: Michael Armbrust mich...@databricks.com
Authored: Mon Nov 3 14:08:27 2014 -0800
Committer: Michael Armbrust mich...@databricks.com
Committed: Mon Nov 3 14:08:40 2014 -0800

--
 docs/sql-programming-guide.md | 18 +-
 .../main/scala/org/apache/spark/sql/SQLConf.scala | 10 +-
 .../sql/parquet/ParquetTableOperations.scala  |  6 +++---
 .../org/apache/spark/sql/hive/HiveContext.scala   |  2 +-
 4 files changed, 22 insertions(+), 14 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/51985f78/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index d4ade93..e399fec 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -582,19 +582,27 @@ Configuration of Parquet can be done using the `setConf` 
method on SQLContext or
 /tr
 tr
   tdcodespark.sql.parquet.cacheMetadata/code/td
-  tdfalse/td
+  tdtrue/td
   td
 Turns on caching of Parquet schema metadata.  Can speed up querying of 
static data.
   /td
 /tr
 tr
   tdcodespark.sql.parquet.compression.codec/code/td
-  tdsnappy/td
+  tdgzip/td
   td
 Sets the compression codec use when writing Parquet files. Acceptable 
values include: 
 uncompressed, snappy, gzip, lzo.
   /td
 /tr
+tr
+  tdcodespark.sql.hive.convertMetastoreParquet/code/td
+  tdtrue/td
+  td
+When set to false, Spark SQL will use the Hive SerDe for parquet tables 
instead of the built in
+support.
+  /td
+/tr
 /table
 
 ## JSON Datasets
@@ -815,7 +823,7 @@ Configuration of in-memory caching can be done using the 
`setConf` method on SQL
 trthProperty Name/ththDefault/ththMeaning/th/tr
 tr
   tdcodespark.sql.inMemoryColumnarStorage.compressed/code/td
-  tdfalse/td
+  tdtrue/td
   td
 When set to true Spark SQL will automatically select a compression codec 
for each column based
 on statistics of the data.
@@ -823,7 +831,7 @@ Configuration of in-memory caching can be done using the 
`setConf` method on SQL
 /tr
 tr
   tdcodespark.sql.inMemoryColumnarStorage.batchSize/code/td
-  td1000/td
+  td1/td
   td
 Controls the size of batches for columnar caching.  Larger batch sizes can 
improve memory utilization
 and compression, but risk OOMs when caching data.
@@ -841,7 +849,7 @@ that these options will be deprecated in future release as 
more optimizations ar
   trthProperty Name/ththDefault/ththMeaning/th/tr
   tr
 tdcodespark.sql.autoBroadcastJoinThreshold/code/td
-td1/td
+td10485760 (10 MB)/td
 td
   Configures the maximum size in bytes for a table that will be broadcast 
to all worker nodes when
   performing a join.  By setting this value to -1 broadcasting can be 
disabled.  Note that currently

http://git-wip-us.apache.org/repos/asf/spark/blob/51985f78/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala
index 07e6e2e..279495a 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala
@@ -79,13 +79,13 @@ private[sql] trait SQLConf {
   private[spark] def dialect: String = getConf(DIALECT, sql)
 
   /** When 

git commit: [SQL] More aggressive defaults

2014-11-03 Thread marmbrus
Repository: spark
Updated Branches:
  refs/heads/master e83f13e8d - 25bef7e69


[SQL] More aggressive defaults

 - Turns on compression for in-memory cached data by default
 - Changes the default parquet compression format back to gzip (we have seen 
more OOMs with production workloads due to the way Snappy allocates memory)
 - Ups the batch size to 10,000 rows
 - Increases the broadcast threshold to 10mb.
 - Uses our parquet implementation instead of the hive one by default.
 - Cache parquet metadata by default.

Author: Michael Armbrust mich...@databricks.com

Closes #3064 from marmbrus/fasterDefaults and squashes the following commits:

97ee9f8 [Michael Armbrust] parquet codec docs
e641694 [Michael Armbrust] Remote also
a12866a [Michael Armbrust] Cache metadata.
2d73acc [Michael Armbrust] Update docs defaults.
d63d2d5 [Michael Armbrust] document parquet option
da373f9 [Michael Armbrust] More aggressive defaults


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/25bef7e6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/25bef7e6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/25bef7e6

Branch: refs/heads/master
Commit: 25bef7e6951301e93004567fc0cef96bf8d1a224
Parents: e83f13e
Author: Michael Armbrust mich...@databricks.com
Authored: Mon Nov 3 14:08:27 2014 -0800
Committer: Michael Armbrust mich...@databricks.com
Committed: Mon Nov 3 14:08:27 2014 -0800

--
 docs/sql-programming-guide.md | 18 +-
 .../main/scala/org/apache/spark/sql/SQLConf.scala | 10 +-
 .../sql/parquet/ParquetTableOperations.scala  |  6 +++---
 .../org/apache/spark/sql/hive/HiveContext.scala   |  2 +-
 4 files changed, 22 insertions(+), 14 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/25bef7e6/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index d4ade93..e399fec 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -582,19 +582,27 @@ Configuration of Parquet can be done using the `setConf` 
method on SQLContext or
 /tr
 tr
   tdcodespark.sql.parquet.cacheMetadata/code/td
-  tdfalse/td
+  tdtrue/td
   td
 Turns on caching of Parquet schema metadata.  Can speed up querying of 
static data.
   /td
 /tr
 tr
   tdcodespark.sql.parquet.compression.codec/code/td
-  tdsnappy/td
+  tdgzip/td
   td
 Sets the compression codec use when writing Parquet files. Acceptable 
values include: 
 uncompressed, snappy, gzip, lzo.
   /td
 /tr
+tr
+  tdcodespark.sql.hive.convertMetastoreParquet/code/td
+  tdtrue/td
+  td
+When set to false, Spark SQL will use the Hive SerDe for parquet tables 
instead of the built in
+support.
+  /td
+/tr
 /table
 
 ## JSON Datasets
@@ -815,7 +823,7 @@ Configuration of in-memory caching can be done using the 
`setConf` method on SQL
 trthProperty Name/ththDefault/ththMeaning/th/tr
 tr
   tdcodespark.sql.inMemoryColumnarStorage.compressed/code/td
-  tdfalse/td
+  tdtrue/td
   td
 When set to true Spark SQL will automatically select a compression codec 
for each column based
 on statistics of the data.
@@ -823,7 +831,7 @@ Configuration of in-memory caching can be done using the 
`setConf` method on SQL
 /tr
 tr
   tdcodespark.sql.inMemoryColumnarStorage.batchSize/code/td
-  td1000/td
+  td1/td
   td
 Controls the size of batches for columnar caching.  Larger batch sizes can 
improve memory utilization
 and compression, but risk OOMs when caching data.
@@ -841,7 +849,7 @@ that these options will be deprecated in future release as 
more optimizations ar
   trthProperty Name/ththDefault/ththMeaning/th/tr
   tr
 tdcodespark.sql.autoBroadcastJoinThreshold/code/td
-td1/td
+td10485760 (10 MB)/td
 td
   Configures the maximum size in bytes for a table that will be broadcast 
to all worker nodes when
   performing a join.  By setting this value to -1 broadcasting can be 
disabled.  Note that currently

http://git-wip-us.apache.org/repos/asf/spark/blob/25bef7e6/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala
index 07e6e2e..279495a 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala
@@ -79,13 +79,13 @@ private[sql] trait SQLConf {
   private[spark] def dialect: String = getConf(DIALECT, sql)
 
   /** When true tables cached using the in-memory columnar caching will be 
compressed. */
-  private[spark] def useCompression: Boolean =