Re: Parquet compression codecs not applied
Hi Ayoub, The doc page isn’t wrong, but it’s indeed confusing. |spark.sql.parquet.compression.codec| is used when you’re wring Parquet file with something like |data.saveAsParquetFile(...)|. However, you are using Hive DDL in the example code. All Hive DDLs and commands like |SET| are directly delegated to Hive, which unfortunately ignores Spark configurations. And yet, it should be updated. Best, Cheng On 1/10/15 5:49 AM, Ayoub Benali wrote: it worked thanks. this doc page <https://spark.apache.org/docs/1.2.0/sql-programming-guide.html>recommends to use "spark.sql.parquet.compression.codec" to set the compression coded and I thought this setting would be forwarded to the hive context given that HiveContext extends SQLContext, but it was not. I am wondering if this behavior is normal, if not I could open an issue with a potential fix so that "spark.sql.parquet.compression.codec" would be translated to "parquet.compression" in the hive context ? Or the documentation should be updated to mention that the compression coded is set differently with HiveContext. Ayoub. 2015-01-09 17:51 GMT+01:00 Michael Armbrust <mailto:mich...@databricks.com>>: This is a little confusing, but that code path is actually going through hive. So the spark sql configuration does not help. Perhaps, try: set parquet.compression=GZIP; On Fri, Jan 9, 2015 at 2:41 AM, Ayoub mailto:benali.ayoub.i...@gmail.com>> wrote: Hello, I tried to save a table created via the hive context as a parquet file but whatever compression codec (uncompressed, snappy, gzip or lzo) I set via setConf like: setConf("spark.sql.parquet.compression.codec", "gzip") the size of the generated files is the always the same, so it seems like spark context ignores the compression codec that I set. Here is a code sample applied via the spark shell: import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) hiveContext.sql("SET hive.exec.dynamic.partition = true") hiveContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict") hiveContext.setConf("spark.sql.parquet.binaryAsString", "true") // required to make data compatible with impala hiveContext.setConf("spark.sql.parquet.compression.codec", "gzip") hiveContext.sql("create external table if not exists foo (bar STRING, ts INT) Partitioned by (year INT, month INT, day INT) STORED AS PARQUET Location 'hdfs://path/data/foo'") hiveContext.sql("insert into table foo partition(year, month,day) select *, year(from_unixtime(ts)) as year, month(from_unixtime(ts)) as month, day(from_unixtime(ts)) as day from raw_foo") I tried that with spark 1.2 and 1.3 snapshot against hive 0.13 and I also tried that with Impala on the same cluster which applied correctly the compression codecs. Does anyone know what could be the problem ? Thanks, Ayoub. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-compression-codecs-not-applied-tp21058.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <mailto:user-unsubscr...@spark.apache.org> For additional commands, e-mail: user-h...@spark.apache.org <mailto:user-h...@spark.apache.org>
Re: Parquet compression codecs not applied
I was using hive context an not sql context, therefore ("SET spark.sql.parquet.compression.codec=gzip") was "ignored". Michael Armbrust pointed out that "parquet.compression" should be used instead, witch solved the issue. I am still wondering if this behavior is "normal", it would be better if "spark.sql.parquet.compression.codec" would be "translated" to "parquet.compression" in case of hive context. Other wise the documentation should be updated to be more precise. 2015-02-04 19:13 GMT+01:00 sahanbull : > Hi Ayoub, > > You could try using the sql format to set the compression type: > > sc = SparkContext() > sqc = SQLContext(sc) > sqc.sql("SET spark.sql.parquet.compression.codec=gzip") > > You get a notification on screen while running the spark job when you set > the compression codec like this. I havent compared it with different > compression methods, Please let the mailing list knows if this works for > you. > > Best > Sahan > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-compression-codecs-not-applied-tp21058p21498.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re-Parquet-compression-codecs-not-applied-tp21499.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Parquet compression codecs not applied
Hi Ayoub, You could try using the sql format to set the compression type: sc = SparkContext() sqc = SQLContext(sc) sqc.sql("SET spark.sql.parquet.compression.codec=gzip") You get a notification on screen while running the spark job when you set the compression codec like this. I havent compared it with different compression methods, Please let the mailing list knows if this works for you. Best Sahan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-compression-codecs-not-applied-tp21058p21498.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Parquet compression codecs not applied
it worked thanks. this doc page <https://spark.apache.org/docs/1.2.0/sql-programming-guide.html>recommends to use "spark.sql.parquet.compression.codec" to set the compression coded and I thought this setting would be forwarded to the hive context given that HiveContext extends SQLContext, but it was not. I am wondering if this behavior is normal, if not I could open an issue with a potential fix so that "spark.sql.parquet.compression.codec" would be translated to "parquet.compression" in the hive context ? Or the documentation should be updated to mention that the compression coded is set differently with HiveContext. Ayoub. 2015-01-09 17:51 GMT+01:00 Michael Armbrust : > This is a little confusing, but that code path is actually going through > hive. So the spark sql configuration does not help. > > Perhaps, try: > set parquet.compression=GZIP; > > On Fri, Jan 9, 2015 at 2:41 AM, Ayoub wrote: > >> Hello, >> >> I tried to save a table created via the hive context as a parquet file but >> whatever compression codec (uncompressed, snappy, gzip or lzo) I set via >> setConf like: >> >> setConf("spark.sql.parquet.compression.codec", "gzip") >> >> the size of the generated files is the always the same, so it seems like >> spark context ignores the compression codec that I set. >> >> Here is a code sample applied via the spark shell: >> >> import org.apache.spark.sql.hive.HiveContext >> val hiveContext = new HiveContext(sc) >> >> hiveContext.sql("SET hive.exec.dynamic.partition = true") >> hiveContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict") >> hiveContext.setConf("spark.sql.parquet.binaryAsString", "true") // >> required >> to make data compatible with impala >> hiveContext.setConf("spark.sql.parquet.compression.codec", "gzip") >> >> hiveContext.sql("create external table if not exists foo (bar STRING, ts >> INT) Partitioned by (year INT, month INT, day INT) STORED AS PARQUET >> Location 'hdfs://path/data/foo'") >> >> hiveContext.sql("insert into table foo partition(year, month,day) select >> *, >> year(from_unixtime(ts)) as year, month(from_unixtime(ts)) as month, >> day(from_unixtime(ts)) as day from raw_foo") >> >> I tried that with spark 1.2 and 1.3 snapshot against hive 0.13 >> and I also tried that with Impala on the same cluster which applied >> correctly the compression codecs. >> >> Does anyone know what could be the problem ? >> >> Thanks, >> Ayoub. >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-compression-codecs-not-applied-tp21058.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: Parquet compression codecs not applied
This is a little confusing, but that code path is actually going through hive. So the spark sql configuration does not help. Perhaps, try: set parquet.compression=GZIP; On Fri, Jan 9, 2015 at 2:41 AM, Ayoub wrote: > Hello, > > I tried to save a table created via the hive context as a parquet file but > whatever compression codec (uncompressed, snappy, gzip or lzo) I set via > setConf like: > > setConf("spark.sql.parquet.compression.codec", "gzip") > > the size of the generated files is the always the same, so it seems like > spark context ignores the compression codec that I set. > > Here is a code sample applied via the spark shell: > > import org.apache.spark.sql.hive.HiveContext > val hiveContext = new HiveContext(sc) > > hiveContext.sql("SET hive.exec.dynamic.partition = true") > hiveContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict") > hiveContext.setConf("spark.sql.parquet.binaryAsString", "true") // required > to make data compatible with impala > hiveContext.setConf("spark.sql.parquet.compression.codec", "gzip") > > hiveContext.sql("create external table if not exists foo (bar STRING, ts > INT) Partitioned by (year INT, month INT, day INT) STORED AS PARQUET > Location 'hdfs://path/data/foo'") > > hiveContext.sql("insert into table foo partition(year, month,day) select *, > year(from_unixtime(ts)) as year, month(from_unixtime(ts)) as month, > day(from_unixtime(ts)) as day from raw_foo") > > I tried that with spark 1.2 and 1.3 snapshot against hive 0.13 > and I also tried that with Impala on the same cluster which applied > correctly the compression codecs. > > Does anyone know what could be the problem ? > > Thanks, > Ayoub. > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-compression-codecs-not-applied-tp21058.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Parquet compression codecs not applied
Hello, I tried to save a table created via the hive context as a parquet file but whatever compression codec (uncompressed, snappy, gzip or lzo) I set via setConf like: setConf("spark.sql.parquet.compression.codec", "gzip") the size of the generated files is the always the same, so it seems like spark context ignores the compression codec that I set. Here is a code sample applied via the spark shell: import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) hiveContext.sql("SET hive.exec.dynamic.partition = true") hiveContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict") hiveContext.setConf("spark.sql.parquet.binaryAsString", "true") // required to make data compatible with impala hiveContext.setConf("spark.sql.parquet.compression.codec", "gzip") hiveContext.sql("create external table if not exists foo (bar STRING, ts INT) Partitioned by (year INT, month INT, day INT) STORED AS PARQUET Location 'hdfs://path/data/foo'") hiveContext.sql("insert into table foo partition(year, month,day) select *, year(from_unixtime(ts)) as year, month(from_unixtime(ts)) as month, day(from_unixtime(ts)) as day from raw_foo") I tried that with spark 1.2 and 1.3 snapshot against hive 0.13 and I also tried that with Impala on the same cluster which applied correctly the compression codecs. Does anyone know what could be the problem ? Thanks, Ayoub. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-compression-codecs-not-applied-tp21058.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Parquet compression codecs not applied
Hello, I tried to save a table created via the hive context as a parquet file but whatever compression codec (uncompressed, snappy, gzip or lzo) I set via setConf like: setConf("spark.sql.parquet.compression.codec", "gzip") the size of the generated files is the always the same, so it seems like spark context ignores the compression codec that I set. Here is a code sample applied via the spark shell: import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) hiveContext.sql("SET hive.exec.dynamic.partition = true") hiveContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict") hiveContext.setConf("spark.sql.parquet.binaryAsString", "true") // required to make data compatible with impala hiveContext.setConf("spark.sql.parquet.compression.codec", "gzip") hiveContext.sql("create external table if not exists foo (bar STRING, ts INT) Partitioned by (year INT, month INT, day INT) STORED AS PARQUET Location 'hdfs://path/data/foo'") hiveContext.sql("insert into table foo partition(year, month,day) select *, year(from_unixtime(ts)) as year, month(from_unixtime(ts)) as month, day(from_unixtime(ts)) as day from raw_foo") I tried that with spark 1.2 and 1.3 snapshot against hive 0.13 and I also tried that with Impala on the same cluster which applied correctly the compression codecs. Does anyone know what could be the problem ? Thanks, Ayoub. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-compression-codecs-not-applied-tp21033.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Parquet compression codecs not applied
Hello, I tried to save a table created via the hive context as a parquet file but whatever compression codec (uncompressed, snappy, gzip or lzo) I set via setConf like: setConf("spark.sql.parquet.compression.codec", "gzip") the size of the generated files is the always the same, so it seems like spark context ignores the compression codec that I set. Here is a code sample applied via the spark shell: import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) hiveContext.sql("SET hive.exec.dynamic.partition = true") hiveContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict") hiveContext.setConf("spark.sql.parquet.binaryAsString", "true") // required to make data compatible with impala hiveContext.setConf("spark.sql.parquet.compression.codec", "gzip") hiveContext.sql("create external table if not exists foo (bar STRING, ts INT) Partitioned by (year INT, month INT, day INT) STORED AS PARQUET Location 'hdfs://path/data/foo'") hiveContext.sql("insert into table foo partition(year, month,day) select *, year(from_unixtime(ts)) as year, month(from_unixtime(ts)) as month, day(from_unixtime(ts)) as day from raw_foo") I tried that with spark 1.2 and 1.3 snapshot against hive 0.13 and I also tried that with Impala on the same cluster which applied correctly the compression codecs. Does anyone know what could be the problem ? Thanks, Ayoub.