Re: Parquet compression codecs not applied

2015-02-05 Thread Cheng Lian

Hi Ayoub,

The doc page isn’t wrong, but it’s indeed confusing. 
|spark.sql.parquet.compression.codec| is used when you’re wring Parquet 
file with something like |data.saveAsParquetFile(...)|. However, you are 
using Hive DDL in the example code. All Hive DDLs and commands like 
|SET| are directly delegated to Hive, which unfortunately ignores Spark 
configurations. And yet, it should be updated.


Best,
Cheng

On 1/10/15 5:49 AM, Ayoub Benali wrote:


it worked thanks.

this doc page 
<https://spark.apache.org/docs/1.2.0/sql-programming-guide.html>recommends 
to use "spark.sql.parquet.compression.codec" to set the compression 
coded and I thought this setting would be forwarded to the hive 
context given that HiveContext extends SQLContext, but it was not.


I am wondering if this behavior is normal, if not I could open an 
issue with a potential fix so that 
"spark.sql.parquet.compression.codec" would be translated to 
"parquet.compression" in the hive context ?


Or the documentation should be updated to mention that the compression 
coded is set differently with HiveContext.


Ayoub.



2015-01-09 17:51 GMT+01:00 Michael Armbrust <mailto:mich...@databricks.com>>:


This is a little confusing, but that code path is actually going
through hive.  So the spark sql configuration does not help.

Perhaps, try:
set parquet.compression=GZIP;

On Fri, Jan 9, 2015 at 2:41 AM, Ayoub mailto:benali.ayoub.i...@gmail.com>> wrote:

Hello,

I tried to save a table created via the hive context as a
parquet file but
whatever compression codec (uncompressed, snappy, gzip or lzo)
I set via
setConf like:

setConf("spark.sql.parquet.compression.codec", "gzip")

the size of the generated files is the always the same, so it
seems like
spark context ignores the compression codec that I set.

Here is a code sample applied via the spark shell:

import org.apache.spark.sql.hive.HiveContext
val hiveContext = new HiveContext(sc)

hiveContext.sql("SET hive.exec.dynamic.partition = true")
hiveContext.sql("SET hive.exec.dynamic.partition.mode =
nonstrict")
hiveContext.setConf("spark.sql.parquet.binaryAsString",
"true") // required
to make data compatible with impala
hiveContext.setConf("spark.sql.parquet.compression.codec", "gzip")

hiveContext.sql("create external table if not exists foo (bar
STRING, ts
INT) Partitioned by (year INT, month INT, day INT) STORED AS
PARQUET
Location 'hdfs://path/data/foo'")

hiveContext.sql("insert into table foo partition(year,
month,day) select *,
year(from_unixtime(ts)) as year, month(from_unixtime(ts)) as
month,
day(from_unixtime(ts)) as day from raw_foo")

I tried that with spark 1.2 and 1.3 snapshot against hive 0.13
and I also tried that with Impala on the same cluster which
applied
correctly the compression codecs.

Does anyone know what could be the problem ?

    Thanks,
    Ayoub.




--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-compression-codecs-not-applied-tp21058.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: user-h...@spark.apache.org
<mailto:user-h...@spark.apache.org>




​


Re: Parquet compression codecs not applied

2015-02-04 Thread Ayoub
I was using hive context an not sql context, therefore ("SET
spark.sql.parquet.compression.codec=gzip") was "ignored".

Michael Armbrust pointed out that "parquet.compression" should be used
instead, witch solved the issue.

I am still wondering if this behavior is "normal", it would be better if
"spark.sql.parquet.compression.codec" would be "translated" to
"parquet.compression" in case of hive context.
Other wise the documentation should be updated to be more precise.



2015-02-04 19:13 GMT+01:00 sahanbull :

> Hi Ayoub,
>
> You could try using the sql format to set the compression type:
>
> sc = SparkContext()
> sqc = SQLContext(sc)
> sqc.sql("SET spark.sql.parquet.compression.codec=gzip")
>
> You get a notification on screen while running the spark job when you set
> the compression codec like this. I havent compared it with different
> compression methods, Please let the mailing list knows if this works for
> you.
>
> Best
> Sahan
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-compression-codecs-not-applied-tp21058p21498.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Re-Parquet-compression-codecs-not-applied-tp21499.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Parquet compression codecs not applied

2015-02-04 Thread sahanbull
Hi Ayoub,

You could try using the sql format to set the compression type:

sc = SparkContext()
sqc = SQLContext(sc)
sqc.sql("SET spark.sql.parquet.compression.codec=gzip")

You get a notification on screen while running the spark job when you set
the compression codec like this. I havent compared it with different
compression methods, Please let the mailing list knows if this works for
you. 

Best
Sahan



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-compression-codecs-not-applied-tp21058p21498.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Parquet compression codecs not applied

2015-01-10 Thread Ayoub Benali
it worked thanks.

this doc page
<https://spark.apache.org/docs/1.2.0/sql-programming-guide.html>recommends
to use "spark.sql.parquet.compression.codec" to set the compression coded
and I thought this setting would be forwarded to the hive context given
that HiveContext extends SQLContext, but it was not.

I am wondering if this behavior is normal, if not I could open an issue
with a potential fix so that "spark.sql.parquet.compression.codec" would be
translated to "parquet.compression" in the hive context ?

Or the documentation should be updated to mention that the compression
coded is set differently with HiveContext.

Ayoub.



2015-01-09 17:51 GMT+01:00 Michael Armbrust :

> This is a little confusing, but that code path is actually going through
> hive.  So the spark sql configuration does not help.
>
> Perhaps, try:
> set parquet.compression=GZIP;
>
> On Fri, Jan 9, 2015 at 2:41 AM, Ayoub  wrote:
>
>> Hello,
>>
>> I tried to save a table created via the hive context as a parquet file but
>> whatever compression codec (uncompressed, snappy, gzip or lzo) I set via
>> setConf like:
>>
>> setConf("spark.sql.parquet.compression.codec", "gzip")
>>
>> the size of the generated files is the always the same, so it seems like
>> spark context ignores the compression codec that I set.
>>
>> Here is a code sample applied via the spark shell:
>>
>> import org.apache.spark.sql.hive.HiveContext
>> val hiveContext = new HiveContext(sc)
>>
>> hiveContext.sql("SET hive.exec.dynamic.partition = true")
>> hiveContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict")
>> hiveContext.setConf("spark.sql.parquet.binaryAsString", "true") //
>> required
>> to make data compatible with impala
>> hiveContext.setConf("spark.sql.parquet.compression.codec", "gzip")
>>
>> hiveContext.sql("create external table if not exists foo (bar STRING, ts
>> INT) Partitioned by (year INT, month INT, day INT) STORED AS PARQUET
>> Location 'hdfs://path/data/foo'")
>>
>> hiveContext.sql("insert into table foo partition(year, month,day) select
>> *,
>> year(from_unixtime(ts)) as year, month(from_unixtime(ts)) as month,
>> day(from_unixtime(ts)) as day from raw_foo")
>>
>> I tried that with spark 1.2 and 1.3 snapshot against hive 0.13
>> and I also tried that with Impala on the same cluster which applied
>> correctly the compression codecs.
>>
>> Does anyone know what could be the problem ?
>>
>> Thanks,
>> Ayoub.
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-compression-codecs-not-applied-tp21058.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Parquet compression codecs not applied

2015-01-09 Thread Michael Armbrust
This is a little confusing, but that code path is actually going through
hive.  So the spark sql configuration does not help.

Perhaps, try:
set parquet.compression=GZIP;

On Fri, Jan 9, 2015 at 2:41 AM, Ayoub  wrote:

> Hello,
>
> I tried to save a table created via the hive context as a parquet file but
> whatever compression codec (uncompressed, snappy, gzip or lzo) I set via
> setConf like:
>
> setConf("spark.sql.parquet.compression.codec", "gzip")
>
> the size of the generated files is the always the same, so it seems like
> spark context ignores the compression codec that I set.
>
> Here is a code sample applied via the spark shell:
>
> import org.apache.spark.sql.hive.HiveContext
> val hiveContext = new HiveContext(sc)
>
> hiveContext.sql("SET hive.exec.dynamic.partition = true")
> hiveContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict")
> hiveContext.setConf("spark.sql.parquet.binaryAsString", "true") // required
> to make data compatible with impala
> hiveContext.setConf("spark.sql.parquet.compression.codec", "gzip")
>
> hiveContext.sql("create external table if not exists foo (bar STRING, ts
> INT) Partitioned by (year INT, month INT, day INT) STORED AS PARQUET
> Location 'hdfs://path/data/foo'")
>
> hiveContext.sql("insert into table foo partition(year, month,day) select *,
> year(from_unixtime(ts)) as year, month(from_unixtime(ts)) as month,
> day(from_unixtime(ts)) as day from raw_foo")
>
> I tried that with spark 1.2 and 1.3 snapshot against hive 0.13
> and I also tried that with Impala on the same cluster which applied
> correctly the compression codecs.
>
> Does anyone know what could be the problem ?
>
> Thanks,
> Ayoub.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-compression-codecs-not-applied-tp21058.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Parquet compression codecs not applied

2015-01-09 Thread Ayoub
Hello, 

I tried to save a table created via the hive context as a parquet file but
whatever compression codec (uncompressed, snappy, gzip or lzo) I set via
setConf like: 

setConf("spark.sql.parquet.compression.codec", "gzip") 

the size of the generated files is the always the same, so it seems like
spark context ignores the compression codec that I set. 

Here is a code sample applied via the spark shell: 

import org.apache.spark.sql.hive.HiveContext 
val hiveContext = new HiveContext(sc) 

hiveContext.sql("SET hive.exec.dynamic.partition = true") 
hiveContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict") 
hiveContext.setConf("spark.sql.parquet.binaryAsString", "true") // required
to make data compatible with impala 
hiveContext.setConf("spark.sql.parquet.compression.codec", "gzip") 

hiveContext.sql("create external table if not exists foo (bar STRING, ts
INT) Partitioned by (year INT, month INT, day INT) STORED AS PARQUET
Location 'hdfs://path/data/foo'") 

hiveContext.sql("insert into table foo partition(year, month,day) select *,
year(from_unixtime(ts)) as year, month(from_unixtime(ts)) as month, 
day(from_unixtime(ts)) as day from raw_foo") 

I tried that with spark 1.2 and 1.3 snapshot against hive 0.13 
and I also tried that with Impala on the same cluster which applied
correctly the compression codecs. 

Does anyone know what could be the problem ? 

Thanks, 
Ayoub.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-compression-codecs-not-applied-tp21058.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Parquet compression codecs not applied

2015-01-08 Thread Ayoub
Hello,

I tried to save a table created via the hive context as a parquet file but
whatever compression codec (uncompressed, snappy, gzip or lzo) I set via
setConf like:

setConf("spark.sql.parquet.compression.codec", "gzip")

the size of the generated files is the always the same, so it seems like
spark context ignores the compression codec that I set.

Here is a code sample applied via the spark shell:

import org.apache.spark.sql.hive.HiveContext
val hiveContext = new HiveContext(sc)

hiveContext.sql("SET hive.exec.dynamic.partition = true")
hiveContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict")
hiveContext.setConf("spark.sql.parquet.binaryAsString", "true") // required
to make data compatible with impala
hiveContext.setConf("spark.sql.parquet.compression.codec", "gzip")

hiveContext.sql("create external table if not exists foo (bar STRING, ts
INT) Partitioned by (year INT, month INT, day INT) STORED AS PARQUET
Location 'hdfs://path/data/foo'")

hiveContext.sql("insert into table foo partition(year, month,day) select *,
year(from_unixtime(ts)) as year, month(from_unixtime(ts)) as month, 
day(from_unixtime(ts)) as day from raw_foo")

I tried that with spark 1.2 and 1.3 snapshot against hive 0.13
and I also tried that with Impala on the same cluster which applied
correctly the compression codecs.

Does anyone know what could be the problem ?

Thanks,
Ayoub.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Parquet-compression-codecs-not-applied-tp21033.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Parquet compression codecs not applied

2015-01-08 Thread Ayoub Benali
Hello,

I tried to save a table created via the hive context as a parquet file but
whatever compression codec (uncompressed, snappy, gzip or lzo) I set via
setConf like:

setConf("spark.sql.parquet.compression.codec", "gzip")

the size of the generated files is the always the same, so it seems like
spark context ignores the compression codec that I set.

Here is a code sample applied via the spark shell:

import org.apache.spark.sql.hive.HiveContext
val hiveContext = new HiveContext(sc)

hiveContext.sql("SET hive.exec.dynamic.partition = true")
hiveContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict")
hiveContext.setConf("spark.sql.parquet.binaryAsString", "true") // required
to make data compatible with impala
hiveContext.setConf("spark.sql.parquet.compression.codec", "gzip")

hiveContext.sql("create external table if not exists foo (bar STRING, ts
INT) Partitioned by (year INT, month INT, day INT) STORED AS PARQUET
Location 'hdfs://path/data/foo'")

hiveContext.sql("insert into table foo partition(year, month,day) select *,
year(from_unixtime(ts)) as year, month(from_unixtime(ts)) as month,
day(from_unixtime(ts)) as day from raw_foo")

I tried that with spark 1.2 and 1.3 snapshot against hive 0.13
and I also tried that with Impala on the same cluster which applied
correctly the compression codecs.

Does anyone know what could be the problem ?

Thanks,
Ayoub.