Using a compression codec in saveAsSequenceFile in Pyspark (Python API)

2014-11-13 Thread sahanbull
Hi, 

I am trying to save an RDD to an S3 bucket using
RDD.saveAsSequenceFile(self, path, CompressionCodec) function in python. I
need to save the RDD in GZIP. Can anyone tell me how to send the gzip codec
class as a parameter into the function. 

I tried
*RDD.saveAsSequenceFile('{0}{1}'.format(outputFolder,datePath),compressionCodecClass=gzip.GzipFile)*

but it hits me with a : *AttributeError: type object 'GzipFile' has no
attribute '_get_object_id' *
Which I think is because JVM cant find the scala mapping gzip. 

*If you can guide me about any method to write the RDD as a gzip(.gz) into
disc that is very much appreciated. *

Many thanks
SahanB



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-a-compression-codec-in-saveAsSequenceFile-in-Pyspark-Python-API-tp18899.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Using a compression codec in saveAsSequenceFile in Pyspark (Python API)

2014-11-13 Thread Davies Liu
You could use the following as compressionCodecClass:

DEFLATE   org.apache.hadoop.io.compress.DefaultCodec
gzip org.apache.hadoop.io.compress.GzipCodec
bzip2 org.apache.hadoop.io.compress.BZip2Codec
LZO  com.hadoop.compression.lzo.LzopCodec

for gzip, compressionCodecClass should be
org.apache.hadoop.io.compress.GzipCodec



On Thu, Nov 13, 2014 at 8:28 PM, sahanbull sa...@skimlinks.com wrote:
 Hi,

 I am trying to save an RDD to an S3 bucket using
 RDD.saveAsSequenceFile(self, path, CompressionCodec) function in python. I
 need to save the RDD in GZIP. Can anyone tell me how to send the gzip codec
 class as a parameter into the function.

 I tried
 *RDD.saveAsSequenceFile('{0}{1}'.format(outputFolder,datePath),compressionCodecClass=gzip.GzipFile)*

 but it hits me with a : *AttributeError: type object 'GzipFile' has no
 attribute '_get_object_id' *
 Which I think is because JVM cant find the scala mapping gzip.

 *If you can guide me about any method to write the RDD as a gzip(.gz) into
 disc that is very much appreciated. *

 Many thanks
 SahanB



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Using-a-compression-codec-in-saveAsSequenceFile-in-Pyspark-Python-API-tp18899.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org