Re: Using a compression codec in saveAsSequenceFile in Pyspark (Python API)

Davies Liu Thu, 13 Nov 2014 23:35:11 -0800

You could use the following as compressionCodecClass:

DEFLATE       org.apache.hadoop.io.compress.DefaultCodec
gzip             org.apache.hadoop.io.compress.GzipCodec
bzip2         org.apache.hadoop.io.compress.BZip2Codec
LZO          com.hadoop.compression.lzo.LzopCodec


for gzip, compressionCodecClass should be
org.apache.hadoop.io.compress.GzipCodec



On Thu, Nov 13, 2014 at 8:28 PM, sahanbull <sa...@skimlinks.com> wrote:
> Hi,
>
> I am trying to save an RDD to an S3 bucket using
> RDD.saveAsSequenceFile(self, path, CompressionCodec) function in python. I
> need to save the RDD in GZIP. Can anyone tell me how to send the gzip codec
> class as a parameter into the function.
>
> I tried
> *RDD.saveAsSequenceFile('{0}{1}'.format(outputFolder,datePath),compressionCodecClass=gzip.GzipFile)*
>
> but it hits me with a : *AttributeError: type object 'GzipFile' has no
> attribute '_get_object_id' *
> Which I think is because JVM cant find the scala mapping gzip.
>
> *If you can guide me about any method to write the RDD as a gzip(.gz) into
> disc that is very much appreciated. *
>
> Many thanks
> SahanB
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-a-compression-codec-in-saveAsSequenceFile-in-Pyspark-Python-API-tp18899.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Using a compression codec in saveAsSequenceFile in Pyspark (Python API)

Reply via email to