Re: Spark output compression on HDFS

Azuryy Fri, 04 Apr 2014 15:48:14 -0700

There is no compress type for snappy.


Sent from my iPhone5s

> On 2014年4月4日, at 23:06, Konstantin Kudryavtsev 
> <kudryavtsev.konstan...@gmail.com> wrote:
> 
> Can anybody suggest how to change compression level (Record, Block) for 
> Snappy? 
> if it possible, of course
> 
> thank you in advance
> 
> Thank you,
> Konstantin Kudryavtsev
> 
> 
>> On Thu, Apr 3, 2014 at 10:28 PM, Konstantin Kudryavtsev 
>> <kudryavtsev.konstan...@gmail.com> wrote:
>> Thanks all, it works fine now and I managed to compress output. However, I 
>> am still in stuck... How is it possible to set compression type for Snappy? 
>> I mean to set up record or block level of compression for output
>> 
>>> On Apr 3, 2014 1:15 AM, "Nicholas Chammas" <nicholas.cham...@gmail.com> 
>>> wrote:
>>> Thanks for pointing that out.
>>> 
>>> 
>>>> On Wed, Apr 2, 2014 at 6:11 PM, Mark Hamstra <m...@clearstorydata.com> 
>>>> wrote:
>>>> First, you shouldn't be using spark.incubator.apache.org anymore, just 
>>>> spark.apache.org.  Second, saveAsSequenceFile doesn't appear to exist in 
>>>> the Python API at this point. 
>>>> 
>>>> 
>>>>> On Wed, Apr 2, 2014 at 3:00 PM, Nicholas Chammas 
>>>>> <nicholas.cham...@gmail.com> wrote:
>>>>> Is this a Scala-only feature?
>>>>> 
>>>>> 
>>>>>> On Wed, Apr 2, 2014 at 5:55 PM, Patrick Wendell <pwend...@gmail.com> 
>>>>>> wrote:
>>>>>> For textFile I believe we overload it and let you set a codec directly:
>>>>>> 
>>>>>> https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FileSuite.scala#L59
>>>>>> 
>>>>>> For saveAsSequenceFile yep, I think Mark is right, you need an option.
>>>>>> 
>>>>>> 
>>>>>>> On Wed, Apr 2, 2014 at 12:36 PM, Mark Hamstra <m...@clearstorydata.com> 
>>>>>>> wrote:
>>>>>>> http://www.scala-lang.org/api/2.10.3/index.html#scala.Option
>>>>>>> 
>>>>>>> The signature is 'def saveAsSequenceFile(path: String, codec: 
>>>>>>> Option[Class[_ <: CompressionCodec]] = None)', but you are providing a 
>>>>>>> Class, not an Option[Class].  
>>>>>>> 
>>>>>>> Try counts.saveAsSequenceFile(output, 
>>>>>>> Some(classOf[org.apache.hadoop.io.compress.SnappyCodec]))
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Wed, Apr 2, 2014 at 12:18 PM, Kostiantyn Kudriavtsev 
>>>>>>>> <kudryavtsev.konstan...@gmail.com> wrote:
>>>>>>>> Hi there,
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I've started using Spark recently and evaluating possible use cases in 
>>>>>>>> our company. 
>>>>>>>> 
>>>>>>>> I'm trying to save RDD as compressed Sequence file. I'm able to save 
>>>>>>>> non-compressed file be calling:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> counts.saveAsSequenceFile(output)
>>>>>>>> where counts is my RDD (IntWritable, Text). However, I didn't manage 
>>>>>>>> to compress output. I tried several configurations and always got 
>>>>>>>> exception:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> counts.saveAsSequenceFile(output, 
>>>>>>>> classOf[org.apache.hadoop.io.compress.SnappyCodec])
>>>>>>>> <console>:21: error: type mismatch;
>>>>>>>>  found   : 
>>>>>>>> Class[org.apache.hadoop.io.compress.SnappyCodec](classOf[org.apache.hadoop.io.compress.SnappyCodec])
>>>>>>>>  required: Option[Class[_ <: 
>>>>>>>> org.apache.hadoop.io.compress.CompressionCodec]]
>>>>>>>>               counts.saveAsSequenceFile(output, 
>>>>>>>> classOf[org.apache.hadoop.io.compress.SnappyCodec])
>>>>>>>> 
>>>>>>>>  counts.saveAsSequenceFile(output, 
>>>>>>>> classOf[org.apache.spark.io.SnappyCompressionCodec])
>>>>>>>> <console>:21: error: type mismatch;
>>>>>>>>  found   : 
>>>>>>>> Class[org.apache.spark.io.SnappyCompressionCodec](classOf[org.apache.spark.io.SnappyCompressionCodec])
>>>>>>>>  required: Option[Class[_ <: 
>>>>>>>> org.apache.hadoop.io.compress.CompressionCodec]]
>>>>>>>>               counts.saveAsSequenceFile(output, 
>>>>>>>> classOf[org.apache.spark.io.SnappyCompressionCodec])
>>>>>>>> and it doesn't work even for Gzip:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>  counts.saveAsSequenceFile(output, 
>>>>>>>> classOf[org.apache.hadoop.io.compress.GzipCodec])
>>>>>>>> <console>:21: error: type mismatch;
>>>>>>>>  found   : 
>>>>>>>> Class[org.apache.hadoop.io.compress.GzipCodec](classOf[org.apache.hadoop.io.compress.GzipCodec])
>>>>>>>>  required: Option[Class[_ <: 
>>>>>>>> org.apache.hadoop.io.compress.CompressionCodec]]
>>>>>>>>               counts.saveAsSequenceFile(output, 
>>>>>>>> classOf[org.apache.hadoop.io.compress.GzipCodec])
>>>>>>>> Could you please suggest solution? also, I didn't find how is it 
>>>>>>>> possible to specify compression parameters (i.e. compression type for 
>>>>>>>> Snappy). I wondered if you could share code snippets for 
>>>>>>>> writing/reading RDD with compression? 
>>>>>>>> 
>>>>>>>> Thank you in advance,
>>>>>>>> 
>>>>>>>> Konstantin Kudryavtsev
>

Re: Spark output compression on HDFS

Reply via email to