On Fri, Jan 3, 2014 at 7:10 PM, Andrew Ash <[email protected]> wrote:

> saveAsHadoopFile and saveAsNewAPIHadoopFile are on PairRDDFunctions which
> uses some Scala magic to become available when you have an that's RDD[Key,
> Value]
>
>
> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L648
>

I see. So if my data is of RDD[Value] type, I cannot use compression? Why
does it have to be of RDD[Key, Value] in order to save it in hadoop?

Also, doesn't saveAsObjectFile("hdfs://...") save data in hadoop? This is
confusing.

I'm only interested in saving data on s3 ("s3n://..."), does it matter if I
use saveAsHadoopFile, or saveAsObjectFile?


>
> Agreed, something like Chill would make this much easier for the default
> cases.
>

It seems chill is already in use:

https://github.com/apache/incubator-spark/blob/3713f8129a618a633a7aca8c944960c3e7ac9d3b/core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala#L26

But what we need is something like chill-hadoop:

https://github.com/twitter/chill/tree/develop/chill-hadoop


>
>
> On Fri, Jan 3, 2014 at 2:04 PM, Aureliano Buendia <[email protected]>wrote:
>
>> RDD only defines saveAsTextFile and saveAsObjectFile. I think
>> saveAsHadoopFile and saveAsNewAPIHadoopFile belong to the older versions.
>>
>> saveAsObjectFile definitely outputs hadoop format.
>>
>> I'm not trying to save big objects by saveAsObjectFile, I'm just trying
>> to minimize the java serialization overhead when saving to a binary file.
>>
>> I can see spark can benefit from something like
>> https://github.com/twitter/chill in this matter.
>>
>>
>> On Fri, Jan 3, 2014 at 6:42 PM, Guillaume Pitel <
>> [email protected]> wrote:
>>
>>>  Hi,
>>>
>>> After a little bit of thinking, I'm not sure anymore if saveAsObjectFile
>>> uses the spark.hadoop.*
>>>
>>> Also, I did write a mistake. The use of *.mapred.* or *.mapreduce.* does
>>> not depend on the hadoop version you use, but onthe API version you use
>>>
>>> So, I can assure you that if you use the saveAsNewAPIHadoopFile, with
>>> the spark.hadoop.mapreduce.* properties, the compression will be used.
>>>
>>> If you use the saveAsHadoopFile, it should be used with mapred.*
>>>
>>> If you use the saveAsObjectFile to a hdfs path, I'm not sure if the
>>> output is compressed.
>>>
>>> Anyway, saveAsObjectFile should be used for small objects, in my opinion.
>>>
>>> Guillaume
>>>
>>>   Even
>>>
>>> someMap.saveAsTextFile("out", classOf[GzipCodec])
>>>
>>>  has no effect.
>>>
>>>  Also, I notices that saving sequence files has no compression option
>>> (my original question was about compressing binary output).
>>>
>>>  Having said this, I still do not understand why kryo cannot be helpful
>>> when saving binary output. Binary output uses java serialization, which has
>>> a pretty hefty overhead.
>>>
>>>  How can kryo be applied to T when calling RDD[T]#saveAsObjectFile()?
>>>
>>>
>>> --
>>>    [image: eXenSa]
>>>  *Guillaume PITEL, Président*
>>> +33(0)6 25 48 86 80 / +33(0)9 70 44 67 53
>>>
>>>  eXenSa S.A.S. <http://www.exensa.com/>
>>>  41, rue Périer - 92120 Montrouge - FRANCE
>>> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
>>>
>>
>>
>

<<exensa_logo_mail.png>>

Reply via email to