On Fri, Jan 3, 2014 at 7:41 PM, Imran Rashid <[email protected]> wrote:

> I think a lot of the confusion is cleared up with a quick look at the code:
>
>
> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L901
>
> saveAsObjectFile is just a thin wrapper around saveAsSequenceFile, which
> makes a null key and calls the java serializer.
>
> if you want to use kryo, just do the same thing yourself, but use the kryo
> serializer in place of the java one.
>

Thanks!

But why is that hadoop compression doesn't work for saveAsObject(), but it
does work (according to Guillaume) for saveAsHadoopFile()?


>
>
>
>
> On Fri, Jan 3, 2014 at 1:33 PM, Aureliano Buendia <[email protected]>wrote:
>
>>
>>
>>
>> On Fri, Jan 3, 2014 at 7:26 PM, Guillaume Pitel <
>> [email protected]> wrote:
>>
>>>  Actually, the interesting part in hadoop files is the sequencefile
>>> format which allows to split the data in various blocks. Other files in
>>> HDFS are single-blocks. They do not scale
>>>
>>
>> But the output of saveAsObjectFile looks like: part-00000, part-00001,
>> part-00002,... . It does output split data, making it scalable, no?
>>
>>
>>>
>>> An ObjectFile cannot be naturally splitted.
>>>
>>> Usually, in Hadoop when storing a sequence of elements instead of a
>>> sequence of key,value the trick is to store key,null
>>>
>>> I don't know what's the most effective way to do that in scala/spark.
>>> Actually that would be a good thing to add it to RDD[U]
>>>
>>> Guillaume
>>>
>>>
>>>
>>>
>>> On Fri, Jan 3, 2014 at 7:10 PM, Andrew Ash <[email protected]> wrote:
>>>
>>>> saveAsHadoopFile and saveAsNewAPIHadoopFile are on PairRDDFunctions
>>>> which uses some Scala magic to become available when you have an that's
>>>> RDD[Key, Value]
>>>>
>>>>
>>>> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L648
>>>>
>>>
>>>  I see. So if my data is of RDD[Value] type, I cannot use compression?
>>> Why does it have to be of RDD[Key, Value] in order to save it in hadoop?
>>>
>>>  Also, doesn't saveAsObjectFile("hdfs://...") save data in hadoop? This
>>> is confusing.
>>>
>>>  I'm only interested in saving data on s3 ("s3n://..."), does it matter
>>> if I use saveAsHadoopFile, or saveAsObjectFile?
>>>
>>>
>>>>
>>>>
>>> --
>>>    [image: eXenSa]
>>>  *Guillaume PITEL, Président*
>>> +33(0)6 25 48 86 80 / +33(0)9 70 44 67 53
>>>
>>>  eXenSa S.A.S. <http://www.exensa.com/>
>>>  41, rue Périer - 92120 Montrouge - FRANCE
>>> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
>>>
>>
>>
>

<<exensa_logo_mail.png>>

Reply via email to