On Fri, Jan 3, 2014 at 7:26 PM, Guillaume Pitel
<[email protected]>wrote:

>  Actually, the interesting part in hadoop files is the sequencefile
> format which allows to split the data in various blocks. Other files in
> HDFS are single-blocks. They do not scale
>

But the output of saveAsObjectFile looks like: part-00000, part-00001,
part-00002,... . It does output split data, making it scalable, no?


>
> An ObjectFile cannot be naturally splitted.
>
> Usually, in Hadoop when storing a sequence of elements instead of a
> sequence of key,value the trick is to store key,null
>
> I don't know what's the most effective way to do that in scala/spark.
> Actually that would be a good thing to add it to RDD[U]
>
> Guillaume
>
>
>
>
> On Fri, Jan 3, 2014 at 7:10 PM, Andrew Ash <[email protected]> wrote:
>
>> saveAsHadoopFile and saveAsNewAPIHadoopFile are on PairRDDFunctions which
>> uses some Scala magic to become available when you have an that's RDD[Key,
>> Value]
>>
>>
>> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L648
>>
>
>  I see. So if my data is of RDD[Value] type, I cannot use compression?
> Why does it have to be of RDD[Key, Value] in order to save it in hadoop?
>
>  Also, doesn't saveAsObjectFile("hdfs://...") save data in hadoop? This
> is confusing.
>
>  I'm only interested in saving data on s3 ("s3n://..."), does it matter
> if I use saveAsHadoopFile, or saveAsObjectFile?
>
>
>>
>>
> --
>    [image: eXenSa]
>  *Guillaume PITEL, Président*
> +33(0)6 25 48 86 80 / +33(0)9 70 44 67 53
>
>  eXenSa S.A.S. <http://www.exensa.com/>
>  41, rue Périer - 92120 Montrouge - FRANCE
> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
>

<<exensa_logo_mail.png>>

Reply via email to