On Fri, Jan 3, 2014 at 7:26 PM, Guillaume Pitel <[email protected]>wrote:
> Actually, the interesting part in hadoop files is the sequencefile > format which allows to split the data in various blocks. Other files in > HDFS are single-blocks. They do not scale > But the output of saveAsObjectFile looks like: part-00000, part-00001, part-00002,... . It does output split data, making it scalable, no? > > An ObjectFile cannot be naturally splitted. > > Usually, in Hadoop when storing a sequence of elements instead of a > sequence of key,value the trick is to store key,null > > I don't know what's the most effective way to do that in scala/spark. > Actually that would be a good thing to add it to RDD[U] > > Guillaume > > > > > On Fri, Jan 3, 2014 at 7:10 PM, Andrew Ash <[email protected]> wrote: > >> saveAsHadoopFile and saveAsNewAPIHadoopFile are on PairRDDFunctions which >> uses some Scala magic to become available when you have an that's RDD[Key, >> Value] >> >> >> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L648 >> > > I see. So if my data is of RDD[Value] type, I cannot use compression? > Why does it have to be of RDD[Key, Value] in order to save it in hadoop? > > Also, doesn't saveAsObjectFile("hdfs://...") save data in hadoop? This > is confusing. > > I'm only interested in saving data on s3 ("s3n://..."), does it matter > if I use saveAsHadoopFile, or saveAsObjectFile? > > >> >> > -- > [image: eXenSa] > *Guillaume PITEL, Président* > +33(0)6 25 48 86 80 / +33(0)9 70 44 67 53 > > eXenSa S.A.S. <http://www.exensa.com/> > 41, rue Périer - 92120 Montrouge - FRANCE > Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05 >
<<exensa_logo_mail.png>>
