On Fri, Jan 3, 2014 at 7:41 PM, Imran Rashid <[email protected]> wrote:
> I think a lot of the confusion is cleared up with a quick look at the code: > > > https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L901 > > saveAsObjectFile is just a thin wrapper around saveAsSequenceFile, which > makes a null key and calls the java serializer. > > if you want to use kryo, just do the same thing yourself, but use the kryo > serializer in place of the java one. > Thanks! But why is that hadoop compression doesn't work for saveAsObject(), but it does work (according to Guillaume) for saveAsHadoopFile()? > > > > > On Fri, Jan 3, 2014 at 1:33 PM, Aureliano Buendia <[email protected]>wrote: > >> >> >> >> On Fri, Jan 3, 2014 at 7:26 PM, Guillaume Pitel < >> [email protected]> wrote: >> >>> Actually, the interesting part in hadoop files is the sequencefile >>> format which allows to split the data in various blocks. Other files in >>> HDFS are single-blocks. They do not scale >>> >> >> But the output of saveAsObjectFile looks like: part-00000, part-00001, >> part-00002,... . It does output split data, making it scalable, no? >> >> >>> >>> An ObjectFile cannot be naturally splitted. >>> >>> Usually, in Hadoop when storing a sequence of elements instead of a >>> sequence of key,value the trick is to store key,null >>> >>> I don't know what's the most effective way to do that in scala/spark. >>> Actually that would be a good thing to add it to RDD[U] >>> >>> Guillaume >>> >>> >>> >>> >>> On Fri, Jan 3, 2014 at 7:10 PM, Andrew Ash <[email protected]> wrote: >>> >>>> saveAsHadoopFile and saveAsNewAPIHadoopFile are on PairRDDFunctions >>>> which uses some Scala magic to become available when you have an that's >>>> RDD[Key, Value] >>>> >>>> >>>> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L648 >>>> >>> >>> I see. So if my data is of RDD[Value] type, I cannot use compression? >>> Why does it have to be of RDD[Key, Value] in order to save it in hadoop? >>> >>> Also, doesn't saveAsObjectFile("hdfs://...") save data in hadoop? This >>> is confusing. >>> >>> I'm only interested in saving data on s3 ("s3n://..."), does it matter >>> if I use saveAsHadoopFile, or saveAsObjectFile? >>> >>> >>>> >>>> >>> -- >>> [image: eXenSa] >>> *Guillaume PITEL, Président* >>> +33(0)6 25 48 86 80 / +33(0)9 70 44 67 53 >>> >>> eXenSa S.A.S. <http://www.exensa.com/> >>> 41, rue Périer - 92120 Montrouge - FRANCE >>> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05 >>> >> >> >
<<exensa_logo_mail.png>>
