Re: Turning kryo on does not decrease binary output

Guillaume Pitel Fri, 03 Jan 2014 11:27:00 -0800

Actually, the interesting part in hadoop files is the sequencefile format which allows to split the data in various blocks. Other files in HDFS are single-blocks. They do not scale

An ObjectFile cannot be naturally splitted.

Usually, in Hadoop when storing a sequence of elements instead of a sequence of key,value the trick is to store key,null

I don't know what's the most effective way to do that in scala/spark. Actually that would be a good thing to add it to RDD[U]

Guillaume

On Fri, Jan 3, 2014 at 7:10 PM, Andrew Ash <[email protected]> wrote:

saveAsHadoopFile and saveAsNewAPIHadoopFile are on PairRDDFunctions which uses some Scala magic to become available when you have an that's RDD[Key, Value]

https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L648

I see. So if my data is of RDD[Value] type, I cannot use compression? Why does it have to be of RDD[Key, Value] in order to save it in hadoop?

Also, doesn't saveAsObjectFile("hdfs://...") save data in hadoop? This is confusing.

I'm only interested in saving data on s3 ("s3n://..."), does it matter if I use saveAsHadoopFile, or saveAsObjectFile?

Guillaume PITEL, Président
+33(0)6 25 48 86 80 / +33(0)9 70 44 67 53

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05

Re: Turning kryo on does not decrease binary output

Reply via email to