On Fri, Jan 3, 2014 at 7:10 PM, Andrew Ash <[email protected]> wrote:
> saveAsHadoopFile and saveAsNewAPIHadoopFile are on PairRDDFunctions which > uses some Scala magic to become available when you have an that's RDD[Key, > Value] > > > https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L648 > I see. So if my data is of RDD[Value] type, I cannot use compression? Why does it have to be of RDD[Key, Value] in order to save it in hadoop? Also, doesn't saveAsObjectFile("hdfs://...") save data in hadoop? This is confusing. I'm only interested in saving data on s3 ("s3n://..."), does it matter if I use saveAsHadoopFile, or saveAsObjectFile? > > Agreed, something like Chill would make this much easier for the default > cases. > It seems chill is already in use: https://github.com/apache/incubator-spark/blob/3713f8129a618a633a7aca8c944960c3e7ac9d3b/core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala#L26 But what we need is something like chill-hadoop: https://github.com/twitter/chill/tree/develop/chill-hadoop > > > On Fri, Jan 3, 2014 at 2:04 PM, Aureliano Buendia <[email protected]>wrote: > >> RDD only defines saveAsTextFile and saveAsObjectFile. I think >> saveAsHadoopFile and saveAsNewAPIHadoopFile belong to the older versions. >> >> saveAsObjectFile definitely outputs hadoop format. >> >> I'm not trying to save big objects by saveAsObjectFile, I'm just trying >> to minimize the java serialization overhead when saving to a binary file. >> >> I can see spark can benefit from something like >> https://github.com/twitter/chill in this matter. >> >> >> On Fri, Jan 3, 2014 at 6:42 PM, Guillaume Pitel < >> [email protected]> wrote: >> >>> Hi, >>> >>> After a little bit of thinking, I'm not sure anymore if saveAsObjectFile >>> uses the spark.hadoop.* >>> >>> Also, I did write a mistake. The use of *.mapred.* or *.mapreduce.* does >>> not depend on the hadoop version you use, but onthe API version you use >>> >>> So, I can assure you that if you use the saveAsNewAPIHadoopFile, with >>> the spark.hadoop.mapreduce.* properties, the compression will be used. >>> >>> If you use the saveAsHadoopFile, it should be used with mapred.* >>> >>> If you use the saveAsObjectFile to a hdfs path, I'm not sure if the >>> output is compressed. >>> >>> Anyway, saveAsObjectFile should be used for small objects, in my opinion. >>> >>> Guillaume >>> >>> Even >>> >>> someMap.saveAsTextFile("out", classOf[GzipCodec]) >>> >>> has no effect. >>> >>> Also, I notices that saving sequence files has no compression option >>> (my original question was about compressing binary output). >>> >>> Having said this, I still do not understand why kryo cannot be helpful >>> when saving binary output. Binary output uses java serialization, which has >>> a pretty hefty overhead. >>> >>> How can kryo be applied to T when calling RDD[T]#saveAsObjectFile()? >>> >>> >>> -- >>> [image: eXenSa] >>> *Guillaume PITEL, Président* >>> +33(0)6 25 48 86 80 / +33(0)9 70 44 67 53 >>> >>> eXenSa S.A.S. <http://www.exensa.com/> >>> 41, rue Périer - 92120 Montrouge - FRANCE >>> Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05 >>> >> >> >
<<exensa_logo_mail.png>>
