Ah somehow after all this time I've never seen that!
On Wed, Jan 22, 2014 at 4:45 PM, Aureliano Buendia <[email protected]> wrote: > > > > On Thu, Jan 23, 2014 at 12:37 AM, Patrick Wendell <[email protected]> > wrote: >> >> What is the ++ operator here? Is this something you defined? > > > No, it's an alias for union defined in RDD.scala: > > def ++(other: RDD[T]): RDD[T] = this.union(other) > >> >> >> Another issue is that RDD's are not ordered, so when you union two >> together it doesn't have a well defined ordering. >> >> If you do want to do this you could coalesce into one partition, then >> call MapPartitions and return an iterator that first adds your header >> and then the rest of the file, then call saveAsTextFile. Keep in mind >> this will only work if you coalesce into a single partition. > > > Thanks! I'll give this a try. > >> >> >> myRdd.coalesce(1) >> .map(_.mkString(","))) >> .mapPartitions(it => (Seq("col1,col2,col3") ++ it).iterator) >> .saveAsTextFile("out.csv") >> >> - Patrick >> >> On Wed, Jan 22, 2014 at 11:12 AM, Aureliano Buendia >> <[email protected]> wrote: >> > Hi, >> > >> > I'm trying to find a way to create a csv header when using >> > saveAsTextFile, >> > and I came up with this: >> > >> > (sc.makeRDD(Array("col1,col2,col3"), 1) ++ >> > myRdd.coalesce(1).map(_.mkString(","))) >> > .saveAsTextFile("out.csv") >> > >> > But it only saves the header part. Why is that the union method does not >> > return both RDD's? > >
