Re: Union of 2 RDD's only returns the first one

Patrick Wendell Tue, 29 Apr 2014 22:27:15 -0700

You are right, once you sort() the RDD, then yes it has a well defined ordering.


But that ordering is lost as soon as you transform the RDD, including
if you union it with another RDD.

On Tue, Apr 29, 2014 at 10:22 PM, Mingyu Kim <m...@palantir.com> wrote:
> Hi Patrick,
>
> I¹m a little confused about your comment that RDDs are not ordered. As far
> as I know, RDDs keep list of partitions that are ordered and this is why I
> can call RDD.take() and get the same first k rows every time I call it and
> RDD.take() returns the same entries as RDD.map(Š).take() because map
> preserves the partition order. RDD order is also what allows me to get the
> top k out of RDD by doing RDD.sort().take().
>
> Am I misunderstanding it? Or, is it just when RDD is written to disk that
> the order is not well preserved? Thanks in advance!
>
> Mingyu
>
>
>
>
> On 1/22/14, 4:46 PM, "Patrick Wendell" <pwend...@gmail.com> wrote:
>
>>Ah somehow after all this time I've never seen that!
>>
>>On Wed, Jan 22, 2014 at 4:45 PM, Aureliano Buendia <buendia...@gmail.com>
>>wrote:
>>>
>>>
>>>
>>> On Thu, Jan 23, 2014 at 12:37 AM, Patrick Wendell <pwend...@gmail.com>
>>> wrote:
>>>>
>>>> What is the ++ operator here? Is this something you defined?
>>>
>>>
>>> No, it's an alias for union defined in RDD.scala:
>>>
>>> def ++(other: RDD[T]): RDD[T] = this.union(other)
>>>
>>>>
>>>>
>>>> Another issue is that RDD's are not ordered, so when you union two
>>>> together it doesn't have a well defined ordering.
>>>>
>>>> If you do want to do this you could coalesce into one partition, then
>>>> call MapPartitions and return an iterator that first adds your header
>>>> and then the rest of the file, then call saveAsTextFile. Keep in mind
>>>> this will only work if you coalesce into a single partition.
>>>
>>>
>>> Thanks! I'll give this a try.
>>>
>>>>
>>>>
>>>> myRdd.coalesce(1)
>>>> .map(_.mkString(",")))
>>>> .mapPartitions(it => (Seq("col1,col2,col3") ++ it).iterator)
>>>> .saveAsTextFile("out.csv")
>>>>
>>>> - Patrick
>>>>
>>>> On Wed, Jan 22, 2014 at 11:12 AM, Aureliano Buendia
>>>> <buendia...@gmail.com> wrote:
>>>> > Hi,
>>>> >
>>>> > I'm trying to find a way to create a csv header when using
>>>> > saveAsTextFile,
>>>> > and I came up with this:
>>>> >
>>>> > (sc.makeRDD(Array("col1,col2,col3"), 1) ++
>>>> > myRdd.coalesce(1).map(_.mkString(",")))
>>>> >       .saveAsTextFile("out.csv")
>>>> >
>>>> > But it only saves the header part. Why is that the union method does
>>>>not
>>>> > return both RDD's?
>>>
>>>

Re: Union of 2 RDD's only returns the first one

Reply via email to