Gabriel,
Thanks. That's exactly what I was talking about. I get the same results
when I run multiple times on the same cluster with same input (to be expected
since it's based on the MR framework's sort), but it was when we went to an
entirely new cluster architecture (different hardware/layout) that there were
some changes. I suspect this is causing the slight mismatches, but wanted to
make sure that was a rational thought.
Thanks!
Dave
-----Original Message-----
From: Gabriel Reid [mailto:[email protected]]
Sent: Wednesday, August 03, 2016 10:52 AM
To: [email protected]
Subject: Re: Cogroup sort order
Hi David,
I take it you're referring to the ordering of the two Collections returned
within the value Pair of a cogroup result?
As you probably know, there isn't any kind of guaranteed ordering on these
collections, although I would expect that given the same input and cluster
layout, it's perfectly possible that you would get the same iteration order on
the results each time.
However, there are also probably quite a few underlying factors which could
change the iteration order on these Collections; for example, just having a
different number of partitions used by the reducers, or different settings
which would influence when spills are done during the shuffle phase (assuming
we're talking about MR-based Crunch here) could influence the iteration order
of the collections. Note that these are things that impact overall ordering of
output in MapReduce itself, and nothing specific to Crunch.
- Gabriel
On Wed, Aug 3, 2016 at 4:40 PM, David Ortiz <[email protected]> wrote:
> Hey everyone,
>
> Just curious based on something I'm seeing as we move a job
> around between different ec2 cluster types. Does the underlying
> architecture of the system have an effect on the sort order in a
> cogroup? It's looking like moving from the cc2 architecture we were
> using to an m4 based system, that our job output changes. The changes
> I am seeing line up with the order in which the iterator returns records
> being different, so was curious.
>
> Thanks,
> Dave
This email is intended only for the use of the individual(s) to whom it is
addressed. If you have received this communication in error, please immediately
notify the sender and delete the original email.