Re: Shuffle guarantees
Nevermind, a look @ the ExternalSorter class tells me that the iterator for each key that's only partially ordered ends up being merge sorted by equality after the fact. Wanted to post my finding on here for others who may have the same questions. On Tue, Mar 1, 2016 at 3:05 PM, Corey Noletwrote: > The reason I'm asking, I see a comment in the ExternalSorter class that > says this: > > "If we need to aggregate by key, we either use a total ordering from the > ordering parameter, or read the keys with the same hash code and compare > them with each other for equality to merge values". > > How can this be assumed if the object used for the key, for instance, in > the case where a HashPartitioner is used, cannot assume ordering and > therefore cannot assume a comparator can be used? > > On Tue, Mar 1, 2016 at 2:56 PM, Corey Nolet wrote: > >> So if I'm using reduceByKey() with a HashPartitioner, I understand that >> the hashCode() of my key is used to create the underlying shuffle files. >> >> Is anything other than hashCode() used in the shuffle files when the data >> is pulled into the reducers and run through the reduce function? The reason >> I'm asking is because there's a possibility of hashCode() colliding in two >> different objects which end up hashing to the same number, right? >> >> >> >
Re: Shuffle guarantees
The reason I'm asking, I see a comment in the ExternalSorter class that says this: "If we need to aggregate by key, we either use a total ordering from the ordering parameter, or read the keys with the same hash code and compare them with each other for equality to merge values". How can this be assumed if the object used for the key, for instance, in the case where a HashPartitioner is used, cannot assume ordering and therefore cannot assume a comparator can be used? On Tue, Mar 1, 2016 at 2:56 PM, Corey Noletwrote: > So if I'm using reduceByKey() with a HashPartitioner, I understand that > the hashCode() of my key is used to create the underlying shuffle files. > > Is anything other than hashCode() used in the shuffle files when the data > is pulled into the reducers and run through the reduce function? The reason > I'm asking is because there's a possibility of hashCode() colliding in two > different objects which end up hashing to the same number, right? > > >
Shuffle guarantees
So if I'm using reduceByKey() with a HashPartitioner, I understand that the hashCode() of my key is used to create the underlying shuffle files. Is anything other than hashCode() used in the shuffle files when the data is pulled into the reducers and run through the reduce function? The reason I'm asking is because there's a possibility of hashCode() colliding in two different objects which end up hashing to the same number, right?