Be careful shoving arbitrary binary data into a string, invalid utf characters can cause significant computational overhead in my experience. On Jun 11, 2015 10:09 AM, "Mark Tse" <mark....@d2l.com> wrote:
> Makes sense – I suspect what you suggested should work. > > > > However, I think the overhead between this and using `String` would be > similar enough to warrant just using `String`. > > > > Mark > > > > *From:* Sonal Goyal [mailto:sonalgoy...@gmail.com] > *Sent:* June-11-15 12:58 PM > *To:* Mark Tse > *Cc:* user@spark.apache.org > *Subject:* Re: ReduceByKey with a byte array as the key > > > > I think if you wrap the byte[] into an object and implement equals and > hashcode methods, you may be able to do this. There will be the overhead of > extra object, but conceptually it should work unless I am missing > something. > > > Best Regards, > Sonal > Founder, Nube Technologies <http://www.nubetech.co> > > Check out Reifier at Spark Summit 2015 > <https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/> > > > > > > On Thu, Jun 11, 2015 at 9:27 PM, Mark Tse <mark....@d2l.com> wrote: > > I would like to work with RDD pairs of Tuple2<byte[], obj>, but byte[]s > with the same contents are considered as different values because their > reference values are different. > > > > I didn't see any to pass in a custom comparer. I could convert the byte[] > into a String with an explicit charset, but I'm wondering if there's a more > efficient way. > > > > Also posted on SO: http://stackoverflow.com/q/30785615/2687324 > > > > Thanks, > > Mark > > >