We use Thrift structures within Hadoop Map/Reduce. Occasionally, a Thrift object will be our grouping or join key. Usually, this works great, but occasionally, there are some issues. In particular, we have trouble with maps and sets. The problem is that the ordering of the map/set internally is arbitrary, and we serialize in that arbitrary order. The result is that two 'equal' objects might not serialize into the same byte array, and therefore fail equality checks based only on the serialized data.

I was wondering if it would make sense to enforce some sort of ordering scheme for collections where order might be arbitrary, at least during serialization. This would necessitate implementing a decent compareTo on generated Thrift structs so we could sort before writing, and obviously, it would include sorting overhead.

Are other people interested in making this use case work acceptably?

-Bryan

Reply via email to