We use Thrift structures within Hadoop Map/Reduce. Occasionally, a
Thrift object will be our grouping or join key. Usually, this works
great, but occasionally, there are some issues. In particular, we
have trouble with maps and sets. The problem is that the ordering of
the map/set internally is arbitrary, and we serialize in that
arbitrary order. The result is that two 'equal' objects might not
serialize into the same byte array, and therefore fail equality
checks based only on the serialized data.
I was wondering if it would make sense to enforce some sort of
ordering scheme for collections where order might be arbitrary, at
least during serialization. This would necessitate implementing a
decent compareTo on generated Thrift structs so we could sort before
writing, and obviously, it would include sorting overhead.
Are other people interested in making this use case work acceptably?
-Bryan