Yeah, it looks like Avro doesn't support comparison on map fields: https://github.com/apache/avro/blob/trunk/lang/java/avro/src/main/java/org/apache/avro/generic/GenericData.java
Assuming the value of the map fields matter for comparison purposes, it seems like your best bet is to serialize the data as a List of pairs or two Lists with corresponding entries, ensuring that the lists are sorted based on the key of the map. Not a pretty solution, but it should work. J On Thu, Apr 2, 2015 at 2:00 PM, Lucy Chen <[email protected]> wrote: > Hi, > > I am trying to do Set difference as follows: > > PCollection<MyClass> C = Set.difference(A, B); > > > Here both A and B are PCollection<MyClass> type. > > > MyClass is defined as follows: > > > public class *MyClass* implements java.io.Serializable, Cloneable{ > > private String a; > > private String b; > > private int c; > > private Map<String, Double> d; > > private int e; > > public MyClass(){ > > this(null, null, 0, new HashMap<String, Double>()); > > } > > public MyClass(String labelID, String sampleID, Integer pos_neg_ind, > HashMap<String, Double> feat_val_pair){ > > ...... > > } > > public MyClass(String input){ > > ..... > > } > > ..... > > } > > > From running the set difference, I got the following error. Was that > because of MyClass including a Map member d? If so, is there another way to > generate the set diff by having these inputs? > > > Thanks! > > > Lucy > > > java.lang.Exception: > org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: Error while > doing final merge > > at > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) > > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) > > Caused by: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: > Error while doing final merge > > at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:160) > > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376) > > at > org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) > > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > > at java.lang.Thread.run(Thread.java:744) > > Caused by: org.apache.avro.AvroRuntimeException: Can't compare maps! > > at org.apache.avro.io.BinaryData.compare(BinaryData.java:134) > > at org.apache.avro.io.BinaryData.compare(BinaryData.java:139) > > at org.apache.avro.io.BinaryData.compare(BinaryData.java:92) > > at org.apache.avro.io.BinaryData.compare(BinaryData.java:72) > > at > org.apache.avro.mapred.AvroKeyComparator.compare(AvroKeyComparator.java:43) > > at org.apache.hadoop.mapred.Merger$MergeQueue.lessThan(Merger.java:578) > > at org.apache.hadoop.util.PriorityQueue.downHeap(PriorityQueue.java:144) > > at org.apache.hadoop.util.PriorityQueue.adjustTop(PriorityQueue.java:108) > > at > org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:524) > > at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:539) > > at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:209) > > at > org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.finalMerge(MergeManagerImpl.java:731) > > at > org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.close(MergeManagerImpl.java:370) > > at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:158) > > ... 7 more > > > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
