Thanks to a bug fix put in by a colleague of mine, merge joins work for tables loaded into pig via HBaseStorage. In our test environment and in the test environment for pig itself, I'm able to get all sorts of fairly complex data merging without issue.
However, when I use that same code on larger data sets in a production environment, the merge join fails. If I run it on the same exact tables on the same cluster after trimming the data down to just a few rows, the merge join works fine. Here is the most basic I've been able to get the pig script. I've been taking out pieces and parts trying to narrow it down but it still fails: If I change the count portion to a limit 5 or something, I'm able to dump the relation. The merge join finishes all of its mappers, but when it gets to the reduce step and starts doing a sort (don't ask me why it's even doing a sort on pre-sorted data), it throws the following error: 2016-03-09 19:36:01,738 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: Error while doing final merge at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:160) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.lang.ClassCastException: org.apache.pig.backend.hadoop.hbase.TableSplitComparable cannot be cast to org.apache.hadoop.hbase.mapreduce.TableSplit at org.apache.pig.backend.hadoop.hbase.TableSplitComparable.compareTo(TableSplitComparable.java:26) at org.apache.pig.data.DataType.compare(DataType.java:566) at org.apache.pig.data.DataType.compare(DataType.java:464) at org.apache.pig.data.BinInterSedes$BinInterSedesTupleRawComparator.compareDatum(BinInterSedes.java:1106) at org.apache.pig.data.BinInterSedes$BinInterSedesTupleRawComparator.compare(BinInterSedes.java:1082) at org.apache.pig.data.BinInterSedes$BinInterSedesTupleRawComparator.compareBinSedesTuple(BinInterSedes.java:787) at org.apache.pig.data.BinInterSedes$BinInterSedesTupleRawComparator.compare(BinInterSedes.java:728) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTupleSortComparator.compare(PigTupleSortComparator.java:100) at org.apache.hadoop.mapred.Merger$MergeQueue.lessThan(Merger.java:587) at org.apache.hadoop.util.PriorityQueue.upHeap(PriorityQueue.java:128) at org.apache.hadoop.util.PriorityQueue.put(PriorityQueue.java:55) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:678) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:596) at org.apache.hadoop.mapred.Merger.merge(Merger.java:131) at org.apache.hadoop.mapred.Merger.merge(Merger.java:115) at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.finalMerge(MergeManagerImpl.java:722) at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.close(MergeManagerImpl.java:370) at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:158) If I switch the order of the two relations in the merge join, I get a different error which appears more promising, but I still don't know what to do about it: 2016-03-09 19:55:24,789 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: c: Local Rearrange[tuple]{chararray}(false) - scope-334 Operator Key: scope-334): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error while executing ForEach at [c[62,4]] at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:316) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNextTuple(POLocalRearrange.java:291) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:279) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:274) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error while executing ForEach at [c[62,4]] at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:325) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:307) ... 12 more Caused by: java.lang.NullPointerException at org.apache.pig.impl.builtin.DefaultIndexableLoader.seekNear(DefaultIndexableLoader.java:190) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.seekInRightStream(POMergeJoin.java:542) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextTuple(POMergeJoin.java:299) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:307) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPreCombinerLocalRearrange.getNextTuple(POPreCombinerLocalRearrange.java:126) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:307) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:252) Again, I've tried replicating the exact scenario (and more complicated ones) in local environments and I can't get it to fail. I think it's related to yarn/mapreduce, but I can't figure out why that would matter or what it's really doing. I'm trying to set up the e2e (end to end) tests in the pig repo, but I'm not having any luck there, either. If I can't get a test failure, I'm afraid I'm not going to be able to fix the bug or issue. Can anyone help point me in the right direction as far as next debugging steps or what might be wrong? William Watson Lead Software Engineer