Hello, We have a Pig/Tez application which is exhibiting a strange problem. This application was recently migrated from Pig/MR to Pig/Tez. We carefully vetted during QA that both MR and Tez versions produced identical results. However, after deploying to production, we noticed that occasionally, results are not the same (either as compared to MR results, or results of Tez processing the same data on a QA cluster).
We’re still looking into the root cause, but I’d like to reach out to the user group in case anyone has seen anything similar, or has suggestions on what might be wrong/what to investigate. *** What we know so far *** Results discrepancy occurs ONLY when the number of containers given to the application by YARN is less than the number requested (we have disabled auto-parallelism, and are using SET_DEFAULT_PARALLEL=50 in all pig scripts). When this occurs, we also see a corresponding discrepancy in the the file system counters HDFS_READ_OPS and HDFS_BYTES_READ (lower when number of containers is low), despite the fact that in all cases number of records processed is identical. Thus, when the production cluster is very busy, we get invalid results. We have kept a separate instance of the Pig/Tez application running on another cluster where it never competes for resources, so we have been able to compare results for each run of the application, which has allowed us to diagnose the problem this far. By comparing results on these two clusters, we also know that the ratio (actual HDFS_READ_OPS)/(expected HDFS_READ_OPS) correlates with the ratio (actual containers)/(requested containers). Likewise, we see the same correlation between hdfs ops ratio and container ratio. Below are some relevant counters. For each counter, the first line is the value from the production cluster showing the problem, and the second line is the value from the QA cluster running on the same data. Any hints/suggestions/questions are most welcome. Thanks, Kurt org.apache.tez.common.counters.DAGCounter NUM_SUCCEEDED_TASKS=950 NUM_SUCCEEDED_TASKS=950 TOTAL_LAUNCHED_TASKS=950 TOTAL_LAUNCHED_TASKS=950 File System Counters FILE_BYTES_READ=7745801982 FILE_BYTES_READ=8003771938 FILE_BYTES_WRITTEN=9725468612 FILE_BYTES_WRITTEN=9675253887 *HDFS_BYTES_READ=9487600888 (when number of containers equals the number requested, this counter is the same between the two clusters) *HDFS_BYTES_READ=17996466110 *HDFS_READ_OPS=3080 (when number of containers equals the number requested, this counter is the same between the two clusters) *HDFS_READ_OPS=3600 HDFS_WRITE_OPS=900 HDFS_WRITE_OPS=900 org.apache.tez.common.counters.TaskCounter INPUT_RECORDS_PROCESSED=28729671 INPUT_RECORDS_PROCESSED=28729671 OUTPUT_RECORDS=33655895 OUTPUT_RECORDS=33655895 OUTPUT_BYTES=28290888628 OUTPUT_BYTES=28294000270 Input(s): Successfully read 2254733 records (1632743360 bytes) from: "input1" Successfully read 2254733 records (1632743360 bytes) from: "input1" Output(s): Successfully stored 0 records in: “output1” Successfully stored 0 records in: "output1” Successfully stored 56019 records (10437069 bytes) in: “output2” Successfully stored 56019 records (10437069 bytes) in: "output2” Successfully stored 2254733 records (1651936175 bytes) in: "output3” Successfully stored 2254733 records (1651936175 bytes) in: "output3” Successfully stored 1160599 records (823479742 bytes) in: "output4” Successfully stored 1160599 records (823480450 bytes) in: "output4” Successfully stored 28605 records (21176320 bytes) in: "output5” Successfully stored 28605 records (21177552 bytes) in: "output5” Successfully stored 6574 records (4442933 bytes) in: "output6” Successfully stored 6574 records (4442933 bytes) in: "output6” Successfully stored 111416 records (164375858 bytes) in: "output7” Successfully stored 111416 records (164379800 bytes) in: "output7” Successfully stored 542 records (387761 bytes) in: "output8” Successfully stored 542 records (387762 bytes) in: "output8"
