data discrepancies related to parallelism

Kurt Muehlner Thu, 05 May 2016 10:39:55 -0700

Hello,

We have a Pig/Tez application which is exhibiting a strange problem.  This 
application was recently migrated from Pig/MR to Pig/Tez.  We carefully vetted 
during QA that both MR and Tez versions produced identical results.  However, 
after deploying to production, we noticed that occasionally, results are not 
the same (either as compared to MR results, or results of Tez processing the 
same data on a QA cluster).


We’re still looking into the root cause, but I’d like to reach out to the user 
group in case anyone has seen anything similar, or has suggestions on what 
might be wrong/what to investigate.

*** What we know so far ***
Results discrepancy occurs ONLY when the number of containers given to the 
application by YARN is less than the number requested (we have disabled 
auto-parallelism, and are using SET_DEFAULT_PARALLEL=50 in all pig scripts).  
When this occurs, we also see a corresponding discrepancy in the the file 
system counters HDFS_READ_OPS and HDFS_BYTES_READ (lower when number of 
containers is low), despite the fact that in all cases number of records 
processed is identical.

Thus, when the production cluster is very busy, we get invalid results.  We 
have kept a separate instance of the Pig/Tez application running on another 
cluster where it never competes for resources, so we have been able to compare 
results for each run of the application, which has allowed us to diagnose the 
problem this far.  By comparing results on these two clusters, we also know 
that the ratio (actual HDFS_READ_OPS)/(expected HDFS_READ_OPS) correlates with 
the ratio (actual containers)/(requested containers).  Likewise, we see the 
same correlation between hdfs ops ratio and container ratio.

Below are some relevant counters.  For each counter, the first line is the 
value from the production cluster showing the problem, and the second line is 
the value from the QA cluster running on the same data.

Any hints/suggestions/questions are most welcome.

Thanks,
Kurt

org.apache.tez.common.counters.DAGCounter

  NUM_SUCCEEDED_TASKS=950
  NUM_SUCCEEDED_TASKS=950
  
  TOTAL_LAUNCHED_TASKS=950
  TOTAL_LAUNCHED_TASKS=950
  
File System Counters

  FILE_BYTES_READ=7745801982
  FILE_BYTES_READ=8003771938

  FILE_BYTES_WRITTEN=9725468612
  FILE_BYTES_WRITTEN=9675253887

  *HDFS_BYTES_READ=9487600888  (when number of containers equals the number 
requested, this counter is the same between the two clusters)
  *HDFS_BYTES_READ=17996466110

  *HDFS_READ_OPS=3080  (when number of containers equals the number requested, 
this counter is the same between the two clusters)
  *HDFS_READ_OPS=3600

  HDFS_WRITE_OPS=900
  HDFS_WRITE_OPS=900

org.apache.tez.common.counters.TaskCounter
  INPUT_RECORDS_PROCESSED=28729671
  INPUT_RECORDS_PROCESSED=28729671


  OUTPUT_RECORDS=33655895
  OUTPUT_RECORDS=33655895

  OUTPUT_BYTES=28290888628
  OUTPUT_BYTES=28294000270

Input(s):
Successfully read 2254733 records (1632743360 bytes) from: "input1"
Successfully read 2254733 records (1632743360 bytes) from: "input1"


Output(s):
Successfully stored 0 records in: “output1”
Successfully stored 0 records in: "output1”

Successfully stored 56019 records (10437069 bytes) in: “output2”
Successfully stored 56019 records (10437069 bytes) in: "output2”

Successfully stored 2254733 records (1651936175 bytes) in: "output3”
Successfully stored 2254733 records (1651936175 bytes) in: "output3”

Successfully stored 1160599 records (823479742 bytes) in: "output4”
Successfully stored 1160599 records (823480450 bytes) in: "output4”

Successfully stored 28605 records (21176320 bytes) in: "output5”
Successfully stored 28605 records (21177552 bytes) in: "output5”

Successfully stored 6574 records (4442933 bytes) in: "output6”
Successfully stored 6574 records (4442933 bytes) in: "output6”

Successfully stored 111416 records (164375858 bytes) in: "output7”
Successfully stored 111416 records (164379800 bytes) in: "output7”

Successfully stored 542 records (387761 bytes) in: "output8”
Successfully stored 542 records (387762 bytes) in: "output8"

data discrepancies related to parallelism

Reply via email to