Hi everybody,
We are running 6 data nodes (plus one master node - version HBase 1.0.0-cdh5.6.0) in each case on a productive and a test environment. Each month we export the deltas of the previous month from the productive system (using org.apache.hadoop.hbase.mapreduce.Export) and import them into the test system. From time to time we are using RowCounter and an analytics map-reduce job written by our own to check if the restore is fine. Now we see that the Export/Import is broken since April 2019. After lots of investigations and tests we found that the bug described in https://github.com/hortonworks-spark/shc/issues/174 <https://github.com/hortonworks-spark/shc/issues/174> causes the problems. After increasing the timeouts (client and roc timeout) from 1 minute to 10 minutes the row counts in the test system seem to be in a good shape (we counted the rows for one month via RowCounter and scan on the hbase shell). Now we are about to implement the changes in the productive system. But the question remains what causes the long timeouts. Some of the tests we did revealed ScannerTimeouts after 60 seconds (the default setting). But 60 seconds - for an android, that is nearly an eternity. Thus we assume that there is something wrong, but how can we find out. The hbase locality factor is 1.0 or close to 1.0 for most of the regions. My questions are: Is it possible that „silent timeouts“ can cause incomplete exports? Is it usual that scans take longer than 1 minute - even if it seems that up to April the exports were all ok? How can one identify regions which are in trouble? Thank you and best regards Udo
