Unfortunately, without already knowing that is the reason, it is difficult to get to that point. Container logs, nodemanager logs, nothing indicated anything incorrect was happening other than inconsistent exports/rowcounter results. I had reviewed all the hbase/yarn/hdfs bugs in the list but didn't see one that seemed like a smoking gun, just a bunch of possible ones. My ignorance of the inner workings of hbase/yarn likely played a big part in that though. I do appreciate you pointing out 'the one' !
From: Ted Yu Sent: Tuesday, February 20, 11:15 PM Subject: Re: Inconsistent rows exported/counted when looking at a set, unchanged past time frame. To: user@hbase.apache.org If you look at https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_fixed_in_58.html#fixed_issues585 , you would see the following: HBASE-15378 - Scanner cannot handle heartbeat message with no results which fixed what you observed in previous release. FYI On Tue, Feb 20, 2018 at 9:07 PM, Andrew Kettmann < andrew.kettm...@evolve24.com> wrote: > Josh, > > We upgraded from CDH 5.8.0 -> 5.8.5 seems to have fixed the issue. 3 > Rowcounts in a row that were not consistent before on a static table are > now consistent. We are doing some further testing but it looks like you > called it with: > > 'scans on RegionServers stop prematurely before all of the data is read' > > Thanks for the pointer in that direction, I was bashing my face against > this for two weeks trying to figure out this inconsistency. I appreciate > the clue! > > Andrew Kettmann > Consultant, Platform Services Group > > -----Original Message----- > From: Josh Elser [mailto:els...@apache.org] > Sent: Monday, February 12, 2018 11:59 AM > To: user@hbase.apache.org > Subject: Re: Inconsistent rows exported/counted when looking at a set, > unchanged past time frame. > > Hi Andrew, > > Yes. The answer is, of course, that you should see consistent results from > HBase if there are no mutations in flight to that table. Whether you're > reading "current" or "back-in-time", as long as you're not dealing with raw > scans (where compactions may persist delete tombstones), this should hold > just the same. > > Are you modifying older cells with newer data when you insert data? > Remember that MAX_VERSIONS for a table defaults to 1. Consider the > following: > > * Timestamps are of the form "tX", and t1 < t2 < t3 < .. > * You are querying from the time range: [t1, t5]. > * You have a cell for "row1" with at t3 with value "foo". > * RowCounter over [t1, t5] would return "1" > * Your ingest writes a new cell for "row1" of "bar" at t6. > * RowCounter over [t1, t5] would return "0" normally, or "1" is you use > RAW scans *** > * A compaction would run over the region containing "row1" > * RowCounter over [t1, t5] would return "0" (RAW or normal) > > It's also possible that you're hitting some sort of bug around missing > records at query time. I'm not sure what the CDH versions you're using line > up to, but there have certainly been issues in the past around query-time > data loss (e.g. scans on RegionServers stop prematurely before all of the > data is read). > > Good luck! > > *** Going off of memory here. I think this is how it works, but you should > be able to test easily ;) > > On 2/9/18 5:30 PM, Andrew Kettmann wrote: > > A simpler question would be this: > > > > Given: > > > > > > * a set timeframe in the past (2-3 days roughly a year ago) > > * we are NOT removing records from the table at all > > * We ARE inserting into this table actively > > > > Should I expect two consecutive runs of the rowcounter mapreduce job to > return an identical number? > > > > > > Andrew Kettmann > > Consultant, Platform Services Group > > > > From: Andrew Kettmann > > Sent: Thursday, February 08, 2018 11:35 AM > > To: user@hbase.apache.org > > Subject: Inconsistent rows exported/counted when looking at a set, > unchanged past time frame. > > > > First the version details: > > > > Running HBASE/Yarn/HDFS using Cloudera manager 5.12.1. > > Hbase: Version 1.2.0-cdh5.8.0 > > HDFS/YARN: Hadoop 2.6.0-cdh5.8.0 > > Hbck and hdfs fsck return healthy > > > > 15 nodes, sized down recently from 30 (other service requirements > > reduced. Solr, etc) > > > > > > The simplest example of the inconsistency is using rowcounter. If I run > the same mapreduce job twice in a row, I get different counts: > > > > hbase org.apache.hadoop.hbase.mapreduce.Driver rowcounter > > -Dmapreduce.map.speculative=false TABLENAME --starttime=1485907200000 > > --endtime=1486058400000 > > > > Looking at org.apache.hadoop.hbase.mapreduce.RowCounter$ > RowCounterMapper$Counters: > > Run 1: 4876683 > > Run 2: 4866351 > > > > Similarly with exports of the same date/time. Consecutive runs of the > export get different results: > > hbase org.apache.hadoop.hbase.mapreduce.Export \ > > -Dmapred.map.tasks.speculative.execution=false \ > > -Dmapred.reduce.tasks.speculative.execution=false \ TABLENAME \ > > HDFSPATH 1 1485907200000 1486058400000 > > > > From Map Input/output records: > > Run 1: 4296778 > > Run 2: 4297307 > > > > None of the results show anything for spilled records, no failed maps. > Sometimes the row count increases, sometimes it decreases. We aren’t using > any row filter queries, we just want to export chunks of the data for a > specific time range. This table is actively being read/written to, but I am > asking about a date range in early 2017 in this case, so that should have > no impact I would have thought. Another point is that the rowcount job and > the export return ridiculously different numbers. There should be no older > versions of rows involved as we are set to only keep the newest, and I can > confirm that there are rows that are consistently missing from the exports. > Table definition is below. > > > > hbase(main):001:0> describe 'TABLENAME' > > Table TABLENAME is ENABLED > > TABLENAME > > COLUMN FAMILIES DESCRIPTION > > {NAME => 'text', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', > > REPLICATION_SCOPE => '0', COMPRESSION => 'SNAPPY', VERSIONS => '1', > > MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', > > BLO CKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'} > > 1 row(s) in 0.2800 seconds > > > > Any advice/suggestions would be greatly appreciated, are some of my > assumptions wrong regarding import/export and that it should be consistent > given consistent date/times? > > > > > > Andrew Kettmann > > Platform Services Group > > >