Hi Andrew,
Yes. The answer is, of course, that you should see consistent results
from HBase if there are no mutations in flight to that table. Whether
you're reading "current" or "back-in-time", as long as you're not
dealing with raw scans (where compactions may persist delete
tombstones), this should hold just the same.
Are you modifying older cells with newer data when you insert data?
Remember that MAX_VERSIONS for a table defaults to 1. Consider the
following:
* Timestamps are of the form "tX", and t1 < t2 < t3 < ..
* You are querying from the time range: [t1, t5].
* You have a cell for "row1" with at t3 with value "foo".
* RowCounter over [t1, t5] would return "1"
* Your ingest writes a new cell for "row1" of "bar" at t6.
* RowCounter over [t1, t5] would return "0" normally, or "1" is you use
RAW scans ***
* A compaction would run over the region containing "row1"
* RowCounter over [t1, t5] would return "0" (RAW or normal)
It's also possible that you're hitting some sort of bug around missing
records at query time. I'm not sure what the CDH versions you're using
line up to, but there have certainly been issues in the past around
query-time data loss (e.g. scans on RegionServers stop prematurely
before all of the data is read).
Good luck!
*** Going off of memory here. I think this is how it works, but you
should be able to test easily ;)
On 2/9/18 5:30 PM, Andrew Kettmann wrote:
A simpler question would be this:
Given:
* a set timeframe in the past (2-3 days roughly a year ago)
* we are NOT removing records from the table at all
* We ARE inserting into this table actively
Should I expect two consecutive runs of the rowcounter mapreduce job to return
an identical number?
Andrew Kettmann
Consultant, Platform Services Group
From: Andrew Kettmann
Sent: Thursday, February 08, 2018 11:35 AM
To: user@hbase.apache.org
Subject: Inconsistent rows exported/counted when looking at a set, unchanged
past time frame.
First the version details:
Running HBASE/Yarn/HDFS using Cloudera manager 5.12.1.
Hbase: Version 1.2.0-cdh5.8.0
HDFS/YARN: Hadoop 2.6.0-cdh5.8.0
Hbck and hdfs fsck return healthy
15 nodes, sized down recently from 30 (other service requirements reduced.
Solr, etc)
The simplest example of the inconsistency is using rowcounter. If I run the
same mapreduce job twice in a row, I get different counts:
hbase org.apache.hadoop.hbase.mapreduce.Driver rowcounter
-Dmapreduce.map.speculative=false TABLENAME --starttime=1485907200000
--endtime=1486058400000
Looking at
org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper$Counters:
Run 1: 4876683
Run 2: 4866351
Similarly with exports of the same date/time. Consecutive runs of the export
get different results:
hbase org.apache.hadoop.hbase.mapreduce.Export \
-Dmapred.map.tasks.speculative.execution=false \
-Dmapred.reduce.tasks.speculative.execution=false \
TABLENAME \
HDFSPATH 1 1485907200000 1486058400000
From Map Input/output records:
Run 1: 4296778
Run 2: 4297307
None of the results show anything for spilled records, no failed maps.
Sometimes the row count increases, sometimes it decreases. We aren’t using any
row filter queries, we just want to export chunks of the data for a specific
time range. This table is actively being read/written to, but I am asking about
a date range in early 2017 in this case, so that should have no impact I would
have thought. Another point is that the rowcount job and the export return
ridiculously different numbers. There should be no older versions of rows
involved as we are set to only keep the newest, and I can confirm that there
are rows that are consistently missing from the exports. Table definition is
below.
hbase(main):001:0> describe 'TABLENAME'
Table TABLENAME is ENABLED
TABLENAME
COLUMN FAMILIES DESCRIPTION
{NAME => 'text', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE =>
'0', COMPRESSION => 'SNAPPY', VERSIONS => '1', MIN_VERSIONS => '0', TTL => 'FOREVER',
KEEP_DELETED_CELLS => 'FALSE', BLO
CKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}
1 row(s) in 0.2800 seconds
Any advice/suggestions would be greatly appreciated, are some of my assumptions
wrong regarding import/export and that it should be consistent given consistent
date/times?
Andrew Kettmann
Platform Services Group