RE: Inconsistent rows exported/counted when looking at a set, unchanged past time frame.

2018-02-14 Thread Andrew Kettmann
Took a dump of the involved table, reimported to the same cluster under a 
different name. This is a separate table now that is not being modified at all. 
Two consecutive/concurrent counts were different:

hbase org.apache.hadoop.hbase.mapreduce.Driver rowcounter 
-Dmapreduce.map.speculative=false ImportedTableName --starttime=148590720 
--endtime=148605840

Count#1 3508052
Count#2 3584553

Do you happen to have more information regarding some of the query-time data 
loss issues you were mentioning?

My versioning SHOULD map roughly to version 1.2.0 for HBASE. 

Hbase: Version 1.2.0-cdh5.8.0
HDFS/YARN: Hadoop 2.6.0-cdh5.8.0

Andrew Kettmann
Consultant, Platform Services Group

-Original Message-
From: Josh Elser [mailto:els...@apache.org] 
Sent: Monday, February 12, 2018 11:59 AM
To: user@hbase.apache.org
Subject: Re: Inconsistent rows exported/counted when looking at a set, 
unchanged past time frame.

Hi Andrew,

Yes. The answer is, of course, that you should see consistent results from 
HBase if there are no mutations in flight to that table. Whether you're reading 
"current" or "back-in-time", as long as you're not dealing with raw scans 
(where compactions may persist delete tombstones), this should hold just the 
same.

Are you modifying older cells with newer data when you insert data? 
Remember that MAX_VERSIONS for a table defaults to 1. Consider the
following:

* Timestamps are of the form "tX", and t1 < t2 < t3 < ..
* You are querying from the time range: [t1, t5].
* You have a cell for "row1" with at t3 with value "foo".
* RowCounter over [t1, t5] would return "1"
* Your ingest writes a new cell for "row1" of "bar" at t6.
* RowCounter over [t1, t5] would return "0" normally, or "1" is you use RAW 
scans ***
* A compaction would run over the region containing "row1"
* RowCounter over [t1, t5] would return "0" (RAW or normal)

It's also possible that you're hitting some sort of bug around missing records 
at query time. I'm not sure what the CDH versions you're using line up to, but 
there have certainly been issues in the past around query-time data loss (e.g. 
scans on RegionServers stop prematurely before all of the data is read).

Good luck!

*** Going off of memory here. I think this is how it works, but you should be 
able to test easily ;)

On 2/9/18 5:30 PM, Andrew Kettmann wrote:
> A simpler question would be this:
> 
> Given:
> 
> 
>*   a set timeframe in the past (2-3 days roughly a year ago)
>*   we are NOT removing records from the table at all
>*   We ARE inserting into this table actively
> 
> Should I expect two consecutive runs of the rowcounter mapreduce job to 
> return an identical number?
> 
> 
> Andrew Kettmann
> Consultant, Platform Services Group
> 
> From: Andrew Kettmann
> Sent: Thursday, February 08, 2018 11:35 AM
> To: user@hbase.apache.org
> Subject: Inconsistent rows exported/counted when looking at a set, unchanged 
> past time frame.
> 
> First the version details:
> 
> Running HBASE/Yarn/HDFS using Cloudera manager 5.12.1.
> Hbase: Version 1.2.0-cdh5.8.0
> HDFS/YARN: Hadoop 2.6.0-cdh5.8.0
> Hbck and hdfs fsck return healthy
> 
> 15 nodes, sized down recently from 30 (other service requirements 
> reduced. Solr, etc)
> 
> 
> The simplest example of the inconsistency is using rowcounter. If I run the 
> same mapreduce job twice in a row, I get different counts:
> 
> hbase org.apache.hadoop.hbase.mapreduce.Driver rowcounter 
> -Dmapreduce.map.speculative=false TABLENAME --starttime=148590720 
> --endtime=148605840
> 
> Looking at 
> org.​apache.​hadoop.​hbase.​mapreduce.​RowCounter​$RowCounterMapper​$Counters:
> Run 1: 4876683
> Run 2: 4866351
> 
> Similarly with exports of the same date/time. Consecutive runs of the export 
> get different results:
> hbase org.apache.hadoop.hbase.mapreduce.Export \ 
> -Dmapred.map.tasks.speculative.execution=false \ 
> -Dmapred.reduce.tasks.speculative.execution=false \ TABLENAME \ 
> HDFSPATH 1 148590720 148605840
> 
>  From Map Input/output records:
> Run 1: 4296778
> Run 2: 4297307
> 
> None of the results show anything for spilled records, no failed maps. 
> Sometimes the row count increases, sometimes it decreases. We aren’t using 
> any row filter queries, we just want to export chunks of the data for a 
> specific time range. This table is actively being read/written to, but I am 
> asking about a date range in early 2017 in this case, so that should have no 
> impact I would have thought. Another point is that the rowcount job and the 
> export return ridiculously different numbers. There should be no older 
> versions of rows involved as we are set to only keep the newest, and I can 
> confirm that there are rows that are consistently missing from the exports. 
> Table definition is below.
> 
> hbase(main):001:0> describe 'TABLENAME'
> Table TABLENAME is ENABLED
> TABLENAME
> COLUMN FAMILIES DESCRIPTION
> {NAME => 'text', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', 
> RE

Fwd: Travel Assistance applications open. Please inform your communities

2018-02-14 Thread Misty Stanley-Jones
-- Forwarded message --
From: "Gavin McDonald" 
Date: Feb 14, 2018 3:34 AM
Subject: Travel Assistance applications open. Please inform your communities
To: 
Cc:

Hello PMCs.

Please could you forward on the below email to your dev and user lists.

Thanks

Gav…

—
The Travel Assistance Committee (TAC) are pleased to announce that travel
assistance applications for ApacheCon NA 2018 are now open!

We will be supporting ApacheCon NA Montreal, Canada on 24th - 29th
September 2018

 TAC exists to help those that would like to attend ApacheCon events, but
are unable to do so for financial reasons.
For more info on this years applications and qualifying criteria, please
visit the TAC website at < http://www.apache.org/travel/ <
http://www.apache.org/travel/> >. Applications are now open and will close
1st May.

Important: Applications close on May 1st, 2018. Applicants have until the
closing date above to submit their applications (which should contain as
much supporting material as required to efficiently and accurately process
their request), this will enable TAC to announce successful awards shortly
afterwards.

As usual, TAC expects to deal with a range of applications from a diverse
range of backgrounds. We therefore encourage (as always) anyone thinking
about sending in an application to do so ASAP.
We look forward to greeting many of you in Montreal

Kind Regards,
Gavin - (On behalf of the Travel Assistance Committee)
—