Re: Inconsistent rows exported/counted when looking at a set, unchanged past time frame.

2018-02-20 Thread Ted Yu
If you look at
https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_fixed_in_58.html#fixed_issues585
, you would see the following:

HBASE-15378 - Scanner cannot handle heartbeat message with no results

which fixed what you observed in previous release.

FYI

On Tue, Feb 20, 2018 at 9:07 PM, Andrew Kettmann <
andrew.kettm...@evolve24.com> wrote:

> Josh,
>
> We upgraded from CDH 5.8.0 -> 5.8.5 seems to have fixed the issue. 3
> Rowcounts in a row that were not consistent before on a static table are
> now consistent. We are doing some further testing but it looks like you
> called it with:
>
> 'scans on RegionServers stop prematurely before all of the data is read'
>
> Thanks for the pointer in that direction, I was bashing my face against
> this for two weeks trying to figure out this inconsistency. I appreciate
> the clue!
>
> Andrew Kettmann
> Consultant, Platform Services Group
>
> -Original Message-
> From: Josh Elser [mailto:els...@apache.org]
> Sent: Monday, February 12, 2018 11:59 AM
> To: user@hbase.apache.org
> Subject: Re: Inconsistent rows exported/counted when looking at a set,
> unchanged past time frame.
>
> Hi Andrew,
>
> Yes. The answer is, of course, that you should see consistent results from
> HBase if there are no mutations in flight to that table. Whether you're
> reading "current" or "back-in-time", as long as you're not dealing with raw
> scans (where compactions may persist delete tombstones), this should hold
> just the same.
>
> Are you modifying older cells with newer data when you insert data?
> Remember that MAX_VERSIONS for a table defaults to 1. Consider the
> following:
>
> * Timestamps are of the form "tX", and t1 < t2 < t3 < ..
> * You are querying from the time range: [t1, t5].
> * You have a cell for "row1" with at t3 with value "foo".
> * RowCounter over [t1, t5] would return "1"
> * Your ingest writes a new cell for "row1" of "bar" at t6.
> * RowCounter over [t1, t5] would return "0" normally, or "1" is you use
> RAW scans ***
> * A compaction would run over the region containing "row1"
> * RowCounter over [t1, t5] would return "0" (RAW or normal)
>
> It's also possible that you're hitting some sort of bug around missing
> records at query time. I'm not sure what the CDH versions you're using line
> up to, but there have certainly been issues in the past around query-time
> data loss (e.g. scans on RegionServers stop prematurely before all of the
> data is read).
>
> Good luck!
>
> *** Going off of memory here. I think this is how it works, but you should
> be able to test easily ;)
>
> On 2/9/18 5:30 PM, Andrew Kettmann wrote:
> > A simpler question would be this:
> >
> > Given:
> >
> >
> >*   a set timeframe in the past (2-3 days roughly a year ago)
> >*   we are NOT removing records from the table at all
> >*   We ARE inserting into this table actively
> >
> > Should I expect two consecutive runs of the rowcounter mapreduce job to
> return an identical number?
> >
> >
> > Andrew Kettmann
> > Consultant, Platform Services Group
> >
> > From: Andrew Kettmann
> > Sent: Thursday, February 08, 2018 11:35 AM
> > To: user@hbase.apache.org
> > Subject: Inconsistent rows exported/counted when looking at a set,
> unchanged past time frame.
> >
> > First the version details:
> >
> > Running HBASE/Yarn/HDFS using Cloudera manager 5.12.1.
> > Hbase: Version 1.2.0-cdh5.8.0
> > HDFS/YARN: Hadoop 2.6.0-cdh5.8.0
> > Hbck and hdfs fsck return healthy
> >
> > 15 nodes, sized down recently from 30 (other service requirements
> > reduced. Solr, etc)
> >
> >
> > The simplest example of the inconsistency is using rowcounter. If I run
> the same mapreduce job twice in a row, I get different counts:
> >
> > hbase org.apache.hadoop.hbase.mapreduce.Driver rowcounter
> > -Dmapreduce.map.speculative=false TABLENAME --starttime=148590720
> > --endtime=148605840
> >
> > Looking at org.​apache.​hadoop.​hbase.​mapreduce.​RowCounter​$
> RowCounterMapper​$Counters:
> > Run 1: 4876683
> > Run 2: 4866351
> >
> > Similarly with exports of the same date/time. Consecutive runs of the
> export get different results:
> > hbase org.apache.hadoop.hbase.mapreduce.Export \
> > -Dmapred.map.tasks.speculative.execution=false \
> > -Dmapred.reduce.tasks.speculative.execution=false \ TABLENAME \
> > HDFSPATH 1 148590720 148605840
> >
> >  From Map Input/output records:
> > Run 1: 4296778
> > Run 2: 4297307
> >
> > None of the results show anything for spilled records, no failed maps.
> Sometimes the row count increases, sometimes it decreases. We aren’t using
> any row filter queries, we just want to export chunks of the data for a
> specific time range. This table is actively being read/written to, but I am
> asking about a date range in early 2017 in this case, so that should have
> no impact I would have thought. Another point is that the rowcount job and
> the export return ridiculously different numbers. There should be no older
> versions of rows 

RE: Inconsistent rows exported/counted when looking at a set, unchanged past time frame.

2018-02-20 Thread Andrew Kettmann
Josh,

We upgraded from CDH 5.8.0 -> 5.8.5 seems to have fixed the issue. 3 Rowcounts 
in a row that were not consistent before on a static table are now consistent. 
We are doing some further testing but it looks like you called it with:

'scans on RegionServers stop prematurely before all of the data is read'

Thanks for the pointer in that direction, I was bashing my face against this 
for two weeks trying to figure out this inconsistency. I appreciate the clue!

Andrew Kettmann
Consultant, Platform Services Group

-Original Message-
From: Josh Elser [mailto:els...@apache.org] 
Sent: Monday, February 12, 2018 11:59 AM
To: user@hbase.apache.org
Subject: Re: Inconsistent rows exported/counted when looking at a set, 
unchanged past time frame.

Hi Andrew,

Yes. The answer is, of course, that you should see consistent results from 
HBase if there are no mutations in flight to that table. Whether you're reading 
"current" or "back-in-time", as long as you're not dealing with raw scans 
(where compactions may persist delete tombstones), this should hold just the 
same.

Are you modifying older cells with newer data when you insert data? 
Remember that MAX_VERSIONS for a table defaults to 1. Consider the
following:

* Timestamps are of the form "tX", and t1 < t2 < t3 < ..
* You are querying from the time range: [t1, t5].
* You have a cell for "row1" with at t3 with value "foo".
* RowCounter over [t1, t5] would return "1"
* Your ingest writes a new cell for "row1" of "bar" at t6.
* RowCounter over [t1, t5] would return "0" normally, or "1" is you use RAW 
scans ***
* A compaction would run over the region containing "row1"
* RowCounter over [t1, t5] would return "0" (RAW or normal)

It's also possible that you're hitting some sort of bug around missing records 
at query time. I'm not sure what the CDH versions you're using line up to, but 
there have certainly been issues in the past around query-time data loss (e.g. 
scans on RegionServers stop prematurely before all of the data is read).

Good luck!

*** Going off of memory here. I think this is how it works, but you should be 
able to test easily ;)

On 2/9/18 5:30 PM, Andrew Kettmann wrote:
> A simpler question would be this:
> 
> Given:
> 
> 
>*   a set timeframe in the past (2-3 days roughly a year ago)
>*   we are NOT removing records from the table at all
>*   We ARE inserting into this table actively
> 
> Should I expect two consecutive runs of the rowcounter mapreduce job to 
> return an identical number?
> 
> 
> Andrew Kettmann
> Consultant, Platform Services Group
> 
> From: Andrew Kettmann
> Sent: Thursday, February 08, 2018 11:35 AM
> To: user@hbase.apache.org
> Subject: Inconsistent rows exported/counted when looking at a set, unchanged 
> past time frame.
> 
> First the version details:
> 
> Running HBASE/Yarn/HDFS using Cloudera manager 5.12.1.
> Hbase: Version 1.2.0-cdh5.8.0
> HDFS/YARN: Hadoop 2.6.0-cdh5.8.0
> Hbck and hdfs fsck return healthy
> 
> 15 nodes, sized down recently from 30 (other service requirements 
> reduced. Solr, etc)
> 
> 
> The simplest example of the inconsistency is using rowcounter. If I run the 
> same mapreduce job twice in a row, I get different counts:
> 
> hbase org.apache.hadoop.hbase.mapreduce.Driver rowcounter 
> -Dmapreduce.map.speculative=false TABLENAME --starttime=148590720 
> --endtime=148605840
> 
> Looking at 
> org.​apache.​hadoop.​hbase.​mapreduce.​RowCounter​$RowCounterMapper​$Counters:
> Run 1: 4876683
> Run 2: 4866351
> 
> Similarly with exports of the same date/time. Consecutive runs of the export 
> get different results:
> hbase org.apache.hadoop.hbase.mapreduce.Export \ 
> -Dmapred.map.tasks.speculative.execution=false \ 
> -Dmapred.reduce.tasks.speculative.execution=false \ TABLENAME \ 
> HDFSPATH 1 148590720 148605840
> 
>  From Map Input/output records:
> Run 1: 4296778
> Run 2: 4297307
> 
> None of the results show anything for spilled records, no failed maps. 
> Sometimes the row count increases, sometimes it decreases. We aren’t using 
> any row filter queries, we just want to export chunks of the data for a 
> specific time range. This table is actively being read/written to, but I am 
> asking about a date range in early 2017 in this case, so that should have no 
> impact I would have thought. Another point is that the rowcount job and the 
> export return ridiculously different numbers. There should be no older 
> versions of rows involved as we are set to only keep the newest, and I can 
> confirm that there are rows that are consistently missing from the exports. 
> Table definition is below.
> 
> hbase(main):001:0> describe 'TABLENAME'
> Table TABLENAME is ENABLED
> TABLENAME
> COLUMN FAMILIES DESCRIPTION
> {NAME => 'text', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'ROW', 
> REPLICATION_SCOPE => '0', COMPRESSION => 'SNAPPY', VERSIONS => '1', 
> MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'FALSE', 
> BLO CKSIZE => '65536', IN_MEM

Re: Want to change key structure

2018-02-20 Thread anil gupta
Hi Marcell,

Since key is changing you will need to rewrite the entire table. I think
generating HFlies(rather than doing puts) will be the most efficient here.
IIRC, you will need to use HFileOutputFormat in your MR job.
For locality, i dont think you should worry that much because major
compaction usually takes care of it. If you want very high locality from
beginning then you can run a major compaction on new table after your
initial load.

HTH,
Anil Gupta

On Mon, Feb 19, 2018 at 11:46 PM, Marcell Ortutay 
wrote:

> I have a large HBase table (~10 TB) that has an existing key structure.
> Based on some recent analysis, the key structure is causing performance
> problems for our current query load. I would like to re-write the table
> with a new key structure that performs substantially better.
>
> What is the best way to go about re-writing this table? Since they key
> structure will change, it will affect locality, so all the data will have
> to move to a new location. If anyone can point to examples of code that
> does something like this, that would be very helpful.
>
> Thanks,
> Marcell
>



-- 
Thanks & Regards,
Anil Gupta


Save the date: ApacheCon North America, September 24-27 in Montréal

2018-02-20 Thread Rich Bowen

Dear Apache Enthusiast,

(You’re receiving this message because you’re subscribed to a user@ or 
dev@ list of one or more Apache Software Foundation projects.)


We’re pleased to announce the upcoming ApacheCon [1] in Montréal, 
September 24-27. This event is all about you — the Apache project community.


We’ll have four tracks of technical content this time, as well as lots 
of opportunities to connect with your project community, hack on the 
code, and learn about other related (and unrelated!) projects across the 
foundation.


The Call For Papers (CFP) [2] and registration are now open. Register 
early to take advantage of the early bird prices and secure your place 
at the event hotel.


Important dates
March 30: CFP closes
April 20: CFP notifications sent
	August 24: Hotel room block closes (please do not wait until the last 
minute)


Follow @ApacheCon on Twitter to be the first to hear announcements about 
keynotes, the schedule, evening events, and everything you can expect to 
see at the event.


See you in Montréal!

Sincerely, Rich Bowen, V.P. Events,
on behalf of the entire ApacheCon team

[1] http://www.apachecon.com/acna18
[2] https://cfp.apachecon.com/conference.html?apachecon-north-america-2018