Re: How to avoid major compaction during restart?

2018-06-28 Thread Marcell Ortutay
Er, to I made a mistake in the above question ; the issue is not so much
the major compaction but rather that during restart (as nodes go up /
down), Hadoop and HBase attempt to rebalance blocks and regions, causing
unnecessary movement. So what I'm actually looking for is a way to avoid
the balancing for the duration of the restart, which would avoid the need
for major compaction afterwards.

Marcell

On Thu, Jun 28, 2018 at 12:55 PM, Marcell Ortutay 
wrote:

> Hi all,
>
> I'm interested in ways to avoid a major compaction when restarting all the
> HBase region servers in a cluster (for example, for a version upgrade). Are
> there any recommended techniques for achieving this?
>
> Thanks,
> Marcell
>
>


How to avoid major compaction during restart?

2018-06-28 Thread Marcell Ortutay
Hi all,

I'm interested in ways to avoid a major compaction when restarting all the
HBase region servers in a cluster (for example, for a version upgrade). Are
there any recommended techniques for achieving this?

Thanks,
Marcell


Re: How to improve HBase read performance.

2018-05-16 Thread Marcell Ortutay
This ticket: https://issues.apache.org/jira/browse/HBASE-20459 was fixed in
the latest version of HBase, upgrading to latest may help with performance

On Wed, May 16, 2018 at 3:55 AM, Kang Minwoo 
wrote:

> Hi, Users.
>
> I store a lot of logs in HBase.
> However, the reading speed of the log is too slow compared to the Hive ORC
> file.
> I know that HBase is slow compared to the Hive ORC file.
> The problem is that it is too slow.
> HBase is about 6 times slower.
>
> Is there a good way to speed up HBase's reading speed?
> Should I put a lot of servers?
>
> I am using HBase 1.2.6.
>
> Best regards,
> Minwoo Kang
>


HBase scans seem slow, compute bound. How to improve?

2018-04-16 Thread Marcell Ortutay
I'm new to HBase and looking at some performance testing for my use case.
I've noticed that HBase scans seem "slow" compared to machine capabilities.

Here is a bit more detail on the testing I am running. I have loaded 3 test
tables into HBase and sqlite3 for comparison. I'm using sqlite3 as a
stand-in for what the "peak" performance can be for this operation. For
HBase running a 2 node (1 name node / 1 data note) cluster on EMR with
m3.2xlarge instances. The test tables each have 1 million rows with data
like this:

(1) 1 bigint column, 1 float column
(2) 1 bigint column, 1 float column, 100 bytes of filler data
(3) 1 bigint column, 1 float column, 1000 bytes of filler data

I randomized the filler data to attempt to limit the effects of
compression, and also ran tests with compression turned off, but that
didn't seem to have much impact.

I ran the following test queries on both HBase and sqlite3:

(a) SELECT count(*) FROM table WHERE val > .5
(b) SELECT count(*) FROM table WHERE filler like '%x%'

In each case I ran the query twice to account for block cache in HBase.
Below are the performance numbers on the 2nd (block cached) run:

HBase:
1a: 1.373s
2a: 1.538s
2b: 3.582s
3a: 0.98s
3b: 11.354s

sqlite3:
1a: 0.156000s
2a: 0.212000s
2b: 0.66s
3a: 0.252000s
3b: 4.364000s

In each case, sqlite3 performs much better (2x-9x) than HBase for an
equivalent operation.

I ran some rudimentary profiling on HBase and it seems like the bulk of the
time is spent in this function
https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L6485
so I'm guessing the computation in there is taking a long time.

I have two questions that I'm hoping to get some guidance on:

(1) Is this expected performance for HBase range scans? I'm told that HBase
is optimized for random key access, not range scans, so perhaps this
performance is a result of that tradeoff?
(2) Is there anything I can do the improve HBase range scan performance in
terms of configuration, data layout, etc.?

A few other notes:
- I'm using Phoenix for the SQL layer on top of HBase, but my profiling
revealed that the limiting factor is HBase, specifically the function cited
above
- The table sizes are 33MB, 130MB and 990MB in HBase and similar in sqlite3.

Thanks,
Marcell


HBase range scan seems slow?

2018-03-16 Thread Marcell Ortutay
Hi all,

I'm fairly new to HBase and was a bit surprised by performance I am seeing
for a range scan. I am running a range scan over ~3 million rows in an
HBase cluster with 4 region servers, each fairly large instances on AWS (24
HDD). I'm pulling a single float value from each row and computing the
average.

When I run this range scan, it takes ~.5sec to execute, and there is not
much performance improvement. This seems long to me. 3 million floats
should take maybe 10-20MB to read from disk and transfer, and it should be
much faster the second time around since the data supposed to be in memory
in block cache at that point.

Additionally, I tried running a number of these range scans concurrently
against HBase. Again, the performance seemed worse than I expected. The
average execution time goes up quite a bit at what seems like low QPS. For
example at 1 QPS, the average response time is several seconds.

Are these performance numbers typical? or is there some user error that is
causing them to be worse than normal?

Thanks,
Marcell


Re: Want to change key structure

2018-02-23 Thread Marcell Ortutay
Thanks for the info Anil. I first tried a MR which did Put's, based on the
examples at [1] but this was much too slow, as you said. I switching to
writing HFiles directly via HFileOutputFormat solves the issue.

Also, I wanted to post an issue I ran into, in case anyone runs into it in
the future. For a table re-write doing a reduce can be bad, because the MR
framework will try to sort the whole table, potentially multiple TB. You
can avoid this by calling job.setNumReduceTasks(0). However, if you use
HFileOutputFormat.configureIncrementalLoad(), that call will also set up
the reducer, which may be a bit surprising (at least it was to me). So the
order matters:

// This will have a (potentially long) reduce phase. Bad for large
tables.
job.setNumReduceTasks(0);
HFileOutputFormat.configureIncrementalLoad(job, hTable);  // Overrides
# of reduce tasks

Instead this works better for large tables:

// This will skip reduce phase
HFileOutputFormat.configureIncrementalLoad(job, hTable);
job.setNumReduceTasks(0);

Followed by a major compaction that will do the sorting for locality.

[1] http://hbase.apache.org/0.94/book/mapreduce.example.html

On Tue, Feb 20, 2018 at 6:44 AM, anil gupta <anilgupt...@gmail.com> wrote:

> Hi Marcell,
>
> Since key is changing you will need to rewrite the entire table. I think
> generating HFlies(rather than doing puts) will be the most efficient here.
> IIRC, you will need to use HFileOutputFormat in your MR job.
> For locality, i dont think you should worry that much because major
> compaction usually takes care of it. If you want very high locality from
> beginning then you can run a major compaction on new table after your
> initial load.
>
> HTH,
> Anil Gupta
>
> On Mon, Feb 19, 2018 at 11:46 PM, Marcell Ortutay <mortu...@23andme.com>
> wrote:
>
> > I have a large HBase table (~10 TB) that has an existing key structure.
> > Based on some recent analysis, the key structure is causing performance
> > problems for our current query load. I would like to re-write the table
> > with a new key structure that performs substantially better.
> >
> > What is the best way to go about re-writing this table? Since they key
> > structure will change, it will affect locality, so all the data will have
> > to move to a new location. If anyone can point to examples of code that
> > does something like this, that would be very helpful.
> >
> > Thanks,
> > Marcell
> >
>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>


Want to change key structure

2018-02-19 Thread Marcell Ortutay
I have a large HBase table (~10 TB) that has an existing key structure.
Based on some recent analysis, the key structure is causing performance
problems for our current query load. I would like to re-write the table
with a new key structure that performs substantially better.

What is the best way to go about re-writing this table? Since they key
structure will change, it will affect locality, so all the data will have
to move to a new location. If anyone can point to examples of code that
does something like this, that would be very helpful.

Thanks,
Marcell