RE: Materialized views in Hbase/Phoenix

Gautham Acharya Fri, 27 Sep 2019 09:23:44 -0700

We will be reaching 100million rows early next year, and then billions shortly 
after that. So, Hbase will be needed to scale to that degree.

If one of the tables fails to write, we need some kind of a rollback mechanism, 
which is why I was considering a transaction. We cannot be in a partial state 
where some of the ‘views’ are written and some aren’t.

From: Pedro Boado [mailto:[email protected]]
Sent: Friday, September 27, 2019 7:22 AM
To: [email protected]
Subject: Re: Materialized views in Hbase/Phoenix

CAUTION: This email originated from outside the Allen Institute. Please do not 
click links or open attachments unless you've validated the sender and know the 
content is safe.
________________________________
For just a few million rows I would go for a RDBMS and not Phoenix / HBase.

You don't really need transactions to control completion, just write a flag (a 
COMPLETED empty file, for instance) as a final step in your job.

On Fri, 27 Sep 2019, 15:03 Gautham Acharya, 
<[email protected]<mailto:[email protected]>> wrote:
Thanks Anil.

So, what you’re essentially advocating for is to use some kind of Spark/compute 
framework (I was going to use AWS Glue) job to write the ‘materialized views’ 
as separate tables (maybe tied together with some kind of a naming convention?)

In this case, we’d end up with some sticky data consistency issues if the write 
job failed halfway through (some ‘materialized view’ tables would be updated, 
and some wouldn’t). Can I use Phoenix transactions to wrap the write jobs 
together, to make sure either all the data is updated, or none?

--gautham

From: anil gupta [mailto:[email protected]<mailto:[email protected]>]
Sent: Friday, September 27, 2019 6:58 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: Materialized views in Hbase/Phoenix

CAUTION: This email originated from outside the Allen Institute. Please do not 
click links or open attachments unless you've validated the sender and know the 
content is safe.
________________________________
For your use case, i would suggest to create another table that stores the 
matrix. Since this data doesnt change that often, maybe you can write a nightly 
spark/MR job to update/rebuild the matrix table.(If you want near real time 
that is also possible with any streaming system) Have you looked into bloom 
filters? It might help if you have sparse dataset and you are using Phoenix 
dynamic columns.
We use dynamic columns for a table that has columns upto 40k. Here is the 
presentation and optimizations we made for that use case: 
https://www.slideshare.net/anilgupta84/phoenix-con2017-truecarfinal<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2Fanilgupta84%2Fphoenix-con2017-truecarfinal&data=02%7C01%7C%7C63db5e769d074a7ec9c908d743562e01%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C1%7C637051909727164641&sdata=lKAo7Zw%2FWYQyIRg6kfQ2lqx4yO55AAgVaJ6kgXvqqRc%3D&reserved=0>
IMO, Hive integration with HBase is not fully baked and it has a lot of rough 
edges. So, it better to stick with native Phoenix/HBase if you care about 
performance and ease of operations.

HTH,
Anil Gupta

On Wed, Sep 25, 2019 at 10:01 AM Gautham Acharya 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

Currently I'm using Hbase to store large, sparse matrices of 50,000 columns 10+ 
million rows of integers.

This matrix is used for fast, random access - we need to be able to fetch 
random row/column subsets, as well as entire columns. We also want to very 
quickly fetch aggregates (Mean, median, etc) on this matrix.

The data does not change very often for these matrices (a few times a week at 
most), so pre-computing is very feasible here. What I would like to do is 
maintain a column store (store the column names as row keys, and a compressed 
list of all the row values) for the use case where we select an entire column. 
Additionally, I would like to maintain a separate table for each precomputed 
aggregate (median table, mean table, etc).

The query time for all these use cases needs to be low latency - under 100ms.

When the data does change for a certain matrix, it would be nice to easily 
update the optimized table. Ideally, I would like the column store/aggregation 
tables to just be materialized views of the original matrix. It doesn't look 
like Apache Phoenix supports materialized views. It looks like Hive does, but 
unfortunately Hive doesn't normally offer low latency queries.

Maybe Hive can create the materialized view, and we can just query the 
underlying Hbase store for lower latency responses?

What would be a good solution for this?

--gautham

--gautham

--
Thanks & Regards,
Anil Gupta

RE: Materialized views in Hbase/Phoenix

Reply via email to