Re: Materialized views in Hbase/Phoenix

2019-10-12 Thread sudhir patil
Few options

1. Have you checked olap db’s on top of hbase like Kylene? It would fit agg
requirement

2. Writing hbase coprocessor to agg and save and save another table or
column family? Implementing coprocessor is not trivial though

On Sat, 12 Oct 2019 at 12:49 AM, Nicolas Paris 
wrote:

> > If one of the tables fails to write, we need some kind of a rollback
> mechanism,
> > which is why I was considering a transaction. We cannot be in a partial
> state
> > where some of the ‘views’ are written and some aren’t.
>
> why not writing to tables to phoenix and if both succeed,
> then rename the original tables and the new tables.
> if succeed, then drop the old tables.
>
>
> On Fri, Sep 27, 2019 at 04:22:54PM +, Gautham Acharya wrote:
> > We will be reaching 100million rows early next year, and then billions
> shortly
> > after that. So, Hbase will be needed to scale to that degree.
> >
> >
> >
> > If one of the tables fails to write, we need some kind of a rollback
> mechanism,
> > which is why I was considering a transaction. We cannot be in a partial
> state
> > where some of the ‘views’ are written and some aren’t.
> >
> >
> >
> >
> >
> > From: Pedro Boado [mailto:pedro.bo...@gmail.com]
> > Sent: Friday, September 27, 2019 7:22 AM
> > To: user@phoenix.apache.org
> > Subject: Re: Materialized views in Hbase/Phoenix
> >
> >
> >
> > CAUTION: This email originated from outside the Allen Institute. Please
> do not
> > click links or open attachments unless you've validated the sender and
> know the
> > content is safe.
> >
> >
> ━━━
> >
> > For just a few million rows I would go for a RDBMS and not Phoenix /
> HBase.
> >
> >
> >
> > You don't really need transactions to control completion, just write a
> flag (a
> > COMPLETED empty file, for instance) as a final step in your job.
> >
> >
> >
> >
> >
> >
> >
> > On Fri, 27 Sep 2019, 15:03 Gautham Acharya,  >
> > wrote:
> >
> > Thanks Anil.
> >
> >
> >
> > So, what you’re essentially advocating for is to use some kind of
> Spark/
> > compute framework (I was going to use AWS Glue) job to write the
> > ‘materialized views’ as separate tables (maybe tied together with
> some kind
> > of a naming convention?)
> >
> >
> >
> > In this case, we’d end up with some sticky data consistency issues
> if the
> > write job failed halfway through (some ‘materialized view’ tables
> would be
> > updated, and some wouldn’t). Can I use Phoenix transactions to wrap
> the
> > write jobs together, to make sure either all the data is updated, or
> none?
> >
> >
> >
> > --gautham
> >
> >
> >
> >
> >
> > From: anil gupta [mailto:anilgupt...@gmail.com]
> > Sent: Friday, September 27, 2019 6:58 AM
> > To: user@phoenix.apache.org
> > Subject: Re: Materialized views in Hbase/Phoenix
> >
> >
> >
> > CAUTION: This email originated from outside the Allen Institute.
> Please do
> > not click links or open attachments unless you've validated the
> sender and
> > know the content is safe.
> >
> >
>  ━━━
> >
> > For your use case, i would suggest to create another table that
> stores the
> > matrix. Since this data doesnt change that often, maybe you can
> write a
> > nightly spark/MR job to update/rebuild the matrix table.(If you want
> near
> > real time that is also possible with any streaming system) Have you
> looked
> > into bloom filters? It might help if you have sparse dataset and you
> are
> > using Phoenix dynamic columns.
> > We use dynamic columns for a table that has columns upto 40k. Here
> is the
> > presentation and optimizations we made for that use case: https://
> > www.slideshare.net/anilgupta84/phoenix-con2017-truecarfinal
> >
> > IMO, Hive integration with HBase is not fully baked and it has a lot
> of
> > rough edges. So, it better to stick with native Phoenix/HBase if you
> care
> > about performance and ease of operations.
> >
> >
> >
> > HTH,
> >
> > Anil Gupta
> >
> >
> >
> >
> >
> > On Wed, Sep 25, 2019 at 10:01 AM Gautham Acharya <
> >

Re: Materialized views in Hbase/Phoenix

2019-10-11 Thread Nicolas Paris
> If one of the tables fails to write, we need some kind of a rollback 
> mechanism,
> which is why I was considering a transaction. We cannot be in a partial state
> where some of the ‘views’ are written and some aren’t.

why not writing to tables to phoenix and if both succeed, 
then rename the original tables and the new tables.
if succeed, then drop the old tables.


On Fri, Sep 27, 2019 at 04:22:54PM +, Gautham Acharya wrote:
> We will be reaching 100million rows early next year, and then billions shortly
> after that. So, Hbase will be needed to scale to that degree.
> 
>  
> 
> If one of the tables fails to write, we need some kind of a rollback 
> mechanism,
> which is why I was considering a transaction. We cannot be in a partial state
> where some of the ‘views’ are written and some aren’t.
> 
>  
> 
>  
> 
> From: Pedro Boado [mailto:pedro.bo...@gmail.com]
> Sent: Friday, September 27, 2019 7:22 AM
> To: user@phoenix.apache.org
> Subject: Re: Materialized views in Hbase/Phoenix
> 
>  
> 
> CAUTION: This email originated from outside the Allen Institute. Please do not
> click links or open attachments unless you've validated the sender and know 
> the
> content is safe.
> 
> ━━━
> 
> For just a few million rows I would go for a RDBMS and not Phoenix / HBase.
> 
>  
> 
> You don't really need transactions to control completion, just write a flag (a
> COMPLETED empty file, for instance) as a final step in your job.
> 
>  
> 
>  
> 
>  
> 
> On Fri, 27 Sep 2019, 15:03 Gautham Acharya, 
> wrote:
> 
> Thanks Anil.
> 
>  
> 
> So, what you’re essentially advocating for is to use some kind of Spark/
> compute framework (I was going to use AWS Glue) job to write the
> ‘materialized views’ as separate tables (maybe tied together with some 
> kind
> of a naming convention?)
> 
>  
> 
> In this case, we’d end up with some sticky data consistency issues if the
> write job failed halfway through (some ‘materialized view’ tables would be
> updated, and some wouldn’t). Can I use Phoenix transactions to wrap the
> write jobs together, to make sure either all the data is updated, or none?
> 
>  
> 
>     --gautham
> 
>  
> 
>  
> 
> From: anil gupta [mailto:anilgupt...@gmail.com]
> Sent: Friday, September 27, 2019 6:58 AM
> To: user@phoenix.apache.org
> Subject: Re: Materialized views in Hbase/Phoenix
> 
>  
> 
> CAUTION: This email originated from outside the Allen Institute. Please do
> not click links or open attachments unless you've validated the sender and
> know the content is safe.
> 
> 
> ━━━
>
> For your use case, i would suggest to create another table that stores the
> matrix. Since this data doesnt change that often, maybe you can write a
> nightly spark/MR job to update/rebuild the matrix table.(If you want near
> real time that is also possible with any streaming system) Have you looked
> into bloom filters? It might help if you have sparse dataset and you are
> using Phoenix dynamic columns.
> We use dynamic columns for a table that has columns upto 40k. Here is the
> presentation and optimizations we made for that use case: https://
> www.slideshare.net/anilgupta84/phoenix-con2017-truecarfinal
> 
> IMO, Hive integration with HBase is not fully baked and it has a lot of
> rough edges. So, it better to stick with native Phoenix/HBase if you care
> about performance and ease of operations.
> 
>  
> 
> HTH,
> 
> Anil Gupta
> 
>  
> 
>  
> 
> On Wed, Sep 25, 2019 at 10:01 AM Gautham Acharya <
> gauth...@alleninstitute.org> wrote:
> 
> Hi,
> 
>  
> 
> Currently I'm using Hbase to store large, sparse matrices of 50,000
> columns 10+ million rows of integers.
> 
>  
> 
> This matrix is used for fast, random access - we need to be able to
> fetch random row/column subsets, as well as entire columns. We also
> want to very quickly fetch aggregates (Mean, median, etc) on this
> matrix.
> 
>  
> 
> The data does not change very often for these matrices (a few times a
> week at most), so pre-computing is very feasible here. What I would
> like to do is maintain a column store (store the column names as row
> keys, and a compressed list of all the row values) for 

Re: Materialized views in Hbase/Phoenix

2019-09-30 Thread Josh Elser
Bulk loading would help a little bit in the "all-or-nothing" problem, 
but still not be fool proof.


You could have a set of files which are destined to different tables and 
have very clear data that needs to be loaded, but, if a file(s) failed 
to be loaded, you would have to take some steps to keep retrying.


On 9/27/19 12:22 PM, Gautham Acharya wrote:
We will be reaching 100million rows early next year, and then billions 
shortly after that. So, Hbase will be needed to scale to that degree.


If one of the tables fails to write, we need some kind of a rollback 
mechanism, which is why I was considering a transaction. We cannot be in 
a partial state where some of the ‘views’ are written and some aren’t.


*From:*Pedro Boado [mailto:pedro.bo...@gmail.com]
*Sent:* Friday, September 27, 2019 7:22 AM
*To:* user@phoenix.apache.org
*Subject:* Re: Materialized views in Hbase/Phoenix

*CAUTION:*This email originated from outside the Allen Institute. Please 
do not click links or open attachments unless you've validated the 
sender and know the content is safe.




For just a few million rows I would go for a RDBMS and not Phoenix / HBase.

You don't really need transactions to control completion, just write a 
flag (a COMPLETED empty file, for instance) as a final step in your job.


On Fri, 27 Sep 2019, 15:03 Gautham Acharya, <mailto:gauth...@alleninstitute.org>> wrote:


Thanks Anil.

So, what you’re essentially advocating for is to use some kind of
Spark/compute framework (I was going to use AWS Glue) job to write
the ‘materialized views’ as separate tables (maybe tied together
with some kind of a naming convention?)

In this case, we’d end up with some sticky data consistency issues
if the write job failed halfway through (some ‘materialized view’
tables would be updated, and some wouldn’t). Can I use Phoenix
transactions to wrap the write jobs together, to make sure either
all the data is updated, or none?

--gautham

*From:*anil gupta [mailto:anilgupt...@gmail.com
<mailto:anilgupt...@gmail.com>]
*Sent:* Friday, September 27, 2019 6:58 AM
*To:* user@phoenix.apache.org <mailto:user@phoenix.apache.org>
    *Subject:* Re: Materialized views in Hbase/Phoenix

*CAUTION:*This email originated from outside the Allen Institute.
Please do not click links or open attachments unless you've
validated the sender and know the content is safe.



For your use case, i would suggest to create another table that
stores the matrix. Since this data doesnt change that often, maybe
you can write a nightly spark/MR job to update/rebuild the matrix
table.(If you want near real time that is also possible with any
streaming system) Have you looked into bloom filters? It might help
if you have sparse dataset and you are using Phoenix dynamic columns.
We use dynamic columns for a table that has columns upto 40k. Here
is the presentation and optimizations we made for that use case:
https://www.slideshare.net/anilgupta84/phoenix-con2017-truecarfinal

<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2Fanilgupta84%2Fphoenix-con2017-truecarfinal=02%7C01%7C%7C63db5e769d074a7ec9c908d743562e01%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C1%7C637051909727164641=lKAo7Zw%2FWYQyIRg6kfQ2lqx4yO55AAgVaJ6kgXvqqRc%3D=0>

IMO, Hive integration with HBase is not fully baked and it has a lot
of rough edges. So, it better to stick with native Phoenix/HBase if
you care about performance and ease of operations.

HTH,

Anil Gupta

On Wed, Sep 25, 2019 at 10:01 AM Gautham Acharya
mailto:gauth...@alleninstitute.org>>
wrote:

Hi,

Currently I'm using Hbase to store large, sparse matrices of
50,000 columns 10+ million rows of integers.

This matrix is used for fast, random access - we need to be able
to fetch random row/column subsets, as well as entire columns.
We also want to very quickly fetch aggregates (Mean, median,
etc) on this matrix.

The data does not change very often for these matrices (a few
times a week at most), so pre-computing is very feasible here.
What I would like to do is maintain a column store (store the
column names as row keys, and a compressed list of all the row
values) for the use case where we select an entire column.
Additionally, I would like to maintain a separate table for each
precomputed aggregate (median table, mean table, etc).

The query time for all these use cases needs to be low latency -
under 100ms.

When the data does change for a certain matrix, it would be nice
to easily update the optimized table. Ide

Re: Materialized views in Hbase/Phoenix

2019-09-27 Thread Pedro Boado
For 2) , how many rows per column are we talking about in this sparse
matrix? I mean, how sparse is it? If I understand correctly you are talking
about storing the transposed matrix for this usecase.

If rows are too large (and they will keep growing with table size) it will
end up polluting the block cache.

For 3) are these groupings predefined? Or random ranges of keys? If
predefined, how big? Also 500 rows max?


Data consistency will be a problem indeed (no multitable transactions in
hbase).

Regarding using phoenix transactions, is data append only or can it be
updated? If data update is not that frequent and follows a data point
change each time, aggregations could be recomputed on write. I don't really
know if transaction performance could be a bottleneck, though.




On Fri, 27 Sep 2019, 21:06 Gautham Acharya, 
wrote:

> My first email details the use cases:
>
>
>
> 1.   Get values for certain row/column sets – this is where Hbase
> comes in handy, as we can easily query based on row key and column. No more
> than 500 rows and 30 columns will be queried.
>
> 2.   Get an entire column
>
> 3.   Get Aggregations per column based on groupings of row keys.
>
>
>
> Number (1) is easily satisfied by Hbase, with fast lookups on the rowkey.
>
> (2) and (3) will have to be precomputed, storing compressed data by column
> for (2) and the aggregations for (3). My main concern here was maintaining
> data consistency between the tables created for each matrix.
>
>
>
> *From:* Pedro Boado [mailto:pedro.bo...@gmail.com]
> *Sent:* Friday, September 27, 2019 10:53 AM
> *To:* user@phoenix.apache.org
> *Subject:* Re: Materialized views in Hbase/Phoenix
>
>
>
> *CAUTION:* This email originated from outside the Allen Institute. Please
> do not click links or open attachments unless you've validated the sender
> and know the content is safe.
> --
>
> Yeah, phoenix won't aggregate billions of rows in under 100ms (probably,
> nothing will).
>
>
>
> This sounds more and more like an OLAP use case, doesn't it? Facts table
> with billions of rows (still, you can handle that volumes with a shared
> RDBMS) that will never be queried directly.. And precomputed aggregations
> to be queried interactively (maybe you could use Phoenix here, but you
> could also use a RDBMS, that additionally can give you all guarantees
> you're looking for).
>
>
>
> If that's the case, I don't really think HBase/Phoenix is the right
> choice, (which is good doing gets by key or running scans/aggregations over
> reasonable key intervals).
>
>
>
> Maybe explaining the use case could help (we are getting more info drop by
> drop in each new message in terms of volume, different query patterns
> expected, concurrency, etc etc). For instance, how are this 100s of queries
> interacting with the DB? Via a REST API?
>
>
>
>
>
>
>
>
>
>
>
> On Fri, 27 Sep 2019, 17:39 Gautham Acharya, 
> wrote:
>
> We are looking at being able to support hundreds of concurrent queries,
> but not too many more.
>
>
>
> Will aggregations be performant across these large datasets? (e.g. give me
> the mean value of each column when all rows are grouped by a certain row
> property).
>
>
>
> Precomputing seems much more efficient.
>
>
>
> *From:* Pedro Boado [mailto:pedro.bo...@gmail.com]
> *Sent:* Friday, September 27, 2019 9:27 AM
> *To:* user@phoenix.apache.org
> *Subject:* Re: Materialized views in Hbase/Phoenix
>
>
>
> *CAUTION:* This email originated from outside the Allen Institute. Please
> do not click links or open attachments unless you've validated the sender
> and know the content is safe.
> --
>
> Can the aggregation be run on the flight in a phoenix query? 100ms
> response time but... With how many concurrent queries?
>
>
>
> On Fri, 27 Sep 2019, 17:23 Gautham Acharya, 
> wrote:
>
> We will be reaching 100million rows early next year, and then billions
> shortly after that. So, Hbase will be needed to scale to that degree.
>
>
>
> If one of the tables fails to write, we need some kind of a rollback
> mechanism, which is why I was considering a transaction. We cannot be in a
> partial state where some of the ‘views’ are written and some aren’t.
>
>
>
>
>
> *From:* Pedro Boado [mailto:pedro.bo...@gmail.com]
> *Sent:* Friday, September 27, 2019 7:22 AM
> *To:* user@phoenix.apache.org
> *Subject:* Re: Materialized views in Hbase/Phoenix
>
>
>
> *CAUTION:* This email originated from outside the Allen Institute. Please
> do not click links or open attachments unless you've validated the sender
> and know the cont

RE: Materialized views in Hbase/Phoenix

2019-09-27 Thread Gautham Acharya
My first email details the use cases:


1.   Get values for certain row/column sets – this is where Hbase comes in 
handy, as we can easily query based on row key and column. No more than 500 
rows and 30 columns will be queried.

2.   Get an entire column

3.   Get Aggregations per column based on groupings of row keys.

Number (1) is easily satisfied by Hbase, with fast lookups on the rowkey.
(2) and (3) will have to be precomputed, storing compressed data by column for 
(2) and the aggregations for (3). My main concern here was maintaining data 
consistency between the tables created for each matrix.

From: Pedro Boado [mailto:pedro.bo...@gmail.com]
Sent: Friday, September 27, 2019 10:53 AM
To: user@phoenix.apache.org
Subject: Re: Materialized views in Hbase/Phoenix

CAUTION: This email originated from outside the Allen Institute. Please do not 
click links or open attachments unless you've validated the sender and know the 
content is safe.

Yeah, phoenix won't aggregate billions of rows in under 100ms (probably, 
nothing will).

This sounds more and more like an OLAP use case, doesn't it? Facts table with 
billions of rows (still, you can handle that volumes with a shared RDBMS) that 
will never be queried directly.. And precomputed aggregations to be queried 
interactively (maybe you could use Phoenix here, but you could also use a 
RDBMS, that additionally can give you all guarantees you're looking for).

If that's the case, I don't really think HBase/Phoenix is the right choice, 
(which is good doing gets by key or running scans/aggregations over reasonable 
key intervals).

Maybe explaining the use case could help (we are getting more info drop by drop 
in each new message in terms of volume, different query patterns expected, 
concurrency, etc etc). For instance, how are this 100s of queries interacting 
with the DB? Via a REST API?





On Fri, 27 Sep 2019, 17:39 Gautham Acharya, 
mailto:gauth...@alleninstitute.org>> wrote:
We are looking at being able to support hundreds of concurrent queries, but not 
too many more.

Will aggregations be performant across these large datasets? (e.g. give me the 
mean value of each column when all rows are grouped by a certain row property).

Precomputing seems much more efficient.

From: Pedro Boado [mailto:pedro.bo...@gmail.com<mailto:pedro.bo...@gmail.com>]
Sent: Friday, September 27, 2019 9:27 AM
To: user@phoenix.apache.org<mailto:user@phoenix.apache.org>
Subject: Re: Materialized views in Hbase/Phoenix

CAUTION: This email originated from outside the Allen Institute. Please do not 
click links or open attachments unless you've validated the sender and know the 
content is safe.

Can the aggregation be run on the flight in a phoenix query? 100ms response 
time but... With how many concurrent queries?

On Fri, 27 Sep 2019, 17:23 Gautham Acharya, 
mailto:gauth...@alleninstitute.org>> wrote:
We will be reaching 100million rows early next year, and then billions shortly 
after that. So, Hbase will be needed to scale to that degree.

If one of the tables fails to write, we need some kind of a rollback mechanism, 
which is why I was considering a transaction. We cannot be in a partial state 
where some of the ‘views’ are written and some aren’t.


From: Pedro Boado [mailto:pedro.bo...@gmail.com<mailto:pedro.bo...@gmail.com>]
Sent: Friday, September 27, 2019 7:22 AM
To: user@phoenix.apache.org<mailto:user@phoenix.apache.org>
Subject: Re: Materialized views in Hbase/Phoenix

CAUTION: This email originated from outside the Allen Institute. Please do not 
click links or open attachments unless you've validated the sender and know the 
content is safe.

For just a few million rows I would go for a RDBMS and not Phoenix / HBase.

You don't really need transactions to control completion, just write a flag (a 
COMPLETED empty file, for instance) as a final step in your job.



On Fri, 27 Sep 2019, 15:03 Gautham Acharya, 
mailto:gauth...@alleninstitute.org>> wrote:
Thanks Anil.

So, what you’re essentially advocating for is to use some kind of Spark/compute 
framework (I was going to use AWS Glue) job to write the ‘materialized views’ 
as separate tables (maybe tied together with some kind of a naming convention?)

In this case, we’d end up with some sticky data consistency issues if the write 
job failed halfway through (some ‘materialized view’ tables would be updated, 
and some wouldn’t). Can I use Phoenix transactions to wrap the write jobs 
together, to make sure either all the data is updated, or none?

--gautham


From: anil gupta [mailto:anilgupt...@gmail.com<mailto:anilgupt...@gmail.com>]
Sent: Friday, September 27, 2019 6:58 AM
To: user@phoenix.apache.org<mailto:user@phoenix.apache.org>
Subject: Re: Materialized views in Hbase/Phoenix

CAUTION: This email originated from outside the Allen Institute.

Re: Materialized views in Hbase/Phoenix

2019-09-27 Thread Pedro Boado
Yeah, phoenix won't aggregate billions of rows in under 100ms (probably,
nothing will).

This sounds more and more like an OLAP use case, doesn't it? Facts table
with billions of rows (still, you can handle that volumes with a shared
RDBMS) that will never be queried directly.. And precomputed aggregations
to be queried interactively (maybe you could use Phoenix here, but you
could also use a RDBMS, that additionally can give you all guarantees
you're looking for).

If that's the case, I don't really think HBase/Phoenix is the right choice,
(which is good doing gets by key or running scans/aggregations over
reasonable key intervals).

Maybe explaining the use case could help (we are getting more info drop by
drop in each new message in terms of volume, different query patterns
expected, concurrency, etc etc). For instance, how are this 100s of queries
interacting with the DB? Via a REST API?





On Fri, 27 Sep 2019, 17:39 Gautham Acharya, 
wrote:

> We are looking at being able to support hundreds of concurrent queries,
> but not too many more.
>
>
>
> Will aggregations be performant across these large datasets? (e.g. give me
> the mean value of each column when all rows are grouped by a certain row
> property).
>
>
>
> Precomputing seems much more efficient.
>
>
>
> *From:* Pedro Boado [mailto:pedro.bo...@gmail.com]
> *Sent:* Friday, September 27, 2019 9:27 AM
> *To:* user@phoenix.apache.org
> *Subject:* Re: Materialized views in Hbase/Phoenix
>
>
>
> *CAUTION:* This email originated from outside the Allen Institute. Please
> do not click links or open attachments unless you've validated the sender
> and know the content is safe.
> --
>
> Can the aggregation be run on the flight in a phoenix query? 100ms
> response time but... With how many concurrent queries?
>
>
>
> On Fri, 27 Sep 2019, 17:23 Gautham Acharya, 
> wrote:
>
> We will be reaching 100million rows early next year, and then billions
> shortly after that. So, Hbase will be needed to scale to that degree.
>
>
>
> If one of the tables fails to write, we need some kind of a rollback
> mechanism, which is why I was considering a transaction. We cannot be in a
> partial state where some of the ‘views’ are written and some aren’t.
>
>
>
>
>
> *From:* Pedro Boado [mailto:pedro.bo...@gmail.com]
> *Sent:* Friday, September 27, 2019 7:22 AM
> *To:* user@phoenix.apache.org
> *Subject:* Re: Materialized views in Hbase/Phoenix
>
>
>
> *CAUTION:* This email originated from outside the Allen Institute. Please
> do not click links or open attachments unless you've validated the sender
> and know the content is safe.
> --
>
> For just a few million rows I would go for a RDBMS and not Phoenix / HBase.
>
>
>
> You don't really need transactions to control completion, just write a
> flag (a COMPLETED empty file, for instance) as a final step in your job.
>
>
>
>
>
>
>
> On Fri, 27 Sep 2019, 15:03 Gautham Acharya, 
> wrote:
>
> Thanks Anil.
>
>
>
> So, what you’re essentially advocating for is to use some kind of
> Spark/compute framework (I was going to use AWS Glue) job to write the
> ‘materialized views’ as separate tables (maybe tied together with some kind
> of a naming convention?)
>
>
>
> In this case, we’d end up with some sticky data consistency issues if the
> write job failed halfway through (some ‘materialized view’ tables would be
> updated, and some wouldn’t). Can I use Phoenix transactions to wrap the
> write jobs together, to make sure either all the data is updated, or none?
>
>
>
> --gautham
>
>
>
>
>
> *From:* anil gupta [mailto:anilgupt...@gmail.com]
> *Sent:* Friday, September 27, 2019 6:58 AM
> *To:* user@phoenix.apache.org
> *Subject:* Re: Materialized views in Hbase/Phoenix
>
>
>
> *CAUTION:* This email originated from outside the Allen Institute. Please
> do not click links or open attachments unless you've validated the sender
> and know the content is safe.
> --
>
> For your use case, i would suggest to create another table that stores the
> matrix. Since this data doesnt change that often, maybe you can write a
> nightly spark/MR job to update/rebuild the matrix table.(If you want near
> real time that is also possible with any streaming system) Have you looked
> into bloom filters? It might help if you have sparse dataset and you are
> using Phoenix dynamic columns.
> We use dynamic columns for a table that has columns upto 40k. Here is the
> presentation and optimizations we made for that use case:
> https://www.slideshare.net/anilgupta84/phoenix-con2017-truecarfinal
> <https://nam05.safe

RE: Materialized views in Hbase/Phoenix

2019-09-27 Thread Gautham Acharya
We are looking at being able to support hundreds of concurrent queries, but not 
too many more.

Will aggregations be performant across these large datasets? (e.g. give me the 
mean value of each column when all rows are grouped by a certain row property).

Precomputing seems much more efficient.

From: Pedro Boado [mailto:pedro.bo...@gmail.com]
Sent: Friday, September 27, 2019 9:27 AM
To: user@phoenix.apache.org
Subject: Re: Materialized views in Hbase/Phoenix

CAUTION: This email originated from outside the Allen Institute. Please do not 
click links or open attachments unless you've validated the sender and know the 
content is safe.

Can the aggregation be run on the flight in a phoenix query? 100ms response 
time but... With how many concurrent queries?

On Fri, 27 Sep 2019, 17:23 Gautham Acharya, 
mailto:gauth...@alleninstitute.org>> wrote:
We will be reaching 100million rows early next year, and then billions shortly 
after that. So, Hbase will be needed to scale to that degree.

If one of the tables fails to write, we need some kind of a rollback mechanism, 
which is why I was considering a transaction. We cannot be in a partial state 
where some of the ‘views’ are written and some aren’t.


From: Pedro Boado [mailto:pedro.bo...@gmail.com<mailto:pedro.bo...@gmail.com>]
Sent: Friday, September 27, 2019 7:22 AM
To: user@phoenix.apache.org<mailto:user@phoenix.apache.org>
Subject: Re: Materialized views in Hbase/Phoenix

CAUTION: This email originated from outside the Allen Institute. Please do not 
click links or open attachments unless you've validated the sender and know the 
content is safe.

For just a few million rows I would go for a RDBMS and not Phoenix / HBase.

You don't really need transactions to control completion, just write a flag (a 
COMPLETED empty file, for instance) as a final step in your job.



On Fri, 27 Sep 2019, 15:03 Gautham Acharya, 
mailto:gauth...@alleninstitute.org>> wrote:
Thanks Anil.

So, what you’re essentially advocating for is to use some kind of Spark/compute 
framework (I was going to use AWS Glue) job to write the ‘materialized views’ 
as separate tables (maybe tied together with some kind of a naming convention?)

In this case, we’d end up with some sticky data consistency issues if the write 
job failed halfway through (some ‘materialized view’ tables would be updated, 
and some wouldn’t). Can I use Phoenix transactions to wrap the write jobs 
together, to make sure either all the data is updated, or none?

--gautham


From: anil gupta [mailto:anilgupt...@gmail.com<mailto:anilgupt...@gmail.com>]
Sent: Friday, September 27, 2019 6:58 AM
To: user@phoenix.apache.org<mailto:user@phoenix.apache.org>
Subject: Re: Materialized views in Hbase/Phoenix

CAUTION: This email originated from outside the Allen Institute. Please do not 
click links or open attachments unless you've validated the sender and know the 
content is safe.

For your use case, i would suggest to create another table that stores the 
matrix. Since this data doesnt change that often, maybe you can write a nightly 
spark/MR job to update/rebuild the matrix table.(If you want near real time 
that is also possible with any streaming system) Have you looked into bloom 
filters? It might help if you have sparse dataset and you are using Phoenix 
dynamic columns.
We use dynamic columns for a table that has columns upto 40k. Here is the 
presentation and optimizations we made for that use case: 
https://www.slideshare.net/anilgupta84/phoenix-con2017-truecarfinal<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2Fanilgupta84%2Fphoenix-con2017-truecarfinal=02%7C01%7C%7C9805924799694582dc0908d74368e9ea%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C0%7C637051990198734214=KvYBkQtZk%2FQ9hD%2F4aL6ZFnVIurU6JJjpf3ZjkjF9A7Q%3D=0>
IMO, Hive integration with HBase is not fully baked and it has a lot of rough 
edges. So, it better to stick with native Phoenix/HBase if you care about 
performance and ease of operations.

HTH,
Anil Gupta


On Wed, Sep 25, 2019 at 10:01 AM Gautham Acharya 
mailto:gauth...@alleninstitute.org>> wrote:
Hi,

Currently I'm using Hbase to store large, sparse matrices of 50,000 columns 10+ 
million rows of integers.

This matrix is used for fast, random access - we need to be able to fetch 
random row/column subsets, as well as entire columns. We also want to very 
quickly fetch aggregates (Mean, median, etc) on this matrix.

The data does not change very often for these matrices (a few times a week at 
most), so pre-computing is very feasible here. What I would like to do is 
maintain a column store (store the column names as row keys, and a compressed 
list of all the row values) for the use case where we select an entire column. 
Additionally, I would like to maintain a separate table for each precomputed 
aggregate 

Re: Materialized views in Hbase/Phoenix

2019-09-27 Thread Pedro Boado
Can the aggregation be run on the flight in a phoenix query? 100ms response
time but... With how many concurrent queries?

On Fri, 27 Sep 2019, 17:23 Gautham Acharya, 
wrote:

> We will be reaching 100million rows early next year, and then billions
> shortly after that. So, Hbase will be needed to scale to that degree.
>
>
>
> If one of the tables fails to write, we need some kind of a rollback
> mechanism, which is why I was considering a transaction. We cannot be in a
> partial state where some of the ‘views’ are written and some aren’t.
>
>
>
>
>
> *From:* Pedro Boado [mailto:pedro.bo...@gmail.com]
> *Sent:* Friday, September 27, 2019 7:22 AM
> *To:* user@phoenix.apache.org
> *Subject:* Re: Materialized views in Hbase/Phoenix
>
>
>
> *CAUTION:* This email originated from outside the Allen Institute. Please
> do not click links or open attachments unless you've validated the sender
> and know the content is safe.
> --
>
> For just a few million rows I would go for a RDBMS and not Phoenix / HBase.
>
>
>
> You don't really need transactions to control completion, just write a
> flag (a COMPLETED empty file, for instance) as a final step in your job.
>
>
>
>
>
>
>
> On Fri, 27 Sep 2019, 15:03 Gautham Acharya, 
> wrote:
>
> Thanks Anil.
>
>
>
> So, what you’re essentially advocating for is to use some kind of
> Spark/compute framework (I was going to use AWS Glue) job to write the
> ‘materialized views’ as separate tables (maybe tied together with some kind
> of a naming convention?)
>
>
>
> In this case, we’d end up with some sticky data consistency issues if the
> write job failed halfway through (some ‘materialized view’ tables would be
> updated, and some wouldn’t). Can I use Phoenix transactions to wrap the
> write jobs together, to make sure either all the data is updated, or none?
>
>
>
> --gautham
>
>
>
>
>
> *From:* anil gupta [mailto:anilgupt...@gmail.com]
> *Sent:* Friday, September 27, 2019 6:58 AM
> *To:* user@phoenix.apache.org
> *Subject:* Re: Materialized views in Hbase/Phoenix
>
>
>
> *CAUTION:* This email originated from outside the Allen Institute. Please
> do not click links or open attachments unless you've validated the sender
> and know the content is safe.
> --
>
> For your use case, i would suggest to create another table that stores the
> matrix. Since this data doesnt change that often, maybe you can write a
> nightly spark/MR job to update/rebuild the matrix table.(If you want near
> real time that is also possible with any streaming system) Have you looked
> into bloom filters? It might help if you have sparse dataset and you are
> using Phoenix dynamic columns.
> We use dynamic columns for a table that has columns upto 40k. Here is the
> presentation and optimizations we made for that use case:
> https://www.slideshare.net/anilgupta84/phoenix-con2017-truecarfinal
> <https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2Fanilgupta84%2Fphoenix-con2017-truecarfinal=02%7C01%7C%7C63db5e769d074a7ec9c908d743562e01%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C1%7C637051909727164641=lKAo7Zw%2FWYQyIRg6kfQ2lqx4yO55AAgVaJ6kgXvqqRc%3D=0>
>
> IMO, Hive integration with HBase is not fully baked and it has a lot of
> rough edges. So, it better to stick with native Phoenix/HBase if you care
> about performance and ease of operations.
>
>
>
> HTH,
>
> Anil Gupta
>
>
>
>
>
> On Wed, Sep 25, 2019 at 10:01 AM Gautham Acharya <
> gauth...@alleninstitute.org> wrote:
>
> Hi,
>
>
>
> Currently I'm using Hbase to store large, sparse matrices of 50,000
> columns 10+ million rows of integers.
>
>
>
> This matrix is used for fast, random access - we need to be able to fetch
> random row/column subsets, as well as entire columns. We also want to very
> quickly fetch aggregates (Mean, median, etc) on this matrix.
>
>
>
> The data does not change very often for these matrices (a few times a week
> at most), so pre-computing is very feasible here. What I would like to do
> is maintain a column store (store the column names as row keys, and a
> compressed list of all the row values) for the use case where we select an
> entire column. Additionally, I would like to maintain a separate table for
> each precomputed aggregate (median table, mean table, etc).
>
>
>
> The query time for all these use cases needs to be low latency - under
> 100ms.
>
>
>
> When the data does change for a certain matrix, it would be nice to easily
> update the optimized table. Ideally, I would like the column
> store/aggregation

RE: Materialized views in Hbase/Phoenix

2019-09-27 Thread Gautham Acharya
We will be reaching 100million rows early next year, and then billions shortly 
after that. So, Hbase will be needed to scale to that degree.

If one of the tables fails to write, we need some kind of a rollback mechanism, 
which is why I was considering a transaction. We cannot be in a partial state 
where some of the ‘views’ are written and some aren’t.


From: Pedro Boado [mailto:pedro.bo...@gmail.com]
Sent: Friday, September 27, 2019 7:22 AM
To: user@phoenix.apache.org
Subject: Re: Materialized views in Hbase/Phoenix

CAUTION: This email originated from outside the Allen Institute. Please do not 
click links or open attachments unless you've validated the sender and know the 
content is safe.

For just a few million rows I would go for a RDBMS and not Phoenix / HBase.

You don't really need transactions to control completion, just write a flag (a 
COMPLETED empty file, for instance) as a final step in your job.



On Fri, 27 Sep 2019, 15:03 Gautham Acharya, 
mailto:gauth...@alleninstitute.org>> wrote:
Thanks Anil.

So, what you’re essentially advocating for is to use some kind of Spark/compute 
framework (I was going to use AWS Glue) job to write the ‘materialized views’ 
as separate tables (maybe tied together with some kind of a naming convention?)

In this case, we’d end up with some sticky data consistency issues if the write 
job failed halfway through (some ‘materialized view’ tables would be updated, 
and some wouldn’t). Can I use Phoenix transactions to wrap the write jobs 
together, to make sure either all the data is updated, or none?

--gautham


From: anil gupta [mailto:anilgupt...@gmail.com<mailto:anilgupt...@gmail.com>]
Sent: Friday, September 27, 2019 6:58 AM
To: user@phoenix.apache.org<mailto:user@phoenix.apache.org>
Subject: Re: Materialized views in Hbase/Phoenix

CAUTION: This email originated from outside the Allen Institute. Please do not 
click links or open attachments unless you've validated the sender and know the 
content is safe.

For your use case, i would suggest to create another table that stores the 
matrix. Since this data doesnt change that often, maybe you can write a nightly 
spark/MR job to update/rebuild the matrix table.(If you want near real time 
that is also possible with any streaming system) Have you looked into bloom 
filters? It might help if you have sparse dataset and you are using Phoenix 
dynamic columns.
We use dynamic columns for a table that has columns upto 40k. Here is the 
presentation and optimizations we made for that use case: 
https://www.slideshare.net/anilgupta84/phoenix-con2017-truecarfinal<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2Fanilgupta84%2Fphoenix-con2017-truecarfinal=02%7C01%7C%7C63db5e769d074a7ec9c908d743562e01%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C1%7C637051909727164641=lKAo7Zw%2FWYQyIRg6kfQ2lqx4yO55AAgVaJ6kgXvqqRc%3D=0>
IMO, Hive integration with HBase is not fully baked and it has a lot of rough 
edges. So, it better to stick with native Phoenix/HBase if you care about 
performance and ease of operations.

HTH,
Anil Gupta


On Wed, Sep 25, 2019 at 10:01 AM Gautham Acharya 
mailto:gauth...@alleninstitute.org>> wrote:
Hi,

Currently I'm using Hbase to store large, sparse matrices of 50,000 columns 10+ 
million rows of integers.

This matrix is used for fast, random access - we need to be able to fetch 
random row/column subsets, as well as entire columns. We also want to very 
quickly fetch aggregates (Mean, median, etc) on this matrix.

The data does not change very often for these matrices (a few times a week at 
most), so pre-computing is very feasible here. What I would like to do is 
maintain a column store (store the column names as row keys, and a compressed 
list of all the row values) for the use case where we select an entire column. 
Additionally, I would like to maintain a separate table for each precomputed 
aggregate (median table, mean table, etc).

The query time for all these use cases needs to be low latency - under 100ms.

When the data does change for a certain matrix, it would be nice to easily 
update the optimized table. Ideally, I would like the column store/aggregation 
tables to just be materialized views of the original matrix. It doesn't look 
like Apache Phoenix supports materialized views. It looks like Hive does, but 
unfortunately Hive doesn't normally offer low latency queries.

Maybe Hive can create the materialized view, and we can just query the 
underlying Hbase store for lower latency responses?

What would be a good solution for this?

--gautham



--gautham



--
Thanks & Regards,
Anil Gupta


Re: Materialized views in Hbase/Phoenix

2019-09-27 Thread Pedro Boado
For just a few million rows I would go for a RDBMS and not Phoenix / HBase.

You don't really need transactions to control completion, just write a flag
(a COMPLETED empty file, for instance) as a final step in your job.



On Fri, 27 Sep 2019, 15:03 Gautham Acharya, 
wrote:

> Thanks Anil.
>
>
>
> So, what you’re essentially advocating for is to use some kind of
> Spark/compute framework (I was going to use AWS Glue) job to write the
> ‘materialized views’ as separate tables (maybe tied together with some kind
> of a naming convention?)
>
>
>
> In this case, we’d end up with some sticky data consistency issues if the
> write job failed halfway through (some ‘materialized view’ tables would be
> updated, and some wouldn’t). Can I use Phoenix transactions to wrap the
> write jobs together, to make sure either all the data is updated, or none?
>
>
>
> --gautham
>
>
>
>
>
> *From:* anil gupta [mailto:anilgupt...@gmail.com]
> *Sent:* Friday, September 27, 2019 6:58 AM
> *To:* user@phoenix.apache.org
> *Subject:* Re: Materialized views in Hbase/Phoenix
>
>
>
> *CAUTION:* This email originated from outside the Allen Institute. Please
> do not click links or open attachments unless you've validated the sender
> and know the content is safe.
> --
>
> For your use case, i would suggest to create another table that stores the
> matrix. Since this data doesnt change that often, maybe you can write a
> nightly spark/MR job to update/rebuild the matrix table.(If you want near
> real time that is also possible with any streaming system) Have you looked
> into bloom filters? It might help if you have sparse dataset and you are
> using Phoenix dynamic columns.
> We use dynamic columns for a table that has columns upto 40k. Here is the
> presentation and optimizations we made for that use case:
> https://www.slideshare.net/anilgupta84/phoenix-con2017-truecarfinal
> <https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2Fanilgupta84%2Fphoenix-con2017-truecarfinal=02%7C01%7C%7Cc8a891e8348d4c8d772108d74352c5c9%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C1%7C637051895107882201=V6Zmo%2Fvj5D2Jb851FBi%2BeofkLyMUYLH%2BxZmV08Khm9Q%3D=0>
>
> IMO, Hive integration with HBase is not fully baked and it has a lot of
> rough edges. So, it better to stick with native Phoenix/HBase if you care
> about performance and ease of operations.
>
>
>
> HTH,
>
> Anil Gupta
>
>
>
>
>
> On Wed, Sep 25, 2019 at 10:01 AM Gautham Acharya <
> gauth...@alleninstitute.org> wrote:
>
> Hi,
>
>
>
> Currently I'm using Hbase to store large, sparse matrices of 50,000
> columns 10+ million rows of integers.
>
>
>
> This matrix is used for fast, random access - we need to be able to fetch
> random row/column subsets, as well as entire columns. We also want to very
> quickly fetch aggregates (Mean, median, etc) on this matrix.
>
>
>
> The data does not change very often for these matrices (a few times a week
> at most), so pre-computing is very feasible here. What I would like to do
> is maintain a column store (store the column names as row keys, and a
> compressed list of all the row values) for the use case where we select an
> entire column. Additionally, I would like to maintain a separate table for
> each precomputed aggregate (median table, mean table, etc).
>
>
>
> The query time for all these use cases needs to be low latency - under
> 100ms.
>
>
>
> When the data does change for a certain matrix, it would be nice to easily
> update the optimized table. Ideally, I would like the column
> store/aggregation tables to just be materialized views of the original
> matrix. It doesn't look like Apache Phoenix supports materialized views. It
> looks like Hive does, but unfortunately Hive doesn't normally offer low
> latency queries.
>
>
>
> Maybe Hive can create the materialized view, and we can just query the
> underlying Hbase store for lower latency responses?
>
>
>
> What would be a good solution for this?
>
>
>
> --gautham
>
>
>
>
>
>
>
> --gautham
>
>
>
>
>
> --
>
> Thanks & Regards,
> Anil Gupta
>


RE: Materialized views in Hbase/Phoenix

2019-09-27 Thread Gautham Acharya
Thanks Anil.

So, what you’re essentially advocating for is to use some kind of Spark/compute 
framework (I was going to use AWS Glue) job to write the ‘materialized views’ 
as separate tables (maybe tied together with some kind of a naming convention?)

In this case, we’d end up with some sticky data consistency issues if the write 
job failed halfway through (some ‘materialized view’ tables would be updated, 
and some wouldn’t). Can I use Phoenix transactions to wrap the write jobs 
together, to make sure either all the data is updated, or none?

--gautham


From: anil gupta [mailto:anilgupt...@gmail.com]
Sent: Friday, September 27, 2019 6:58 AM
To: user@phoenix.apache.org
Subject: Re: Materialized views in Hbase/Phoenix

CAUTION: This email originated from outside the Allen Institute. Please do not 
click links or open attachments unless you've validated the sender and know the 
content is safe.

For your use case, i would suggest to create another table that stores the 
matrix. Since this data doesnt change that often, maybe you can write a nightly 
spark/MR job to update/rebuild the matrix table.(If you want near real time 
that is also possible with any streaming system) Have you looked into bloom 
filters? It might help if you have sparse dataset and you are using Phoenix 
dynamic columns.
We use dynamic columns for a table that has columns upto 40k. Here is the 
presentation and optimizations we made for that use case: 
https://www.slideshare.net/anilgupta84/phoenix-con2017-truecarfinal<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2Fanilgupta84%2Fphoenix-con2017-truecarfinal=02%7C01%7C%7Cc8a891e8348d4c8d772108d74352c5c9%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C1%7C637051895107882201=V6Zmo%2Fvj5D2Jb851FBi%2BeofkLyMUYLH%2BxZmV08Khm9Q%3D=0>
IMO, Hive integration with HBase is not fully baked and it has a lot of rough 
edges. So, it better to stick with native Phoenix/HBase if you care about 
performance and ease of operations.

HTH,
Anil Gupta


On Wed, Sep 25, 2019 at 10:01 AM Gautham Acharya 
mailto:gauth...@alleninstitute.org>> wrote:
Hi,

Currently I'm using Hbase to store large, sparse matrices of 50,000 columns 10+ 
million rows of integers.

This matrix is used for fast, random access - we need to be able to fetch 
random row/column subsets, as well as entire columns. We also want to very 
quickly fetch aggregates (Mean, median, etc) on this matrix.

The data does not change very often for these matrices (a few times a week at 
most), so pre-computing is very feasible here. What I would like to do is 
maintain a column store (store the column names as row keys, and a compressed 
list of all the row values) for the use case where we select an entire column. 
Additionally, I would like to maintain a separate table for each precomputed 
aggregate (median table, mean table, etc).

The query time for all these use cases needs to be low latency - under 100ms.

When the data does change for a certain matrix, it would be nice to easily 
update the optimized table. Ideally, I would like the column store/aggregation 
tables to just be materialized views of the original matrix. It doesn't look 
like Apache Phoenix supports materialized views. It looks like Hive does, but 
unfortunately Hive doesn't normally offer low latency queries.

Maybe Hive can create the materialized view, and we can just query the 
underlying Hbase store for lower latency responses?

What would be a good solution for this?

--gautham



--gautham



--
Thanks & Regards,
Anil Gupta


Re: Materialized views in Hbase/Phoenix

2019-09-27 Thread anil gupta
For your use case, i would suggest to create another table that stores the
matrix. Since this data doesnt change that often, maybe you can write a
nightly spark/MR job to update/rebuild the matrix table.(If you want near
real time that is also possible with any streaming system) Have you looked
into bloom filters? It might help if you have sparse dataset and you are
using Phoenix dynamic columns.
We use dynamic columns for a table that has columns upto 40k. Here is the
presentation and optimizations we made for that use case:
https://www.slideshare.net/anilgupta84/phoenix-con2017-truecarfinal
IMO, Hive integration with HBase is not fully baked and it has a lot of
rough edges. So, it better to stick with native Phoenix/HBase if you care
about performance and ease of operations.

HTH,
Anil Gupta


On Wed, Sep 25, 2019 at 10:01 AM Gautham Acharya <
gauth...@alleninstitute.org> wrote:

> Hi,
>
>
>
> Currently I'm using Hbase to store large, sparse matrices of 50,000
> columns 10+ million rows of integers.
>
>
>
> This matrix is used for fast, random access - we need to be able to fetch
> random row/column subsets, as well as entire columns. We also want to very
> quickly fetch aggregates (Mean, median, etc) on this matrix.
>
>
>
> The data does not change very often for these matrices (a few times a week
> at most), so pre-computing is very feasible here. What I would like to do
> is maintain a column store (store the column names as row keys, and a
> compressed list of all the row values) for the use case where we select an
> entire column. Additionally, I would like to maintain a separate table for
> each precomputed aggregate (median table, mean table, etc).
>
>
>
> The query time for all these use cases needs to be low latency - under
> 100ms.
>
>
>
> When the data does change for a certain matrix, it would be nice to easily
> update the optimized table. Ideally, I would like the column
> store/aggregation tables to just be materialized views of the original
> matrix. It doesn't look like Apache Phoenix supports materialized views. It
> looks like Hive does, but unfortunately Hive doesn't normally offer low
> latency queries.
>
>
>
> Maybe Hive can create the materialized view, and we can just query the
> underlying Hbase store for lower latency responses?
>
>
>
> What would be a good solution for this?
>
>
>
> --gautham
>
>
>
>
>
>
>
> --gautham
>
>
>


-- 
Thanks & Regards,
Anil Gupta


Materialized views in Hbase/Phoenix

2019-09-25 Thread Gautham Acharya
Hi,

Currently I'm using Hbase to store large, sparse matrices of 50,000 columns 10+ 
million rows of integers.

This matrix is used for fast, random access - we need to be able to fetch 
random row/column subsets, as well as entire columns. We also want to very 
quickly fetch aggregates (Mean, median, etc) on this matrix.

The data does not change very often for these matrices (a few times a week at 
most), so pre-computing is very feasible here. What I would like to do is 
maintain a column store (store the column names as row keys, and a compressed 
list of all the row values) for the use case where we select an entire column. 
Additionally, I would like to maintain a separate table for each precomputed 
aggregate (median table, mean table, etc).

The query time for all these use cases needs to be low latency - under 100ms.

When the data does change for a certain matrix, it would be nice to easily 
update the optimized table. Ideally, I would like the column store/aggregation 
tables to just be materialized views of the original matrix. It doesn't look 
like Apache Phoenix supports materialized views. It looks like Hive does, but 
unfortunately Hive doesn't normally offer low latency queries.

Maybe Hive can create the materialized view, and we can just query the 
underlying Hbase store for lower latency responses?

What would be a good solution for this?

--gautham



--gautham