Can the aggregation be run on the flight in a phoenix query? 100ms response time but... With how many concurrent queries?
On Fri, 27 Sep 2019, 17:23 Gautham Acharya, <gauth...@alleninstitute.org> wrote: > We will be reaching 100million rows early next year, and then billions > shortly after that. So, Hbase will be needed to scale to that degree. > > > > If one of the tables fails to write, we need some kind of a rollback > mechanism, which is why I was considering a transaction. We cannot be in a > partial state where some of the ‘views’ are written and some aren’t. > > > > > > *From:* Pedro Boado [mailto:pedro.bo...@gmail.com] > *Sent:* Friday, September 27, 2019 7:22 AM > *To:* user@phoenix.apache.org > *Subject:* Re: Materialized views in Hbase/Phoenix > > > > *CAUTION:* This email originated from outside the Allen Institute. Please > do not click links or open attachments unless you've validated the sender > and know the content is safe. > ------------------------------ > > For just a few million rows I would go for a RDBMS and not Phoenix / HBase. > > > > You don't really need transactions to control completion, just write a > flag (a COMPLETED empty file, for instance) as a final step in your job. > > > > > > > > On Fri, 27 Sep 2019, 15:03 Gautham Acharya, <gauth...@alleninstitute.org> > wrote: > > Thanks Anil. > > > > So, what you’re essentially advocating for is to use some kind of > Spark/compute framework (I was going to use AWS Glue) job to write the > ‘materialized views’ as separate tables (maybe tied together with some kind > of a naming convention?) > > > > In this case, we’d end up with some sticky data consistency issues if the > write job failed halfway through (some ‘materialized view’ tables would be > updated, and some wouldn’t). Can I use Phoenix transactions to wrap the > write jobs together, to make sure either all the data is updated, or none? > > > > --gautham > > > > > > *From:* anil gupta [mailto:anilgupt...@gmail.com] > *Sent:* Friday, September 27, 2019 6:58 AM > *To:* user@phoenix.apache.org > *Subject:* Re: Materialized views in Hbase/Phoenix > > > > *CAUTION:* This email originated from outside the Allen Institute. Please > do not click links or open attachments unless you've validated the sender > and know the content is safe. > ------------------------------ > > For your use case, i would suggest to create another table that stores the > matrix. Since this data doesnt change that often, maybe you can write a > nightly spark/MR job to update/rebuild the matrix table.(If you want near > real time that is also possible with any streaming system) Have you looked > into bloom filters? It might help if you have sparse dataset and you are > using Phoenix dynamic columns. > We use dynamic columns for a table that has columns upto 40k. Here is the > presentation and optimizations we made for that use case: > https://www.slideshare.net/anilgupta84/phoenix-con2017-truecarfinal > <https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2Fanilgupta84%2Fphoenix-con2017-truecarfinal&data=02%7C01%7C%7C63db5e769d074a7ec9c908d743562e01%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C1%7C637051909727164641&sdata=lKAo7Zw%2FWYQyIRg6kfQ2lqx4yO55AAgVaJ6kgXvqqRc%3D&reserved=0> > > IMO, Hive integration with HBase is not fully baked and it has a lot of > rough edges. So, it better to stick with native Phoenix/HBase if you care > about performance and ease of operations. > > > > HTH, > > Anil Gupta > > > > > > On Wed, Sep 25, 2019 at 10:01 AM Gautham Acharya < > gauth...@alleninstitute.org> wrote: > > Hi, > > > > Currently I'm using Hbase to store large, sparse matrices of 50,000 > columns 10+ million rows of integers. > > > > This matrix is used for fast, random access - we need to be able to fetch > random row/column subsets, as well as entire columns. We also want to very > quickly fetch aggregates (Mean, median, etc) on this matrix. > > > > The data does not change very often for these matrices (a few times a week > at most), so pre-computing is very feasible here. What I would like to do > is maintain a column store (store the column names as row keys, and a > compressed list of all the row values) for the use case where we select an > entire column. Additionally, I would like to maintain a separate table for > each precomputed aggregate (median table, mean table, etc). > > > > The query time for all these use cases needs to be low latency - under > 100ms. > > > > When the data does change for a certain matrix, it would be nice to easily > update the optimized table. Ideally, I would like the column > store/aggregation tables to just be materialized views of the original > matrix. It doesn't look like Apache Phoenix supports materialized views. It > looks like Hive does, but unfortunately Hive doesn't normally offer low > latency queries. > > > > Maybe Hive can create the materialized view, and we can just query the > underlying Hbase store for lower latency responses? > > > > What would be a good solution for this? > > > > --gautham > > > > > > > > --gautham > > > > > > -- > > Thanks & Regards, > Anil Gupta > >