For just a few million rows I would go for a RDBMS and not Phoenix / HBase.
You don't really need transactions to control completion, just write a flag (a COMPLETED empty file, for instance) as a final step in your job. On Fri, 27 Sep 2019, 15:03 Gautham Acharya, <gauth...@alleninstitute.org> wrote: > Thanks Anil. > > > > So, what you’re essentially advocating for is to use some kind of > Spark/compute framework (I was going to use AWS Glue) job to write the > ‘materialized views’ as separate tables (maybe tied together with some kind > of a naming convention?) > > > > In this case, we’d end up with some sticky data consistency issues if the > write job failed halfway through (some ‘materialized view’ tables would be > updated, and some wouldn’t). Can I use Phoenix transactions to wrap the > write jobs together, to make sure either all the data is updated, or none? > > > > --gautham > > > > > > *From:* anil gupta [mailto:anilgupt...@gmail.com] > *Sent:* Friday, September 27, 2019 6:58 AM > *To:* user@phoenix.apache.org > *Subject:* Re: Materialized views in Hbase/Phoenix > > > > *CAUTION:* This email originated from outside the Allen Institute. Please > do not click links or open attachments unless you've validated the sender > and know the content is safe. > ------------------------------ > > For your use case, i would suggest to create another table that stores the > matrix. Since this data doesnt change that often, maybe you can write a > nightly spark/MR job to update/rebuild the matrix table.(If you want near > real time that is also possible with any streaming system) Have you looked > into bloom filters? It might help if you have sparse dataset and you are > using Phoenix dynamic columns. > We use dynamic columns for a table that has columns upto 40k. Here is the > presentation and optimizations we made for that use case: > https://www.slideshare.net/anilgupta84/phoenix-con2017-truecarfinal > <https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2Fanilgupta84%2Fphoenix-con2017-truecarfinal&data=02%7C01%7C%7Cc8a891e8348d4c8d772108d74352c5c9%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C1%7C637051895107882201&sdata=V6Zmo%2Fvj5D2Jb851FBi%2BeofkLyMUYLH%2BxZmV08Khm9Q%3D&reserved=0> > > IMO, Hive integration with HBase is not fully baked and it has a lot of > rough edges. So, it better to stick with native Phoenix/HBase if you care > about performance and ease of operations. > > > > HTH, > > Anil Gupta > > > > > > On Wed, Sep 25, 2019 at 10:01 AM Gautham Acharya < > gauth...@alleninstitute.org> wrote: > > Hi, > > > > Currently I'm using Hbase to store large, sparse matrices of 50,000 > columns 10+ million rows of integers. > > > > This matrix is used for fast, random access - we need to be able to fetch > random row/column subsets, as well as entire columns. We also want to very > quickly fetch aggregates (Mean, median, etc) on this matrix. > > > > The data does not change very often for these matrices (a few times a week > at most), so pre-computing is very feasible here. What I would like to do > is maintain a column store (store the column names as row keys, and a > compressed list of all the row values) for the use case where we select an > entire column. Additionally, I would like to maintain a separate table for > each precomputed aggregate (median table, mean table, etc). > > > > The query time for all these use cases needs to be low latency - under > 100ms. > > > > When the data does change for a certain matrix, it would be nice to easily > update the optimized table. Ideally, I would like the column > store/aggregation tables to just be materialized views of the original > matrix. It doesn't look like Apache Phoenix supports materialized views. It > looks like Hive does, but unfortunately Hive doesn't normally offer low > latency queries. > > > > Maybe Hive can create the materialized view, and we can just query the > underlying Hbase store for lower latency responses? > > > > What would be a good solution for this? > > > > --gautham > > > > > > > > --gautham > > > > > > -- > > Thanks & Regards, > Anil Gupta >