Can the aggregation be run on the flight in a phoenix query? 100ms response
time but... With how many concurrent queries?

On Fri, 27 Sep 2019, 17:23 Gautham Acharya, <gauth...@alleninstitute.org>
wrote:

> We will be reaching 100million rows early next year, and then billions
> shortly after that. So, Hbase will be needed to scale to that degree.
>
>
>
> If one of the tables fails to write, we need some kind of a rollback
> mechanism, which is why I was considering a transaction. We cannot be in a
> partial state where some of the ‘views’ are written and some aren’t.
>
>
>
>
>
> *From:* Pedro Boado [mailto:pedro.bo...@gmail.com]
> *Sent:* Friday, September 27, 2019 7:22 AM
> *To:* user@phoenix.apache.org
> *Subject:* Re: Materialized views in Hbase/Phoenix
>
>
>
> *CAUTION:* This email originated from outside the Allen Institute. Please
> do not click links or open attachments unless you've validated the sender
> and know the content is safe.
> ------------------------------
>
> For just a few million rows I would go for a RDBMS and not Phoenix / HBase.
>
>
>
> You don't really need transactions to control completion, just write a
> flag (a COMPLETED empty file, for instance) as a final step in your job.
>
>
>
>
>
>
>
> On Fri, 27 Sep 2019, 15:03 Gautham Acharya, <gauth...@alleninstitute.org>
> wrote:
>
> Thanks Anil.
>
>
>
> So, what you’re essentially advocating for is to use some kind of
> Spark/compute framework (I was going to use AWS Glue) job to write the
> ‘materialized views’ as separate tables (maybe tied together with some kind
> of a naming convention?)
>
>
>
> In this case, we’d end up with some sticky data consistency issues if the
> write job failed halfway through (some ‘materialized view’ tables would be
> updated, and some wouldn’t). Can I use Phoenix transactions to wrap the
> write jobs together, to make sure either all the data is updated, or none?
>
>
>
> --gautham
>
>
>
>
>
> *From:* anil gupta [mailto:anilgupt...@gmail.com]
> *Sent:* Friday, September 27, 2019 6:58 AM
> *To:* user@phoenix.apache.org
> *Subject:* Re: Materialized views in Hbase/Phoenix
>
>
>
> *CAUTION:* This email originated from outside the Allen Institute. Please
> do not click links or open attachments unless you've validated the sender
> and know the content is safe.
> ------------------------------
>
> For your use case, i would suggest to create another table that stores the
> matrix. Since this data doesnt change that often, maybe you can write a
> nightly spark/MR job to update/rebuild the matrix table.(If you want near
> real time that is also possible with any streaming system) Have you looked
> into bloom filters? It might help if you have sparse dataset and you are
> using Phoenix dynamic columns.
> We use dynamic columns for a table that has columns upto 40k. Here is the
> presentation and optimizations we made for that use case:
> https://www.slideshare.net/anilgupta84/phoenix-con2017-truecarfinal
> <https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2Fanilgupta84%2Fphoenix-con2017-truecarfinal&data=02%7C01%7C%7C63db5e769d074a7ec9c908d743562e01%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C1%7C637051909727164641&sdata=lKAo7Zw%2FWYQyIRg6kfQ2lqx4yO55AAgVaJ6kgXvqqRc%3D&reserved=0>
>
> IMO, Hive integration with HBase is not fully baked and it has a lot of
> rough edges. So, it better to stick with native Phoenix/HBase if you care
> about performance and ease of operations.
>
>
>
> HTH,
>
> Anil Gupta
>
>
>
>
>
> On Wed, Sep 25, 2019 at 10:01 AM Gautham Acharya <
> gauth...@alleninstitute.org> wrote:
>
> Hi,
>
>
>
> Currently I'm using Hbase to store large, sparse matrices of 50,000
> columns 10+ million rows of integers.
>
>
>
> This matrix is used for fast, random access - we need to be able to fetch
> random row/column subsets, as well as entire columns. We also want to very
> quickly fetch aggregates (Mean, median, etc) on this matrix.
>
>
>
> The data does not change very often for these matrices (a few times a week
> at most), so pre-computing is very feasible here. What I would like to do
> is maintain a column store (store the column names as row keys, and a
> compressed list of all the row values) for the use case where we select an
> entire column. Additionally, I would like to maintain a separate table for
> each precomputed aggregate (median table, mean table, etc).
>
>
>
> The query time for all these use cases needs to be low latency - under
> 100ms.
>
>
>
> When the data does change for a certain matrix, it would be nice to easily
> update the optimized table. Ideally, I would like the column
> store/aggregation tables to just be materialized views of the original
> matrix. It doesn't look like Apache Phoenix supports materialized views. It
> looks like Hive does, but unfortunately Hive doesn't normally offer low
> latency queries.
>
>
>
> Maybe Hive can create the materialized view, and we can just query the
> underlying Hbase store for lower latency responses?
>
>
>
> What would be a good solution for this?
>
>
>
> --gautham
>
>
>
>
>
>
>
> --gautham
>
>
>
>
>
> --
>
> Thanks & Regards,
> Anil Gupta
>
>

Reply via email to