Re: Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work

Nick Dimiduk Mon, 05 Jan 2015 22:05:47 -0800

Region server fails consistently? Can you provide logs from the failing
process?


On Monday, January 5, 2015, [email protected] <[email protected]>
wrote:

> Hi, Lars
> Thanks for your reply and advice. You are right, we are considering about
> sort of aggregates work.
> Our requirements need to assure full scan over table with approximately 50
> million rows while containing
> nearly 100+ columns. We are using the latest 4.2.2 release, actually we
> are using Spark to read and write to
> Phoenix tables. We apply the schema of mapreduce over Phoenix tables to do
> full table scan in Spark, and
> then we shall use the created rdd to write or bulkload to new Phoenix
> tables. Thats' just our production flow.
>
> Specifying the #1 vs #2 performance, we found that #1 shall always failed
> to complete and we can see regionserver
> falling down during the job.  #2 would cause some kind of
> ScannerTimeOutExecption, then we configure parameters
> for our hbase cluster and such problems gone. However, we are still
> expecting more efficient approaches for doing
> such full table scan over Phoenix datasets.
>
> Thanks,
> Sun.
>
> ------------------------------
> ------------------------------
>
> CertusNet
>
>
> *From:* lars hofhansl <javascript:_e(%7B%7D,'cvml','[email protected]');>
> *Date:* 2015-01-06 12:52
> *To:* [email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>; user
> <javascript:_e(%7B%7D,'cvml','[email protected]');>
> *Subject:* Re: Performance options for doing Phoenix full table scans to
> complete some data statistics and summary collection work
> Hi Sun,
>
> assuming that you are mostly talking about aggregates (in the sense of
> scanning a lot of data, but the resulting set is small), it's interesting
> that option #1 would not satisfy your performance expectations,  but #2
> would.
>
> Which version of Phoenix are you using? From 4.2 Phoenix is well aware of
> the distribution of the data and will farm out full scans in parallel
> chunks.
> In number you would make a copy of the entire dataset in order to be able
> to "query" it via Spark?
>
> What kind of performance do you see with option #1 vs #2?
>
> Thanks.
>
> -- Lars
>
>   ------------------------------
>  *From:* "[email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>" <
> [email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
> *To:* user <[email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>>; dev <
> [email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
> *Sent:* Monday, January 5, 2015 6:42 PM
> *Subject:* Performance options for doing Phoenix full table scans to
> complete some data statistics and summary collection work
>
> Hi，all
> Currently we are using Phoenix to store and query large datasets of KPI
> for our projects. Noting that we definitely need
> to do full table scan of phoneix KPI tables for data statistics and
> summary collection, e.g. from five minutes data table to
> summary hour based data table, and to day based and week based data
> tables, and so on.
> The approaches now we used currently are as follows:
> 1. using Phoenix upsert into ... select ... grammer , however, the query
> performance would not satisfy our expectation.
> 2. using Apache Spark with the phoenix_mr integration to read data from
> phoenix tables and create rdd, then we can transform
> these rdds to summary rdd, and bulkload to new Phoenix data table.    This
> approach can satisfy most of our application requirements, but
> in some cases we cannot complete the full scan job.
>
> Here are my questions:
> 1. Is there any more efficient approaches for improving performance of
> Phoenix full table scan of large data sets? Any kindly share are greately
> appropriated.
> 2. Noting that full table scan is not quite appropriate for hbase tables,
> is there any alternative options for doing such work under current hdfs and
> hbase environments? Please kindly share any good points.
>
> Best regards,
> Sun.
>
>
>
>
>
> CertusNet
>
>
>
>

Re: Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work

Reply via email to