Re: Re: Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work

[email protected] Mon, 05 Jan 2015 21:31:23 -0800

Hi, Lars
Thanks for your reply and advice. You are right, we are considering about sort 
of aggregates work.
Our requirements need to assure full scan over table with approximately 50 
million rows while containing
nearly 100+ columns. We are using the latest 4.2.2 release, actually we are 
using Spark to read and write to
Phoenix tables. We apply the schema of mapreduce over Phoenix tables to do full 
table scan in Spark, and 
then we shall use the created rdd to write or bulkload to new Phoenix tables. 
Thats' just our production flow.


Specifying the #1 vs #2 performance, we found that #1 shall always failed to 
complete and we can see regionserver
falling down during the job.  #2 would cause some kind of 
ScannerTimeOutExecption, then we configure parameters
for our hbase cluster and such problems gone. However, we are still expecting 
more efficient approaches for doing 
such full table scan over Phoenix datasets.

Thanks,
Sun.





CertusNet 

From: lars hofhansl
Date: 2015-01-06 12:52
To: [email protected]; user
Subject: Re: Performance options for doing Phoenix full table scans to complete 
some data statistics and summary collection work
Hi Sun,

assuming that you are mostly talking about aggregates (in the sense of scanning 
a lot of data, but the resulting set is small), it's interesting that option #1 
would not satisfy your performance expectations,  but #2 would.

Which version of Phoenix are you using? From 4.2 Phoenix is well aware of the 
distribution of the data and will farm out full scans in parallel chunks.
In number you would make a copy of the entire dataset in order to be able to 
"query" it via Spark?

What kind of performance do you see with option #1 vs #2?

Thanks. 

-- Lars



From: "[email protected]" <[email protected]>
To: user <[email protected]>; dev <[email protected]> 
Sent: Monday, January 5, 2015 6:42 PM
Subject: Performance options for doing Phoenix full table scans to complete 
some data statistics and summary collection work

Hi，all
Currently we are using Phoenix to store and query large datasets of KPI for our 
projects. Noting that we definitely need
to do full table scan of phoneix KPI tables for data statistics and summary 
collection, e.g. from five minutes data table to
summary hour based data table, and to day based and week based data tables, and 
so on. 
The approaches now we used currently are as follows:
1. using Phoenix upsert into ... select ... grammer , however, the query 
performance would not satisfy our expectation.
2. using Apache Spark with the phoenix_mr integration to read data from phoenix 
tables and create rdd, then we can transform 
these rdds to summary rdd, and bulkload to new Phoenix data table.    This 
approach can satisfy most of our application requirements, but 
in some cases we cannot complete the full scan job.

Here are my questions:
1. Is there any more efficient approaches for improving performance of Phoenix 
full table scan of large data sets? Any kindly share are greately
appropriated.
2. Noting that full table scan is not quite appropriate for hbase tables, is 
there any alternative options for doing such work under current hdfs and
hbase environments? Please kindly share any good points.

Best regards,
Sun.





CertusNet

Re: Re: Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work

Reply via email to