Re: Re: Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work

2015-01-08 Thread su...@certusnet.com.cn
Hi,guys
Thanks for all of your kindly advice. For #1, we are planning to retry that. 
Mujtaba, compression set to
snappy already. Actually we are using only one column family name and we are 
good to utilize multiple
column families. The table schema is high-narrow.
For example, our table use one default column family and had over 90 + rows. 
How manay column family 
names would you recommend us to apply then?Maybe only two to three column 
families are enough?
We had one cluster with 5 nodes.

Thanks ,
Sun.





CertusNet 

From: Mujtaba Chohan
Date: 2015-01-09 00:42
To: user@phoenix.apache.org
Subject: Re: Performance options for doing Phoenix full table scans to complete 
some data statistics and summary collection work
With 100+ columns, using multiple column families will help a lot if your full 
scan uses only few columns. 

Also if columns are wide then turning on compression would help if you are 
seeing disk I/O contention on region servers. 

On Wednesday, January 7, 2015, James Taylor jamestay...@apache.org wrote:
Hi Sun,
Can you give us a sample DDL and upsert/select query for #1? What's the 
approximate cluster size and what does the client look like? How much data are 
you scanning? Are you using multiple column families? We should be able to help 
tune things to improve #1.
Thanks,
James

On Monday, January 5, 2015, su...@certusnet.com.cn su...@certusnet.com.cn 
wrote:
We had firstly done the test using #1 and the result didnot satisfy our 
expectation. 
Unfortunately I had not saved the log copy, but under same conditions of 
datasets,
#2 is better than #1.

Thanks,
Sun.






From: Nick Dimiduk
Date: 2015-01-06 14:03
To: user@phoenix.apache.org
CC: lars hofhansl
Subject: Re: Performance options for doing Phoenix full table scans to complete 
some data statistics and summary collection work
Region server fails consistently? Can you provide logs from the failing process?

On Monday, January 5, 2015, su...@certusnet.com.cn su...@certusnet.com.cn 
wrote:
Hi, Lars
Thanks for your reply and advice. You are right, we are considering about sort 
of aggregates work.
Our requirements need to assure full scan over table with approximately 50 
million rows while containing
nearly 100+ columns. We are using the latest 4.2.2 release, actually we are 
using Spark to read and write to
Phoenix tables. We apply the schema of mapreduce over Phoenix tables to do full 
table scan in Spark, and 
then we shall use the created rdd to write or bulkload to new Phoenix tables. 
Thats' just our production flow.

Specifying the #1 vs #2 performance, we found that #1 shall always failed to 
complete and we can see regionserver
falling down during the job.  #2 would cause some kind of 
ScannerTimeOutExecption, then we configure parameters
for our hbase cluster and such problems gone. However, we are still expecting 
more efficient approaches for doing 
such full table scan over Phoenix datasets.

Thanks,
Sun.





CertusNet 

From: lars hofhansl
Date: 2015-01-06 12:52
To: d...@phoenix.apache.org; user
Subject: Re: Performance options for doing Phoenix full table scans to complete 
some data statistics and summary collection work
Hi Sun,

assuming that you are mostly talking about aggregates (in the sense of scanning 
a lot of data, but the resulting set is small), it's interesting that option #1 
would not satisfy your performance expectations,  but #2 would.

Which version of Phoenix are you using? From 4.2 Phoenix is well aware of the 
distribution of the data and will farm out full scans in parallel chunks.
In number you would make a copy of the entire dataset in order to be able to 
query it via Spark?

What kind of performance do you see with option #1 vs #2?

Thanks. 

-- Lars



From: su...@certusnet.com.cn su...@certusnet.com.cn
To: user user@phoenix.apache.org; dev d...@phoenix.apache.org 
Sent: Monday, January 5, 2015 6:42 PM
Subject: Performance options for doing Phoenix full table scans to complete 
some data statistics and summary collection work

Hi,all
Currently we are using Phoenix to store and query large datasets of KPI for our 
projects. Noting that we definitely need
to do full table scan of phoneix KPI tables for data statistics and summary 
collection, e.g. from five minutes data table to
summary hour based data table, and to day based and week based data tables, and 
so on. 
The approaches now we used currently are as follows:
1. using Phoenix upsert into ... select ... grammer , however, the query 
performance would not satisfy our expectation.
2. using Apache Spark with the phoenix_mr integration to read data from phoenix 
tables and create rdd, then we can transform 
these rdds to summary rdd, and bulkload to new Phoenix data table.This 
approach can satisfy most of our application requirements, but 
in some cases we cannot complete the full scan job.

Here are my questions:
1. Is there any more efficient approaches for improving performance of Phoenix 
full table scan of large data sets? 

Re: Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work

2015-01-08 Thread Mujtaba Chohan
With 100+ columns, using multiple column families will help a lot if your
full scan uses only few columns.

Also if columns are wide then turning on compression would help if you are
seeing disk I/O contention on region servers.

On Wednesday, January 7, 2015, James Taylor jamestay...@apache.org wrote:

 Hi Sun,
 Can you give us a sample DDL and upsert/select query for #1? What's the
 approximate cluster size and what does the client look like? How much data
 are you scanning? Are you using multiple column families? We should be able
 to help tune things to improve #1.
 Thanks,
 James

 On Monday, January 5, 2015, su...@certusnet.com.cn
 javascript:_e(%7B%7D,'cvml','su...@certusnet.com.cn'); 
 su...@certusnet.com.cn
 javascript:_e(%7B%7D,'cvml','su...@certusnet.com.cn'); wrote:

 We had firstly done the test using #1 and the result didnot satisfy our
 expectation.
 Unfortunately I had not saved the log copy, but under same conditions of
 datasets,
 #2 is better than #1.

 Thanks,
 Sun.

 --
 --


 *From:* Nick Dimiduk
 *Date:* 2015-01-06 14:03
 *To:* user@phoenix.apache.org
 *CC:* lars hofhansl
 *Subject:* Re: Performance options for doing Phoenix full table scans to
 complete some data statistics and summary collection work
 Region server fails consistently? Can you provide logs from the failing
 process?

 On Monday, January 5, 2015, su...@certusnet.com.cn 
 su...@certusnet.com.cn wrote:

 Hi, Lars
 Thanks for your reply and advice. You are right, we are considering
 about sort of aggregates work.
 Our requirements need to assure full scan over table with approximately
 50 million rows while containing
 nearly 100+ columns. We are using the latest 4.2.2 release, actually we
 are using Spark to read and write to
 Phoenix tables. We apply the schema of mapreduce over Phoenix tables to
 do full table scan in Spark, and
 then we shall use the created rdd to write or bulkload to new Phoenix
 tables. Thats' just our production flow.

 Specifying the #1 vs #2 performance, we found that #1 shall always
 failed to complete and we can see regionserver
 falling down during the job.  #2 would cause some kind of
 ScannerTimeOutExecption, then we configure parameters
 for our hbase cluster and such problems gone. However, we are still
 expecting more efficient approaches for doing
 such full table scan over Phoenix datasets.

 Thanks,
 Sun.

 --
 --

 CertusNet


 *From:* lars hofhansl
 *Date:* 2015-01-06 12:52
 *To:* d...@phoenix.apache.org; user
 *Subject:* Re: Performance options for doing Phoenix full table scans
 to complete some data statistics and summary collection work
 Hi Sun,

 assuming that you are mostly talking about aggregates (in the sense of
 scanning a lot of data, but the resulting set is small), it's interesting
 that option #1 would not satisfy your performance expectations,  but #2
 would.

 Which version of Phoenix are you using? From 4.2 Phoenix is well aware
 of the distribution of the data and will farm out full scans in parallel
 chunks.
 In number you would make a copy of the entire dataset in order to be
 able to query it via Spark?

 What kind of performance do you see with option #1 vs #2?

 Thanks.

 -- Lars

   --
  *From:* su...@certusnet.com.cn su...@certusnet.com.cn
 *To:* user user@phoenix.apache.org; dev d...@phoenix.apache.org
 *Sent:* Monday, January 5, 2015 6:42 PM
 *Subject:* Performance options for doing Phoenix full table scans to
 complete some data statistics and summary collection work

 Hi,all
 Currently we are using Phoenix to store and query large datasets of KPI
 for our projects. Noting that we definitely need
 to do full table scan of phoneix KPI tables for data statistics and
 summary collection, e.g. from five minutes data table to
 summary hour based data table, and to day based and week based data
 tables, and so on.
 The approaches now we used currently are as follows:
 1. using Phoenix upsert into ... select ... grammer , however, the query
 performance would not satisfy our expectation.
 2. using Apache Spark with the phoenix_mr integration to read data from
 phoenix tables and create rdd, then we can transform
 these rdds to summary rdd, and bulkload to new Phoenix data table.
 This approach can satisfy most of our application requirements, but
 in some cases we cannot complete the full scan job.

 Here are my questions:
 1. Is there any more efficient approaches for improving performance of
 Phoenix full table scan of large data sets? Any kindly share are greately
 appropriated.
 2. Noting that full table scan is not quite appropriate for hbase
 tables, is there any alternative options for doing such work under current
 hdfs and
 hbase environments? Please kindly share any good points.

 Best regards,
 Sun.





 CertusNet






Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work

2015-01-05 Thread su...@certusnet.com.cn
Hi,all
Currently we are using Phoenix to store and query large datasets of KPI for our 
projects. Noting that we definitely need
to do full table scan of phoneix KPI tables for data statistics and summary 
collection, e.g. from five minutes data table to
summary hour based data table, and to day based and week based data tables, and 
so on. 
The approaches now we used currently are as follows:
1. using Phoenix upsert into ... select ... grammer , however, the query 
performance would not satisfy our expectation.
2. using Apache Spark with the phoenix_mr integration to read data from phoenix 
tables and create rdd, then we can transform 
these rdds to summary rdd, and bulkload to new Phoenix data table.This 
approach can satisfy most of our application requirements, but 
in some cases we cannot complete the full scan job.

Here are my questions:
1. Is there any more efficient approaches for improving performance of Phoenix 
full table scan of large data sets? Any kindly share are greately
appropriated.
2. Noting that full table scan is not quite appropriate for hbase tables, is 
there any alternative options for doing such work under current hdfs and
hbase environments? Please kindly share any good points.

Best regards,
Sun.





CertusNet 



Re: Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work

2015-01-05 Thread lars hofhansl
Hi Sun,
assuming that you are mostly talking about aggregates (in the sense of scanning 
a lot of data, but the resulting set is small), it's interesting that option #1 
would not satisfy your performance expectations,  but #2 would.
Which version of Phoenix are you using? From 4.2 Phoenix is well aware of the 
distribution of the data and will farm out full scans in parallel chunks.In 
number you would make a copy of the entire dataset in order to be able to 
query it via Spark?
What kind of performance do you see with option #1 vs #2?
Thanks. 

-- Lars

  From: su...@certusnet.com.cn su...@certusnet.com.cn
 To: user user@phoenix.apache.org; dev d...@phoenix.apache.org 
 Sent: Monday, January 5, 2015 6:42 PM
 Subject: Performance options for doing Phoenix full table scans to complete 
some data statistics and summary collection work
   
Hi,all
Currently we are using Phoenix to store and query large datasets of KPI for our 
projects. Noting that we definitely need
to do full table scan of phoneix KPI tables for data statistics and summary 
collection, e.g. from five minutes data table to
summary hour based data table, and to day based and week based data tables, and 
so on. 
The approaches now we used currently are as follows:
1. using Phoenix upsert into ... select ... grammer , however, the query 
performance would not satisfy our expectation.
2. using Apache Spark with the phoenix_mr integration to read data from phoenix 
tables and create rdd, then we can transform 
these rdds to summary rdd, and bulkload to new Phoenix data table.    This 
approach can satisfy most of our application requirements, but 
in some cases we cannot complete the full scan job.

Here are my questions:
1. Is there any more efficient approaches for improving performance of Phoenix 
full table scan of large data sets? Any kindly share are greately
appropriated.
2. Noting that full table scan is not quite appropriate for hbase tables, is 
there any alternative options for doing such work under current hdfs and
hbase environments? Please kindly share any good points.

Best regards,
Sun.





CertusNet 



  

Re: Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work

2015-01-05 Thread Nick Dimiduk
Region server fails consistently? Can you provide logs from the failing
process?

On Monday, January 5, 2015, su...@certusnet.com.cn su...@certusnet.com.cn
wrote:

 Hi, Lars
 Thanks for your reply and advice. You are right, we are considering about
 sort of aggregates work.
 Our requirements need to assure full scan over table with approximately 50
 million rows while containing
 nearly 100+ columns. We are using the latest 4.2.2 release, actually we
 are using Spark to read and write to
 Phoenix tables. We apply the schema of mapreduce over Phoenix tables to do
 full table scan in Spark, and
 then we shall use the created rdd to write or bulkload to new Phoenix
 tables. Thats' just our production flow.

 Specifying the #1 vs #2 performance, we found that #1 shall always failed
 to complete and we can see regionserver
 falling down during the job.  #2 would cause some kind of
 ScannerTimeOutExecption, then we configure parameters
 for our hbase cluster and such problems gone. However, we are still
 expecting more efficient approaches for doing
 such full table scan over Phoenix datasets.

 Thanks,
 Sun.

 --
 --

 CertusNet


 *From:* lars hofhansl javascript:_e(%7B%7D,'cvml','la...@apache.org');
 *Date:* 2015-01-06 12:52
 *To:* d...@phoenix.apache.org
 javascript:_e(%7B%7D,'cvml','d...@phoenix.apache.org');; user
 javascript:_e(%7B%7D,'cvml','user@phoenix.apache.org');
 *Subject:* Re: Performance options for doing Phoenix full table scans to
 complete some data statistics and summary collection work
 Hi Sun,

 assuming that you are mostly talking about aggregates (in the sense of
 scanning a lot of data, but the resulting set is small), it's interesting
 that option #1 would not satisfy your performance expectations,  but #2
 would.

 Which version of Phoenix are you using? From 4.2 Phoenix is well aware of
 the distribution of the data and will farm out full scans in parallel
 chunks.
 In number you would make a copy of the entire dataset in order to be able
 to query it via Spark?

 What kind of performance do you see with option #1 vs #2?

 Thanks.

 -- Lars

   --
  *From:* su...@certusnet.com.cn
 javascript:_e(%7B%7D,'cvml','su...@certusnet.com.cn'); 
 su...@certusnet.com.cn
 javascript:_e(%7B%7D,'cvml','su...@certusnet.com.cn');
 *To:* user user@phoenix.apache.org
 javascript:_e(%7B%7D,'cvml','user@phoenix.apache.org');; dev 
 d...@phoenix.apache.org
 javascript:_e(%7B%7D,'cvml','d...@phoenix.apache.org');
 *Sent:* Monday, January 5, 2015 6:42 PM
 *Subject:* Performance options for doing Phoenix full table scans to
 complete some data statistics and summary collection work

 Hi,all
 Currently we are using Phoenix to store and query large datasets of KPI
 for our projects. Noting that we definitely need
 to do full table scan of phoneix KPI tables for data statistics and
 summary collection, e.g. from five minutes data table to
 summary hour based data table, and to day based and week based data
 tables, and so on.
 The approaches now we used currently are as follows:
 1. using Phoenix upsert into ... select ... grammer , however, the query
 performance would not satisfy our expectation.
 2. using Apache Spark with the phoenix_mr integration to read data from
 phoenix tables and create rdd, then we can transform
 these rdds to summary rdd, and bulkload to new Phoenix data table.This
 approach can satisfy most of our application requirements, but
 in some cases we cannot complete the full scan job.

 Here are my questions:
 1. Is there any more efficient approaches for improving performance of
 Phoenix full table scan of large data sets? Any kindly share are greately
 appropriated.
 2. Noting that full table scan is not quite appropriate for hbase tables,
 is there any alternative options for doing such work under current hdfs and
 hbase environments? Please kindly share any good points.

 Best regards,
 Sun.





 CertusNet