Re: Re: Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work
Hi,guys Thanks for all of your kindly advice. For #1, we are planning to retry that. Mujtaba, compression set to snappy already. Actually we are using only one column family name and we are good to utilize multiple column families. The table schema is high-narrow. For example, our table use one default column family and had over 90 + rows. How manay column family names would you recommend us to apply then?Maybe only two to three column families are enough? We had one cluster with 5 nodes. Thanks , Sun. CertusNet From: Mujtaba Chohan Date: 2015-01-09 00:42 To: user@phoenix.apache.org Subject: Re: Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work With 100+ columns, using multiple column families will help a lot if your full scan uses only few columns. Also if columns are wide then turning on compression would help if you are seeing disk I/O contention on region servers. On Wednesday, January 7, 2015, James Taylor jamestay...@apache.org wrote: Hi Sun, Can you give us a sample DDL and upsert/select query for #1? What's the approximate cluster size and what does the client look like? How much data are you scanning? Are you using multiple column families? We should be able to help tune things to improve #1. Thanks, James On Monday, January 5, 2015, su...@certusnet.com.cn su...@certusnet.com.cn wrote: We had firstly done the test using #1 and the result didnot satisfy our expectation. Unfortunately I had not saved the log copy, but under same conditions of datasets, #2 is better than #1. Thanks, Sun. From: Nick Dimiduk Date: 2015-01-06 14:03 To: user@phoenix.apache.org CC: lars hofhansl Subject: Re: Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work Region server fails consistently? Can you provide logs from the failing process? On Monday, January 5, 2015, su...@certusnet.com.cn su...@certusnet.com.cn wrote: Hi, Lars Thanks for your reply and advice. You are right, we are considering about sort of aggregates work. Our requirements need to assure full scan over table with approximately 50 million rows while containing nearly 100+ columns. We are using the latest 4.2.2 release, actually we are using Spark to read and write to Phoenix tables. We apply the schema of mapreduce over Phoenix tables to do full table scan in Spark, and then we shall use the created rdd to write or bulkload to new Phoenix tables. Thats' just our production flow. Specifying the #1 vs #2 performance, we found that #1 shall always failed to complete and we can see regionserver falling down during the job. #2 would cause some kind of ScannerTimeOutExecption, then we configure parameters for our hbase cluster and such problems gone. However, we are still expecting more efficient approaches for doing such full table scan over Phoenix datasets. Thanks, Sun. CertusNet From: lars hofhansl Date: 2015-01-06 12:52 To: d...@phoenix.apache.org; user Subject: Re: Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work Hi Sun, assuming that you are mostly talking about aggregates (in the sense of scanning a lot of data, but the resulting set is small), it's interesting that option #1 would not satisfy your performance expectations, but #2 would. Which version of Phoenix are you using? From 4.2 Phoenix is well aware of the distribution of the data and will farm out full scans in parallel chunks. In number you would make a copy of the entire dataset in order to be able to query it via Spark? What kind of performance do you see with option #1 vs #2? Thanks. -- Lars From: su...@certusnet.com.cn su...@certusnet.com.cn To: user user@phoenix.apache.org; dev d...@phoenix.apache.org Sent: Monday, January 5, 2015 6:42 PM Subject: Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work Hi,all Currently we are using Phoenix to store and query large datasets of KPI for our projects. Noting that we definitely need to do full table scan of phoneix KPI tables for data statistics and summary collection, e.g. from five minutes data table to summary hour based data table, and to day based and week based data tables, and so on. The approaches now we used currently are as follows: 1. using Phoenix upsert into ... select ... grammer , however, the query performance would not satisfy our expectation. 2. using Apache Spark with the phoenix_mr integration to read data from phoenix tables and create rdd, then we can transform these rdds to summary rdd, and bulkload to new Phoenix data table.This approach can satisfy most of our application requirements, but in some cases we cannot complete the full scan job. Here are my questions: 1. Is there any more efficient approaches for improving performance of Phoenix full table scan of large data sets?
Re: Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work
With 100+ columns, using multiple column families will help a lot if your full scan uses only few columns. Also if columns are wide then turning on compression would help if you are seeing disk I/O contention on region servers. On Wednesday, January 7, 2015, James Taylor jamestay...@apache.org wrote: Hi Sun, Can you give us a sample DDL and upsert/select query for #1? What's the approximate cluster size and what does the client look like? How much data are you scanning? Are you using multiple column families? We should be able to help tune things to improve #1. Thanks, James On Monday, January 5, 2015, su...@certusnet.com.cn javascript:_e(%7B%7D,'cvml','su...@certusnet.com.cn'); su...@certusnet.com.cn javascript:_e(%7B%7D,'cvml','su...@certusnet.com.cn'); wrote: We had firstly done the test using #1 and the result didnot satisfy our expectation. Unfortunately I had not saved the log copy, but under same conditions of datasets, #2 is better than #1. Thanks, Sun. -- -- *From:* Nick Dimiduk *Date:* 2015-01-06 14:03 *To:* user@phoenix.apache.org *CC:* lars hofhansl *Subject:* Re: Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work Region server fails consistently? Can you provide logs from the failing process? On Monday, January 5, 2015, su...@certusnet.com.cn su...@certusnet.com.cn wrote: Hi, Lars Thanks for your reply and advice. You are right, we are considering about sort of aggregates work. Our requirements need to assure full scan over table with approximately 50 million rows while containing nearly 100+ columns. We are using the latest 4.2.2 release, actually we are using Spark to read and write to Phoenix tables. We apply the schema of mapreduce over Phoenix tables to do full table scan in Spark, and then we shall use the created rdd to write or bulkload to new Phoenix tables. Thats' just our production flow. Specifying the #1 vs #2 performance, we found that #1 shall always failed to complete and we can see regionserver falling down during the job. #2 would cause some kind of ScannerTimeOutExecption, then we configure parameters for our hbase cluster and such problems gone. However, we are still expecting more efficient approaches for doing such full table scan over Phoenix datasets. Thanks, Sun. -- -- CertusNet *From:* lars hofhansl *Date:* 2015-01-06 12:52 *To:* d...@phoenix.apache.org; user *Subject:* Re: Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work Hi Sun, assuming that you are mostly talking about aggregates (in the sense of scanning a lot of data, but the resulting set is small), it's interesting that option #1 would not satisfy your performance expectations, but #2 would. Which version of Phoenix are you using? From 4.2 Phoenix is well aware of the distribution of the data and will farm out full scans in parallel chunks. In number you would make a copy of the entire dataset in order to be able to query it via Spark? What kind of performance do you see with option #1 vs #2? Thanks. -- Lars -- *From:* su...@certusnet.com.cn su...@certusnet.com.cn *To:* user user@phoenix.apache.org; dev d...@phoenix.apache.org *Sent:* Monday, January 5, 2015 6:42 PM *Subject:* Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work Hi,all Currently we are using Phoenix to store and query large datasets of KPI for our projects. Noting that we definitely need to do full table scan of phoneix KPI tables for data statistics and summary collection, e.g. from five minutes data table to summary hour based data table, and to day based and week based data tables, and so on. The approaches now we used currently are as follows: 1. using Phoenix upsert into ... select ... grammer , however, the query performance would not satisfy our expectation. 2. using Apache Spark with the phoenix_mr integration to read data from phoenix tables and create rdd, then we can transform these rdds to summary rdd, and bulkload to new Phoenix data table. This approach can satisfy most of our application requirements, but in some cases we cannot complete the full scan job. Here are my questions: 1. Is there any more efficient approaches for improving performance of Phoenix full table scan of large data sets? Any kindly share are greately appropriated. 2. Noting that full table scan is not quite appropriate for hbase tables, is there any alternative options for doing such work under current hdfs and hbase environments? Please kindly share any good points. Best regards, Sun. CertusNet
Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work
Hi,all Currently we are using Phoenix to store and query large datasets of KPI for our projects. Noting that we definitely need to do full table scan of phoneix KPI tables for data statistics and summary collection, e.g. from five minutes data table to summary hour based data table, and to day based and week based data tables, and so on. The approaches now we used currently are as follows: 1. using Phoenix upsert into ... select ... grammer , however, the query performance would not satisfy our expectation. 2. using Apache Spark with the phoenix_mr integration to read data from phoenix tables and create rdd, then we can transform these rdds to summary rdd, and bulkload to new Phoenix data table.This approach can satisfy most of our application requirements, but in some cases we cannot complete the full scan job. Here are my questions: 1. Is there any more efficient approaches for improving performance of Phoenix full table scan of large data sets? Any kindly share are greately appropriated. 2. Noting that full table scan is not quite appropriate for hbase tables, is there any alternative options for doing such work under current hdfs and hbase environments? Please kindly share any good points. Best regards, Sun. CertusNet
Re: Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work
Hi Sun, assuming that you are mostly talking about aggregates (in the sense of scanning a lot of data, but the resulting set is small), it's interesting that option #1 would not satisfy your performance expectations, but #2 would. Which version of Phoenix are you using? From 4.2 Phoenix is well aware of the distribution of the data and will farm out full scans in parallel chunks.In number you would make a copy of the entire dataset in order to be able to query it via Spark? What kind of performance do you see with option #1 vs #2? Thanks. -- Lars From: su...@certusnet.com.cn su...@certusnet.com.cn To: user user@phoenix.apache.org; dev d...@phoenix.apache.org Sent: Monday, January 5, 2015 6:42 PM Subject: Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work Hi,all Currently we are using Phoenix to store and query large datasets of KPI for our projects. Noting that we definitely need to do full table scan of phoneix KPI tables for data statistics and summary collection, e.g. from five minutes data table to summary hour based data table, and to day based and week based data tables, and so on. The approaches now we used currently are as follows: 1. using Phoenix upsert into ... select ... grammer , however, the query performance would not satisfy our expectation. 2. using Apache Spark with the phoenix_mr integration to read data from phoenix tables and create rdd, then we can transform these rdds to summary rdd, and bulkload to new Phoenix data table. This approach can satisfy most of our application requirements, but in some cases we cannot complete the full scan job. Here are my questions: 1. Is there any more efficient approaches for improving performance of Phoenix full table scan of large data sets? Any kindly share are greately appropriated. 2. Noting that full table scan is not quite appropriate for hbase tables, is there any alternative options for doing such work under current hdfs and hbase environments? Please kindly share any good points. Best regards, Sun. CertusNet
Re: Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work
Region server fails consistently? Can you provide logs from the failing process? On Monday, January 5, 2015, su...@certusnet.com.cn su...@certusnet.com.cn wrote: Hi, Lars Thanks for your reply and advice. You are right, we are considering about sort of aggregates work. Our requirements need to assure full scan over table with approximately 50 million rows while containing nearly 100+ columns. We are using the latest 4.2.2 release, actually we are using Spark to read and write to Phoenix tables. We apply the schema of mapreduce over Phoenix tables to do full table scan in Spark, and then we shall use the created rdd to write or bulkload to new Phoenix tables. Thats' just our production flow. Specifying the #1 vs #2 performance, we found that #1 shall always failed to complete and we can see regionserver falling down during the job. #2 would cause some kind of ScannerTimeOutExecption, then we configure parameters for our hbase cluster and such problems gone. However, we are still expecting more efficient approaches for doing such full table scan over Phoenix datasets. Thanks, Sun. -- -- CertusNet *From:* lars hofhansl javascript:_e(%7B%7D,'cvml','la...@apache.org'); *Date:* 2015-01-06 12:52 *To:* d...@phoenix.apache.org javascript:_e(%7B%7D,'cvml','d...@phoenix.apache.org');; user javascript:_e(%7B%7D,'cvml','user@phoenix.apache.org'); *Subject:* Re: Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work Hi Sun, assuming that you are mostly talking about aggregates (in the sense of scanning a lot of data, but the resulting set is small), it's interesting that option #1 would not satisfy your performance expectations, but #2 would. Which version of Phoenix are you using? From 4.2 Phoenix is well aware of the distribution of the data and will farm out full scans in parallel chunks. In number you would make a copy of the entire dataset in order to be able to query it via Spark? What kind of performance do you see with option #1 vs #2? Thanks. -- Lars -- *From:* su...@certusnet.com.cn javascript:_e(%7B%7D,'cvml','su...@certusnet.com.cn'); su...@certusnet.com.cn javascript:_e(%7B%7D,'cvml','su...@certusnet.com.cn'); *To:* user user@phoenix.apache.org javascript:_e(%7B%7D,'cvml','user@phoenix.apache.org');; dev d...@phoenix.apache.org javascript:_e(%7B%7D,'cvml','d...@phoenix.apache.org'); *Sent:* Monday, January 5, 2015 6:42 PM *Subject:* Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work Hi,all Currently we are using Phoenix to store and query large datasets of KPI for our projects. Noting that we definitely need to do full table scan of phoneix KPI tables for data statistics and summary collection, e.g. from five minutes data table to summary hour based data table, and to day based and week based data tables, and so on. The approaches now we used currently are as follows: 1. using Phoenix upsert into ... select ... grammer , however, the query performance would not satisfy our expectation. 2. using Apache Spark with the phoenix_mr integration to read data from phoenix tables and create rdd, then we can transform these rdds to summary rdd, and bulkload to new Phoenix data table.This approach can satisfy most of our application requirements, but in some cases we cannot complete the full scan job. Here are my questions: 1. Is there any more efficient approaches for improving performance of Phoenix full table scan of large data sets? Any kindly share are greately appropriated. 2. Noting that full table scan is not quite appropriate for hbase tables, is there any alternative options for doing such work under current hdfs and hbase environments? Please kindly share any good points. Best regards, Sun. CertusNet