With 100+ columns, using multiple column families will help a lot if your full scan uses only few columns.
Also if columns are wide then turning on compression would help if you are seeing disk I/O contention on region servers. On Wednesday, January 7, 2015, James Taylor <[email protected]> wrote: > Hi Sun, > Can you give us a sample DDL and upsert/select query for #1? What's the > approximate cluster size and what does the client look like? How much data > are you scanning? Are you using multiple column families? We should be able > to help tune things to improve #1. > Thanks, > James > > On Monday, January 5, 2015, [email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');> < > [email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: > >> We had firstly done the test using #1 and the result didnot satisfy our >> expectation. >> Unfortunately I had not saved the log copy, but under same conditions of >> datasets, >> #2 is better than #1. >> >> Thanks, >> Sun. >> >> ------------------------------ >> ------------------------------ >> >> >> *From:* Nick Dimiduk >> *Date:* 2015-01-06 14:03 >> *To:* [email protected] >> *CC:* lars hofhansl >> *Subject:* Re: Performance options for doing Phoenix full table scans to >> complete some data statistics and summary collection work >> Region server fails consistently? Can you provide logs from the failing >> process? >> >> On Monday, January 5, 2015, [email protected] < >> [email protected]> wrote: >> >>> Hi, Lars >>> Thanks for your reply and advice. You are right, we are considering >>> about sort of aggregates work. >>> Our requirements need to assure full scan over table with approximately >>> 50 million rows while containing >>> nearly 100+ columns. We are using the latest 4.2.2 release, actually we >>> are using Spark to read and write to >>> Phoenix tables. We apply the schema of mapreduce over Phoenix tables to >>> do full table scan in Spark, and >>> then we shall use the created rdd to write or bulkload to new Phoenix >>> tables. Thats' just our production flow. >>> >>> Specifying the #1 vs #2 performance, we found that #1 shall always >>> failed to complete and we can see regionserver >>> falling down during the job. #2 would cause some kind of >>> ScannerTimeOutExecption, then we configure parameters >>> for our hbase cluster and such problems gone. However, we are still >>> expecting more efficient approaches for doing >>> such full table scan over Phoenix datasets. >>> >>> Thanks, >>> Sun. >>> >>> ------------------------------ >>> ------------------------------ >>> >>> CertusNet >>> >>> >>> *From:* lars hofhansl >>> *Date:* 2015-01-06 12:52 >>> *To:* [email protected]; user >>> *Subject:* Re: Performance options for doing Phoenix full table scans >>> to complete some data statistics and summary collection work >>> Hi Sun, >>> >>> assuming that you are mostly talking about aggregates (in the sense of >>> scanning a lot of data, but the resulting set is small), it's interesting >>> that option #1 would not satisfy your performance expectations, but #2 >>> would. >>> >>> Which version of Phoenix are you using? From 4.2 Phoenix is well aware >>> of the distribution of the data and will farm out full scans in parallel >>> chunks. >>> In number you would make a copy of the entire dataset in order to be >>> able to "query" it via Spark? >>> >>> What kind of performance do you see with option #1 vs #2? >>> >>> Thanks. >>> >>> -- Lars >>> >>> ------------------------------ >>> *From:* "[email protected]" <[email protected]> >>> *To:* user <[email protected]>; dev <[email protected]> >>> *Sent:* Monday, January 5, 2015 6:42 PM >>> *Subject:* Performance options for doing Phoenix full table scans to >>> complete some data statistics and summary collection work >>> >>> Hiļ¼all >>> Currently we are using Phoenix to store and query large datasets of KPI >>> for our projects. Noting that we definitely need >>> to do full table scan of phoneix KPI tables for data statistics and >>> summary collection, e.g. from five minutes data table to >>> summary hour based data table, and to day based and week based data >>> tables, and so on. >>> The approaches now we used currently are as follows: >>> 1. using Phoenix upsert into ... select ... grammer , however, the query >>> performance would not satisfy our expectation. >>> 2. using Apache Spark with the phoenix_mr integration to read data from >>> phoenix tables and create rdd, then we can transform >>> these rdds to summary rdd, and bulkload to new Phoenix data table. >>> This approach can satisfy most of our application requirements, but >>> in some cases we cannot complete the full scan job. >>> >>> Here are my questions: >>> 1. Is there any more efficient approaches for improving performance of >>> Phoenix full table scan of large data sets? Any kindly share are greately >>> appropriated. >>> 2. Noting that full table scan is not quite appropriate for hbase >>> tables, is there any alternative options for doing such work under current >>> hdfs and >>> hbase environments? Please kindly share any good points. >>> >>> Best regards, >>> Sun. >>> >>> >>> >>> >>> >>> CertusNet >>> >>> >>> >>>
