Hi Sun - Yes 2 to 3 column families is a good number to experiment with frequently used columns clustered in each column family.
On Thu, Jan 8, 2015 at 5:52 PM, [email protected] < [email protected]> wrote: > Hi,guys > Thanks for all of your kindly advice. For #1, we are planning to retry > that. Mujtaba, compression set to > snappy already. Actually we are using only one column family name and we > are good to utilize multiple > column families. The table schema is high-narrow. > For example, our table use one default column family and had over 90 + > rows. How manay column family > names would you recommend us to apply then?Maybe only two to three column > families are enough? > We had one cluster with 5 nodes. > > Thanks , > Sun. > > ------------------------------ > ------------------------------ > > CertusNet > > > *From:* Mujtaba Chohan <[email protected]> > *Date:* 2015-01-09 00:42 > *To:* [email protected] > *Subject:* Re: Performance options for doing Phoenix full table scans to > complete some data statistics and summary collection work > With 100+ columns, using multiple column families will help a lot if your > full scan uses only few columns. > > Also if columns are wide then turning on compression would help if you are > seeing disk I/O contention on region servers. > > On Wednesday, January 7, 2015, James Taylor <[email protected]> > wrote: > >> Hi Sun, >> Can you give us a sample DDL and upsert/select query for #1? What's the >> approximate cluster size and what does the client look like? How much data >> are you scanning? Are you using multiple column families? We should be able >> to help tune things to improve #1. >> Thanks, >> James >> >> On Monday, January 5, 2015, [email protected] < >> [email protected]> wrote: >> >>> We had firstly done the test using #1 and the result didnot satisfy our >>> expectation. >>> Unfortunately I had not saved the log copy, but under same conditions of >>> datasets, >>> #2 is better than #1. >>> >>> Thanks, >>> Sun. >>> >>> ------------------------------ >>> ------------------------------ >>> >>> >>> *From:* Nick Dimiduk >>> *Date:* 2015-01-06 14:03 >>> *To:* [email protected] >>> *CC:* lars hofhansl >>> *Subject:* Re: Performance options for doing Phoenix full table scans >>> to complete some data statistics and summary collection work >>> Region server fails consistently? Can you provide logs from the failing >>> process? >>> >>> On Monday, January 5, 2015, [email protected] < >>> [email protected]> wrote: >>> >>>> Hi, Lars >>>> Thanks for your reply and advice. You are right, we are considering >>>> about sort of aggregates work. >>>> Our requirements need to assure full scan over table with approximately >>>> 50 million rows while containing >>>> nearly 100+ columns. We are using the latest 4.2.2 release, actually we >>>> are using Spark to read and write to >>>> Phoenix tables. We apply the schema of mapreduce over Phoenix tables to >>>> do full table scan in Spark, and >>>> then we shall use the created rdd to write or bulkload to new Phoenix >>>> tables. Thats' just our production flow. >>>> >>>> Specifying the #1 vs #2 performance, we found that #1 shall always >>>> failed to complete and we can see regionserver >>>> falling down during the job. #2 would cause some kind of >>>> ScannerTimeOutExecption, then we configure parameters >>>> for our hbase cluster and such problems gone. However, we are still >>>> expecting more efficient approaches for doing >>>> such full table scan over Phoenix datasets. >>>> >>>> Thanks, >>>> Sun. >>>> >>>> ------------------------------ >>>> ------------------------------ >>>> >>>> CertusNet >>>> >>>> >>>> *From:* lars hofhansl >>>> *Date:* 2015-01-06 12:52 >>>> *To:* [email protected]; user >>>> *Subject:* Re: Performance options for doing Phoenix full table scans >>>> to complete some data statistics and summary collection work >>>> Hi Sun, >>>> >>>> assuming that you are mostly talking about aggregates (in the sense of >>>> scanning a lot of data, but the resulting set is small), it's interesting >>>> that option #1 would not satisfy your performance expectations, but #2 >>>> would. >>>> >>>> Which version of Phoenix are you using? From 4.2 Phoenix is well aware >>>> of the distribution of the data and will farm out full scans in parallel >>>> chunks. >>>> In number you would make a copy of the entire dataset in order to be >>>> able to "query" it via Spark? >>>> >>>> What kind of performance do you see with option #1 vs #2? >>>> >>>> Thanks. >>>> >>>> -- Lars >>>> >>>> ------------------------------ >>>> *From:* "[email protected]" <[email protected]> >>>> *To:* user <[email protected]>; dev <[email protected]> >>>> *Sent:* Monday, January 5, 2015 6:42 PM >>>> *Subject:* Performance options for doing Phoenix full table scans to >>>> complete some data statistics and summary collection work >>>> >>>> Hi,all >>>> Currently we are using Phoenix to store and query large datasets of KPI >>>> for our projects. Noting that we definitely need >>>> to do full table scan of phoneix KPI tables for data statistics and >>>> summary collection, e.g. from five minutes data table to >>>> summary hour based data table, and to day based and week based data >>>> tables, and so on. >>>> The approaches now we used currently are as follows: >>>> 1. using Phoenix upsert into ... select ... grammer , however, the >>>> query performance would not satisfy our expectation. >>>> 2. using Apache Spark with the phoenix_mr integration to read data from >>>> phoenix tables and create rdd, then we can transform >>>> these rdds to summary rdd, and bulkload to new Phoenix data table. >>>> This approach can satisfy most of our application requirements, but >>>> in some cases we cannot complete the full scan job. >>>> >>>> Here are my questions: >>>> 1. Is there any more efficient approaches for improving performance of >>>> Phoenix full table scan of large data sets? Any kindly share are greately >>>> appropriated. >>>> 2. Noting that full table scan is not quite appropriate for hbase >>>> tables, is there any alternative options for doing such work under current >>>> hdfs and >>>> hbase environments? Please kindly share any good points. >>>> >>>> Best regards, >>>> Sun. >>>> >>>> >>>> >>>> >>>> >>>> CertusNet >>>> >>>> >>>> >>>>
