Region server fails consistently? Can you provide logs from the failing process?
On Monday, January 5, 2015, [email protected] <[email protected]> wrote: > Hi, Lars > Thanks for your reply and advice. You are right, we are considering about > sort of aggregates work. > Our requirements need to assure full scan over table with approximately 50 > million rows while containing > nearly 100+ columns. We are using the latest 4.2.2 release, actually we > are using Spark to read and write to > Phoenix tables. We apply the schema of mapreduce over Phoenix tables to do > full table scan in Spark, and > then we shall use the created rdd to write or bulkload to new Phoenix > tables. Thats' just our production flow. > > Specifying the #1 vs #2 performance, we found that #1 shall always failed > to complete and we can see regionserver > falling down during the job. #2 would cause some kind of > ScannerTimeOutExecption, then we configure parameters > for our hbase cluster and such problems gone. However, we are still > expecting more efficient approaches for doing > such full table scan over Phoenix datasets. > > Thanks, > Sun. > > ------------------------------ > ------------------------------ > > CertusNet > > > *From:* lars hofhansl <javascript:_e(%7B%7D,'cvml','[email protected]');> > *Date:* 2015-01-06 12:52 > *To:* [email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>; user > <javascript:_e(%7B%7D,'cvml','[email protected]');> > *Subject:* Re: Performance options for doing Phoenix full table scans to > complete some data statistics and summary collection work > Hi Sun, > > assuming that you are mostly talking about aggregates (in the sense of > scanning a lot of data, but the resulting set is small), it's interesting > that option #1 would not satisfy your performance expectations, but #2 > would. > > Which version of Phoenix are you using? From 4.2 Phoenix is well aware of > the distribution of the data and will farm out full scans in parallel > chunks. > In number you would make a copy of the entire dataset in order to be able > to "query" it via Spark? > > What kind of performance do you see with option #1 vs #2? > > Thanks. > > -- Lars > > ------------------------------ > *From:* "[email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>" < > [email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>> > *To:* user <[email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>>; dev < > [email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>> > *Sent:* Monday, January 5, 2015 6:42 PM > *Subject:* Performance options for doing Phoenix full table scans to > complete some data statistics and summary collection work > > Hiļ¼all > Currently we are using Phoenix to store and query large datasets of KPI > for our projects. Noting that we definitely need > to do full table scan of phoneix KPI tables for data statistics and > summary collection, e.g. from five minutes data table to > summary hour based data table, and to day based and week based data > tables, and so on. > The approaches now we used currently are as follows: > 1. using Phoenix upsert into ... select ... grammer , however, the query > performance would not satisfy our expectation. > 2. using Apache Spark with the phoenix_mr integration to read data from > phoenix tables and create rdd, then we can transform > these rdds to summary rdd, and bulkload to new Phoenix data table. This > approach can satisfy most of our application requirements, but > in some cases we cannot complete the full scan job. > > Here are my questions: > 1. Is there any more efficient approaches for improving performance of > Phoenix full table scan of large data sets? Any kindly share are greately > appropriated. > 2. Noting that full table scan is not quite appropriate for hbase tables, > is there any alternative options for doing such work under current hdfs and > hbase environments? Please kindly share any good points. > > Best regards, > Sun. > > > > > > CertusNet > > > >
