Re: Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work

Mujtaba Chohan Thu, 08 Jan 2015 08:42:55 -0800

With 100+ columns, using multiple column families will help a lot if your
full scan uses only few columns.


Also if columns are wide then turning on compression would help if you are
seeing disk I/O contention on region servers.

On Wednesday, January 7, 2015, James Taylor <[email protected]> wrote:

> Hi Sun,
> Can you give us a sample DDL and upsert/select query for #1? What's the
> approximate cluster size and what does the client look like? How much data
> are you scanning? Are you using multiple column families? We should be able
> to help tune things to improve #1.
> Thanks,
> James
>
> On Monday, January 5, 2015, [email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');> <
> [email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>
>> We had firstly done the test using #1 and the result didnot satisfy our
>> expectation.
>> Unfortunately I had not saved the log copy, but under same conditions of
>> datasets,
>> #2 is better than #1.
>>
>> Thanks,
>> Sun.
>>
>> ------------------------------
>> ------------------------------
>>
>>
>> *From:* Nick Dimiduk
>> *Date:* 2015-01-06 14:03
>> *To:* [email protected]
>> *CC:* lars hofhansl
>> *Subject:* Re: Performance options for doing Phoenix full table scans to
>> complete some data statistics and summary collection work
>> Region server fails consistently? Can you provide logs from the failing
>> process?
>>
>> On Monday, January 5, 2015, [email protected] <
>> [email protected]> wrote:
>>
>>> Hi, Lars
>>> Thanks for your reply and advice. You are right, we are considering
>>> about sort of aggregates work.
>>> Our requirements need to assure full scan over table with approximately
>>> 50 million rows while containing
>>> nearly 100+ columns. We are using the latest 4.2.2 release, actually we
>>> are using Spark to read and write to
>>> Phoenix tables. We apply the schema of mapreduce over Phoenix tables to
>>> do full table scan in Spark, and
>>> then we shall use the created rdd to write or bulkload to new Phoenix
>>> tables. Thats' just our production flow.
>>>
>>> Specifying the #1 vs #2 performance, we found that #1 shall always
>>> failed to complete and we can see regionserver
>>> falling down during the job.  #2 would cause some kind of
>>> ScannerTimeOutExecption, then we configure parameters
>>> for our hbase cluster and such problems gone. However, we are still
>>> expecting more efficient approaches for doing
>>> such full table scan over Phoenix datasets.
>>>
>>> Thanks,
>>> Sun.
>>>
>>> ------------------------------
>>> ------------------------------
>>>
>>> CertusNet
>>>
>>>
>>> *From:* lars hofhansl
>>> *Date:* 2015-01-06 12:52
>>> *To:* [email protected]; user
>>> *Subject:* Re: Performance options for doing Phoenix full table scans
>>> to complete some data statistics and summary collection work
>>> Hi Sun,
>>>
>>> assuming that you are mostly talking about aggregates (in the sense of
>>> scanning a lot of data, but the resulting set is small), it's interesting
>>> that option #1 would not satisfy your performance expectations,  but #2
>>> would.
>>>
>>> Which version of Phoenix are you using? From 4.2 Phoenix is well aware
>>> of the distribution of the data and will farm out full scans in parallel
>>> chunks.
>>> In number you would make a copy of the entire dataset in order to be
>>> able to "query" it via Spark?
>>>
>>> What kind of performance do you see with option #1 vs #2?
>>>
>>> Thanks.
>>>
>>> -- Lars
>>>
>>>   ------------------------------
>>>  *From:* "[email protected]" <[email protected]>
>>> *To:* user <[email protected]>; dev <[email protected]>
>>> *Sent:* Monday, January 5, 2015 6:42 PM
>>> *Subject:* Performance options for doing Phoenix full table scans to
>>> complete some data statistics and summary collection work
>>>
>>> Hi，all
>>> Currently we are using Phoenix to store and query large datasets of KPI
>>> for our projects. Noting that we definitely need
>>> to do full table scan of phoneix KPI tables for data statistics and
>>> summary collection, e.g. from five minutes data table to
>>> summary hour based data table, and to day based and week based data
>>> tables, and so on.
>>> The approaches now we used currently are as follows:
>>> 1. using Phoenix upsert into ... select ... grammer , however, the query
>>> performance would not satisfy our expectation.
>>> 2. using Apache Spark with the phoenix_mr integration to read data from
>>> phoenix tables and create rdd, then we can transform
>>> these rdds to summary rdd, and bulkload to new Phoenix data table.
>>> This approach can satisfy most of our application requirements, but
>>> in some cases we cannot complete the full scan job.
>>>
>>> Here are my questions:
>>> 1. Is there any more efficient approaches for improving performance of
>>> Phoenix full table scan of large data sets? Any kindly share are greately
>>> appropriated.
>>> 2. Noting that full table scan is not quite appropriate for hbase
>>> tables, is there any alternative options for doing such work under current
>>> hdfs and
>>> hbase environments? Please kindly share any good points.
>>>
>>> Best regards,
>>> Sun.
>>>
>>>
>>>
>>>
>>>
>>> CertusNet
>>>
>>>
>>>
>>>

Re: Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work

Reply via email to