Re: Re: Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work

Mujtaba Chohan Tue, 13 Jan 2015 15:15:09 -0800

Hi Sun - Yes 2 to 3 column families is a good number to experiment with
frequently used columns clustered in each column family.


On Thu, Jan 8, 2015 at 5:52 PM, [email protected] <
[email protected]> wrote:

> Hi,guys
> Thanks for all of your kindly advice. For #1, we are planning to retry
> that. Mujtaba, compression set to
> snappy already. Actually we are using only one column family name and we
> are good to utilize multiple
> column families. The table schema is high-narrow.
> For example, our table use one default column family and had over 90 +
> rows. How manay column family
> names would you recommend us to apply then？Maybe only two to three column
> families are enough?
> We had one cluster with 5 nodes.
>
> Thanks ,
> Sun.
>
> ------------------------------
> ------------------------------
>
> CertusNet
>
>
> *From:* Mujtaba Chohan <[email protected]>
> *Date:* 2015-01-09 00:42
> *To:* [email protected]
> *Subject:* Re: Performance options for doing Phoenix full table scans to
> complete some data statistics and summary collection work
> With 100+ columns, using multiple column families will help a lot if your
> full scan uses only few columns.
>
> Also if columns are wide then turning on compression would help if you are
> seeing disk I/O contention on region servers.
>
> On Wednesday, January 7, 2015, James Taylor <[email protected]>
> wrote:
>
>> Hi Sun,
>> Can you give us a sample DDL and upsert/select query for #1? What's the
>> approximate cluster size and what does the client look like? How much data
>> are you scanning? Are you using multiple column families? We should be able
>> to help tune things to improve #1.
>> Thanks,
>> James
>>
>> On Monday, January 5, 2015, [email protected] <
>> [email protected]> wrote:
>>
>>> We had firstly done the test using #1 and the result didnot satisfy our
>>> expectation.
>>> Unfortunately I had not saved the log copy, but under same conditions of
>>> datasets,
>>> #2 is better than #1.
>>>
>>> Thanks,
>>> Sun.
>>>
>>> ------------------------------
>>> ------------------------------
>>>
>>>
>>> *From:* Nick Dimiduk
>>> *Date:* 2015-01-06 14:03
>>> *To:* [email protected]
>>> *CC:* lars hofhansl
>>> *Subject:* Re: Performance options for doing Phoenix full table scans
>>> to complete some data statistics and summary collection work
>>> Region server fails consistently? Can you provide logs from the failing
>>> process?
>>>
>>> On Monday, January 5, 2015, [email protected] <
>>> [email protected]> wrote:
>>>
>>>> Hi, Lars
>>>> Thanks for your reply and advice. You are right, we are considering
>>>> about sort of aggregates work.
>>>> Our requirements need to assure full scan over table with approximately
>>>> 50 million rows while containing
>>>> nearly 100+ columns. We are using the latest 4.2.2 release, actually we
>>>> are using Spark to read and write to
>>>> Phoenix tables. We apply the schema of mapreduce over Phoenix tables to
>>>> do full table scan in Spark, and
>>>> then we shall use the created rdd to write or bulkload to new Phoenix
>>>> tables. Thats' just our production flow.
>>>>
>>>> Specifying the #1 vs #2 performance, we found that #1 shall always
>>>> failed to complete and we can see regionserver
>>>> falling down during the job.  #2 would cause some kind of
>>>> ScannerTimeOutExecption, then we configure parameters
>>>> for our hbase cluster and such problems gone. However, we are still
>>>> expecting more efficient approaches for doing
>>>> such full table scan over Phoenix datasets.
>>>>
>>>> Thanks,
>>>> Sun.
>>>>
>>>> ------------------------------
>>>> ------------------------------
>>>>
>>>> CertusNet
>>>>
>>>>
>>>> *From:* lars hofhansl
>>>> *Date:* 2015-01-06 12:52
>>>> *To:* [email protected]; user
>>>> *Subject:* Re: Performance options for doing Phoenix full table scans
>>>> to complete some data statistics and summary collection work
>>>> Hi Sun,
>>>>
>>>> assuming that you are mostly talking about aggregates (in the sense of
>>>> scanning a lot of data, but the resulting set is small), it's interesting
>>>> that option #1 would not satisfy your performance expectations,  but #2
>>>> would.
>>>>
>>>> Which version of Phoenix are you using? From 4.2 Phoenix is well aware
>>>> of the distribution of the data and will farm out full scans in parallel
>>>> chunks.
>>>> In number you would make a copy of the entire dataset in order to be
>>>> able to "query" it via Spark?
>>>>
>>>> What kind of performance do you see with option #1 vs #2?
>>>>
>>>> Thanks.
>>>>
>>>> -- Lars
>>>>
>>>>   ------------------------------
>>>>  *From:* "[email protected]" <[email protected]>
>>>> *To:* user <[email protected]>; dev <[email protected]>
>>>> *Sent:* Monday, January 5, 2015 6:42 PM
>>>> *Subject:* Performance options for doing Phoenix full table scans to
>>>> complete some data statistics and summary collection work
>>>>
>>>> Hi，all
>>>> Currently we are using Phoenix to store and query large datasets of KPI
>>>> for our projects. Noting that we definitely need
>>>> to do full table scan of phoneix KPI tables for data statistics and
>>>> summary collection, e.g. from five minutes data table to
>>>> summary hour based data table, and to day based and week based data
>>>> tables, and so on.
>>>> The approaches now we used currently are as follows:
>>>> 1. using Phoenix upsert into ... select ... grammer , however, the
>>>> query performance would not satisfy our expectation.
>>>> 2. using Apache Spark with the phoenix_mr integration to read data from
>>>> phoenix tables and create rdd, then we can transform
>>>> these rdds to summary rdd, and bulkload to new Phoenix data table.
>>>> This approach can satisfy most of our application requirements, but
>>>> in some cases we cannot complete the full scan job.
>>>>
>>>> Here are my questions:
>>>> 1. Is there any more efficient approaches for improving performance of
>>>> Phoenix full table scan of large data sets? Any kindly share are greately
>>>> appropriated.
>>>> 2. Noting that full table scan is not quite appropriate for hbase
>>>> tables, is there any alternative options for doing such work under current
>>>> hdfs and
>>>> hbase environments? Please kindly share any good points.
>>>>
>>>> Best regards,
>>>> Sun.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> CertusNet
>>>>
>>>>
>>>>
>>>>

Re: Re: Performance options for doing Phoenix full table scans to complete some data statistics and summary collection work

Reply via email to