Hello Qiao,
When you say whatever the table ddl it limits things a lot, as table ddl
will help define several things that will drastically improve the degree of
parallelism and the table size.

The DOP (degree of parallelism) of the scan operator is constrained by the
number of regions your table uses. So if you want to increase DOP, you need
to partition your table using the syntax like:
create table customer_demographics_salt
(
 cd_demo_sk int not null
 , cd_gender char(1)
 , cd_marital_status char(1)
 , cd_education_status char(20)
 , cd_purchase_estimate int
 , cd_credit_rating char(10)
 , cd_dep_count int
 , cd_dep_employed_count int
 , cd_dep_college_count int
 , primary key (cd_demo_sk)
)
salt using 12 partitions
ATTRIBUTES ALIGNED FORMAT
HBASE_OPTIONS
  (
    DATA_BLOCK_ENCODING = 'FAST_DIFF',
    COMPRESSION = 'SNAPPY'
  );

you can experiment with different values of number of partitions (but pick a
multiple of 3 since you have 3 nodes).

Then the optimizer will pick the DOP with a compromise of resource usage vs
gain in speed.
If you want to force higher DOP than what optimizer selected, you can use :
CQD parallel_num_esps '12';
To force it to one ESP per partition (assuming you picked 12 partitions).
You can verify what optimizer picked as DOP by doing an explain on the
query.

Other important factors plays in performance:
- use of aligned format (see above example)
- careful choice of the primary key (bad idea to use a varchar with big max
size)
- It is good idea to use compression and encoding (see the most common
options we use above)
- you can also increase the HBASE_OPTIONS BLOCKSIZE parameter, but there is
a drawback: increasing it will increase performance of SCAN but decrease
performance of keyed access. That is why I did not include it in the example
above.

Hope this helps,
Regards,
Eric


-----Original Message-----
From: 乔彦克 [mailto:qya...@gmail.com]
Sent: Monday, September 12, 2016 1:22 AM
To: user@trafodion.incubator.apache.org; d...@trafodion.incubator.apache.org
Subject: trafodion query optimization

Hi all,
     I executed the sum and count query on my table where the cluster has
three nodes. I found that the sum query is not well parallel executed(not
all the three nodes get high load when executing the sum query) and the cpu
load is very high while the memory load is very low(the machines have 16
cores and 16GB memory). My sum query on the 12 million data sets takes about
2 minutes and a half time.
    So my question is that is there any optimization advice that I can use
to improve the query performance and maximize the usage of my machines, what
ever the configuration or the table ddl.
    Any replies is appreciated.

Thanks,
Qiao

Reply via email to