Hi Amit,
Couple more performance tips on top of what Mujtaba already mentioned:
- Use the latest Phoenix (4.8.0). There are some great performance
enhancements in here, especially around usage of DISTINCT. We've also got
some new encoding schemes to reduce table sizes in our encodecolumns branch
which will make it into master in a few weeks.
- Specify the UPDATE_CACHE_FREQUENCY property in your CREATE TABLE and
CREATE INDEX calls. This will greatly reduce the amount of RPC traffic. See
here for more info[1].
- Make sure to create secondary indexes to prevent full table scans
- Declare tables as IMMUTABLE_ROWS=true if they're write-once/append-only.
- Depending on the use case/query patterns and the schema involved, you may
want to use multiple column families. This prevents having to scan data
that you don't need. More info on how to declare column families can be
found here[2].
- Consider using the SMALL, SERIAL, NO_CACHE, and SKIP_SCAN hints as
necessary. We haven't yet enhanced our optimizer to do this automatically,
so some experience in understanding what HBase is doing under the covers
will help. For example, for point lookups, use the SMALL and SERIAL hint.
For large scans, use the NO_CACHE hint. For low cardinality columns in the
PK, try the SKIP_SCAN hint. For more on these hints, see here[3].

Thanks,
James

[1] https://phoenix.apache.org/#Altering
[2] https://phoenix.apache.org/language/index.html#create_table
[3] https://phoenix.apache.org/language/index.html#select_statement

On Fri, Aug 12, 2016 at 2:57 PM, Mujtaba Chohan <mujt...@apache.org> wrote:

> Hi Amit,
>
> * What's the heap size of each of your region servers?
> * Do you see huge amount of disk reads when you do a select count(*) from
> tpch.lineitem? If yes then try setting snappy compression on your table
> followed by major compaction
> * Were there any deleted rows in this table? What's the row count via HBase
> shell?
> * What's the schema of your table? How did you load your data?
> * Update statistics with 100M guidepost width and check explain plan after
> this async task finishes to see if this shows approximately correct row
> count. update statistics TPCH.LINEITEM  SET
> "phoenix.stats.guidepost.width"=100000000;
>
> I get the following number with 600M rows (uncompressed - default phoenix
> fast_diff encoded) in TPCH.LINEITEM on a 8 node HBase 0.98.20 cluster with
> 12G heap/12+12 core (virtual+physical). Data is fetched from OS page cache.
>
> select count(*) from lineitem;
> +------------+
> |  COUNT(1)  |
> +------------+
> | 600037902  |
> +------------+
> 1 row selected (*57.012 seconds*)
>
> select l_returnflag, l_linestatus,sum(l_quantity) as
> sum_qty,sum(l_extendedprice) as sum_base_price,sum(l_extendedprice * (1 -
> l_discount)) as sum_disc_price,sum(l_extendedprice * (1 - l_discount) * (1
> + l_tax)) as sum_charge,avg(l_quantity) as avg_qty,avg(l_extendedprice) as
> avg_price,avg(l_discount) as avg_disc,count(*) as count_order from lineitem
> where l_shipdate <= current_date()- 90 group by l_returnflag,l_linestatus
> order by l_returnflag,l_linestatus;
> +---------------+---------------+----------------+----------
> ----------+----------------------+-----------------------+--
> --------+-------------+-----------+--------------+
> | L_RETURNFLAG  | L_LINESTATUS  |    SUM_QTY     |   SUM_BASE_PRICE   |
> SUM_DISC_PRICE    |      SUM_CHARGE       | AVG_QTY  |  AVG_PRICE  |
> AVG_DISC  | COUNT_ORDER  |
> +---------------+---------------+----------------+----------
> ----------+----------------------+-----------------------+--
> --------+-------------+-----------+--------------+
> | A             | F             | 3775127758     | 5660776097194.45   |
> 5377736398183.9374   | 5592847429515.927026  | 25.4993  | 38236.1169  |
> 0.05      | 148047881    |
> | N             | F             | 98553062       | 147771098385.98    |
> 140384965965.0348    | 145999793032.775829   | 25.5015  | 38237.1993  |
> 0.0499    | 3864590      |
> | N             | O             | 7651423419     | 11473321691083.27  |
> 10899667121317.2466  | 11335664103186.27932  | 25.4998  | 38236.9914  |
> 0.0499    | 300058170    |
> | R             | F             | 3.77572497E+9  | 5661603032745.34   |
> 5378513563915.4097   | 5593662252666.916161  | 25.5     | 38236.6972  |
> 0.05      | 148067261    |
> +---------------+---------------+----------------+----------
> ----------+----------------------+-----------------------+--
> --------+-------------+-----------+--------------+
> 4 rows selected (*146.677 seconds*)
>
> explain select count(*) from lineitem ;
> +-----------------------------------------------------------
> ----------------------------------+
> |
> PLAN                                             |
> +-----------------------------------------------------------
> ----------------------------------+
> | CLIENT 4204-CHUNK 589522747 ROWS 409200095701 BYTES PARALLEL 1-WAY FULL
> SCAN OVER LINEITEM  |
> |     SERVER FILTER BY FIRST KEY
> ONLY                                                         |
> |     SERVER AGGREGATE INTO SINGLE
> ROW                                                        |
> +-----------------------------------------------------------
> ----------------------------------+
>
> DDL: CREATE TABLE LINEITEM(L_ORDERKEY INTEGER not null, L_PARTKEY  INTEGER,
> L_SUPPKEY  INTEGER , L_LINENUMBER  INTEGER not null, L_QUANTITY
> DECIMAL(15,2), L_EXTENDEDPRICE DECIMAL(15,2), L_DISCOUNT DECIMAL(15,2),
> L_TAX DECIMAL(15,2), L_RETURNFLAG  CHAR(1), L_LINESTATUS  CHAR(1),
> L_SHIPDATE DATE, L_COMMITDATE DATE, L_RECEIPTDATE DATE, L_SHIPINSTRUCT
> CHAR(25), L_SHIPMODE CHAR(10), L_COMMENT VARCHAR(44) constraint pk primary
> key(l_orderkey, l_linenumber));
>
> Raw data: do curl -O http://static.druid.io/data/benchmarks/tpch
> /100/lineitem.tbl.$i.gz ; done
>
> //mujtaba
>
>
>
>
>
>
> On Thu, Aug 11, 2016 at 1:33 PM, Amit Mudgal <amud...@splicemachine.com>
> wrote:
>
> >
> > Hi team,
> >
> > I was evaluating Apache Phoenix against the TPC-H data based on the
> > presentation given at Hadoop summit in june stating that most TPC-H
> queries
> > should run.
> > Here is the setup details i have in my local environment :
> >
> > 1. One master node and 3 region servers with 3.6 TB Disks space, 62.9 GB
> > memory with 24 CPU cores (OS: centos-release-6-8.el6.centos.12.3.x86_64
> )
> > 2. I am running the phoenix parcel (4.7.0) on Cloudera
> 5.7.2-1.cdh5.7.2.p0.
> > 18.
> >
> > The data got uploaded and a compaction was manually triggered on hbase.
> > There were 2 problems we were trying to find the answer to :
> >
> > 1. While doing explain plan on standard TPCH data on LINEITEM table
> > provided it shows 8,649,179,394 rows but there are only 600,000,000
> records
> > uploaded.
> >
> > 0: jdbc:phoenix> explain select * from TPCH.LINEITEM where L_SUPPKEY =
> > 768951;
> > +-----------------------------------------------------------
> > ----------------------------------------------------+
> > |                                                     PLAN
> >                                       |
> > +-----------------------------------------------------------
> > ----------------------------------------------------+
> > | CLIENT 1458-CHUNK 8649179394 ROWS 424044167376 BYTES PARALLEL 1-WAY
> > ROUND ROBIN FULL SCAN OVER TPCH.LINEITEM  |
> > |     SERVER FILTER BY L_SUPPKEY = 768951
> >                                       |
> > +-----------------------------------------------------------
> > ----------------------------------------------------+
> > 2 rows selected (3.036 seconds)
> >
> > I could not do a count(*) on the table due to the fact that it always
> > failed for me with the error code Error: Operation timed out.
> > (state=TIM01,code=6000)
> >
> > 2. Secondly, I was not able to also run a simple query01 published by
> TPCH
> > as it times out regularly:
> >
> >
> > 0: jdbc:phoenix:stl-colo-srv050> select l_returnflag,
> > l_linestatus,sum(l_quantity) as sum_qty,sum(l_extendedprice) as
> > sum_base_price,sum(l_extendedprice * (1 - l_discount)) as
> > sum_disc_price,sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as
> > sum_charge,avg(l_quantity) as avg_qty,avg(l_extendedprice) as
> > avg_price,avg(l_discount) as avg_disc,count(*) as count_order from
> > TPCH.lineitem where l_shipdate <= current_date()- 90 group by
> > l_returnflag,l_linestatus order by l_returnflag,l_linestatus
> > . . . . . . . . . . . . . . . . . . . . . . .> ;
> > Error: Operation timed out. (state=TIM01,code=6000)
> > java.sql.SQLTimeoutException: Operation timed out.
> > at org.apache.phoenix.exception.SQLExceptionCode$14.
> > newException(SQLExceptionCode.java:359)
> > at org.apache.phoenix.exception.SQLExceptionInfo.buildException(
> > SQLExceptionInfo.java:145)
> > at org.apache.phoenix.iterate.BaseResultIterators.getIterators(
> > BaseResultIterators.java:728)
> > at org.apache.phoenix.iterate.BaseResultIterators.getIterators(
> > BaseResultIterators.java:638)
> > at org.apache.phoenix.iterate.MergeSortResultIterator.getMinHeap(
> > MergeSortResultIterator.java:72)
> > at org.apache.phoenix.iterate.MergeSortResultIterator.minIterator(
> > MergeSortResultIterator.java:93)
> > at org.apache.phoenix.iterate.MergeSortResultIterator.next(
> > MergeSortResultIterator.java:58)
> > at org.apache.phoenix.iterate.BaseGroupedAggregatingResultIterator.next(
> > BaseGroupedAggregatingResultIterator.java:64)
> > at org.apache.phoenix.jdbc.PhoenixResultSet.next(
> > PhoenixResultSet.java:778)
> > at sqlline.BufferedRows.<init>(BufferedRows.java:37)
> > at sqlline.SqlLine.print(SqlLine.java:1650)
> > at sqlline.Commands.execute(Commands.java:833)
> > at sqlline.Commands.sql(Commands.java:732)
> > at sqlline.SqlLine.dispatch(SqlLine.java:808)
> > at sqlline.SqlLine.begin(SqlLine.java:681)
> > at sqlline.SqlLine.start(SqlLine.java:398)
> > at sqlline.SqlLine.main(SqlLine.java:292)
> > 0: jdbc:phoenix:stl-colo-srv050>
> >
> > On firing smaller queries like attaching a limit the data comes in fine :
> >
> >
> > 0: jdbc:phoenix:stl-colo-srv052> select * from TPCH.LINEITEM limit 10;
> > +-------------+------------+------------+---------------+---
> > ----------+------------------+-------------+--------+-------
> > --------+---------------+--------------------------+--------
> > ------------------+---------+
> > | L_ORDERKEY  | L_PARTKEY  | L_SUPPKEY  | L_LINENUMBER  | L_QUANTITY  |
> > L_EXTENDEDPRICE  | L_DISCOUNT  | L_TAX  | L_RETURNFLAG  | L_LINESTATUS  |
> >     L_SHIPDATE        |       L_COMMITDATE       |      L_ |
> > +-------------+------------+------------+---------------+---
> > ----------+------------------+-------------+--------+-------
> > --------+---------------+--------------------------+--------
> > ------------------+---------+
> > | 1           | 15518935   | 768951     | 1             | 17          |
> 33203.72
> >         | 0.04        | 0.02   | N             | O             |
> 1996-03-13
> > 00:00:00.000  | 1996-02-12 00:00:00.000  | 1996-03 |
> > | 1           | 6730908    | 730909     | 2             | 36          |
> 69788.52
> >         | 0.09        | 0.06   | N             | O             |
> 1996-04-12
> > 00:00:00.000  | 1996-02-28 00:00:00.000  | 1996-04 |
> > | 1           | 6369978    | 369979     | 3             | 8           |
> 16381.28
> >         | 0.1         | 0.02   | N             | O             |
> 1996-01-29
> > 00:00:00.000  | 1996-03-05 00:00:00.000  | 1996-01 |
> > | 1           | 213150     | 463151     | 4             | 28          |
> 29767.92
> >         | 0.09        | 0.06   | N             | O             |
> 1996-04-21
> > 00:00:00.000  | 1996-03-30 00:00:00.000  | 1996-05 |
> > | 1           | 2402664    | 152671     | 5             | 24          |
> 37596.96
> >         | 0.1         | 0.04   | N             | O             |
> 1996-03-30
> > 00:00:00.000  | 1996-03-14 00:00:00.000  | 1996-04 |
> > | 1           | 1563445    | 63448      | 6             | 32          |
> 48267.84
> >         | 0.07        | 0.02   | N             | O             |
> 1996-01-30
> > 00:00:00.000  | 1996-02-07 00:00:00.000  | 1996-02 |
> > | 2           | 10616973   | 116994     | 1             | 38          |
> 71798.72
> >         | 0           | 0.05   | N             | O             |
> 1997-01-28
> > 00:00:00.000  | 1997-01-14 00:00:00.000  | 1997-02 |
> > | 3           | 429697     | 179698     | 1             | 45          |
> 73200.15
> >         | 0.06        | 0      | R             | F             |
> 1994-02-02
> > 00:00:00.000  | 1994-01-04 00:00:00.000  | 1994-02 |
> > | 3           | 1903543    | 653547     | 2             | 49          |
> 75776.05
> >         | 0.1         | 0      | R             | F             |
> 1993-11-09
> > 00:00:00.000  | 1993-12-20 00:00:00.000  | 1993-11 |
> > | 3           | 12844823   | 344848     | 3             | 27          |
> 47713.86
> >         | 0.06        | 0.07   | A             | F             |
> 1994-01-16
> > 00:00:00.000  | 1993-11-22 00:00:00.000  | 1994-01 |
> > +-------------+------------+------------+---------------+---
> > ----------+------------------+-------------+--------+-------
> > --------+---------------+--------------------------+--------
> > ------------------+---------+
> > 10 rows selected (0.603 seconds)
> > 0: jdbc:phoenix:stl-colo-srv052>
> >
> >
> > I am sure i am doing something wrong here and would greatly appreciate if
> > you could please point me to the same.
> >
> > Thanks Again
> >
> > Amit
> >
> >
> >
>

Reply via email to