Further, I'd try to implement James' suggestions _not_ using the Phoenix Query Server. Remember that the thin-client uses PQS, adding a level of indirection and re-serialization.

By using the "thick" driver, you can avoid this overhead which will help you get repeatable test results with less moving pieces.

On 4/17/19 11:30 AM, James Taylor wrote:
Hi Hieu,
You could try add the /*+ SERIAL */ hint to see if that has any impact. Also, have you tried not salting the table? The SALT_BUCKETS value of 128 is pretty high.

For the other issue, do you have a lot of deleted cells? You might try running a major compaction. You might try adding a secondary index on "doubleCol" if that's a common query.

Thanks,
James

On Thu, Apr 11, 2019 at 5:44 PM Hieu Nguyen <h...@box.com <mailto:h...@box.com>> wrote:

    Hi,

    I am using Phoenix 4.14-cdh5.11, with sqlline-thin as the client.  I
    am seeing strange patterns around SELECT query execution time:
    1. Increasing the LIMIT past a certain "threshold" results in
    significantly slower execution time.
    2. Adding just one column (BIGINT) to the SELECT results in
    significantly slower execution time.

    This is our schema (names are changed for readability):
    CREATE TABLE "metadata" (
       "pk"                           VARCHAR PRIMARY KEY
    )
    SALT_BUCKETS = 128,
    COLUMN_ENCODED_BYTES = 0,
    BLOOMFILTER = 'ROWCOL',
    COMPRESSION = 'GZ';

    CREATE VIEW "extended" (
    "doubleCol" DOUBLE,
    "intCol" BIGINT,
    "intCol2" BIGINT,
    "intCol3" BIGINT,
    "stringCol" VARCHAR,
    "stringCol2" VARCHAR,
    "stringCol3" VARCHAR,
    "stringCol4" VARCHAR,
    "stringCol5" VARCHAR,
    "stringCol6" VARCHAR,
    "stringCol7" VARCHAR,
    ) AS SELECT * FROM "metadata"

    We have other views created that also select from "metadata" that
    define their own columns.  Overall, there are 1 million rows in this
    table, and 20k rows match the condition "doubleCol" > 100.

    Base query:
    SELECT
    "pk","doubleCol","intCol","intCol2","stringCol","stringCol2","intCol3"
    FROM "templatealldatattype-7d55c5a6-efe3-419d-9bce-9fea7c14f8bc"
    WHERE "doubleCol" > 100
    LIMIT 10000
    -> 1.976 seconds

    Decreasing LIMIT to 9500 (only 5% decrease in number of rows):
    SELECT
    "pk","doubleCol","intCol","intCol2","stringCol","stringCol2","intCol3"
    FROM "templatealldatattype-7d55c5a6-efe3-419d-9bce-9fea7c14f8bc"
    WHERE "doubleCol" > 100
    LIMIT 9500
    -> 0.409 seconds

    Removing "intCol3" from SELECT, keeping LIMIT at 10000:
    SELECT "pk","doubleCol","intCol","intCol2","stringCol","stringCol2"
    FROM "templatealldatattype-7d55c5a6-efe3-419d-9bce-9fea7c14f8bc"
    WHERE "doubleCol" > 100
    LIMIT 10000
    -> 0.339 seconds

    I ran each of these queries a few times in a row.  There was small
    variation in execution time, but the 2nd and 3rd queries never were
    slower than the 1st query.

    The EXPLAIN plan did not change, except the ROW LIMIT value when
    explaining the 2nd query (9500 instead of 10000).
    
+----------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
|                                                PLAN                                 | EST_BYTES_READ  | EST_ROWS_READ |  EST_INFO_TS   |
    
+----------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+
    | CLIENT 128-CHUNK 382226 ROWS 314572800 BYTES PARALLEL 128-WAY
ROUND ROBIN FULL SCAN OVER metadata  | 314572800       | 382226    | 1554973434637  | |     SERVER FILTER BY "doubleCol" > 100.0                                  | 314572800       | 382226  | 1554973434637  | |     SERVER 10000 ROW LIMIT                                  | 314572800       | 382226  | 1554973434637  | | CLIENT 10000 ROW LIMIT                                  | 314572800       | 382226  | 1554973434637  |
    
+----------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+

    I tried adding the SEEK_TO_COLUMN and NO_SEEK_TO_COLUMN hints as
    suggested in a similar thread
    
(https://lists.apache.org/thread.html/4ef8384ecd31f30fdaf5837e3abc613142426d899e916c7aae4a46d4@%3Cuser.phoenix.apache.org%3E),
    but they had no effect.

    Any pointers to how we can investigate the 4-5x slowdown when
    increasing LIMIT by only ~5% or when selecting just one more BIGINT
    column?  Could we have exceeded some threshold in the result size
    that caused the query to perform a lot slower for seemingly small
    changes in the query?

    Thanks,
    -Hieu

Reply via email to