Hi, I am using Phoenix 4.14-cdh5.11, with sqlline-thin as the client. I am seeing strange patterns around SELECT query execution time: 1. Increasing the LIMIT past a certain "threshold" results in significantly slower execution time. 2. Adding just one column (BIGINT) to the SELECT results in significantly slower execution time.
This is our schema (names are changed for readability): CREATE TABLE "metadata" ( "pk" VARCHAR PRIMARY KEY ) SALT_BUCKETS = 128, COLUMN_ENCODED_BYTES = 0, BLOOMFILTER = 'ROWCOL', COMPRESSION = 'GZ'; CREATE VIEW "extended" ( "doubleCol" DOUBLE, "intCol" BIGINT, "intCol2" BIGINT, "intCol3" BIGINT, "stringCol" VARCHAR, "stringCol2" VARCHAR, "stringCol3" VARCHAR, "stringCol4" VARCHAR, "stringCol5" VARCHAR, "stringCol6" VARCHAR, "stringCol7" VARCHAR, ) AS SELECT * FROM "metadata" We have other views created that also select from "metadata" that define their own columns. Overall, there are 1 million rows in this table, and 20k rows match the condition "doubleCol" > 100. Base query: SELECT "pk","doubleCol","intCol","intCol2","stringCol","stringCol2","intCol3" FROM "templatealldatattype-7d55c5a6-efe3-419d-9bce-9fea7c14f8bc" WHERE "doubleCol" > 100 LIMIT 10000 -> 1.976 seconds Decreasing LIMIT to 9500 (only 5% decrease in number of rows): SELECT "pk","doubleCol","intCol","intCol2","stringCol","stringCol2","intCol3" FROM "templatealldatattype-7d55c5a6-efe3-419d-9bce-9fea7c14f8bc" WHERE "doubleCol" > 100 LIMIT 9500 -> 0.409 seconds Removing "intCol3" from SELECT, keeping LIMIT at 10000: SELECT "pk","doubleCol","intCol","intCol2","stringCol","stringCol2" FROM "templatealldatattype-7d55c5a6-efe3-419d-9bce-9fea7c14f8bc" WHERE "doubleCol" > 100 LIMIT 10000 -> 0.339 seconds I ran each of these queries a few times in a row. There was small variation in execution time, but the 2nd and 3rd queries never were slower than the 1st query. The EXPLAIN plan did not change, except the ROW LIMIT value when explaining the 2nd query (9500 instead of 10000). +----------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+ | PLAN | EST_BYTES_READ | EST_ROWS_READ | EST_INFO_TS | +----------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+ | CLIENT 128-CHUNK 382226 ROWS 314572800 BYTES PARALLEL 128-WAY ROUND ROBIN FULL SCAN OVER metadata | 314572800 | 382226 | 1554973434637 | | SERVER FILTER BY "doubleCol" > 100.0 | 314572800 | 382226 | 1554973434637 | | SERVER 10000 ROW LIMIT | 314572800 | 382226 | 1554973434637 | | CLIENT 10000 ROW LIMIT | 314572800 | 382226 | 1554973434637 | +----------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+ I tried adding the SEEK_TO_COLUMN and NO_SEEK_TO_COLUMN hints as suggested in a similar thread ( https://lists.apache.org/thread.html/4ef8384ecd31f30fdaf5837e3abc613142426d899e916c7aae4a46d4@%3Cuser.phoenix.apache.org%3E), but they had no effect. Any pointers to how we can investigate the 4-5x slowdown when increasing LIMIT by only ~5% or when selecting just one more BIGINT column? Could we have exceeded some threshold in the result size that caused the query to perform a lot slower for seemingly small changes in the query? Thanks, -Hieu