Please file a JIRA and include the explain plan for each of the queries. I
suspect your index is not being used in the first query due to the
selection of all the columns. You can try hinting the query to force your
index to be used. See
https://phoenix.apache.org/secondary_indexing.html#Index_Usage

Thanks,
James

On Mon, Oct 30, 2017 at 7:02 AM, Marcin Januszkiewicz <
januszkiewicz.mar...@gmail.com> wrote:

> We have a wide table with 100M records created with the following DDL:
>
> CREATE TABLE traces (
>   rowkey VARCHAR PRIMARY KEY,
>   time VARCHAR,
>   number VARCHAR,
>   +40 more columns)
>
> We want to select a large (~30M records) subset of this data with the
> query:
>
> SELECT *all columns*
>   FROM traces
>   WHERE (UPPER(number) LIKE 'PO %')
>   ORDER BY time DESC, ROWKEY
>   LIMIT 101;
>
> This times out after 15 minutes and puts a huge load on our cluster.
> We have an alternate way of selecting this data:
>
> SELECT t.rowkey, *all columns*
> FROM TRACES t
> JOIN (
>   SELECT rowkey
>   FROM TRACES
>   WHERE (UPPER(number) LIKE 'PO %')
>   ORDER BY time DESC, ROWKEY
>   LIMIT 101
> ) ix
> ON t.ROWKEY = ix.ROWKEY
> order by t.ROWKEY;
>
> Which completes in just under a minute.
> Is there a better way to construct this query?
> When is using the self-join a worse choice than the simple select?
> Given that we have a functional index on UPPER(number), could this
> potentially be a statistics-based optimizer decision?
>
> --
> Pozdrawiam,
> Marcin Januszkiewicz
>

Reply via email to