[ 
https://issues.apache.org/jira/browse/PHOENIX-3590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822241#comment-15822241
 ] 

Samarth Jain edited comment on PHOENIX-3590 at 1/13/17 7:44 PM:
----------------------------------------------------------------

This turned out to be interesting. The data generator script uses random column 
names with names starting from a(A) to z(Z). For ease of querying a few columns 
are hard-coded as c0, c1000, c2000, c3000, c4000. HBase sorts the data in a row 
by column qualifiers. So for non-encoded data, these columns occur toward the 
start of the row. On the other hand, for encoded data, we use number based 
column qualifiers where columns get assigned qualifiers 0, 1, 2, .. depending 
on the order in which they are declared in the DDL. So column c4000, which was 
added as the 4000th column in the DDL, occurs at the 4000th position in the 
row. Now, when we are filtering on the c4000 column, HBase has to read much 
less data before applying the filter for non-encoded data as compared to for 
encoded making the query run faster. To validate my theory, I sorted the column 
names for non-encoded table and picked the last column to use in the filter. 
For encoded table, I selected the column that was declared last in the DDL. The 
query then turned out to be faster for encoded table which is expected since 
the amount of data to scan is smaller for encoded vs non-encoded.


was (Author: samarthjain):
This turned out to be interesting. The data generated script used random column 
names with names starting from a to z. For ease of querying a few columns are 
hard-coded as c0, c1000, c2000, c3000, c4000. HBase sorts the data in a row by 
column qualifiers. So for non-encoded data, these columns occur toward the 
start of the row. On the other hand, for encoded data, we use number based 
column qualifiers where columns get assigned qualifiers 0, 1, 2, .. depending 
on the order in which they are declared in the DDL. So column c4000, which was 
added as the 4000th column in the DDL, occurs at the 4000th position in the 
row. Now, when we are filtering on the c4000 column, HBase has to read much 
less data before applying the filter for non-encoded data as compared to for 
encoded. To validate my theory, I sorted the column names for non-encoded table 
and picked the last column to use in the filter. For encoded table, I selected 
the column that was declared last in the DDL. The query then turned out to be 
faster for encoded table which is expected since the amount of data to scan is 
smaller for encoded vs non-encoded.

> Filter on value column for mutable encoded table is > 3X slower compared to 
> non encoded table
> ---------------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-3590
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3590
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: Mujtaba Chohan
>            Assignee: Samarth Jain
>
> {noformat}
> select /*+ SERIAL NO_CACHE*/ k2,c0,c1000,c2000,c3000 from $T where c4000='50' 
> limit 1
> {noformat}
> Query get progressively slower if the column which is filtered is the last 
> column of the table.
> For data and schema see data generator script in 
> https://issues.apache.org/jira/browse/PHOENIX-3560



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to