[ 
https://issues.apache.org/jira/browse/SPARK-25643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650866#comment-16650866
 ] 

Bruce Robbins edited comment on SPARK-25643 at 10/15/18 10:08 PM:
------------------------------------------------------------------

[~viirya] Yes, in the case where I said "matching rows are sprinkled fairly 
evenly throughout the table", I meant that predicate pushdown is working 
correctly, but that every data page for the filtered column contains at least 
one matching value. It's not about predicate push down so much as how records 
are distributed in the files.

For example, I have a 1 million record table with 6000 columns, where {{select 
* from table where id1 = 1}} returns 2330 rows. The table is not sorted by id1, 
so the matching records are arbitrarily distributed throughout the table. In 
this case, every data page for column id1 contains at least one entry with 
value 1. As a result, every row gets realized and passed to FilterExec.

So the "small subset of rows" in this case is 2330 rows. When I say "the 
returned result includes just a few rows", I meant a few rows relative to the 
size of the table. There has to be enough matching rows so that many data pages 
are involved.


was (Author: bersprockets):
[~viirya] Yes, in the case where I said "predicate push down is not helping", I 
meant that predicate pushdown is working correctly, but that every data page 
for the filtered column contains at least one matching value. It's not about 
predicate push down so much as how records are distributed in the files.

For example, I have a 1 million record table with 6000 columns, where {{select 
* from table where id1 = 1}} returns 2330 rows. The table is not sorted by id1, 
so the matching records are arbitrarily distributed throughout the table. In 
this case, every data page for column id1 contains at least one entry with 
value 1. As a result, every row gets realized and passed to FilterExec.

So the "small subset of rows" in this case is 2330 rows. When I say "the 
returned result includes just a few rows", I meant a few rows relative to the 
size of the table. There has to be enough matching rows so that many data pages 
are involved.

> Performance issues querying wide rows
> -------------------------------------
>
>                 Key: SPARK-25643
>                 URL: https://issues.apache.org/jira/browse/SPARK-25643
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Bruce Robbins
>            Priority: Major
>
> Querying a small subset of rows from a wide table (e.g., a table with 6000 
> columns) can be quite slow in the following case:
>  * the table has many rows (most of which will be filtered out)
>  * the projection includes every column of a wide table (i.e., select *)
>  * predicate push down is not helping: either matching rows are sprinkled 
> fairly evenly throughout the table, or predicate push down is switched off
> Even if the filter involves only a single column and the returned result 
> includes just a few rows, the query can run much longer compared to an 
> equivalent query against a similar table with fewer columns.
> According to initial profiling, it appears that most time is spent realizing 
> the entire row in the scan, just so the filter can look at a tiny subset of 
> columns and almost certainly throw the row away. The profiling shows 74% of 
> time is spent in FileSourceScanExec, and that time is spent across numerous 
> writeFields_0_xxx method calls.
> If Spark must realize the entire row just to check a tiny subset of columns, 
> this all sounds reasonable. However, I wonder if there is an optimization 
> here where we can avoid realizing the entire row until after the filter has 
> selected the row.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to