Bruce Robbins created SPARK-23963:
-------------------------------------

             Summary: Queries on text-based Hive tables grow disproportionately 
slower as the number of columns increase
                 Key: SPARK-23963
                 URL: https://issues.apache.org/jira/browse/SPARK-23963
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.3.0
            Reporter: Bruce Robbins


TableReader gets disproportionately slower as the number of columns in the 
query increase.

For example, reading a table with 6000 columns is 4 times more expensive per 
record than reading a table with 3000 columns, rather than twice as expensive.

The increase in processing time is due to several Lists (fieldRefs, 
fieldOrdinals, and unwrappers), each of which the reader accesses by column 
number for each column in a record. Because each List has O(n) time for lookup 
by column number, these lookups grow increasingly expensive as the column count 
increases.

When I patched the code to change those 3 Lists to Arrays, the query times 
became proportional.

 

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to