Bruce Robbins created SPARK-25164:
-------------------------------------

             Summary: Parquet reader builds entire list of columns once for 
each column
                 Key: SPARK-25164
                 URL: https://issues.apache.org/jira/browse/SPARK-25164
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.4.0
            Reporter: Bruce Robbins


{{VectorizedParquetRecordReader.initializeInternal}} loops through each column, 
and for each column it calls
{noformat}
requestedSchema.getColumns().get(i)
{noformat}
However, {{MessageType.getColumns}} will build the entire column list from 
getPaths(0).
{noformat}
  public List<ColumnDescriptor> getColumns() {
    List<String[]> paths = this.getPaths(0);
    List<ColumnDescriptor> columns = new 
ArrayList<ColumnDescriptor>(paths.size());
    for (String[] path : paths) {
      // TODO: optimize this                                                    
                                                                
      PrimitiveType primitiveType = getType(path).asPrimitiveType();
      columns.add(new ColumnDescriptor(
                      path,
                      primitiveType,
                      getMaxRepetitionLevel(path),
                      getMaxDefinitionLevel(path)));
    }
    return columns;
  }
{noformat}
This means that for each parquet file, this routine indirectly iterates 
colCount*colCount times.

This is actually not particularly noticeable unless you have:
 - many parquet files
 - many columns

To verify that this is an issue, I created a 1 million record parquet table 
with 6000 columns of type double and 67 files (so initializeInternal is called 
67 times). I ran the following query:
{noformat}
sql("select * from 6000_1m_double where id1 = 1").collect
{noformat}
I used Spark from the master branch. I had 8 executor threads. The filter 
returns only a few thousand records. The query ran (on average) for 6.4 minutes.

Then I cached the column list at the top of {{initializeInternal}} as follows:
{noformat}
List<ColumnDescriptor> columnCache = requestedSchema.getColumns();
{noformat}
Then I changed {{initializeInternal}} to use {{columnCache}} rather than 
{{requestedSchema.getColumns()}}.

With the column cache variable, the same query runs in 5 minutes. So with my 
simple query, you save %22 of time by not rebuilding the column list for each 
column.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to