Gabor Szadovszky created PARQUET-1415:
-----------------------------------------

             Summary: Improve logic when to write column indexes
                 Key: PARQUET-1415
                 URL: https://issues.apache.org/jira/browse/PARQUET-1415
             Project: Parquet
          Issue Type: Improvement
            Reporter: Gabor Szadovszky
            Assignee: Gabor Szadovszky


Currently, we always write column indexes. In case of the data is ordered 
(ASCENDING or DESCENDING) the filtering would highly benefit from column 
indexes. While, if the data is UNORDERED it is not obvious if ordering based on 
column indexes would make sense. For example if the data is random then the 
min/max values of the different pages might be close to each other so in most 
cases filtering based on these values would not drop any of the pages. In the 
other hand UNORDERED values does not mean that the values are random. It can 
happen that the values are clustered or semi-ordered. We shall discover these 
cases somehow before writing the column indexes and write only if the min/max 
values for the pages do not overlap too much.

Another simple case if we have only one page. In this case writing column 
indexes is useless. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to