[ 
https://issues.apache.org/jira/browse/IMPALA-6964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16840447#comment-16840447
 ] 

Csaba Ringhofer edited comment on IMPALA-6964 at 5/15/19 2:24 PM:
------------------------------------------------------------------

I have noticed during the implementation of IMPALA-6433 that only data pages 
are tracked, but not dictionary pages. The difference can be pretty large for 
(long) string columns or if column index filters out most data pages.

Another possible issue is that pages are only tracked if there is a slot for 
them, so they won't be tracked if only def+rep levels are read. V1 data pages 
need to decompress the whole date pages to get def/rep level,  while V2 date 
pages store def/rep levels uncompressed, so no decompression will be needed in 
this case.


was (Author: csringhofer):
I have not noticed during the implementation of IMPALA-6433 that only data 
pages are tracked, but not dictionary pages. The difference can be pretty large 
for string columns or if column indexes filter out most data pages. 

> Track stats about column and page sizes in Parquet reader
> ---------------------------------------------------------
>
>                 Key: IMPALA-6964
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6964
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Tim Armstrong
>            Assignee: Sahil Takiar
>            Priority: Major
>              Labels: observability, parquet, ramp-up
>             Fix For: Impala 3.2.0
>
>
> It would be good to have stats for scanned parquet data about page sizes. We 
> currently can't tell much about the "shape" of the parquet pages from the 
> profile. Some questions that are interesting:
> * How big is each column? I.e. total compressed and decompressed size read.
> * How big are pages on average? Either compressed or decompressed size
> * What is the compression ratio for pages? Could be inferred from the above 
> two.
> I think storing all the stats in the profile per-column would be too much 
> data, but we could probably infer most useful things from higher-level 
> aggregates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to