[ 
https://issues.apache.org/jira/browse/FLINK-9753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kostas Kloudas closed FLINK-9753.
---------------------------------
    Resolution: Fixed

Merged on master with 66b1f854a0250bdd048808d40f93aa2990476841

> Support Parquet for StreamingFileSink
> -------------------------------------
>
>                 Key: FLINK-9753
>                 URL: https://issues.apache.org/jira/browse/FLINK-9753
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Streaming Connectors
>            Reporter: Stephan Ewen
>            Assignee: Kostas Kloudas
>            Priority: Major
>             Fix For: 1.6.0
>
>
> Formats like Parquet and ORC are great at compressing data and making it fast 
> to scan/filter/project the data.
> However, these formats are only efficient, if they can columnarize and 
> compress a significant amount of data in their columnar format. If they 
> compress only a few rows at a time, they produce many short column vecors and 
> are thus much less efficient.
> The Bucketing Sink has the requirement that data is persistent on the target 
> FileSystem on each checkpoint.
> Pushing data through a Parquet or ORC encoder and flushing on each checkpoint 
> means that for frequent checkpoints, the amount of data 
> compressed/columnarized in a block is small. Hence, the result is an 
> inefficiently compressed file.
> Making this efficient independently of the checkpoint interval would mean 
> that the sink needs to first collect (and persist) a good amount of data and 
> then push it through the Parquet/ORC writers.
> I would suggest to approach this as follows:
>  - When writing to the "in progress files" write the raw records 
> (TypeSerializer encoding)
>  - When the "in progress file" is rolled over (published), the sink pushes 
> the data through the encoder.
>  - This is not much work on top of the new abstraction and will result in 
> large blocksand hence in efficient compression.
> Alternatively, we can support directly encoding the stream to the "in 
> progress files" via Parque/ORC, if users know that their combination of data 
> rate and checkpoint interval will result in large enough chunks of data per 
> checkpoint interval.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to