Hello,

I’m exploring ways to reduce overall latency/cost during dataset write 
operation of intermediate data. Currently I have a union dataset composed of 
in-memory tables which is partitioned and written out to storage. One of the 
ways I can reduce the overhead is by storing meta data of each partition in 
storage and have the reader side resolve the partition meta data to actual 
data. Assumption here is is both reader and writer are Arrow based 
implementations. A rough outline of partition meta data files in filesystem and 
the content of meta data is.

meta data files:
/base/part=A/_partition_metadata/01.parquet
/base/part=A/_partition_metadata/02.parquet
…

meta data content (stored as parquet):
uri                                  |    row-group     |  row-offsets          
         |
==================================================
file:///path/01.parquet <file:///path/01.parquet>    | 1                      | 
 [3,5,7]                           |
file:///path/01.parquet <file:///path/01.parquet>    | 3                      | 
 [17,19,23]                     |
…..                                | …..                   | …………………………..|
…..                                | …..                   | …………………………..|
==================================================

Questions:

- In dataset subsystem there is currently way to build out dataset via 
“_metadata” file which think resolves to row-groups in parquet files. Is this a 
reasonable approach to implement the partition data resolver on the reader side 
?

- What are the potential classes I need to look at to get an understanding on 
how to go about implementing the writer side of equation ?

Any other suggestion and help is welcome.

Thank you.

Reply via email to