Re: [C++] - Squeeze more out of parquet write(table) operation.

Yeshwanth Sriram Sun, 28 Mar 2021 10:58:14 -0700

- Writing multiple files is an option. I’ve already tested processing (read, 
filter,write) each row groups in separate threads and it definitely provides me 
with under 2 minute latency for the whole job. But within each processing unit 
the parquet write (I suppose parquet encode/serialize) latency dominate all 
other latencies (including ADLFS writes) hence my question if there is any 
additional options in parquet/writer that I could leverage to bring down this 
latency.


- ADLFS/sdk supports append(pos, bytes) and final flush (total bytes) operation 
which makes it possible to append from different threads and perform final 
flush operation after all futures are complete. But this latency is a small 
factor for this particular poc.

I’ll proceed to make comparison (latency) between existing spark based solution 
with what I have so far and try publish this number here. Thank you again for 
all the help.

Was thinking if Arrow/parquet/encode/decode subsystem had an option to pick 
(any two) from the following three options.
- ReadOptimized
- WriteOptimized
- ComputeOptimized

Where 

RC ->  Possibly ML training scenario
WC -> My current use case raw project/filter and write (no aggregations)
RW -> Reporting

Yesh


> On Mar 27, 2021, at 2:12 AM, Antoine Pitrou <[email protected]> wrote:
> 
> On Fri, 26 Mar 2021 18:47:26 -1000
> Weston Pace <[email protected]> wrote:
>> I'm fairly certain there is room for improvement in the C++
>> implementation for writing single files to ADLFS.  Others can correct
>> me if I'm wrong but we don't do any kind of pipelined writes.  I'd
>> guess this is partly because there isn't much benefit when writing to
>> local disk (writes are typically synchronous) but also because it's
>> much easier to write multiple files.
> 
> Writes should be asynchronous most of the time.  I don't know anything
> about ADLFS, though.
> 
> Regards
> 
> Antoine.
> 
> 
>> 
>> Is writing multiple files a choice for you?  I would guess using a
>> dataset write with multiple files would be significantly more
>> efficient than one large single file write on ADLFS.
>> 
>> -Weston
>> 
>> On Fri, Mar 26, 2021 at 6:28 PM Yeshwanth Sriram <[email protected]> 
>> wrote:
>>> 
>>> Hello,
>>> 
>>> Thank you again for earlier help on improving overall ADLFS read latency 
>>> using multiple threads which has worked out really well.
>>> 
>>> I’ve incorporated buffering on the adls/writer implementation (upto 64 meg) 
>>> . What I’m noticing is that the parquet_writer->WriteTable(table) latency 
>>> dominates everything else on the output phase of the job (~65sec vs ~1.2min 
>>> ) .  I could use multiple threads (like io/s3fs) but not sure if it will 
>>> have any effect on parquet write table operation.
>>> 
>>> Question: Is there anything else I can leverage inside parquet/writer 
>>> subsystem to improve the core parquet/write/table latency ?
>>> 
>>> 
>>> schema:
>>>  map<key,array<struct<…>>>
>>>  struct<...>
>>>  map<key,map<key,map<key, struct<…>>>>
>>>  struct<…>
>>>  binary
>>> num_row_groups: 6
>>> num_rows_per_row_group: ~8mil
>>> write buffer size: 64 * 1024 * 1024 (~64 mb)
>>> write compression: snappy
>>> total write latency per row group: ~1.2min
>>> adls append/flush latency (minor factor)
>>> Azure: ESv3/RAM: 256Gb/Cores: 8
>>> 
>>> Yesh  
>> 
> 
> 
>

Re: [C++] - Squeeze more out of parquet write(table) operation.

Reply via email to