On Fri, 26 Mar 2021 18:47:26 -1000 Weston Pace <[email protected]> wrote: > I'm fairly certain there is room for improvement in the C++ > implementation for writing single files to ADLFS. Others can correct > me if I'm wrong but we don't do any kind of pipelined writes. I'd > guess this is partly because there isn't much benefit when writing to > local disk (writes are typically synchronous) but also because it's > much easier to write multiple files.
Writes should be asynchronous most of the time. I don't know anything about ADLFS, though. Regards Antoine. > > Is writing multiple files a choice for you? I would guess using a > dataset write with multiple files would be significantly more > efficient than one large single file write on ADLFS. > > -Weston > > On Fri, Mar 26, 2021 at 6:28 PM Yeshwanth Sriram <[email protected]> > wrote: > > > > Hello, > > > > Thank you again for earlier help on improving overall ADLFS read latency > > using multiple threads which has worked out really well. > > > > I’ve incorporated buffering on the adls/writer implementation (upto 64 meg) > > . What I’m noticing is that the parquet_writer->WriteTable(table) latency > > dominates everything else on the output phase of the job (~65sec vs ~1.2min > > ) . I could use multiple threads (like io/s3fs) but not sure if it will > > have any effect on parquet write table operation. > > > > Question: Is there anything else I can leverage inside parquet/writer > > subsystem to improve the core parquet/write/table latency ? > > > > > > schema: > > map<key,array<struct<…>>> > > struct<...> > > map<key,map<key,map<key, struct<…>>>> > > struct<…> > > binary > > num_row_groups: 6 > > num_rows_per_row_group: ~8mil > > write buffer size: 64 * 1024 * 1024 (~64 mb) > > write compression: snappy > > total write latency per row group: ~1.2min > > adls append/flush latency (minor factor) > > Azure: ESv3/RAM: 256Gb/Cores: 8 > > > > Yesh >
