- Writing multiple files is an option. I’ve already tested processing (read, filter,write) each row groups in separate threads and it definitely provides me with under 2 minute latency for the whole job. But within each processing unit the parquet write (I suppose parquet encode/serialize) latency dominate all other latencies (including ADLFS writes) hence my question if there is any additional options in parquet/writer that I could leverage to bring down this latency.
- ADLFS/sdk supports append(pos, bytes) and final flush (total bytes) operation which makes it possible to append from different threads and perform final flush operation after all futures are complete. But this latency is a small factor for this particular poc. I’ll proceed to make comparison (latency) between existing spark based solution with what I have so far and try publish this number here. Thank you again for all the help. Was thinking if Arrow/parquet/encode/decode subsystem had an option to pick (any two) from the following three options. - ReadOptimized - WriteOptimized - ComputeOptimized Where RC -> Possibly ML training scenario WC -> My current use case raw project/filter and write (no aggregations) RW -> Reporting Yesh > On Mar 27, 2021, at 2:12 AM, Antoine Pitrou <[email protected]> wrote: > > On Fri, 26 Mar 2021 18:47:26 -1000 > Weston Pace <[email protected]> wrote: >> I'm fairly certain there is room for improvement in the C++ >> implementation for writing single files to ADLFS. Others can correct >> me if I'm wrong but we don't do any kind of pipelined writes. I'd >> guess this is partly because there isn't much benefit when writing to >> local disk (writes are typically synchronous) but also because it's >> much easier to write multiple files. > > Writes should be asynchronous most of the time. I don't know anything > about ADLFS, though. > > Regards > > Antoine. > > >> >> Is writing multiple files a choice for you? I would guess using a >> dataset write with multiple files would be significantly more >> efficient than one large single file write on ADLFS. >> >> -Weston >> >> On Fri, Mar 26, 2021 at 6:28 PM Yeshwanth Sriram <[email protected]> >> wrote: >>> >>> Hello, >>> >>> Thank you again for earlier help on improving overall ADLFS read latency >>> using multiple threads which has worked out really well. >>> >>> I’ve incorporated buffering on the adls/writer implementation (upto 64 meg) >>> . What I’m noticing is that the parquet_writer->WriteTable(table) latency >>> dominates everything else on the output phase of the job (~65sec vs ~1.2min >>> ) . I could use multiple threads (like io/s3fs) but not sure if it will >>> have any effect on parquet write table operation. >>> >>> Question: Is there anything else I can leverage inside parquet/writer >>> subsystem to improve the core parquet/write/table latency ? >>> >>> >>> schema: >>> map<key,array<struct<…>>> >>> struct<...> >>> map<key,map<key,map<key, struct<…>>>> >>> struct<…> >>> binary >>> num_row_groups: 6 >>> num_rows_per_row_group: ~8mil >>> write buffer size: 64 * 1024 * 1024 (~64 mb) >>> write compression: snappy >>> total write latency per row group: ~1.2min >>> adls append/flush latency (minor factor) >>> Azure: ESv3/RAM: 256Gb/Cores: 8 >>> >>> Yesh >> > > >
