Re: [C++] - Squeeze more out of parquet write(table) operation.

Weston Pace Sun, 28 Mar 2021 11:36:34 -0700

Sorry, I didn’t realize you are writing multiple files already.  The flame
graphs Micah suggested would be extremely helpful.  Can you also measure
the CPU utilization?  If the CPU is not close to maxing out then another
possibility is that pipelined writes can help given ADLFS supports a high
number of concurrent writes.


Also, regarding ReadOptimized/WriteOptimized/ComputeOptimized.  What are
you thinking is the difference between the three?  Other than potentially
enabling/disabling compression I’m not sure I follow that point.

On Sun, Mar 28, 2021 at 8:12 AM Micah Kornfield <[email protected]>
wrote:

> Was thinking if Arrow/parquet/encode/decode subsystem had an option to
>> pick (any two) from the following three options.
>> - ReadOptimized
>> - WriteOptimized
>> - ComputeOptimized
>
>
> The only thing that I'm aware of that could potentially impact this is
> compression used (or not used I think).   I think there might also be a
> configuration knob to turn dictionary encoding on/off (turning it off would
> reduce computation requirements). Number of rows per row-group might also
> impact this but probably to a lesser extent.
>
> As you experiment providing a flame-graph or similar profile could
> potentially highlight hot-spots that can  be optimized.
>
> On Sun, Mar 28, 2021 at 10:58 AM Yeshwanth Sriram <[email protected]>
> wrote:
>
>> - Writing multiple files is an option. I’ve already tested processing
>> (read, filter,write) each row groups in separate threads and it definitely
>> provides me with under 2 minute latency for the whole job. But within each
>> processing unit the parquet write (I suppose parquet encode/serialize)
>> latency dominate all other latencies (including ADLFS writes) hence my
>> question if there is any additional options in parquet/writer that I could
>> leverage to bring down this latency.
>>
>> - ADLFS/sdk supports append(pos, bytes) and final flush (total bytes)
>> operation which makes it possible to append from different threads and
>> perform final flush operation after all futures are complete. But this
>> latency is a small factor for this particular poc.
>>
>> I’ll proceed to make comparison (latency) between existing spark based
>> solution with what I have so far and try publish this number here. Thank
>> you again for all the help.
>>
>> Was thinking if Arrow/parquet/encode/decode subsystem had an option to
>> pick (any two) from the following three options.
>> - ReadOptimized
>> - WriteOptimized
>> - ComputeOptimized
>>
>> Where
>>
>> RC ->  Possibly ML training scenario
>> WC -> My current use case raw project/filter and write (no aggregations)
>> RW -> Reporting
>>
>> Yesh
>>
>>
>> > On Mar 27, 2021, at 2:12 AM, Antoine Pitrou <[email protected]> wrote:
>> >
>> > On Fri, 26 Mar 2021 18:47:26 -1000
>> > Weston Pace <[email protected]> wrote:
>> >> I'm fairly certain there is room for improvement in the C++
>> >> implementation for writing single files to ADLFS.  Others can correct
>> >> me if I'm wrong but we don't do any kind of pipelined writes.  I'd
>> >> guess this is partly because there isn't much benefit when writing to
>> >> local disk (writes are typically synchronous) but also because it's
>> >> much easier to write multiple files.
>> >
>> > Writes should be asynchronous most of the time.  I don't know anything
>> > about ADLFS, though.
>> >
>> > Regards
>> >
>> > Antoine.
>> >
>> >
>> >>
>> >> Is writing multiple files a choice for you?  I would guess using a
>> >> dataset write with multiple files would be significantly more
>> >> efficient than one large single file write on ADLFS.
>> >>
>> >> -Weston
>> >>
>> >> On Fri, Mar 26, 2021 at 6:28 PM Yeshwanth Sriram <
>> [email protected]> wrote:
>> >>>
>> >>> Hello,
>> >>>
>> >>> Thank you again for earlier help on improving overall ADLFS read
>> latency using multiple threads which has worked out really well.
>> >>>
>> >>> I’ve incorporated buffering on the adls/writer implementation (upto
>> 64 meg) . What I’m noticing is that the parquet_writer->WriteTable(table)
>> latency dominates everything else on the output phase of the job (~65sec vs
>> ~1.2min ) .  I could use multiple threads (like io/s3fs) but not sure if it
>> will have any effect on parquet write table operation.
>> >>>
>> >>> Question: Is there anything else I can leverage inside parquet/writer
>> subsystem to improve the core parquet/write/table latency ?
>> >>>
>> >>>
>> >>> schema:
>> >>>  map<key,array<struct<…>>>
>> >>>  struct<...>
>> >>>  map<key,map<key,map<key, struct<…>>>>
>> >>>  struct<…>
>> >>>  binary
>> >>> num_row_groups: 6
>> >>> num_rows_per_row_group: ~8mil
>> >>> write buffer size: 64 * 1024 * 1024 (~64 mb)
>> >>> write compression: snappy
>> >>> total write latency per row group: ~1.2min
>> >>> adls append/flush latency (minor factor)
>> >>> Azure: ESv3/RAM: 256Gb/Cores: 8
>> >>>
>> >>> Yesh
>> >>
>> >
>> >
>> >
>>
>>

Re: [C++] - Squeeze more out of parquet write(table) operation.

Reply via email to