Hi, I have a daily bucketed dataset with CSV files which is read by a Dataflow batch job with a custom CSV Filesource. The CSV records are parsed to dicts, timestamped and then applied with a sliding window. The window sizes are two days with a period of one day and the idea is the emit diffs between sequential days. After timestamping the data and applying the window, a GroupByKey is performed (implicitly per key-window) which is a followed by a DoFn which applies a some logic for emitting diffs between values per key, i.e checks if there is an update between two days for an id.
However, running this in Dataflow is taking incredible amount of resources and time. As an example 11GB of total data on approximately 40-50 n1-standard workers takes approximately 40 min. Is it just me or does that seem unreasonably computationally expensive? In each window there are few values per each key so HotKeyFanout() is not really applicable. I’ve also tried Dataflows experimental shuffle service which hasn’t provided any improvements. I’m thinking there might be something related to the encoding of dicts in PCols in Python as well as the combination of the Sliding windows that’s causing this inefficiency. Any input would be greatly appreciated! :) Kind Regards, Akash Patel -- This message and any attachment(s) hereto are confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient you are hereby notified that you have received this message in error and that you must not - in whole or in part - review, copy, distribute, retain copies or disclose the contents of this message or any attachments hereto. If you are not the intended recipient, please notify the sender immediately by return e-mail and delete this message and any attachment from your system.
