Hi, 

I have a daily bucketed dataset with CSV files which is read by a Dataflow 
batch job with a custom CSV Filesource. The CSV records are parsed to dicts, 
timestamped and then applied with a sliding window. The window sizes are two 
days with a period of one day and the idea is the emit diffs between sequential 
days. After timestamping the data and applying the window, a GroupByKey is 
performed (implicitly per key-window) which is a followed by a DoFn which 
applies a some logic for emitting diffs between values per key, i.e checks if 
there is an update between two days for an id.

However, running this in Dataflow is taking incredible amount of resources and 
time. As an example 11GB of total data on approximately 40-50 n1-standard 
workers takes approximately 40 min. Is it just me or does that seem 
unreasonably computationally expensive? 

In each window there are few values per each key so HotKeyFanout() is not 
really applicable. I’ve also tried Dataflows experimental shuffle service which 
hasn’t provided any improvements.

I’m thinking there might be something related to the encoding of dicts in PCols 
in Python as well as the combination of the Sliding windows that’s causing this 
inefficiency. 

Any input would be greatly appreciated! :)

Kind Regards, 
Akash Patel
-- 










This message and any attachment(s) hereto are confidential and 
may be privileged or otherwise protected from disclosure. If you are not 
the intended recipient you are hereby notified that you have received this 
message in error and that you must not - in whole or in part - review, 
copy, distribute, retain copies or disclose the contents of this message or 
any attachments hereto. If you are not the intended recipient, please 
notify the sender immediately by return e-mail and delete this message and 
any attachment from your system.

Reply via email to