Worker machines are n1-standard-2s (2 cpus and 7.5GB of RAM)
Pipeline is simple, but large amounts of end files, ~125K temp files written in
one case at least
1. Scan Bigtable (NoSQL DB)
2. Transform with business logic
3. Convert to GenericRecord
4. WriteDynamic to a google bucket as Parquet files partitioned by 15 minute
intervals.
(gs://bucket/root_dir/CATEGORY/YEAR/MONTH/DAY/HOUR/MINUTE_FLOOR_15/FILENAME.parquet)
Everything does fine until I get to the writeDynamic. When it does the
groupByKey
(FileIO.Write/WriteFiles/GatherTempFileResults/Reshuffle.ViaRandomKey/Reshuffle/GroupByKey)
the stackdriver logs show a ton of allocation failure triggered GC that then
frees up essentially zero space and never progresses, ends up with a "The
worker lost contact with the service." error four times and then fails. Also
worth noting that Dataflow sizes down to a single worker during this time, so
it is trying to do it all at once. What are my options for splitting
Likely I am not hitting GC alerts because I am using a snippet I pulled from a
GCP Dataflow template that queries Bigtable that looks to disable the
GCThrashing monitoring, due to Bigtable creating at least 5 objects per row
scanned.
DataflowPipelineDebugOptions debugOptions =
options.as(DataflowPipelineDebugOptions.class);
debugOptions.setGCThrashingPercentagePerPeriod(100.00);
What are my options for splitting this up so that it can process this in
smaller chunks? I tried adding windowing but it didn't seem to help, or I
needed to do something else other than just the windowing, but I don't really
have a key to group it by here.
[https://storage.googleapis.com/e24-email-images/e24logonotag.png]<https://www.evolve24.com>
Andrew Kettmann
DevOps Engineer
P: 1.314.596.2836
[LinkedIn]<https://linkedin.com/company/evolve24> [Twitter]
<https://twitter.com/evolve24> [Instagram]
<https://www.instagram.com/evolve_24>
evolve24 Confidential & Proprietary Statement: This email and any attachments
are confidential and may contain information that is privileged, confidential
or exempt from disclosure under applicable law. It is intended for the use of
the recipients. If you are not the intended recipient, or believe that you have
received this communication in error, please do not read, print, copy,
retransmit, disseminate, or otherwise use the information. Please delete this
email and attachments, without reading, printing, copying, forwarding or saving
them, and notify the Sender immediately by reply email. No confidentiality or
privilege is waived or lost by any transmission in error.