Thank you Danny for your help. I think you are right. I changed the function
to my implementation of Write which indeed writes to shards and the process
didn't fail. Thanks for your help. -- Paweł
Dnia 31 marca 2022 21:03 Danny McCormick
<[email protected]> napisał(a):
Hey Pawel, the size is almost definitely the problem - textio.Write
isn't really built to scale to single files of that size (at this time at
least). Specifically, it performs an github.com AddFixedKey/GroupByKey
operation before writing to a file, which gathers all the file contents onto a
single machine before writing the whole file at once. It seems likely that this
is causing some sort of memory issue which is eventually causing the crash (I
suspect the failed split is a red herring since failing to split shouldn't
be fatal). If you need to write it all to GCS, I'd recommend writing some
sort of sharding function to split it up into several files and then writing it
to GCS that way - there's an example of how to do something like that in
Beam Go's github.com large wordcount example which you could probably
adapt to your use case. Thanks, Danny On Thu, Mar 31, 2022 at 3:15 AM Pawel
< [email protected] > wrote:
Hi Danny, Thank you for your reply. I don't use splittable DoFn
explicitly. The problem appears when I attach write function at the end of my
pipeline After ~10 hours of running a job with 30 workers first I start
getting a lot of errors like this from Dataflow Error syncing pod
3cfc1e3bc4c70cd9c668944ce9769a42
("df-go-job-1-1648642256936168-03300511-upar-harness-nn4g_default(3cfc1e3bc4c70cd9c668944ce9769a42)"),
skipping: failed to "StartContainer" for "sdk-0-0" with
CrashLoopBackOff: "back-off 40s restarting failed container=sdk-0-0
pod=df-go-job-1-1648642256936168-03300511-upar-harness-nn4g_default(3cfc1e3bc4c70cd9c668944ce9769a42)
Then the only 2 errors which appeared on workers are unable to split
process_bundle-2455586051109759173-4329: failed to split DataSource (at index:
1) at requested splits: {[0 1]} Just after that the pipeline was marked as
failed in Dataflow. This makes me think that this split is causing the pipeline
to fail. I am pretty sure this happened in the last step which is simply
textio.Write to write. Maybe the size of the output file was a problem because
I'm expecting between 1TB and 2TB file to be written to GCS, maybe this is
the problem. -- Paweł
Dnia 30 marca 2022 14:47 Danny McCormick <
[email protected] > napisał(a):
Hey Pawel, are you able to provide any more information about the
splittable DoFn that is failing and what its SplitRestriction function looks
like? I'm not sure that we have enough information to fully diagnose the
issue with the error alone. The error itself is coming from here - github.com
https://github.com/apache/beam/blob/ec62f3d9fa38e67279a70df28f2be8b17f344e27/sdks/go/pkg/beam/core/runtime/exec/datasource.go#L538
- and is pretty much what it sounds like - the runner is trying to initiate a
split but it fails because the sdk can't find a split point. I am kinda
surprised that this causes the whole pipeline to fail though - was there any
additional logging around the job crash? I wouldn't expect a failed split
to be enough to bring the whole job down. @Daniel Oliveira probably has more
context here, Daniel do you have anything to add? Thanks, Danny On Wed, Mar
30, 2022 at 1:01 AM Pawel < [email protected] > wrote: Hi I have very
limited knowledge about Apache Beam and I'm facing a strange problem when
starting job in Dataflow using Go SDK. First I get this debug message (I see
more of such Splits) PB Split:
instruction_id:"process_bundle-2455586051109757402-4328"
desired_splits:{key:"ptransform-11" value:{fraction_of_remainder:0.5
allowed_split_points:0 allowed_split_points:1 estimated_input_elements:1}}
And just after this debug message there is an error github.com
github.com/apache/beam/sdks/[email protected]/go/pkg/beam/core/runtime/harness/harness.go:432
" message: "unable to split process_bundle-2455586051109757402-4328:
failed to split DataSource (at index: 1) at requested splits: {[0 1]}"
After this error whole jobs crashes. Is this likely a problem in my in my
pipeline or a bug in Dataflow and/or Beam? Thanks in advance -- Paweł