Re: Re: Failing job

Pawel Fri, 01 Apr 2022 07:45:49 -0700

Thank you Danny for your help.   I think you are right. I changed the function 
to my implementation of Write which indeed writes to shards and the process 
didn&#39;t fail. Thanks for your help.   --  Paweł






        Dnia 31 marca 2022 21:03 Danny McCormick 
&lt;[email protected]&gt; napisał(a):



         Hey Pawel, the size is almost definitely the problem - textio.Write 
isn&#39;t really built to scale to single files of that size (at this time at 
least). Specifically, it performs an  github.com AddFixedKey/GroupByKey  
operation before writing to a file, which gathers all the file contents onto a 
single machine before writing the whole file at once. It seems likely that this 
is causing some sort of memory issue which is eventually causing the crash (I 
suspect the failed split is a red herring since failing to split  shouldn&#39;t 
 be fatal).  If you need to write it all to GCS, I&#39;d recommend writing some 
sort of sharding function to split it up into several files and then writing it 
to GCS that way - there&#39;s an example of how to do something like that in 
Beam Go&#39;s  github.com large wordcount example  which you could probably 
adapt to your use case.  Thanks, Danny  On Thu, Mar 31, 2022 at 3:15 AM Pawel 
&lt;   [email protected] &gt; wrote:
           Hi Danny,  Thank you for your reply. I don&#39;t use splittable DoFn 
explicitly. The problem appears when I attach write function at the end of my 
pipeline    After ~10 hours of running a job with 30 workers first I start 
getting a lot of errors like this from Dataflow   Error syncing pod 
3cfc1e3bc4c70cd9c668944ce9769a42 
(&#34;df-go-job-1-1648642256936168-03300511-upar-harness-nn4g_default(3cfc1e3bc4c70cd9c668944ce9769a42)&#34;),
 skipping: failed to &#34;StartContainer&#34; for &#34;sdk-0-0&#34; with 
CrashLoopBackOff: &#34;back-off 40s restarting failed container=sdk-0-0 
pod=df-go-job-1-1648642256936168-03300511-upar-harness-nn4g_default(3cfc1e3bc4c70cd9c668944ce9769a42)
   Then the only 2 errors which appeared on workers are   unable to split 
process_bundle-2455586051109759173-4329: failed to split DataSource (at index: 
1) at requested splits: {[0 1]}   Just after that the pipeline was marked as 
failed in Dataflow. This makes me think that this split is causing the pipeline 
to fail.  I am pretty sure this happened in the last step which is simply 
textio.Write to write. Maybe the size of the output file was a problem because 
I&#39;m expecting between 1TB and 2TB file to be written to GCS,  maybe this is 
the problem.   --  Paweł





        Dnia 30 marca 2022 14:47 Danny McCormick &lt;   
[email protected] &gt; napisał(a):



         Hey Pawel, are you able to provide any more information about the 
splittable DoFn that is failing and what its SplitRestriction function looks 
like? I&#39;m not sure that we have enough information to fully diagnose the 
issue with the error alone.  The error itself is coming from here -  github.com 
https://github.com/apache/beam/blob/ec62f3d9fa38e67279a70df28f2be8b17f344e27/sdks/go/pkg/beam/core/runtime/exec/datasource.go#L538
  - and is pretty much what it sounds like - the runner is trying to initiate a 
split but it fails because the sdk can&#39;t find a split point. I am kinda 
surprised that this causes the whole pipeline to fail though - was there any 
additional logging around the job crash? I wouldn&#39;t expect a failed split 
to be enough to bring the whole job down.   @Daniel Oliveira  probably has more 
context here, Daniel do you have anything to add?   Thanks, Danny  On Wed, Mar 
30, 2022 at 1:01 AM Pawel &lt;    [email protected] &gt; wrote:  Hi  I have very 
limited knowledge about Apache Beam and I&#39;m facing a strange problem when 
starting job in Dataflow using Go SDK.   First I get this debug message (I see 
more of such Splits)   PB Split: 
instruction_id:&#34;process_bundle-2455586051109757402-4328&#34;  
desired_splits:{key:&#34;ptransform-11&#34;  value:{fraction_of_remainder:0.5  
allowed_split_points:0  allowed_split_points:1  estimated_input_elements:1}}   
And just after this debug message there is an error   github.com 
github.com/apache/beam/sdks/[email protected]/go/pkg/beam/core/runtime/harness/harness.go:432
 &#34;  message: &#34;unable to split process_bundle-2455586051109757402-4328: 
failed to split DataSource (at index: 1) at requested splits: {[0 1]}&#34;   
After this error whole jobs crashes. Is this likely a problem in my in my 
pipeline or a bug in Dataflow and/or Beam?  Thanks in advance   --  Paweł

Re: Re: Failing job

Reply via email to