[ 
https://issues.apache.org/jira/browse/BEAM-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Kirpichov updated BEAM-1841:
-----------------------------------
    Summary: FileBasedSource should have safeguards for when set of files grows 
while job is running  (was: FileBasedSource should better handle when set of 
files grows while job is running)

> FileBasedSource should have safeguards for when set of files grows while job 
> is running
> ---------------------------------------------------------------------------------------
>
>                 Key: BEAM-1841
>                 URL: https://issues.apache.org/jira/browse/BEAM-1841
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-dataflow, sdk-java-core
>            Reporter: Eugene Kirpichov
>            Assignee: Eugene Kirpichov
>
> In some cases people run pipelines over directories where the set of files in 
> the directory grows while the job runs. This may lead to a situation like 
> this, in particular with Dataflow runner:
> At job submission time, the FileBasedSource estimates the current size of the 
> filepattern, and ends up with a small number. Dataflow runner chooses thus a 
> small desiredBundleSizeBytes to pass to .splitIntoBundles(). However, at the 
> time splitIntoBundles() runs, the set of files has greatly grown, and we 
> produce many more, unnecessarily small bundles, than anticipated.
> I see a few things we could do:
> - In splitIntoBundles(), compute the actual size and detect when the desired 
> size is unreasonably small for it; e.g. set an upper threshold on how many 
> bundles we produce in total.
> - Somehow remember, at submission time, what was the estimated size. Then, in 
> splitIntoBundles(), compute the actual current size, and scale 
> desiredBundleSizeBytes accordingly to get approximately the intended number 
> of bundles. Caveat: files may still change between the moment size is 
> estimated and the moment splitting happens.
> - (much larger in scope) Change the whole protocol to use number of bundles 
> instead of bundle size bytes. This probably won't happen with BoundedSource, 
> but it is going to be the case with Splittable DoFn.
> Option 1 seems by far the simplest.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to