[ https://issues.apache.org/jira/browse/BEAM-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eugene Kirpichov closed BEAM-1822. ---------------------------------- Resolution: Duplicate Fix Version/s: Not applicable > Improve handling of eventually-consistent filepatterns > ------------------------------------------------------ > > Key: BEAM-1822 > URL: https://issues.apache.org/jira/browse/BEAM-1822 > Project: Beam > Issue Type: Bug > Components: sdk-java-core > Reporter: Eugene Kirpichov > Fix For: Not applicable > > > Reading from an eventually consistent filepattern (e.g. located in a > multi-regional Google Cloud Storage bucket, etc.) using FileBasedSource is > dangerous, because it may silently process fewer data than the user thinks, > in case not all files get returned by the match call. > We should improve our handling of this case. I'd suggest to aim for > minimizing the chance of silent data loss. Here's a couple of things we could > do. > - Let the user supply an expected number of files to be matched, and fail the > pipeline if the actual number is different. For special filepatterns like > XXX-of-YYY, we can autodetect the expected number. > - Poll the filepattern for a while (perhaps for a period determined by the > underlying IOChannelFactory that knows the typical eventual consistency > convergence times of its filesystem), and either wait until it quiesces, or > fail the pipeline if it doesn't -- This message was sent by Atlassian JIRA (v6.3.15#6346)