[ https://issues.apache.org/jira/browse/BEAM-60?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eugene Kirpichov reassigned BEAM-60: ------------------------------------ Assignee: (was: Pei He) > FileBasedSource/IOChannelFactory: Custom glob expansion > ------------------------------------------------------- > > Key: BEAM-60 > URL: https://issues.apache.org/jira/browse/BEAM-60 > Project: Beam > Issue Type: New Feature > Components: sdk-java-core > Reporter: Daniel Halperin > > Many cloud and distributed filesystems are eventually consistent, for > instance Amazon s3 and Google Cloud Storage. > To work around this, many systems that produce files such as Beam's > FileBasedSinks, or Google BigQuery will provide methods to determine the > number and set of files produced. E.g., > * Beam FileBasedSink uses -00000-of-NNNNN > * BigQuery export jobs uses -000000 -000001 -000002 ... until an empty file > is produced > * Another system may produce a .filelist suffix that contains a list of all > files. > Users should be able to supply a glob to FileBasedSource but additionally > supply a "glob expander" that can provide a custom implementation for file > expansion. That way, e.g., Beam pipelines can be run back-to-back-to-back > where each consumes the output of the previous, on an inconsistent > filesystem, without data loss. -- This message was sent by Atlassian JIRA (v6.3.15#6346)