[jira] [Commented] (BEAM-9316) FileIO.Write.relativeFileNaming should not be public
[ https://issues.apache.org/jira/browse/BEAM-9316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17548355#comment-17548355 ] Danny McCormick commented on BEAM-9316: --- This issue has been migrated to https://github.com/apache/beam/issues/19997 > FileIO.Write.relativeFileNaming should not be public > > > Key: BEAM-9316 > URL: https://issues.apache.org/jira/browse/BEAM-9316 > Project: Beam > Issue Type: Improvement > Components: io-java-files >Reporter: Claire McGinty >Priority: P3 > > I think the existing FileIO.writeDynamic is a bit easy to misuse, as > something like this looks correct, and compiles: > > {{ FileIO.writeDynamic()}} > {{ .by(...)}} > {{ .withNaming(new SerializableFunction[String, FileNaming] {}} > {{ override def apply(str: String): FileNaming =}} > {{ FileIO.Write.relativeFileNaming(}} > {{ "some/directory",}} > {{ new FileNaming {}} > {{ override defFilename(window: BoundedWindow, pane: PaneInfo, > numShards: Int, shardIndex: Int, compression: Compression): String = > "some_filename.txt"}}{{}}} > {{ .via(...)}} > {{ .to("gs://some/bucket")}} > > However, for dynamic writes, if `outputDirectory` (.to("...")) is set, under > the hood, Beam will wrap the provided `fileNamingFn` in > `FileIO.Write.relativeFileNaming(...)` as well, so it ends up as a nested > `relativeFileNaming` function. > ([https://github.com/apache/beam/blob/da9e17288e8473925674a4691d9e86252e67d7d7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L1243)|https://github.com/apache/beam/blob/da9e17288e8473925674a4691d9e86252e67d7d7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L1243])] > > IMO, `relativeFileNaming` should either be made private, so that it's only > used internally by FileIO.Write, or a precondition should be added when a > dynamic FileIO.Write is expanded, to check that `outputDirectory` can't be > set if the provided `fileNamingFn` is relative. > > wdyt? > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (BEAM-9316) FileIO.Write.relativeFileNaming should not be public
[ https://issues.apache.org/jira/browse/BEAM-9316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136886#comment-17136886 ] Beam JIRA Bot commented on BEAM-9316: - This issue was marked "stale-P2" and has not received a public comment in 14 days. It is now automatically moved to P3. If you are still affected by it, you can comment and move it back to P2. > FileIO.Write.relativeFileNaming should not be public > > > Key: BEAM-9316 > URL: https://issues.apache.org/jira/browse/BEAM-9316 > Project: Beam > Issue Type: Improvement > Components: io-java-files >Reporter: Claire McGinty >Priority: P3 > > I think the existing FileIO.writeDynamic is a bit easy to misuse, as > something like this looks correct, and compiles: > > {{ FileIO.writeDynamic()}} > {{ .by(...)}} > {{ .withNaming(new SerializableFunction[String, FileNaming] {}} > {{ override def apply(str: String): FileNaming =}} > {{ FileIO.Write.relativeFileNaming(}} > {{ "some/directory",}} > {{ new FileNaming {}} > {{ override defFilename(window: BoundedWindow, pane: PaneInfo, > numShards: Int, shardIndex: Int, compression: Compression): String = > "some_filename.txt"}}{{}}} > {{ .via(...)}} > {{ .to("gs://some/bucket")}} > > However, for dynamic writes, if `outputDirectory` (.to("...")) is set, under > the hood, Beam will wrap the provided `fileNamingFn` in > `FileIO.Write.relativeFileNaming(...)` as well, so it ends up as a nested > `relativeFileNaming` function. > ([https://github.com/apache/beam/blob/da9e17288e8473925674a4691d9e86252e67d7d7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L1243)|https://github.com/apache/beam/blob/da9e17288e8473925674a4691d9e86252e67d7d7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L1243])] > > IMO, `relativeFileNaming` should either be made private, so that it's only > used internally by FileIO.Write, or a precondition should be added when a > dynamic FileIO.Write is expanded, to check that `outputDirectory` can't be > set if the provided `fileNamingFn` is relative. > > wdyt? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-9316) FileIO.Write.relativeFileNaming should not be public
[ https://issues.apache.org/jira/browse/BEAM-9316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17122381#comment-17122381 ] Beam JIRA Bot commented on BEAM-9316: - This issue is P2 but has been unassigned without any comment for 60 days so it has been labeled "stale-P2". If this issue is still affecting you, we care! Please comment and remove the label. Otherwise, in 14 days the issue will be moved to P3. Please see https://beam.apache.org/contribute/jira-priorities/ for a detailed explanation of what these priorities mean. > FileIO.Write.relativeFileNaming should not be public > > > Key: BEAM-9316 > URL: https://issues.apache.org/jira/browse/BEAM-9316 > Project: Beam > Issue Type: Improvement > Components: io-java-files >Reporter: Claire McGinty >Priority: P2 > Labels: stale-P2 > > I think the existing FileIO.writeDynamic is a bit easy to misuse, as > something like this looks correct, and compiles: > > {{ FileIO.writeDynamic()}} > {{ .by(...)}} > {{ .withNaming(new SerializableFunction[String, FileNaming] {}} > {{ override def apply(str: String): FileNaming =}} > {{ FileIO.Write.relativeFileNaming(}} > {{ "some/directory",}} > {{ new FileNaming {}} > {{ override defFilename(window: BoundedWindow, pane: PaneInfo, > numShards: Int, shardIndex: Int, compression: Compression): String = > "some_filename.txt"}}{{}}} > {{ .via(...)}} > {{ .to("gs://some/bucket")}} > > However, for dynamic writes, if `outputDirectory` (.to("...")) is set, under > the hood, Beam will wrap the provided `fileNamingFn` in > `FileIO.Write.relativeFileNaming(...)` as well, so it ends up as a nested > `relativeFileNaming` function. > ([https://github.com/apache/beam/blob/da9e17288e8473925674a4691d9e86252e67d7d7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L1243)|https://github.com/apache/beam/blob/da9e17288e8473925674a4691d9e86252e67d7d7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L1243])] > > IMO, `relativeFileNaming` should either be made private, so that it's only > used internally by FileIO.Write, or a precondition should be added when a > dynamic FileIO.Write is expanded, to check that `outputDirectory` can't be > set if the provided `fileNamingFn` is relative. > > wdyt? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-9316) FileIO.Write.relativeFileNaming should not be public
[ https://issues.apache.org/jira/browse/BEAM-9316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17040149#comment-17040149 ] Claire McGinty commented on BEAM-9316: -- thanks for taking a look [~jkff] and [~iemejia] ! my issue with it is that it actually has a bug resolving the relative filename when it's nested like this. For example, I ran this snippet on DirectRunner and DataflowRunner (sorry, more Scala code...): {code:java} val pipeline: Pipeline = ... pipeline.apply( Create.of(ImmutableList.of("a1", "b1")) ).apply( FileIO.writeDynamic() .by((element: String) => element.charAt(0).toString) .to("gs://some_bucket/write_dynamic") .via(TextIO.sink()) .withNaming((dst: String) => FileIO.Write.relativeFileNaming( StaticValueProvider.of("nested/directory"), new FileNaming { override def getFilename( window: BoundedWindow, pane: PaneInfo, numShards: Int, shardIndex: Int, compression: Compression ): String = s"file_${shardIndex}_${dst}.txt" } )) .withDestinationCoder(StringUtf8Coder.of()) ) {code} In DataflowRunner, files are written to gs://some_bucket/write_dynamic//nested/directory/file_0_b.txt and gs://some_bucket/write_dynamic//nested/directory/file_0_a.txt (note the extra forward slash–I think the filesystems API is prepending it to nested/directory when it matches the resource). In DirectRunner, files are written to gs://some_bucket/write_dynamic//Users/clairemcginty/nested/directory/file_0_a.txt and gs://some_bucket/write_dynamic//Users/clairemcginty/nested/directory/file_0_b.txt. In this case it resolves the absolute path of "nested/directory" on my local FS and appends that to the gs://some_bucket/write_dynamic, again with the double forward slash. It seems that the double forward slash is getting prepended when FileIO.Write.relativeFileNaming resolves the filename with the given baseDirectory (in this case "nested/directory" becomes "/nested/directory"). (If I get rid of .to() and just use FileIO.Write.relativeFileNaming( StaticValueProvider.of("gs://some_bucket/write_dynamic/nested/directory"), it works as expected.) I think this is actually a difficult thing to fix since FileSystems.matchNewResource tries to parse the path scheme, so passing just a directory name doesn't work correctly. It seems more straightforward to just make the API easier to use correctly by renaming/deprecating things as you suggested. > FileIO.Write.relativeFileNaming should not be public > > > Key: BEAM-9316 > URL: https://issues.apache.org/jira/browse/BEAM-9316 > Project: Beam > Issue Type: Improvement > Components: io-java-files >Reporter: Claire McGinty >Priority: Major > > I think the existing FileIO.writeDynamic is a bit easy to misuse, as > something like this looks correct, and compiles: > > {{ FileIO.writeDynamic()}} > {{ .by(...)}} > {{ .withNaming(new SerializableFunction[String, FileNaming] {}} > {{ override def apply(str: String): FileNaming =}} > {{ FileIO.Write.relativeFileNaming(}} > {{ "some/directory",}} > {{ new FileNaming {}} > {{ override defFilename(window: BoundedWindow, pane: PaneInfo, > numShards: Int, shardIndex: Int, compression: Compression): String = > "some_filename.txt"}}{{}}} > {{ .via(...)}} > {{ .to("gs://some/bucket")}} > > However, for dynamic writes, if `outputDirectory` (.to("...")) is set, under > the hood, Beam will wrap the provided `fileNamingFn` in > `FileIO.Write.relativeFileNaming(...)` as well, so it ends up as a nested > `relativeFileNaming` function. > ([https://github.com/apache/beam/blob/da9e17288e8473925674a4691d9e86252e67d7d7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L1243)|https://github.com/apache/beam/blob/da9e17288e8473925674a4691d9e86252e67d7d7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L1243])] > > IMO, `relativeFileNaming` should either be made private, so that it's only > used internally by FileIO.Write, or a precondition should be added when a > dynamic FileIO.Write is expanded, to check that `outputDirectory` can't be > set if the provided `fileNamingFn` is relative. > > wdyt? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-9316) FileIO.Write.relativeFileNaming should not be public
[ https://issues.apache.org/jira/browse/BEAM-9316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17037332#comment-17037332 ] Eugene Kirpichov commented on BEAM-9316: Could you elaborate why the current behavior is undesirable? Do I understand correctly that in the given example it'll write to gs://some/bucket/some/directory/ ? If so, that seems fairly intuitive to me and arguably a useful thing to have: writing to subdirectories within a base directory specified with .to(). I think making relativeFileNaming private would be unfortunate: it's a backwards incompatible change (can break people who use it), and it seems useful to have something of that sort. As for checking whether a fileNaming is relative - I don't think it can be done in general (what if somebody writes a new FileNaming that, logically, is relative, but isn't constructed via relativeFileNaming). I guess you could throw runtime errors if .to() is specified and the FileNaming *returns* something that contains a slash? But again I'm not sure that's desirable. I acknowledge that there is some confusion but I think the best way to address it would be to rename things more clearly (and deprecate the old names). E.g. maybe something like: to() -> toBaseDirectory() and relativeFileNaming() -> toSubDirectory()? There's probably better options too. > FileIO.Write.relativeFileNaming should not be public > > > Key: BEAM-9316 > URL: https://issues.apache.org/jira/browse/BEAM-9316 > Project: Beam > Issue Type: Improvement > Components: io-java-files >Reporter: Claire McGinty >Priority: Major > > I think the existing FileIO.writeDynamic is a bit easy to misuse, as > something like this looks correct, and compiles: > > {{ FileIO.writeDynamic()}} > {{ .by(...)}} > {{ .withNaming(new SerializableFunction[String, FileNaming] {}} > {{ override def apply(str: String): FileNaming =}} > {{ FileIO.Write.relativeFileNaming(}} > {{ "some/directory",}} > {{ new FileNaming {}} > {{ override defFilename(window: BoundedWindow, pane: PaneInfo, > numShards: Int, shardIndex: Int, compression: Compression): String = > "some_filename.txt"}}{{}}} > {{ .via(...)}} > {{ .to("gs://some/bucket")}} > > However, for dynamic writes, if `outputDirectory` (.to("...")) is set, under > the hood, Beam will wrap the provided `fileNamingFn` in > `FileIO.Write.relativeFileNaming(...)` as well, so it ends up as a nested > `relativeFileNaming` function. > ([https://github.com/apache/beam/blob/da9e17288e8473925674a4691d9e86252e67d7d7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L1243)|https://github.com/apache/beam/blob/da9e17288e8473925674a4691d9e86252e67d7d7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L1243])] > > IMO, `relativeFileNaming` should either be made private, so that it's only > used internally by FileIO.Write, or a precondition should be added when a > dynamic FileIO.Write is expanded, to check that `outputDirectory` can't be > set if the provided `fileNamingFn` is relative. > > wdyt? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-9316) FileIO.Write.relativeFileNaming should not be public
[ https://issues.apache.org/jira/browse/BEAM-9316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17037326#comment-17037326 ] Ismaël Mejía commented on BEAM-9316: I think making it private as Claire proposes is a good idea, any objection/comments on this [~jkff] [~chamikara] [~pabloem] > FileIO.Write.relativeFileNaming should not be public > > > Key: BEAM-9316 > URL: https://issues.apache.org/jira/browse/BEAM-9316 > Project: Beam > Issue Type: Improvement > Components: io-java-files >Reporter: Claire McGinty >Priority: Major > > I think the existing FileIO.writeDynamic is a bit easy to misuse, as > something like this looks correct, and compiles: > > {{ FileIO.writeDynamic()}} > {{ .by(...)}} > {{ .withNaming(new SerializableFunction[String, FileNaming] {}} > {{ override def apply(str: String): FileNaming =}} > {{ FileIO.Write.relativeFileNaming(}} > {{ "some/directory",}} > {{ new FileNaming {}} > {{ override defFilename(window: BoundedWindow, pane: PaneInfo, > numShards: Int, shardIndex: Int, compression: Compression): String = > "some_filename.txt"}}{{}}} > {{ .via(...)}} > {{ .to("gs://some/bucket")}} > > However, for dynamic writes, if `outputDirectory` (.to("...")) is set, under > the hood, Beam will wrap the provided `fileNamingFn` in > `FileIO.Write.relativeFileNaming(...)` as well, so it ends up as a nested > `relativeFileNaming` function. > ([https://github.com/apache/beam/blob/da9e17288e8473925674a4691d9e86252e67d7d7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L1243)|https://github.com/apache/beam/blob/da9e17288e8473925674a4691d9e86252e67d7d7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L1243])] > > IMO, `relativeFileNaming` should either be made private, so that it's only > used internally by FileIO.Write, or a precondition should be added when a > dynamic FileIO.Write is expanded, to check that `outputDirectory` can't be > set if the provided `fileNamingFn` is relative. > > wdyt? > -- This message was sent by Atlassian Jira (v8.3.4#803005)