Re: potentially using SplittableDoFn to zipWithIndex for a large file

Scott Wegner Fri, 14 Dec 2018 11:03:57 -0800

> For A, I feel like this is only doable sequentially in Beam or in
preprocessing stage before Beam?
> Any file io library needs to check all characters in the file to find
"\n" to determine the end of line.  Files are mostly not indexed hence no
metadata we can use to partition large file in a parallel way.


Yes, the source transform would be the one responsible for reading a file
shard and splitting on newlines. In Beam Java SDK, this is
TextIO/TextSource [1]. I don't believe the implementation has hooks that
would allow for this type of customization, but you could create your own
clone of TextIO for testing the idea. If this is generally useful we should
add the necessary extensibility points.

[1]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSource.java

Re: potentially using SplittableDoFn to zipWithIndex for a large file

Reply via email to