> For A, I feel like this is only doable sequentially in Beam or in preprocessing stage before Beam? > Any file io library needs to check all characters in the file to find "\n" to determine the end of line. Files are mostly not indexed hence no metadata we can use to partition large file in a parallel way.
Yes, the source transform would be the one responsible for reading a file shard and splitting on newlines. In Beam Java SDK, this is TextIO/TextSource [1]. I don't believe the implementation has hooks that would allow for this type of customization, but you could create your own clone of TextIO for testing the idea. If this is generally useful we should add the necessary extensibility points. [1] https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSource.java
