Hi Reece, As you may have already found out, TextIO.java doesn't help here (unless you modify your input) since that is for reading lines from full text files.
Please see https://beam.apache.org/documentation/io/authoring-overview/ for some information on implementing read transforms for Beam (this guide is not complete yet). As described there you can implement your read transform with our without the source API. For your use-case, I would *only* suggest using the source API (you can use the abstraction FileBasedSource instead of BoundedSource) if you hope to implement a generalized solution (not for reading a single book) and if you wish to support dynamic work rebalancing when reading your file. Otherwise I would go with a solution that consist of a ParDo transform that reads data from your input file and produces <chapter, line> KVs. You can parallelize reading by having a preceding ParDo that produces byte ranges to read followed by a GroupByKey. Also you will probably only be able to efficiently parallelize reading (either using FileBasedSource or a ParDo that produces byte ranges) if you have a pre-computed index that gives starting byte positions of chapters. - Cham On Mon, Jul 31, 2017 at 12:02 PM Reece <[email protected]> wrote: > Hello, > > I'm creating a beam pipeline that counts words by chapter from a single > input file formatted as such: > > CHAPTER I. Down the Rabbit-Hole > > Alice was beginning to get very tired of sitting by her sister on the > bank, and of having nothing to do: once or twice... > > CHAPTER II. The Pool of Tears > > 'Curiouser and curiouser!' cried Alice (she was so much surprised, that > for the moment she quite forgot how to speak good English); 'now I'm > opening out like the largest telescope that... > > > Is there a way to achieve this in beam? Does it require extending > BoundedSource (and if so, does anyone have any similar examples I could > work off of?) I can think of a couple ways to do this by modifying the > input file before the pipeline, but I'm interested to know if there's a way > to do this purely in beam. > > Many thanks, > > > > ________________________ > > > Reece > [email protected] >
