Hi Reece,

As you may have already found out, TextIO.java doesn't help here (unless
you modify your input) since that is for reading lines from full text files.

Please see https://beam.apache.org/documentation/io/authoring-overview/ for
some information on implementing read transforms for Beam (this guide is
not complete yet). As described there you can implement your read transform
with our without the source API.

For your use-case, I would *only* suggest using the source API (you can use
the abstraction FileBasedSource instead of BoundedSource) if you hope to
implement a generalized solution (not for reading a single book) and if you
wish to support dynamic work rebalancing when reading your file. Otherwise
I would go with a solution that consist of a ParDo transform that reads
data from your input file and produces <chapter, line> KVs. You can
parallelize reading by having a preceding ParDo that produces byte ranges
to read followed by a GroupByKey. Also you will probably only be able to
efficiently parallelize reading (either using FileBasedSource or a ParDo
that produces byte ranges) if you have a pre-computed index that gives
starting byte positions of chapters.

- Cham

On Mon, Jul 31, 2017 at 12:02 PM Reece <[email protected]> wrote:

> Hello,
>
> I'm creating a beam pipeline that counts words by chapter from a single
> input file formatted as such:
>
> CHAPTER I. Down the Rabbit-Hole
>
> Alice was beginning to get very tired of sitting by her sister on the
> bank, and of having nothing to do: once or twice...
>
> CHAPTER II. The Pool of Tears
>
> 'Curiouser and curiouser!' cried Alice (she was so much surprised, that
> for the moment she quite forgot how to speak good English); 'now I'm
> opening out like the largest telescope that...
>
>
> Is there a way to achieve this in beam? Does it require extending
> BoundedSource (and if so, does anyone have any similar examples I could
> work off of?) I can think of a couple ways to do this by modifying the
> input file before the pipeline, but I'm interested to know if there's a way
> to do this purely in beam.
>
> Many thanks,
>
>
>
> ________________________
>
>
> Reece
> [email protected]
>

Reply via email to