Hi Matt, I am much more familiar with Python, so I usually answer questions using that SDK. Also, it's quicker to type a fully detailed pipeline on an email and the SDKs are similar enough that it should not be too difficult to translate to Java from an IDE.
To your questions: 1. Grouping like that should not create a single in-memory iterable. Runners provide lazy iterables that load elements as you go through them. 2. The Java SDK provides the FileIO.match(), and FileIO.readMatches() transforms, which generate a list of file metadata objects, and convert them into file descriptors that can be read in a ParDo (respectively). I think those should do the trick for you. I guess, in Java you'd do something like so: p.apply(FileIO.match().filepattern(...)) .apply(ParDo.of(new AddSingleKeyDoFn()) .apply(GroupByKey.create()) .apply(ParDo.of(new DropKeyDoFn()) .apply(FileIO.readMatches()) .apply(ParDo.of(new ConsumeFileDescriptors()); You'll have to follow the FileIO documentation <https://beam.apache.org/releases/javadoc/2.9.0/org/apache/beam/sdk/io/FileIO.html>[1] to see what the file descriptors look like, and how to use the match function. Best -P. [1] https://beam.apache.org/releases/javadoc/2.9.0/org/apache/beam/sdk/io/FileIO.html On Mon, Jan 7, 2019 at 2:41 PM Matt Casters <mattcast...@gmail.com> wrote: > Hi Pablo, > > Apologies, I thought the cases were very simple and clear. Obviously I > should have also mentioned I'm in Java land, not used to the script kiddy > stuff :-) > > On the output side: thanks for the grouping "trick". However, doesn't > that mean that all rows will end up in a single in-memory Iterable? > > On the input side, reading a JSON or XML file shouldn't be black magic but > apparently it's completely undocumented as far as I could tell. Any tips > there? > > Mind you, if it's not possible in the Beam API, feel free to let me know > and I'll dive into the Beam source code to figure something out. > > Cheers, > > Matt > > > > --- > Matt Casters <m <mcast...@pentaho.org>attcast...@gmail.com> > Senior Solution Architect, Kettle Project Founder > > > > Op ma 7 jan. 2019 om 23:09 schreef Pablo Estrada <pabl...@google.com>: > >> Hi Matt, >> is this computation running as part of a larger pipeline that does run >> some parallel processing? Otherwise, it's odd that it needs to run on Beam. >> Nonetheless, you can certainly do this with a pipeline that has a single >> element. Here's what that looks like in python: >> >> p | beam.Create(['gs://myfile.json']) | beam.ParDo(LoadEachFile()) | >> WriteToMyDatabase() >> >> If, on the other hand, you may have a PCollection with multiple elements >> (e.g. filenames), and you want to process them one-by-one, you can group >> them all on a single key, like so: >> >> my_filename_pcolll | GroupIntoSingleThread() | beam.PArDo(LoadEachFile()) >> | WriteToMyDatabase() >> >> Where the GroupIntoSingleThread transform looks basically like so: >> >> input | beam.Map(lambda x: ('singlekey', x)) | beam.GroupByKey() | >> beam.FlatMap(lambda x: x[1]) >> >> In this example, we are adding a single key to all elements, grouping >> them all together, and then throwing away the key, to get each of the >> elements one-by-one in a single thread. You can do something similar using >> side inputs (with AsIter(my_filename_pcoll)). >> >> Does that help? Or perhaps you could clarify a bit more about your use >> case. >> Best >> -P. >> >> On Mon, Jan 7, 2019 at 1:33 PM Matt Casters <mattcast...@gmail.com> >> wrote: >> >>> Hi Beam! >>> >>> There's a bunch of stuff that I would like to support and it's probably >>> something silly but I couldn't find it immediately ... or I'm completely >>> dim and making too much of certain things. >>> >>> The thing is, sometimes you just want to do a single threaded operations. >>> For example, we sometimes need to read generic JSON or XML documents or >>> perform single threaded bulk loads into certain databases. >>> There's also simple relational database data you might want to side-load >>> or data from some web service somewhere. >>> >>> So, how can I instruct Beam not to fire up a bunch of readers or >>> writers, what is a good alternative for ParDo? >>> >>> Thanks in advance for any suggestions! >>> >>> Matt >>> --- >>> Matt Casters <m <mcast...@pentaho.org>attcast...@gmail.com> >>> Senior Solution Architect, Kettle Project Founder >>> >>> >>>