Hi Pablo, Apologies, I thought the cases were very simple and clear. Obviously I should have also mentioned I'm in Java land, not used to the script kiddy stuff :-)
On the output side: thanks for the grouping "trick". However, doesn't that mean that all rows will end up in a single in-memory Iterable? On the input side, reading a JSON or XML file shouldn't be black magic but apparently it's completely undocumented as far as I could tell. Any tips there? Mind you, if it's not possible in the Beam API, feel free to let me know and I'll dive into the Beam source code to figure something out. Cheers, Matt --- Matt Casters <m <[email protected]>[email protected]> Senior Solution Architect, Kettle Project Founder Op ma 7 jan. 2019 om 23:09 schreef Pablo Estrada <[email protected]>: > Hi Matt, > is this computation running as part of a larger pipeline that does run > some parallel processing? Otherwise, it's odd that it needs to run on Beam. > Nonetheless, you can certainly do this with a pipeline that has a single > element. Here's what that looks like in python: > > p | beam.Create(['gs://myfile.json']) | beam.ParDo(LoadEachFile()) | > WriteToMyDatabase() > > If, on the other hand, you may have a PCollection with multiple elements > (e.g. filenames), and you want to process them one-by-one, you can group > them all on a single key, like so: > > my_filename_pcolll | GroupIntoSingleThread() | beam.PArDo(LoadEachFile()) > | WriteToMyDatabase() > > Where the GroupIntoSingleThread transform looks basically like so: > > input | beam.Map(lambda x: ('singlekey', x)) | beam.GroupByKey() | > beam.FlatMap(lambda x: x[1]) > > In this example, we are adding a single key to all elements, grouping them > all together, and then throwing away the key, to get each of the elements > one-by-one in a single thread. You can do something similar using side > inputs (with AsIter(my_filename_pcoll)). > > Does that help? Or perhaps you could clarify a bit more about your use > case. > Best > -P. > > On Mon, Jan 7, 2019 at 1:33 PM Matt Casters <[email protected]> wrote: > >> Hi Beam! >> >> There's a bunch of stuff that I would like to support and it's probably >> something silly but I couldn't find it immediately ... or I'm completely >> dim and making too much of certain things. >> >> The thing is, sometimes you just want to do a single threaded operations. >> For example, we sometimes need to read generic JSON or XML documents or >> perform single threaded bulk loads into certain databases. >> There's also simple relational database data you might want to side-load >> or data from some web service somewhere. >> >> So, how can I instruct Beam not to fire up a bunch of readers or writers, >> what is a good alternative for ParDo? >> >> Thanks in advance for any suggestions! >> >> Matt >> --- >> Matt Casters <m <[email protected]>[email protected]> >> Senior Solution Architect, Kettle Project Founder >> >> >>
