Yes, one could use Apache Beam to execute such a pipeline. The main question seems to be where the set of remote directories would come from (the input) and where you would write your results (the output). And, yes, you would have to write this to not use asyncio. There's also the question of what advantages you hope to gain from using Beam. A simpler programming model? The ability to run on a managed service?
On Thu, Nov 28, 2019 at 2:10 PM Marco Mistroni <[email protected]> wrote: > > Hi all > i am currently getting acquainted with Apache beam to replace my current > workflow, and was wondering if Beam can handle it. > Currently, my workflow is based entirely on python asyncio plus some groupby > operations, and it consists of the following > > - have a list of remote directories from which i need to download a file - > file has same name across directories > - for each of the file above, i need to scan the content (which is itself a > list of remote file paths) > - for each of the file paths above i need to extract the content to a list of > string > - i need to do a reducebYkey operation out of all the lists extracted above > > To me, it seems suitable... the only thing that concerns me is that i > probably have to drop asyncio.... > Could anyone advise? > > kind regards > Marco
