I have a fairly simple python word count job (although the packaging is a little more complicated) that I'm trying to run. (note: I'm explicitly NOT using save_main_session.)
In it is a method to tokenize the incoming text to words, and I used something similar to how the wordcount example worked. def tokenize(row): import re return re.findall(r'[A-Za-z\']+', row.text) which is then used as the function for a FlatMap: | 'Split' >> ( beam.FlatMap(tokenize).with_output_types(str)) However, if I run this job on dataflow (2.33), the python runner fails with a bizarre error: INFO:apache_beam.runners.dataflow.dataflow_runner:2021-12-07T20:59:59.704Z: JOB_MESSAGE_ERROR: Traceback (most recent call last): File "apache_beam/runners/common.py", line 1232, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 572, in apache_beam.runners.common.SimpleInvoker.invoke_process File "/tmp/tmpq_8l154y/wordcount_test.py", line 75, in tokenize ImportError: __import__ not found I was able to find an example in the streaming wordcount snippet that did something similar, but very strange [1]: | 'ExtractWords' >> beam.FlatMap(lambda x: __import__('re').findall(r'[A-Za-z\']+', x)) For whatever reason this actually fixed the issue in my job as well. I can't for the life of me understand why this works, or why the normal import fails. Someone else must have run into this same issue though for that streaming wordcount example to be like that. Any ideas what's going on here? [1] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py#L692