I have a fairly simple python word count job (although the packaging is a
little more complicated) that I'm trying to run.  (note: I'm explicitly NOT
using save_main_session.)

In it is a method to tokenize the incoming text to words, and I used
something similar to how the wordcount example worked.

def tokenize(row):
  import re
  return re.findall(r'[A-Za-z\']+', row.text)

which is then used as the function for a FlatMap:
| 'Split' >> (
        beam.FlatMap(tokenize).with_output_types(str))

However, if I run this job on dataflow (2.33), the python runner fails with
a bizarre error:
INFO:apache_beam.runners.dataflow.dataflow_runner:2021-12-07T20:59:59.704Z:
JOB_MESSAGE_ERROR: Traceback (most recent call last):
  File "apache_beam/runners/common.py", line 1232, in
apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 572, in
apache_beam.runners.common.SimpleInvoker.invoke_process
  File "/tmp/tmpq_8l154y/wordcount_test.py", line 75, in tokenize
ImportError: __import__ not found

I was able to find an example in the streaming wordcount snippet that did
something similar, but very strange [1]:
        | 'ExtractWords' >>
        beam.FlatMap(lambda x: __import__('re').findall(r'[A-Za-z\']+', x))

For whatever reason this actually fixed the issue in my job as well.  I
can't for the life of me understand why this works, or why the normal
import fails.  Someone else must have run into this same issue though for
that streaming wordcount example to be like that.  Any ideas what's going
on here?

[1]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py#L692

Reply via email to