hm, interesting, are we using pybind anywhere? I didn't see any references to it. I can give it a try on python 3.8 too though.
On Wed, Dec 8, 2021 at 9:19 AM Brian Hulette <[email protected]> wrote: > A google search for "__import__ not found" turned up an issue filed with > pybind [1]. I can't deduce a root cause from the discussion there, but it > looks like they didn't experience the issue in Python 3.8 - it could be > interesting to see if your problem goes away there. > > It looks like +Charles Chen <[email protected]> added the __import__('re') > workaround in [2], maybe he remembers what was going on? > > [1] https://github.com/pybind/pybind11/issues/2557 > [2] https://github.com/apache/beam/pull/5071 > > On Wed, Dec 8, 2021 at 5:30 AM Steve Niemitz <[email protected]> wrote: > >> Yeah, I can't imagine this is a "normal" problem. >> >> I'm on linux w/ py 3.7. My script does have a __name__ == '__main__' >> block. >> >> On Wed, Dec 8, 2021 at 12:38 AM Ning Kang <[email protected]> wrote: >> >>> I tried a pipeline: >>> >>> p = beam.Pipeline(DataflowRunner(), options=options) >>> text = p | beam.Create(['Hello World, Hello You']) >>> >>> >>> def tokenize(x): >>> import re >>> return re.findall('Hello', x) >>> >>> >>> flatten = text | 'Split' >> >>> (beam.FlatMap(tokenize).with_output_types(str)) >>> pipeline_result = p.run() >>> >>> >>> Didn't run into the issue. >>> >>> What OS and Python version are you using? Does your script come with a >>> `if __name__ == '__main__': `? >>> >>> On Tue, Dec 7, 2021 at 6:58 PM Steve Niemitz <[email protected]> >>> wrote: >>> >>>> I have a fairly simple python word count job (although the packaging is >>>> a little more complicated) that I'm trying to run. (note: I'm explicitly >>>> NOT using save_main_session.) >>>> >>>> In it is a method to tokenize the incoming text to words, and I used >>>> something similar to how the wordcount example worked. >>>> >>>> def tokenize(row): >>>> import re >>>> return re.findall(r'[A-Za-z\']+', row.text) >>>> >>>> which is then used as the function for a FlatMap: >>>> | 'Split' >> ( >>>> beam.FlatMap(tokenize).with_output_types(str)) >>>> >>>> However, if I run this job on dataflow (2.33), the python runner fails >>>> with a bizarre error: >>>> INFO:apache_beam.runners.dataflow.dataflow_runner:2021-12-07T20:59:59.704Z: >>>> JOB_MESSAGE_ERROR: Traceback (most recent call last): >>>> File "apache_beam/runners/common.py", line 1232, in >>>> apache_beam.runners.common.DoFnRunner.process >>>> File "apache_beam/runners/common.py", line 572, in >>>> apache_beam.runners.common.SimpleInvoker.invoke_process >>>> File "/tmp/tmpq_8l154y/wordcount_test.py", line 75, in tokenize >>>> ImportError: __import__ not found >>>> >>>> I was able to find an example in the streaming wordcount snippet that >>>> did something similar, but very strange [1]: >>>> | 'ExtractWords' >> >>>> beam.FlatMap(lambda x: __import__('re').findall(r'[A-Za-z\']+', >>>> x)) >>>> >>>> For whatever reason this actually fixed the issue in my job as well. I >>>> can't for the life of me understand why this works, or why the normal >>>> import fails. Someone else must have run into this same issue though for >>>> that streaming wordcount example to be like that. Any ideas what's going >>>> on here? >>>> >>>> [1] >>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py#L692 >>>> >>>
