Yeah, I can't imagine this is a "normal" problem.

I'm on linux w/ py 3.7.  My script does have a __name__ == '__main__' block.

On Wed, Dec 8, 2021 at 12:38 AM Ning Kang <ni...@google.com> wrote:

> I tried a pipeline:
>
> p = beam.Pipeline(DataflowRunner(), options=options)
> text = p | beam.Create(['Hello World, Hello You'])
>
>
> def tokenize(x):
>     import re
>     return re.findall('Hello', x)
>
>
> flatten = text | 'Split' >> (beam.FlatMap(tokenize).with_output_types(str))
> pipeline_result = p.run()
>
>
> Didn't run into the issue.
>
> What OS and Python version are you using? Does your script come with a `if
> __name__ == '__main__': `?
>
> On Tue, Dec 7, 2021 at 6:58 PM Steve Niemitz <sniem...@apache.org> wrote:
>
>> I have a fairly simple python word count job (although the packaging is a
>> little more complicated) that I'm trying to run.  (note: I'm explicitly NOT
>> using save_main_session.)
>>
>> In it is a method to tokenize the incoming text to words, and I used
>> something similar to how the wordcount example worked.
>>
>> def tokenize(row):
>>   import re
>>   return re.findall(r'[A-Za-z\']+', row.text)
>>
>> which is then used as the function for a FlatMap:
>> | 'Split' >> (
>>         beam.FlatMap(tokenize).with_output_types(str))
>>
>> However, if I run this job on dataflow (2.33), the python runner fails
>> with a bizarre error:
>> INFO:apache_beam.runners.dataflow.dataflow_runner:2021-12-07T20:59:59.704Z:
>> JOB_MESSAGE_ERROR: Traceback (most recent call last):
>>   File "apache_beam/runners/common.py", line 1232, in
>> apache_beam.runners.common.DoFnRunner.process
>>   File "apache_beam/runners/common.py", line 572, in
>> apache_beam.runners.common.SimpleInvoker.invoke_process
>>   File "/tmp/tmpq_8l154y/wordcount_test.py", line 75, in tokenize
>> ImportError: __import__ not found
>>
>> I was able to find an example in the streaming wordcount snippet that did
>> something similar, but very strange [1]:
>>         | 'ExtractWords' >>
>>         beam.FlatMap(lambda x: __import__('re').findall(r'[A-Za-z\']+',
>> x))
>>
>> For whatever reason this actually fixed the issue in my job as well.  I
>> can't for the life of me understand why this works, or why the normal
>> import fails.  Someone else must have run into this same issue though for
>> that streaming wordcount example to be like that.  Any ideas what's going
>> on here?
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py#L692
>>
>

Reply via email to