Re: ImportError: __import__ not found on python job

2021-12-08 Thread Steve Niemitz
hm, interesting, are we using pybind anywhere?  I didn't see any references
to it.  I can give it a try on python 3.8 too though.

On Wed, Dec 8, 2021 at 9:19 AM Brian Hulette  wrote:

> A google search for "__import__ not found" turned up an issue filed with
> pybind [1]. I can't deduce a root cause from the discussion there, but it
> looks like they didn't experience the issue in Python 3.8 - it could be
> interesting to see if your problem goes away there.
>
> It looks like +Charles Chen  added the __import__('re')
> workaround in [2], maybe he remembers what was going on?
>
> [1] https://github.com/pybind/pybind11/issues/2557
> [2] https://github.com/apache/beam/pull/5071
>
> On Wed, Dec 8, 2021 at 5:30 AM Steve Niemitz  wrote:
>
>> Yeah, I can't imagine this is a "normal" problem.
>>
>> I'm on linux w/ py 3.7.  My script does have a __name__ == '__main__'
>> block.
>>
>> On Wed, Dec 8, 2021 at 12:38 AM Ning Kang  wrote:
>>
>>> I tried a pipeline:
>>>
>>> p = beam.Pipeline(DataflowRunner(), options=options)
>>> text = p | beam.Create(['Hello World, Hello You'])
>>>
>>>
>>> def tokenize(x):
>>> import re
>>> return re.findall('Hello', x)
>>>
>>>
>>> flatten = text | 'Split' >>
>>> (beam.FlatMap(tokenize).with_output_types(str))
>>> pipeline_result = p.run()
>>>
>>>
>>> Didn't run into the issue.
>>>
>>> What OS and Python version are you using? Does your script come with a
>>> `if __name__ == '__main__': `?
>>>
>>> On Tue, Dec 7, 2021 at 6:58 PM Steve Niemitz 
>>> wrote:
>>>
 I have a fairly simple python word count job (although the packaging is
 a little more complicated) that I'm trying to run.  (note: I'm explicitly
 NOT using save_main_session.)

 In it is a method to tokenize the incoming text to words, and I used
 something similar to how the wordcount example worked.

 def tokenize(row):
   import re
   return re.findall(r'[A-Za-z\']+', row.text)

 which is then used as the function for a FlatMap:
 | 'Split' >> (
 beam.FlatMap(tokenize).with_output_types(str))

 However, if I run this job on dataflow (2.33), the python runner fails
 with a bizarre error:
 INFO:apache_beam.runners.dataflow.dataflow_runner:2021-12-07T20:59:59.704Z:
 JOB_MESSAGE_ERROR: Traceback (most recent call last):
   File "apache_beam/runners/common.py", line 1232, in
 apache_beam.runners.common.DoFnRunner.process
   File "apache_beam/runners/common.py", line 572, in
 apache_beam.runners.common.SimpleInvoker.invoke_process
   File "/tmp/tmpq_8l154y/wordcount_test.py", line 75, in tokenize
 ImportError: __import__ not found

 I was able to find an example in the streaming wordcount snippet that
 did something similar, but very strange [1]:
 | 'ExtractWords' >>
 beam.FlatMap(lambda x: __import__('re').findall(r'[A-Za-z\']+',
 x))

 For whatever reason this actually fixed the issue in my job as well.  I
 can't for the life of me understand why this works, or why the normal
 import fails.  Someone else must have run into this same issue though for
 that streaming wordcount example to be like that.  Any ideas what's going
 on here?

 [1]
 https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py#L692

>>>


Re: ImportError: __import__ not found on python job

2021-12-08 Thread Steve Niemitz
Yeah, I can't imagine this is a "normal" problem.

I'm on linux w/ py 3.7.  My script does have a __name__ == '__main__' block.

On Wed, Dec 8, 2021 at 12:38 AM Ning Kang  wrote:

> I tried a pipeline:
>
> p = beam.Pipeline(DataflowRunner(), options=options)
> text = p | beam.Create(['Hello World, Hello You'])
>
>
> def tokenize(x):
> import re
> return re.findall('Hello', x)
>
>
> flatten = text | 'Split' >> (beam.FlatMap(tokenize).with_output_types(str))
> pipeline_result = p.run()
>
>
> Didn't run into the issue.
>
> What OS and Python version are you using? Does your script come with a `if
> __name__ == '__main__': `?
>
> On Tue, Dec 7, 2021 at 6:58 PM Steve Niemitz  wrote:
>
>> I have a fairly simple python word count job (although the packaging is a
>> little more complicated) that I'm trying to run.  (note: I'm explicitly NOT
>> using save_main_session.)
>>
>> In it is a method to tokenize the incoming text to words, and I used
>> something similar to how the wordcount example worked.
>>
>> def tokenize(row):
>>   import re
>>   return re.findall(r'[A-Za-z\']+', row.text)
>>
>> which is then used as the function for a FlatMap:
>> | 'Split' >> (
>> beam.FlatMap(tokenize).with_output_types(str))
>>
>> However, if I run this job on dataflow (2.33), the python runner fails
>> with a bizarre error:
>> INFO:apache_beam.runners.dataflow.dataflow_runner:2021-12-07T20:59:59.704Z:
>> JOB_MESSAGE_ERROR: Traceback (most recent call last):
>>   File "apache_beam/runners/common.py", line 1232, in
>> apache_beam.runners.common.DoFnRunner.process
>>   File "apache_beam/runners/common.py", line 572, in
>> apache_beam.runners.common.SimpleInvoker.invoke_process
>>   File "/tmp/tmpq_8l154y/wordcount_test.py", line 75, in tokenize
>> ImportError: __import__ not found
>>
>> I was able to find an example in the streaming wordcount snippet that did
>> something similar, but very strange [1]:
>> | 'ExtractWords' >>
>> beam.FlatMap(lambda x: __import__('re').findall(r'[A-Za-z\']+',
>> x))
>>
>> For whatever reason this actually fixed the issue in my job as well.  I
>> can't for the life of me understand why this works, or why the normal
>> import fails.  Someone else must have run into this same issue though for
>> that streaming wordcount example to be like that.  Any ideas what's going
>> on here?
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py#L692
>>
>


ImportError: __import__ not found on python job

2021-12-07 Thread Steve Niemitz
I have a fairly simple python word count job (although the packaging is a
little more complicated) that I'm trying to run.  (note: I'm explicitly NOT
using save_main_session.)

In it is a method to tokenize the incoming text to words, and I used
something similar to how the wordcount example worked.

def tokenize(row):
  import re
  return re.findall(r'[A-Za-z\']+', row.text)

which is then used as the function for a FlatMap:
| 'Split' >> (
beam.FlatMap(tokenize).with_output_types(str))

However, if I run this job on dataflow (2.33), the python runner fails with
a bizarre error:
INFO:apache_beam.runners.dataflow.dataflow_runner:2021-12-07T20:59:59.704Z:
JOB_MESSAGE_ERROR: Traceback (most recent call last):
  File "apache_beam/runners/common.py", line 1232, in
apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 572, in
apache_beam.runners.common.SimpleInvoker.invoke_process
  File "/tmp/tmpq_8l154y/wordcount_test.py", line 75, in tokenize
ImportError: __import__ not found

I was able to find an example in the streaming wordcount snippet that did
something similar, but very strange [1]:
| 'ExtractWords' >>
beam.FlatMap(lambda x: __import__('re').findall(r'[A-Za-z\']+', x))

For whatever reason this actually fixed the issue in my job as well.  I
can't for the life of me understand why this works, or why the normal
import fails.  Someone else must have run into this same issue though for
that streaming wordcount example to be like that.  Any ideas what's going
on here?

[1]
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py#L692