[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support
[ https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=375263=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-375263 ] ASF GitHub Bot logged work on BEAM-9029: Author: ASF GitHub Bot Created on: 21/Jan/20 22:48 Start Date: 21/Jan/20 22:48 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #10459: [BEAM-9029]Fix two bugs in Python SDK S3 filesystem support URL: https://github.com/apache/beam/pull/10459 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 375263) Remaining Estimate: 21h (was: 21h 10m) Time Spent: 3h (was: 2h 50m) > Two bugs in Python SDK S3 filesystem support > > > Key: BEAM-9029 > URL: https://issues.apache.org/jira/browse/BEAM-9029 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Wenhai Pan >Assignee: Wenhai Pan >Priority: Major > Labels: pull-request-available > Original Estimate: 24h > Time Spent: 3h > Remaining Estimate: 21h > > Hi :) > There seem to be 2 bugs in the S3 filesystem support. > I tried to use S3 storage for a simple wordcount demo with DirectRunner. > The demo script: > {code:java} > def main(): > options = PipelineOptions().view_as(StandardOptions) > options.runner = 'DirectRunner' > pipeline = beam.Pipeline(options = options) > ( > pipeline > | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data") > | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x)) > | beam.combiners.Count.PerElement() > | beam.MapTuple(lambda word, count: "%s: %s" % (word, count)) > | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output") > ) > result = pipeline.run() > result.wait_until_finish() > return > {code} > > Error message 1: > {noformat} > apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': > BeamIOError("List operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried > to list nonexistent S3 path: > s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while > running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions > None{noformat} > > After digging into the code, it seems the Boto3 client's list function will > raise an exception when trying to list a nonexistent S3 path > (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And > the S3IO class does not handle this exception in list_prefix function > (beam/sdks/python/apache_beam/io/aws/s3io.py line 121). > When the runner tries to list and delete the existing output file, if there > is no existing output file, it will try to list a nonexistent S3 path and > will trigger the exception. > This should not be an issue here. I think we can ignore this exception safely > in the S3IO list_prefix function. > Error Message 2: > {noformat} > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in delete > exceptions = {path: error for (path, error) in results > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in > exceptions = {path: error for (path, error) in results > ValueError: too many values to unpack (expected 2) [while running > 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat} > > When the runner tries to delete the temporary output directory, it will > trigger this exception. This exception is caused by parsing (path, error) > directly from the "results" which is a dict > (beam/sdks/python/apache_beam/io/aws/s3filesystem.py line 272). I think we > should use results.items() here. > I have submitted a patch for these 2 bugs: > https://github.com/apache/beam/pull/10459 > > Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support
[ https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=375262=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-375262 ] ASF GitHub Bot logged work on BEAM-9029: Author: ASF GitHub Bot Created on: 21/Jan/20 22:47 Start Date: 21/Jan/20 22:47 Worklog Time Spent: 10m Work Description: pabloem commented on issue #10459: [BEAM-9029]Fix two bugs in Python SDK S3 filesystem support URL: https://github.com/apache/beam/pull/10459#issuecomment-576922546 LGTM. Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 375262) Remaining Estimate: 21h 10m (was: 21h 20m) Time Spent: 2h 50m (was: 2h 40m) > Two bugs in Python SDK S3 filesystem support > > > Key: BEAM-9029 > URL: https://issues.apache.org/jira/browse/BEAM-9029 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Wenhai Pan >Assignee: Wenhai Pan >Priority: Major > Labels: pull-request-available > Original Estimate: 24h > Time Spent: 2h 50m > Remaining Estimate: 21h 10m > > Hi :) > There seem to be 2 bugs in the S3 filesystem support. > I tried to use S3 storage for a simple wordcount demo with DirectRunner. > The demo script: > {code:java} > def main(): > options = PipelineOptions().view_as(StandardOptions) > options.runner = 'DirectRunner' > pipeline = beam.Pipeline(options = options) > ( > pipeline > | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data") > | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x)) > | beam.combiners.Count.PerElement() > | beam.MapTuple(lambda word, count: "%s: %s" % (word, count)) > | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output") > ) > result = pipeline.run() > result.wait_until_finish() > return > {code} > > Error message 1: > {noformat} > apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': > BeamIOError("List operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried > to list nonexistent S3 path: > s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while > running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions > None{noformat} > > After digging into the code, it seems the Boto3 client's list function will > raise an exception when trying to list a nonexistent S3 path > (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And > the S3IO class does not handle this exception in list_prefix function > (beam/sdks/python/apache_beam/io/aws/s3io.py line 121). > When the runner tries to list and delete the existing output file, if there > is no existing output file, it will try to list a nonexistent S3 path and > will trigger the exception. > This should not be an issue here. I think we can ignore this exception safely > in the S3IO list_prefix function. > Error Message 2: > {noformat} > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in delete > exceptions = {path: error for (path, error) in results > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in > exceptions = {path: error for (path, error) in results > ValueError: too many values to unpack (expected 2) [while running > 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat} > > When the runner tries to delete the temporary output directory, it will > trigger this exception. This exception is caused by parsing (path, error) > directly from the "results" which is a dict > (beam/sdks/python/apache_beam/io/aws/s3filesystem.py line 272). I think we > should use results.items() here. > I have submitted a patch for these 2 bugs: > https://github.com/apache/beam/pull/10459 > > Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support
[ https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=374229=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-374229 ] ASF GitHub Bot logged work on BEAM-9029: Author: ASF GitHub Bot Created on: 19/Jan/20 08:24 Start Date: 19/Jan/20 08:24 Worklog Time Spent: 10m Work Description: icemoon1987 commented on issue #10459: [BEAM-9029]Fix two bugs in Python SDK S3 filesystem support URL: https://github.com/apache/beam/pull/10459#issuecomment-575979255 Run PythonLint PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 374229) Remaining Estimate: 21h 20m (was: 21.5h) Time Spent: 2h 40m (was: 2.5h) > Two bugs in Python SDK S3 filesystem support > > > Key: BEAM-9029 > URL: https://issues.apache.org/jira/browse/BEAM-9029 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Wenhai Pan >Assignee: Wenhai Pan >Priority: Major > Labels: pull-request-available > Original Estimate: 24h > Time Spent: 2h 40m > Remaining Estimate: 21h 20m > > Hi :) > There seem to be 2 bugs in the S3 filesystem support. > I tried to use S3 storage for a simple wordcount demo with DirectRunner. > The demo script: > {code:java} > def main(): > options = PipelineOptions().view_as(StandardOptions) > options.runner = 'DirectRunner' > pipeline = beam.Pipeline(options = options) > ( > pipeline > | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data") > | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x)) > | beam.combiners.Count.PerElement() > | beam.MapTuple(lambda word, count: "%s: %s" % (word, count)) > | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output") > ) > result = pipeline.run() > result.wait_until_finish() > return > {code} > > Error message 1: > {noformat} > apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': > BeamIOError("List operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried > to list nonexistent S3 path: > s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while > running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions > None{noformat} > > After digging into the code, it seems the Boto3 client's list function will > raise an exception when trying to list a nonexistent S3 path > (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And > the S3IO class does not handle this exception in list_prefix function > (beam/sdks/python/apache_beam/io/aws/s3io.py line 121). > When the runner tries to list and delete the existing output file, if there > is no existing output file, it will try to list a nonexistent S3 path and > will trigger the exception. > This should not be an issue here. I think we can ignore this exception safely > in the S3IO list_prefix function. > Error Message 2: > {noformat} > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in delete > exceptions = {path: error for (path, error) in results > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in > exceptions = {path: error for (path, error) in results > ValueError: too many values to unpack (expected 2) [while running > 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat} > > When the runner tries to delete the temporary output directory, it will > trigger this exception. This exception is caused by parsing (path, error) > directly from the "results" which is a dict > (beam/sdks/python/apache_beam/io/aws/s3filesystem.py line 272). I think we > should use results.items() here. > I have submitted a patch for these 2 bugs: > https://github.com/apache/beam/pull/10459 > > Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support
[ https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=374228=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-374228 ] ASF GitHub Bot logged work on BEAM-9029: Author: ASF GitHub Bot Created on: 19/Jan/20 08:24 Start Date: 19/Jan/20 08:24 Worklog Time Spent: 10m Work Description: icemoon1987 commented on issue #10459: [BEAM-9029]Fix two bugs in Python SDK S3 filesystem support URL: https://github.com/apache/beam/pull/10459#issuecomment-575979242 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 374228) Remaining Estimate: 21.5h (was: 21h 40m) Time Spent: 2.5h (was: 2h 20m) > Two bugs in Python SDK S3 filesystem support > > > Key: BEAM-9029 > URL: https://issues.apache.org/jira/browse/BEAM-9029 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Wenhai Pan >Assignee: Wenhai Pan >Priority: Major > Labels: pull-request-available > Original Estimate: 24h > Time Spent: 2.5h > Remaining Estimate: 21.5h > > Hi :) > There seem to be 2 bugs in the S3 filesystem support. > I tried to use S3 storage for a simple wordcount demo with DirectRunner. > The demo script: > {code:java} > def main(): > options = PipelineOptions().view_as(StandardOptions) > options.runner = 'DirectRunner' > pipeline = beam.Pipeline(options = options) > ( > pipeline > | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data") > | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x)) > | beam.combiners.Count.PerElement() > | beam.MapTuple(lambda word, count: "%s: %s" % (word, count)) > | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output") > ) > result = pipeline.run() > result.wait_until_finish() > return > {code} > > Error message 1: > {noformat} > apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': > BeamIOError("List operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried > to list nonexistent S3 path: > s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while > running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions > None{noformat} > > After digging into the code, it seems the Boto3 client's list function will > raise an exception when trying to list a nonexistent S3 path > (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And > the S3IO class does not handle this exception in list_prefix function > (beam/sdks/python/apache_beam/io/aws/s3io.py line 121). > When the runner tries to list and delete the existing output file, if there > is no existing output file, it will try to list a nonexistent S3 path and > will trigger the exception. > This should not be an issue here. I think we can ignore this exception safely > in the S3IO list_prefix function. > Error Message 2: > {noformat} > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in delete > exceptions = {path: error for (path, error) in results > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in > exceptions = {path: error for (path, error) in results > ValueError: too many values to unpack (expected 2) [while running > 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat} > > When the runner tries to delete the temporary output directory, it will > trigger this exception. This exception is caused by parsing (path, error) > directly from the "results" which is a dict > (beam/sdks/python/apache_beam/io/aws/s3filesystem.py line 272). I think we > should use results.items() here. > I have submitted a patch for these 2 bugs: > https://github.com/apache/beam/pull/10459 > > Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support
[ https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=374184=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-374184 ] ASF GitHub Bot logged work on BEAM-9029: Author: ASF GitHub Bot Created on: 19/Jan/20 03:37 Start Date: 19/Jan/20 03:37 Worklog Time Spent: 10m Work Description: icemoon1987 commented on issue #10459: [BEAM-9029]Fix two bugs in Python SDK S3 filesystem support URL: https://github.com/apache/beam/pull/10459#issuecomment-575963489 @pabloem Hi, pabloem :) I have changed the codes and try to rerun the lint tests and precommit tests, but it seems I failed to trigger the tests, and can not see the results now. Would you please help to rerun the tests? Thank you so much. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 374184) Remaining Estimate: 21h 40m (was: 21h 50m) Time Spent: 2h 20m (was: 2h 10m) > Two bugs in Python SDK S3 filesystem support > > > Key: BEAM-9029 > URL: https://issues.apache.org/jira/browse/BEAM-9029 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Wenhai Pan >Assignee: Wenhai Pan >Priority: Major > Labels: pull-request-available > Original Estimate: 24h > Time Spent: 2h 20m > Remaining Estimate: 21h 40m > > Hi :) > There seem to be 2 bugs in the S3 filesystem support. > I tried to use S3 storage for a simple wordcount demo with DirectRunner. > The demo script: > {code:java} > def main(): > options = PipelineOptions().view_as(StandardOptions) > options.runner = 'DirectRunner' > pipeline = beam.Pipeline(options = options) > ( > pipeline > | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data") > | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x)) > | beam.combiners.Count.PerElement() > | beam.MapTuple(lambda word, count: "%s: %s" % (word, count)) > | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output") > ) > result = pipeline.run() > result.wait_until_finish() > return > {code} > > Error message 1: > {noformat} > apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': > BeamIOError("List operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried > to list nonexistent S3 path: > s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while > running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions > None{noformat} > > After digging into the code, it seems the Boto3 client's list function will > raise an exception when trying to list a nonexistent S3 path > (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And > the S3IO class does not handle this exception in list_prefix function > (beam/sdks/python/apache_beam/io/aws/s3io.py line 121). > When the runner tries to list and delete the existing output file, if there > is no existing output file, it will try to list a nonexistent S3 path and > will trigger the exception. > This should not be an issue here. I think we can ignore this exception safely > in the S3IO list_prefix function. > Error Message 2: > {noformat} > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in delete > exceptions = {path: error for (path, error) in results > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in > exceptions = {path: error for (path, error) in results > ValueError: too many values to unpack (expected 2) [while running > 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat} > > When the runner tries to delete the temporary output directory, it will > trigger this exception. This exception is caused by parsing (path, error) > directly from the "results" which is a dict > (beam/sdks/python/apache_beam/io/aws/s3filesystem.py line 272). I think we > should use results.items() here. > I have submitted a patch for these 2 bugs: > https://github.com/apache/beam/pull/10459 > > Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support
[ https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=372107=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-372107 ] ASF GitHub Bot logged work on BEAM-9029: Author: ASF GitHub Bot Created on: 15/Jan/20 03:20 Start Date: 15/Jan/20 03:20 Worklog Time Spent: 10m Work Description: icemoon1987 commented on issue #10459: [BEAM-9029]Fix two bugs in Python SDK S3 filesystem support URL: https://github.com/apache/beam/pull/10459#issuecomment-574477470 Run PythonLint PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 372107) Remaining Estimate: 21h 50m (was: 22h) Time Spent: 2h 10m (was: 2h) > Two bugs in Python SDK S3 filesystem support > > > Key: BEAM-9029 > URL: https://issues.apache.org/jira/browse/BEAM-9029 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Wenhai Pan >Assignee: Wenhai Pan >Priority: Major > Labels: pull-request-available > Original Estimate: 24h > Time Spent: 2h 10m > Remaining Estimate: 21h 50m > > Hi :) > There seem to be 2 bugs in the S3 filesystem support. > I tried to use S3 storage for a simple wordcount demo with DirectRunner. > The demo script: > {code:java} > def main(): > options = PipelineOptions().view_as(StandardOptions) > options.runner = 'DirectRunner' > pipeline = beam.Pipeline(options = options) > ( > pipeline > | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data") > | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x)) > | beam.combiners.Count.PerElement() > | beam.MapTuple(lambda word, count: "%s: %s" % (word, count)) > | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output") > ) > result = pipeline.run() > result.wait_until_finish() > return > {code} > > Error message 1: > {noformat} > apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': > BeamIOError("List operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried > to list nonexistent S3 path: > s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while > running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions > None{noformat} > > After digging into the code, it seems the Boto3 client's list function will > raise an exception when trying to list a nonexistent S3 path > (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And > the S3IO class does not handle this exception in list_prefix function > (beam/sdks/python/apache_beam/io/aws/s3io.py line 121). > When the runner tries to list and delete the existing output file, if there > is no existing output file, it will try to list a nonexistent S3 path and > will trigger the exception. > This should not be an issue here. I think we can ignore this exception safely > in the S3IO list_prefix function. > Error Message 2: > {noformat} > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in delete > exceptions = {path: error for (path, error) in results > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in > exceptions = {path: error for (path, error) in results > ValueError: too many values to unpack (expected 2) [while running > 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat} > > When the runner tries to delete the temporary output directory, it will > trigger this exception. This exception is caused by parsing (path, error) > directly from the "results" which is a dict > (beam/sdks/python/apache_beam/io/aws/s3filesystem.py line 272). I think we > should use results.items() here. > I have submitted a patch for these 2 bugs: > https://github.com/apache/beam/pull/10459 > > Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support
[ https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=372106=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-372106 ] ASF GitHub Bot logged work on BEAM-9029: Author: ASF GitHub Bot Created on: 15/Jan/20 03:20 Start Date: 15/Jan/20 03:20 Worklog Time Spent: 10m Work Description: icemoon1987 commented on issue #10459: [BEAM-9029]Fix two bugs in Python SDK S3 filesystem support URL: https://github.com/apache/beam/pull/10459#issuecomment-574477413 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 372106) Remaining Estimate: 22h (was: 22h 10m) Time Spent: 2h (was: 1h 50m) > Two bugs in Python SDK S3 filesystem support > > > Key: BEAM-9029 > URL: https://issues.apache.org/jira/browse/BEAM-9029 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Wenhai Pan >Assignee: Wenhai Pan >Priority: Major > Labels: pull-request-available > Original Estimate: 24h > Time Spent: 2h > Remaining Estimate: 22h > > Hi :) > There seem to be 2 bugs in the S3 filesystem support. > I tried to use S3 storage for a simple wordcount demo with DirectRunner. > The demo script: > {code:java} > def main(): > options = PipelineOptions().view_as(StandardOptions) > options.runner = 'DirectRunner' > pipeline = beam.Pipeline(options = options) > ( > pipeline > | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data") > | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x)) > | beam.combiners.Count.PerElement() > | beam.MapTuple(lambda word, count: "%s: %s" % (word, count)) > | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output") > ) > result = pipeline.run() > result.wait_until_finish() > return > {code} > > Error message 1: > {noformat} > apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': > BeamIOError("List operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried > to list nonexistent S3 path: > s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while > running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions > None{noformat} > > After digging into the code, it seems the Boto3 client's list function will > raise an exception when trying to list a nonexistent S3 path > (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And > the S3IO class does not handle this exception in list_prefix function > (beam/sdks/python/apache_beam/io/aws/s3io.py line 121). > When the runner tries to list and delete the existing output file, if there > is no existing output file, it will try to list a nonexistent S3 path and > will trigger the exception. > This should not be an issue here. I think we can ignore this exception safely > in the S3IO list_prefix function. > Error Message 2: > {noformat} > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in delete > exceptions = {path: error for (path, error) in results > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in > exceptions = {path: error for (path, error) in results > ValueError: too many values to unpack (expected 2) [while running > 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat} > > When the runner tries to delete the temporary output directory, it will > trigger this exception. This exception is caused by parsing (path, error) > directly from the "results" which is a dict > (beam/sdks/python/apache_beam/io/aws/s3filesystem.py line 272). I think we > should use results.items() here. > I have submitted a patch for these 2 bugs: > https://github.com/apache/beam/pull/10459 > > Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support
[ https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=371753=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-371753 ] ASF GitHub Bot logged work on BEAM-9029: Author: ASF GitHub Bot Created on: 14/Jan/20 18:12 Start Date: 14/Jan/20 18:12 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #10459: [BEAM-9029]Fix two bugs in Python SDK S3 filesystem support URL: https://github.com/apache/beam/pull/10459#discussion_r366493566 ## File path: sdks/python/apache_beam/io/aws/s3io.py ## @@ -118,7 +118,18 @@ def list_prefix(self, path): logging.info("Starting the size estimation of the input") while True: - response = self.client.list(request) + + #The list operation will raise an exception when trying to list a nonexistent S3 path. + #This should not be an issue here. + #Ignore this exception or it will break the procedure. + try: +response = self.client.list(request) + except messages.S3ClientError as e: +if e.code == 404: + break Review comment: You are correct. Thanks for pointing that out. FYI, it seems that there are a few linnt errors and a few errors in precommit tests. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 371753) Remaining Estimate: 22h 10m (was: 22h 20m) Time Spent: 1h 50m (was: 1h 40m) > Two bugs in Python SDK S3 filesystem support > > > Key: BEAM-9029 > URL: https://issues.apache.org/jira/browse/BEAM-9029 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Wenhai Pan >Assignee: Wenhai Pan >Priority: Major > Labels: pull-request-available > Original Estimate: 24h > Time Spent: 1h 50m > Remaining Estimate: 22h 10m > > Hi :) > There seem to be 2 bugs in the S3 filesystem support. > I tried to use S3 storage for a simple wordcount demo with DirectRunner. > The demo script: > {code:java} > def main(): > options = PipelineOptions().view_as(StandardOptions) > options.runner = 'DirectRunner' > pipeline = beam.Pipeline(options = options) > ( > pipeline > | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data") > | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x)) > | beam.combiners.Count.PerElement() > | beam.MapTuple(lambda word, count: "%s: %s" % (word, count)) > | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output") > ) > result = pipeline.run() > result.wait_until_finish() > return > {code} > > Error message 1: > {noformat} > apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': > BeamIOError("List operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried > to list nonexistent S3 path: > s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while > running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions > None{noformat} > > After digging into the code, it seems the Boto3 client's list function will > raise an exception when trying to list a nonexistent S3 path > (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And > the S3IO class does not handle this exception in list_prefix function > (beam/sdks/python/apache_beam/io/aws/s3io.py line 121). > When the runner tries to list and delete the existing output file, if there > is no existing output file, it will try to list a nonexistent S3 path and > will trigger the exception. > This should not be an issue here. I think we can ignore this exception safely > in the S3IO list_prefix function. > Error Message 2: > {noformat} > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in delete > exceptions = {path: error for (path, error) in results > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in > exceptions = {path: error for (path, error) in results > ValueError: too many values to unpack (expected 2) [while running > 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat} > > When the runner tries to delete the temporary output directory, it will > trigger this exception. This
[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support
[ https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=371343=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-371343 ] ASF GitHub Bot logged work on BEAM-9029: Author: ASF GitHub Bot Created on: 14/Jan/20 04:16 Start Date: 14/Jan/20 04:16 Worklog Time Spent: 10m Work Description: icemoon1987 commented on pull request #10459: [BEAM-9029]Fix two bugs in Python SDK S3 filesystem support URL: https://github.com/apache/beam/pull/10459#discussion_r366145805 ## File path: sdks/python/apache_beam/io/aws/s3io.py ## @@ -118,7 +118,18 @@ def list_prefix(self, path): logging.info("Starting the size estimation of the input") while True: - response = self.client.list(request) + + #The list operation will raise an exception when trying to list a nonexistent S3 path. + #This should not be an issue here. + #Ignore this exception or it will break the procedure. + try: +response = self.client.list(request) + except messages.S3ClientError as e: +if e.code == 404: + break Review comment: Hi pabloem, Thank you for reviewing :) This "break" will exit the "while" loop start at line 120. The "response.items" operation in line 133 is also in the "while" loop in line 120. So the "response.items" (line 133) will not be run and will not raise an exception. The variable "response" is only used in the "while" loop start at line 120. The "break" here will stop the loop and the function: list_prefix will return the variable "file_sizes" in line 147. Since the variable "file_sizes" is initialized as an empty dict in line 114. The list_prefix function will return an empty dict. It should be safe. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 371343) Remaining Estimate: 22h 20m (was: 22.5h) Time Spent: 1h 40m (was: 1.5h) > Two bugs in Python SDK S3 filesystem support > > > Key: BEAM-9029 > URL: https://issues.apache.org/jira/browse/BEAM-9029 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Wenhai Pan >Assignee: Wenhai Pan >Priority: Major > Labels: pull-request-available > Original Estimate: 24h > Time Spent: 1h 40m > Remaining Estimate: 22h 20m > > Hi :) > There seem to be 2 bugs in the S3 filesystem support. > I tried to use S3 storage for a simple wordcount demo with DirectRunner. > The demo script: > {code:java} > def main(): > options = PipelineOptions().view_as(StandardOptions) > options.runner = 'DirectRunner' > pipeline = beam.Pipeline(options = options) > ( > pipeline > | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data") > | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x)) > | beam.combiners.Count.PerElement() > | beam.MapTuple(lambda word, count: "%s: %s" % (word, count)) > | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output") > ) > result = pipeline.run() > result.wait_until_finish() > return > {code} > > Error message 1: > {noformat} > apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': > BeamIOError("List operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried > to list nonexistent S3 path: > s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while > running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions > None{noformat} > > After digging into the code, it seems the Boto3 client's list function will > raise an exception when trying to list a nonexistent S3 path > (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And > the S3IO class does not handle this exception in list_prefix function > (beam/sdks/python/apache_beam/io/aws/s3io.py line 121). > When the runner tries to list and delete the existing output file, if there > is no existing output file, it will try to list a nonexistent S3 path and > will trigger the exception. > This should not be an issue here. I think we can ignore this exception safely > in the S3IO list_prefix function. > Error Message 2: > {noformat} > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in delete >
[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support
[ https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=371342=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-371342 ] ASF GitHub Bot logged work on BEAM-9029: Author: ASF GitHub Bot Created on: 14/Jan/20 04:15 Start Date: 14/Jan/20 04:15 Worklog Time Spent: 10m Work Description: icemoon1987 commented on pull request #10459: [BEAM-9029]Fix two bugs in Python SDK S3 filesystem support URL: https://github.com/apache/beam/pull/10459#discussion_r366145805 ## File path: sdks/python/apache_beam/io/aws/s3io.py ## @@ -118,7 +118,18 @@ def list_prefix(self, path): logging.info("Starting the size estimation of the input") while True: - response = self.client.list(request) + + #The list operation will raise an exception when trying to list a nonexistent S3 path. + #This should not be an issue here. + #Ignore this exception or it will break the procedure. + try: +response = self.client.list(request) + except messages.S3ClientError as e: +if e.code == 404: + break Review comment: Hi pabloem, Thank you for reviewing :) This "break" will exit the "while" loop start at line 120. The "response.items" operation in line 133 is also in the "while" loop in line 120. So the "response.items" (line 133) will not raise an exception. The variable "response" is only used in the "while" loop start at line 120. I directly "break" here because I saw the "break" operation in line 142. They should have the similar effect. The "break" here will stop the loop and the function: list_prefix will return the variable "file_sizes" in line 147. Since the variable "file_sizes" is initialized as an empty dict in line 114. The list_prefix function will return an empty dict. It should be safe. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 371342) Remaining Estimate: 22.5h (was: 22h 40m) Time Spent: 1.5h (was: 1h 20m) > Two bugs in Python SDK S3 filesystem support > > > Key: BEAM-9029 > URL: https://issues.apache.org/jira/browse/BEAM-9029 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Wenhai Pan >Assignee: Wenhai Pan >Priority: Major > Labels: pull-request-available > Original Estimate: 24h > Time Spent: 1.5h > Remaining Estimate: 22.5h > > Hi :) > There seem to be 2 bugs in the S3 filesystem support. > I tried to use S3 storage for a simple wordcount demo with DirectRunner. > The demo script: > {code:java} > def main(): > options = PipelineOptions().view_as(StandardOptions) > options.runner = 'DirectRunner' > pipeline = beam.Pipeline(options = options) > ( > pipeline > | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data") > | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x)) > | beam.combiners.Count.PerElement() > | beam.MapTuple(lambda word, count: "%s: %s" % (word, count)) > | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output") > ) > result = pipeline.run() > result.wait_until_finish() > return > {code} > > Error message 1: > {noformat} > apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': > BeamIOError("List operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried > to list nonexistent S3 path: > s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while > running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions > None{noformat} > > After digging into the code, it seems the Boto3 client's list function will > raise an exception when trying to list a nonexistent S3 path > (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And > the S3IO class does not handle this exception in list_prefix function > (beam/sdks/python/apache_beam/io/aws/s3io.py line 121). > When the runner tries to list and delete the existing output file, if there > is no existing output file, it will try to list a nonexistent S3 path and > will trigger the exception. > This should not be an issue here. I think we can ignore this exception safely > in the S3IO list_prefix function. > Error Message 2: > {noformat} > File >
[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support
[ https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=371190=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-371190 ] ASF GitHub Bot logged work on BEAM-9029: Author: ASF GitHub Bot Created on: 13/Jan/20 23:23 Start Date: 13/Jan/20 23:23 Worklog Time Spent: 10m Work Description: pabloem commented on issue #10459: [BEAM-9029]Fix two bugs in Python SDK S3 filesystem support URL: https://github.com/apache/beam/pull/10459#issuecomment-573921702 Here are the lint errors: https://scans.gradle.com/s/rlvpwlynymekm/console-log?task=:sdks:python:test-suites:tox:py37:lintPy37 Here are the errors in the precommit tests: https://builds.apache.org/job/beam_PreCommit_Python_Phrase/1367/#showFailuresLink This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 371190) Remaining Estimate: 22h 40m (was: 22h 50m) Time Spent: 1h 20m (was: 1h 10m) > Two bugs in Python SDK S3 filesystem support > > > Key: BEAM-9029 > URL: https://issues.apache.org/jira/browse/BEAM-9029 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Wenhai Pan >Assignee: Wenhai Pan >Priority: Major > Labels: pull-request-available > Original Estimate: 24h > Time Spent: 1h 20m > Remaining Estimate: 22h 40m > > Hi :) > There seem to be 2 bugs in the S3 filesystem support. > I tried to use S3 storage for a simple wordcount demo with DirectRunner. > The demo script: > {code:java} > def main(): > options = PipelineOptions().view_as(StandardOptions) > options.runner = 'DirectRunner' > pipeline = beam.Pipeline(options = options) > ( > pipeline > | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data") > | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x)) > | beam.combiners.Count.PerElement() > | beam.MapTuple(lambda word, count: "%s: %s" % (word, count)) > | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output") > ) > result = pipeline.run() > result.wait_until_finish() > return > {code} > > Error message 1: > {noformat} > apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': > BeamIOError("List operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried > to list nonexistent S3 path: > s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while > running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions > None{noformat} > > After digging into the code, it seems the Boto3 client's list function will > raise an exception when trying to list a nonexistent S3 path > (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And > the S3IO class does not handle this exception in list_prefix function > (beam/sdks/python/apache_beam/io/aws/s3io.py line 121). > When the runner tries to list and delete the existing output file, if there > is no existing output file, it will try to list a nonexistent S3 path and > will trigger the exception. > This should not be an issue here. I think we can ignore this exception safely > in the S3IO list_prefix function. > Error Message 2: > {noformat} > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in delete > exceptions = {path: error for (path, error) in results > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in > exceptions = {path: error for (path, error) in results > ValueError: too many values to unpack (expected 2) [while running > 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat} > > When the runner tries to delete the temporary output directory, it will > trigger this exception. This exception is caused by parsing (path, error) > directly from the "results" which is a dict > (beam/sdks/python/apache_beam/io/aws/s3filesystem.py line 272). I think we > should use results.items() here. > I have submitted a patch for these 2 bugs: > https://github.com/apache/beam/pull/10459 > > Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support
[ https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=371163=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-371163 ] ASF GitHub Bot logged work on BEAM-9029: Author: ASF GitHub Bot Created on: 13/Jan/20 22:29 Start Date: 13/Jan/20 22:29 Worklog Time Spent: 10m Work Description: pabloem commented on issue #10459: [BEAM-9029]Fix two bugs in Python SDK S3 filesystem support URL: https://github.com/apache/beam/pull/10459#issuecomment-573903261 Run PythonLint PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 371163) Remaining Estimate: 22h 50m (was: 23h) Time Spent: 1h 10m (was: 1h) > Two bugs in Python SDK S3 filesystem support > > > Key: BEAM-9029 > URL: https://issues.apache.org/jira/browse/BEAM-9029 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Wenhai Pan >Assignee: Wenhai Pan >Priority: Major > Labels: pull-request-available > Original Estimate: 24h > Time Spent: 1h 10m > Remaining Estimate: 22h 50m > > Hi :) > There seem to be 2 bugs in the S3 filesystem support. > I tried to use S3 storage for a simple wordcount demo with DirectRunner. > The demo script: > {code:java} > def main(): > options = PipelineOptions().view_as(StandardOptions) > options.runner = 'DirectRunner' > pipeline = beam.Pipeline(options = options) > ( > pipeline > | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data") > | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x)) > | beam.combiners.Count.PerElement() > | beam.MapTuple(lambda word, count: "%s: %s" % (word, count)) > | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output") > ) > result = pipeline.run() > result.wait_until_finish() > return > {code} > > Error message 1: > {noformat} > apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': > BeamIOError("List operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried > to list nonexistent S3 path: > s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while > running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions > None{noformat} > > After digging into the code, it seems the Boto3 client's list function will > raise an exception when trying to list a nonexistent S3 path > (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And > the S3IO class does not handle this exception in list_prefix function > (beam/sdks/python/apache_beam/io/aws/s3io.py line 121). > When the runner tries to list and delete the existing output file, if there > is no existing output file, it will try to list a nonexistent S3 path and > will trigger the exception. > This should not be an issue here. I think we can ignore this exception safely > in the S3IO list_prefix function. > Error Message 2: > {noformat} > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in delete > exceptions = {path: error for (path, error) in results > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in > exceptions = {path: error for (path, error) in results > ValueError: too many values to unpack (expected 2) [while running > 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat} > > When the runner tries to delete the temporary output directory, it will > trigger this exception. This exception is caused by parsing (path, error) > directly from the "results" which is a dict > (beam/sdks/python/apache_beam/io/aws/s3filesystem.py line 272). I think we > should use results.items() here. > I have submitted a patch for these 2 bugs: > https://github.com/apache/beam/pull/10459 > > Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support
[ https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=371162=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-371162 ] ASF GitHub Bot logged work on BEAM-9029: Author: ASF GitHub Bot Created on: 13/Jan/20 22:29 Start Date: 13/Jan/20 22:29 Worklog Time Spent: 10m Work Description: pabloem commented on issue #10459: [BEAM-9029]Fix two bugs in Python SDK S3 filesystem support URL: https://github.com/apache/beam/pull/10459#issuecomment-573903206 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 371162) Remaining Estimate: 23h (was: 23h 10m) Time Spent: 1h (was: 50m) > Two bugs in Python SDK S3 filesystem support > > > Key: BEAM-9029 > URL: https://issues.apache.org/jira/browse/BEAM-9029 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Wenhai Pan >Assignee: Wenhai Pan >Priority: Major > Labels: pull-request-available > Original Estimate: 24h > Time Spent: 1h > Remaining Estimate: 23h > > Hi :) > There seem to be 2 bugs in the S3 filesystem support. > I tried to use S3 storage for a simple wordcount demo with DirectRunner. > The demo script: > {code:java} > def main(): > options = PipelineOptions().view_as(StandardOptions) > options.runner = 'DirectRunner' > pipeline = beam.Pipeline(options = options) > ( > pipeline > | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data") > | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x)) > | beam.combiners.Count.PerElement() > | beam.MapTuple(lambda word, count: "%s: %s" % (word, count)) > | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output") > ) > result = pipeline.run() > result.wait_until_finish() > return > {code} > > Error message 1: > {noformat} > apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': > BeamIOError("List operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried > to list nonexistent S3 path: > s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while > running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions > None{noformat} > > After digging into the code, it seems the Boto3 client's list function will > raise an exception when trying to list a nonexistent S3 path > (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And > the S3IO class does not handle this exception in list_prefix function > (beam/sdks/python/apache_beam/io/aws/s3io.py line 121). > When the runner tries to list and delete the existing output file, if there > is no existing output file, it will try to list a nonexistent S3 path and > will trigger the exception. > This should not be an issue here. I think we can ignore this exception safely > in the S3IO list_prefix function. > Error Message 2: > {noformat} > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in delete > exceptions = {path: error for (path, error) in results > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in > exceptions = {path: error for (path, error) in results > ValueError: too many values to unpack (expected 2) [while running > 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat} > > When the runner tries to delete the temporary output directory, it will > trigger this exception. This exception is caused by parsing (path, error) > directly from the "results" which is a dict > (beam/sdks/python/apache_beam/io/aws/s3filesystem.py line 272). I think we > should use results.items() here. > I have submitted a patch for these 2 bugs: > https://github.com/apache/beam/pull/10459 > > Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support
[ https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=371149=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-371149 ] ASF GitHub Bot logged work on BEAM-9029: Author: ASF GitHub Bot Created on: 13/Jan/20 22:16 Start Date: 13/Jan/20 22:16 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #10459: [BEAM-9029]Fix two bugs in Python SDK S3 filesystem support URL: https://github.com/apache/beam/pull/10459#discussion_r366058089 ## File path: sdks/python/apache_beam/io/aws/s3io.py ## @@ -118,7 +118,18 @@ def list_prefix(self, path): logging.info("Starting the size estimation of the input") while True: - response = self.client.list(request) + + #The list operation will raise an exception when trying to list a nonexistent S3 path. + #This should not be an issue here. + #Ignore this exception or it will break the procedure. + try: +response = self.client.list(request) + except messages.S3ClientError as e: +if e.code == 404: + break Review comment: @icemoon1987 fyi This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 371149) Remaining Estimate: 23h 10m (was: 23h 20m) Time Spent: 50m (was: 40m) > Two bugs in Python SDK S3 filesystem support > > > Key: BEAM-9029 > URL: https://issues.apache.org/jira/browse/BEAM-9029 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Wenhai Pan >Assignee: Wenhai Pan >Priority: Major > Labels: pull-request-available > Original Estimate: 24h > Time Spent: 50m > Remaining Estimate: 23h 10m > > Hi :) > There seem to be 2 bugs in the S3 filesystem support. > I tried to use S3 storage for a simple wordcount demo with DirectRunner. > The demo script: > {code:java} > def main(): > options = PipelineOptions().view_as(StandardOptions) > options.runner = 'DirectRunner' > pipeline = beam.Pipeline(options = options) > ( > pipeline > | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data") > | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x)) > | beam.combiners.Count.PerElement() > | beam.MapTuple(lambda word, count: "%s: %s" % (word, count)) > | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output") > ) > result = pipeline.run() > result.wait_until_finish() > return > {code} > > Error message 1: > {noformat} > apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': > BeamIOError("List operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried > to list nonexistent S3 path: > s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while > running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions > None{noformat} > > After digging into the code, it seems the Boto3 client's list function will > raise an exception when trying to list a nonexistent S3 path > (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And > the S3IO class does not handle this exception in list_prefix function > (beam/sdks/python/apache_beam/io/aws/s3io.py line 121). > When the runner tries to list and delete the existing output file, if there > is no existing output file, it will try to list a nonexistent S3 path and > will trigger the exception. > This should not be an issue here. I think we can ignore this exception safely > in the S3IO list_prefix function. > Error Message 2: > {noformat} > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in delete > exceptions = {path: error for (path, error) in results > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in > exceptions = {path: error for (path, error) in results > ValueError: too many values to unpack (expected 2) [while running > 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat} > > When the runner tries to delete the temporary output directory, it will > trigger this exception. This exception is caused by parsing (path, error) > directly from the "results" which is a dict >
[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support
[ https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=370144=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-370144 ] ASF GitHub Bot logged work on BEAM-9029: Author: ASF GitHub Bot Created on: 10/Jan/20 23:39 Start Date: 10/Jan/20 23:39 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #10459: [BEAM-9029]Fix two bugs in Python SDK S3 filesystem support URL: https://github.com/apache/beam/pull/10459#discussion_r365472091 ## File path: sdks/python/apache_beam/io/aws/s3io.py ## @@ -118,7 +118,18 @@ def list_prefix(self, path): logging.info("Starting the size estimation of the input") while True: - response = self.client.list(request) + + #The list operation will raise an exception when trying to list a nonexistent S3 path. + #This should not be an issue here. + #Ignore this exception or it will break the procedure. + try: +response = self.client.list(request) + except messages.S3ClientError as e: +if e.code == 404: + break Review comment: Should the response be some kind of empty response? If we just break out here, `response` will be `None`, and `response.items` will raise an exception, right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 370144) Remaining Estimate: 23h 20m (was: 23.5h) Time Spent: 40m (was: 0.5h) > Two bugs in Python SDK S3 filesystem support > > > Key: BEAM-9029 > URL: https://issues.apache.org/jira/browse/BEAM-9029 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Wenhai Pan >Assignee: Wenhai Pan >Priority: Major > Labels: pull-request-available > Original Estimate: 24h > Time Spent: 40m > Remaining Estimate: 23h 20m > > Hi :) > There seem to be 2 bugs in the S3 filesystem support. > I tried to use S3 storage for a simple wordcount demo with DirectRunner. > The demo script: > {code:java} > def main(): > options = PipelineOptions().view_as(StandardOptions) > options.runner = 'DirectRunner' > pipeline = beam.Pipeline(options = options) > ( > pipeline > | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data") > | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x)) > | beam.combiners.Count.PerElement() > | beam.MapTuple(lambda word, count: "%s: %s" % (word, count)) > | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output") > ) > result = pipeline.run() > result.wait_until_finish() > return > {code} > > Error message 1: > {noformat} > apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': > BeamIOError("List operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried > to list nonexistent S3 path: > s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while > running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions > None{noformat} > > After digging into the code, it seems the Boto3 client's list function will > raise an exception when trying to list a nonexistent S3 path > (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And > the S3IO class does not handle this exception in list_prefix function > (beam/sdks/python/apache_beam/io/aws/s3io.py line 121). > When the runner tries to list and delete the existing output file, if there > is no existing output file, it will try to list a nonexistent S3 path and > will trigger the exception. > This should not be an issue here. I think we can ignore this exception safely > in the S3IO list_prefix function. > Error Message 2: > {noformat} > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in delete > exceptions = {path: error for (path, error) in results > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in > exceptions = {path: error for (path, error) in results > ValueError: too many values to unpack (expected 2) [while running > 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat} > > When the runner tries to delete the temporary output directory, it will > trigger
[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support
[ https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=369228=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-369228 ] ASF GitHub Bot logged work on BEAM-9029: Author: ASF GitHub Bot Created on: 09/Jan/20 18:23 Start Date: 09/Jan/20 18:23 Worklog Time Spent: 10m Work Description: pabloem commented on issue #10459: [BEAM-9029]Fix two bugs in Python SDK S3 filesystem support URL: https://github.com/apache/beam/pull/10459#issuecomment-572688890 I'll take a look... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 369228) Remaining Estimate: 23.5h (was: 23h 40m) Time Spent: 0.5h (was: 20m) > Two bugs in Python SDK S3 filesystem support > > > Key: BEAM-9029 > URL: https://issues.apache.org/jira/browse/BEAM-9029 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Wenhai Pan >Assignee: Wenhai Pan >Priority: Major > Labels: pull-request-available > Original Estimate: 24h > Time Spent: 0.5h > Remaining Estimate: 23.5h > > Hi :) > There seem to be 2 bugs in the S3 filesystem support. > I tried to use S3 storage for a simple wordcount demo with DirectRunner. > The demo script: > {code:java} > def main(): > options = PipelineOptions().view_as(StandardOptions) > options.runner = 'DirectRunner' > pipeline = beam.Pipeline(options = options) > ( > pipeline > | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data") > | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x)) > | beam.combiners.Count.PerElement() > | beam.MapTuple(lambda word, count: "%s: %s" % (word, count)) > | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output") > ) > result = pipeline.run() > result.wait_until_finish() > return > {code} > > Error message 1: > {noformat} > apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': > BeamIOError("List operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried > to list nonexistent S3 path: > s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while > running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions > None{noformat} > > After digging into the code, it seems the Boto3 client's list function will > raise an exception when trying to list a nonexistent S3 path > (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And > the S3IO class does not handle this exception in list_prefix function > (beam/sdks/python/apache_beam/io/aws/s3io.py line 121). > When the runner tries to list and delete the existing output file, if there > is no existing output file, it will try to list a nonexistent S3 path and > will trigger the exception. > This should not be an issue here. I think we can ignore this exception safely > in the S3IO list_prefix function. > Error Message 2: > {noformat} > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in delete > exceptions = {path: error for (path, error) in results > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in > exceptions = {path: error for (path, error) in results > ValueError: too many values to unpack (expected 2) [while running > 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat} > > When the runner tries to delete the temporary output directory, it will > trigger this exception. This exception is caused by parsing (path, error) > directly from the "results" which is a dict > (beam/sdks/python/apache_beam/io/aws/s3filesystem.py line 272). I think we > should use results.items() here. > I have submitted a patch for these 2 bugs: > https://github.com/apache/beam/pull/10459 > > Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support
[ https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=362902=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-362902 ] ASF GitHub Bot logged work on BEAM-9029: Author: ASF GitHub Bot Created on: 24/Dec/19 06:25 Start Date: 24/Dec/19 06:25 Worklog Time Spent: 10m Work Description: icemoon1987 commented on issue #10459: [BEAM-9029]Fix two bugs in Python SDK S3 filesystem support URL: https://github.com/apache/beam/pull/10459#issuecomment-568667881 R: @pabloem @robertwb @aaltay @charlesccychen This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 362902) Remaining Estimate: 23h 40m (was: 23h 50m) Time Spent: 20m (was: 10m) > Two bugs in Python SDK S3 filesystem support > > > Key: BEAM-9029 > URL: https://issues.apache.org/jira/browse/BEAM-9029 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Wenhai Pan >Priority: Major > Labels: pull-request-available > Original Estimate: 24h > Time Spent: 20m > Remaining Estimate: 23h 40m > > Hi :) > There seem to be 2 bugs in the S3 filesystem support. > I tried to use S3 storage for a simple wordcount demo with DirectRunner. > The demo script: > {code:java} > def main(): > options = PipelineOptions().view_as(StandardOptions) > options.runner = 'DirectRunner' > pipeline = beam.Pipeline(options = options) > ( > pipeline > | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data") > | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x)) > | beam.combiners.Count.PerElement() > | beam.MapTuple(lambda word, count: "%s: %s" % (word, count)) > | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output") > ) > result = pipeline.run() > result.wait_until_finish() > return > {code} > > Error message 1: > {noformat} > apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': > BeamIOError("List operation failed with exceptions > {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried > to list nonexistent S3 path: > s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while > running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions > None{noformat} > > After digging into the code, it seems the Boto3 client's list function will > raise an exception when trying to list a nonexistent S3 path > (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And > the S3IO class does not handle this exception in list_prefix function > (beam/sdks/python/apache_beam/io/aws/s3io.py line 121). > When the runner tries to list and delete the existing output file, if there > is no existing output file, it will try to list a nonexistent S3 path and > will trigger the exception. > This should not be an issue here. I think we can ignore this exception safely > in the S3IO list_prefix function. > Error Message 2: > {noformat} > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in delete > exceptions = {path: error for (path, error) in results > File > "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py", > line 272, in > exceptions = {path: error for (path, error) in results > ValueError: too many values to unpack (expected 2) [while running > 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat} > > When the runner tries to delete the temporary output directory, it will > trigger this exception. This exception is caused by parsing (path, error) > directly from the "results" which is a dict > (beam/sdks/python/apache_beam/io/aws/s3filesystem.py line 272). I think we > should use results.items() here. > I have submitted a patch for these 2 bugs: > https://github.com/apache/beam/pull/10459 > > Thank you. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support
[ https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=362879=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-362879 ] ASF GitHub Bot logged work on BEAM-9029: Author: ASF GitHub Bot Created on: 24/Dec/19 04:30 Start Date: 24/Dec/19 04:30 Worklog Time Spent: 10m Work Description: icemoon1987 commented on pull request #10459: [BEAM-9029]Fix two bugs in Python SDK S3 filesystem support URL: https://github.com/apache/beam/pull/10459 Trying to fix the bugs on JIRA: https://issues.apache.org/jira/browse/BEAM-9029 1. Ignore exception when trying to list a nonexistent S3 path; 2. Fix parsing issue when deleting the temporary output directory. Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/) Python | [![Build