[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support

2020-01-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=375263=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-375263
 ]

ASF GitHub Bot logged work on BEAM-9029:


Author: ASF GitHub Bot
Created on: 21/Jan/20 22:48
Start Date: 21/Jan/20 22:48
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #10459: 
[BEAM-9029]Fix two bugs in Python SDK S3 filesystem support
URL: https://github.com/apache/beam/pull/10459
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 375263)
Remaining Estimate: 21h  (was: 21h 10m)
Time Spent: 3h  (was: 2h 50m)

> Two bugs in Python SDK S3 filesystem support
> 
>
> Key: BEAM-9029
> URL: https://issues.apache.org/jira/browse/BEAM-9029
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Wenhai Pan
>Assignee: Wenhai Pan
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 24h
>  Time Spent: 3h
>  Remaining Estimate: 21h
>
> Hi :)
> There seem to be 2 bugs in the S3 filesystem support.
> I tried to use S3 storage for a simple wordcount demo with DirectRunner.
> The demo script:
> {code:java}
> def main():
> options = PipelineOptions().view_as(StandardOptions)
>  options.runner = 'DirectRunner'
> pipeline = beam.Pipeline(options = options)
> (
>  pipeline
>  | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data")
>  | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x))
>  | beam.combiners.Count.PerElement()
>  | beam.MapTuple(lambda word, count: "%s: %s" % (word, count))
>  | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output")
>  )
> result = pipeline.run()
>  result.wait_until_finish()
> return
> {code}
>  
> Error message 1:
> {noformat}
> apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': 
> BeamIOError("List operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried 
> to list nonexistent S3 path: 
> s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while 
> running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions 
> None{noformat}
>  
> After digging into the code, it seems the Boto3 client's list function will 
> raise an exception when trying to list a nonexistent S3 path 
> (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And 
> the S3IO class does not handle this exception in list_prefix function 
> (beam/sdks/python/apache_beam/io/aws/s3io.py line 121).
> When the runner tries to list and delete the existing output file, if there 
> is no existing output file, it will try to list a nonexistent S3 path and 
> will trigger the exception.
> This should not be an issue here. I think we can ignore this exception safely 
> in the S3IO list_prefix function.
> Error Message 2:
> {noformat}
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in delete
> exceptions = {path: error for (path, error) in results
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in 
> exceptions = {path: error for (path, error) in results
> ValueError: too many values to unpack (expected 2) [while running 
> 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat}
>  
> When the runner tries to delete the temporary output directory, it will 
> trigger this exception. This exception is caused by parsing (path, error) 
> directly from the "results" which is a dict 
> (beam/sdks/python/apache_beam/io/aws/s3filesystem.py line 272). I think we 
> should use results.items() here.
> I have submitted a patch for these 2 bugs: 
> https://github.com/apache/beam/pull/10459
>  
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support

2020-01-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=375262=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-375262
 ]

ASF GitHub Bot logged work on BEAM-9029:


Author: ASF GitHub Bot
Created on: 21/Jan/20 22:47
Start Date: 21/Jan/20 22:47
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #10459: [BEAM-9029]Fix two 
bugs in Python SDK S3 filesystem support
URL: https://github.com/apache/beam/pull/10459#issuecomment-576922546
 
 
   LGTM. Thanks!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 375262)
Remaining Estimate: 21h 10m  (was: 21h 20m)
Time Spent: 2h 50m  (was: 2h 40m)

> Two bugs in Python SDK S3 filesystem support
> 
>
> Key: BEAM-9029
> URL: https://issues.apache.org/jira/browse/BEAM-9029
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Wenhai Pan
>Assignee: Wenhai Pan
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 24h
>  Time Spent: 2h 50m
>  Remaining Estimate: 21h 10m
>
> Hi :)
> There seem to be 2 bugs in the S3 filesystem support.
> I tried to use S3 storage for a simple wordcount demo with DirectRunner.
> The demo script:
> {code:java}
> def main():
> options = PipelineOptions().view_as(StandardOptions)
>  options.runner = 'DirectRunner'
> pipeline = beam.Pipeline(options = options)
> (
>  pipeline
>  | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data")
>  | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x))
>  | beam.combiners.Count.PerElement()
>  | beam.MapTuple(lambda word, count: "%s: %s" % (word, count))
>  | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output")
>  )
> result = pipeline.run()
>  result.wait_until_finish()
> return
> {code}
>  
> Error message 1:
> {noformat}
> apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': 
> BeamIOError("List operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried 
> to list nonexistent S3 path: 
> s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while 
> running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions 
> None{noformat}
>  
> After digging into the code, it seems the Boto3 client's list function will 
> raise an exception when trying to list a nonexistent S3 path 
> (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And 
> the S3IO class does not handle this exception in list_prefix function 
> (beam/sdks/python/apache_beam/io/aws/s3io.py line 121).
> When the runner tries to list and delete the existing output file, if there 
> is no existing output file, it will try to list a nonexistent S3 path and 
> will trigger the exception.
> This should not be an issue here. I think we can ignore this exception safely 
> in the S3IO list_prefix function.
> Error Message 2:
> {noformat}
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in delete
> exceptions = {path: error for (path, error) in results
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in 
> exceptions = {path: error for (path, error) in results
> ValueError: too many values to unpack (expected 2) [while running 
> 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat}
>  
> When the runner tries to delete the temporary output directory, it will 
> trigger this exception. This exception is caused by parsing (path, error) 
> directly from the "results" which is a dict 
> (beam/sdks/python/apache_beam/io/aws/s3filesystem.py line 272). I think we 
> should use results.items() here.
> I have submitted a patch for these 2 bugs: 
> https://github.com/apache/beam/pull/10459
>  
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support

2020-01-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=374229=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-374229
 ]

ASF GitHub Bot logged work on BEAM-9029:


Author: ASF GitHub Bot
Created on: 19/Jan/20 08:24
Start Date: 19/Jan/20 08:24
Worklog Time Spent: 10m 
  Work Description: icemoon1987 commented on issue #10459: [BEAM-9029]Fix 
two bugs in Python SDK S3 filesystem support
URL: https://github.com/apache/beam/pull/10459#issuecomment-575979255
 
 
   Run PythonLint PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 374229)
Remaining Estimate: 21h 20m  (was: 21.5h)
Time Spent: 2h 40m  (was: 2.5h)

> Two bugs in Python SDK S3 filesystem support
> 
>
> Key: BEAM-9029
> URL: https://issues.apache.org/jira/browse/BEAM-9029
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Wenhai Pan
>Assignee: Wenhai Pan
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 24h
>  Time Spent: 2h 40m
>  Remaining Estimate: 21h 20m
>
> Hi :)
> There seem to be 2 bugs in the S3 filesystem support.
> I tried to use S3 storage for a simple wordcount demo with DirectRunner.
> The demo script:
> {code:java}
> def main():
> options = PipelineOptions().view_as(StandardOptions)
>  options.runner = 'DirectRunner'
> pipeline = beam.Pipeline(options = options)
> (
>  pipeline
>  | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data")
>  | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x))
>  | beam.combiners.Count.PerElement()
>  | beam.MapTuple(lambda word, count: "%s: %s" % (word, count))
>  | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output")
>  )
> result = pipeline.run()
>  result.wait_until_finish()
> return
> {code}
>  
> Error message 1:
> {noformat}
> apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': 
> BeamIOError("List operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried 
> to list nonexistent S3 path: 
> s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while 
> running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions 
> None{noformat}
>  
> After digging into the code, it seems the Boto3 client's list function will 
> raise an exception when trying to list a nonexistent S3 path 
> (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And 
> the S3IO class does not handle this exception in list_prefix function 
> (beam/sdks/python/apache_beam/io/aws/s3io.py line 121).
> When the runner tries to list and delete the existing output file, if there 
> is no existing output file, it will try to list a nonexistent S3 path and 
> will trigger the exception.
> This should not be an issue here. I think we can ignore this exception safely 
> in the S3IO list_prefix function.
> Error Message 2:
> {noformat}
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in delete
> exceptions = {path: error for (path, error) in results
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in 
> exceptions = {path: error for (path, error) in results
> ValueError: too many values to unpack (expected 2) [while running 
> 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat}
>  
> When the runner tries to delete the temporary output directory, it will 
> trigger this exception. This exception is caused by parsing (path, error) 
> directly from the "results" which is a dict 
> (beam/sdks/python/apache_beam/io/aws/s3filesystem.py line 272). I think we 
> should use results.items() here.
> I have submitted a patch for these 2 bugs: 
> https://github.com/apache/beam/pull/10459
>  
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support

2020-01-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=374228=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-374228
 ]

ASF GitHub Bot logged work on BEAM-9029:


Author: ASF GitHub Bot
Created on: 19/Jan/20 08:24
Start Date: 19/Jan/20 08:24
Worklog Time Spent: 10m 
  Work Description: icemoon1987 commented on issue #10459: [BEAM-9029]Fix 
two bugs in Python SDK S3 filesystem support
URL: https://github.com/apache/beam/pull/10459#issuecomment-575979242
 
 
   Run Python PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 374228)
Remaining Estimate: 21.5h  (was: 21h 40m)
Time Spent: 2.5h  (was: 2h 20m)

> Two bugs in Python SDK S3 filesystem support
> 
>
> Key: BEAM-9029
> URL: https://issues.apache.org/jira/browse/BEAM-9029
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Wenhai Pan
>Assignee: Wenhai Pan
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 24h
>  Time Spent: 2.5h
>  Remaining Estimate: 21.5h
>
> Hi :)
> There seem to be 2 bugs in the S3 filesystem support.
> I tried to use S3 storage for a simple wordcount demo with DirectRunner.
> The demo script:
> {code:java}
> def main():
> options = PipelineOptions().view_as(StandardOptions)
>  options.runner = 'DirectRunner'
> pipeline = beam.Pipeline(options = options)
> (
>  pipeline
>  | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data")
>  | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x))
>  | beam.combiners.Count.PerElement()
>  | beam.MapTuple(lambda word, count: "%s: %s" % (word, count))
>  | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output")
>  )
> result = pipeline.run()
>  result.wait_until_finish()
> return
> {code}
>  
> Error message 1:
> {noformat}
> apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': 
> BeamIOError("List operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried 
> to list nonexistent S3 path: 
> s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while 
> running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions 
> None{noformat}
>  
> After digging into the code, it seems the Boto3 client's list function will 
> raise an exception when trying to list a nonexistent S3 path 
> (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And 
> the S3IO class does not handle this exception in list_prefix function 
> (beam/sdks/python/apache_beam/io/aws/s3io.py line 121).
> When the runner tries to list and delete the existing output file, if there 
> is no existing output file, it will try to list a nonexistent S3 path and 
> will trigger the exception.
> This should not be an issue here. I think we can ignore this exception safely 
> in the S3IO list_prefix function.
> Error Message 2:
> {noformat}
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in delete
> exceptions = {path: error for (path, error) in results
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in 
> exceptions = {path: error for (path, error) in results
> ValueError: too many values to unpack (expected 2) [while running 
> 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat}
>  
> When the runner tries to delete the temporary output directory, it will 
> trigger this exception. This exception is caused by parsing (path, error) 
> directly from the "results" which is a dict 
> (beam/sdks/python/apache_beam/io/aws/s3filesystem.py line 272). I think we 
> should use results.items() here.
> I have submitted a patch for these 2 bugs: 
> https://github.com/apache/beam/pull/10459
>  
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support

2020-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=374184=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-374184
 ]

ASF GitHub Bot logged work on BEAM-9029:


Author: ASF GitHub Bot
Created on: 19/Jan/20 03:37
Start Date: 19/Jan/20 03:37
Worklog Time Spent: 10m 
  Work Description: icemoon1987 commented on issue #10459: [BEAM-9029]Fix 
two bugs in Python SDK S3 filesystem support
URL: https://github.com/apache/beam/pull/10459#issuecomment-575963489
 
 
   @pabloem 
   
   Hi, pabloem :)
   
   I have changed the codes and try to rerun the lint tests and precommit 
tests, but it seems I failed to trigger the tests, and can not see the results 
now.
   
   Would you please help to rerun the tests? Thank you so much.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 374184)
Remaining Estimate: 21h 40m  (was: 21h 50m)
Time Spent: 2h 20m  (was: 2h 10m)

> Two bugs in Python SDK S3 filesystem support
> 
>
> Key: BEAM-9029
> URL: https://issues.apache.org/jira/browse/BEAM-9029
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Wenhai Pan
>Assignee: Wenhai Pan
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 24h
>  Time Spent: 2h 20m
>  Remaining Estimate: 21h 40m
>
> Hi :)
> There seem to be 2 bugs in the S3 filesystem support.
> I tried to use S3 storage for a simple wordcount demo with DirectRunner.
> The demo script:
> {code:java}
> def main():
> options = PipelineOptions().view_as(StandardOptions)
>  options.runner = 'DirectRunner'
> pipeline = beam.Pipeline(options = options)
> (
>  pipeline
>  | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data")
>  | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x))
>  | beam.combiners.Count.PerElement()
>  | beam.MapTuple(lambda word, count: "%s: %s" % (word, count))
>  | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output")
>  )
> result = pipeline.run()
>  result.wait_until_finish()
> return
> {code}
>  
> Error message 1:
> {noformat}
> apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': 
> BeamIOError("List operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried 
> to list nonexistent S3 path: 
> s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while 
> running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions 
> None{noformat}
>  
> After digging into the code, it seems the Boto3 client's list function will 
> raise an exception when trying to list a nonexistent S3 path 
> (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And 
> the S3IO class does not handle this exception in list_prefix function 
> (beam/sdks/python/apache_beam/io/aws/s3io.py line 121).
> When the runner tries to list and delete the existing output file, if there 
> is no existing output file, it will try to list a nonexistent S3 path and 
> will trigger the exception.
> This should not be an issue here. I think we can ignore this exception safely 
> in the S3IO list_prefix function.
> Error Message 2:
> {noformat}
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in delete
> exceptions = {path: error for (path, error) in results
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in 
> exceptions = {path: error for (path, error) in results
> ValueError: too many values to unpack (expected 2) [while running 
> 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat}
>  
> When the runner tries to delete the temporary output directory, it will 
> trigger this exception. This exception is caused by parsing (path, error) 
> directly from the "results" which is a dict 
> (beam/sdks/python/apache_beam/io/aws/s3filesystem.py line 272). I think we 
> should use results.items() here.
> I have submitted a patch for these 2 bugs: 
> https://github.com/apache/beam/pull/10459
>  
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support

2020-01-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=372107=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-372107
 ]

ASF GitHub Bot logged work on BEAM-9029:


Author: ASF GitHub Bot
Created on: 15/Jan/20 03:20
Start Date: 15/Jan/20 03:20
Worklog Time Spent: 10m 
  Work Description: icemoon1987 commented on issue #10459: [BEAM-9029]Fix 
two bugs in Python SDK S3 filesystem support
URL: https://github.com/apache/beam/pull/10459#issuecomment-574477470
 
 
   Run PythonLint PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 372107)
Remaining Estimate: 21h 50m  (was: 22h)
Time Spent: 2h 10m  (was: 2h)

> Two bugs in Python SDK S3 filesystem support
> 
>
> Key: BEAM-9029
> URL: https://issues.apache.org/jira/browse/BEAM-9029
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Wenhai Pan
>Assignee: Wenhai Pan
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 24h
>  Time Spent: 2h 10m
>  Remaining Estimate: 21h 50m
>
> Hi :)
> There seem to be 2 bugs in the S3 filesystem support.
> I tried to use S3 storage for a simple wordcount demo with DirectRunner.
> The demo script:
> {code:java}
> def main():
> options = PipelineOptions().view_as(StandardOptions)
>  options.runner = 'DirectRunner'
> pipeline = beam.Pipeline(options = options)
> (
>  pipeline
>  | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data")
>  | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x))
>  | beam.combiners.Count.PerElement()
>  | beam.MapTuple(lambda word, count: "%s: %s" % (word, count))
>  | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output")
>  )
> result = pipeline.run()
>  result.wait_until_finish()
> return
> {code}
>  
> Error message 1:
> {noformat}
> apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': 
> BeamIOError("List operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried 
> to list nonexistent S3 path: 
> s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while 
> running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions 
> None{noformat}
>  
> After digging into the code, it seems the Boto3 client's list function will 
> raise an exception when trying to list a nonexistent S3 path 
> (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And 
> the S3IO class does not handle this exception in list_prefix function 
> (beam/sdks/python/apache_beam/io/aws/s3io.py line 121).
> When the runner tries to list and delete the existing output file, if there 
> is no existing output file, it will try to list a nonexistent S3 path and 
> will trigger the exception.
> This should not be an issue here. I think we can ignore this exception safely 
> in the S3IO list_prefix function.
> Error Message 2:
> {noformat}
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in delete
> exceptions = {path: error for (path, error) in results
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in 
> exceptions = {path: error for (path, error) in results
> ValueError: too many values to unpack (expected 2) [while running 
> 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat}
>  
> When the runner tries to delete the temporary output directory, it will 
> trigger this exception. This exception is caused by parsing (path, error) 
> directly from the "results" which is a dict 
> (beam/sdks/python/apache_beam/io/aws/s3filesystem.py line 272). I think we 
> should use results.items() here.
> I have submitted a patch for these 2 bugs: 
> https://github.com/apache/beam/pull/10459
>  
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support

2020-01-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=372106=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-372106
 ]

ASF GitHub Bot logged work on BEAM-9029:


Author: ASF GitHub Bot
Created on: 15/Jan/20 03:20
Start Date: 15/Jan/20 03:20
Worklog Time Spent: 10m 
  Work Description: icemoon1987 commented on issue #10459: [BEAM-9029]Fix 
two bugs in Python SDK S3 filesystem support
URL: https://github.com/apache/beam/pull/10459#issuecomment-574477413
 
 
   Run Python PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 372106)
Remaining Estimate: 22h  (was: 22h 10m)
Time Spent: 2h  (was: 1h 50m)

> Two bugs in Python SDK S3 filesystem support
> 
>
> Key: BEAM-9029
> URL: https://issues.apache.org/jira/browse/BEAM-9029
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Wenhai Pan
>Assignee: Wenhai Pan
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 24h
>  Time Spent: 2h
>  Remaining Estimate: 22h
>
> Hi :)
> There seem to be 2 bugs in the S3 filesystem support.
> I tried to use S3 storage for a simple wordcount demo with DirectRunner.
> The demo script:
> {code:java}
> def main():
> options = PipelineOptions().view_as(StandardOptions)
>  options.runner = 'DirectRunner'
> pipeline = beam.Pipeline(options = options)
> (
>  pipeline
>  | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data")
>  | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x))
>  | beam.combiners.Count.PerElement()
>  | beam.MapTuple(lambda word, count: "%s: %s" % (word, count))
>  | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output")
>  )
> result = pipeline.run()
>  result.wait_until_finish()
> return
> {code}
>  
> Error message 1:
> {noformat}
> apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': 
> BeamIOError("List operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried 
> to list nonexistent S3 path: 
> s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while 
> running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions 
> None{noformat}
>  
> After digging into the code, it seems the Boto3 client's list function will 
> raise an exception when trying to list a nonexistent S3 path 
> (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And 
> the S3IO class does not handle this exception in list_prefix function 
> (beam/sdks/python/apache_beam/io/aws/s3io.py line 121).
> When the runner tries to list and delete the existing output file, if there 
> is no existing output file, it will try to list a nonexistent S3 path and 
> will trigger the exception.
> This should not be an issue here. I think we can ignore this exception safely 
> in the S3IO list_prefix function.
> Error Message 2:
> {noformat}
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in delete
> exceptions = {path: error for (path, error) in results
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in 
> exceptions = {path: error for (path, error) in results
> ValueError: too many values to unpack (expected 2) [while running 
> 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat}
>  
> When the runner tries to delete the temporary output directory, it will 
> trigger this exception. This exception is caused by parsing (path, error) 
> directly from the "results" which is a dict 
> (beam/sdks/python/apache_beam/io/aws/s3filesystem.py line 272). I think we 
> should use results.items() here.
> I have submitted a patch for these 2 bugs: 
> https://github.com/apache/beam/pull/10459
>  
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support

2020-01-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=371753=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-371753
 ]

ASF GitHub Bot logged work on BEAM-9029:


Author: ASF GitHub Bot
Created on: 14/Jan/20 18:12
Start Date: 14/Jan/20 18:12
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #10459: 
[BEAM-9029]Fix two bugs in Python SDK S3 filesystem support
URL: https://github.com/apache/beam/pull/10459#discussion_r366493566
 
 

 ##
 File path: sdks/python/apache_beam/io/aws/s3io.py
 ##
 @@ -118,7 +118,18 @@ def list_prefix(self, path):
 logging.info("Starting the size estimation of the input")
 
 while True:
-  response = self.client.list(request)
+
+  #The list operation will raise an exception when trying to list a 
nonexistent S3 path. 
+  #This should not be an issue here. 
+  #Ignore this exception or it will break the procedure.
+  try:
+response = self.client.list(request)
+  except messages.S3ClientError as e:
+if e.code == 404:
+  break
 
 Review comment:
   You are correct. Thanks for pointing that out. FYI, it seems that there are 
a few linnt errors and a few errors in precommit tests.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 371753)
Remaining Estimate: 22h 10m  (was: 22h 20m)
Time Spent: 1h 50m  (was: 1h 40m)

> Two bugs in Python SDK S3 filesystem support
> 
>
> Key: BEAM-9029
> URL: https://issues.apache.org/jira/browse/BEAM-9029
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Wenhai Pan
>Assignee: Wenhai Pan
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 24h
>  Time Spent: 1h 50m
>  Remaining Estimate: 22h 10m
>
> Hi :)
> There seem to be 2 bugs in the S3 filesystem support.
> I tried to use S3 storage for a simple wordcount demo with DirectRunner.
> The demo script:
> {code:java}
> def main():
> options = PipelineOptions().view_as(StandardOptions)
>  options.runner = 'DirectRunner'
> pipeline = beam.Pipeline(options = options)
> (
>  pipeline
>  | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data")
>  | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x))
>  | beam.combiners.Count.PerElement()
>  | beam.MapTuple(lambda word, count: "%s: %s" % (word, count))
>  | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output")
>  )
> result = pipeline.run()
>  result.wait_until_finish()
> return
> {code}
>  
> Error message 1:
> {noformat}
> apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': 
> BeamIOError("List operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried 
> to list nonexistent S3 path: 
> s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while 
> running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions 
> None{noformat}
>  
> After digging into the code, it seems the Boto3 client's list function will 
> raise an exception when trying to list a nonexistent S3 path 
> (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And 
> the S3IO class does not handle this exception in list_prefix function 
> (beam/sdks/python/apache_beam/io/aws/s3io.py line 121).
> When the runner tries to list and delete the existing output file, if there 
> is no existing output file, it will try to list a nonexistent S3 path and 
> will trigger the exception.
> This should not be an issue here. I think we can ignore this exception safely 
> in the S3IO list_prefix function.
> Error Message 2:
> {noformat}
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in delete
> exceptions = {path: error for (path, error) in results
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in 
> exceptions = {path: error for (path, error) in results
> ValueError: too many values to unpack (expected 2) [while running 
> 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat}
>  
> When the runner tries to delete the temporary output directory, it will 
> trigger this exception. This 

[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support

2020-01-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=371343=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-371343
 ]

ASF GitHub Bot logged work on BEAM-9029:


Author: ASF GitHub Bot
Created on: 14/Jan/20 04:16
Start Date: 14/Jan/20 04:16
Worklog Time Spent: 10m 
  Work Description: icemoon1987 commented on pull request #10459: 
[BEAM-9029]Fix two bugs in Python SDK S3 filesystem support
URL: https://github.com/apache/beam/pull/10459#discussion_r366145805
 
 

 ##
 File path: sdks/python/apache_beam/io/aws/s3io.py
 ##
 @@ -118,7 +118,18 @@ def list_prefix(self, path):
 logging.info("Starting the size estimation of the input")
 
 while True:
-  response = self.client.list(request)
+
+  #The list operation will raise an exception when trying to list a 
nonexistent S3 path. 
+  #This should not be an issue here. 
+  #Ignore this exception or it will break the procedure.
+  try:
+response = self.client.list(request)
+  except messages.S3ClientError as e:
+if e.code == 404:
+  break
 
 Review comment:
   Hi pabloem,
   
   Thank you for reviewing :)
   
   This "break" will exit the "while" loop start at line 120. The 
"response.items" operation in line 133 is also in the "while" loop in line 120. 
So the "response.items" (line 133) will not be run and will not raise an 
exception. The variable "response" is only used in the "while" loop start at 
line 120.
   
   The "break" here will stop the loop and the function: list_prefix will 
return the variable "file_sizes" in line 147. Since the variable "file_sizes" 
is initialized as an empty dict in line 114. The list_prefix function will 
return an empty dict. It should be safe. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 371343)
Remaining Estimate: 22h 20m  (was: 22.5h)
Time Spent: 1h 40m  (was: 1.5h)

> Two bugs in Python SDK S3 filesystem support
> 
>
> Key: BEAM-9029
> URL: https://issues.apache.org/jira/browse/BEAM-9029
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Wenhai Pan
>Assignee: Wenhai Pan
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 24h
>  Time Spent: 1h 40m
>  Remaining Estimate: 22h 20m
>
> Hi :)
> There seem to be 2 bugs in the S3 filesystem support.
> I tried to use S3 storage for a simple wordcount demo with DirectRunner.
> The demo script:
> {code:java}
> def main():
> options = PipelineOptions().view_as(StandardOptions)
>  options.runner = 'DirectRunner'
> pipeline = beam.Pipeline(options = options)
> (
>  pipeline
>  | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data")
>  | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x))
>  | beam.combiners.Count.PerElement()
>  | beam.MapTuple(lambda word, count: "%s: %s" % (word, count))
>  | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output")
>  )
> result = pipeline.run()
>  result.wait_until_finish()
> return
> {code}
>  
> Error message 1:
> {noformat}
> apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': 
> BeamIOError("List operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried 
> to list nonexistent S3 path: 
> s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while 
> running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions 
> None{noformat}
>  
> After digging into the code, it seems the Boto3 client's list function will 
> raise an exception when trying to list a nonexistent S3 path 
> (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And 
> the S3IO class does not handle this exception in list_prefix function 
> (beam/sdks/python/apache_beam/io/aws/s3io.py line 121).
> When the runner tries to list and delete the existing output file, if there 
> is no existing output file, it will try to list a nonexistent S3 path and 
> will trigger the exception.
> This should not be an issue here. I think we can ignore this exception safely 
> in the S3IO list_prefix function.
> Error Message 2:
> {noformat}
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in delete
> 

[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support

2020-01-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=371342=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-371342
 ]

ASF GitHub Bot logged work on BEAM-9029:


Author: ASF GitHub Bot
Created on: 14/Jan/20 04:15
Start Date: 14/Jan/20 04:15
Worklog Time Spent: 10m 
  Work Description: icemoon1987 commented on pull request #10459: 
[BEAM-9029]Fix two bugs in Python SDK S3 filesystem support
URL: https://github.com/apache/beam/pull/10459#discussion_r366145805
 
 

 ##
 File path: sdks/python/apache_beam/io/aws/s3io.py
 ##
 @@ -118,7 +118,18 @@ def list_prefix(self, path):
 logging.info("Starting the size estimation of the input")
 
 while True:
-  response = self.client.list(request)
+
+  #The list operation will raise an exception when trying to list a 
nonexistent S3 path. 
+  #This should not be an issue here. 
+  #Ignore this exception or it will break the procedure.
+  try:
+response = self.client.list(request)
+  except messages.S3ClientError as e:
+if e.code == 404:
+  break
 
 Review comment:
   Hi pabloem,
   
   Thank you for reviewing :)
   
   This "break" will exit the "while" loop start at line 120. The 
"response.items" operation in line 133 is also in the "while" loop in line 120. 
So the "response.items" (line 133) will not raise an exception. The variable 
"response" is only used in the "while" loop start at line 120.
   
   I directly "break" here because I saw the "break" operation in line 142. 
They should have the similar effect.
   
   The "break" here will stop the loop and the function: list_prefix will 
return the variable "file_sizes" in line 147. Since the variable "file_sizes" 
is initialized as an empty dict in line 114. The list_prefix function will 
return an empty dict. It should be safe. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 371342)
Remaining Estimate: 22.5h  (was: 22h 40m)
Time Spent: 1.5h  (was: 1h 20m)

> Two bugs in Python SDK S3 filesystem support
> 
>
> Key: BEAM-9029
> URL: https://issues.apache.org/jira/browse/BEAM-9029
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Wenhai Pan
>Assignee: Wenhai Pan
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 24h
>  Time Spent: 1.5h
>  Remaining Estimate: 22.5h
>
> Hi :)
> There seem to be 2 bugs in the S3 filesystem support.
> I tried to use S3 storage for a simple wordcount demo with DirectRunner.
> The demo script:
> {code:java}
> def main():
> options = PipelineOptions().view_as(StandardOptions)
>  options.runner = 'DirectRunner'
> pipeline = beam.Pipeline(options = options)
> (
>  pipeline
>  | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data")
>  | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x))
>  | beam.combiners.Count.PerElement()
>  | beam.MapTuple(lambda word, count: "%s: %s" % (word, count))
>  | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output")
>  )
> result = pipeline.run()
>  result.wait_until_finish()
> return
> {code}
>  
> Error message 1:
> {noformat}
> apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': 
> BeamIOError("List operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried 
> to list nonexistent S3 path: 
> s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while 
> running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions 
> None{noformat}
>  
> After digging into the code, it seems the Boto3 client's list function will 
> raise an exception when trying to list a nonexistent S3 path 
> (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And 
> the S3IO class does not handle this exception in list_prefix function 
> (beam/sdks/python/apache_beam/io/aws/s3io.py line 121).
> When the runner tries to list and delete the existing output file, if there 
> is no existing output file, it will try to list a nonexistent S3 path and 
> will trigger the exception.
> This should not be an issue here. I think we can ignore this exception safely 
> in the S3IO list_prefix function.
> Error Message 2:
> {noformat}
> File 
> 

[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support

2020-01-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=371190=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-371190
 ]

ASF GitHub Bot logged work on BEAM-9029:


Author: ASF GitHub Bot
Created on: 13/Jan/20 23:23
Start Date: 13/Jan/20 23:23
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #10459: [BEAM-9029]Fix two 
bugs in Python SDK S3 filesystem support
URL: https://github.com/apache/beam/pull/10459#issuecomment-573921702
 
 
   Here are the lint errors: 
https://scans.gradle.com/s/rlvpwlynymekm/console-log?task=:sdks:python:test-suites:tox:py37:lintPy37
   
   Here are the errors in the precommit tests: 
https://builds.apache.org/job/beam_PreCommit_Python_Phrase/1367/#showFailuresLink
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 371190)
Remaining Estimate: 22h 40m  (was: 22h 50m)
Time Spent: 1h 20m  (was: 1h 10m)

> Two bugs in Python SDK S3 filesystem support
> 
>
> Key: BEAM-9029
> URL: https://issues.apache.org/jira/browse/BEAM-9029
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Wenhai Pan
>Assignee: Wenhai Pan
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 24h
>  Time Spent: 1h 20m
>  Remaining Estimate: 22h 40m
>
> Hi :)
> There seem to be 2 bugs in the S3 filesystem support.
> I tried to use S3 storage for a simple wordcount demo with DirectRunner.
> The demo script:
> {code:java}
> def main():
> options = PipelineOptions().view_as(StandardOptions)
>  options.runner = 'DirectRunner'
> pipeline = beam.Pipeline(options = options)
> (
>  pipeline
>  | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data")
>  | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x))
>  | beam.combiners.Count.PerElement()
>  | beam.MapTuple(lambda word, count: "%s: %s" % (word, count))
>  | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output")
>  )
> result = pipeline.run()
>  result.wait_until_finish()
> return
> {code}
>  
> Error message 1:
> {noformat}
> apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': 
> BeamIOError("List operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried 
> to list nonexistent S3 path: 
> s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while 
> running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions 
> None{noformat}
>  
> After digging into the code, it seems the Boto3 client's list function will 
> raise an exception when trying to list a nonexistent S3 path 
> (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And 
> the S3IO class does not handle this exception in list_prefix function 
> (beam/sdks/python/apache_beam/io/aws/s3io.py line 121).
> When the runner tries to list and delete the existing output file, if there 
> is no existing output file, it will try to list a nonexistent S3 path and 
> will trigger the exception.
> This should not be an issue here. I think we can ignore this exception safely 
> in the S3IO list_prefix function.
> Error Message 2:
> {noformat}
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in delete
> exceptions = {path: error for (path, error) in results
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in 
> exceptions = {path: error for (path, error) in results
> ValueError: too many values to unpack (expected 2) [while running 
> 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat}
>  
> When the runner tries to delete the temporary output directory, it will 
> trigger this exception. This exception is caused by parsing (path, error) 
> directly from the "results" which is a dict 
> (beam/sdks/python/apache_beam/io/aws/s3filesystem.py line 272). I think we 
> should use results.items() here.
> I have submitted a patch for these 2 bugs: 
> https://github.com/apache/beam/pull/10459
>  
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support

2020-01-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=371163=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-371163
 ]

ASF GitHub Bot logged work on BEAM-9029:


Author: ASF GitHub Bot
Created on: 13/Jan/20 22:29
Start Date: 13/Jan/20 22:29
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #10459: [BEAM-9029]Fix two 
bugs in Python SDK S3 filesystem support
URL: https://github.com/apache/beam/pull/10459#issuecomment-573903261
 
 
   Run PythonLint PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 371163)
Remaining Estimate: 22h 50m  (was: 23h)
Time Spent: 1h 10m  (was: 1h)

> Two bugs in Python SDK S3 filesystem support
> 
>
> Key: BEAM-9029
> URL: https://issues.apache.org/jira/browse/BEAM-9029
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Wenhai Pan
>Assignee: Wenhai Pan
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 24h
>  Time Spent: 1h 10m
>  Remaining Estimate: 22h 50m
>
> Hi :)
> There seem to be 2 bugs in the S3 filesystem support.
> I tried to use S3 storage for a simple wordcount demo with DirectRunner.
> The demo script:
> {code:java}
> def main():
> options = PipelineOptions().view_as(StandardOptions)
>  options.runner = 'DirectRunner'
> pipeline = beam.Pipeline(options = options)
> (
>  pipeline
>  | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data")
>  | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x))
>  | beam.combiners.Count.PerElement()
>  | beam.MapTuple(lambda word, count: "%s: %s" % (word, count))
>  | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output")
>  )
> result = pipeline.run()
>  result.wait_until_finish()
> return
> {code}
>  
> Error message 1:
> {noformat}
> apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': 
> BeamIOError("List operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried 
> to list nonexistent S3 path: 
> s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while 
> running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions 
> None{noformat}
>  
> After digging into the code, it seems the Boto3 client's list function will 
> raise an exception when trying to list a nonexistent S3 path 
> (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And 
> the S3IO class does not handle this exception in list_prefix function 
> (beam/sdks/python/apache_beam/io/aws/s3io.py line 121).
> When the runner tries to list and delete the existing output file, if there 
> is no existing output file, it will try to list a nonexistent S3 path and 
> will trigger the exception.
> This should not be an issue here. I think we can ignore this exception safely 
> in the S3IO list_prefix function.
> Error Message 2:
> {noformat}
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in delete
> exceptions = {path: error for (path, error) in results
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in 
> exceptions = {path: error for (path, error) in results
> ValueError: too many values to unpack (expected 2) [while running 
> 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat}
>  
> When the runner tries to delete the temporary output directory, it will 
> trigger this exception. This exception is caused by parsing (path, error) 
> directly from the "results" which is a dict 
> (beam/sdks/python/apache_beam/io/aws/s3filesystem.py line 272). I think we 
> should use results.items() here.
> I have submitted a patch for these 2 bugs: 
> https://github.com/apache/beam/pull/10459
>  
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support

2020-01-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=371162=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-371162
 ]

ASF GitHub Bot logged work on BEAM-9029:


Author: ASF GitHub Bot
Created on: 13/Jan/20 22:29
Start Date: 13/Jan/20 22:29
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #10459: [BEAM-9029]Fix two 
bugs in Python SDK S3 filesystem support
URL: https://github.com/apache/beam/pull/10459#issuecomment-573903206
 
 
   Run Python PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 371162)
Remaining Estimate: 23h  (was: 23h 10m)
Time Spent: 1h  (was: 50m)

> Two bugs in Python SDK S3 filesystem support
> 
>
> Key: BEAM-9029
> URL: https://issues.apache.org/jira/browse/BEAM-9029
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Wenhai Pan
>Assignee: Wenhai Pan
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 24h
>  Time Spent: 1h
>  Remaining Estimate: 23h
>
> Hi :)
> There seem to be 2 bugs in the S3 filesystem support.
> I tried to use S3 storage for a simple wordcount demo with DirectRunner.
> The demo script:
> {code:java}
> def main():
> options = PipelineOptions().view_as(StandardOptions)
>  options.runner = 'DirectRunner'
> pipeline = beam.Pipeline(options = options)
> (
>  pipeline
>  | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data")
>  | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x))
>  | beam.combiners.Count.PerElement()
>  | beam.MapTuple(lambda word, count: "%s: %s" % (word, count))
>  | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output")
>  )
> result = pipeline.run()
>  result.wait_until_finish()
> return
> {code}
>  
> Error message 1:
> {noformat}
> apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': 
> BeamIOError("List operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried 
> to list nonexistent S3 path: 
> s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while 
> running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions 
> None{noformat}
>  
> After digging into the code, it seems the Boto3 client's list function will 
> raise an exception when trying to list a nonexistent S3 path 
> (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And 
> the S3IO class does not handle this exception in list_prefix function 
> (beam/sdks/python/apache_beam/io/aws/s3io.py line 121).
> When the runner tries to list and delete the existing output file, if there 
> is no existing output file, it will try to list a nonexistent S3 path and 
> will trigger the exception.
> This should not be an issue here. I think we can ignore this exception safely 
> in the S3IO list_prefix function.
> Error Message 2:
> {noformat}
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in delete
> exceptions = {path: error for (path, error) in results
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in 
> exceptions = {path: error for (path, error) in results
> ValueError: too many values to unpack (expected 2) [while running 
> 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat}
>  
> When the runner tries to delete the temporary output directory, it will 
> trigger this exception. This exception is caused by parsing (path, error) 
> directly from the "results" which is a dict 
> (beam/sdks/python/apache_beam/io/aws/s3filesystem.py line 272). I think we 
> should use results.items() here.
> I have submitted a patch for these 2 bugs: 
> https://github.com/apache/beam/pull/10459
>  
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support

2020-01-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=371149=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-371149
 ]

ASF GitHub Bot logged work on BEAM-9029:


Author: ASF GitHub Bot
Created on: 13/Jan/20 22:16
Start Date: 13/Jan/20 22:16
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #10459: 
[BEAM-9029]Fix two bugs in Python SDK S3 filesystem support
URL: https://github.com/apache/beam/pull/10459#discussion_r366058089
 
 

 ##
 File path: sdks/python/apache_beam/io/aws/s3io.py
 ##
 @@ -118,7 +118,18 @@ def list_prefix(self, path):
 logging.info("Starting the size estimation of the input")
 
 while True:
-  response = self.client.list(request)
+
+  #The list operation will raise an exception when trying to list a 
nonexistent S3 path. 
+  #This should not be an issue here. 
+  #Ignore this exception or it will break the procedure.
+  try:
+response = self.client.list(request)
+  except messages.S3ClientError as e:
+if e.code == 404:
+  break
 
 Review comment:
   @icemoon1987 fyi
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 371149)
Remaining Estimate: 23h 10m  (was: 23h 20m)
Time Spent: 50m  (was: 40m)

> Two bugs in Python SDK S3 filesystem support
> 
>
> Key: BEAM-9029
> URL: https://issues.apache.org/jira/browse/BEAM-9029
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Wenhai Pan
>Assignee: Wenhai Pan
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 24h
>  Time Spent: 50m
>  Remaining Estimate: 23h 10m
>
> Hi :)
> There seem to be 2 bugs in the S3 filesystem support.
> I tried to use S3 storage for a simple wordcount demo with DirectRunner.
> The demo script:
> {code:java}
> def main():
> options = PipelineOptions().view_as(StandardOptions)
>  options.runner = 'DirectRunner'
> pipeline = beam.Pipeline(options = options)
> (
>  pipeline
>  | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data")
>  | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x))
>  | beam.combiners.Count.PerElement()
>  | beam.MapTuple(lambda word, count: "%s: %s" % (word, count))
>  | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output")
>  )
> result = pipeline.run()
>  result.wait_until_finish()
> return
> {code}
>  
> Error message 1:
> {noformat}
> apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': 
> BeamIOError("List operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried 
> to list nonexistent S3 path: 
> s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while 
> running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions 
> None{noformat}
>  
> After digging into the code, it seems the Boto3 client's list function will 
> raise an exception when trying to list a nonexistent S3 path 
> (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And 
> the S3IO class does not handle this exception in list_prefix function 
> (beam/sdks/python/apache_beam/io/aws/s3io.py line 121).
> When the runner tries to list and delete the existing output file, if there 
> is no existing output file, it will try to list a nonexistent S3 path and 
> will trigger the exception.
> This should not be an issue here. I think we can ignore this exception safely 
> in the S3IO list_prefix function.
> Error Message 2:
> {noformat}
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in delete
> exceptions = {path: error for (path, error) in results
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in 
> exceptions = {path: error for (path, error) in results
> ValueError: too many values to unpack (expected 2) [while running 
> 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat}
>  
> When the runner tries to delete the temporary output directory, it will 
> trigger this exception. This exception is caused by parsing (path, error) 
> directly from the "results" which is a dict 
> 

[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support

2020-01-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=370144=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-370144
 ]

ASF GitHub Bot logged work on BEAM-9029:


Author: ASF GitHub Bot
Created on: 10/Jan/20 23:39
Start Date: 10/Jan/20 23:39
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #10459: 
[BEAM-9029]Fix two bugs in Python SDK S3 filesystem support
URL: https://github.com/apache/beam/pull/10459#discussion_r365472091
 
 

 ##
 File path: sdks/python/apache_beam/io/aws/s3io.py
 ##
 @@ -118,7 +118,18 @@ def list_prefix(self, path):
 logging.info("Starting the size estimation of the input")
 
 while True:
-  response = self.client.list(request)
+
+  #The list operation will raise an exception when trying to list a 
nonexistent S3 path. 
+  #This should not be an issue here. 
+  #Ignore this exception or it will break the procedure.
+  try:
+response = self.client.list(request)
+  except messages.S3ClientError as e:
+if e.code == 404:
+  break
 
 Review comment:
   Should the response be some kind of empty response? If we just break out 
here, `response` will be `None`, and `response.items` will raise an exception, 
right?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 370144)
Remaining Estimate: 23h 20m  (was: 23.5h)
Time Spent: 40m  (was: 0.5h)

> Two bugs in Python SDK S3 filesystem support
> 
>
> Key: BEAM-9029
> URL: https://issues.apache.org/jira/browse/BEAM-9029
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Wenhai Pan
>Assignee: Wenhai Pan
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 24h
>  Time Spent: 40m
>  Remaining Estimate: 23h 20m
>
> Hi :)
> There seem to be 2 bugs in the S3 filesystem support.
> I tried to use S3 storage for a simple wordcount demo with DirectRunner.
> The demo script:
> {code:java}
> def main():
> options = PipelineOptions().view_as(StandardOptions)
>  options.runner = 'DirectRunner'
> pipeline = beam.Pipeline(options = options)
> (
>  pipeline
>  | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data")
>  | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x))
>  | beam.combiners.Count.PerElement()
>  | beam.MapTuple(lambda word, count: "%s: %s" % (word, count))
>  | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output")
>  )
> result = pipeline.run()
>  result.wait_until_finish()
> return
> {code}
>  
> Error message 1:
> {noformat}
> apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': 
> BeamIOError("List operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried 
> to list nonexistent S3 path: 
> s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while 
> running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions 
> None{noformat}
>  
> After digging into the code, it seems the Boto3 client's list function will 
> raise an exception when trying to list a nonexistent S3 path 
> (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And 
> the S3IO class does not handle this exception in list_prefix function 
> (beam/sdks/python/apache_beam/io/aws/s3io.py line 121).
> When the runner tries to list and delete the existing output file, if there 
> is no existing output file, it will try to list a nonexistent S3 path and 
> will trigger the exception.
> This should not be an issue here. I think we can ignore this exception safely 
> in the S3IO list_prefix function.
> Error Message 2:
> {noformat}
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in delete
> exceptions = {path: error for (path, error) in results
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in 
> exceptions = {path: error for (path, error) in results
> ValueError: too many values to unpack (expected 2) [while running 
> 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat}
>  
> When the runner tries to delete the temporary output directory, it will 
> trigger 

[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support

2020-01-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=369228=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-369228
 ]

ASF GitHub Bot logged work on BEAM-9029:


Author: ASF GitHub Bot
Created on: 09/Jan/20 18:23
Start Date: 09/Jan/20 18:23
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #10459: [BEAM-9029]Fix two 
bugs in Python SDK S3 filesystem support
URL: https://github.com/apache/beam/pull/10459#issuecomment-572688890
 
 
   I'll take a look...
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 369228)
Remaining Estimate: 23.5h  (was: 23h 40m)
Time Spent: 0.5h  (was: 20m)

> Two bugs in Python SDK S3 filesystem support
> 
>
> Key: BEAM-9029
> URL: https://issues.apache.org/jira/browse/BEAM-9029
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Wenhai Pan
>Assignee: Wenhai Pan
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 24h
>  Time Spent: 0.5h
>  Remaining Estimate: 23.5h
>
> Hi :)
> There seem to be 2 bugs in the S3 filesystem support.
> I tried to use S3 storage for a simple wordcount demo with DirectRunner.
> The demo script:
> {code:java}
> def main():
> options = PipelineOptions().view_as(StandardOptions)
>  options.runner = 'DirectRunner'
> pipeline = beam.Pipeline(options = options)
> (
>  pipeline
>  | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data")
>  | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x))
>  | beam.combiners.Count.PerElement()
>  | beam.MapTuple(lambda word, count: "%s: %s" % (word, count))
>  | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output")
>  )
> result = pipeline.run()
>  result.wait_until_finish()
> return
> {code}
>  
> Error message 1:
> {noformat}
> apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': 
> BeamIOError("List operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried 
> to list nonexistent S3 path: 
> s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while 
> running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions 
> None{noformat}
>  
> After digging into the code, it seems the Boto3 client's list function will 
> raise an exception when trying to list a nonexistent S3 path 
> (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And 
> the S3IO class does not handle this exception in list_prefix function 
> (beam/sdks/python/apache_beam/io/aws/s3io.py line 121).
> When the runner tries to list and delete the existing output file, if there 
> is no existing output file, it will try to list a nonexistent S3 path and 
> will trigger the exception.
> This should not be an issue here. I think we can ignore this exception safely 
> in the S3IO list_prefix function.
> Error Message 2:
> {noformat}
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in delete
> exceptions = {path: error for (path, error) in results
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in 
> exceptions = {path: error for (path, error) in results
> ValueError: too many values to unpack (expected 2) [while running 
> 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat}
>  
> When the runner tries to delete the temporary output directory, it will 
> trigger this exception. This exception is caused by parsing (path, error) 
> directly from the "results" which is a dict 
> (beam/sdks/python/apache_beam/io/aws/s3filesystem.py line 272). I think we 
> should use results.items() here.
> I have submitted a patch for these 2 bugs: 
> https://github.com/apache/beam/pull/10459
>  
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support

2019-12-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=362902=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-362902
 ]

ASF GitHub Bot logged work on BEAM-9029:


Author: ASF GitHub Bot
Created on: 24/Dec/19 06:25
Start Date: 24/Dec/19 06:25
Worklog Time Spent: 10m 
  Work Description: icemoon1987 commented on issue #10459: [BEAM-9029]Fix 
two bugs in Python SDK S3 filesystem support
URL: https://github.com/apache/beam/pull/10459#issuecomment-568667881
 
 
   R: @pabloem @robertwb @aaltay @charlesccychen
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 362902)
Remaining Estimate: 23h 40m  (was: 23h 50m)
Time Spent: 20m  (was: 10m)

> Two bugs in Python SDK S3 filesystem support
> 
>
> Key: BEAM-9029
> URL: https://issues.apache.org/jira/browse/BEAM-9029
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Wenhai Pan
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 24h
>  Time Spent: 20m
>  Remaining Estimate: 23h 40m
>
> Hi :)
> There seem to be 2 bugs in the S3 filesystem support.
> I tried to use S3 storage for a simple wordcount demo with DirectRunner.
> The demo script:
> {code:java}
> def main():
> options = PipelineOptions().view_as(StandardOptions)
>  options.runner = 'DirectRunner'
> pipeline = beam.Pipeline(options = options)
> (
>  pipeline
>  | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data")
>  | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x))
>  | beam.combiners.Count.PerElement()
>  | beam.MapTuple(lambda word, count: "%s: %s" % (word, count))
>  | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output")
>  )
> result = pipeline.run()
>  result.wait_until_finish()
> return
> {code}
>  
> Error message 1:
> {noformat}
> apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-1': 
> BeamIOError("List operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried 
> to list nonexistent S3 path: 
> s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while 
> running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions 
> None{noformat}
>  
> After digging into the code, it seems the Boto3 client's list function will 
> raise an exception when trying to list a nonexistent S3 path 
> (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And 
> the S3IO class does not handle this exception in list_prefix function 
> (beam/sdks/python/apache_beam/io/aws/s3io.py line 121).
> When the runner tries to list and delete the existing output file, if there 
> is no existing output file, it will try to list a nonexistent S3 path and 
> will trigger the exception.
> This should not be an issue here. I think we can ignore this exception safely 
> in the S3IO list_prefix function.
> Error Message 2:
> {noformat}
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in delete
> exceptions = {path: error for (path, error) in results
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in 
> exceptions = {path: error for (path, error) in results
> ValueError: too many values to unpack (expected 2) [while running 
> 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat}
>  
> When the runner tries to delete the temporary output directory, it will 
> trigger this exception. This exception is caused by parsing (path, error) 
> directly from the "results" which is a dict 
> (beam/sdks/python/apache_beam/io/aws/s3filesystem.py line 272). I think we 
> should use results.items() here.
> I have submitted a patch for these 2 bugs: 
> https://github.com/apache/beam/pull/10459
>  
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support

2019-12-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=362879=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-362879
 ]

ASF GitHub Bot logged work on BEAM-9029:


Author: ASF GitHub Bot
Created on: 24/Dec/19 04:30
Start Date: 24/Dec/19 04:30
Worklog Time Spent: 10m 
  Work Description: icemoon1987 commented on pull request #10459: 
[BEAM-9029]Fix two bugs in Python SDK S3 filesystem support
URL: https://github.com/apache/beam/pull/10459
 
 
   Trying to fix the bugs on JIRA: 
https://issues.apache.org/jira/browse/BEAM-9029
   
   1. Ignore exception when trying to list a nonexistent S3 path;
   2. Fix parsing issue when deleting the temporary output directory.
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more 
tips on [how to make review process 
smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/)
   Python | [![Build