[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS
[ https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=147285=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-147285 ] ASF GitHub Bot logged work on BEAM-4011: Author: ASF GitHub Bot Created on: 24/Sep/18 20:24 Start Date: 24/Sep/18 20:24 Worklog Time Spent: 10m Work Description: udim commented on issue #5024: [BEAM-4011] Unify Python IO glob implementation. URL: https://github.com/apache/beam/pull/5024#issuecomment-424112346 @knub , I filed a bug for your side-note. Thanks for reporting! https://issues.apache.org/jira/browse/BEAM-5486 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 147285) Time Spent: 3h 20m (was: 3h 10m) > Python SDK: add glob support for HDFS > - > > Key: BEAM-4011 > URL: https://issues.apache.org/jira/browse/BEAM-4011 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Udi Meiri >Assignee: Udi Meiri >Priority: Major > Fix For: 2.5.0 > > Time Spent: 3h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS
[ https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=124874=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-124874 ] ASF GitHub Bot logged work on BEAM-4011: Author: ASF GitHub Bot Created on: 19/Jul/18 09:18 Start Date: 19/Jul/18 09:18 Worklog Time Spent: 10m Work Description: knub commented on issue #5024: [BEAM-4011] Unify Python IO glob implementation. URL: https://github.com/apache/beam/pull/5024#issuecomment-406211816 Hi udim, thanks for your quick response! For future reference, here's the full replacement with the same interface as before: ```python from apache_beam.io.filesystems import FileSystems def glob(path): match_result = FileSystems.match([path])[0] file_metadata_objects = match_result.metadata_list return [fm.path for fm in file_metadata_objects] ``` --- Side-note: I think there's a bug in the `FileSystems.match` implementation for certain globs, see the following snippet: ```ipynb In [1]: from apache_beam.io.filesystems import FileSystems In [2]: FileSystems.match(["gs:///*"]) /usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/gcsio.py:176: DeprecationWarning: object() takes no parameters super(GcsIO, cls).__new__(cls, storage_client)) WARNING:root:Retry with exponential backoff: waiting for 4.3505648468 seconds before retrying list_prefix because we caught exception: ValueError: GCS path must be in the form gs:///. Traceback for above exception (most recent call last): File "/usr/local/lib/python2.7/dist-packages/apache_beam/utils/retry.py", line 180, in wrapper return fun(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/gcsio.py", line 423, in list_prefix bucket, prefix = parse_gcs_path(path) File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/gcsio.py", line 147, in parse_gcs_path raise ValueError('GCS path must be in the form gs:///.') ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 124874) Time Spent: 3h 10m (was: 3h) > Python SDK: add glob support for HDFS > - > > Key: BEAM-4011 > URL: https://issues.apache.org/jira/browse/BEAM-4011 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Udi Meiri >Assignee: Udi Meiri >Priority: Major > Fix For: 2.5.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS
[ https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=124347=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-124347 ] ASF GitHub Bot logged work on BEAM-4011: Author: ASF GitHub Bot Created on: 18/Jul/18 07:23 Start Date: 18/Jul/18 07:23 Worklog Time Spent: 10m Work Description: udim commented on issue #5024: [BEAM-4011] Unify Python IO glob implementation. URL: https://github.com/apache/beam/pull/5024#issuecomment-405835585 Hi @knub ! First of all, I apologize for breaking the API. In the future we will label deprecated methods as such before removing them, and document a migration path. Calling `Filesystems.match()` should work as a drop-in replacement. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 124347) Time Spent: 3h (was: 2h 50m) > Python SDK: add glob support for HDFS > - > > Key: BEAM-4011 > URL: https://issues.apache.org/jira/browse/BEAM-4011 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Udi Meiri >Assignee: Udi Meiri >Priority: Major > Fix For: 2.5.0 > > Time Spent: 3h > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS
[ https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=123660=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-123660 ] ASF GitHub Bot logged work on BEAM-4011: Author: ASF GitHub Bot Created on: 16/Jul/18 15:54 Start Date: 16/Jul/18 15:54 Worklog Time Spent: 10m Work Description: knub edited a comment on issue #5024: [BEAM-4011] Unify Python IO glob implementation. URL: https://github.com/apache/beam/pull/5024#issuecomment-405294775 @udim @chamikaramj Hi there! first of all, thanks for your work in Apache Beam! I have a problem: I just updated from `apache_beam==2.4.0` to `apache_beam==2.5.0`. Previously, I was using: ``` gcsio.GcsIO().glob(glob_string) ``` for queries against GCS. Now this previously public interface does not work anymore in `apache_beam==2.5.0`. After doing some digging, I found this PR which seems related. I can't find any documentation/update guide on how to migrate from the former to the latter version and also couldn't find out from reading the code. Can you help me? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 123660) Time Spent: 2h 50m (was: 2h 40m) > Python SDK: add glob support for HDFS > - > > Key: BEAM-4011 > URL: https://issues.apache.org/jira/browse/BEAM-4011 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Udi Meiri >Assignee: Udi Meiri >Priority: Major > Fix For: 2.5.0 > > Time Spent: 2h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS
[ https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=123659=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-123659 ] ASF GitHub Bot logged work on BEAM-4011: Author: ASF GitHub Bot Created on: 16/Jul/18 15:53 Start Date: 16/Jul/18 15:53 Worklog Time Spent: 10m Work Description: knub commented on issue #5024: [BEAM-4011] Unify Python IO glob implementation. URL: https://github.com/apache/beam/pull/5024#issuecomment-405294775 @udim @chamikaramj Hi there! first of all, thanks for your work in Apache Beam! I have a problem. I just updated from `apache_beam==2.4.0` to `apache_beam==2.5.0`. I was previously using: ``` gcsio.GcsIO().glob(glob_string) ``` for queries against GCS. Now this previously public interface does not work anymore in `apache_beam==2.5.0`. After doing quite some digging, I found this PR which seems related. I can't find any documentation/update guide on how to migrate from the former to the latter version and also couldn't find out from reading the code. Can you help me? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 123659) Time Spent: 2h 40m (was: 2.5h) > Python SDK: add glob support for HDFS > - > > Key: BEAM-4011 > URL: https://issues.apache.org/jira/browse/BEAM-4011 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Udi Meiri >Assignee: Udi Meiri >Priority: Major > Fix For: 2.5.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS
[ https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=91574=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-91574 ] ASF GitHub Bot logged work on BEAM-4011: Author: ASF GitHub Bot Created on: 17/Apr/18 02:20 Start Date: 17/Apr/18 02:20 Worklog Time Spent: 10m Work Description: chamikaramj closed pull request #5024: [BEAM-4011] Unify Python IO glob implementation. URL: https://github.com/apache/beam/pull/5024 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/sdks/python/apache_beam/io/filesystem.py b/sdks/python/apache_beam/io/filesystem.py index 3f7e9aba847..929b7ef5607 100644 --- a/sdks/python/apache_beam/io/filesystem.py +++ b/sdks/python/apache_beam/io/filesystem.py @@ -21,8 +21,11 @@ import abc import bz2 import cStringIO +import fnmatch import logging import os +import posixpath +import re import time import zlib @@ -498,9 +501,49 @@ def mkdirs(self, path): raise NotImplementedError @abc.abstractmethod + def has_dirs(self): +"""Whether this FileSystem supports directories.""" +raise NotImplementedError + + @abc.abstractmethod + def _list(self, dir_or_prefix): +"""List files in a location. + +Listing is non-recursive (for filesystems that support directories). + +Args: + dir_or_prefix: (string) A directory or location prefix (for filesystems +that don't have directories). + +Returns: + Generator of ``FileMetadata`` objects. + +Raises: + ``BeamIOError`` if listing fails, but not if no files were found. +""" +raise NotImplementedError + + @staticmethod + def _url_dirname(url_or_path): +"""Like posixpath.dirname, but preserves scheme:// prefix. + +Args: + url_or_path: A string in the form of scheme://some/path OR /some/path. +""" +match = re.match(r'([a-z]+://)(.*)', url_or_path) +if match is None: + return posixpath.dirname(url_or_path) +url_prefix, path = match.groups() +return url_prefix + posixpath.dirname(path) + def match(self, patterns, limits=None): """Find all matching paths to the patterns provided. +Pattern matching is done using fnmatch.fnmatch. +For filesystems that have directories, matching is not recursive. Patterns +like scheme://path/*/foo will not match anything. +Patterns ending with '/' will be appended with '*'. + Args: patterns: list of string for the file path pattern to match against limits: list of maximum number of responses that need to be fetched @@ -510,7 +553,52 @@ def match(self, patterns, limits=None): Raises: ``BeamIOError`` if any of the pattern match operations fail """ -raise NotImplementedError +if limits is None: + limits = [None] * len(patterns) +else: + err_msg = "Patterns and limits should be equal in length" + assert len(patterns) == len(limits), err_msg + +def _match(pattern, limit): + """Find all matching paths to the pattern provided.""" + if pattern.endswith('/'): +pattern += '*' + # Get the part of the pattern before the first globbing character. + # For example scheme://path/foo* will become scheme://path/foo for + # filesystems like GCS, or converted to scheme://path for filesystems with + # directories. + prefix_or_dir = re.match('^[^[*?]*', pattern).group(0) + + file_metadatas = [] + if prefix_or_dir == pattern: +# Short-circuit calling self.list() if there's no glob pattern to match. +if self.exists(pattern): + file_metadatas = [FileMetadata(pattern, self.size(pattern))] + else: +if self.has_dirs(): + prefix_or_dir = self._url_dirname(prefix_or_dir) +file_metadatas = self._list(prefix_or_dir) + + metadata_list = [] + for file_metadata in file_metadatas: +if limit is not None and len(metadata_list) >= limit: + break +if fnmatch.fnmatch(file_metadata.path, pattern): + metadata_list.append(file_metadata) + + return MatchResult(pattern, metadata_list) + +exceptions = {} +result = [] +for pattern, limit in zip(patterns, limits): + try: +result.append(_match(pattern, limit)) + except Exception as e: # pylint: disable=broad-except +exceptions[pattern] = e + +if exceptions: + raise BeamIOError("Match operation failed", exceptions) +return result @abc.abstractmethod def create(self, path, mime_type='application/octet-stream', @@ -579,6 +667,19 @@ def exists(self, path): raise NotImplementedError @abc.abstractmethod + def size(self, path): +"""Get
[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS
[ https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=90665=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-90665 ] ASF GitHub Bot logged work on BEAM-4011: Author: ASF GitHub Bot Created on: 13/Apr/18 01:30 Start Date: 13/Apr/18 01:30 Worklog Time Spent: 10m Work Description: udim commented on a change in pull request #5024: [BEAM-4011] Unify Python IO glob implementation. URL: https://github.com/apache/beam/pull/5024#discussion_r181264323 ## File path: sdks/python/apache_beam/io/filesystem.py ## @@ -510,7 +551,48 @@ def match(self, patterns, limits=None): Raises: ``BeamIOError`` if any of the pattern match operations fail """ -raise NotImplementedError +if limits is None: + limits = [None] * len(patterns) +else: + err_msg = "Patterns and limits should be equal in length" + assert len(patterns) == len(limits), err_msg + +def _match(pattern, limit): + """Find all matching paths to the pattern provided.""" + if pattern.endswith('/'): +pattern += '*' + prefix_or_dir = re.match('^[^[*?]*', pattern).group(0) Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 90665) Time Spent: 2h (was: 1h 50m) > Python SDK: add glob support for HDFS > - > > Key: BEAM-4011 > URL: https://issues.apache.org/jira/browse/BEAM-4011 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Udi Meiri >Assignee: Udi Meiri >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS
[ https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=90667=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-90667 ] ASF GitHub Bot logged work on BEAM-4011: Author: ASF GitHub Bot Created on: 13/Apr/18 01:30 Start Date: 13/Apr/18 01:30 Worklog Time Spent: 10m Work Description: udim commented on a change in pull request #5024: [BEAM-4011] Unify Python IO glob implementation. URL: https://github.com/apache/beam/pull/5024#discussion_r181263543 ## File path: sdks/python/apache_beam/io/filesystem.py ## @@ -498,9 +501,47 @@ def mkdirs(self, path): raise NotImplementedError @abc.abstractmethod + def has_dirs(self): +"""Whether this FileSystem supports directories.""" +raise NotImplementedError + + @abc.abstractmethod + def list(self, dir_or_prefix): Review comment: done This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 90667) Time Spent: 2h 20m (was: 2h 10m) > Python SDK: add glob support for HDFS > - > > Key: BEAM-4011 > URL: https://issues.apache.org/jira/browse/BEAM-4011 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Udi Meiri >Assignee: Udi Meiri >Priority: Major > Time Spent: 2h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS
[ https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=90666=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-90666 ] ASF GitHub Bot logged work on BEAM-4011: Author: ASF GitHub Bot Created on: 13/Apr/18 01:30 Start Date: 13/Apr/18 01:30 Worklog Time Spent: 10m Work Description: udim commented on a change in pull request #5024: [BEAM-4011] Unify Python IO glob implementation. URL: https://github.com/apache/beam/pull/5024#discussion_r181263847 ## File path: sdks/python/apache_beam/io/filesystem.py ## @@ -498,9 +501,47 @@ def mkdirs(self, path): raise NotImplementedError @abc.abstractmethod + def has_dirs(self): +"""Whether this FileSystem supports directories.""" +raise NotImplementedError + + @abc.abstractmethod + def list(self, dir_or_prefix): +"""List files in a location. + +Listing is non-recursive (for filesystems that support directories). + +Args: + dir_or_prefix: (string) A directory or location prefix (for filesystems +that don't have directories). + +Returns: + Generator of ``FileMetadata`` objects. + +Raises: + ``BeamIOError`` if listing fails, but not if no files were found. +""" +raise NotImplementedError + + @staticmethod + def _url_dirname(url_or_path): +"""Like posixpath.dirname, but preserves scheme:// prefix. + +Args: + url_or_path: A string in the form of scheme://some/path OR /some/path. +""" +match = re.match(r'([a-z]+://)(.*)', url_or_path) +if match is None: + return posixpath.dirname(url_or_path) +url_prefix, path = match.groups() +return url_prefix + posixpath.dirname(path) + def match(self, patterns, limits=None): """Find all matching paths to the patterns provided. +Pattern matching is done using fnmatch.fnmatch. Review comment: Clarified. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 90666) Time Spent: 2h 10m (was: 2h) > Python SDK: add glob support for HDFS > - > > Key: BEAM-4011 > URL: https://issues.apache.org/jira/browse/BEAM-4011 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Udi Meiri >Assignee: Udi Meiri >Priority: Major > Time Spent: 2h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS
[ https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=90664=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-90664 ] ASF GitHub Bot logged work on BEAM-4011: Author: ASF GitHub Bot Created on: 13/Apr/18 01:30 Start Date: 13/Apr/18 01:30 Worklog Time Spent: 10m Work Description: udim commented on a change in pull request #5024: [BEAM-4011] Unify Python IO glob implementation. URL: https://github.com/apache/beam/pull/5024#discussion_r181265403 ## File path: sdks/python/apache_beam/io/filesystem.py ## @@ -579,6 +661,19 @@ def exists(self, path): raise NotImplementedError @abc.abstractmethod + def size(self, path): Review comment: _list() returns sizes, but it only works on prefixes (may return more than one result) or directories (fails on files). I added this method for the case where the pattern given doesn't end in `/` and has no globbing characters. In this case we return the size of the file or directory pointed to by the pattern (if it exists). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 90664) Time Spent: 2h (was: 1h 50m) > Python SDK: add glob support for HDFS > - > > Key: BEAM-4011 > URL: https://issues.apache.org/jira/browse/BEAM-4011 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Udi Meiri >Assignee: Udi Meiri >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS
[ https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=90663=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-90663 ] ASF GitHub Bot logged work on BEAM-4011: Author: ASF GitHub Bot Created on: 13/Apr/18 01:30 Start Date: 13/Apr/18 01:30 Worklog Time Spent: 10m Work Description: udim commented on a change in pull request #5024: [BEAM-4011] Unify Python IO glob implementation. URL: https://github.com/apache/beam/pull/5024#discussion_r181264935 ## File path: sdks/python/apache_beam/io/filesystem.py ## @@ -510,7 +551,48 @@ def match(self, patterns, limits=None): Raises: ``BeamIOError`` if any of the pattern match operations fail """ -raise NotImplementedError +if limits is None: + limits = [None] * len(patterns) +else: + err_msg = "Patterns and limits should be equal in length" + assert len(patterns) == len(limits), err_msg + +def _match(pattern, limit): + """Find all matching paths to the pattern provided.""" + if pattern.endswith('/'): +pattern += '*' + prefix_or_dir = re.match('^[^[*?]*', pattern).group(0) + + file_metadatas = [] + if prefix_or_dir == pattern: +# Short-circuit calling self.list() if there's no glob pattern to match. +if self.exists(pattern): + file_metadatas = [FileMetadata(pattern, self.size(pattern))] + else: +if self.has_dirs(): Review comment: Yes. I explained it in the new comment a little above this line. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 90663) Time Spent: 1h 50m (was: 1h 40m) > Python SDK: add glob support for HDFS > - > > Key: BEAM-4011 > URL: https://issues.apache.org/jira/browse/BEAM-4011 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Udi Meiri >Assignee: Udi Meiri >Priority: Major > Time Spent: 1h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS
[ https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=90519=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-90519 ] ASF GitHub Bot logged work on BEAM-4011: Author: ASF GitHub Bot Created on: 12/Apr/18 18:15 Start Date: 12/Apr/18 18:15 Worklog Time Spent: 10m Work Description: chamikaramj commented on a change in pull request #5024: [BEAM-4011] Unify Python IO glob implementation. URL: https://github.com/apache/beam/pull/5024#discussion_r181169858 ## File path: sdks/python/apache_beam/io/filesystem.py ## @@ -498,9 +501,47 @@ def mkdirs(self, path): raise NotImplementedError @abc.abstractmethod + def has_dirs(self): +"""Whether this FileSystem supports directories.""" +raise NotImplementedError + + @abc.abstractmethod + def list(self, dir_or_prefix): +"""List files in a location. + +Listing is non-recursive (for filesystems that support directories). + +Args: + dir_or_prefix: (string) A directory or location prefix (for filesystems +that don't have directories). + +Returns: + Generator of ``FileMetadata`` objects. + +Raises: + ``BeamIOError`` if listing fails, but not if no files were found. +""" +raise NotImplementedError + + @staticmethod + def _url_dirname(url_or_path): +"""Like posixpath.dirname, but preserves scheme:// prefix. + +Args: + url_or_path: A string in the form of scheme://some/path OR /some/path. +""" +match = re.match(r'([a-z]+://)(.*)', url_or_path) +if match is None: + return posixpath.dirname(url_or_path) +url_prefix, path = match.groups() +return url_prefix + posixpath.dirname(path) + def match(self, patterns, limits=None): """Find all matching paths to the patterns provided. +Pattern matching is done using fnmatch.fnmatch. Review comment: I think we should clarify if match for \/* is recursive or not. I think most users will use a pattern in the form \ /\ * which to avoid matching all sub-directories anyways. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 90519) > Python SDK: add glob support for HDFS > - > > Key: BEAM-4011 > URL: https://issues.apache.org/jira/browse/BEAM-4011 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Udi Meiri >Assignee: Udi Meiri >Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS
[ https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=90516=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-90516 ] ASF GitHub Bot logged work on BEAM-4011: Author: ASF GitHub Bot Created on: 12/Apr/18 18:15 Start Date: 12/Apr/18 18:15 Worklog Time Spent: 10m Work Description: chamikaramj commented on a change in pull request #5024: [BEAM-4011] Unify Python IO glob implementation. URL: https://github.com/apache/beam/pull/5024#discussion_r181165982 ## File path: sdks/python/apache_beam/io/filesystem.py ## @@ -498,9 +501,47 @@ def mkdirs(self, path): raise NotImplementedError @abc.abstractmethod + def has_dirs(self): +"""Whether this FileSystem supports directories.""" +raise NotImplementedError + + @abc.abstractmethod + def list(self, dir_or_prefix): Review comment: I think having both list() and match() as public can be confusing to users. Let's keep list() as private (and move it to filesystem implementations) if there's no compelling use-case to keep it in the interface. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 90516) Time Spent: 1h 20m (was: 1h 10m) > Python SDK: add glob support for HDFS > - > > Key: BEAM-4011 > URL: https://issues.apache.org/jira/browse/BEAM-4011 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Udi Meiri >Assignee: Udi Meiri >Priority: Major > Time Spent: 1h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS
[ https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=90517=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-90517 ] ASF GitHub Bot logged work on BEAM-4011: Author: ASF GitHub Bot Created on: 12/Apr/18 18:15 Start Date: 12/Apr/18 18:15 Worklog Time Spent: 10m Work Description: chamikaramj commented on a change in pull request #5024: [BEAM-4011] Unify Python IO glob implementation. URL: https://github.com/apache/beam/pull/5024#discussion_r181161482 ## File path: sdks/python/apache_beam/io/filesystem.py ## @@ -510,7 +551,48 @@ def match(self, patterns, limits=None): Raises: ``BeamIOError`` if any of the pattern match operations fail """ -raise NotImplementedError +if limits is None: + limits = [None] * len(patterns) +else: + err_msg = "Patterns and limits should be equal in length" + assert len(patterns) == len(limits), err_msg + +def _match(pattern, limit): + """Find all matching paths to the pattern provided.""" + if pattern.endswith('/'): +pattern += '*' + prefix_or_dir = re.match('^[^[*?]*', pattern).group(0) Review comment: Please add a comment explaining this regex. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 90517) Time Spent: 1.5h (was: 1h 20m) > Python SDK: add glob support for HDFS > - > > Key: BEAM-4011 > URL: https://issues.apache.org/jira/browse/BEAM-4011 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Udi Meiri >Assignee: Udi Meiri >Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS
[ https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=90515=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-90515 ] ASF GitHub Bot logged work on BEAM-4011: Author: ASF GitHub Bot Created on: 12/Apr/18 18:15 Start Date: 12/Apr/18 18:15 Worklog Time Spent: 10m Work Description: chamikaramj commented on a change in pull request #5024: [BEAM-4011] Unify Python IO glob implementation. URL: https://github.com/apache/beam/pull/5024#discussion_r181162315 ## File path: sdks/python/apache_beam/io/filesystem.py ## @@ -510,7 +551,48 @@ def match(self, patterns, limits=None): Raises: ``BeamIOError`` if any of the pattern match operations fail """ -raise NotImplementedError +if limits is None: + limits = [None] * len(patterns) +else: + err_msg = "Patterns and limits should be equal in length" + assert len(patterns) == len(limits), err_msg + +def _match(pattern, limit): + """Find all matching paths to the pattern provided.""" + if pattern.endswith('/'): +pattern += '*' + prefix_or_dir = re.match('^[^[*?]*', pattern).group(0) + + file_metadatas = [] + if prefix_or_dir == pattern: +# Short-circuit calling self.list() if there's no glob pattern to match. +if self.exists(pattern): + file_metadatas = [FileMetadata(pattern, self.size(pattern))] + else: +if self.has_dirs(): Review comment: Wasn't sure from code, but do we try to list "\/\*" for "\ /\ *" ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 90515) Time Spent: 1h 10m (was: 1h) > Python SDK: add glob support for HDFS > - > > Key: BEAM-4011 > URL: https://issues.apache.org/jira/browse/BEAM-4011 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Udi Meiri >Assignee: Udi Meiri >Priority: Major > Time Spent: 1h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS
[ https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=90518=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-90518 ] ASF GitHub Bot logged work on BEAM-4011: Author: ASF GitHub Bot Created on: 12/Apr/18 18:15 Start Date: 12/Apr/18 18:15 Worklog Time Spent: 10m Work Description: chamikaramj commented on a change in pull request #5024: [BEAM-4011] Unify Python IO glob implementation. URL: https://github.com/apache/beam/pull/5024#discussion_r181163732 ## File path: sdks/python/apache_beam/io/filesystem.py ## @@ -579,6 +661,19 @@ def exists(self, path): raise NotImplementedError @abc.abstractmethod + def size(self, path): Review comment: Why do we need a separate method for size() ? I think we can already stat files using the match() method (it returns FileMetada objects). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 90518) Time Spent: 1h 40m (was: 1.5h) > Python SDK: add glob support for HDFS > - > > Key: BEAM-4011 > URL: https://issues.apache.org/jira/browse/BEAM-4011 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Udi Meiri >Assignee: Udi Meiri >Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS
[ https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=88535=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-88535 ] ASF GitHub Bot logged work on BEAM-4011: Author: ASF GitHub Bot Created on: 06/Apr/18 17:56 Start Date: 06/Apr/18 17:56 Worklog Time Spent: 10m Work Description: udim commented on issue #5024: [BEAM-4011] Unify Python IO glob implementation. URL: https://github.com/apache/beam/pull/5024#issuecomment-379329439 This code is working. I've opened BEAM-4027 for the precommit failure with cython. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 88535) Time Spent: 40m (was: 0.5h) > Python SDK: add glob support for HDFS > - > > Key: BEAM-4011 > URL: https://issues.apache.org/jira/browse/BEAM-4011 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Udi Meiri >Assignee: Udi Meiri >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS
[ https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=88536=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-88536 ] ASF GitHub Bot logged work on BEAM-4011: Author: ASF GitHub Bot Created on: 06/Apr/18 17:56 Start Date: 06/Apr/18 17:56 Worklog Time Spent: 10m Work Description: udim commented on issue #5024: [BEAM-4011] Unify Python IO glob implementation. URL: https://github.com/apache/beam/pull/5024#issuecomment-379329516 R: @chamikaramj This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 88536) Time Spent: 50m (was: 40m) > Python SDK: add glob support for HDFS > - > > Key: BEAM-4011 > URL: https://issues.apache.org/jira/browse/BEAM-4011 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Udi Meiri >Assignee: Udi Meiri >Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS
[ https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=88292=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-88292 ] ASF GitHub Bot logged work on BEAM-4011: Author: ASF GitHub Bot Created on: 06/Apr/18 00:54 Start Date: 06/Apr/18 00:54 Worklog Time Spent: 10m Work Description: udim commented on issue #5024: [BEAM-4011] Unify Python IO glob implementation. URL: https://github.com/apache/beam/pull/5024#issuecomment-379118600 run python postcommit This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 88292) Time Spent: 0.5h (was: 20m) > Python SDK: add glob support for HDFS > - > > Key: BEAM-4011 > URL: https://issues.apache.org/jira/browse/BEAM-4011 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Udi Meiri >Assignee: Udi Meiri >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS
[ https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=87735=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-87735 ] ASF GitHub Bot logged work on BEAM-4011: Author: ASF GitHub Bot Created on: 04/Apr/18 19:26 Start Date: 04/Apr/18 19:26 Worklog Time Spent: 10m Work Description: udim commented on issue #5024: [BEAM-4011] Normalize Filesystems.match() glob behavior. URL: https://github.com/apache/beam/pull/5024#issuecomment-378716918 Do not merge before https://github.com/apache/beam/pull/4979 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 87735) Time Spent: 20m (was: 10m) > Python SDK: add glob support for HDFS > - > > Key: BEAM-4011 > URL: https://issues.apache.org/jira/browse/BEAM-4011 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Udi Meiri >Assignee: Udi Meiri >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS
[ https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=87724=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-87724 ] ASF GitHub Bot logged work on BEAM-4011: Author: ASF GitHub Bot Created on: 04/Apr/18 19:03 Start Date: 04/Apr/18 19:03 Worklog Time Spent: 10m Work Description: udim opened a new pull request #5024: [BEAM-4011] Normalize Filesystems.match() glob behavior. URL: https://github.com/apache/beam/pull/5024 - Introduces FileSystem.list() abstract method. Lists a directory or prefix. - Implement FileSystem.match() - no longer abstract, unifies glob behavior using fnmatch.fnmatch. DESCRIPTION HERE Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [ ] Write a pull request description that is detailed enough to understand: - [ ] What the pull request does - [ ] Why it does it - [ ] How it does it - [ ] Why this approach - [ ] Each commit in the pull request should have a meaningful subject line and body. - [ ] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 87724) Time Spent: 10m Remaining Estimate: 0h > Python SDK: add glob support for HDFS > - > > Key: BEAM-4011 > URL: https://issues.apache.org/jira/browse/BEAM-4011 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Udi Meiri >Assignee: Udi Meiri >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)