[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS

2018-09-24 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=147285=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-147285
 ]

ASF GitHub Bot logged work on BEAM-4011:


Author: ASF GitHub Bot
Created on: 24/Sep/18 20:24
Start Date: 24/Sep/18 20:24
Worklog Time Spent: 10m 
  Work Description: udim commented on issue #5024: [BEAM-4011] Unify Python 
IO glob implementation.
URL: https://github.com/apache/beam/pull/5024#issuecomment-424112346
 
 
   @knub , I filed a bug for your side-note. Thanks for reporting!
   https://issues.apache.org/jira/browse/BEAM-5486


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 147285)
Time Spent: 3h 20m  (was: 3h 10m)

> Python SDK: add glob support for HDFS
> -
>
> Key: BEAM-4011
> URL: https://issues.apache.org/jira/browse/BEAM-4011
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Assignee: Udi Meiri
>Priority: Major
> Fix For: 2.5.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS

2018-07-19 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=124874=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-124874
 ]

ASF GitHub Bot logged work on BEAM-4011:


Author: ASF GitHub Bot
Created on: 19/Jul/18 09:18
Start Date: 19/Jul/18 09:18
Worklog Time Spent: 10m 
  Work Description: knub commented on issue #5024: [BEAM-4011] Unify Python 
IO glob implementation.
URL: https://github.com/apache/beam/pull/5024#issuecomment-406211816
 
 
   Hi udim,
   
   thanks for your quick response! For future reference, here's the full 
replacement with the same interface as before:
   
   ```python
   from apache_beam.io.filesystems import FileSystems
   
   def glob(path):
   match_result = FileSystems.match([path])[0]
   file_metadata_objects = match_result.metadata_list
   return [fm.path for fm in file_metadata_objects]
   ```
   
   ---
   
   Side-note: I think there's a bug in the `FileSystems.match` implementation 
for certain globs, see the following snippet:
   
   ```ipynb
   In [1]: from apache_beam.io.filesystems import FileSystems
   
   In [2]: FileSystems.match(["gs:///*"])
   /usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/gcsio.py:176: 
DeprecationWarning: object() takes no parameters
 super(GcsIO, cls).__new__(cls, storage_client))
   WARNING:root:Retry with exponential backoff: waiting for 4.3505648468 
seconds before retrying list_prefix because we caught exception: ValueError: 
GCS path must be in the form gs:///.
Traceback for above exception (most recent call last):
 File "/usr/local/lib/python2.7/dist-packages/apache_beam/utils/retry.py", 
line 180, in wrapper
   return fun(*args, **kwargs)
 File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/gcsio.py", 
line 423, in list_prefix
   bucket, prefix = parse_gcs_path(path)
 File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/gcsio.py", 
line 147, in parse_gcs_path
   raise ValueError('GCS path must be in the form gs:///.')
   ```
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 124874)
Time Spent: 3h 10m  (was: 3h)

> Python SDK: add glob support for HDFS
> -
>
> Key: BEAM-4011
> URL: https://issues.apache.org/jira/browse/BEAM-4011
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Assignee: Udi Meiri
>Priority: Major
> Fix For: 2.5.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS

2018-07-18 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=124347=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-124347
 ]

ASF GitHub Bot logged work on BEAM-4011:


Author: ASF GitHub Bot
Created on: 18/Jul/18 07:23
Start Date: 18/Jul/18 07:23
Worklog Time Spent: 10m 
  Work Description: udim commented on issue #5024: [BEAM-4011] Unify Python 
IO glob implementation.
URL: https://github.com/apache/beam/pull/5024#issuecomment-405835585
 
 
   Hi @knub !
   First of all, I apologize for breaking the API. In the future we will label 
deprecated methods as such before removing them, and document a migration path.
   
   Calling `Filesystems.match()` should work as a drop-in replacement.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 124347)
Time Spent: 3h  (was: 2h 50m)

> Python SDK: add glob support for HDFS
> -
>
> Key: BEAM-4011
> URL: https://issues.apache.org/jira/browse/BEAM-4011
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Assignee: Udi Meiri
>Priority: Major
> Fix For: 2.5.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS

2018-07-16 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=123660=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-123660
 ]

ASF GitHub Bot logged work on BEAM-4011:


Author: ASF GitHub Bot
Created on: 16/Jul/18 15:54
Start Date: 16/Jul/18 15:54
Worklog Time Spent: 10m 
  Work Description: knub edited a comment on issue #5024: [BEAM-4011] Unify 
Python IO glob implementation.
URL: https://github.com/apache/beam/pull/5024#issuecomment-405294775
 
 
   @udim @chamikaramj 
   Hi there!
   
   first of all, thanks for your work in Apache Beam!
   
   I have a problem: I just updated from `apache_beam==2.4.0` to 
`apache_beam==2.5.0`.
   Previously, I was using:
   ```
   gcsio.GcsIO().glob(glob_string)
   ```
   for queries against GCS. Now this previously public interface does not work 
anymore in `apache_beam==2.5.0`.
   After doing some digging, I found this PR which seems related. I can't find 
any documentation/update guide on how to migrate from the former to the latter 
version and also couldn't find out from reading the code. Can you help me?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 123660)
Time Spent: 2h 50m  (was: 2h 40m)

> Python SDK: add glob support for HDFS
> -
>
> Key: BEAM-4011
> URL: https://issues.apache.org/jira/browse/BEAM-4011
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Assignee: Udi Meiri
>Priority: Major
> Fix For: 2.5.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS

2018-07-16 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=123659=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-123659
 ]

ASF GitHub Bot logged work on BEAM-4011:


Author: ASF GitHub Bot
Created on: 16/Jul/18 15:53
Start Date: 16/Jul/18 15:53
Worklog Time Spent: 10m 
  Work Description: knub commented on issue #5024: [BEAM-4011] Unify Python 
IO glob implementation.
URL: https://github.com/apache/beam/pull/5024#issuecomment-405294775
 
 
   @udim @chamikaramj 
   Hi there!
   
   first of all, thanks for your work in Apache Beam!
   
   I have a problem. I just updated from `apache_beam==2.4.0` to 
`apache_beam==2.5.0`.
   I was previously using:
   ```
   gcsio.GcsIO().glob(glob_string)
   ```
   for queries against GCS. Now this previously public interface does not work 
anymore in `apache_beam==2.5.0`.
   After doing quite some digging, I found this PR which seems related. I can't 
find any documentation/update guide on how to migrate from the former to the 
latter version and also couldn't find out from reading the code. Can you help 
me?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 123659)
Time Spent: 2h 40m  (was: 2.5h)

> Python SDK: add glob support for HDFS
> -
>
> Key: BEAM-4011
> URL: https://issues.apache.org/jira/browse/BEAM-4011
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Assignee: Udi Meiri
>Priority: Major
> Fix For: 2.5.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS

2018-04-16 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=91574=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-91574
 ]

ASF GitHub Bot logged work on BEAM-4011:


Author: ASF GitHub Bot
Created on: 17/Apr/18 02:20
Start Date: 17/Apr/18 02:20
Worklog Time Spent: 10m 
  Work Description: chamikaramj closed pull request #5024: [BEAM-4011] 
Unify Python IO glob implementation.
URL: https://github.com/apache/beam/pull/5024
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/sdks/python/apache_beam/io/filesystem.py 
b/sdks/python/apache_beam/io/filesystem.py
index 3f7e9aba847..929b7ef5607 100644
--- a/sdks/python/apache_beam/io/filesystem.py
+++ b/sdks/python/apache_beam/io/filesystem.py
@@ -21,8 +21,11 @@
 import abc
 import bz2
 import cStringIO
+import fnmatch
 import logging
 import os
+import posixpath
+import re
 import time
 import zlib
 
@@ -498,9 +501,49 @@ def mkdirs(self, path):
 raise NotImplementedError
 
   @abc.abstractmethod
+  def has_dirs(self):
+"""Whether this FileSystem supports directories."""
+raise NotImplementedError
+
+  @abc.abstractmethod
+  def _list(self, dir_or_prefix):
+"""List files in a location.
+
+Listing is non-recursive (for filesystems that support directories).
+
+Args:
+  dir_or_prefix: (string) A directory or location prefix (for filesystems
+that don't have directories).
+
+Returns:
+  Generator of ``FileMetadata`` objects.
+
+Raises:
+  ``BeamIOError`` if listing fails, but not if no files were found.
+"""
+raise NotImplementedError
+
+  @staticmethod
+  def _url_dirname(url_or_path):
+"""Like posixpath.dirname, but preserves scheme:// prefix.
+
+Args:
+  url_or_path: A string in the form of scheme://some/path OR /some/path.
+"""
+match = re.match(r'([a-z]+://)(.*)', url_or_path)
+if match is None:
+  return posixpath.dirname(url_or_path)
+url_prefix, path = match.groups()
+return url_prefix + posixpath.dirname(path)
+
   def match(self, patterns, limits=None):
 """Find all matching paths to the patterns provided.
 
+Pattern matching is done using fnmatch.fnmatch.
+For filesystems that have directories, matching is not recursive. Patterns
+like scheme://path/*/foo will not match anything.
+Patterns ending with '/' will be appended with '*'.
+
 Args:
   patterns: list of string for the file path pattern to match against
   limits: list of maximum number of responses that need to be fetched
@@ -510,7 +553,52 @@ def match(self, patterns, limits=None):
 Raises:
   ``BeamIOError`` if any of the pattern match operations fail
 """
-raise NotImplementedError
+if limits is None:
+  limits = [None] * len(patterns)
+else:
+  err_msg = "Patterns and limits should be equal in length"
+  assert len(patterns) == len(limits), err_msg
+
+def _match(pattern, limit):
+  """Find all matching paths to the pattern provided."""
+  if pattern.endswith('/'):
+pattern += '*'
+  # Get the part of the pattern before the first globbing character.
+  # For example scheme://path/foo* will become scheme://path/foo for
+  # filesystems like GCS, or converted to scheme://path for filesystems 
with
+  # directories.
+  prefix_or_dir = re.match('^[^[*?]*', pattern).group(0)
+
+  file_metadatas = []
+  if prefix_or_dir == pattern:
+# Short-circuit calling self.list() if there's no glob pattern to 
match.
+if self.exists(pattern):
+  file_metadatas = [FileMetadata(pattern, self.size(pattern))]
+  else:
+if self.has_dirs():
+  prefix_or_dir = self._url_dirname(prefix_or_dir)
+file_metadatas = self._list(prefix_or_dir)
+
+  metadata_list = []
+  for file_metadata in file_metadatas:
+if limit is not None and len(metadata_list) >= limit:
+  break
+if fnmatch.fnmatch(file_metadata.path, pattern):
+  metadata_list.append(file_metadata)
+
+  return MatchResult(pattern, metadata_list)
+
+exceptions = {}
+result = []
+for pattern, limit in zip(patterns, limits):
+  try:
+result.append(_match(pattern, limit))
+  except Exception as e:  # pylint: disable=broad-except
+exceptions[pattern] = e
+
+if exceptions:
+  raise BeamIOError("Match operation failed", exceptions)
+return result
 
   @abc.abstractmethod
   def create(self, path, mime_type='application/octet-stream',
@@ -579,6 +667,19 @@ def exists(self, path):
 raise NotImplementedError
 
   @abc.abstractmethod
+  def size(self, path):
+"""Get 

[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS

2018-04-12 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=90665=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-90665
 ]

ASF GitHub Bot logged work on BEAM-4011:


Author: ASF GitHub Bot
Created on: 13/Apr/18 01:30
Start Date: 13/Apr/18 01:30
Worklog Time Spent: 10m 
  Work Description: udim commented on a change in pull request #5024: 
[BEAM-4011] Unify Python IO glob implementation.
URL: https://github.com/apache/beam/pull/5024#discussion_r181264323
 
 

 ##
 File path: sdks/python/apache_beam/io/filesystem.py
 ##
 @@ -510,7 +551,48 @@ def match(self, patterns, limits=None):
 Raises:
   ``BeamIOError`` if any of the pattern match operations fail
 """
-raise NotImplementedError
+if limits is None:
+  limits = [None] * len(patterns)
+else:
+  err_msg = "Patterns and limits should be equal in length"
+  assert len(patterns) == len(limits), err_msg
+
+def _match(pattern, limit):
+  """Find all matching paths to the pattern provided."""
+  if pattern.endswith('/'):
+pattern += '*'
+  prefix_or_dir = re.match('^[^[*?]*', pattern).group(0)
 
 Review comment:
   done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 90665)
Time Spent: 2h  (was: 1h 50m)

> Python SDK: add glob support for HDFS
> -
>
> Key: BEAM-4011
> URL: https://issues.apache.org/jira/browse/BEAM-4011
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS

2018-04-12 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=90667=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-90667
 ]

ASF GitHub Bot logged work on BEAM-4011:


Author: ASF GitHub Bot
Created on: 13/Apr/18 01:30
Start Date: 13/Apr/18 01:30
Worklog Time Spent: 10m 
  Work Description: udim commented on a change in pull request #5024: 
[BEAM-4011] Unify Python IO glob implementation.
URL: https://github.com/apache/beam/pull/5024#discussion_r181263543
 
 

 ##
 File path: sdks/python/apache_beam/io/filesystem.py
 ##
 @@ -498,9 +501,47 @@ def mkdirs(self, path):
 raise NotImplementedError
 
   @abc.abstractmethod
+  def has_dirs(self):
+"""Whether this FileSystem supports directories."""
+raise NotImplementedError
+
+  @abc.abstractmethod
+  def list(self, dir_or_prefix):
 
 Review comment:
   done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 90667)
Time Spent: 2h 20m  (was: 2h 10m)

> Python SDK: add glob support for HDFS
> -
>
> Key: BEAM-4011
> URL: https://issues.apache.org/jira/browse/BEAM-4011
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS

2018-04-12 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=90666=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-90666
 ]

ASF GitHub Bot logged work on BEAM-4011:


Author: ASF GitHub Bot
Created on: 13/Apr/18 01:30
Start Date: 13/Apr/18 01:30
Worklog Time Spent: 10m 
  Work Description: udim commented on a change in pull request #5024: 
[BEAM-4011] Unify Python IO glob implementation.
URL: https://github.com/apache/beam/pull/5024#discussion_r181263847
 
 

 ##
 File path: sdks/python/apache_beam/io/filesystem.py
 ##
 @@ -498,9 +501,47 @@ def mkdirs(self, path):
 raise NotImplementedError
 
   @abc.abstractmethod
+  def has_dirs(self):
+"""Whether this FileSystem supports directories."""
+raise NotImplementedError
+
+  @abc.abstractmethod
+  def list(self, dir_or_prefix):
+"""List files in a location.
+
+Listing is non-recursive (for filesystems that support directories).
+
+Args:
+  dir_or_prefix: (string) A directory or location prefix (for filesystems
+that don't have directories).
+
+Returns:
+  Generator of ``FileMetadata`` objects.
+
+Raises:
+  ``BeamIOError`` if listing fails, but not if no files were found.
+"""
+raise NotImplementedError
+
+  @staticmethod
+  def _url_dirname(url_or_path):
+"""Like posixpath.dirname, but preserves scheme:// prefix.
+
+Args:
+  url_or_path: A string in the form of scheme://some/path OR /some/path.
+"""
+match = re.match(r'([a-z]+://)(.*)', url_or_path)
+if match is None:
+  return posixpath.dirname(url_or_path)
+url_prefix, path = match.groups()
+return url_prefix + posixpath.dirname(path)
+
   def match(self, patterns, limits=None):
 """Find all matching paths to the patterns provided.
 
+Pattern matching is done using fnmatch.fnmatch.
 
 Review comment:
   Clarified.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 90666)
Time Spent: 2h 10m  (was: 2h)

> Python SDK: add glob support for HDFS
> -
>
> Key: BEAM-4011
> URL: https://issues.apache.org/jira/browse/BEAM-4011
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS

2018-04-12 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=90664=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-90664
 ]

ASF GitHub Bot logged work on BEAM-4011:


Author: ASF GitHub Bot
Created on: 13/Apr/18 01:30
Start Date: 13/Apr/18 01:30
Worklog Time Spent: 10m 
  Work Description: udim commented on a change in pull request #5024: 
[BEAM-4011] Unify Python IO glob implementation.
URL: https://github.com/apache/beam/pull/5024#discussion_r181265403
 
 

 ##
 File path: sdks/python/apache_beam/io/filesystem.py
 ##
 @@ -579,6 +661,19 @@ def exists(self, path):
 raise NotImplementedError
 
   @abc.abstractmethod
+  def size(self, path):
 
 Review comment:
   _list() returns sizes, but it only works on prefixes (may return more than 
one result) or directories (fails on files).
   I added this method for the case where the pattern given doesn't end in `/` 
and has no globbing characters. In this case we return the size of the file or 
directory pointed to by the pattern (if it exists).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 90664)
Time Spent: 2h  (was: 1h 50m)

> Python SDK: add glob support for HDFS
> -
>
> Key: BEAM-4011
> URL: https://issues.apache.org/jira/browse/BEAM-4011
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS

2018-04-12 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=90663=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-90663
 ]

ASF GitHub Bot logged work on BEAM-4011:


Author: ASF GitHub Bot
Created on: 13/Apr/18 01:30
Start Date: 13/Apr/18 01:30
Worklog Time Spent: 10m 
  Work Description: udim commented on a change in pull request #5024: 
[BEAM-4011] Unify Python IO glob implementation.
URL: https://github.com/apache/beam/pull/5024#discussion_r181264935
 
 

 ##
 File path: sdks/python/apache_beam/io/filesystem.py
 ##
 @@ -510,7 +551,48 @@ def match(self, patterns, limits=None):
 Raises:
   ``BeamIOError`` if any of the pattern match operations fail
 """
-raise NotImplementedError
+if limits is None:
+  limits = [None] * len(patterns)
+else:
+  err_msg = "Patterns and limits should be equal in length"
+  assert len(patterns) == len(limits), err_msg
+
+def _match(pattern, limit):
+  """Find all matching paths to the pattern provided."""
+  if pattern.endswith('/'):
+pattern += '*'
+  prefix_or_dir = re.match('^[^[*?]*', pattern).group(0)
+
+  file_metadatas = []
+  if prefix_or_dir == pattern:
+# Short-circuit calling self.list() if there's no glob pattern to 
match.
+if self.exists(pattern):
+  file_metadatas = [FileMetadata(pattern, self.size(pattern))]
+  else:
+if self.has_dirs():
 
 Review comment:
   Yes. I explained it in the new comment a little above this line.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 90663)
Time Spent: 1h 50m  (was: 1h 40m)

> Python SDK: add glob support for HDFS
> -
>
> Key: BEAM-4011
> URL: https://issues.apache.org/jira/browse/BEAM-4011
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS

2018-04-12 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=90519=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-90519
 ]

ASF GitHub Bot logged work on BEAM-4011:


Author: ASF GitHub Bot
Created on: 12/Apr/18 18:15
Start Date: 12/Apr/18 18:15
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on a change in pull request 
#5024: [BEAM-4011] Unify Python IO glob implementation.
URL: https://github.com/apache/beam/pull/5024#discussion_r181169858
 
 

 ##
 File path: sdks/python/apache_beam/io/filesystem.py
 ##
 @@ -498,9 +501,47 @@ def mkdirs(self, path):
 raise NotImplementedError
 
   @abc.abstractmethod
+  def has_dirs(self):
+"""Whether this FileSystem supports directories."""
+raise NotImplementedError
+
+  @abc.abstractmethod
+  def list(self, dir_or_prefix):
+"""List files in a location.
+
+Listing is non-recursive (for filesystems that support directories).
+
+Args:
+  dir_or_prefix: (string) A directory or location prefix (for filesystems
+that don't have directories).
+
+Returns:
+  Generator of ``FileMetadata`` objects.
+
+Raises:
+  ``BeamIOError`` if listing fails, but not if no files were found.
+"""
+raise NotImplementedError
+
+  @staticmethod
+  def _url_dirname(url_or_path):
+"""Like posixpath.dirname, but preserves scheme:// prefix.
+
+Args:
+  url_or_path: A string in the form of scheme://some/path OR /some/path.
+"""
+match = re.match(r'([a-z]+://)(.*)', url_or_path)
+if match is None:
+  return posixpath.dirname(url_or_path)
+url_prefix, path = match.groups()
+return url_prefix + posixpath.dirname(path)
+
   def match(self, patterns, limits=None):
 """Find all matching paths to the patterns provided.
 
+Pattern matching is done using fnmatch.fnmatch.
 
 Review comment:
   I think we should clarify if match for \/* is recursive or not. I 
think most users will use a pattern in the form \/\* which to 
avoid matching all sub-directories anyways.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 90519)

> Python SDK: add glob support for HDFS
> -
>
> Key: BEAM-4011
> URL: https://issues.apache.org/jira/browse/BEAM-4011
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS

2018-04-12 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=90516=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-90516
 ]

ASF GitHub Bot logged work on BEAM-4011:


Author: ASF GitHub Bot
Created on: 12/Apr/18 18:15
Start Date: 12/Apr/18 18:15
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on a change in pull request 
#5024: [BEAM-4011] Unify Python IO glob implementation.
URL: https://github.com/apache/beam/pull/5024#discussion_r181165982
 
 

 ##
 File path: sdks/python/apache_beam/io/filesystem.py
 ##
 @@ -498,9 +501,47 @@ def mkdirs(self, path):
 raise NotImplementedError
 
   @abc.abstractmethod
+  def has_dirs(self):
+"""Whether this FileSystem supports directories."""
+raise NotImplementedError
+
+  @abc.abstractmethod
+  def list(self, dir_or_prefix):
 
 Review comment:
   I think having both list() and match() as public can be confusing to users. 
Let's keep list() as private (and move it to filesystem implementations)  if 
there's no compelling use-case to keep it in the interface.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 90516)
Time Spent: 1h 20m  (was: 1h 10m)

> Python SDK: add glob support for HDFS
> -
>
> Key: BEAM-4011
> URL: https://issues.apache.org/jira/browse/BEAM-4011
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS

2018-04-12 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=90517=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-90517
 ]

ASF GitHub Bot logged work on BEAM-4011:


Author: ASF GitHub Bot
Created on: 12/Apr/18 18:15
Start Date: 12/Apr/18 18:15
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on a change in pull request 
#5024: [BEAM-4011] Unify Python IO glob implementation.
URL: https://github.com/apache/beam/pull/5024#discussion_r181161482
 
 

 ##
 File path: sdks/python/apache_beam/io/filesystem.py
 ##
 @@ -510,7 +551,48 @@ def match(self, patterns, limits=None):
 Raises:
   ``BeamIOError`` if any of the pattern match operations fail
 """
-raise NotImplementedError
+if limits is None:
+  limits = [None] * len(patterns)
+else:
+  err_msg = "Patterns and limits should be equal in length"
+  assert len(patterns) == len(limits), err_msg
+
+def _match(pattern, limit):
+  """Find all matching paths to the pattern provided."""
+  if pattern.endswith('/'):
+pattern += '*'
+  prefix_or_dir = re.match('^[^[*?]*', pattern).group(0)
 
 Review comment:
   Please add a comment explaining this regex.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 90517)
Time Spent: 1.5h  (was: 1h 20m)

> Python SDK: add glob support for HDFS
> -
>
> Key: BEAM-4011
> URL: https://issues.apache.org/jira/browse/BEAM-4011
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS

2018-04-12 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=90515=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-90515
 ]

ASF GitHub Bot logged work on BEAM-4011:


Author: ASF GitHub Bot
Created on: 12/Apr/18 18:15
Start Date: 12/Apr/18 18:15
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on a change in pull request 
#5024: [BEAM-4011] Unify Python IO glob implementation.
URL: https://github.com/apache/beam/pull/5024#discussion_r181162315
 
 

 ##
 File path: sdks/python/apache_beam/io/filesystem.py
 ##
 @@ -510,7 +551,48 @@ def match(self, patterns, limits=None):
 Raises:
   ``BeamIOError`` if any of the pattern match operations fail
 """
-raise NotImplementedError
+if limits is None:
+  limits = [None] * len(patterns)
+else:
+  err_msg = "Patterns and limits should be equal in length"
+  assert len(patterns) == len(limits), err_msg
+
+def _match(pattern, limit):
+  """Find all matching paths to the pattern provided."""
+  if pattern.endswith('/'):
+pattern += '*'
+  prefix_or_dir = re.match('^[^[*?]*', pattern).group(0)
+
+  file_metadatas = []
+  if prefix_or_dir == pattern:
+# Short-circuit calling self.list() if there's no glob pattern to 
match.
+if self.exists(pattern):
+  file_metadatas = [FileMetadata(pattern, self.size(pattern))]
+  else:
+if self.has_dirs():
 
 Review comment:
   Wasn't sure from code, but do we try to list "\/\*" for 
"\/\*" ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 90515)
Time Spent: 1h 10m  (was: 1h)

> Python SDK: add glob support for HDFS
> -
>
> Key: BEAM-4011
> URL: https://issues.apache.org/jira/browse/BEAM-4011
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS

2018-04-12 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=90518=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-90518
 ]

ASF GitHub Bot logged work on BEAM-4011:


Author: ASF GitHub Bot
Created on: 12/Apr/18 18:15
Start Date: 12/Apr/18 18:15
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on a change in pull request 
#5024: [BEAM-4011] Unify Python IO glob implementation.
URL: https://github.com/apache/beam/pull/5024#discussion_r181163732
 
 

 ##
 File path: sdks/python/apache_beam/io/filesystem.py
 ##
 @@ -579,6 +661,19 @@ def exists(self, path):
 raise NotImplementedError
 
   @abc.abstractmethod
+  def size(self, path):
 
 Review comment:
   Why do we need a separate method for size() ? I think we can already stat 
files using the match() method (it returns FileMetada objects).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 90518)
Time Spent: 1h 40m  (was: 1.5h)

> Python SDK: add glob support for HDFS
> -
>
> Key: BEAM-4011
> URL: https://issues.apache.org/jira/browse/BEAM-4011
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS

2018-04-06 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=88535=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-88535
 ]

ASF GitHub Bot logged work on BEAM-4011:


Author: ASF GitHub Bot
Created on: 06/Apr/18 17:56
Start Date: 06/Apr/18 17:56
Worklog Time Spent: 10m 
  Work Description: udim commented on issue #5024: [BEAM-4011] Unify Python 
IO glob implementation.
URL: https://github.com/apache/beam/pull/5024#issuecomment-379329439
 
 
   This code is working. I've opened BEAM-4027 for the precommit failure with 
cython.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 88535)
Time Spent: 40m  (was: 0.5h)

> Python SDK: add glob support for HDFS
> -
>
> Key: BEAM-4011
> URL: https://issues.apache.org/jira/browse/BEAM-4011
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS

2018-04-06 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=88536=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-88536
 ]

ASF GitHub Bot logged work on BEAM-4011:


Author: ASF GitHub Bot
Created on: 06/Apr/18 17:56
Start Date: 06/Apr/18 17:56
Worklog Time Spent: 10m 
  Work Description: udim commented on issue #5024: [BEAM-4011] Unify Python 
IO glob implementation.
URL: https://github.com/apache/beam/pull/5024#issuecomment-379329516
 
 
   R: @chamikaramj 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 88536)
Time Spent: 50m  (was: 40m)

> Python SDK: add glob support for HDFS
> -
>
> Key: BEAM-4011
> URL: https://issues.apache.org/jira/browse/BEAM-4011
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS

2018-04-05 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=88292=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-88292
 ]

ASF GitHub Bot logged work on BEAM-4011:


Author: ASF GitHub Bot
Created on: 06/Apr/18 00:54
Start Date: 06/Apr/18 00:54
Worklog Time Spent: 10m 
  Work Description: udim commented on issue #5024: [BEAM-4011] Unify Python 
IO glob implementation.
URL: https://github.com/apache/beam/pull/5024#issuecomment-379118600
 
 
   run python postcommit


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 88292)
Time Spent: 0.5h  (was: 20m)

> Python SDK: add glob support for HDFS
> -
>
> Key: BEAM-4011
> URL: https://issues.apache.org/jira/browse/BEAM-4011
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS

2018-04-04 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=87735=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-87735
 ]

ASF GitHub Bot logged work on BEAM-4011:


Author: ASF GitHub Bot
Created on: 04/Apr/18 19:26
Start Date: 04/Apr/18 19:26
Worklog Time Spent: 10m 
  Work Description: udim commented on issue #5024: [BEAM-4011] Normalize 
Filesystems.match() glob behavior.
URL: https://github.com/apache/beam/pull/5024#issuecomment-378716918
 
 
   Do not merge before https://github.com/apache/beam/pull/4979


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 87735)
Time Spent: 20m  (was: 10m)

> Python SDK: add glob support for HDFS
> -
>
> Key: BEAM-4011
> URL: https://issues.apache.org/jira/browse/BEAM-4011
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4011) Python SDK: add glob support for HDFS

2018-04-04 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=87724=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-87724
 ]

ASF GitHub Bot logged work on BEAM-4011:


Author: ASF GitHub Bot
Created on: 04/Apr/18 19:03
Start Date: 04/Apr/18 19:03
Worklog Time Spent: 10m 
  Work Description: udim opened a new pull request #5024: [BEAM-4011] 
Normalize Filesystems.match() glob behavior.
URL: https://github.com/apache/beam/pull/5024
 
 
   - Introduces FileSystem.list() abstract method. Lists a directory or
   prefix.
   - Implement FileSystem.match() - no longer abstract, unifies glob
   behavior using fnmatch.fnmatch.
   
   DESCRIPTION HERE
   
   
   
   Follow this checklist to help us incorporate your contribution quickly and 
easily:
   
- [ ] Make sure there is a [JIRA 
issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the 
change (usually before you start working on it).  Trivial changes like typos do 
not require a JIRA issue.  Your pull request should address just this issue, 
without pulling in other changes.
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue.
- [ ] Write a pull request description that is detailed enough to 
understand:
  - [ ] What the pull request does
  - [ ] Why it does it
  - [ ] How it does it
  - [ ] Why this approach
- [ ] Each commit in the pull request should have a meaningful subject line 
and body.
- [ ] Run `mvn clean verify` to make sure basic checks pass. A more 
thorough check will be performed on your pull request automatically.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 87724)
Time Spent: 10m
Remaining Estimate: 0h

> Python SDK: add glob support for HDFS
> -
>
> Key: BEAM-4011
> URL: https://issues.apache.org/jira/browse/BEAM-4011
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)