[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-24 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seem clear and 
are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Possible solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy I think becasue there would be always 2 methods to 
do the work
- Make everything compatible to FSSpec and reference the Spec, see 
https://issues.apache.org/jira/browse/ARROW-7102, 
I like the idea of a https://github.com/intake/filesystem_spec repo. Some 
comments on the proposed solutions there:
Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
having to wrap again a FileSystem that is not good enough in yet another repo
Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
documented "official" pyarow FileSystem it is fine I think, otherwise I would 
be yet another wrapper on top of the pyarrow "official" fs


h2. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will also just wrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h2. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce 
[touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from 
dask/hdfs3
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] 
from dask/hdfs3
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes, already 
implemented by https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), 
it would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- implementing 
[readline|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L809], needed 
for:

 {code}
with hdfs.open("file", 'wb') as outfile:
  pickle.dump({"a": "b"}, outfile)

with hdfs.open("file", 'wb') as infile:
  pickle.load(infile) 
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seem clear and 
are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to 

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-24 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seem clear and 
are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Possible solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy I think becasue there would be always 2 methods to 
do the work
- Make everything compatible to FSSpec and reference the Spec, see 
https://issues.apache.org/jira/browse/ARROW-7102, 
I like the idea of a https://github.com/intake/filesystem_spec repo. Some 
comments on the proposed solutions there:
Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
having to wrap again a FileSystem that is not good enough in yet another repo
Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
documented "official" pyarow FileSystem it is fine I think, otherwise I would 
be yet another wrapper on top of the pyarrow "official" fs


h2. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will also just wrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h2. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce 
[touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from 
dask/hdfs3
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] 
from dask/hdfs3
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes, already 
implemented by https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), 
it would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- implementing readline, needed for:

 {code}
with hdfs.open("file", 'wb') as outfile:
  pickle.dump({"a": "b"}, outfile)

with hdfs.open("file", 'wb') as infile:
  pickle.load(infile) 
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seem clear and 
are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as 

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seem clear and 
are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Possible solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy I think becasue there would be always 2 methods to 
do the work
- Make everything compatible to FSSpec and reference the Spec, see 
https://issues.apache.org/jira/browse/ARROW-7102, 
I like the idea of a https://github.com/intake/filesystem_spec repo. Some 
comments on the proposed solutions there:
Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
having to wrap again a FileSystem that is not good enough in yet another repo
Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
documented "official" pyarow FileSystem it is fine I think, otherwise I would 
be yet another wrapper on top of the pyarrow "official" fs


h2. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will also just wrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h2. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce 
[touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from 
dask/hdfs3
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] 
from dask/hdfs3
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes, already 
implemented by https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), 
it would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Possible solutions

- If the current Python API is still unused we could just rename the methods
- 

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Labels: FileSystem  (was: )

> [Python] Improve ergonomics of new FileSystem API
> -
>
> Key: ARROW-7584
> URL: https://issues.apache.org/jira/browse/ARROW-7584
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Fabian Höring
>Priority: Major
>  Labels: FileSystem
>
> The [new Python FileSystem API 
> |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
> nice but seems to be very verbose to use.
> The documentation of the old FS API is 
> [here|https://arrow.apache.org/docs/python/filesystems.html]
> h2. Here are some examples
> *Filesystem access:*
> Before:
> fs.ls()
> fs.mkdir()
> fs.rmdir()
> Now:
> fs.get_target_stats()
> fs.create_dir()
> fs.delete_dir()
> What is the advantage of having a longer method ? The short ones seems clear 
> and are much easier to use. Seems like an easy change.  Also this is 
> consistent with what is doing hdfs in the [fs api| 
> https://arrow.apache.org/docs/python/filesystems.html] and works naturally 
> with a local filesystem.
> *File opening:*
> Before:
> with fs.open(self, path, mode=u'rb', buffer_size=None)
> Now:
> fs.open_input_file()
> fs.open_input_stream()
> fs.open_output_stream()
> It seems more natural to fit to Python standard open function which works for 
> local file access as well. Not sure if this is possible to do easily as there 
> is `_wrap_output_stream` method.
> h2. Possible solutions
> - If the current Python API is still unused we could just rename the methods
> - We could keep everything as is and add some alias methods, it would make 
> the FileSystem class a bit messy I think becasue there would be always 2 
> methods to do the work
> - Make everything compatible to FSSpec and reference the Spec, see 
> https://issues.apache.org/jira/browse/ARROW-7102, 
> I like the idea of a https://github.com/intake/filesystem_spec repo. Some 
> comments on the proposed solutions there:
> Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
> having to wrap again a FileSystem that is not good enough in yet another repo
> Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
> documented "official" pyarow FileSystem it is fine I think, otherwise I would 
> be yet another wrapper on top of the pyarrow "official" fs
> h2. Tensorflow RFC on FileSystems
> Tensorflow is also doing some standardization work on their FileSystem:
> https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations
> Not clear (to me) what they will do with Python file API though. it seems 
> like they will also just wrap the C code back to 
> [tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]
> h2. Other considerations on FS ergonomics
> In the long run I would also like to enhance the FileSystem API and add more 
> methods that use the basic ones to provide new features for example:
> - introduce put and get on top of the streams that directly upload/download 
> files
> - introduce 
> [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from 
> dask/hdfs3
> - introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] 
> from dask/hdfs3
> - check if selector works with globs or add 
> https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
> - be able to write strings to the file streams (instead of only bytes, 
> already implemented by 
> https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), it would 
> permit to directly use some Python API's like json.dump
> {code}
> with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   json.dump(res, fd)
> {code}
> instead of
> {code}
> with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   fd.write(json.dumps(res))
> {code}
> or like currently (with old API, which required encore each time, untested 
> with new one)
> {code}with fs.open(path, "wb") as fd:
>   res = {"a": "bc"}
>   fd.write(json.dumps(res).encode())
> {code}
> - not clear how to make this also work when reading from files 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Possible solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy I think becasue there would be always 2 methods to 
do the work
- Make everything compatible to FSSpec and reference the Spec, see 
https://issues.apache.org/jira/browse/ARROW-7102, 
I like the idea of a https://github.com/intake/filesystem_spec repo. Some 
comments on the proposed solutions there:
Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
having to wrap again a FileSystem that is not good enough in yet another repo
Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
documented "official" pyarow FileSystem it is fine I think, otherwise I would 
be yet another wrapper on top of the pyarrow "official" fs


h2. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will also just wrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h2. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce 
[touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from 
dask/hdfs3
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] 
from dask/hdfs3
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes, already 
implemented by https://github.com/dask/hdfs3/blob/master/hdfs3/utils.py#L96), 
it would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Possible solutions

- If the current Python API is still unused we could just rename the methods
- 

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Possible solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy I think becasue there would be always 2 methods to 
do the work
- Make everything compatible to FSSpec and reference the Spec, see 
https://issues.apache.org/jira/browse/ARROW-7102, 
I like the idea of a https://github.com/intake/filesystem_spec repo. Some 
comments on the proposed solutions there:
Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
having to wrap again a FileSystem that is not good enough in yet another repo
Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
documented "official" pyarow FileSystem it is fine I think, otherwise I would 
be yet another wrapper on top of the pyarrow "official" fs


h2. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will also just wrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h2. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce 
[touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from 
dask/hdfs3
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] 
from dask/hdfs3
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Possible solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Possible solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy I think becasue there would be always 2 methods to 
do the work
- Make everything compatible to FSSpec and reference the Spec, see 
https://issues.apache.org/jira/browse/ARROW-7102, 
I like the idea of a https://github.com/intake/filesystem_spec repo. Some 
comments on the proposed solutions there:
Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
having to wrap again a FileSystem the is not good enough in yet another repo
Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
documented "official" pyarow FileSystem it is fine I think, otherwise I would 
be yet another wrapper on top of the pyarrow "official" fs


h2. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will also just wrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h2. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce 
[touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from 
dask/hdfs3
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] 
from dask/hdfs3
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Possible solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Possible solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy I think becasue there would be always 2 methods to 
do the work
- Make everything compatible to FSSpec and reference the Spec, see 
https://issues.apache.org/jira/browse/ARROW-7102, 
I like the idea of a fsspex repo. Some comments on the proposed solutions:
Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
having to wrap again a FileSystem the is not good enough in yet another repo
Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
documented "official" pyarow FileSystem it is fine I think, otherwise I would 
be yet another wrapper on top of the pyarrow "official" fs


h2. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will also just wrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h2. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce 
[touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601] from 
dask/hdfs3
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252] 
from dask/hdfs3
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Possible solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy I think becasue there 

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Possible solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy I think becasue there would be always 2 methods to 
do the work
- Make everything compatible to FSSpec and reference the Spec, see 
https://issues.apache.org/jira/browse/ARROW-7102, 
I like the idea of a fsspex repo. Some comments on the proposed solutions:
Make a fsspec wrapper for pyarrow.fs => seems strange to me, it would be 
having to wrap again a FileSystem the is not good enough in yet another repo
Make a pyarrow.fs wrapper for fsspec => if the wrapper becomes the 
documented "official" pyarow FileSystem it is fine I think, otherwise I would 
be yet another wrapper on top of the pyarrow "official" fs


h2. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will also just wrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h2. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601]
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Proposed solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy I think becasue there would be always 2 methods to 
do 

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Proposed solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy I think becasue there would be always 2 methods to 
do the work

h2. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will also just wrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h2. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601]
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Proposed solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h2. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will als ojus twrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h2. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that 

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Proposed solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h2. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will als ojus twrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h2. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce [touch|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L601]
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Proposed solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h2. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will als ojus twrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h2. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic 

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h2. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h2. Proposed solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h2. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will als ojus twrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h2. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h3. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h3. Proposed solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h3. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will als ojus twrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h3. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the 

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h3. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h3. Proposed solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h3. Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will als ojus twrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h3. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h3. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h3. Proposed solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h3: Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will als ojus twrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h3. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the 

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h3. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h3. Proposed solutions

- If the current Python API is still unused we could just rename the methods
- We could keep everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h3: Tensorflow RFC on FileSystems

Tensorflow is also doing some standardization work on their FileSystem:
https://github.com/tensorflow/community/blob/master/rfcs/20190506-filesystem-plugin-modular-tensorflow.md#python-considerations

Not clear (to me) what they will do with Python file API though. it seems like 
they will als ojus twrap the C code back to 
[tf.Gfile|https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile]

h3. Other considerations on FS ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h3. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h3. Solutions

- If the current Python API is still unused we could just rename the methods
- We could everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h3. Other considerations on ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h3. Here are some examples

*Filesystem access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h3. Solutions

- If the current Python API is still unused we could just rename the methods
- We could everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h3. Other considerations on ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h3. Here are some examples

*File access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h3. Solutions

- If the current Python API is still unused we could just rename the methods
- We could everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h3. Other considerations on ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 


> [Python] Improve ergonomics of new FileSystem API
> 

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h3. Here are some examples

*File access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h3. Solutions

- If the current Python API is still unused we could just rename the methods
- We could everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h3. Other considerations on ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res))
{code}

or like currently (with old API, which required encore each time, untested with 
new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}

- not clear how to make this also work when reading from files 

  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h3. Here are some examples

*File access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h3. Solutions

- If the current Python API is still unused we could just rename the methods
- We could everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h3. Other considerations on ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res)) # instead of fd.write(json.dumps(res).encode())
{code}

or like currently (with old API, untested with new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}



> [Python] Improve ergonomics of new FileSystem API
> -
>
> Key: ARROW-7584
>   

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h3. Here are some examples

*File access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h3. Solutions

- If the current Python API is still unused we could just rename the methods
- We could everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h3. Other considerations on ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write strings to the file streams (instead of only bytes), it 
would permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res)) # instead of fd.write(json.dumps(res).encode())
{code}

or like currently (with old API, untested with new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}


  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h3. Here are some examples

*File access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h3. Solutions

- If the current Python API is still unused we could just rename the methods
- We could everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h3. Other considerations on ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write string to the file streams (instead of only bytes), it would 
permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res)) # instead of fd.write(json.dumps(res).encode())
{code}

or like currently (with old API, untested with new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}



> [Python] Improve ergonomics of new FileSystem API
> -
>
> Key: ARROW-7584
> URL: 

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h3. Here are some examples

*File access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to do easily as there 
is `_wrap_output_stream` method.

h3. Solutions

- If the current Python API is still unused we could just rename the methods
- We could everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h3. Other considerations on ergonomics

In the long run I would also like to enhance the FileSystem API and add more 
methods that use the basic ones to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- introduce [du|https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L252]
- check if selector works with globs or add 
https://github.com/dask/hdfs3/blob/master/hdfs3/core.py#L349
- be able to write string to the file streams (instead of only bytes), it would 
permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res)) # instead of fd.write(json.dumps(res).encode())
{code}

or like currently (with old API, untested with new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}


  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h3. Here are some examples

*File access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to doeasily. as there 
is `_wrap_output_stream`

h3. Solutions:
- If the current Python API is still unused we could just rename the methods
- We could everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h3. Other considerations:
In the long run I would also enhance the FileSystem API to add more methods 
that use the basic to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- be able to write string to the file streams (instead of only bytes), it would 
permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res)) # instead of fd.write(json.dumps(res).encode())
{code}

or like currently (with old API, untested with new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}



> [Python] Improve ergonomics of new FileSystem API
> -
>
> Key: ARROW-7584
> URL: https://issues.apache.org/jira/browse/ARROW-7584
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Fabian Höring
>Priority: Major
>
> The [new Python FileSystem API 

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

h3. Here are some examples

*File access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to doeasily. as there 
is `_wrap_output_stream`

h3. Solutions:
- If the current Python API is still unused we could just rename the methods
- We could everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

h3. Other considerations:
In the long run I would also enhance the FileSystem API to add more methods 
that use the basic to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- be able to write string to the file streams (instead of only bytes), it would 
permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res)) # instead of fd.write(json.dumps(res).encode())
{code}

or like currently (with old API, untested with new one)

{code}with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
{code}


  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

Here are some examples:

*File access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to doeasily. as there 
is `_wrap_output_stream`

Solutions:
- If the current Python API is still unused we could just rename the methods
- We could everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

Other considerations:
In the long run I would also enhance the FileSystem API to add more methods 
that use the basic to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- be able to write string to the file streams (instead of only bytes), it would 
permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

```
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res)) # instead of fd.write(json.dumps(res).encode())
```

or like currently (with old API, untested with new one)

```
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
```



> [Python] Improve ergonomics of new FileSystem API
> -
>
> Key: ARROW-7584
> URL: https://issues.apache.org/jira/browse/ARROW-7584
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Fabian Höring
>Priority: Major
>
> The [new Python FileSystem API 
> |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
> nice but seems to be very verbose to use.
> The documentation of the old FS API is 
> [here|https://arrow.apache.org/docs/python/filesystems.html]
> h3. Here 

[jira] [Updated] (ARROW-7584) [Python] Improve ergonomics of new FileSystem API

2020-01-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-7584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Höring updated ARROW-7584:
-
Description: 
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

Here are some examples:

*File access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to doeasily. as there 
is `_wrap_output_stream`

Solutions:
- If the current Python API is still unused we could just rename the methods
- We could everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

Other considerations:
In the long run I would also enhance the FileSystem API to add more methods 
that use the basic to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- be able to write string to the file streams (instead of only bytes), it would 
permit to directly use some Python API's like json.dump

{code}
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
{code}

instead of

```
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res)) # instead of fd.write(json.dumps(res).encode())
```

or like currently (with old API, untested with new one)

```
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
```


  was:
The [new Python FileSystem API 
|https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
nice but seems to be very verbose to use.

The documentation of the old FS API is 
[here|https://arrow.apache.org/docs/python/filesystems.html]

Here are some examples:

*File access:*

Before:
fs.ls()
fs.mkdir()
fs.rmdir()

Now:
fs.get_target_stats()
fs.create_dir()
fs.delete_dir()

What is the advantage of having a longer method ? The short ones seems clear 
and are much easier to use. Seems like an easy change.  Also this is consistent 
with what is doing hdfs in the [fs api| 
https://arrow.apache.org/docs/python/filesystems.html] and works naturally with 
a local filesystem.

*File opening:*

Before:
with fs.open(self, path, mode=u'rb', buffer_size=None)

Now:
fs.open_input_file()
fs.open_input_stream()
fs.open_output_stream()

It seems more natural to fit to Python standard open function which works for 
local file access as well. Not sure if this is possible to doeasily. as there 
is `_wrap_output_stream`

Solutions:
- If the current Python API is still unused we could just rename the methods
- We could everything as is and add some alias methods, it would make the 
FileSystem class a bit messy think if there are always 2 method to do the work

Other considerations:
In the long run I would also enhance the FileSystem API to add more methods 
that use the basic to provide new features for example:
- introduce put and get on top of the streams that directly upload/download 
files
- be able to write string to the file streams (instead of only bytes), it would 
permit to directly use some Python API's like json.dump

```
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  json.dump(res, fd)
```

instead of

```
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res)) # instead of fd.write(json.dumps(res).encode())
```

or like currently (with old API, untested with new one)

```
with fs.open(path, "wb") as fd:
  res = {"a": "bc"}
  fd.write(json.dumps(res).encode())
```



> [Python] Improve ergonomics of new FileSystem API
> -
>
> Key: ARROW-7584
> URL: https://issues.apache.org/jira/browse/ARROW-7584
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Fabian Höring
>Priority: Major
>
> The [new Python FileSystem API 
> |https://github.com/apache/arrow/blob/master/python/pyarrow/_fs.pyx#L185] is 
> nice but seems to be very verbose to use.
> The documentation of the old FS API is 
> [here|https://arrow.apache.org/docs/python/filesystems.html]
> Here are some examples:
> *File