Arrow Dataset API on Ceph

2020-08-26 Thread Ivo Jimenez
Dear Arrow community, We are writing to share our thoughts about designing an Apache Arrow-native storage system leveraging Ceph’s extensibility mechanism as part of the SkyhookDM project and aim for a design that leverages Arrow as much as possible, both on the client API a

Writing parquet to new filesystem API

2020-08-26 Thread Weston Pace
Forgive me if I am missing something obvious but I am unable to write parquet files using the new filesystem API. Here is what I am trying: https://gist.github.com/westonpace/0c5ef01e21a40de5d16608b7f12de80d I receive an error: OSError: Unrecognized filesystem:

Re: Creating filesystems that read local files

2020-08-26 Thread Weston Pace
Ok. I think I have it figured out as: num_rows = 0 dataset = pa.dataset.dataset(short_files, filesystem=subtree_filesystem) for fragment in dataset.get_fragments(): fragment.ensure_complete_metadata() if fragment.row_groups: for row_group in fragment.row_groups: num_ro

conversion between pyspark.DataFrame and pyarrow.Table

2020-08-26 Thread Radu Teodorescu
Hi, I noticed that arrow is mentioned as an optional intermediary format for converting between pandas DFs and spark DFs. Is there a way to explicitly convert an pyarrow Table to a spark DataFrame and the other way around. Absent that, going pysprak->pandas->pyarrow and back works but it’s obviou

Re: Authentication Redesign

2020-08-26 Thread James Duong
Hi everyone, I've updated the PR and responded to comments in the proposal document. The PR now makes handshake optional and sends auth information with every request. The client now needs to supply a CredentialCallOption containing auth information (as a Consumer), which we'll convert to a gRPC C

Re: Creating filesystems that read local files

2020-08-26 Thread Weston Pace
Thanks Joris / Antoine, It appears I will have to learn the new datasets API. I can confirm that SubTreeFileSystem is working for me. In case there is still interest here is the code I had from before reproducing the issue: https://gist.github.com/westonpace/4107c1c492cdd78d611595d43e72964d It

Re: [Rust] Async record batch reader?

2020-08-26 Thread Andy Grove
Hi Max, I have been experimenting with an async record batch reader and was able to get a working version, but I had to use channels to communicate with the parquet reader, which ran on its own thread. I have taken a step back now that I have some experience of this and look forward to working wi

Re: [Rust] Async record batch reader?

2020-08-26 Thread Vertexclique
Hi Max; There is an open issue in the tracker which needs to gather feedback to finalize how we will do overall async interface which spans to arrow crates. Please check that issue, it is mentioning sans IO and several design considerations. Imo we can carry async discussion under it. Best, Ma

[Rust] Async record batch reader?

2020-08-26 Thread Max Burke
Out of curiosity, is anyone working on a record batch reader that's async friendly? Wanting to know if it's something I could wait on/help out with, or if it's something we could start working on too. -- -Max

Re: Unexpected keyword argument 'split_blocks'

2020-08-26 Thread Joris Van den Bossche
Hi Jayant, This keyword was introduced in pyarrow 0.16, so you will need to update your installation. For updating, if `conda update pyarrow` indicates it is already installed, you can also try `conda install pyarrow=0.16`. However, it might be you will need to use the conda-forge channel, as I a

Re: Questions about S3 options

2020-08-26 Thread Joris Van den Bossche
Hi Weston, Sorry for the late reply. For using S3 in pyarrow, there are indeed 2 options: using the implementation provided by arrow (`pyarrow.fs.S3FileSystem`) or using s3fs which gets wrapped by pyarrow. Note that the wrapper is not actually DaskFileSystem: for the legacy filesystems we use s3fs

Unexpected keyword argument 'split_blocks'

2020-08-26 Thread Jayant Singh
Good Evening, My name is Jayant and I am using pyarrow version 0.15.0 with python 3.8 on my MacOs catalina, latest version. As per the documentation (link ), I ran following code to avoid memory doubling; df=table.to_pandas(split_blocks=True, self_

Re: Creating filesystems that read local files

2020-08-26 Thread Joris Van den Bossche
Hi Weston, Currently there are two filesystems interfaces in pyarrow, a legacy one in `pyarrow.filesystem` and a new one in `pyarrow.fs` (see https://issues.apache.org/jira/browse/ARROW-9645 and https://arrow.apache.org/docs/python/filesystems_deprecated.html, docs are still a bit scarce). Based

RE: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

2020-08-26 Thread Kazuaki Ishizaki
Hi, I waited for comments regarding Java Big-Endian (BE) support during my one-week vacation. Thank you for good suggestions and comments. I already responded to some questions in another mail. This mail addresses the remaining questions: Use cases, holistic strategy for BE support, and testing

Re: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

2020-08-26 Thread Kazuaki Ishizaki
Hi Micah, Thank you for expanding the scope for Big Endian support in Arrow. I am glad to see this when I am back from one-week vacation. I agree with this since we have just seen the kickoff of BE support in Go. Hi Wes, Thank you for your positive comments. We should carefully implement BE su

RE: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

2020-08-26 Thread Kazuaki Ishizaki
Kazuaki Ishizaki, Ph.D., Senior Technical Staff Member (STSM), IBM Research - Tokyo ACM Distinguished Member - Apache Spark committer - IBM Academy of Technology Member Wes McKinney wrote on 2020/08/26 21:27:49: > From: Wes McKinney > To: dev , Micah Kornfield > Cc: Fan Liya > Date: 2020/08

Re: Creating filesystems that read local files

2020-08-26 Thread Antoine Pitrou
Hi Weston, Can you show the code for your experiment? (or post equivalent code) Regards Antoine. Le 25/08/2020 à 23:38, Weston Pace a écrit : > I created a RelativeFileSystem that extended FileSystem and proxied > calls to a LocalFileSystem instance. This filesystem allowed me to > specify

Re: [DISCUSS] Big Endian support in Arrow (was: Re: [Java] Supporting Big Endian)

2020-08-26 Thread Wes McKinney
hi Micah, I agree with your reasoning. If supporting BE in some languages (e.g. Java) is impractical due to performance regressions on LE platforms, then I don't think it's worth it. But if it can be handled at compile time or without runtime overhead, and tested / maintained properly on an ongoin

[NIGHTLY] Arrow Build Report for Job nightly-2020-08-26-0

2020-08-26 Thread Crossbow
Arrow Build Report for Job nightly-2020-08-26-0 All tasks: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-26-0 Failed Tasks: - conda-osx-clang-py36: URL: https://github.com/ursa-labs/crossbow/branches/all?query=nightly-2020-08-26-0-azure-conda-osx-clang-py36 - cond