Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-09-01 Thread Will Jones
Thanks for pointing that out, Dane. I think that seems like an obvious choice for Dask to be able to consume this protocol. On Fri, Sep 1, 2023 at 10:13 AM Dane Pitkin wrote: > The Python Substrait package[1] is on PyPi[2] and currently has python > wrappers for the Substrait protobuf objects.

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-09-01 Thread Dane Pitkin
The Python Substrait package[1] is on PyPi[2] and currently has python wrappers for the Substrait protobuf objects. I think this will be a great opportunity to identify helper features that users of this protocol would like to see. I'll be keeping an eye out as this develops, but also feel free to

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-08-31 Thread Will Jones
Hello Arrow devs, We discussed this further in the Arrow community call on 2023-08-30 [1], and concluded we should create an entirely new protocol that uses Substrait expressions. I have created an issue [2] to track this and will start a PR soon. It does look like we might block this on

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-08-29 Thread Ian Cook
An update about this: Weston's PR https://github.com/apache/arrow/pull/34834/ merged last week. This makes it possible to convert PyArrow expressions to/from Substrait expressions. As Fokko previously noted, the PR does not change the PyArrow Dataset interface at all. It simply enables a

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-07-03 Thread Will Jones
Hello, After thinking about it, I think I understand the approach David Li and Ian are suggesting with respect to expressions. There will be some arguments that only PyArrow's own datasets support, but that aren't in the generic protocol. Passing PyArrow expressions to the filters argument should

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-07-03 Thread Fokko Driesprong
Hey everyone, Chiming in here from the PyIceberg side. I would love to see the protocol as proposed in the PR. I did a small test , and it seems to be quite straightforward to implement and it brings a lot of potential.

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-06-28 Thread Will Jones
> > That wouldn't remove the feature from DuckDB, would it? It would just mean > that we recognize that PyArrow expressions don't have well-defined > semantics that we are committing to at this time. > That's a fair point, David. I would be fine excluding it from the protocol initially, and keep

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-06-28 Thread Jonathan Keane
> I would understand this objection more if DuckDB hasn't been relying on > being able to pass PyArrow expressions for 18 months now [1]. Unless, do we > just think this isn't widely used enough that we don't care? This isn't a pro or a con of specifically adopting the PyArrow expression

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-06-28 Thread David Li
That wouldn't remove the feature from DuckDB, would it? It would just mean that we recognize that PyArrow expressions don't have well-defined semantics that we are committing to at this time. As long as we have `**kwargs` everywhere, we can in the future introduce a

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-06-28 Thread Will Jones
Hi Ian, > I favor option 2 out of concern that option 1 could create a > temptation for users of this protocol to depend on a feature that we > intend to deprecate. > I would understand this objection more if DuckDB hasn't been relying on being able to pass PyArrow expressions for 18 months now

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-06-27 Thread Ian Cook
> I think there's three routes we can go here: > > 1. We keep PyArrow expressions in the API initially, but once we have > Substrait-based alternatives we deprecate the PyArrow expression support. > This is what I intended with the current design, and I think it provides > the most obvious

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-06-23 Thread Weston Pace
> The trouble is that Dataset was not designed to serve as a > general-purpose unmaterialized dataframe. For example, the PyArrow > Dataset constructor [5] exposes options for specifying a list of > source files and a partitioning scheme, which are irrelevant for many > of the applications that

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-06-23 Thread Will Jones
Thanks Ian for your extensive feedback. I strongly agree with the comments made by David, > Weston, and Dewey arguing that we should avoid any use of PyArrow > expressions in this API. Expressions are an implementation detail of > PyArrow, not a part of the Arrow standard. It would be much safer

Re: [Python][Discuss] PyArrow Dataset as a Python protocol

2023-06-23 Thread Ian Cook
Thanks Will for this proposal! For anyone familiar with PyArrow, this idea has a clear intuitive logic to it. It provides an expedient solution to the current lack of a practical means for interchanging "unmaterialized dataframes" between different Python libraries. To elaborate on that: If you

[Python][Discuss] PyArrow Dataset as a Python protocol

2023-06-21 Thread Will Jones
Hello Arrow devs, I have drafted a PR defining an experimental protocol which would allow third-party libraries to imitate the PyArrow Dataset API [5]. This protocol is intended to endorse an integration pattern that is starting to be used in the Python ecosystem, where some libraries are