Re: Evaluate expressions on a pyarrow table in-memory

Ian Cook Thu, 31 Mar 2022 06:54:19 -0700

There is work ongoing to implement a pandas-like interface to the
Arrow C++ compute engine using Ibis [1] and Substrait [2].


The gist of this is: Python users write Ibis code; the ibis-substrait
submodule of Ibis [3] compiles this into a Substrait plan; and the
Arrow C++ compute engine parses the Substrait plan and executes it,
returning an Arrow table.

There is a recent Voltron Data blog post co-authored by Phillip Cloud
(the lead developer of Ibis) and Jacques Nadeau (the originator of
Substrait) with some additional details [4].

Since Substrait is an open standard, this can all be implemented
without introducing tight coupling between the interface and the
engine, at least in theory. Other Substrait-compatible engines will be
able to consume and execute Substrait plans, and other
Substrait-compatible interfaces will be able to produce Substrait
plans. There is already some work ongoing to support Substrait in
other interfaces and engines including the Arrow R interface [5], the
DataFusion engine [6], and the DuckDB engine [7].

I am not sure when the Ibis-Substrait-Arrow stack is expected to reach
milestones of "real-world usability." I will leave it to others to
comment about that

Thanks,
Ian

[1] https://ibis-project.org
[2] https://substrait.io
[3] https://github.com/ibis-project/ibis-substrait
[4] 
https://voltrondata.com/news/introducing-substrait-an-interoperable-data-to-engine-connector/
[5] https://github.com/voltrondata/substrait-r
[6] https://github.com/datafusion-contrib/datafusion-substrait
[7] https://github.com/duckdblabs/duckdb-substrait-demo


On Thu, Mar 31, 2022 at 7:25 AM Suresh V <[email protected]> wrote:
>
> Thanks Aldrin for the response. My goal is to have something like 
> numexpr/pandas equivalent where I can simply pass strings to get the results 
> and if arrow gods have any plans to create something like that in the near 
> future :)
>
> On Thu, Mar 31, 2022 at 3:15 AM Aldrin <[email protected]> wrote:
>>
>> I think everything Weston said is totally good.
>>
>> I just wanted to add that there may be a couple of "simpler" options 
>> available without using SQL-like approaches:
>> (1) just use the compute API (at least for examples such as "sum(...) / 
>> sum(...)")
>> (2) when converting strings to expressions, do it in 2 passes: (a) put 
>> expressions supported by the dataset API into projections and scans on the 
>> source table, then (b) use the compute API for any other expressions and 
>> apply them to the final relation from (a). So, for one of the given 
>> examples, "sum(a*b) / sum(a)", this ends up being 2 passes of applying 
>> expressions: once via the dataset API, and once via the compute API.
>>
>> Note that (1) seems simple in my naive opinion, but (2) may actually be 
>> quite complicated. But, how much extra work it is probably depends on how 
>> you were planning on converting strings to expressions anyways. And note 
>> that you can do many iterations of (2) for using aggregates in filters or 
>> projections, etc.
>>
>> ----
>>
>> The expression:
>>
>>     sum(a * b) / sum(a)
>>
>> would be:
>>
>>     ```
>>     import pyarrow
>>     from pyarrow import dataset
>>     from pyarrow import compute
>>
>>     tab = pyarrow.Table.from_arrays(
>>          [
>>            pyarrow.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>           ,pyarrow.array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
>>          ]
>>         ,names=['a', 'b']
>>     )
>>
>>     # some chain of scans and projections via the dataset API
>>     relational_result = dataset.Scanner.from_batches(
>>          tab.to_batches()
>>         ,schema=tab.schema
>>         ,columns={
>>               'a': dataset.field('a')
>>              ,'product': dataset.field('a') * dataset.field('b')
>>          }
>>     ).to_table()
>>
>>     # a single "expression" that uses the compute API and the final relation 
>> from the previous "phase"
>>     expr_result = (
>>           compute.sum(relational_result.column('product'))
>>         / compute.sum(relational_result.column('a'))
>>     )
>>     ```
>>
>> Aldrin Montana
>> Computer Science PhD Student
>> UC Santa Cruz
>>
>>
>> On Wed, Mar 30, 2022 at 8:21 PM Weston Pace <[email protected]> wrote:
>>>
>>> Yes and no :)  Disclaimer: this answer is a little out of my
>>> wheelhouse as I've learned relational algebra through doing and so my
>>> formal theory may be off.  Anyone is welcome to correct me.  Also,
>>> this answer turned into a bit of ramble and is a bit scattershot.  You
>>> may already be very familiar with some of these concepts.  I'm not
>>> trying to be patronizing but I do tend to ramble.
>>>
>>> TL;DR: Pyarrow has some "relational algebra interfaces" today but not
>>> a lot of "dataframe interfaces".  What you're asking for is a little
>>> bit more of a "dataframe" type question and you will probably need to
>>> go beyond pyarrow to get exactly what you are asking for.  That being
>>> said, pyarrow has a lot of the primitives that can solve parts and
>>> pieces of your problem.
>>>
>>> In relational algebra there is a special class of functions called
>>> "scalar functions".  These are functions that create a single value
>>> for each row.  For example, "less than", "addition", and "to upper
>>> case".  A scalar expression is then an expression that consists only
>>> of scalar functions.  Projections in dataset scanners can only contain
>>> scalar expressions.  In pyarrow you can scan an in-memory table and
>>> apply a scalar expression:
>>>
>>> ```
>>> import pyarrow as pa
>>> import pyarrow.dataset as ds
>>>
>>> arr = pa.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> tab = pa.Table.from_arrays([arr], names=["x"])
>>> expr = (ds.field("x") > 5) & (ds.field("x") < 8)
>>> # This is a little bit unwieldy, we could maybe investigate a better
>>> API for this kind of thing
>>> ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema,
>>> columns={'y': expr3}).to_table()
>>> # pyarrow.Table
>>> # y: bool
>>> # ----
>>> # y: [[false,false,false,false,false,true,true,false,false,false]]
>>> ```
>>>
>>> However, the functions that you are listing (sum, avg) are not scalar
>>> functions.  They do, however, fit a special class of functions known
>>> as "aggregates" (functions that consume values from multiple rows to
>>> create a single output).  Aggregates in relational algebra are used in
>>> a "group by" node.  In pyarrow 7 we added the ability to run group_by
>>> functions very easily on in-memory tables but it looks like it
>>> requires at least one key column so that isn't going to help us here.
>>>
>>> Both of these features are powered by the pyarrow compute functions.
>>> These are slightly lower level compute primitives.  We can use these
>>> here to get some of the values you want:
>>>
>>> ```
>>> # Continuing from above example
>>> pc.sum(tab.column("x"))
>>> # <pyarrow.Int64Scalar: 55>
>>> pc.mean(tab.column("x"))
>>> # <pyarrow.DoubleScalar: 5.5>
>>> ```
>>>
>>> But...while this does give you the capability to run functions, this
>>> doesn't give you the capability to run "expressions".
>>>
>>> Running expressions that contain both aggregate and scalar functions
>>> is a little trickier than it may seem.  This is often done by creating
>>> a relational algebra query from the expression.  For example, consider
>>> the expression `sum(a * b)/sum(a)`.
>>>
>>> We can create the query (this is totally shorthand pseudocode, the
>>> Substrait text format isn't ready yet)
>>>
>>> SCAN_TABLE(table) ->
>>>   PROJECT({"a*b": multiply(field("a"), field("b"))}) ->
>>>   GROUP_BY(keys=[], aggregates=[("a*b", "sum"), ("a", "sum")]) ->
>>>   PROJECT({"sum(a*b)/sum(a)": divide(field("sum(a*b)"),field("sum(a)"))})
>>>
>>> So if you wanted to create such a query plan then you could express it
>>> in Substrait, which is a spec for expression query plans, and use our
>>> "very new and not quite ready yet" Substrait consumer API to process
>>> that query plan.  So if your goal is purely "how do I express a series
>>> of compute operations as a string" then I will point out that SQL is a
>>> very standard answer for that question and the Substrait text format
>>> will be a new way to answer that question.
>>>
>>> ---
>>>
>>> Ok, now everything I've written has been exclusive to pyarrow.  If I
>>> step back and look at your original question "how can I evaluate a
>>> dataframe-like expression against an Arrow table" I think the answer
>>> is going to lie outside of pyarrow (some cool functions like table
>>> group by are being added but I'm not aware of any developers whose
>>> goal is to make a full fledged dataframe API inside of pyarrow).
>>> Fortunately, the whole point of Arrow is to make it super easy for
>>> libraries to add new functionality to your data.  There may be some
>>> library out there that does this already, there are libraries out
>>> there that solve similar problems like "how do I run this SQL query
>>> against Arrow data", and there are probably libraries in development
>>> to tackle this.  Unfortunately, I'm not knowledgeable nor qualified
>>> enough to speak on any of these.
>>>
>>> On Wed, Mar 30, 2022 at 1:16 AM Suresh V <[email protected]> wrote:
>>> >
>>> > Hi,
>>> >
>>> > Is there a way to evaluate mathematical expressions against columns of a 
>>> > pyarrow table which is in memory already similar to how projections work 
>>> > for dataset scanners?
>>> >
>>> > The goal is to have specify a bunch of strings like sum(a * b)/sum(a), or 
>>> > avg(a[:10]) etc. Convert these into expressions and run against the table.
>>> >
>>> > Thanks
>>> >
>>> >

Re: Evaluate expressions on a pyarrow table in-memory

Reply via email to