Re: Evaluate expressions on a pyarrow table in-memory

Weston Pace Wed, 30 Mar 2022 20:21:15 -0700

Yes and no :)  Disclaimer: this answer is a little out of my
wheelhouse as I've learned relational algebra through doing and so my
formal theory may be off.  Anyone is welcome to correct me.  Also,
this answer turned into a bit of ramble and is a bit scattershot.  You
may already be very familiar with some of these concepts.  I'm not
trying to be patronizing but I do tend to ramble.

TL;DR: Pyarrow has some "relational algebra interfaces" today but not
a lot of "dataframe interfaces".  What you're asking for is a little
bit more of a "dataframe" type question and you will probably need to
go beyond pyarrow to get exactly what you are asking for.  That being
said, pyarrow has a lot of the primitives that can solve parts and
pieces of your problem.

In relational algebra there is a special class of functions called
"scalar functions".  These are functions that create a single value
for each row.  For example, "less than", "addition", and "to upper
case".  A scalar expression is then an expression that consists only
of scalar functions.  Projections in dataset scanners can only contain
scalar expressions.  In pyarrow you can scan an in-memory table and
apply a scalar expression:

```
import pyarrow as pa
import pyarrow.dataset as ds

arr = pa.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
tab = pa.Table.from_arrays([arr], names=["x"])
expr = (ds.field("x") > 5) & (ds.field("x") < 8)
# This is a little bit unwieldy, we could maybe investigate a better
API for this kind of thing
ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema,
columns={'y': expr3}).to_table()
# pyarrow.Table
# y: bool
# ----
# y: [[false,false,false,false,false,true,true,false,false,false]]
```

However, the functions that you are listing (sum, avg) are not scalar
functions.  They do, however, fit a special class of functions known
as "aggregates" (functions that consume values from multiple rows to
create a single output).  Aggregates in relational algebra are used in
a "group by" node.  In pyarrow 7 we added the ability to run group_by
functions very easily on in-memory tables but it looks like it
requires at least one key column so that isn't going to help us here.

Both of these features are powered by the pyarrow compute functions.
These are slightly lower level compute primitives.  We can use these
here to get some of the values you want:

```
# Continuing from above example
pc.sum(tab.column("x"))
# <pyarrow.Int64Scalar: 55>
pc.mean(tab.column("x"))
# <pyarrow.DoubleScalar: 5.5>
```

But...while this does give you the capability to run functions, this
doesn't give you the capability to run "expressions".

Running expressions that contain both aggregate and scalar functions
is a little trickier than it may seem.  This is often done by creating
a relational algebra query from the expression.  For example, consider
the expression `sum(a * b)/sum(a)`.

We can create the query (this is totally shorthand pseudocode, the
Substrait text format isn't ready yet)

SCAN_TABLE(table) ->
  PROJECT({"a*b": multiply(field("a"), field("b"))}) ->
  GROUP_BY(keys=[], aggregates=[("a*b", "sum"), ("a", "sum")]) ->
  PROJECT({"sum(a*b)/sum(a)": divide(field("sum(a*b)"),field("sum(a)"))})

So if you wanted to create such a query plan then you could express it
in Substrait, which is a spec for expression query plans, and use our
"very new and not quite ready yet" Substrait consumer API to process
that query plan.  So if your goal is purely "how do I express a series
of compute operations as a string" then I will point out that SQL is a
very standard answer for that question and the Substrait text format
will be a new way to answer that question.

---

Ok, now everything I've written has been exclusive to pyarrow.  If I
step back and look at your original question "how can I evaluate a
dataframe-like expression against an Arrow table" I think the answer
is going to lie outside of pyarrow (some cool functions like table
group by are being added but I'm not aware of any developers whose
goal is to make a full fledged dataframe API inside of pyarrow).
Fortunately, the whole point of Arrow is to make it super easy for
libraries to add new functionality to your data.  There may be some
library out there that does this already, there are libraries out
there that solve similar problems like "how do I run this SQL query
against Arrow data", and there are probably libraries in development
to tackle this.  Unfortunately, I'm not knowledgeable nor qualified
enough to speak on any of these.

On Wed, Mar 30, 2022 at 1:16 AM Suresh V <[email protected]> wrote:
>
> Hi,
>
> Is there a way to evaluate mathematical expressions against columns of a 
> pyarrow table which is in memory already similar to how projections work for 
> dataset scanners?
>
> The goal is to have specify a bunch of strings like sum(a * b)/sum(a), or 
> avg(a[:10]) etc. Convert these into expressions and run against the table.
>
> Thanks
>
>

Re: Evaluate expressions on a pyarrow table in-memory

Reply via email to