Re: Evaluate expressions on a pyarrow table in-memory

Matthew Peters Thu, 31 Mar 2022 07:09:42 -0700

I have been playing around with polars for a few weeks.  They have an
expression system that is different from pandas , but it feels really clean
and is built on arrow in rust.  If you need something now, this would be a
good option.  It does everything mentioned here and adds a lazy evaluation
system and is very fast.


pola-rs/polars: Fast multi-threaded DataFrame library in Rust | Python |
Node.js (github.com) <https://github.com/pola-rs/polars>

I just thought I would mention it.

Thanks
Matt

On Thu, Mar 31, 2022 at 8:54 AM Ian Cook <[email protected]> wrote:

> There is work ongoing to implement a pandas-like interface to the
> Arrow C++ compute engine using Ibis [1] and Substrait [2].
>
> The gist of this is: Python users write Ibis code; the ibis-substrait
> submodule of Ibis [3] compiles this into a Substrait plan; and the
> Arrow C++ compute engine parses the Substrait plan and executes it,
> returning an Arrow table.
>
> There is a recent Voltron Data blog post co-authored by Phillip Cloud
> (the lead developer of Ibis) and Jacques Nadeau (the originator of
> Substrait) with some additional details [4].
>
> Since Substrait is an open standard, this can all be implemented
> without introducing tight coupling between the interface and the
> engine, at least in theory. Other Substrait-compatible engines will be
> able to consume and execute Substrait plans, and other
> Substrait-compatible interfaces will be able to produce Substrait
> plans. There is already some work ongoing to support Substrait in
> other interfaces and engines including the Arrow R interface [5], the
> DataFusion engine [6], and the DuckDB engine [7].
>
> I am not sure when the Ibis-Substrait-Arrow stack is expected to reach
> milestones of "real-world usability." I will leave it to others to
> comment about that
>
> Thanks,
> Ian
>
> [1] https://ibis-project.org
> [2] https://substrait.io
> [3] https://github.com/ibis-project/ibis-substrait
> [4]
> https://voltrondata.com/news/introducing-substrait-an-interoperable-data-to-engine-connector/
> [5] https://github.com/voltrondata/substrait-r
> [6] https://github.com/datafusion-contrib/datafusion-substrait
> [7] https://github.com/duckdblabs/duckdb-substrait-demo
>
>
> On Thu, Mar 31, 2022 at 7:25 AM Suresh V <[email protected]> wrote:
> >
> > Thanks Aldrin for the response. My goal is to have something like
> numexpr/pandas equivalent where I can simply pass strings to get the
> results and if arrow gods have any plans to create something like that in
> the near future :)
> >
> > On Thu, Mar 31, 2022 at 3:15 AM Aldrin <[email protected]> wrote:
> >>
> >> I think everything Weston said is totally good.
> >>
> >> I just wanted to add that there may be a couple of "simpler" options
> available without using SQL-like approaches:
> >> (1) just use the compute API (at least for examples such as "sum(...) /
> sum(...)")
> >> (2) when converting strings to expressions, do it in 2 passes: (a) put
> expressions supported by the dataset API into projections and scans on the
> source table, then (b) use the compute API for any other expressions and
> apply them to the final relation from (a). So, for one of the given
> examples, "sum(a*b) / sum(a)", this ends up being 2 passes of applying
> expressions: once via the dataset API, and once via the compute API.
> >>
> >> Note that (1) seems simple in my naive opinion, but (2) may actually be
> quite complicated. But, how much extra work it is probably depends on how
> you were planning on converting strings to expressions anyways. And note
> that you can do many iterations of (2) for using aggregates in filters or
> projections, etc.
> >>
> >> ----
> >>
> >> The expression:
> >>
> >>     sum(a * b) / sum(a)
> >>
> >> would be:
> >>
> >>     ```
> >>     import pyarrow
> >>     from pyarrow import dataset
> >>     from pyarrow import compute
> >>
> >>     tab = pyarrow.Table.from_arrays(
> >>          [
> >>            pyarrow.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
> >>           ,pyarrow.array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
> >>          ]
> >>         ,names=['a', 'b']
> >>     )
> >>
> >>     # some chain of scans and projections via the dataset API
> >>     relational_result = dataset.Scanner.from_batches(
> >>          tab.to_batches()
> >>         ,schema=tab.schema
> >>         ,columns={
> >>               'a': dataset.field('a')
> >>              ,'product': dataset.field('a') * dataset.field('b')
> >>          }
> >>     ).to_table()
> >>
> >>     # a single "expression" that uses the compute API and the final
> relation from the previous "phase"
> >>     expr_result = (
> >>           compute.sum(relational_result.column('product'))
> >>         / compute.sum(relational_result.column('a'))
> >>     )
> >>     ```
> >>
> >> Aldrin Montana
> >> Computer Science PhD Student
> >> UC Santa Cruz
> >>
> >>
> >> On Wed, Mar 30, 2022 at 8:21 PM Weston Pace <[email protected]>
> wrote:
> >>>
> >>> Yes and no :)  Disclaimer: this answer is a little out of my
> >>> wheelhouse as I've learned relational algebra through doing and so my
> >>> formal theory may be off.  Anyone is welcome to correct me.  Also,
> >>> this answer turned into a bit of ramble and is a bit scattershot.  You
> >>> may already be very familiar with some of these concepts.  I'm not
> >>> trying to be patronizing but I do tend to ramble.
> >>>
> >>> TL;DR: Pyarrow has some "relational algebra interfaces" today but not
> >>> a lot of "dataframe interfaces".  What you're asking for is a little
> >>> bit more of a "dataframe" type question and you will probably need to
> >>> go beyond pyarrow to get exactly what you are asking for.  That being
> >>> said, pyarrow has a lot of the primitives that can solve parts and
> >>> pieces of your problem.
> >>>
> >>> In relational algebra there is a special class of functions called
> >>> "scalar functions".  These are functions that create a single value
> >>> for each row.  For example, "less than", "addition", and "to upper
> >>> case".  A scalar expression is then an expression that consists only
> >>> of scalar functions.  Projections in dataset scanners can only contain
> >>> scalar expressions.  In pyarrow you can scan an in-memory table and
> >>> apply a scalar expression:
> >>>
> >>> ```
> >>> import pyarrow as pa
> >>> import pyarrow.dataset as ds
> >>>
> >>> arr = pa.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
> >>> tab = pa.Table.from_arrays([arr], names=["x"])
> >>> expr = (ds.field("x") > 5) & (ds.field("x") < 8)
> >>> # This is a little bit unwieldy, we could maybe investigate a better
> >>> API for this kind of thing
> >>> ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema,
> >>> columns={'y': expr3}).to_table()
> >>> # pyarrow.Table
> >>> # y: bool
> >>> # ----
> >>> # y: [[false,false,false,false,false,true,true,false,false,false]]
> >>> ```
> >>>
> >>> However, the functions that you are listing (sum, avg) are not scalar
> >>> functions.  They do, however, fit a special class of functions known
> >>> as "aggregates" (functions that consume values from multiple rows to
> >>> create a single output).  Aggregates in relational algebra are used in
> >>> a "group by" node.  In pyarrow 7 we added the ability to run group_by
> >>> functions very easily on in-memory tables but it looks like it
> >>> requires at least one key column so that isn't going to help us here.
> >>>
> >>> Both of these features are powered by the pyarrow compute functions.
> >>> These are slightly lower level compute primitives.  We can use these
> >>> here to get some of the values you want:
> >>>
> >>> ```
> >>> # Continuing from above example
> >>> pc.sum(tab.column("x"))
> >>> # <pyarrow.Int64Scalar: 55>
> >>> pc.mean(tab.column("x"))
> >>> # <pyarrow.DoubleScalar: 5.5>
> >>> ```
> >>>
> >>> But...while this does give you the capability to run functions, this
> >>> doesn't give you the capability to run "expressions".
> >>>
> >>> Running expressions that contain both aggregate and scalar functions
> >>> is a little trickier than it may seem.  This is often done by creating
> >>> a relational algebra query from the expression.  For example, consider
> >>> the expression `sum(a * b)/sum(a)`.
> >>>
> >>> We can create the query (this is totally shorthand pseudocode, the
> >>> Substrait text format isn't ready yet)
> >>>
> >>> SCAN_TABLE(table) ->
> >>>   PROJECT({"a*b": multiply(field("a"), field("b"))}) ->
> >>>   GROUP_BY(keys=[], aggregates=[("a*b", "sum"), ("a", "sum")]) ->
> >>>   PROJECT({"sum(a*b)/sum(a)":
> divide(field("sum(a*b)"),field("sum(a)"))})
> >>>
> >>> So if you wanted to create such a query plan then you could express it
> >>> in Substrait, which is a spec for expression query plans, and use our
> >>> "very new and not quite ready yet" Substrait consumer API to process
> >>> that query plan.  So if your goal is purely "how do I express a series
> >>> of compute operations as a string" then I will point out that SQL is a
> >>> very standard answer for that question and the Substrait text format
> >>> will be a new way to answer that question.
> >>>
> >>> ---
> >>>
> >>> Ok, now everything I've written has been exclusive to pyarrow.  If I
> >>> step back and look at your original question "how can I evaluate a
> >>> dataframe-like expression against an Arrow table" I think the answer
> >>> is going to lie outside of pyarrow (some cool functions like table
> >>> group by are being added but I'm not aware of any developers whose
> >>> goal is to make a full fledged dataframe API inside of pyarrow).
> >>> Fortunately, the whole point of Arrow is to make it super easy for
> >>> libraries to add new functionality to your data.  There may be some
> >>> library out there that does this already, there are libraries out
> >>> there that solve similar problems like "how do I run this SQL query
> >>> against Arrow data", and there are probably libraries in development
> >>> to tackle this.  Unfortunately, I'm not knowledgeable nor qualified
> >>> enough to speak on any of these.
> >>>
> >>> On Wed, Mar 30, 2022 at 1:16 AM Suresh V <[email protected]> wrote:
> >>> >
> >>> > Hi,
> >>> >
> >>> > Is there a way to evaluate mathematical expressions against columns
> of a pyarrow table which is in memory already similar to how projections
> work for dataset scanners?
> >>> >
> >>> > The goal is to have specify a bunch of strings like sum(a *
> b)/sum(a), or avg(a[:10]) etc. Convert these into expressions and run
> against the table.
> >>> >
> >>> > Thanks
> >>> >
> >>> >
>


-- 
-------------------------------------------------------
Matthew Peters
[email protected]

Re: Evaluate expressions on a pyarrow table in-memory

Reply via email to