RE: [Python] Documentation for PyArrow's schema

Natasha Jokinen Tue, 31 Aug 2021 14:02:07 -0700

Is there any documentation about how the “schema” functions linked end up 
expressing things in Arrow? The “from_pandas”[4] function says it returns the 
“implied schema”, how does it decide what is implied?

For me, a multi-header/nested header is any header containing multiple rows of 
header data. Sometimes the multiple header rows can be derived via some form of 
aggregation or column pivot/splitBy (ie like in Perspective[5]) on the table 
data and sometimes it's more of a visual decision based on context outside of 
the table data. As an example of a mult-header, if you take the example 
dataframe from the pandas pivot_table docs[6] and run `pd.pivot_table(df, 
index=['A', 'B'], columns=['C'], aggfunc='first')`, you end up with a 
multi-header where headers D and E have two subheaders apiece named “large” and 
“small”. 

                        D               E
        C       large   small   large   small
A       B                               
------------------------------------------------------------
bar     one     4.00    5.00    6.00    8.00
        two     7.00    6.00    9.00    9.00
foo     one     2.00    1.00    4.00    2.00
        two     NaN     3.00    NaN     5.00

One of the ways to represent this header structure in Arrow is to use nested 
columns, but pandas only supports flat columns so Pyarrow has to preserve the 
pandas indexes in order to have a lossless “from_pandas” and “to_pandas” 
conversion. How is the index preservation represented in Arrow? Any 
documentation on this sort of detail would be nice to have so I don’t have to 
inspect the pyarrow.Table.schema.metadata.get(b’pandas’) for each desired 
feature.

My usecase is something like a non-python backend will set up this kind of 
multi-header and send it over the wire via Arrow and the python frontend will 
just want to deserialize and render it.

[4] 
https://arrow.apache.org/docs/python/generated/pyarrow.Schema.html#pyarrow.Schema.from_pandas
[5] https://perspective.finos.org/docs/md/view.html#column-pivots
[6] https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html

Thanks,
Natasha

-----Original Message-----
From: Weston Pace <[email protected]> 
Sent: Monday, August 30, 2021 7:50 PM
To: [email protected]
Subject: Re: [Python] Documentation for PyArrow's schema

The Arrow columnar format defines a "schema"[1].  That is the most basic 
concept all implementations support.  The C data interface also defines a 
"schema"[2] which is based on [1] that you might also want to be aware of.  
Pyarrow defines a python object called a Schema[3] which wraps [2] and 
represents [1] for the python implementation.

I don't really know what a multi-header / nested-header is exactly.  I can make 
some guesses but would rather make sure I understand what you are after first.  
Can you expand on that a little bit and maybe provide an example?

[1] 
https://urldefense.com/v3/__https://github.com/apache/arrow/blob/apache-arrow-5.0.0/format/Schema.fbs*L415__;Iw!!PBKjc0U4!f2Pc9hn14qQsZU1L_0C6ciK0WQk4yA4F_6zpB-6ksSrTYOYHyRpKIyZOHBPudxwz8mEv$
[2] 
https://urldefense.com/v3/__https://arrow.apache.org/docs/format/CDataInterface.html*the-arrowschema-structure__;Iw!!PBKjc0U4!f2Pc9hn14qQsZU1L_0C6ciK0WQk4yA4F_6zpB-6ksSrTYOYHyRpKIyZOHBPud_uAaet4$
[3] 
https://urldefense.com/v3/__https://arrow.apache.org/docs/python/generated/pyarrow.Schema.html__;!!PBKjc0U4!f2Pc9hn14qQsZU1L_0C6ciK0WQk4yA4F_6zpB-6ksSrTYOYHyRpKIyZOHBPudz4qorNE$

On Mon, Aug 30, 2021 at 11:22 AM Natasha Jokinen <[email protected]> 
wrote:
>
> Hi Team,
>
>
>
> Is the schema that PyArrow uses to know how to convert between an Apache 
> Arrow table and a Pandas Dataframe documented? I’m looking at ways my company 
> can have non-python languages share an Apache Arrow schema and it would be 
> great to build off of an existing schema like what Pandas uses rather than 
> coming up with our own.
>
>
>
> I’m particularly interest in documentation of how multi-headers/nested 
> headers are expressed in the PyArrow schema since Pandas only supports flat 
> columns and how PyArrow’s schema indicates row grouping.
>
>
>
> Thanks,
>
> Natasha
>
>

RE: [Python] Documentation for PyArrow's schema

Reply via email to