Hi all,
It's been almost 6 months and I made some progress on this and wanted to
share what I found.
First of all, let me clarify my use case.
I want to write a python library that parses binary data and converts it to
an arrow RecordBatch.
I know ahead of time the schema of the data I'm getting, so I need to
implement a function whose signature is:
```python
def convert_binary(data: bytes, schema: pa.Schema) -> pa.RecordBatch:
pass
```
For context (thought it doesn't matter much actually), this binary data
comes from the python cassandra driver.
In the current implementation, the parsing is done in python (or sometimes
in cython).
But ultimately it returns tuples to represent rows.
This is highly inefficient (both in terms of memory and CPU) for analytics
payload.
There is also an implementation that returns numpy arrays for each columns,
but it only support limited types (int, float etc).
For other types it returns np.object (string, datetime, UDT, lists...).
The binary data is encoded on a row per row basis.
As it requires a lot of small operation and manipulating small native
object types.
This type of workload is much more efficient in C++.
So I decided to write my own implementation, using C++ to do the parsing,
and arrow to represent the data.
You can find the code here [1] (caution, it's a WIP).
Writing the code and setting up is fine.
There are a lot of pybind11 tutorials out there.
The difficulty is when running the code you get this error:
```
ImportError: /usr/local/lib/python3.9/site-packages/_
cassarrow.cpython-39-x86_64-linux-gnu.so: undefined symbol:
_ZNK5arrow8DataType18ComputeFingerprintB5cxx11Ev
```
As identified in the previous emails, it is because my code hasn't been
compiled with the same toolchain as the pyarrow code.
The pyarrow binary code has been compiled on a different system and
packaged as a wheel.
I've found several way to go around the problem.
1. Build pyarrow from source
By building pyarrow locally from source (instead of using a wheel), we
avoid this compatibility issue.
Unfortunately pyarrow has got a lot of dependencies, which need to be
installed[2]
```shell
pip install --no-binary=pyarrow pyarrow
```
cons:
* Lots of dependencies to install and manage for pyarrow
* Hard to make sure everyone is the team has got the same dependencies and
tool chain for reproducible build
* Pyarrow is (somewhat) slow to build
See this Dockerfile, to see how to install from source: [3]
2. Make sure you use the same toolchain as the wheel
You have to set up your development environment so that it replicates the
way the wheel are built.
This is a big ask though, and the only easy way to do it is to build on
docker using the image used to build the wheels.
I've done so using the docker image used by cibuildwheel (the tool most
library us to create their wheel).
cons:
* You can only build on docker, as it's hard to set up your environment the
same way
See this docker file: [4]
3. Use Nix
Using nix as a package/dependency manager works like a charm in this
context.
It basically guarantees that all the installed libraries are built with the
same toolchain.
It is extremely powerful especially if you want to use other C++ libraries.
For example I used it in the past to use the C++ google protobuf library
from python.
cons:
* You have to use nix for everything
* Not all packages in pypi are available in nix
Here's the nix shell config for this project: [5]
4. Use flag `-D_GLIBCXX_USE_CXX11_ABI=0`
This works, and is very easy to integrate in `setup.py`.
cons:
* there's still no warranty that the tool chains are 100% compatible.
5. Use poetry/cond
I haven't explored these yet.
6. Use the C data interface
My library is building RecordBatch from rows of data.
To do so it needs access to the ArrayBuilder api, which isn't exposed in
the CDataInterface (AFAIK).
# Finishing notes:
While working on this I found this article very helpful [5].
Also, Uwe's blog is very good in general, highly recommended.
[1] https://github.com/0x26res/cassarrow
[2] https://arrow.apache.org/docs/developers/python.html#using-pip
[3] https://github.com/0x26res/cassarrow/blob/master/from-source.Dockerfile
[4]
https://github.com/0x26res/cassarrow/blob/master/same-toolchain.Dockerfile
[5] https://github.com/0x26res/cassarrow/blob/master/shell.nix
[5]
https://uwekorn.com/2019/09/15/how-we-build-apache-arrows-manylinux-wheels.html