This is an automated email from the ASF dual-hosted git repository. paleolimbot pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/arrow-nanoarrow.git
The following commit(s) were added to refs/heads/main by this push: new fcc540a8 docs(python): Update Python bindings readme (#474) fcc540a8 is described below commit fcc540a8fabe03a38f07e010f8a72c733e18a4a8 Author: Dewey Dunnington <de...@dunnington.ca> AuthorDate: Fri May 17 16:22:00 2024 -0300 docs(python): Update Python bindings readme (#474) The previous readme was written for the previous release and is outdated! --- python/README.ipynb | 480 +++++++++++++++++++++++++++++++++++++++++----------- python/README.md | 318 ++++++++++++++++++++++++++-------- 2 files changed, 624 insertions(+), 174 deletions(-) diff --git a/python/README.ipynb b/python/README.ipynb index 0f13829a..5d62065b 100644 --- a/python/README.ipynb +++ b/python/README.ipynb @@ -36,11 +36,19 @@ "\n", "## Installation\n", "\n", - "Python bindings for nanoarrow are not yet available on PyPI. You can install via\n", - "URL (requires a C compiler):\n", + "The nanoarrow Python bindings are available from [PyPI](https://pypi.org/) and\n", + "[conda-forge](https://conda-forge.org/):\n", "\n", - "```bash\n", - "python -m pip install \"git+https://github.com/apache/arrow-nanoarrow.git#egg=nanoarrow&subdirectory=python\"\n", + "```shell\n", + "pip install nanoarrow\n", + "conda install nanoarrow -c conda-forge\n", + "```\n", + "\n", + "Development versions (based on the `main` branch) are also available:\n", + "\n", + "```shell\n", + "pip install --extra-index-url https://pypi.fury.io/arrow-nightlies/ \\\n", + " --prefer-binary --pre nanoarrow\n", "```\n", "\n", "If you can import the namespace, you're good to go!" @@ -48,7 +56,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 46, "metadata": {}, "outputs": [], "source": [ @@ -56,102 +64,326 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "## Low-level C library bindings\n", - "\n", - "The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the `ArrowSchema` which represents a data type of an array, the `ArrowArray` which represents the values of an array, and an `ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common `ArrowSchema`.\n", + "## Data types, arrays, and array streams\n", "\n", - "### Schemas\n", - "\n", - "Use `nanoarrow.c_schema()` to convert an object to an `ArrowSchema` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.Schema`, `pyarrow.DataType`, and `pyarrow.Field`)." + "The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the `ArrowSchema` which represents a data type of an array, the `ArrowArray` which represents the values of an array, and an `ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common `ArrowSchema`. These concepts map to the `nanoarrow.Schema`, `nanoarrow.Array`, and `nanoarrow.ArrayStream` in the Python package." ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "<nanoarrow.c_lib.CSchema decimal128(10, 3)>\n", - "- format: 'd:10,3'\n", - "- name: ''\n", - "- flags: 2\n", - "- metadata: NULL\n", - "- dictionary: NULL\n", - "- children[0]:" + "<Schema> int32" ] }, - "execution_count": 2, + "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "import pyarrow as pa\n", - "schema = na.c_schema(pa.decimal128(10, 3))\n", - "schema" + "na.int32()" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "nanoarrow.Array<int32>[3]\n", + "1\n", + "2\n", + "3" + ] + }, + "execution_count": 48, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "na.Array([1, 2, 3], na.int32())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `nanoarrow.Array` can accommodate arrays with any number of chunks, reflecting the reality that many array containers (e.g., `pyarrow.ChunkedArray`, `polars.Series`) support this." + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "nanoarrow.Array<int32>[6]\n", + "1\n", + "2\n", + "3\n", + "4\n", + "5\n", + "6" + ] + }, + "execution_count": 49, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "chunked = na.Array.from_chunks([[1, 2, 3], [4, 5, 6]], na.int32())\n", + "chunked" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "You can extract the fields of a `CSchema` object one at a time or parse it into a view to extract deserialized parameters." + "Whereas chunks of an `Array` are always fully materialized when the object is constructed, the chunks of an `ArrayStream` have not necessarily been resolved yet." + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "nanoarrow.ArrayStream<int32>" + ] + }, + "execution_count": 50, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "stream = na.ArrayStream(chunked)\n", + "stream" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "nanoarrow.Array<int32>[3]\n", + "1\n", + "2\n", + "3\n", + "nanoarrow.Array<int32>[3]\n", + "4\n", + "5\n", + "6\n" + ] + } + ], + "source": [ + "with stream:\n", + " for chunk in stream:\n", + " print(chunk)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `nanoarrow.ArrayStream` also provides an interface to nanoarrow's [Arrow IPC](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc) reader:" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "nanoarrow.ArrayStream<non-nullable struct<commit: string, time: timestamp('us', 'UTC'), files: int3...>" + ] + }, + "execution_count": 52, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "url = \"https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows\"\n", + "na.ArrayStream.from_url(url)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "These objects implement the [Arrow PyCapsule interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html) for both producing and consuming and are interchangeable with `pyarrow` objects in many cases:" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "pyarrow.Field<: int32>" + ] + }, + "execution_count": 53, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pyarrow as pa\n", + "\n", + "pa.field(na.int32())" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "<pyarrow.lib.ChunkedArray object at 0x12a49a250>\n", + "[\n", + " [\n", + " 1,\n", + " 2,\n", + " 3\n", + " ],\n", + " [\n", + " 4,\n", + " 5,\n", + " 6\n", + " ]\n", + "]" + ] + }, + "execution_count": 54, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pa.chunked_array(chunked)" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "<pyarrow.lib.Int32Array object at 0x11b552500>\n", + "[\n", + " 4,\n", + " 5,\n", + " 6\n", + "]" + ] + }, + "execution_count": 55, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pa.array(chunked.chunk(1))" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "nanoarrow.Array<int64>[3]\n", + "10\n", + "11\n", + "12" + ] + }, + "execution_count": 56, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "na.Array(pa.array([10, 11, 12]))" ] }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "<nanoarrow.c_lib.CSchemaView>\n", - "- type: 'decimal128'\n", - "- storage_type: 'decimal128'\n", - "- decimal_bitwidth: 128\n", - "- decimal_precision: 10\n", - "- decimal_scale: 3\n", - "- dictionary_ordered: False\n", - "- map_keys_sorted: False\n", - "- nullable: True\n", - "- storage_type_id: 24\n", - "- type_id: 24" + "<Schema> string" ] }, - "execution_count": 3, + "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "na.c_schema_view(schema)" + "na.Schema(pa.string())" ] }, { + "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "Advanced users can allocate an empty `CSchema` and populate its contents by passing its `._addr()` to a schema-exporting function." + "## Low-level C library bindings\n", + "\n", + "The nanoarrow Python package also provides lower level wrappers around Arrow C interface structures. You can create these using `nanoarrow.c_schema()`, `nanoarrow.c_array()`, and `nanoarrow.c_array_stream()`.\n", + "\n", + "### Schemas\n", + "\n", + "Use `nanoarrow.c_schema()` to convert an object to an `ArrowSchema` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.Schema`, `pyarrow.DataType`, and `pyarrow.Field`)." ] }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "<nanoarrow.c_lib.CSchema int32>\n", - "- format: 'i'\n", + "<nanoarrow.c_schema.CSchema decimal128(10, 3)>\n", + "- format: 'd:10,3'\n", "- name: ''\n", "- flags: 2\n", "- metadata: NULL\n", @@ -159,15 +391,41 @@ "- children[0]:" ] }, - "execution_count": 4, + "execution_count": 58, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "na.c_schema(pa.decimal128(10, 3))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Using `c_schema()` is a good fit for testing and for ephemeral schema objects that are being passed from one library to another. To extract the fields of a schema in a more convenient form, use `Schema()`:" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(10, 3)" + ] + }, + "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "schema = na.allocate_c_schema()\n", - "pa.int32()._export_to_c(schema._addr())\n", - "schema" + "schema = na.Schema(pa.decimal128(10, 3))\n", + "schema.precision, schema.scale" ] }, { @@ -190,29 +448,28 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "<nanoarrow.c_lib.CArray string>\n", + "<nanoarrow.c_array.CArray string>\n", "- length: 4\n", "- offset: 0\n", "- null_count: 1\n", - "- buffers: (3678035706048, 3678035705984, 3678035706112)\n", + "- buffers: (4754305168, 4754307808, 4754310464)\n", "- dictionary: NULL\n", "- children[0]:" ] }, - "execution_count": 5, + "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "array = na.c_array(pa.array([\"one\", \"two\", \"three\", None]))\n", - "array" + "na.c_array([\"one\", \"two\", \"three\", None], na.string())" ] }, { @@ -220,67 +477,87 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "You can extract the fields of a `CArray` one at a time or parse it into a view to extract deserialized content:" + "Using `c_array()` is a good fit for testing and for ephemeral array objects that are being passed from one library to another. For a higher level interface, use `Array()`:" ] }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "<nanoarrow.c_lib.CArrayView>\n", - "- storage_type: 'string'\n", - "- length: 4\n", - "- offset: 0\n", - "- null_count: 1\n", - "- buffers[3]:\n", - " - validity <bool[1 b] 11100000>\n", - " - data_offset <int32[20 b] 0 3 6 11 11>\n", - " - data <string[11 b] b'onetwothree'>\n", - "- dictionary: NULL\n", - "- children[0]:" + "['one', 'two', 'three', None]" ] }, - "execution_count": 6, + "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "na.c_array_view(array)" + "array = na.Array([\"one\", \"two\", \"three\", None], na.string())\n", + "array.to_pylist()" + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(nanoarrow.c_lib.CBufferView(bool[1 b] 11100000),\n", + " nanoarrow.c_lib.CBufferView(int32[20 b] 0 3 6 11 11),\n", + " nanoarrow.c_lib.CBufferView(string[11 b] b'onetwothree'))" + ] + }, + "execution_count": 62, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "array.buffers" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "Like the `CSchema`, you can allocate an empty one and access its address with `_addr()` to pass to other array-exporting functions." + "Advanced users can create arrays directly from buffers using `c_array_from_buffers()`:" ] }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "3" + "<nanoarrow.c_array.CArray string>\n", + "- length: 2\n", + "- offset: 0\n", + "- null_count: 0\n", + "- buffers: (0, 5002908320, 4999694624)\n", + "- dictionary: NULL\n", + "- children[0]:" ] }, - "execution_count": 7, + "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "array = na.allocate_c_array()\n", - "pa.array([1, 2, 3])._export_to_c(array._addr(), array.schema._addr())\n", - "array.length" + "na.c_array_from_buffers(\n", + " na.string(),\n", + " 2,\n", + " [None, na.c_buffer([0, 3, 6], na.int32()), b\"abcdef\"]\n", + ")" ] }, { @@ -290,30 +567,29 @@ "source": [ "### Array streams\n", "\n", - "You can use `nanoarrow.c_array_stream()` to wrap an object representing a sequence of `CArray`s with a common `CSchema` to an `ArrowArrayStream` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.RecordBatchReader`)." + "You can use `nanoarrow.c_array_stream()` to wrap an object representing a sequence of `CArray`s with a common `CSchema` to an `ArrowArrayStream` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.RecordBatchReader`, `pyarrow.ChunkedArray`)." ] }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "<nanoarrow.c_lib.CArrayStream>\n", - "- get_schema(): struct<some_column: int32>" + "<nanoarrow.c_array_stream.CArrayStream>\n", + "- get_schema(): struct<col1: int64>" ] }, - "execution_count": 8, + "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "pa_array_child = pa.array([1, 2, 3], pa.int32())\n", - "pa_array = pa.record_batch([pa_array_child], names=[\"some_column\"])\n", - "reader = pa.RecordBatchReader.from_batches(pa_array.schema, [pa_array])\n", + "pa_batch = pa.record_batch({\"col1\": [1, 2, 3]})\n", + "reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])\n", "array_stream = na.c_array_stream(reader)\n", "array_stream" ] @@ -328,25 +604,25 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 65, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "<nanoarrow.c_lib.CArray struct<some_column: int32>>\n", + "<nanoarrow.c_array.CArray struct<col1: int64>>\n", "- length: 3\n", "- offset: 0\n", "- null_count: 0\n", "- buffers: (0,)\n", "- dictionary: NULL\n", "- children[1]:\n", - " 'some_column': <nanoarrow.c_lib.CArray int32>\n", + " 'col1': <nanoarrow.c_array.CArray int64>\n", " - length: 3\n", " - offset: 0\n", " - null_count: 0\n", - " - buffers: (0, 3678035837056)\n", + " - buffers: (0, 2642948588352)\n", " - dictionary: NULL\n", " - children[0]:\n" ] @@ -358,34 +634,34 @@ ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "You can also get the address of a freshly-allocated stream to pass to a suitable exporting function:" + "Use `ArrayStream()` for a higher level interface:" ] }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "<nanoarrow.c_lib.CArrayStream>\n", - "- get_schema(): struct<some_column: int32>" + "nanoarrow.Array<non-nullable struct<col1: int64>>[3]\n", + "{'col1': 1}\n", + "{'col1': 2}\n", + "{'col1': 3}" ] }, - "execution_count": 10, + "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "array_stream = na.allocate_c_array_stream()\n", - "reader._export_to_c(array_stream._addr())\n", - "array_stream" + "reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])\n", + "na.ArrayStream(reader).read_all()" ] }, { @@ -408,11 +684,13 @@ "\n", "```shell\n", "# Install dependencies\n", - "pip install -e .[test]\n", + "pip install -e \".[test]\"\n", "\n", "# Run tests\n", "pytest -vvx\n", - "```" + "```\n", + "\n", + "CMake is currently required to ensure that the vendored copy of nanoarrow in the Python package stays in sync with the nanoarrow sources in the working tree." ] } ], @@ -432,7 +710,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.4" + "version": "3.12.3" }, "orig_nbformat": 4 }, diff --git a/python/README.md b/python/README.md index 42b4e390..f279a095 100644 --- a/python/README.md +++ b/python/README.md @@ -29,11 +29,19 @@ interfaces. ## Installation -Python bindings for nanoarrow are not yet available on PyPI. You can install via -URL (requires a C compiler): +The nanoarrow Python bindings are available from [PyPI](https://pypi.org/) and +[conda-forge](https://conda-forge.org/): -```bash -python -m pip install "git+https://github.com/apache/arrow-nanoarrow.git#egg=nanoarrow&subdirectory=python" +```shell +pip install nanoarrow +conda install nanoarrow -c conda-forge +``` + +Development versions (based on the `main` branch) are also available: + +```shell +pip install --extra-index-url https://pypi.fury.io/arrow-nightlies/ \ + --prefer-binary --pre nanoarrow ``` If you can import the namespace, you're good to go! @@ -43,72 +51,207 @@ If you can import the namespace, you're good to go! import nanoarrow as na ``` -## Low-level C library bindings +## Data types, arrays, and array streams -The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the `ArrowSchema` which represents a data type of an array, the `ArrowArray` which represents the values of an array, and an `ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common `ArrowSchema`. +The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the `ArrowSchema` which represents a data type of an array, the `ArrowArray` which represents the values of an array, and an `ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common `ArrowSchema`. These concepts map to the `nanoarrow.Schema`, `nanoarrow.Array`, and `nanoarrow.ArrayStream` in the Python package. + + +```python +na.int32() +``` + + + + + <Schema> int32 -### Schemas -Use `nanoarrow.c_schema()` to convert an object to an `ArrowSchema` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.Schema`, `pyarrow.DataType`, and `pyarrow.Field`). + + +```python +na.Array([1, 2, 3], na.int32()) +``` + + + + + nanoarrow.Array<int32>[3] + 1 + 2 + 3 + + + +The `nanoarrow.Array` can accommodate arrays with any number of chunks, reflecting the reality that many array containers (e.g., `pyarrow.ChunkedArray`, `polars.Series`) support this. + + +```python +chunked = na.Array.from_chunks([[1, 2, 3], [4, 5, 6]], na.int32()) +chunked +``` + + + + + nanoarrow.Array<int32>[6] + 1 + 2 + 3 + 4 + 5 + 6 + + + +Whereas chunks of an `Array` are always fully materialized when the object is constructed, the chunks of an `ArrayStream` have not necessarily been resolved yet. + + +```python +stream = na.ArrayStream(chunked) +stream +``` + + + + + nanoarrow.ArrayStream<int32> + + + + +```python +with stream: + for chunk in stream: + print(chunk) +``` + + nanoarrow.Array<int32>[3] + 1 + 2 + 3 + nanoarrow.Array<int32>[3] + 4 + 5 + 6 + + +The `nanoarrow.ArrayStream` also provides an interface to nanoarrow's [Arrow IPC](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc) reader: + + +```python +url = "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows" +na.ArrayStream.from_url(url) +``` + + + + + nanoarrow.ArrayStream<non-nullable struct<commit: string, time: timestamp('us', 'UTC'), files: int3...> + + + +These objects implement the [Arrow PyCapsule interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html) for both producing and consuming and are interchangeable with `pyarrow` objects in many cases: ```python import pyarrow as pa -schema = na.c_schema(pa.decimal128(10, 3)) -schema + +pa.field(na.int32()) ``` - <nanoarrow.c_lib.CSchema decimal128(10, 3)> - - format: 'd:10,3' - - name: '' - - flags: 2 - - metadata: NULL - - dictionary: NULL - - children[0]: + pyarrow.Field<: int32> + + + + +```python +pa.chunked_array(chunked) +``` + + + + + <pyarrow.lib.ChunkedArray object at 0x12a49a250> + [ + [ + 1, + 2, + 3 + ], + [ + 4, + 5, + 6 + ] + ] + + + + +```python +pa.array(chunked.chunk(1)) +``` + + + + + <pyarrow.lib.Int32Array object at 0x11b552500> + [ + 4, + 5, + 6 + ] + + + + +```python +na.Array(pa.array([10, 11, 12])) +``` + + + + nanoarrow.Array<int64>[3] + 10 + 11 + 12 -You can extract the fields of a `CSchema` object one at a time or parse it into a view to extract deserialized parameters. ```python -na.c_schema_view(schema) +na.Schema(pa.string()) ``` - <nanoarrow.c_lib.CSchemaView> - - type: 'decimal128' - - storage_type: 'decimal128' - - decimal_bitwidth: 128 - - decimal_precision: 10 - - decimal_scale: 3 - - dictionary_ordered: False - - map_keys_sorted: False - - nullable: True - - storage_type_id: 24 - - type_id: 24 + <Schema> string + + + +## Low-level C library bindings +The nanoarrow Python package also provides lower level wrappers around Arrow C interface structures. You can create these using `nanoarrow.c_schema()`, `nanoarrow.c_array()`, and `nanoarrow.c_array_stream()`. +### Schemas -Advanced users can allocate an empty `CSchema` and populate its contents by passing its `._addr()` to a schema-exporting function. +Use `nanoarrow.c_schema()` to convert an object to an `ArrowSchema` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.Schema`, `pyarrow.DataType`, and `pyarrow.Field`). ```python -schema = na.allocate_c_schema() -pa.int32()._export_to_c(schema._addr()) -schema +na.c_schema(pa.decimal128(10, 3)) ``` - <nanoarrow.c_lib.CSchema int32> - - format: 'i' + <nanoarrow.c_schema.CSchema decimal128(10, 3)> + - format: 'd:10,3' - name: '' - flags: 2 - metadata: NULL @@ -117,6 +260,21 @@ schema +Using `c_schema()` is a good fit for testing and for ephemeral schema objects that are being passed from one library to another. To extract the fields of a schema in a more convenient form, use `Schema()`: + + +```python +schema = na.Schema(pa.decimal128(10, 3)) +schema.precision, schema.scale +``` + + + + + (10, 3) + + + The `CSchema` object cleans up after itself: when the object is deleted, the underlying `ArrowSchema` is released. ### Arrays @@ -125,72 +283,83 @@ You can use `nanoarrow.c_array()` to convert an array-like object to an `ArrowAr ```python -array = na.c_array(pa.array(["one", "two", "three", None])) -array +na.c_array(["one", "two", "three", None], na.string()) ``` - <nanoarrow.c_lib.CArray string> + <nanoarrow.c_array.CArray string> - length: 4 - offset: 0 - null_count: 1 - - buffers: (3678035706048, 3678035705984, 3678035706112) + - buffers: (4754305168, 4754307808, 4754310464) - dictionary: NULL - children[0]: -You can extract the fields of a `CArray` one at a time or parse it into a view to extract deserialized content: +Using `c_array()` is a good fit for testing and for ephemeral array objects that are being passed from one library to another. For a higher level interface, use `Array()`: ```python -na.c_array_view(array) +array = na.Array(["one", "two", "three", None], na.string()) +array.to_pylist() ``` - <nanoarrow.c_lib.CArrayView> - - storage_type: 'string' - - length: 4 - - offset: 0 - - null_count: 1 - - buffers[3]: - - validity <bool[1 b] 11100000> - - data_offset <int32[20 b] 0 3 6 11 11> - - data <string[11 b] b'onetwothree'> - - dictionary: NULL - - children[0]: + ['one', 'two', 'three', None] + + + + +```python +array.buffers +``` + + + (nanoarrow.c_lib.CBufferView(bool[1 b] 11100000), + nanoarrow.c_lib.CBufferView(int32[20 b] 0 3 6 11 11), + nanoarrow.c_lib.CBufferView(string[11 b] b'onetwothree')) -Like the `CSchema`, you can allocate an empty one and access its address with `_addr()` to pass to other array-exporting functions. + + +Advanced users can create arrays directly from buffers using `c_array_from_buffers()`: ```python -array = na.allocate_c_array() -pa.array([1, 2, 3])._export_to_c(array._addr(), array.schema._addr()) -array.length +na.c_array_from_buffers( + na.string(), + 2, + [None, na.c_buffer([0, 3, 6], na.int32()), b"abcdef"] +) ``` - 3 + <nanoarrow.c_array.CArray string> + - length: 2 + - offset: 0 + - null_count: 0 + - buffers: (0, 5002908320, 4999694624) + - dictionary: NULL + - children[0]: ### Array streams -You can use `nanoarrow.c_array_stream()` to wrap an object representing a sequence of `CArray`s with a common `CSchema` to an `ArrowArrayStream` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.RecordBatchReader`). +You can use `nanoarrow.c_array_stream()` to wrap an object representing a sequence of `CArray`s with a common `CSchema` to an `ArrowArrayStream` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.RecordBatchReader`, `pyarrow.ChunkedArray`). ```python -pa_array_child = pa.array([1, 2, 3], pa.int32()) -pa_array = pa.record_batch([pa_array_child], names=["some_column"]) -reader = pa.RecordBatchReader.from_batches(pa_array.schema, [pa_array]) +pa_batch = pa.record_batch({"col1": [1, 2, 3]}) +reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch]) array_stream = na.c_array_stream(reader) array_stream ``` @@ -198,8 +367,8 @@ array_stream - <nanoarrow.c_lib.CArrayStream> - - get_schema(): struct<some_column: int32> + <nanoarrow.c_array_stream.CArrayStream> + - get_schema(): struct<col1: int64> @@ -211,36 +380,37 @@ for array in array_stream: print(array) ``` - <nanoarrow.c_lib.CArray struct<some_column: int32>> + <nanoarrow.c_array.CArray struct<col1: int64>> - length: 3 - offset: 0 - null_count: 0 - buffers: (0,) - dictionary: NULL - children[1]: - 'some_column': <nanoarrow.c_lib.CArray int32> + 'col1': <nanoarrow.c_array.CArray int64> - length: 3 - offset: 0 - null_count: 0 - - buffers: (0, 3678035837056) + - buffers: (0, 2642948588352) - dictionary: NULL - children[0]: -You can also get the address of a freshly-allocated stream to pass to a suitable exporting function: +Use `ArrayStream()` for a higher level interface: ```python -array_stream = na.allocate_c_array_stream() -reader._export_to_c(array_stream._addr()) -array_stream +reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch]) +na.ArrayStream(reader).read_all() ``` - <nanoarrow.c_lib.CArrayStream> - - get_schema(): struct<some_column: int32> + nanoarrow.Array<non-nullable struct<col1: int64>>[3] + {'col1': 1} + {'col1': 2} + {'col1': 3} @@ -264,3 +434,5 @@ pip install -e ".[test]" # Run tests pytest -vvx ``` + +CMake is currently required to ensure that the vendored copy of nanoarrow in the Python package stays in sync with the nanoarrow sources in the working tree.