GitHub user apjoseph created a discussion: codegen tool like protoc / flatc for
client type annotations for an arrow schema.
Hello I've recently been using the adbc driver for duckdb and find it
convenient, but one huge missing piece that has been vexing me is codegen /
static type checking of arrow schemas. While I know that arrow is a columnar
storage format, when interacting with it in a client it is in many cases still
preferable to model / access it as a traditional domain object for the simple
reason that the properties of a given arrow schema will change over time -just
like any other data interchange structure.
For most widespread data formats, there are typically tools to take a schema
and generate the relevant client code. Flatbuffers has flatc, protobuf has
protoc. There are plenty of tools which can take a JSON schema and generate
client code, but conspicuously I have never been able to find any equivalent
for arrow (atleast for python classes / type annotations) despite it's
widespread use.
I am wondering if these capabilities exist somewhere and I have simply missed
them or if there is some underlying reason that no such tools exist.
I have a very crude system of wrapper facades that works well enough for static
type checking which looks something like below, but I keep thinking somewhere
out there with 8.1 billion people on Earth, someone must have built a more
automated / polished version of this.
## Wrappers
```python
class ListFacade:
""" A facade for a pyarrow ListScalar """
def __init__(self, lst: pyarrow.lib.ListScalar):
self.lst = lst
def __getitem__(self, index: int):
val_type = self.lst.type.value_type
return convert_value(self.lst[index],val_type)
def __len__(self):
return len(self.lst)
class StructFacade:
""" A facade for a pyarrow StructScalar """
def __init__(self, struct: pyarrow.lib.StructScalar):
self.struct = struct
def __getattr__(self, key: str):
field = self.struct.type[key]
return convert_value(self.struct[key],field.type)
class RecordFacade:
""" A facade for a single row in a pyarrow RecordBatch """
def __init__(self, batch: pyarrow.lib.RecordBatch, row_num: int):
self.batch = batch
self.row_num = row_num
def __getattr__(self, key: str):
field = self.batch.schema.field(key)
return convert_value(self.batch.column(key)[self.row_num],field.type)
class BatchReaderFacade[T]:
"""Wrapper for batch reader that yields rows as facades"""
def __init__(self, reader:pyarrow.lib.RecordBatchReader):
self.reader = reader
def rows(self) -> typing.Iterable[T]:
""" Yields each row in the batch as a RecordFacade that provides
attributes for Protocol T"""
for batch in self.reader:
for i in range(batch.num_rows):
yield RecordFacade(batch, i)
## Converter
```python
def convert_value(value,schema:pyarrow.lib.DataType):
""" Convert a pyarrow value to a Python value or a facade object """
if pyarrow.types.is_struct(schema):
return StructFacade(value)
if pyarrow.types.is_primitive(schema):
return value.as_py()
if pyarrow.types.is_list(schema):
return ListFacade(value)
raise ValueError(f"Unsupported type: {schema}")
```
## Example domain Protocols (interfaces)
```python
class DatabaseFunctionParam(Protocol):
name: str
dtype: str
class DatabaseFunction(Protocol):
name: str
kind: str
return_type: str
params: list[DatabaseFunctionParam]
database: str
class DatabaseSchema(Protocol):
id: int
name: str
functions: list[DatabaseFunction]
```
## Query
```duckbsql
with adbc_driver_manager.dbapi.connect(
driver="libduckdb", entrypoint="duckdb_adbc_init"
) as conn:
with conn.cursor() as cursor:
cursor.execute("""
SELECT
s.oid as id,
s.schema_name as name,
s.database_name as database,
array_agg(
struct_pack(
name:=f.function_name,
kind:=f.function_type,
return_type:=f.return_type,
params:=f.parameters.apply(lambda pn,i:
struct_pack(name:=pn,dtype:=f.parameter_types[i]))
) ORDER BY f.function_name
) as functions
FROM
duckdb_functions() as f
JOIN duckdb_schemas() as s ON f.database_name = s.database_name AND
f.schema_name = s.schema_name
GROUP BY
s.oid,
s.schema_name,
s.database_name
""")
reader = BatchReaderFacade[DatabaseSchema](cursor.fetch_record_batch())
for dbSchema in reader.rows():
print(dbSchema.name)
for fn in dbSchema.functions:
print(f" {fn.name}")
for param in fn.params:
print(f" {param.name}: {param.dtype}")
```
GitHub link: https://github.com/apache/arrow/discussions/47530
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]