[D] codegen tool like protoc / flatc for client type annotations for an arrow schema. [arrow]

via GitHub Mon, 08 Sep 2025 12:50:26 -0700


GitHub user apjoseph created a discussion: codegen tool like protoc / flatc for 
client type annotations for an arrow schema.


Hello I've recently been using the adbc driver for duckdb and find it 
convenient, but one huge missing piece that has been vexing me is codegen / 
static type checking of arrow schemas. While I know that arrow is a columnar 
storage format, when interacting with it in a client it is in many cases still 
preferable to model / access it as a traditional domain object for the simple 
reason that the properties of a given arrow schema will change over time -just 
like any other data interchange structure. 

For most widespread data formats, there are typically tools to take a schema 
and generate the relevant client code. Flatbuffers has flatc, protobuf has 
protoc. There are plenty of tools which can take a JSON schema and generate 
client code, but conspicuously I have never been able to find any equivalent 
for arrow (atleast for python classes / type annotations) despite it's 
widespread use.

I am wondering if these capabilities exist somewhere and I have simply missed 
them or if there is some underlying reason that no such tools exist.  

I have a very crude system of wrapper facades that works well enough for static 
type checking which looks something like below, but I keep thinking somewhere 
out there with 8.1 billion people on Earth, someone must have built a more 
automated / polished version of this.


## Wrappers
```python
class ListFacade:
    """ A facade for a pyarrow ListScalar """
    def __init__(self, lst: pyarrow.lib.ListScalar):
        self.lst = lst

    def __getitem__(self, index: int):
        val_type = self.lst.type.value_type
        return convert_value(self.lst[index],val_type)

    def __len__(self):
        return len(self.lst)


class StructFacade:
    """ A facade for a pyarrow StructScalar """
    def __init__(self, struct: pyarrow.lib.StructScalar):
        self.struct = struct

    def __getattr__(self, key: str):
        field = self.struct.type[key]
        return convert_value(self.struct[key],field.type)


class RecordFacade:
    """ A facade for a single row in a pyarrow RecordBatch """

    def __init__(self, batch: pyarrow.lib.RecordBatch, row_num: int):
        self.batch = batch
        self.row_num = row_num

    def __getattr__(self, key: str):
        field = self.batch.schema.field(key)
        return convert_value(self.batch.column(key)[self.row_num],field.type)

class BatchReaderFacade[T]:
    """Wrapper for batch reader that yields rows as facades"""
    def __init__(self, reader:pyarrow.lib.RecordBatchReader):
        self.reader = reader

    def rows(self) -> typing.Iterable[T]:
        """ Yields each row in the batch as a RecordFacade that provides 
attributes for Protocol T"""
        for batch in self.reader:
            for i in range(batch.num_rows):
                yield RecordFacade(batch, i)

## Converter
```python
def convert_value(value,schema:pyarrow.lib.DataType):
    """ Convert a pyarrow value to a Python value or a facade object """
    if pyarrow.types.is_struct(schema):
        return StructFacade(value)
    if pyarrow.types.is_primitive(schema):
        return value.as_py()
    if pyarrow.types.is_list(schema):
        return ListFacade(value)
    raise ValueError(f"Unsupported type: {schema}")
```

## Example domain Protocols (interfaces)
```python
class DatabaseFunctionParam(Protocol):
    name: str
    dtype: str

class DatabaseFunction(Protocol):
    name: str
    kind: str
    return_type: str
    params: list[DatabaseFunctionParam]
    database: str

class DatabaseSchema(Protocol):
    id: int
    name: str
    functions: list[DatabaseFunction]
```

## Query
```duckbsql
with adbc_driver_manager.dbapi.connect(
    driver="libduckdb", entrypoint="duckdb_adbc_init"
) as conn:
    with conn.cursor() as cursor:
        cursor.execute("""
        SELECT
          s.oid as id,
          s.schema_name as name,
          s.database_name as database,
          array_agg(
            struct_pack(
              name:=f.function_name,
              kind:=f.function_type,
              return_type:=f.return_type,
              params:=f.parameters.apply(lambda pn,i: 
struct_pack(name:=pn,dtype:=f.parameter_types[i]))
            ) ORDER BY f.function_name
          ) as functions
        FROM
          duckdb_functions() as f
          JOIN duckdb_schemas() as s ON f.database_name = s.database_name AND 
f.schema_name = s.schema_name
        GROUP BY
            s.oid,
            s.schema_name,
            s.database_name
        """)
        reader = BatchReaderFacade[DatabaseSchema](cursor.fetch_record_batch())
        for dbSchema in reader.rows():
            print(dbSchema.name)
            for fn in dbSchema.functions:
                print(f"  {fn.name}")
                for param in fn.params:
                    print(f"     {param.name}: {param.dtype}")
``` 



GitHub link: https://github.com/apache/arrow/discussions/47530

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

[D] codegen tool like protoc / flatc for client type annotations for an arrow schema. [arrow]

Reply via email to