Hi! One of the gaps I currently have in Funnel Rocket ( https://github.com/DynamicYieldProjects/funnel-rocket) is supporting nested columns, as in: given a Parquet file with a column of type List(int64), be able to find rows where the list holds a specific int element.
Right now, the need is fortunately limited to lists of primitives (mostly int) and maps of string->string, rather than any arbitrary complexity. Currently, I load Parquet files via pyarrow, then call to_pandas() and run multiple filters on the DataFrame. After reading Uwe's blog post ( https://uwekorn.com/2018/08/03/use-numba-to-work-with-apache-arrow-in-pure-python.html) and looking at the Fletcher project (https://github.com/xhochy/fletcher), seems the "proper" way to do it would be: * Write an ExtensionDType/ExtensionArray can wrap an arrow ChunkedArray made of ListArrays. Not even sure what the operator should be for lookup in a list - should I treat a list_series==123 as "for each list in this series, look for the element 123 in it?". * Potentially use a @jitclass for more performant lookup, as Uwe has outlined. * For now, for any abstract method I'm not sure what to do with - start with raising an exception, then run some unit tests based on my project's needs, and see that they pass :-/ * When calling Table.to_pandas(), supply a type mapper argument to map the specific supported types to the appropriate extension class. * If it seems to work, figure out if I've missed something important in the concrete classes :-/ Am I getting this right, more or less? Thanks a lot, Elad
