[D] Dictionary encode a field in a dataset (string -> int) [arrow]

via GitHub Fri, 08 Aug 2025 06:26:55 -0700


GitHub user shner-elmo created a discussion: Dictionary encode a field in a 
dataset (string -> int)


So I want to encode a field/column in a Parquet dataset to save memory.
basically:
* get all unique values: `field.unique()`
* create a mapping to integers: `{val: i for i, val in 
enumerate(field.unique())}`
* save the mapping in a separate file (instead of saving the same dictionary in 
each parquet file of the dataset)

Now what I have been struggling with, is creating a field/column in the dataset 
that contains all these integers.
The question is how can I perform the dictionary lookup row-wise without 
creating a UDF or doing a Python for loop.

I was trying this:
```py
dataset = ds.dataset(...)
columns = {'ticker': pc.dictionary_encode(ds.field('ticker'))}
scanner = dataset.scanner(columns=columns)
# pyarrow.lib.ArrowInvalid: ExecuteScalarExpression cannot Execute non-scalar 
expression dictionary_encode(ticker)
```
and also:
```py
arr = pa.array([{...}], type=pa.map_(pa.string(), pa.string()))
columns = {'ticker': pc.map_lookup(arr, ds.field("ticker"), "first")}
# pyarrow.lib.ArrowInvalid: Could not convert <pyarrow.compute.Expression a> 
with type pyarrow._compute.Expression: did not recognize Python value type when 
inferring an Arrow data type
```

Is there a way to do the map lookup natively (at C speed) instead of Python for 
loops... ?

GitHub link: https://github.com/apache/arrow/discussions/47293

----
This is an automatically sent email for user@arrow.apache.org.
To unsubscribe, please send an email to: user-unsubscr...@arrow.apache.org

[D] Dictionary encode a field in a dataset (string -> int) [arrow]

Reply via email to