GitHub user shner-elmo created a discussion: Dictionary encode a field in a
dataset (string -> int)
So I want to encode a field/column in a Parquet dataset to save memory.
basically:
* get all unique values: `field.unique()`
* create a mapping to integers: `{val: i for i, val in
enumerate(field.unique())}`
* save the mapping in a separate file (instead of saving the same dictionary in
each parquet file of the dataset)
Now what I have been struggling with, is creating a field/column in the dataset
that contains all these integers.
The question is how can I perform the dictionary lookup row-wise without
creating a UDF or doing a Python for loop.
I was trying this:
```py
dataset = ds.dataset(...)
columns = {'ticker': pc.dictionary_encode(ds.field('ticker'))}
scanner = dataset.scanner(columns=columns)
# pyarrow.lib.ArrowInvalid: ExecuteScalarExpression cannot Execute non-scalar
expression dictionary_encode(ticker)
```
and also:
```py
arr = pa.array([{...}], type=pa.map_(pa.string(), pa.string()))
columns = {'ticker': pc.map_lookup(arr, ds.field("ticker"), "first")}
# pyarrow.lib.ArrowInvalid: Could not convert <pyarrow.compute.Expression a>
with type pyarrow._compute.Expression: did not recognize Python value type when
inferring an Arrow data type
```
Is there a way to do the map lookup natively (at C speed) instead of Python for
loops... ?
GitHub link: https://github.com/apache/arrow/discussions/47293
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]