GitHub user shner-elmo created a discussion: Dictionary encode a field in a dataset (string -> int)
So I want to encode a field/column in a Parquet dataset to save memory. basically: * get all unique values: `field.unique()` * create a mapping to integers: `{val: i for i, val in enumerate(field.unique())}` * save the mapping in a separate file (instead of saving the same dictionary in each parquet file of the dataset) Now what I have been struggling with, is creating a field/column in the dataset that contains all these integers. The question is how can I perform the dictionary lookup row-wise without creating a UDF or doing a Python for loop. I was trying this: ```py dataset = ds.dataset(...) columns = {'ticker': pc.dictionary_encode(ds.field('ticker'))} scanner = dataset.scanner(columns=columns) # pyarrow.lib.ArrowInvalid: ExecuteScalarExpression cannot Execute non-scalar expression dictionary_encode(ticker) ``` and also: ```py arr = pa.array([{...}], type=pa.map_(pa.string(), pa.string())) columns = {'ticker': pc.map_lookup(arr, ds.field("ticker"), "first")} # pyarrow.lib.ArrowInvalid: Could not convert <pyarrow.compute.Expression a> with type pyarrow._compute.Expression: did not recognize Python value type when inferring an Arrow data type ``` Is there a way to do the map lookup natively (at C speed) instead of Python for loops... ? GitHub link: https://github.com/apache/arrow/discussions/47293 ---- This is an automatically sent email for user@arrow.apache.org. To unsubscribe, please send an email to: user-unsubscr...@arrow.apache.org