Hi .. I created a pyarrow table from a dictionary array as shown
below. However I cannot figure out any easy way to get the mapping
used to create the dictionary array (vals) easily from the table. Can
you please let me know the easiest way? Other than the ones which
involve pyarrow.compute/conversion to pandas as they are expensive
operations for large datasets.
import pyarrow as pa
import pyarrow.compute as pc
import numpy as np
vals = ['aa', 'ab', 'ac', 'ba', 'bb', 'bc']
int_vals = [3, 4, 3, 0, 2, 0, 1, 5, 0, 0]
x = pa.DictionaryArray.from_arrays(pa.array(int_vals), vals)
tab = pa.Table.from_arrays([x], names=['x'])
Also since this is effectively a string array which is dictionary
encoded, is there any way to use string compute kernels like
starts_with etc. Right now I am aware of two methods and they are not
straightforward.
approach 1:
Cast to string and then run string kernel
expr = pc.starts_with(pc.field("x").cast(pa.string()), "a")
ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema,
columns={'x': pc.field('x')}, filter=expr).to_table()
approach 2:
filter using the corresponding indices assuming we have access to the dictionary
filter_ = np.where(pc.starts_with(x.dictionary, "a"))[0]
pc.is_in(x.indices, filter_)
Approach 2 is better/faster .. but I am not able to figure out how to
get the dictionary/indices assuming we start from a table read from
parquet/feather.
Thanks