Thank you very much for the response. I was looking directly at tab['x']. Didnt realize that the dictionary is present at chunk level.
On Thu, Apr 21, 2022, 1:17 AM Weston Pace <[email protected]> wrote: > > However I cannot figure out any easy way to get the mapping > > used to create the dictionary array (vals) easily from the table. Can > > you please let me know the easiest way? > > A dictionary is going to be associated with an array and not a table. > So you first need to get the array from the table. Tables are made of > columns and each column is made of chunks and each chunk is an array. > Each chunk could have a different mapping, so that is something you > may need to deal with at some point depending on your goal. > > The table you are creating in your example has one column and that > column has one chunk so we can get to the mapping with: > > tab.column(0).chunks[0].dictionary > > And we can get to the indices with: > > tab.column(0).chunks[0].indices > > > Also since this is effectively a string array which is dictionary > > encoded, is there any way to use string compute kernels like > > starts_with etc. Right now I am aware of two methods and they are not > > straightforward. > > Regrettably, I don't think we have kernels in place for string > functions on dictionary arrays. At least, that is my reading of [1]. > So the two workarounds you have are may be the best there is at the > moment. > > [1] https://issues.apache.org/jira/browse/ARROW-14068 > > On Wed, Apr 20, 2022 at 10:00 AM Suresh V <[email protected]> wrote: > > > > Hi .. I created a pyarrow table from a dictionary array as shown > > below. However I cannot figure out any easy way to get the mapping > > used to create the dictionary array (vals) easily from the table. Can > > you please let me know the easiest way? Other than the ones which > > involve pyarrow.compute/conversion to pandas as they are expensive > > operations for large datasets. > > > > import pyarrow as pa > > import pyarrow.compute as pc > > import numpy as np > > > > vals = ['aa', 'ab', 'ac', 'ba', 'bb', 'bc'] > > int_vals = [3, 4, 3, 0, 2, 0, 1, 5, 0, 0] > > x = pa.DictionaryArray.from_arrays(pa.array(int_vals), vals) > > tab = pa.Table.from_arrays([x], names=['x']) > > > > Also since this is effectively a string array which is dictionary > > encoded, is there any way to use string compute kernels like > > starts_with etc. Right now I am aware of two methods and they are not > > straightforward. > > > > approach 1: > > Cast to string and then run string kernel > > expr = pc.starts_with(pc.field("x").cast(pa.string()), "a") > > ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema, > > columns={'x': pc.field('x')}, filter=expr).to_table() > > > > approach 2: > > filter using the corresponding indices assuming we have access to the > dictionary > > filter_ = np.where(pc.starts_with(x.dictionary, "a"))[0] > > pc.is_in(x.indices, filter_) > > > > Approach 2 is better/faster .. but I am not able to figure out how to > > get the dictionary/indices assuming we start from a table read from > > parquet/feather. > > > > Thanks >
