Thanks for the explanation. But I am now worried that the approach 2 for using a string kernel like starts with will become more complicated as i have to run the filter in a python loop effectively slowing it down.
If we had access to the indices directly, approach 2 was 100% faster than 1 for an array of 50m entries. Completely unrelated, I also noticed that reading feather file from ipc/feather API is 10x faster (0.2 vs 2s) than the dataset.to_table() for 50m rows. I wouldn't expect there to be such a big difference. Will create a bug if it isn't a known issue. Thanks On Thu, Apr 21, 2022, 4:07 PM Aldrin <[email protected]> wrote: > I could be wrong, but in my experience the dictionary is available at the > chunk level, because that is where you know it is a DictionaryArray (or at > least, an Array). At the column level, you only know it's a ChunkedArray, > which seems to roughly be an alias to a vector<Array> (list[Array]) at > least type-wise. > > Also, I think each chunk references the same dictionary, so I think you > can access any chunk's dictionary and get the same one. > > Aldrin Montana > Computer Science PhD Student > UC Santa Cruz > > > On Wed, Apr 20, 2022 at 10:54 PM Suresh V <[email protected]> wrote: > >> Thank you very much for the response. I was looking directly at tab['x']. >> Didnt realize that the dictionary is present at chunk level. >> >> On Thu, Apr 21, 2022, 1:17 AM Weston Pace <[email protected]> wrote: >> >>> > However I cannot figure out any easy way to get the mapping >>> > used to create the dictionary array (vals) easily from the table. Can >>> > you please let me know the easiest way? >>> >>> A dictionary is going to be associated with an array and not a table. >>> So you first need to get the array from the table. Tables are made of >>> columns and each column is made of chunks and each chunk is an array. >>> Each chunk could have a different mapping, so that is something you >>> may need to deal with at some point depending on your goal. >>> >>> The table you are creating in your example has one column and that >>> column has one chunk so we can get to the mapping with: >>> >>> tab.column(0).chunks[0].dictionary >>> >>> And we can get to the indices with: >>> >>> tab.column(0).chunks[0].indices >>> >>> > Also since this is effectively a string array which is dictionary >>> > encoded, is there any way to use string compute kernels like >>> > starts_with etc. Right now I am aware of two methods and they are not >>> > straightforward. >>> >>> Regrettably, I don't think we have kernels in place for string >>> functions on dictionary arrays. At least, that is my reading of [1]. >>> So the two workarounds you have are may be the best there is at the >>> moment. >>> >>> [1] https://issues.apache.org/jira/browse/ARROW-14068 >>> >>> On Wed, Apr 20, 2022 at 10:00 AM Suresh V <[email protected]> wrote: >>> > >>> > Hi .. I created a pyarrow table from a dictionary array as shown >>> > below. However I cannot figure out any easy way to get the mapping >>> > used to create the dictionary array (vals) easily from the table. Can >>> > you please let me know the easiest way? Other than the ones which >>> > involve pyarrow.compute/conversion to pandas as they are expensive >>> > operations for large datasets. >>> > >>> > import pyarrow as pa >>> > import pyarrow.compute as pc >>> > import numpy as np >>> > >>> > vals = ['aa', 'ab', 'ac', 'ba', 'bb', 'bc'] >>> > int_vals = [3, 4, 3, 0, 2, 0, 1, 5, 0, 0] >>> > x = pa.DictionaryArray.from_arrays(pa.array(int_vals), vals) >>> > tab = pa.Table.from_arrays([x], names=['x']) >>> > >>> > Also since this is effectively a string array which is dictionary >>> > encoded, is there any way to use string compute kernels like >>> > starts_with etc. Right now I am aware of two methods and they are not >>> > straightforward. >>> > >>> > approach 1: >>> > Cast to string and then run string kernel >>> > expr = pc.starts_with(pc.field("x").cast(pa.string()), "a") >>> > ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema, >>> > columns={'x': pc.field('x')}, filter=expr).to_table() >>> > >>> > approach 2: >>> > filter using the corresponding indices assuming we have access to the >>> dictionary >>> > filter_ = np.where(pc.starts_with(x.dictionary, "a"))[0] >>> > pc.is_in(x.indices, filter_) >>> > >>> > Approach 2 is better/faster .. but I am not able to figure out how to >>> > get the dictionary/indices assuming we start from a table read from >>> > parquet/feather. >>> > >>> > Thanks >>> >>
