Thanks for the explanation.  But I am now worried that the approach 2 for
using a string kernel like starts with will become more complicated as i
have to run the filter in a python loop effectively slowing it down.

If we had access to the indices directly, approach 2 was 100% faster than 1
for an array of 50m entries.

Completely unrelated,  I also noticed that reading feather file from
ipc/feather API is 10x faster (0.2 vs 2s) than the dataset.to_table() for
50m rows. I wouldn't expect there to be such a big difference. Will create
a bug if it isn't a known issue.

Thanks
On Thu, Apr 21, 2022, 4:07 PM Aldrin <[email protected]> wrote:

> I could be wrong, but in my experience the dictionary is available at the
> chunk level, because that is where you know it is a DictionaryArray (or at
> least, an Array). At the column level, you only know it's a ChunkedArray,
> which seems to roughly be an alias to a vector<Array> (list[Array]) at
> least type-wise.
>
> Also, I think each chunk references the same dictionary, so I think you
> can access any chunk's dictionary and get the same one.
>
> Aldrin Montana
> Computer Science PhD Student
> UC Santa Cruz
>
>
> On Wed, Apr 20, 2022 at 10:54 PM Suresh V <[email protected]> wrote:
>
>> Thank you very much for the response. I was looking directly at tab['x'].
>> Didnt realize that the dictionary is present at chunk level.
>>
>> On Thu, Apr 21, 2022, 1:17 AM Weston Pace <[email protected]> wrote:
>>
>>> > However I cannot figure out any easy way to get the mapping
>>> > used to create the dictionary array (vals) easily from the table. Can
>>> > you please let me know the easiest way?
>>>
>>> A dictionary is going to be associated with an array and not a table.
>>> So you first need to get the array from the table.  Tables are made of
>>> columns and each column is made of chunks and each chunk is an array.
>>> Each chunk could have a different mapping, so that is something you
>>> may need to deal with at some point depending on your goal.
>>>
>>> The table you are creating in your example has one column and that
>>> column has one chunk so we can get to the mapping with:
>>>
>>>     tab.column(0).chunks[0].dictionary
>>>
>>> And we can get to the indices with:
>>>
>>>     tab.column(0).chunks[0].indices
>>>
>>> > Also since this is effectively a string array which is dictionary
>>> > encoded, is there any way to use string compute kernels like
>>> > starts_with etc. Right now I am aware of two methods and they are not
>>> > straightforward.
>>>
>>> Regrettably, I don't think we have kernels in place for string
>>> functions on dictionary arrays.  At least, that is my reading of [1].
>>> So the two workarounds you have are may be the best there is at the
>>> moment.
>>>
>>> [1] https://issues.apache.org/jira/browse/ARROW-14068
>>>
>>> On Wed, Apr 20, 2022 at 10:00 AM Suresh V <[email protected]> wrote:
>>> >
>>> > Hi .. I created a pyarrow table from a dictionary array as shown
>>> > below. However I cannot figure out any easy way to get the mapping
>>> > used to create the dictionary array (vals) easily from the table. Can
>>> > you please let me know the easiest way? Other than the ones which
>>> > involve pyarrow.compute/conversion to pandas as they are expensive
>>> > operations for large datasets.
>>> >
>>> > import pyarrow as pa
>>> > import pyarrow.compute as pc
>>> > import numpy as np
>>> >
>>> > vals = ['aa', 'ab', 'ac', 'ba', 'bb', 'bc']
>>> > int_vals = [3, 4, 3, 0, 2, 0, 1, 5, 0, 0]
>>> > x = pa.DictionaryArray.from_arrays(pa.array(int_vals), vals)
>>> > tab = pa.Table.from_arrays([x], names=['x'])
>>> >
>>> > Also since this is effectively a string array which is dictionary
>>> > encoded, is there any way to use string compute kernels like
>>> > starts_with etc. Right now I am aware of two methods and they are not
>>> > straightforward.
>>> >
>>> > approach 1:
>>> > Cast to string and then run string kernel
>>> > expr = pc.starts_with(pc.field("x").cast(pa.string()), "a")
>>> > ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema,
>>> > columns={'x': pc.field('x')}, filter=expr).to_table()
>>> >
>>> > approach 2:
>>> > filter using the corresponding indices assuming we have access to the
>>> dictionary
>>> > filter_ = np.where(pc.starts_with(x.dictionary, "a"))[0]
>>> > pc.is_in(x.indices, filter_)
>>> >
>>> > Approach 2 is better/faster .. but I am not able to figure out how to
>>> > get the dictionary/indices assuming we start from a table read from
>>> > parquet/feather.
>>> >
>>> > Thanks
>>>
>>

Reply via email to