You can always turn these structures into tables and do sql like joins.
map = {“a”: 1, “b”: 2, “c”: 3}
input_array = pa.array([“a”, “b”, “c”, “a”])
map_table = pa.Table.from_pylist([{"key": k, "value": v} for k, v in
map.items()])
input_table = pa.Table.from_arrays([input_array], names=["key"])
joined_result = input_table.join(map_table, "key")
>>> joined_result
pyarrow.Table
key: string
value: int64
----
key: [["a","b","c","a"]]
value: [[1,2,3,1]]
-----Original Message-----
From: Joris Van den Bossche <[email protected]>
Sent: Monday, November 14, 2022 11:33 PM
To: [email protected]
Subject: Re: pyarrow.compute case_when
External Email: Use caution with links and attachments
And as an answer to how you can use pyarrow.compute.case_when for this:
>>> map = {"a": 1, "b": 2, "c": 3}
>>> cond = pc.make_struct(*[pc.equal(input_array, val) for val in
>>> map.keys()]) pc.case_when(cond, *map.values())
<pyarrow.lib.Int64Array object at 0x7f44a99f32e0> [
1,
2,
3,
1
]
The "case_when" compute function takes the multiple conditions as a
StructArray, which you can compose using the "make_struct" compute function.
It's certainly not the most user friendly or obvious way, so we should
certainly add some examples to the docstring on how to achieve this.
Also, for this specific case where you already having this "mapping"
of values you want to replace, I think we should have a specialized kernel,
avoiding the need to materialize a boolean array for each value ->
https://urldefense.com/v3/__https://issues.apache.org/jira/browse/ARROW-10641__;!!KSjYCgUGsB4!ZIFs661Vs0zZZSbl7Ap0B_swBkooVbpHBiZavkMXQfUANYdmlbAd318opypAMkNY-O4rKPOTHVjfVdYvuuIgL0qQhP6PYA$
Joris
On Mon, 14 Nov 2022 at 19:51, Ryan Kuhns <[email protected]> wrote:
>
> Hi,
>
> I’ve got one more question as a follow up to my prior question on working
> with multi-file zipped CSVs. [1] Figured it was worth asking in another
> thread so it would be easier for others to see specific question about
> case_when.
>
> I’m trying to accomplish something like pandas DataFrame.Series.map where I
> map values of a arrow array to a new value.
>
> pyarrow.compute.case_when looks like a candidate to solve this, but after
> reading the docs, I’m still not clear on how to structure the argument to the
> “cond” parameter or if there is alternative functionality that would be
> better.
>
> Example input, mapping and expected output:
>
> import pyarrow as pa
> import pyarrow.compute as pc
>
> map = {“a”: 1, “b”: 2, “c”: 3}
> input_array = pa.array([“a”, “b”, “c”, “a”]) expected_output =
> pa.array([1, 2, 3, 1])
>
> Logic I’m hoping for would be the equivalent of the following SQL:
>
> Case
> when input_array = “a” then 1
> when input_array = “b” then 2
> when input_array = “c” then 3
> else input_array
> End
>
> Or alternatively, if input array was a a pandas Series then
> input_array.map(map).
>
> Thanks again,
>
> Ryan
>
>
>
>
>
> [1]
> https://urldefense.com/v3/__https://www.mail-archive.com/[email protected]
> rrow.org/msg02379.html__;!!KSjYCgUGsB4!ZIFs661Vs0zZZSbl7Ap0B_swBkooVbp
> HBiZavkMXQfUANYdmlbAd318opypAMkNY-O4rKPOTHVjfVdYvuuIgL0offUCiSw$
This message may contain information that is confidential or privileged. If you
are not the intended recipient, please advise the sender immediately and delete
this message. See
http://www.blackrock.com/corporate/compliance/email-disclaimers for further
information. Please refer to
http://www.blackrock.com/corporate/compliance/privacy-policy for more
information about BlackRock’s Privacy Policy.
For a list of BlackRock's office addresses worldwide, see
http://www.blackrock.com/corporate/about-us/contacts-locations.
© 2022 BlackRock, Inc. All rights reserved.