Hi,

Great thanks. Really interesting stuff. I wasn’t aware that arrow had a row 
format as well.

Will have a look.

Regards,

Olo

Sent from Outlook for iOS<https://aka.ms/o0ukef>
________________________________
From: Raphael Taylor-Davies <[email protected]>
Sent: Monday, February 27, 2023 6:45:39 PM
To: [email protected] <[email protected]>
Subject: Re: [Rust] Sorting a RecordBatch by multiple columns


Hi,


There are a couple of possibilities here. To sort multiple rows by a single 
column, the fastest approach would be to use sort_to_indices [1] and then use 
the take [2] kernel to select the corresponding rows. There are specialized 
implementations for each of the array types, making this fairly performant.


If you want to sort lexicographically by multiple columns, there are lexsort 
[3] and lexsort_to_indices [4] kernels, that sort the arrays in place. However, 
for multicolumn sorts without a limit it will likely be faster to convert to 
the row format [5], and use this to perform a lexsort to indices [6], or sort 
the rows in place and convert back to arrow arrays [7]. There is more 
information about the row format in the blog series on the topic [8], which may 
be of interest.


Kind Regards,


Raphael


[1]: 
https://docs.rs/arrow-ord/latest/arrow_ord/sort/fn.sort_to_indices.html<https://gbr01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rs%2Farrow-ord%2Flatest%2Farrow_ord%2Fsort%2Ffn.sort_to_indices.html&data=05%7C01%7C%7C2f34b22ed4974414fb3d08db18b84617%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638130951953938102%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=U%2Fzvv%2BblBMZaslk7YPnAoU850swp7EqSDkQLT4LyyWg%3D&reserved=0>

[2]: 
https://docs.rs/arrow-select/latest/arrow_select/take/fn.take.html<https://gbr01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rs%2Farrow-select%2Flatest%2Farrow_select%2Ftake%2Ffn.take.html&data=05%7C01%7C%7C2f34b22ed4974414fb3d08db18b84617%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638130951953938102%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=HO4xUWOiPKJMj1QZarhjNNx70so8nCYVsoLDg%2FBe2qY%3D&reserved=0>

[3]: 
https://docs.rs/arrow-ord/latest/arrow_ord/sort/fn.lexsort.html<https://gbr01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rs%2Farrow-ord%2Flatest%2Farrow_ord%2Fsort%2Ffn.lexsort.html&data=05%7C01%7C%7C2f34b22ed4974414fb3d08db18b84617%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638130951953938102%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=yBTjPg6su2ehH1NfouHBUBBlICchHxwa6t5B3znxk3c%3D&reserved=0>

[4]: 
https://docs.rs/arrow-ord/latest/arrow_ord/sort/fn.lexsort_to_indices.html<https://gbr01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rs%2Farrow-ord%2Flatest%2Farrow_ord%2Fsort%2Ffn.lexsort_to_indices.html&data=05%7C01%7C%7C2f34b22ed4974414fb3d08db18b84617%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638130951953938102%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ifJaXxEZqyh4mhUuob6KfU3fBlU1k3JkNDIVlK14gEQ%3D&reserved=0>

[5]: 
https://docs.rs/arrow-row/latest/arrow_row<https://gbr01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rs%2Farrow-row%2Flatest%2Farrow_row&data=05%7C01%7C%7C2f34b22ed4974414fb3d08db18b84617%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638130951953938102%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=wjxdTv5x2y6laYCWLI53v%2FzC%2BPIYHCtNqcaRr1f7YWs%3D&reserved=0>

[6]: 
https://docs.rs/arrow-row/latest/arrow_row/#lexsort<https://gbr01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rs%2Farrow-row%2Flatest%2Farrow_row%2F%23lexsort&data=05%7C01%7C%7C2f34b22ed4974414fb3d08db18b84617%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638130951953938102%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=z8pwpKmKrAws8V1hx%2FoIOUgsIJ%2FtMiaEYpAv41IGrQQ%3D&reserved=0>

[7]: 
https://docs.rs/arrow-row/latest/arrow_row/struct.RowConverter.html#method.convert_rows<https://gbr01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.rs%2Farrow-row%2Flatest%2Farrow_row%2Fstruct.RowConverter.html%23method.convert_rows&data=05%7C01%7C%7C2f34b22ed4974414fb3d08db18b84617%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638130951953938102%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=eQhWO%2F7dS%2FozBwafGw6HGWV9YlAAWQlyK5vxfqAc61M%3D&reserved=0>

[8]: 
https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/<https://gbr01.safelinks.protection.outlook.com/?url=https%3A%2F%2Farrow.apache.org%2Fblog%2F2022%2F11%2F07%2Fmulti-column-sorts-in-arrow-rust-part-1%2F&data=05%7C01%7C%7C2f34b22ed4974414fb3d08db18b84617%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638130951953938102%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=LfPOfxCG8UkMXFMG%2Fl5osl2kB2N%2FURRPnJClSDU2tWQ%3D&reserved=0>


On 26/02/2023 11:55, Olo Sawyerr wrote:
Hi there,

Hope you're well.

I'm trying to sort a RecordBatch by multiple columns and it's not obvious how 
to achieve this. The kernels::sort() only takes a single ArrayRef.

Any ideas pls?

Regards,

Olo

Reply via email to