[ https://issues.apache.org/jira/browse/ARROW-9895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andy Grove resolved ARROW-9895. ------------------------------- Fix Version/s: 2.0.0 Resolution: Fixed Issue resolved by pull request 8092 [https://github.com/apache/arrow/pull/8092] > [RUST] Improve sort kernels > --------------------------- > > Key: ARROW-9895 > URL: https://issues.apache.org/jira/browse/ARROW-9895 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust > Affects Versions: 1.0.0 > Reporter: Jörn Horstmann > Assignee: Jörn Horstmann > Priority: Major > Labels: pull-request-available > Fix For: 2.0.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > Followup from my mailing list post: > {quote}1. When sorting by multiple columns (lexsort_to_indices) the Float32 > and Float64 data types are not supported because the implementation > relies on the OrdArray trait. This trait is not implemented because > f64/f32 only implements PartialOrd. The sort function for a single > column (sort_to_indices) has some special logic which looks like it > wants to treats NaN the same as null, but I'm also not convinced this > is the correct way. For example postgres does the following > (https://www.postgresql.org/docs/12/datatype-numeric.html#DATATYPE-FLOAT) > "In order to allow floating-point values to be sorted and used in > tree-based indexes, PostgreSQL treats NaN values as equal, and greater > than all non-NaN values." > I propose to do the same in an OrdArray impl for > Float64Array/Float32Array and then simplifying the sort_to_indices > function accordingly. > 2. Sorting for dictionary encoded strings. The problem here is that > DictionaryArray does not have a generic parameter for the value type > so it is not currently possible to only implement OrdArray for string > dictionaries. Again for the single column case, the value data type > could be checked and a sort could be implemented by looking up each > key in the dictionary. An optimization could be to check the is_sorted > flag of DictionaryArray (which does not seem to be used really) and > then directly sort by the keys. For the general case I see roughly to > options > - Somehow implement an OrdArray view of the dictionary array. This > could be easier if OrdArray did not extend Array but was a completely > separate trait. > - Change the lexicographic sort impl to not use dynamic calls but > instead sort multiple times. So for a query `ORDER BY a, b`, first > sort by b and afterwards sort again by a. With a stable sort > implementation this should result in the same ordering. I'm curious > about the performance, it could avoid dynamic method calls for each > comparison, but it would process the indices vector multiple times. > {quote} > My plan is to open a draft PR with the following changes: > - {{sort_to_indices}} further splits up float64/float32 inputs into > nulls/non-nan/nan, sorts the non-nan values and then concats those 3 slices > according to the sort options. Nans are distinct from null and sort greater > than any other valid value > - implement a sort method for dictionary arrays with string values. this > kernel checks the {{is_ordered}} flag and sorts just by the keys if it is > set, it will look up the string values otherwise > - for the lexical sort use case the above kernel are not used, instead the > {{OrdArray}} trait is used. To make that more flexible and allow wrapping > arrays with differend ordering behavior I will make it no longer extend > {{Array}} and instead only contain the {{cmp_value}} method > - string dictionary sorting can then be implemented with a wrapper struct > {{StringDictionaryArrayAsOrdArray}} which implements {{OrdArray}} > - NaN aware sorting of floats can also be implemented with a wrapper struct > and trait implementation -- This message was sent by Atlassian Jira (v8.3.4#803005)