[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-14 Thread GitBox


sundy-li commented on a change in pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#discussion_r593899532



##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -686,90 +815,124 @@ pub fn lexsort_to_indices(columns: &[SortColumn]) -> 
Result {
 };
 
 let mut value_indices = (0..row_count).collect::>();
-value_indices.sort_by(lex_comparator);
+let mut len = value_indices.len();
+
+if let Some(limit) = limit {
+len = limit.min(len);
+}
+sort_by( value_indices, len, lex_comparator);
 
 Ok(UInt32Array::from(
-value_indices
-.into_iter()
-.map(|i| i as u32)
+(_indices)[0..len]
+.iter()
+.map(|i| *i as u32)
 .collect::>(),
 ))
 }
 
+pub fn partial_sort(v:  [T], limit: usize, mut is_less: F)
+where
+F: FnMut(, ) -> Ordering,
+{
+let (before, _mid, _after) = v.select_nth_unstable_by(limit,  is_less);
+before.sort_unstable_by(is_less);

Review comment:
   There are some reasons for using unstable_sort.  
   1. `select_nth_unstable_by` is unstable, so using `before.sort_stable_by` 
don't ensure it's stable. Currently we do not have select_nth_stable_by
   2. `sort_unstable` is faster than sort_stable, refer to 
[doc](https://doc.rust-lang.org/std/primitive.slice.html#method.sort_unstable)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-14 Thread GitBox


sundy-li commented on a change in pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#discussion_r593899532



##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -686,90 +815,124 @@ pub fn lexsort_to_indices(columns: &[SortColumn]) -> 
Result {
 };
 
 let mut value_indices = (0..row_count).collect::>();
-value_indices.sort_by(lex_comparator);
+let mut len = value_indices.len();
+
+if let Some(limit) = limit {
+len = limit.min(len);
+}
+sort_by( value_indices, len, lex_comparator);
 
 Ok(UInt32Array::from(
-value_indices
-.into_iter()
-.map(|i| i as u32)
+(_indices)[0..len]
+.iter()
+.map(|i| *i as u32)
 .collect::>(),
 ))
 }
 
+pub fn partial_sort(v:  [T], limit: usize, mut is_less: F)
+where
+F: FnMut(, ) -> Ordering,
+{
+let (before, _mid, _after) = v.select_nth_unstable_by(limit,  is_less);
+before.sort_unstable_by(is_less);

Review comment:
   There are some reasons for using unstable_sort.  
   1. `select_nth_unstable_by` is unstable, so using 
`before.sort_stable_by`cadon't ensure it's stable. Currently we do not have 
select_nth_stable_by
   2. `sort_unstable` is faster than sort_stable, refer to 
[doc](https://doc.rust-lang.org/std/primitive.slice.html#method.sort_unstable)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-14 Thread GitBox


sundy-li commented on a change in pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#discussion_r593899532



##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -686,90 +815,124 @@ pub fn lexsort_to_indices(columns: &[SortColumn]) -> 
Result {
 };
 
 let mut value_indices = (0..row_count).collect::>();
-value_indices.sort_by(lex_comparator);
+let mut len = value_indices.len();
+
+if let Some(limit) = limit {
+len = limit.min(len);
+}
+sort_by( value_indices, len, lex_comparator);
 
 Ok(UInt32Array::from(
-value_indices
-.into_iter()
-.map(|i| i as u32)
+(_indices)[0..len]
+.iter()
+.map(|i| *i as u32)
 .collect::>(),
 ))
 }
 
+pub fn partial_sort(v:  [T], limit: usize, mut is_less: F)
+where
+F: FnMut(, ) -> Ordering,
+{
+let (before, _mid, _after) = v.select_nth_unstable_by(limit,  is_less);
+before.sort_unstable_by(is_less);

Review comment:
   There are some reasons for using unstable_sort.  
   1. `select_nth_unstable_by` is unstable, so using 
`before.sort_stable_by`cadon't ensure it's stable.
   2. `sort_unstable` is faster than sort_stable, refer to 
[doc](https://doc.rust-lang.org/std/primitive.slice.html#method.sort_unstable)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-06 Thread GitBox


sundy-li commented on a change in pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#discussion_r588974704



##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -37,10 +37,33 @@ use TimeUnit::*;
 /// Returns an `ArrowError::ComputeError(String)` if the array type is either 
unsupported by `sort_to_indices` or `take`.
 ///
 pub fn sort(values: , options: Option) -> 
Result {
-let indices = sort_to_indices(values, options)?;
+let indices = sort_to_indices(values, options, None)?;
 take(values.as_ref(), , None)
 }
 
+/// Sort the `ArrayRef` partially.
+/// Return an sorted `ArrayRef`, discarding the data after limit.
+pub fn partial_sort(
+values: ,
+options: Option,
+limit: Option,
+) -> Result {
+let indices = sort_to_indices(values, options, limit)?;
+take(values.as_ref(), , None)
+}
+
+#[inline]
+fn sort_by(array:  [T], limit: usize, cmp: F)
+where
+F: FnMut(, ) -> Ordering,

Review comment:
   Just to be compatible with `std::sort_by`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-06 Thread GitBox


sundy-li commented on a change in pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#discussion_r588964696



##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -1560,7 +1891,14 @@ mod tests {
 Some(2),
 Some(17),
 ])) as ArrayRef];
-test_lex_sort_arrays(input, expected);
+test_lex_sort_arrays(input.clone(), expected, None);
+
+let expected = vec![Arc::new(PrimitiveArrayfrom(vec![
+Some(-1),
+Some(0),
+Some(2),
+])) as ArrayRef];
+test_lex_sort_arrays(input, expected, Some(3));

Review comment:
   Here input size is 4, expected size is 3.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-04 Thread GitBox


sundy-li commented on a change in pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#discussion_r587330687



##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -517,20 +650,32 @@ where
 },
 );
 
-if !options.descending {
-valids.sort_by(|a, b| cmp_array(a.1.as_ref(), b.1.as_ref()))
+let mut len = values.len();
+let descending = options.descending;
+
+if let Some(size) = limit {
+len = size.min(len);
+}
+
+// we are not using partial_sort here, because array is ArrayRef. 
Something is not working good in that.

Review comment:
   Sorry, there is a bug in the previous version. It's fixed in the newest 
[commit](https://github.com/sundy-li/partial_sort/commit/372778d0011b4d647ed4ee13d1dacb0865eb266d#diff-b1a35a68f14e696205874893c07fd24fdb2b47c23cc0e0c80a30c7d53759R103-R141)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-03 Thread GitBox


sundy-li commented on a change in pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#discussion_r586975774



##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -278,12 +347,27 @@ fn sort_boolean(
 let valids_len = valids.len();
 let nulls_len = nulls.len();
 
-if !descending {
-valids.sort_by(|a, b| a.1.cmp());
-} else {
-valids.sort_by(|a, b| a.1.cmp().reverse());
-// reverse to keep a stable ordering
-nulls.reverse();
+let mut len = values.len();
+match limit {
+Some(limit) => {
+len = limit.min(len);
+if !descending {
+valids.partial_sort(len, |a, b| cmp(a.1, b.1));

Review comment:
   > Sorry not "fast path" but would the performance be the same as 
`sort_by`?
   
   I'm afraid not. The sort function in rust is merge sort, it is slightly 
faster than the heap sort for larger sets to sort all elements. 
   But it requires twice the memory of the heap sort because of the second 
array. 
   
   ```
   sort 2^12   time:   [799.70 us 806.41 us 814.15 us]
   sort 2^12 limit 2^12 time:   [1.2848 ms 1.3012 ms 1.3229 ms]
   
   sort nulls 2^12 time:   [647.20 us 649.27 us 651.61 us]
   sort nulls 2^12 limit 2^12   time:   [780.17 us 788.48 us 798.04 us]
   ```
   
   We can make a new function to reduce the repeated patterns or make 
partial_sort fallback to default sort when  limit equals to len.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-03 Thread GitBox


sundy-li commented on a change in pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#discussion_r586975774



##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -278,12 +347,27 @@ fn sort_boolean(
 let valids_len = valids.len();
 let nulls_len = nulls.len();
 
-if !descending {
-valids.sort_by(|a, b| a.1.cmp());
-} else {
-valids.sort_by(|a, b| a.1.cmp().reverse());
-// reverse to keep a stable ordering
-nulls.reverse();
+let mut len = values.len();
+match limit {
+Some(limit) => {
+len = limit.min(len);
+if !descending {
+valids.partial_sort(len, |a, b| cmp(a.1, b.1));

Review comment:
   > Sorry not "fast path" but would the performance be the same as 
`sort_by`?
   
   I'm afraid not. The sort function in rust is merge sort, it is slightly 
faster than the heap sort for larger sets to sort all elements. 
   But it requires twice the memory of the heap sort because of the second 
array. 
   
   ```
   sort 2^12   time:   [799.70 us 806.41 us 814.15 us]
   sort 2^12 limit 2^12 time:   [1.2848 ms 1.3012 ms 1.3229 ms]
   
   sort nulls 2^12 time:   [647.20 us 649.27 us 651.61 us]
   sort nulls 2^12 limit 2^12   time:   [780.17 us 788.48 us 798.04 us]
   ```
   
   We can make a new function to reduce the repeated patterns or make 
partial_sort fallback to default sort when  limit is the len.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-03 Thread GitBox


sundy-li commented on a change in pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#discussion_r586975774



##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -278,12 +347,27 @@ fn sort_boolean(
 let valids_len = valids.len();
 let nulls_len = nulls.len();
 
-if !descending {
-valids.sort_by(|a, b| a.1.cmp());
-} else {
-valids.sort_by(|a, b| a.1.cmp().reverse());
-// reverse to keep a stable ordering
-nulls.reverse();
+let mut len = values.len();
+match limit {
+Some(limit) => {
+len = limit.min(len);
+if !descending {
+valids.partial_sort(len, |a, b| cmp(a.1, b.1));

Review comment:
   > Sorry not "fast path" but would the performance be the same as 
`sort_by`?
   
   I'm afraid not. The sort function in rust is merge sort, it is slightly 
faster than the heap sort for larger sets to sort all elements.
   But it requires twice the memory of the heap sort because of the second 
array. 
   
   ```
   sort 2^12   time:   [799.70 us 806.41 us 814.15 us]
   sort 2^12 limit 2^12 time:   [1.2848 ms 1.3012 ms 1.3229 ms]
   
   sort nulls 2^12 time:   [647.20 us 649.27 us 651.61 us]
   sort nulls 2^12 limit 2^12   time:   [780.17 us 788.48 us 798.04 us]
   ```
   
   We can make a new function to reduce the repeated patterns.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-03 Thread GitBox


sundy-li commented on a change in pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#discussion_r586975774



##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -278,12 +347,27 @@ fn sort_boolean(
 let valids_len = valids.len();
 let nulls_len = nulls.len();
 
-if !descending {
-valids.sort_by(|a, b| a.1.cmp());
-} else {
-valids.sort_by(|a, b| a.1.cmp().reverse());
-// reverse to keep a stable ordering
-nulls.reverse();
+let mut len = values.len();
+match limit {
+Some(limit) => {
+len = limit.min(len);
+if !descending {
+valids.partial_sort(len, |a, b| cmp(a.1, b.1));

Review comment:
   > Sorry not "fast path" but would the performance be the same as 
`sort_by`?
   
   I'm afraid not. The sort function in rust is merge sort, it is slightly 
faster than the heap sort for larger sets.
   But it requires twice the memory of the heap sort because of the second 
array. 
   
   ```
   sort 2^12   time:   [799.70 us 806.41 us 814.15 us]
   sort 2^12 limit 2^12 time:   [1.2848 ms 1.3012 ms 1.3229 ms]
   
   sort nulls 2^12 time:   [647.20 us 649.27 us 651.61 us]
   sort nulls 2^12 limit 2^12   time:   [780.17 us 788.48 us 798.04 us]
   ```
   
   We can make a new function to reduce the repeated patterns.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] sundy-li commented on a change in pull request #9602: ARROW-11630: [Rust] Introduce limit option for sort kernel

2021-03-01 Thread GitBox


sundy-li commented on a change in pull request #9602:
URL: https://github.com/apache/arrow/pull/9602#discussion_r585214440



##
File path: rust/arrow/src/compute/kernels/sort.rs
##
@@ -36,8 +36,12 @@ use TimeUnit::*;
 ///
 /// Returns an `ArrowError::ComputeError(String)` if the array type is either 
unsupported by `sort_to_indices` or `take`.
 ///
-pub fn sort(values: , options: Option) -> 
Result {
-let indices = sort_to_indices(values, options)?;
+pub fn sort(
+values: ,
+options: Option,
+limit: Option,

Review comment:
   Ok, I will introduce `partial_sort` with `limit` parameter.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org