Re: [PR] Chunk based iteration in `accumulate_indices` [datafusion]

2024-11-20 Thread via GitHub


jayzhan211 commented on PR #13451:
URL: https://github.com/apache/datafusion/pull/13451#issuecomment-2487873256

   Thanks @Dandandan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Chunk based iteration in `accumulate_indices` [datafusion]

2024-11-20 Thread via GitHub


jayzhan211 merged PR #13451:
URL: https://github.com/apache/datafusion/pull/13451


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Chunk based iteration in `accumulate_indices` [datafusion]

2024-11-19 Thread via GitHub


jayzhan211 commented on PR #13451:
URL: https://github.com/apache/datafusion/pull/13451#issuecomment-2487231707

   @Dandandan PTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Chunk based iteration in `accumulate_indices` [datafusion]

2024-11-18 Thread via GitHub


Dandandan commented on code in PR #13451:
URL: https://github.com/apache/datafusion/pull/13451#discussion_r1846086708


##
datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/accumulate.rs:
##
@@ -395,19 +395,38 @@ pub fn accumulate_indices(
 }
 }
 (None, Some(filter)) => {
-assert_eq!(filter.len(), group_indices.len());
-// The performance with a filter could be improved by
-// iterating over the filter in chunks, rather than a single
-// iterator. TODO file a ticket
-let iter = group_indices.iter().zip(filter.iter());
-for (&group_index, filter_value) in iter {
-if let Some(true) = filter_value {
-index_fn(group_index)
-}
-}
+debug_assert_eq!(filter.len(), group_indices.len());
+let group_indices_chunks = group_indices.chunks_exact(64);
+let bit_chunks = filter.values().bit_chunks();
+let group_indices_remainder = group_indices_chunks.remainder();
+
+group_indices_chunks.zip(bit_chunks.iter()).for_each(

Review Comment:
   Yeah it seems it is faster (generating faster code) when valid values are 
high enough.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Chunk based iteration in `accumulate_indices` [datafusion]

2024-11-17 Thread via GitHub


jayzhan211 commented on code in PR #13451:
URL: https://github.com/apache/datafusion/pull/13451#discussion_r1845965171


##
datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/accumulate.rs:
##
@@ -395,19 +395,38 @@ pub fn accumulate_indices(
 }
 }
 (None, Some(filter)) => {
-assert_eq!(filter.len(), group_indices.len());
-// The performance with a filter could be improved by
-// iterating over the filter in chunks, rather than a single
-// iterator. TODO file a ticket
-let iter = group_indices.iter().zip(filter.iter());
-for (&group_index, filter_value) in iter {
-if let Some(true) = filter_value {
-index_fn(group_index)
-}
-}
+debug_assert_eq!(filter.len(), group_indices.len());
+let group_indices_chunks = group_indices.chunks_exact(64);
+let bit_chunks = filter.values().bit_chunks();
+let group_indices_remainder = group_indices_chunks.remainder();
+
+group_indices_chunks.zip(bit_chunks.iter()).for_each(

Review Comment:
   I'm quite surprise that the fixed chunk size (64) is faster than iterating 
set_indices



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Chunk based iteration in `accumulate_indices` [datafusion]

2024-11-17 Thread via GitHub


jayzhan211 commented on PR #13451:
URL: https://github.com/apache/datafusion/pull/13451#issuecomment-2482009689

   > Can we maybe find / create a query with filter? Maybe add one in 
`clickbench_extended`?
   
   aggregate filter required postgres dialect, it might be easier to run 
benchmark with [criterion](https://crates.io/crates/criterion)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Chunk based iteration in `accumulate_indices` [datafusion]

2024-11-17 Thread via GitHub


Dandandan commented on PR #13451:
URL: https://github.com/apache/datafusion/pull/13451#issuecomment-2481238182

   Can we maybe find / create a query with filter? Maybe add one in 
`clickbench_extended`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Chunk based iteration in `accumulate_indices` [datafusion]

2024-11-17 Thread via GitHub


jayzhan211 commented on PR #13451:
URL: https://github.com/apache/datafusion/pull/13451#issuecomment-2481192336

   Neither Q7, Q10, nor Q24 uses an aggregate filter.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Chunk based iteration in `accumulate_indices` [datafusion]

2024-11-17 Thread via GitHub


Dandandan commented on PR #13451:
URL: https://github.com/apache/datafusion/pull/13451#issuecomment-2481076175

   I wonder if any of the queries in that benchmark uses aggregate filters?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Chunk based iteration in `accumulate_indices` [datafusion]

2024-11-16 Thread via GitHub


jayzhan211 commented on PR #13451:
URL: https://github.com/apache/datafusion/pull/13451#issuecomment-2480883998

   At least no regression
   
   ```
   Comparing main and filter-chunk
   
   Benchmark clickbench_1.json
   
   ┏━━┳━━━┳━━┳━━━┓
   ┃ Query┃  main ┃ filter-chunk ┃Change ┃
   ┡━━╇━━━╇━━╇━━━┩
   │ QQuery 0 │0.26ms │   0.30ms │  1.16x slower │
   │ QQuery 1 │   26.00ms │  25.44ms │ no change │
   │ QQuery 2 │   55.35ms │  53.21ms │ no change │
   │ QQuery 3 │   46.89ms │  45.90ms │ no change │
   │ QQuery 4 │  307.76ms │ 309.23ms │ no change │
   │ QQuery 5 │  574.82ms │ 570.49ms │ no change │
   │ QQuery 6 │   26.30ms │  26.27ms │ no change │
   │ QQuery 7 │   27.54ms │  29.15ms │  1.06x slower │
   │ QQuery 8 │  356.26ms │ 362.77ms │ no change │
   │ QQuery 9 │  546.82ms │ 557.03ms │ no change │
   │ QQuery 10│  131.44ms │ 141.31ms │  1.08x slower │
   │ QQuery 11│  150.23ms │ 152.91ms │ no change │
   │ QQuery 12│  614.05ms │ 610.26ms │ no change │
   │ QQuery 13│  706.93ms │ 688.08ms │ no change │
   │ QQuery 14│  566.77ms │ 547.21ms │ no change │
   │ QQuery 15│  388.56ms │ 383.52ms │ no change │
   │ QQuery 16│  932.13ms │ 941.68ms │ no change │
   │ QQuery 17│  850.16ms │ 838.91ms │ no change │
   │ QQuery 18│ 2798.37ms │2790.18ms │ no change │
   │ QQuery 19│   39.54ms │  39.65ms │ no change │
   │ QQuery 20│  707.81ms │ 702.99ms │ no change │
   │ QQuery 21│  971.53ms │ 958.09ms │ no change │
   │ QQuery 22│ 2687.40ms │2676.42ms │ no change │
   │ QQuery 23│ 5810.42ms │5824.84ms │ no change │
   │ QQuery 24│  410.33ms │ 385.74ms │ +1.06x faster │
   │ QQuery 25│  396.73ms │ 405.53ms │ no change │
   │ QQuery 26│  448.47ms │ 450.85ms │ no change │
   │ QQuery 27│ 1074.77ms │1038.62ms │ no change │
   │ QQuery 28│ 8169.84ms │7937.72ms │ no change │
   │ QQuery 29│  348.52ms │ 355.17ms │ no change │
   │ QQuery 30│  520.21ms │ 501.82ms │ no change │
   │ QQuery 31│  528.27ms │ 525.37ms │ no change │
   │ QQuery 32│ 2780.65ms │2652.40ms │ no change │
   │ QQuery 33│ 2678.26ms │2671.76ms │ no change │
   │ QQuery 34│ 2572.40ms │2586.40ms │ no change │
   │ QQuery 35│  526.14ms │ 529.53ms │ no change │
   │ QQuery 36│  115.23ms │ 112.60ms │ no change │
   │ QQuery 37│   77.91ms │  79.28ms │ no change │
   │ QQuery 38│   81.69ms │  82.17ms │ no change │
   │ QQuery 39│  203.75ms │ 207.33ms │ no change │
   │ QQuery 40│   23.99ms │  24.14ms │ no change │
   │ QQuery 41│   22.93ms │  23.93ms │ no change │
   │ QQuery 42│   28.73ms │  28.82ms │ no change │
   └──┴───┴──┴───┘
   ┏━┳┓
   ┃ Benchmark Summary   ┃┃
   ┡━╇┩
   │ Total Time (main)   │ 40332.13ms │
   │ Total Time (filter-chunk)   │ 39875.01ms │
   │ Average Time (main) │   937.96ms │
   │ Average Time (filter-chunk) │   927.33ms │
   │ Queries Faster  │  1 │
   │ Queries Slower  │  3 │
   │ Queries with No Change  │ 39 │
   └─┴┘
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org