Hi hackers,
While debugging some slow queries containing Bitmap Heap/Index Scans (in
short BHS / BIS), we observed a few issues regarding scalability:
1. The BIS always only runs in a single process, also when the parent
BHS is parallel. The first process arriving in the BHS serves as leader and
executes the BIS.
2. As long as execution is "exact" (TIDs are stored instead of page
bits), the parallel BHS sorts all TIDs to ensure pages are accessed
sequentially. The sort is also performed just by a single worker. Already
with a few tens of thousands of pages to scan, the sort time can make up a
significant portion of the total runtime. Large page counts and the need
for parallelism are not uncommon for BHS, as one use case is closing the
gap between index and sequential scans. The BHS costing seems to not
account for that.
3. The BHS does not scale well with an increasing number of parallel
workers, even when accounting for the sequential parts of execution. A perf
profile shows that the TID list / bitmap iteration code heavily contents on
a mutex taken for every single TID / page bit (see
LWLockAcquire(>lock, LW_EXCLUSIVE) in tidbitmap.c:1067).
4. The EXPLAIN ANALYZE statistics of the parallel BHS do not include the
statistics of the parallel workers. For example the number of heap pages
processed is what just the leader did. Similarly to other parallel plan
nodes we should aggregate statistics across workers.
The EXPLAIN ANALYZE output below shows (1) to (3) happening in action for
different numbers of workers. I had to obfuscate the query slightly. The
difference between the startup time of the BHS and the BIS is the time it
takes to sort the TID list. The self time of the BHS is just the time spent
on processing the shared TID list and processing the pages. That part runs
in parallel but does not scale.
Workers | Total runtime | Startup time BIS | Startup time BHS | Self time
BHS (excl. sorting)
---|--|--
2 | 15322 ms | 3107 ms | 5912 ms | 9269 ms
4 | 13277 ms | 3094 ms | 5869 ms | 7260 ms
8 | 14628 ms | 3106 ms | 5882 ms | 8598 ms
None of this is really new and some of it is even documented. So, what I am
more wondering about is why things are the way they are and how hard it
would be to change them. I am especially curious about:
- What stops us from extending the BIS to run in parallel? Parallel
Bitmap Index Scans are also supported.
- What about reducing the sort time by, e.g.
- dividing TIDs across workers, ending up with N parallely sorted
streams,
- cooperatively sorting the TIDs with multiple workers using barriers
for synchronization,
- optimizing the PagetableEntry data structure for size and using a
faster sorting algorithm like e.g. radix sort
- a combination of the first three options
- With separate TID lists per worker process the iteration problem would
be solved. Otherwise, we could
- optimize the iteration code and thereby minimize the duration of
the critical section,
- have worker processes acquire chunks of TIDs / page bits to reduce
locking.
Is there interest in patches improving on the above mentioned shortcomings?
If so, which options do you deem best?
--
David Geier
(ServiceNow)
-- 2 workers
Finalize Aggregate (actual time=15228.937..15321.356 rows=1 loops=1)
Output: count(*)
-> Gather (actual time=15187.942..15321.345 rows=2 loops=1)
Output: (PARTIAL count(*))
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (actual time=15181.486..15181.488 rows=1
loops=2)
Output: PARTIAL count(*)
Worker 0: actual time=15181.364..15181.366 rows=1 loops=1
Worker 1: actual time=15181.608..15181.610 rows=1 loops=1
-> Parallel Bitmap Heap Scan on foo (actual
time=5912.731..15166.992 rows=269713 loops=2)
Filter: ...
Rows Removed by Filter: 4020149
Worker 0: actual time=5912.498..15166.936 rows=269305
loops=1
Worker 1: actual time=5912.963..15167.048 rows=270121
loops=1
-> Bitmap Index Scan on foo_idx (actual
time=3107.947..3107.948 rows=8579724 loops=1)
Index Cond: -
Worker 1: actual time=3107.947..3107.948
rows=8579724 loops=1
Planning Time: 0.167 ms
Execution Time: 15322.081 ms
-- 4 workers
Finalize Aggregate (actual time=13175.765..13276.415 rows=1 loops=1)
Output: count(*)
-> Gather (actual time=13137.981..13276.403 rows=4 loops=1)
Output: (PARTIAL count(*))
Workers Planned: 4
Workers Launched: 4
-> Partial Aggregate (actual time=13130.344..13130.346