Re: Improving scalability of Parallel Bitmap Heap/Index Scan

2022-07-15 Thread John Naylor
On Thu, Jul 14, 2022 at 5:13 PM David Geier  wrote:
> optimizing the PagetableEntry data structure for size and using a faster
sorting algorithm like e.g. radix sort

On this note, there has been a proposed (but as far as I know untested)
patch to speed up this sort in a much simpler way, in this thread

https://www.postgresql.org/message-id/CA%2BhUKGKztHEWm676csTFjYzortziWmOcf8HDss2Zr0muZ2xfEg%40mail.gmail.com

where you may find this patch

https://www.postgresql.org/message-id/attachment/120560/0007-Specialize-pagetable-sort-routines-in-tidbitmap.c.patch

and see if it helps.

--
John Naylor
EDB: http://www.enterprisedb.com


Improving scalability of Parallel Bitmap Heap/Index Scan

2022-07-14 Thread David Geier
Hi hackers,

While debugging some slow queries containing Bitmap Heap/Index Scans (in
short BHS / BIS), we observed a few issues regarding scalability:

   1. The BIS always only runs in a single process, also when the parent
   BHS is parallel. The first process arriving in the BHS serves as leader and
   executes the BIS.
   2. As long as execution is "exact" (TIDs are stored instead of page
   bits), the parallel BHS sorts all TIDs to ensure pages are accessed
   sequentially. The sort is also performed just by a single worker. Already
   with a few tens of thousands of pages to scan, the sort time can make up a
   significant portion of the total runtime. Large page counts and the need
   for parallelism are not uncommon for BHS, as one use case is closing the
   gap between index and sequential scans. The BHS costing seems to not
   account for that.
   3. The BHS does not scale well with an increasing number of parallel
   workers, even when accounting for the sequential parts of execution. A perf
   profile shows that the TID list / bitmap iteration code heavily contents on
   a mutex taken for every single TID / page bit (see
   LWLockAcquire(>lock, LW_EXCLUSIVE) in tidbitmap.c:1067).
   4. The EXPLAIN ANALYZE statistics of the parallel BHS do not include the
   statistics of the parallel workers. For example the number of heap pages
   processed is what just the leader did. Similarly to other parallel plan
   nodes we should aggregate statistics across workers.

The EXPLAIN ANALYZE output below shows (1) to (3) happening in action for
different numbers of workers. I had to obfuscate the query slightly. The
difference between the startup time of the BHS and the BIS is the time it
takes to sort the TID list. The self time of the BHS is just the time spent
on processing the shared TID list and processing the pages. That part runs
in parallel but does not scale.

Workers | Total runtime | Startup time BIS | Startup time BHS | Self time
BHS (excl. sorting)
---|--|--
2   | 15322 ms  | 3107 ms  | 5912 ms  | 9269 ms
4   | 13277 ms  | 3094 ms  | 5869 ms  | 7260 ms
8   | 14628 ms  | 3106 ms  | 5882 ms  | 8598 ms

None of this is really new and some of it is even documented. So, what I am
more wondering about is why things are the way they are and how hard it
would be to change them. I am especially curious about:

   - What stops us from extending the BIS to run in parallel? Parallel
   Bitmap Index Scans are also supported.
   - What about reducing the sort time by, e.g.
  - dividing TIDs across workers, ending up with N parallely sorted
  streams,
  - cooperatively sorting the TIDs with multiple workers using barriers
  for synchronization,
  - optimizing the PagetableEntry data structure for size and using a
  faster sorting algorithm like e.g. radix sort
  - a combination of the first three options
   - With separate TID lists per worker process the iteration problem would
   be solved. Otherwise, we could
  - optimize the iteration code and thereby minimize the duration of
  the critical section,
  - have worker processes acquire chunks of TIDs / page bits to reduce
  locking.

Is there interest in patches improving on the above mentioned shortcomings?
If so, which options do you deem best?

--
David Geier
(ServiceNow)



-- 2 workers

 Finalize Aggregate (actual time=15228.937..15321.356 rows=1 loops=1)
   Output: count(*)
   ->  Gather (actual time=15187.942..15321.345 rows=2 loops=1)
 Output: (PARTIAL count(*))
 Workers Planned: 2
 Workers Launched: 2
 ->  Partial Aggregate (actual time=15181.486..15181.488 rows=1
loops=2)
   Output: PARTIAL count(*)
   Worker 0:  actual time=15181.364..15181.366 rows=1 loops=1
   Worker 1:  actual time=15181.608..15181.610 rows=1 loops=1
   ->  Parallel Bitmap Heap Scan on foo (actual
time=5912.731..15166.992 rows=269713 loops=2)
 Filter: ...
 Rows Removed by Filter: 4020149
 Worker 0:  actual time=5912.498..15166.936 rows=269305
loops=1
 Worker 1:  actual time=5912.963..15167.048 rows=270121
loops=1
 ->  Bitmap Index Scan on foo_idx (actual
time=3107.947..3107.948 rows=8579724 loops=1)
   Index Cond: -
   Worker 1:  actual time=3107.947..3107.948
rows=8579724 loops=1
 Planning Time: 0.167 ms
 Execution Time: 15322.081 ms


-- 4 workers

 Finalize Aggregate (actual time=13175.765..13276.415 rows=1 loops=1)
   Output: count(*)
   ->  Gather (actual time=13137.981..13276.403 rows=4 loops=1)
 Output: (PARTIAL count(*))
 Workers Planned: 4
 Workers Launched: 4
 ->  Partial Aggregate (actual time=13130.344..13130.346