Re: plan_create_index_workers doesn't account for TOAST

2023-06-29 Thread Jonathan S. Katz

On 6/29/23 10:12 AM, Jonathan S. Katz wrote:

Hi,

plan_create_index_workers[1] does not consider the amount of tuples 
existing in TOAST pages when determining the number of parallel workers 
to use for a build. The estimation comes from estimate_rel_size[2], 
which in this case, will just take the value from rel->rd_rel->relpages.


We probably don't notice this much with B-trees, given a B-tree is 
typically used for data that does not require toasting. However, this 
becomes more visible when working on custom index access methods that 
implement their own parallel build strategy.


For example, pgvector[3] provides its own data types and index access 
method for indexing vector data. Vectors can get quite large fairly 
quickly, e.g. a 768-dimensional vector takes up 8 + 4*768 = 3080 bytes 
on disk, which quickly clears the default TOAST tuple threshold.


In a recent patch proposal to allow for building indexes in parallel[4], 
I performed a few experiments on how many parallel workers would be 
spawned when indexing 1,000,000 (1MM) 768-dim vectors, both with 
EXTEDNED (default) and PLAIN storage. In all cases, I allowed for leader 
participation, but the leader is not considered in 
plan_create_index_workers.


With EXTENDED, plan_create_index_workers recommended 2 workers. The 
build time was ~2x faster than the serial build.


With PLAIN, plan_create_index_workers recommended 4 workers. The build 
time was **~3X faster** than the serial build.


(I've been doing more detailed, less hand-waivy performance testing, but 
I wanted to provide directional numbers here)


It seems like we're leaving some performance for columns with TOASTed 
data that require indexing, so I wanted to propose allowing the pages in 
TOASTed tables to be considered when we're trying to index a column with 
TOASTed attributes.


Just to add to this: there is a lever to get more parallel workers by 
setting "min_parallel_table_scan_size" to a lower value, which does help 
in this case. However, it does mask the fact that a large chunk of the 
data required to build the index exists in the TOAST table, which is not 
intuitive to a user who rarely has to use tuning parameters.


Thanks,

Jonathan



OpenPGP_signature
Description: OpenPGP digital signature


plan_create_index_workers doesn't account for TOAST

2023-06-29 Thread Jonathan S. Katz

Hi,

plan_create_index_workers[1] does not consider the amount of tuples 
existing in TOAST pages when determining the number of parallel workers 
to use for a build. The estimation comes from estimate_rel_size[2], 
which in this case, will just take the value from rel->rd_rel->relpages.


We probably don't notice this much with B-trees, given a B-tree is 
typically used for data that does not require toasting. However, this 
becomes more visible when working on custom index access methods that 
implement their own parallel build strategy.


For example, pgvector[3] provides its own data types and index access 
method for indexing vector data. Vectors can get quite large fairly 
quickly, e.g. a 768-dimensional vector takes up 8 + 4*768 = 3080 bytes 
on disk, which quickly clears the default TOAST tuple threshold.


In a recent patch proposal to allow for building indexes in parallel[4], 
I performed a few experiments on how many parallel workers would be 
spawned when indexing 1,000,000 (1MM) 768-dim vectors, both with 
EXTEDNED (default) and PLAIN storage. In all cases, I allowed for leader 
participation, but the leader is not considered in 
plan_create_index_workers.


With EXTENDED, plan_create_index_workers recommended 2 workers. The 
build time was ~2x faster than the serial build.


With PLAIN, plan_create_index_workers recommended 4 workers. The build 
time was **~3X faster** than the serial build.


(I've been doing more detailed, less hand-waivy performance testing, but 
I wanted to provide directional numbers here)


It seems like we're leaving some performance for columns with TOASTed 
data that require indexing, so I wanted to propose allowing the pages in 
TOASTed tables to be considered when we're trying to index a column with 
TOASTed attributes.


Thanks,

Jonathan

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/optimizer/plan/planner.c;hb=refs/heads/master#l6734
[2] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/optimizer/util/plancat.c;hb=refs/heads/master#l1117

[3] https://github.com/pgvector/pgvector
[4] https://github.com/pgvector/pgvector/commits/parallel-index-build


OpenPGP_signature
Description: OpenPGP digital signature