Re: [Proposal] Page Compression for OLTP
On Fri, 4 Nov 2022 at 07:02, Ian Lawrence Barwick wrote: > > 2022年7月27日(水) 2:47 chenhj : > > > > Hi hackers, > > > > I have rebase this patch and made some improvements. > > > > > > 1. A header is added to each chunk in the pcd file, which records the chunk > > of which block the chunk belongs to, and the checksum of the chunk. > > > > Accordingly, all pages in a compressed relation are stored in compressed > > format, even if the compressed page is larger than BLCKSZ. > > > > The maximum space occupied by a compressed page is BLCKSZ + chunk_size > > (exceeding this range will report an error when writing the page). > > > > 2. Repair the pca file through the information recorded in the pcd when > > recovering from a crash > > > > 3. For compressed relation, do not release the free blocks at the end of > > the relation (just like what old_snapshot_threshold does), reducing the > > risk of data inconsistency between pcd and pca file. > > > > 4. During backup, only check the checksum in the chunk header for the pcd > > file, and avoid assembling and decompressing chunks into the original page. > > > > 5. bugfix, doc, code style and so on > > > > > > And see src/backend/storage/smgr/README.compression for detail > > > > > > Other > > > > 1. remove support of default compression option in tablespace, I'm not sure > > about the necessity of this feature, so don't support it for now. > > > > 2. pg_rewind currently does not support copying only changed blocks from > > pcd file. This feature is relatively independent and could be implemented > > later. > > Hi > > cfbot reports the patch no longer applies. As CommitFest 2022-11 is > currently underway, this would be an excellent time to update the patch. There has been no updates on this thread for some time, so this has been switched as Returned with Feedback. Feel free to open it in the next commitfest if you plan to continue on this. Regards, Vignesh
Re: [Proposal] Page Compression for OLTP
2022年7月27日(水) 2:47 chenhj : > > Hi hackers, > > I have rebase this patch and made some improvements. > > > 1. A header is added to each chunk in the pcd file, which records the chunk > of which block the chunk belongs to, and the checksum of the chunk. > > Accordingly, all pages in a compressed relation are stored in compressed > format, even if the compressed page is larger than BLCKSZ. > > The maximum space occupied by a compressed page is BLCKSZ + chunk_size > (exceeding this range will report an error when writing the page). > > 2. Repair the pca file through the information recorded in the pcd when > recovering from a crash > > 3. For compressed relation, do not release the free blocks at the end of the > relation (just like what old_snapshot_threshold does), reducing the risk of > data inconsistency between pcd and pca file. > > 4. During backup, only check the checksum in the chunk header for the pcd > file, and avoid assembling and decompressing chunks into the original page. > > 5. bugfix, doc, code style and so on > > > And see src/backend/storage/smgr/README.compression for detail > > > Other > > 1. remove support of default compression option in tablespace, I'm not sure > about the necessity of this feature, so don't support it for now. > > 2. pg_rewind currently does not support copying only changed blocks from pcd > file. This feature is relatively independent and could be implemented later. Hi cfbot reports the patch no longer applies. As CommitFest 2022-11 is currently underway, this would be an excellent time to update the patch. Thanks Ian Barwick
Re: [Proposal] Page Compression for OLTP
On Tue, Feb 16, 2021 at 11:15:36PM +0800, chenhj wrote: > At 2021-02-16 21:51:14, "Daniel Gustafsson" wrote: > > >> On 16 Feb 2021, at 15:45, chenhj wrote: > > > >> I want to know whether this patch can be accepted by the community, that > >> is, whether it is necessary for me to continue working for this Patch. > >> If you have any suggestions, please feedback to me. > > > >It doesn't seem like the patch has been registered in the commitfest app so > >it > >may have been forgotten about, the number of proposed patches often outnumber > >the code review bandwidth. Please register it at: > > > > https://commitfest.postgresql.org/32/ > > > >..to make sure it doesn't get lost. > > > >-- > > >Daniel Gustafssonhttps://vmware.com/ > > > Thanks, I will complete this patch and registered it later. > Chen Huajun The simplest way forward is to register it now so it doesn't miss the window for the upcoming commitfest (CF), which closes at the end of this month. That way, everybody has the entire time between now and the end of the CF to review the patch, work on it, etc, and the CF bot will be testing it against the changing code base to ensure people know if such a change causes it to need a rebase. Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Re: [Proposal] Page Compression for OLTP
At 2021-02-16 21:51:14, "Daniel Gustafsson" wrote: >> On 16 Feb 2021, at 15:45, chenhj wrote: > >> I want to know whether this patch can be accepted by the community, that is, >> whether it is necessary for me to continue working for this Patch. >> If you have any suggestions, please feedback to me. > >It doesn't seem like the patch has been registered in the commitfest app so it >may have been forgotten about, the number of proposed patches often outnumber >the code review bandwidth. Please register it at: > > https://commitfest.postgresql.org/32/ > >..to make sure it doesn't get lost. > >-- >Daniel Gustafsson https://vmware.com/ Thanks, I will complete this patch and registered it later. Chen Huajun
Re: [Proposal] Page Compression for OLTP
> On 16 Feb 2021, at 15:45, chenhj wrote: > I want to know whether this patch can be accepted by the community, that is, > whether it is necessary for me to continue working for this Patch. > If you have any suggestions, please feedback to me. It doesn't seem like the patch has been registered in the commitfest app so it may have been forgotten about, the number of proposed patches often outnumber the code review bandwidth. Please register it at: https://commitfest.postgresql.org/32/ ..to make sure it doesn't get lost. -- Daniel Gustafsson https://vmware.com/
Re: [Proposal] Page Compression for OLTP
Hi, hackers I want to know whether this patch can be accepted by the community, that is, whether it is necessary for me to continue working for this Patch. If you have any suggestions, please feedback to me. Best Regards Chen Huajun
Re: [Proposal] Page Compression for OLTP
Hi hackers, > # Page storage(Plan C) > > Further, since the size of the compress address file is fixed, the above > address file and data file can also be combined into one file > > 0 1 2 1230710 1 2 > +===+===+===+ +===+=+=+ > | head | 1|2 | ... | | data1 | data2 | ... > +===+===+===+ +===+=+=+ > head | address| data | I made a prototype according to the above storage method. Any suggestions are welcome. # Page compress file storage related definitions /* * layout of Page Compress file: * * - PageCompressHeader * - PageCompressAddr[] * - chuncks of PageCompressData * */ typedef struct PageCompressHeader { pg_atomic_uint32 nblocks; /* number of total blocks in this segment */ pg_atomic_uint32 allocated_chunks; /* number of total allocated chunks in data area */ uint16chunk_size; /* size of each chunk, must be 1/2 1/4 or 1/8 of BLCKSZ */ uint8algorithm; /* compress algorithm, 1=pglz, 2=lz4 */ } PageCompressHeader; typedef struct PageCompressAddr { uint8nchunks; /* number of chunks for this block */ uint8allocated_chunks; /* number of allocated chunks for this block */ /* variable-length fields, 1 based chunk no array for this block, size of the array must be 2, 4 or 8 */ pc_chunk_number_t chunknos[FLEXIBLE_ARRAY_MEMBER]; } PageCompressAddr; typedef struct PageCompressData { char page_header[SizeOfPageHeaderData]; /* page header */ uint32 size;/* size of compressed data */ char data[FLEXIBLE_ARRAY_MEMBER]; /* compressed page, except for the page header */ } PageCompressData; # Usage Set whether to use compression through storage parameters of tables and indexes - compress_type Set whether to compress and the compression algorithm used, supported values: none, pglz, zstd - compress_chunk_size Chunk is the smallest unit of storage space allocated for compressed pages. The size of the chunk can only be 1/2, 1/4 or 1/8 of BLCKSZ - compress_prealloc_chunks The number of chunks pre-allocated for each page. The maximum value allowed is: BLCKSZ/compress_chunk_size -1. If the number of chunks required for a compressed page is less than `compress_prealloc_chunks`, It allocates `compress_prealloc_chunks` chunks to avoid future storage fragmentation when the page needs more storage space. # Sample ## requirement - zstd ## build ./configure --with-zstd make make install ## create compressed table and index create table tb1(id int,c1 text); create table tb1_zstd(id int,c1 text) with(compress_type=zstd,compress_chunk_size=1024); create table tb1_zstd_4(id int,c1 text) with(compress_type=zstd,compress_chunk_size=1024,compress_prealloc_chunks=4); create index tb1_idx_id on tb1(id); create index tb1_idx_id_zstd on tb1(id) with(compress_type=zstd,compress_chunk_size=1024); create index tb1_idx_id_zstd_4 on tb1(id) with(compress_type=zstd,compress_chunk_size=1024,compress_prealloc_chunks=4); create index tb1_idx_c1 on tb1(c1); create index tb1_idx_c1_zstd on tb1(c1) with(compress_type=zstd,compress_chunk_size=1024); create index tb1_idx_c1_zstd_4 on tb1(c1) with(compress_type=zstd,compress_chunk_size=1024,compress_prealloc_chunks=4); insert into tb1 select generate_series(1,100),md5(random()::text); insert into tb1_zstd select generate_series(1,100),md5(random()::text); insert into tb1_zstd_4 select generate_series(1,100),md5(random()::text); ## show size of table and index postgres=# \d+ List of relations Schema |Name| Type | Owner | Persistence | Size | Description ++---+--+-+---+- public | tb1| table | postgres | permanent | 65 MB | public | tb1_zstd | table | postgres | permanent | 37 MB | public | tb1_zstd_4 | table | postgres | permanent | 37 MB | (3 rows) postgres=# \di+ List of relations Schema | Name| Type | Owner | Table | Persistence | Size | Description +---+---+--+---+-+---+- public | tb1_idx_c1| index | postgres | tb1 | permanent | 73 MB | public | tb1_idx_c1_zstd | index | postgres | tb1 | permanent | 36 MB | public | tb1_idx_c1_zstd_4 | index | postgres | tb1 | permanent | 41 MB | public | tb1_idx_id| index | postgres | tb1 | permanent | 21 MB | public | tb1_idx_id_zstd | index | postgres | tb1 | permanent | 13 MB | public | tb1_idx_id_zstd_4 | index | postgres | tb1 | permanent | 15 MB | (6 rows) # pgbench performance testing(TPC-B)
Re: [Proposal] Page Compression for OLTP
Sorry, There may be a problem with the display format of the previous mail. So resend it At 2020-05-21 15:04:55, "Fabien COELHO" wrote: > >Hello, > >My 0.02, some of which may just show some misunderstanding on my part: > > - Could this be proposed as some kind of extension, provided that enough >hooks are available? ISTM that foreign tables and/or alternative >storage engine (aka ACCESS METHOD) provide convenient APIs which could >fit the need for these? Or are they not appropriate? You seem to >suggest that there are not. > >If not, what could be done to improve API to allow what you are seeking >to do? Maybe you need a somehow lower-level programmable API which does >not exist already, or at least is not exported already, but could be >specified and implemented with limited effort? Basically you would like >to read/write pg pages to somewhere, and then there is the syncing >issue to consider. Maybe such a "page storage" API could provide >benefit for some specialized hardware, eg persistent memory stores, >so there would be more reason to define it anyway? I think it might >be valuable to give it some thoughts. Thank you for giving so many comments. In my opinion, developing a foreign table or a new storage engine, in addition to compression, also needs to do a lot of extra things. A similar explanation was mentioned in Nikolay P's email. The "page storage" API may be a good choice, and I will consider it, but I have not yet figured out how to implement it. > - Could you maybe elaborate on how your plan differs from [4] and [5]? My solution is similar to CFS, and it is also embedded in the file access layer (fd.c, md.c) to realize the mapping from block number to the corresponding file and location where compressed data is stored. However, the most important difference is that I hope to avoid the need for GC through the design of the page layout. https://www.postgresql.org/message-id/flat/11996861554042351%40iva4-dd95b404a60b.qloud-c.yandex.net >> The most difficult thing in CFS development is certainly >> defragmentation. In CFS it is done using background garbage collection, >> by one or one >> GC worker processes. The main challenges were to minimize its >> interaction with normal work of the system, make it fault tolerant and >> prevent unlimited growth of data segments. >> CFS is not introducing its own storage manager, it is mostly embedded in >> existed Postgres file access layer (fd.c, md.c). It allows to reused >> code responsible for mapping relations and file descriptors cache. As it >> was recently discussed in hackers, it may be good idea to separate the >> questions "how to map blocks to filenames and offsets" and "how to >> actually perform IO". In this it will be easier to implement compressed >> storage manager. > - Have you consider keeping page headers and compressing tuple data >only? In that case, we must add some additional information in the page header to identify whether this is a compressed page or an uncompressed page. When a compressed page becomes an uncompressed page, or vice versa, an uncompressed page becomes a compressed page, the original page header must be modified. This is unacceptable because it requires modifying the shared buffer and recalculating the checksum. However, it should be feasible to put this flag in the compressed address file. The problem with this is that even if a page only occupies the size of one compressed block, the address file needs to be read, that is, from 1 IO to 2 IO. Since the address file is very small, it is basically a memory access, this cost may not be as large as I had imagined. > - I'm not sure there is a point in going below the underlying file >system blocksize, quite often 4 KiB? Or maybe yes? Or is there >a benefit to aim at 1/4 even if most pages overflow? My solution is mainly optimized for scenarios where the original page can be compressed to only require one compressed block of storage. The scene where the original page is stored in multiple compressed blocks is suitable for scenarios that are not particularly sensitive to performance, but are more concerned about the compression rate, such as cold data. In addition, users can also choose to compile PostgreSQL with 16KB or 32KB BLOCKSZ. > - ISTM that your approach entails 3 "files". Could it be done with 2? >I'd suggest that the possible overflow pointers (coa) could be part of >the headers so that when reading the 3.1 page, then the header would >tell where to find the overflow 3.2, without requiring an additional >independent structure with very small data in it, most of it zeros. >Possibly this is not possible, because it would require some available >space in standard headers when the is page is not compressible, and >there is not enough.
Re: [Proposal] Page Compression for OLTP
At 2020-05-21 15:04:55, "Fabien COELHO" wrote: > >Hello, > >My 0.02, some of which may just show some misunderstanding on my part: > > - Could this be proposed as some kind of extension, provided that enough > hooks are available? ISTM that foreign tables and/or alternative > storage engine (aka ACCESS METHOD) provide convenient APIs which could > fit the need for these? Or are they not appropriate? You seem to > suggest that there are not. > > If not, what could be done to improve API to allow what you are seeking > to do? Maybe you need a somehow lower-level programmable API which does > not exist already, or at least is not exported already, but could be > specified and implemented with limited effort? Basically you would like > to read/write pg pages to somewhere, and then there is the syncing > issue to consider. Maybe such a "page storage" API could provide > benefit for some specialized hardware, eg persistent memory stores, > so there would be more reason to define it anyway? I think it might > be valuable to give it some thoughts. Thank you for giving so many comments. In my opinion, developing a foreign table or a new storage engine, in addition to compression, also needs to do a lot of extra things. A similar explanation was mentioned in Nikolay P's email. The "page storage" API may be a good choice, and I will consider it, but I have not yet figured out how to implement it. > - Could you maybe elaborate on how your plan differs from [4] and [5]? My solution is similar to CFS, and it is also embedded in the file access layer (fd.c, md.c) to realize the mapping from block number to the corresponding file and location where compressed data is stored. However, the most important difference is that I hope to avoid the need for GC through the design of the page layout. https://www.postgresql.org/message-id/flat/11996861554042351%40iva4-dd95b404a60b.qloud-c.yandex.net >> The most difficult thing in CFS development is certainly >> defragmentation. In CFS it is done using background garbage collection, >> by one or one >> GC worker processes. The main challenges were to minimize its >> interaction with normal work of the system, make it fault tolerant and >> prevent unlimited growth of data segments. >> CFS is not introducing its own storage manager, it is mostly embedded in >> existed Postgres file access layer (fd.c, md.c). It allows to reused >> code responsible for mapping relations and file descriptors cache. As it >> was recently discussed in hackers, it may be good idea to separate the >> questions "how to map blocks to filenames and offsets" and "how to >> actually perform IO". In this it will be easier to implement compressed >> storage manager. > - Have you consider keeping page headers and compressing tuple data > only? In that case, we must add some additional information in the page header to identify whether this is a compressed page or an uncompressed page. When a compressed page becomes an uncompressed page, or vice versa, an uncompressed page becomes a compressed page, the original page header must be modified. This is unacceptable because it requires modifying the shared buffer and recalculating the checksum. However, it should be feasible to put this flag in the compressed address file. The problem with this is that even if a page only occupies the size of one compressed block, the address file needs to be read, that is, from 1 IO to 2 IO. Since the address file is very small, it is basically a memory access, this cost may not be as large as I had imagined. > - I'm not sure there is a point in going below the underlying file > system blocksize, quite often 4 KiB? Or maybe yes? Or is there > a benefit to aim at 1/4 even if most pages overflow? My solution is mainly optimized for scenarios where the original page can be compressed to only require one compressed block of storage. The scene where the original page is stored in multiple compressed blocks is suitable for scenarios that are not particularly sensitive to performance, but are more concerned about the compression rate, such as cold data. In addition, users can also choose to compile PostgreSQL with 16KB or 32KB BLOCKSZ. > - ISTM that your approach entails 3 "files". Could it be done with 2? > I'd suggest that the possible overflow pointers (coa) could be part of > the headers so that when reading the 3.1 page, then the header would > tell where to find the overflow 3.2, without requiring an additional > independent structure with very small data in it, most of it zeros. > Possibly this is not possible, because it would require some available > space in standard headers when the is page is not compressible, and > there is not enough. Maybe creating a little room for that in > existing headers (4 bytes could be enough?) would be a good compromise. > Hmmm. Maybe the approach I suggest would only work for 1/2 compression, > but not for other target ratios, but I
Re: [Proposal] Page Compression for OLTP
Hello, My 0.02€, some of which may just show some misunderstanding on my part: - you have clearly given quite a few thoughts about the what and how… which makes your message an interesting read. - Could this be proposed as some kind of extension, provided that enough hooks are available? ISTM that foreign tables and/or alternative storage engine (aka ACCESS METHOD) provide convenient APIs which could fit the need for these? Or are they not appropriate? You seem to suggest that there are not. If not, what could be done to improve API to allow what you are seeking to do? Maybe you need a somehow lower-level programmable API which does not exist already, or at least is not exported already, but could be specified and implemented with limited effort? Basically you would like to read/write pg pages to somewhere, and then there is the syncing issue to consider. Maybe such a "page storage" API could provide benefit for some specialized hardware, eg persistent memory stores, so there would be more reason to define it anyway? I think it might be valuable to give it some thoughts. - Could you maybe elaborate on how your plan differs from [4] and [5]? - Have you consider keeping page headers and compressing tuple data only? - I'm not sure there is a point in going below the underlying file system blocksize, quite often 4 KiB? Or maybe yes? Or is there a benefit to aim at 1/4 even if most pages overflow? - ISTM that your approach entails 3 "files". Could it be done with 2? I'd suggest that the possible overflow pointers (coa) could be part of the headers so that when reading the 3.1 page, then the header would tell where to find the overflow 3.2, without requiring an additional independent structure with very small data in it, most of it zeros. Possibly this is not possible, because it would require some available space in standard headers when the is page is not compressible, and there is not enough. Maybe creating a little room for that in existing headers (4 bytes could be enough?) would be a good compromise. Hmmm. Maybe the approach I suggest would only work for 1/2 compression, but not for other target ratios, but I think it could be made to work if the pointer can entail several blocks in the overflow table. - If one page is split in 3 parts, could it creates problems on syncing, if 1/3 or 2/3 pages get written, but maybe that is manageable with WAL as it would note that the page was not synced and that is enough for replay. - I'm unclear how you would manage the 2 representations of a page in memory. I'm afraid that a 8 KiB page compressed to 4 KiB would basically take 12 KiB, i.e. reduce the available memory for caching purposes. Hmmm. The current status is that a written page probably takes 16 KiB, once in shared buffers and once in the system caches, so it would be an improvement anyway. - Maybe the compressed and overflow table could become bloated somehow, which would require the vaccuuming implementation and add to the complexity of the implementation? - External tools should be available to allow page inspection, eg for debugging purposes. -- Fabien.