Re: [Proposal] Page Compression for OLTP
At 2021-02-16 21:51:14, "Daniel Gustafsson" wrote: >> On 16 Feb 2021, at 15:45, chenhj wrote: > >> I want to know whether this patch can be accepted by the community, that is, >> whether it is necessary for me to continue working for this Patch. >> If you have any suggestions, please feedback to me. > >It doesn't seem like the patch has been registered in the commitfest app so it >may have been forgotten about, the number of proposed patches often outnumber >the code review bandwidth. Please register it at: > > https://commitfest.postgresql.org/32/ > >..to make sure it doesn't get lost. > >-- >Daniel Gustafsson https://vmware.com/ Thanks, I will complete this patch and registered it later. Chen Huajun
Re: [Proposal] Page Compression for OLTP
Hi, hackers I want to know whether this patch can be accepted by the community, that is, whether it is necessary for me to continue working for this Patch. If you have any suggestions, please feedback to me. Best Regards Chen Huajun
Re: [Proposal] Page Compression for OLTP
Hi hackers, > # Page storage(Plan C) > > Further, since the size of the compress address file is fixed, the above > address file and data file can also be combined into one file > > 0 1 2 1230710 1 2 > +===+===+===+ +===+=+=+ > | head | 1|2 | ... | | data1 | data2 | ... > +===+===+===+ +===+=+=+ > head | address| data | I made a prototype according to the above storage method. Any suggestions are welcome. # Page compress file storage related definitions /* * layout of Page Compress file: * * - PageCompressHeader * - PageCompressAddr[] * - chuncks of PageCompressData * */ typedef struct PageCompressHeader { pg_atomic_uint32 nblocks; /* number of total blocks in this segment */ pg_atomic_uint32 allocated_chunks; /* number of total allocated chunks in data area */ uint16chunk_size; /* size of each chunk, must be 1/2 1/4 or 1/8 of BLCKSZ */ uint8algorithm; /* compress algorithm, 1=pglz, 2=lz4 */ } PageCompressHeader; typedef struct PageCompressAddr { uint8nchunks; /* number of chunks for this block */ uint8allocated_chunks; /* number of allocated chunks for this block */ /* variable-length fields, 1 based chunk no array for this block, size of the array must be 2, 4 or 8 */ pc_chunk_number_t chunknos[FLEXIBLE_ARRAY_MEMBER]; } PageCompressAddr; typedef struct PageCompressData { char page_header[SizeOfPageHeaderData]; /* page header */ uint32 size;/* size of compressed data */ char data[FLEXIBLE_ARRAY_MEMBER]; /* compressed page, except for the page header */ } PageCompressData; # Usage Set whether to use compression through storage parameters of tables and indexes - compress_type Set whether to compress and the compression algorithm used, supported values: none, pglz, zstd - compress_chunk_size Chunk is the smallest unit of storage space allocated for compressed pages. The size of the chunk can only be 1/2, 1/4 or 1/8 of BLCKSZ - compress_prealloc_chunks The number of chunks pre-allocated for each page. The maximum value allowed is: BLCKSZ/compress_chunk_size -1. If the number of chunks required for a compressed page is less than `compress_prealloc_chunks`, It allocates `compress_prealloc_chunks` chunks to avoid future storage fragmentation when the page needs more storage space. # Sample ## requirement - zstd ## build ./configure --with-zstd make make install ## create compressed table and index create table tb1(id int,c1 text); create table tb1_zstd(id int,c1 text) with(compress_type=zstd,compress_chunk_size=1024); create table tb1_zstd_4(id int,c1 text) with(compress_type=zstd,compress_chunk_size=1024,compress_prealloc_chunks=4); create index tb1_idx_id on tb1(id); create index tb1_idx_id_zstd on tb1(id) with(compress_type=zstd,compress_chunk_size=1024); create index tb1_idx_id_zstd_4 on tb1(id) with(compress_type=zstd,compress_chunk_size=1024,compress_prealloc_chunks=4); create index tb1_idx_c1 on tb1(c1); create index tb1_idx_c1_zstd on tb1(c1) with(compress_type=zstd,compress_chunk_size=1024); create index tb1_idx_c1_zstd_4 on tb1(c1) with(compress_type=zstd,compress_chunk_size=1024,compress_prealloc_chunks=4); insert into tb1 select generate_series(1,100),md5(random()::text); insert into tb1_zstd select generate_series(1,100),md5(random()::text); insert into tb1_zstd_4 select generate_series(1,100),md5(random()::text); ## show size of table and index postgres=# \d+ List of relations Schema |Name| Type | Owner | Persistence | Size | Description ++---+--+-+---+- public | tb1| table | postgres | permanent | 65 MB | public | tb1_zstd | table | postgres | permanent | 37 MB | public | tb1_zstd_4 | table | postgres | permanent | 37 MB | (3 rows) postgres=# \di+ List of relations Schema | Name| Type | Owner | Table | Persistence | Size | Description +---+---+--+---+-+---+- public | tb1_idx_c1| index | postgres | tb1 | permanent | 73 MB | public | tb1_idx_c1_zstd | index | postgres | tb1 | permanent | 36 MB | public | tb1_idx_c1_zstd_4 | index | postgres | tb1 | permanent | 41 MB | public | tb1_idx_id| index | postgres | tb1 | permanent | 21 MB | public | tb1_idx_id_zstd | index | postgres | tb1 | permanent | 13 MB | public | tb1_idx_id_zstd_4 | index | postgres | tb1 | permanent | 15 MB | (6 rows) # pgbench performance testing(TPC-B) Compre
Re: [Proposal] Page Compression for OLTP
Sorry, There may be a problem with the display format of the previous mail. So resend it At 2020-05-21 15:04:55, "Fabien COELHO" wrote: > >Hello, > >My 0.02, some of which may just show some misunderstanding on my part: > > - Could this be proposed as some kind of extension, provided that enough >hooks are available? ISTM that foreign tables and/or alternative >storage engine (aka ACCESS METHOD) provide convenient APIs which could >fit the need for these? Or are they not appropriate? You seem to >suggest that there are not. > >If not, what could be done to improve API to allow what you are seeking >to do? Maybe you need a somehow lower-level programmable API which does >not exist already, or at least is not exported already, but could be >specified and implemented with limited effort? Basically you would like >to read/write pg pages to somewhere, and then there is the syncing >issue to consider. Maybe such a "page storage" API could provide >benefit for some specialized hardware, eg persistent memory stores, >so there would be more reason to define it anyway? I think it might >be valuable to give it some thoughts. Thank you for giving so many comments. In my opinion, developing a foreign table or a new storage engine, in addition to compression, also needs to do a lot of extra things. A similar explanation was mentioned in Nikolay P's email. The "page storage" API may be a good choice, and I will consider it, but I have not yet figured out how to implement it. > - Could you maybe elaborate on how your plan differs from [4] and [5]? My solution is similar to CFS, and it is also embedded in the file access layer (fd.c, md.c) to realize the mapping from block number to the corresponding file and location where compressed data is stored. However, the most important difference is that I hope to avoid the need for GC through the design of the page layout. https://www.postgresql.org/message-id/flat/11996861554042351%40iva4-dd95b404a60b.qloud-c.yandex.net >> The most difficult thing in CFS development is certainly >> defragmentation. In CFS it is done using background garbage collection, >> by one or one >> GC worker processes. The main challenges were to minimize its >> interaction with normal work of the system, make it fault tolerant and >> prevent unlimited growth of data segments. >> CFS is not introducing its own storage manager, it is mostly embedded in >> existed Postgres file access layer (fd.c, md.c). It allows to reused >> code responsible for mapping relations and file descriptors cache. As it >> was recently discussed in hackers, it may be good idea to separate the >> questions "how to map blocks to filenames and offsets" and "how to >> actually perform IO". In this it will be easier to implement compressed >> storage manager. > - Have you consider keeping page headers and compressing tuple data >only? In that case, we must add some additional information in the page header to identify whether this is a compressed page or an uncompressed page. When a compressed page becomes an uncompressed page, or vice versa, an uncompressed page becomes a compressed page, the original page header must be modified. This is unacceptable because it requires modifying the shared buffer and recalculating the checksum. However, it should be feasible to put this flag in the compressed address file. The problem with this is that even if a page only occupies the size of one compressed block, the address file needs to be read, that is, from 1 IO to 2 IO. Since the address file is very small, it is basically a memory access, this cost may not be as large as I had imagined. > - I'm not sure there is a point in going below the underlying file >system blocksize, quite often 4 KiB? Or maybe yes? Or is there >a benefit to aim at 1/4 even if most pages overflow? My solution is mainly optimized for scenarios where the original page can be compressed to only require one compressed block of storage. The scene where the original page is stored in multiple compressed blocks is suitable for scenarios that are not particularly sensitive to performance, but are more concerned about the compression rate, such as cold data. In addition, users can also choose to compile PostgreSQL with 16KB or 32KB BLOCKSZ. > - ISTM that your approach entails 3 "files". Could it be done with 2? >I'd suggest that the possible overflow pointers (coa) could be part of >the headers so that when reading the 3.1 page, then the header would >tell where to find the overflow 3.2, without requiring an additional >independent structure with very small data in it, most of it zeros. >Possibly this is not possible, because it would require some available >space in standard headers when the is page is not compressible, and >there is not enough.
Re: [Proposal] Page Compression for OLTP
At 2020-05-21 15:04:55, "Fabien COELHO" wrote: > >Hello, > >My 0.02, some of which may just show some misunderstanding on my part: > > - Could this be proposed as some kind of extension, provided that enough > hooks are available? ISTM that foreign tables and/or alternative > storage engine (aka ACCESS METHOD) provide convenient APIs which could > fit the need for these? Or are they not appropriate? You seem to > suggest that there are not. > > If not, what could be done to improve API to allow what you are seeking > to do? Maybe you need a somehow lower-level programmable API which does > not exist already, or at least is not exported already, but could be > specified and implemented with limited effort? Basically you would like > to read/write pg pages to somewhere, and then there is the syncing > issue to consider. Maybe such a "page storage" API could provide > benefit for some specialized hardware, eg persistent memory stores, > so there would be more reason to define it anyway? I think it might > be valuable to give it some thoughts. Thank you for giving so many comments. In my opinion, developing a foreign table or a new storage engine, in addition to compression, also needs to do a lot of extra things. A similar explanation was mentioned in Nikolay P's email. The "page storage" API may be a good choice, and I will consider it, but I have not yet figured out how to implement it. > - Could you maybe elaborate on how your plan differs from [4] and [5]? My solution is similar to CFS, and it is also embedded in the file access layer (fd.c, md.c) to realize the mapping from block number to the corresponding file and location where compressed data is stored. However, the most important difference is that I hope to avoid the need for GC through the design of the page layout. https://www.postgresql.org/message-id/flat/11996861554042351%40iva4-dd95b404a60b.qloud-c.yandex.net >> The most difficult thing in CFS development is certainly >> defragmentation. In CFS it is done using background garbage collection, >> by one or one >> GC worker processes. The main challenges were to minimize its >> interaction with normal work of the system, make it fault tolerant and >> prevent unlimited growth of data segments. >> CFS is not introducing its own storage manager, it is mostly embedded in >> existed Postgres file access layer (fd.c, md.c). It allows to reused >> code responsible for mapping relations and file descriptors cache. As it >> was recently discussed in hackers, it may be good idea to separate the >> questions "how to map blocks to filenames and offsets" and "how to >> actually perform IO". In this it will be easier to implement compressed >> storage manager. > - Have you consider keeping page headers and compressing tuple data > only? In that case, we must add some additional information in the page header to identify whether this is a compressed page or an uncompressed page. When a compressed page becomes an uncompressed page, or vice versa, an uncompressed page becomes a compressed page, the original page header must be modified. This is unacceptable because it requires modifying the shared buffer and recalculating the checksum. However, it should be feasible to put this flag in the compressed address file. The problem with this is that even if a page only occupies the size of one compressed block, the address file needs to be read, that is, from 1 IO to 2 IO. Since the address file is very small, it is basically a memory access, this cost may not be as large as I had imagined. > - I'm not sure there is a point in going below the underlying file > system blocksize, quite often 4 KiB? Or maybe yes? Or is there > a benefit to aim at 1/4 even if most pages overflow? My solution is mainly optimized for scenarios where the original page can be compressed to only require one compressed block of storage. The scene where the original page is stored in multiple compressed blocks is suitable for scenarios that are not particularly sensitive to performance, but are more concerned about the compression rate, such as cold data. In addition, users can also choose to compile PostgreSQL with 16KB or 32KB BLOCKSZ. > - ISTM that your approach entails 3 "files". Could it be done with 2? > I'd suggest that the possible overflow pointers (coa) could be part of > the headers so that when reading the 3.1 page, then the header would > tell where to find the overflow 3.2, without requiring an additional > independent structure with very small data in it, most of it zeros. > Possibly this is not possible, because it would require some available > space in standard headers when the is page is not compressible, and > there is not enough. Maybe creating a little room for that in > existing headers (4 bytes could be enough?) would be a good compromise. > Hmmm. Maybe the approach I suggest would only work for 1/2 compression, > but not for other target ratios, but I thin
[Proposal] Page Compression for OLTP
Hello hackers! This email is about the proposal to implement OLTP-oriented compression on PostgreSQL. Although there are currently some alternative ways to achieve compression, such as using a file system that supports compression. However, this depends on a specific file system and is not suitable for all deployment environments, but also increases the complexity of deployment and maintenance. I hope this compression work can meet the following goals 1. In most scenarios, the compressed size of the table can be lower than 50% of the original table 2. Mainly oriented to the OLTP scenario, the performance impact on the load of frequent reads and writes is relatively small. 3. Does not rely on special external software or hardware that is difficult to obtain 4. Friendly to application developers and database managers 5. The transformation of PostgreSQL is small I have noticed that there has been some discussion or work related to compression before, but they do not meet the above goals. such as, 1. Use the sparse file[1] This is also the implemention method of MySQL 5.7's transparent page compression. However, sparse files may generate a lot of fragmentation inside the file system, and the "compressed" data file in sparse files will be restored to their original size after physical backup and restoration, unless our backup and recovery tools also support sparse files. 2. Use TAM (table access method interface) (pg_cryogen, zedstore) [2] [3] Neither storage engine is geared towards OLTP scenarios. It is best to make relatively small modifications to the existing code of the heap engine (by modify md.c and fd.c mainly). The methods proposed by Postgres Pro Enterprise CFS [4] and Nikolay P [5] are close to my needs. However, I would like to mention a slightly different implementation plan, which does not require space reclamation. Hope to get any suggestions. # Premise assumption 1. Most of the pages in the compressed table can be compressed to less than 50% of the original size As long as you use an algorithm with a relatively high compression ratio (such as zlib, zstd), this first point should be easy to meet. Unless the table stores compressed data, such as pictures. 2. The compression ratio of most pages in the same table is relatively close # Page storage Configure 3 files for storing compressed data for each segment of each main fork. The main fork segment file(for example: 123456.2) still exists, but its size is 0. -Compressed data file (for example: 123456.2.cd) Used to store the compressed page. The block size of this file is table level configurable. But it can only be 1/2, 1/4 or 1/8 of BLOCKSZ -Compress overflow address file (for example: 123456.2.coa) When a page cannot be compressed to less than the size of the compressed block, this file is used to store the address of the overflow block. -Compress overflow data file (for example: 123456.2.cod) When a page cannot be compressed to less than the size of the compressed block, this file is used to store the overflow block. The following is an example when the compressed block size is 4K, which is 1/2 of BLOCKSZ. ## Scenario 1: The compressed size of the original page (including the header of the compressed page) is less than or equal to the compressed block size (4KB) Compressed data files(123456.2.cd): 0 1 2 +===+===+===+ | data0 | data1 | data2 | +===+===+===+ ->| 4K |<- ## Scenario 2: The compressed size of the original page (including the header of the compressed page) is larger than the compressed block size (4KB) If the compressed size of the original page (page 3 below) is greater than 4KB, it will not be compressed. The first 4KB of the original page is stored in the compressed data file, and the last 4KB is stored in the compress overflow data file. Compressed data files(123456.2.cd): 0 1 2 3 +===+===+===+=+ | data0 | data1 | data2 | data3_1 | +===+===+===+=+ ->| 1st 4K |<- Compress overflow address file(123456.2.coa): The compress overflow address file stores the block number of the compress overflow block assigned to each block + 1 The size of the compressed block and the number of expanded blocks of the compress overflow data file are stored in the head of the compress overflow address file 0 1 2 3 +===+===+===+===+===+ | head | | | | 1 | +===+===+===+===+===|===+ | | Compress overflow data file: | ___| | 0 |1 2 3 +===|=+=+==+=+ | data3_2 | | | | +=+=+==+=+ ->| 2nd 4K |<- If the compressed block size is 1/4 or 1/8 of BLOCKS
Re:could not access status of transaction
Hi, This database has not had the same failure again 2019/09/16(reported at 2019/09/29), so this is a very low probability failure, but it is uncertain whether it will happen again in the future. Now add some information for incident at 2019/09/16, may be useful for analyze the cause of this problem. > Block 32291 > - > Block Offset: 0x0fc46000 Offsets: Lower 236 (0x00ec) > Block: Size 8192 Version4Upper2112 (0x0840) > LSN: logid254 recoff 0xa3a598c8 Special 8192 (0x2000) > Items: 53 Free Space: 1876 > Checksum: 0x8355 Prune XID: 0x02186566 Flags: 0x0001 (HAS_FREE_LINES) > Length (including item array): 236 > ... > Item 8 -- Length: 151 Offset: 4696 (0x1258) Flags: NORMAL > XMIN: 35153480 XMAX: 35153545 CID|XVAC: 0 > Block Id: 163363 linp Index: 9 Attributes: 20 Size: 32 > infomask: 0x2503 > (HASNULL|HASVARWIDTH|XMIN_COMMITTED|XMAX_COMMITTED|UPDATED|HOT_UPDATED|HEAP_ONLY) > t_bits: [0]: 0xff [1]: 0xff [2]: 0x07 > > > > Item 9 -- Length: 151 Offset: 4544 (0x11c0) Flags: NORMAL > XMIN: 35153545 XMAX: 0 CID|XVAC: 0 > Block Id: 163363 linp Index: 9 Attributes: 20 Size: 32 > infomask: 0x2803 (HASNULL|HASVARWIDTH|XMAX_INVALID|UPDATED|HEAP_ONLY) > t_bits: [0]: 0xff [1]: 0xff [2]: 0x07 According to above information, the flags of the heap page (163363) with the problem tuple (163363, 9) is 0x0001 (HAS_FREE_LINES), that is, ALL_VISIBLE is not set. However, according hexdump content of the corresponding vm file, that block(location is 9F88 + 6bit) has set VISIBILITYMAP_ALL_FROZEN and VISIBILITYMAP_ALL_VISIBLE flags. That is, the heap file and the vm file are inconsistent. # vi 19548_vm.hexdump 000 0215 0858 857c 8cee 0018 2000 010 2000 2004 020 * 0002000 0215 1680 857c 3bb1 0018 2000 0002010 2000 2004 0002020 * 0004000 0215 20e8 857c 07f7 0018 2000 0004010 2000 2004 0004020 * 0006000 0215 3618 857c 4792 0018 2000 0006010 2000 2004 0006020 * 0008000 0215 17a8 8580 3d7e 0018 2000 0008010 2000 2004 0008020 * 000a000 0215 3558 8585 1239 0018 2000 000a010 2000 2004 000a020 * 000c000 0215 72a8 871b 1a23 0018 2000 000c010 2000 2004 000c020 ... 000e000 0215 93d0 8794 506a 0018 2000 000e010 2000 2004 fc03 000e020 f3ff cccf fffc f3cf ff3f ... 000f6c0 3f0f 3303 c33f 00f0 00c3 0303 0003 000f6d0 * 001 Regards, Chen Huajun
Connections hang indefinitely while taking a gin index's LWLock buffer_content lock(PG10.7)
Hi,all In our PostgreSQL 10.7(rhel 6.3) database, autovacuum process and many insert processes blocked in gin index's LWLock:buffer_content for long time. In other words, the following gin index lwlock deadlock phenomenon has occurred again. Since the following bug in 10.7 has been fixed. So this should be a new bug. https://www.postgresql.org/message-id/flat/31a702a.14dd.166c1366ac1.Coremail.chjischj%40163.com We have already obtained coredump files of autovacuum process and one of insert processes. Unfortunately the insert process(run by gcore) held no lwlock, it should be another process(we did not fetch core file) that hold the lwlock needed for autovacuum process. the stack is as following: ## stack of one insert process: Acquire lock 0x7f6c517dbfa4 which was held by vacuum process (gdb) bt #0 0x00369ea0da00 in sem_wait () from /lib64/libpthread.so.0 #1 0x006a7910 in PGSemaphoreLock (sema=0x7f6c4f76a7b8) at pg_sema.c:316 #2 0x00718225 in LWLockAcquire (lock=0x7f6c517dbfa4, mode=LW_SHARED) at lwlock.c:1233 #3 0x0048b622 in ginTraverseLock (buffer=224225, searchMode=0 '\000') at ginbtree.c:40 #4 0x0048ca13 in ginFindLeafPage (btree=0x7fffc71c4ea0, searchMode=0 '\000', snapshot=0x0) at ginbtree.c:97 #5 0x004894db in ginInsertItemPointers (index=, rootBlkno=, items=, nitem=, buildStats=0x0) at gindatapage.c:1909 #6 0x004863a7 in ginEntryInsert (ginstate=0x1c72158, attnum=1, key=20190913, category=0 '\000', items=0x1c81508, nitem=72, buildStats=0x0) at gininsert.c:214 #7 0x0049219a in ginInsertCleanup (ginstate=0x1c72158, full_clean=0 '\000', fill_fsm=1 '\001', forceCleanup=, stats=) at ginfast.c:878 #8 0x0049308e in ginHeapTupleFastInsert (ginstate=0x1c72158, collector=) at ginfast.c:443 #9 0x00486749 in gininsert (index=, values=0x7fffc71c54f0, isnull=0x7fffc71c5600 "", ht_ctid=0x1c6d3a4, heapRel=, checkUnique=, indexInfo=0x1c61da8) at gininsert.c:522 #10 0x005f75f0 in ExecInsertIndexTuples (slot=0x1c62168, tupleid=0x1c6d3a4, estate=0x1c61768, noDupErr=0 '\000', specConflict=0x0, arbiterIndexes=0x0) at execIndexing.c:387 #11 0x00616497 in ExecInsert (pstate=0x1c61ab8) at nodeModifyTable.c:519 #12 ExecModifyTable (pstate=0x1c61ab8) at nodeModifyTable.c:1779 #13 0x005fb6bf in ExecProcNode (queryDesc=0x1c67760, direction=, count=0, execute_once=-72 '\270') at ../../../src/include/executor/executor.h:250 #14 ExecutePlan (queryDesc=0x1c67760, direction=, count=0, execute_once=-72 '\270') at execMain.c:1723 #15 standard_ExecutorRun (queryDesc=0x1c67760, direction=, count=0, execute_once=-72 '\270') at execMain.c:364 #16 0x7f6e226aa6f8 in pgss_ExecutorRun (queryDesc=0x1c67760, direction=ForwardScanDirection, count=0, execute_once=1 '\001') at pg_stat_statements.c:889 #17 0x7f6e224a474d in explain_ExecutorRun (queryDesc=0x1c67760, direction=ForwardScanDirection, count=0, execute_once=1 '\001') at auto_explain.c:267 #18 0x0072a15b in ProcessQuery (plan=, sourceText=0x1c21458 "INSERT INTO bi_dm.tdm_wh_shopgds_fnsh_rt (STATIS_DATE,SITE_CD,LGORT,ZSIZE,ZVTWEG,VSBED,TOTAL_CNT,FNSH_CNT,UNFNSH_CNT,ETL_TIME,DEPT_CD,TMALL_FLG,BUSS_TP,ZCKYWLX) VALUES($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$"..., params=0x1c21580, queryEnv=0x0, dest=, completionTag=0x7fffc71c5de0 "") at pquery.c:161 #19 0x0072a395 in PortalRunMulti (portal=0x1c57f18, isTopLevel=1 '\001', setHoldSnapshot=0 '\000', dest=0xc9b480, altdest=0xc9b480, completionTag=0x7fffc71c5de0 "") at pquery.c:1286 #20 0x0072aa98 in PortalRun (portal=0x1c57f18, count=1, isTopLevel=1 '\001', run_once=1 '\001', dest=0x1c25768, altdest=0x1c25768, completionTag=0x7fffc71c5de0 "") at pquery.c:799 #21 0x00728c9a in exec_execute_message (argc=, argv=, dbname=0x1bbb800 "lbiwhdb", username=) at postgres.c:2007 #22 PostgresMain (argc=, argv=, dbname=0x1bbb800 "lbiwhdb", username=) at postgres.c:4180 #23 0x006bb43a in BackendRun (argc=, argv=) at postmaster.c:4405 ---Type to continue, or q to quit--- #24 BackendStartup (argc=, argv=) at postmaster.c:4077 #25 ServerLoop (argc=, argv=) at postmaster.c:1755 #26 PostmasterMain (argc=, argv=) at postmaster.c:1363 #27 0x0063b4d0 in main (argc=3, argv=0x1b839e0) at main.c:228 (gdb) f 2 #2 0x00718225 in LWLockAcquire (lock=0x7f6c517dbfa4, mode=LW_SHARED) at lwlock.c:1233 1233lwlock.c: No such file or directory. in lwlock.c (gdb) p num_held_lwlocks $1 = 0 (gdb) ## stack of autovacuum:Acquire lock 0x7f6c519ba5a4 and hold 0x7f6c517dbfa4, 0x7f6c51684f64 -- (gdb) bt #0 0x00369ea0da00 in sem_wait () from /lib64/libpthread.so.0 #1 0x006a7910 in PGSemaphoreLock (sema=0x7f6c4f77fdb8) at pg_sema.c:316 #2 0x00718225 in LWLockAcquire (lock=0x7f6c519ba5a4, mode=LW_EXCLUSIVE) at lwlock.c:1
Re:Re: could not access status of transaction
Hi, all Our other system had encountered the same failure, but this time it is PostgreSQL 10.7(rhel 6.3). Details are as follows: Phenomenon: app_db=# select count(*) from loba_sp_cost_xcd_104561; ERROR: could not access status of transaction 35153545 DETAIL: Could not open file "pg_xact/0021": No such file or directory. Related Information: The clog file in the pg_xact directory is 00CE~0179, and pg_xact/0021 has been cleaned up. Through gdb analysis, the tuple that caused the error was in block 163363. (gdb) bt #0 SlruReportIOError (ctl=0xca6b60, pageno=1072, xid=35153545) at slru.c:894 #1 0x004e2060 in SimpleLruReadPage (ctl=0xca6b60, pageno=1072, write_ok=1 '\001', xid=35153545) at slru.c:445 #2 0x004da431 in TransactionIdGetStatus (xid=35153545, lsn=0x7fffd07b1e88) at clog.c:411 #3 0x004e385a in TransactionLogFetch (transactionId=35153545) at transam.c:79 #4 0x004e39a5 in TransactionIdDidCommit (transactionId=) at transam.c:129 #5 0x0085e959 in HeapTupleSatisfiesMVCC (htup=, snapshot=0x2c21740, buffer=320) at tqual.c:1057 #6 0x004b49e7 in heapgetpage (scan=0x2c8a5d8, page=163363) at heapam.c:439 #7 0x004b5091 in heapgettup_pagemode (scan=0x2c8a5d8, direction=) at heapam.c:1034 #8 heap_getnext (scan=0x2c8a5d8, direction=) at heapam.c:1801 #9 0x00618b21 in SeqNext (node=0x2c63708) at nodeSeqscan.c:81 #10 0x005fe132 in ExecScanFetch (node=0x2c63708, accessMtd=0x618ae0 , recheckMtd=0x618840 ) at execScan.c:97 #11 ExecScan (node=0x2c63708, accessMtd=0x618ae0 , recheckMtd=0x618840 ) at execScan.c:164 #12 0x006061ee in ExecProcNode (aggstate=0x2c630c0) at ../../../src/include/executor/executor.h:250 #13 fetch_input_tuple (aggstate=0x2c630c0) at nodeAgg.c:695 #14 0x00607f89 in agg_retrieve_direct (pstate=0x2c630c0) at nodeAgg.c:2448 #15 ExecAgg (pstate=0x2c630c0) at nodeAgg.c:2158 #16 0x0060b240 in ExecProcNode (pstate=0x2c62da0) at ../../../src/include/executor/executor.h:250 #17 gather_getnext (pstate=0x2c62da0) at nodeGather.c:286 #18 ExecGather (pstate=0x2c62da0) at nodeGather.c:215 #19 0x006061ee in ExecProcNode (aggstate=0x2c626f8) at ../../../src/include/executor/executor.h:250 #20 fetch_input_tuple (aggstate=0x2c626f8) at nodeAgg.c:695 #21 0x006080ec in agg_retrieve_direct (pstate=0x2c626f8) at nodeAgg.c:2347 #22 ExecAgg (pstate=0x2c626f8) at nodeAgg.c:2158 #23 0x005fb6bf in ExecProcNode (queryDesc=0x2c81658, direction=, count=0, execute_once=-8 '\370') at ../../../src/include/executor/executor.h:250 #24 ExecutePlan (queryDesc=0x2c81658, direction=, count=0, execute_once=-8 '\370') at execMain.c:1723 #25 standard_ExecutorRun (queryDesc=0x2c81658, direction=, count=0, execute_once=-8 '\370') at execMain.c:364 #26 0x7fd6f161c6f8 in pgss_ExecutorRun (queryDesc=0x2c81658, direction=ForwardScanDirection, count=0, execute_once=1 '\001') at pg_stat_statements.c:889 #27 0x7fd6f141674d in explain_ExecutorRun (queryDesc=0x2c81658, direction=ForwardScanDirection, count=0, execute_once=1 '\001') at auto_explain.c:267 #28 0x0072986b in PortalRunSelect (portal=0x2c85f98, forward=, count=0, dest=) at pquery.c:932 #29 0x0072aa01 in PortalRun (portal=0x2c85f98, count=9223372036854775807, isTopLevel=1 '\001', run_once=1 '\001', dest=0x2c5fb58, altdest=0x2c5fb58, completionTag=0x7fffd07b26d0 "") at pquery.c:773 #30 0x00727051 in exec_simple_query (query_string=0x2c26358 "select count(*) from loba_sp_cost_xcd_104561;") at postgres.c:1122 #31 0x00728039 in PostgresMain (argc=, argv=, dbname=0x2bcc2f0 "app_db", username=) at postgres.c:4117 #32 0x006bb43a in BackendRun (argc=, argv=) at postmaster.c:4405 #33 BackendStartup (argc=, argv=) at postmaster.c:4077 ---Type to continue, or q to quit--- #34 ServerLoop (argc=, argv=) at postmaster.c:1755 #35 PostmasterMain (argc=, argv=) at postmaster.c:1363 #36 0x0063b4d0 in main (argc=1, argv=0x2b8e9f0) at main.c:228 (gdb) f 5 #5 0x0085e959 in HeapTupleSatisfiesMVCC (htup=, snapshot=0x2c21740, buffer=320) at tqual.c:1057 1057tqual.c: No such file or directory. in tqual.c (gdb) p *tuple $1 = {t_choice = {t_heap = {t_xmin = 35153545, t_xmax = 0, t_field3 = {t_cid = 0, t_xvac = 0}}, t_datum = {datum_len_ = 35153545, datum_typmod = 0, datum_typeid = 0}}, t_ctid = {ip_blkid = { bi_hi = 2, bi_lo = 32291}, ip_posid = 9}, t_infomask2 = 32788, t_infomask = 10243, t_hoff = 32 ' ', t_bits = 0x7fd6d97f0440 "\211f\030\002"} The problematic tuple data parsed using `pg_filedump -i` is as follows: Block 32291 - Block Offset: 0x0fc46000 Offsets: Lower 236 (0x00ec) Block: Size 8192 Version4Upper2112 (0x0840) LSN: logid254 recoff 0xa3a598c8 Special 8192 (0x2000) Items: 53 Free Space: 1876 Checksum: 0x8355 Prune
could not access status of transaction
In our PG 10.2(CentOS 7.3) database, the following error is reported when querying a table. We have already restored the production data through backup, but i want to confirm what may be the reason and how to avoid it in the future. lma=# select count(*) from bi_dm.tdm_ttk_site_on_way_rt; ERROR: could not access status of transaction 3250922107 DETAIL: Could not open file "pg_xact/0C1C": No such file or directory. Here are some related information The CLOG files in pg_xact diractory is as follow: 0C4A(Last update date: 2018/12/29) ... 0D09(Last update date: 2019/01/13) The stack when the error occurs is as follows (gdb) bt #0 SlruReportIOError (ctl=ctl@entry=0xca98e0 , pageno=pageno@entry=99210, xid=xid@entry=3250922107) at slru.c:901 #1 0x004e53a8 in SimpleLruReadPage (ctl=ctl@entry=0xca98e0 , pageno=pageno@entry=99210, write_ok=write_ok@entry=1 '\001', xid=xid@entry=3250922107) at slru.c:445 #2 0x004e5460 in SimpleLruReadPage_ReadOnly (ctl=ctl@entry=0xca98e0 , pageno=pageno@entry=99210, xid=xid@entry=3250922107) at slru.c:492 #3 0x004dde4d in TransactionIdGetStatus (xid=xid@entry=3250922107, lsn=lsn@entry=0x7ffeb4472660) at clog.c:411 #4 0x004e6e28 in TransactionLogFetch (transactionId=3250922107) at transam.c:79 #5 0x004e6e7f in TransactionIdDidCommit (transactionId=) at transam.c:129 #6 0x0084ce6a in HeapTupleSatisfiesMVCC (htup=, snapshot=0xf7f0c8, buffer=837) at tqual.c:1124 #7 0x004b106e in heapgetpage (scan=scan@entry=0x10bb1b0, page=page@entry=1078209) at heapam.c:439 #8 0x004b23ab in heapgettup_pagemode (key=0x0, nkeys=0, dir=1078209, scan=0x10bb1b0) at heapam.c:1034 #9 heap_getnext (scan=scan@entry=0x10bb1b0, direction=direction@entry=ForwardScanDirection) at heapam.c:1801 #10 0x0060fa93 in SeqNext (node=node@entry=0x108c360) at nodeSeqscan.c:81 #11 0x005f51ca in ExecScanFetch (recheckMtd=0x60fa40 , accessMtd=0x60fa60 , node=0x108c360) at execScan.c:97 #12 ExecScan (node=0x108c360, accessMtd=0x60fa60 , recheckMtd=0x60fa40 ) at execScan.c:147 #13 0x005fad89 in ExecProcNode (node=0x108c360) at ../../../src/include/executor/executor.h:250 #14 fetch_input_tuple (aggstate=aggstate@entry=0x108bcb8) at nodeAgg.c:695 #15 0x005fcdbf in agg_retrieve_direct (aggstate=0x108bcb8) at nodeAgg.c:2448 #16 ExecAgg (pstate=0x108bcb8) at nodeAgg.c:2158 #17 0x005ef612 in ExecProcNode (node=0x108bcb8) at ../../../src/include/executor/executor.h:250 #18 ExecutePlan (execute_once=, dest=0x10c40e8, direction=, numberTuples=0, sendTuples=1 '\001', operation=CMD_SELECT, use_parallel_mode=, planstate=0x108bcb8, estate=0x108baa8) at execMain.c:1722 #19 standard_ExecutorRun (queryDesc=0x10c7168, direction=, count=0, execute_once=) at execMain.c:363 #20 0x7f766427915d in pgss_ExecutorRun (queryDesc=0x10c7168, direction=ForwardScanDirection, count=0, execute_once=) at pg_stat_statements.c:889 #21 0x0071a28b in PortalRunSelect (portal=portal@entry=0xfac708, forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807, dest=dest@entry=0x10c40e8) at pquery.c:932 #22 0x0071b63f in PortalRun (portal=portal@entry=0xfac708, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', run_once=run_once@entry=1 '\001', dest=dest@entry=0x10c40e8, altdest=altdest@entry=0x10c40e8, completionTag=completionTag@entry=0x7ffeb4472c80 "") at pquery.c:773 #23 0x007175b3 in exec_simple_query (query_string=0x102a4e8 "select count(*) from tdm_ttk_site_on_way_rt ;") at postgres.c:1099 #24 0x007188ac in PostgresMain (argc=, argv=argv@entry=0xfa2d68, dbname=0xfa2c50 "lma", username=) at postgres.c:4088 #25 0x0047ad3c in BackendRun (port=0xfa49b0) at postmaster.c:4405 #26 BackendStartup (port=0xfa49b0) at postmaster.c:4077 #27 ServerLoop () at postmaster.c:1755 #28 0x006afacf in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0xf5fa20) at postmaster.c:1363 #29 0x0047bb6f in main (argc=3, argv=0xf5fa20) at main.c:228 (gdb) f 6 #6 0x0084ce6a in HeapTupleSatisfiesMVCC (htup=, snapshot=0xf7f0c8, buffer=837) at tqual.c:1124 1124if (!TransactionIdDidCommit(HeapTupleHeaderGetRawXmax(tuple))) (gdb) p *tuple $3 = {t_choice = {t_heap = {t_xmin = 3238223060, t_xmax = 3250922107, t_field3 = {t_cid = 0, t_xvac = 0}}, t_datum = {datum_len_ = -1056744236, datum_typmod = -1044045189, datum_typeid = 0}}, t_ctid = {ip_blkid = {bi_hi = 16, bi_lo = 29633}, ip_posid = 3}, t_infomask2 = 8211, t_infomask
Re:Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock
Hi Before we only discussed the connection hang on the primary, the connection hang on the standby database should be another problem. When the recovery process replays the gin's delete WAL record, the order of get lwlock is not the same as the select process, resulting in a deadlock. Accord infomation form gcore, the details are as follows: ## select process 2246(rootBlkno=2246) | 3254(1. Held buffer=366260,LWLock=0x2d50e2e4) --rightlink--> 483(2. Acquire buffer=2201739,LWLock=0x2aaab45158a4) The ginStepRight() function in select process gets the lwlock in the order of left to right. ## recovey process 2246(2. Held buffer=7034160,LWLock=0x2aaac6c081e4,rootBlkno) | 3254(3. Acquire buffer=366260,LWLock=0x2d50e2e4) --rightlink--> 483(1. Held buffer=2201739,LWLock=0x2aaab45158a4) But, the ginRedoDeletePage() function in recovery process gets the lwlock in the order of current to parent and to left. So, i think inconsistent lock order in ginRedoDeletePage() is the reason for hang in the standby. static void ginRedoDeletePage(XLogReaderState *record) { XLogRecPtrlsn = record->EndRecPtr; ginxlogDeletePage *data = (ginxlogDeletePage *) XLogRecGetData(record); Bufferdbuffer; Bufferpbuffer; Bufferlbuffer; Pagepage; if (XLogReadBufferForRedo(record, 0, &dbuffer) == BLK_NEEDS_REDO) { page = BufferGetPage(dbuffer); Assert(GinPageIsData(page)); GinPageGetOpaque(page)->flags = GIN_DELETED; PageSetLSN(page, lsn); MarkBufferDirty(dbuffer); } if (XLogReadBufferForRedo(record, 1, &pbuffer) == BLK_NEEDS_REDO) { page = BufferGetPage(pbuffer); Assert(GinPageIsData(page)); Assert(!GinPageIsLeaf(page)); GinPageDeletePostingItem(page, data->parentOffset); PageSetLSN(page, lsn); MarkBufferDirty(pbuffer); } if (XLogReadBufferForRedo(record, 2, &lbuffer) == BLK_NEEDS_REDO) { page = BufferGetPage(lbuffer); Assert(GinPageIsData(page)); GinPageGetOpaque(page)->rightlink = data->rightLink; PageSetLSN(page, lsn); MarkBufferDirty(lbuffer); } if (BufferIsValid(lbuffer)) UnlockReleaseBuffer(lbuffer); if (BufferIsValid(pbuffer)) UnlockReleaseBuffer(pbuffer); if (BufferIsValid(dbuffer)) UnlockReleaseBuffer(dbuffer); } The order of get lwlock in ginRedoDeletePage() may should be change from "dbuffer->pbuffer->lbuffer" to "lbuffer->dbuffer->pbuffer" . Is this right? ## How to reproduct this issue 1. modify ginxlog.c and build(add "sleep(60)" ) if (XLogReadBufferForRedo(record, 1, &pbuffer) == BLK_NEEDS_REDO) { page = BufferGetPage(pbuffer); Assert(GinPageIsData(page)); Assert(!GinPageIsLeaf(page)); GinPageDeletePostingItem(page, data->parentOffset); PageSetLSN(page, lsn); MarkBufferDirty(pbuffer); } ==> if (XLogReadBufferForRedo(record, 1, &pbuffer) == BLK_NEEDS_REDO) { page = BufferGetPage(pbuffer); Assert(GinPageIsData(page)); Assert(!GinPageIsLeaf(page)); GinPageDeletePostingItem(page, data->parentOffset); PageSetLSN(page, lsn); MarkBufferDirty(pbuffer); sleep(60);//add for debug } 2. run test SQL on the primary create table tb1(id int); create index ON tb1 using gin(id); insert into tb1 select 1 from generate_series(1,100)id; delete from tb1; 3. check recovery process in standby had enter "sleep()" branch $ ps -ef|grep reco postgres 13418 13417 0 22:23 ?00:00:00 postgres: startup process recovering 0001005E postgres 13425 31505 0 22:23 pts/800:00:00 grep --color=auto reco $ pstack 13418 #0 0x7f2166d39650 in __nanosleep_nocancel () from /lib64/libc.so.6 #1 0x7f2166d39504 in sleep () from /lib64/libc.so.6 #2 0x0048614f in ginRedoDeletePage (record=0x127cbe8) at ginxlog.c:480 #3 gin_redo (record=0x127cbe8) at ginxlog.c:732 #4 0x004efec3 in StartupXLOG () #5 0x00697c51 in StartupProcessMain () #6 0x004fd22a in AuxiliaryProcessMain () #7 0x00694e49 in StartChildProcess () #8 0x00697665 in PostmasterMain () #9 0x004766e1 in main () 4. execute select SQL set enable_seqscan = false; select count(*) from tb1 where id =2; 5. check result recovery process block in LWLock $ pstack 13418 #0 0x7f216775779b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0 #1 0x7f216775782f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0 #2 0x7f21677578cb in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0 #3 0x00685df2 in PGSemaphoreLock () #4 0x006edd64 in LWLockAcquire () #5 0x004fa66a in XLogReadBufferForRedoExtended () #6 0x00486161 in ginRedoDeletePage (record=0x127cbe8) at ginxlog.c:483 #7 gin_redo (record=0x127cbe8) at ginxlog.c:732 #8 0x004efec3 in StartupXLOG () #9 0x00697c51 in StartupProcessMain () #10 0x004fd22a in AuxiliaryProcessMain () #11 0x000
Re:Connections hang indefinitely while taking a gin index's LWLock buffer_content lock
Hi,all I analyzed the btree block where lwlock deadlock occurred, as follows: ## insert process(ginInsertValue()) 644(root blkno) | 7054(2. held LWLock:0x2aaac587ae64) rightlink>(3. Acquire LWLock:0x2aaab4009564,buffer = 2119038,blkno should be 9954) | 701(1. held LWLock:0x2aaab670dfe4) The ginInsertValue() function above gets the lwlock in the order described in the README. src/backend/access/gin/README --- To avoid deadlocks, B-tree pages must always be locked in the same order: left to right, and bottom to top. ... - ## autovacuum process(ginScanToDelete()) 644(root blkno) | ... 9954(1. held LWLock:0x2aaab4009564) ... | 701(2. Acquire LWLock:0x2aaab670dfe4) note:according to autovacuum's core 701's parent is 9954;while insert's core shows 701's parent is 7054, rightlink of 7054 is 9954! However, ginScanToDelete() depth-first scans the btree and gets the EXCLUSIVE lock, which creates a deadlock. Is the above statement correct? If so, deadlocks should easily happen. static bool ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot, DataPageDeleteStack *parent, OffsetNumber myoff) { ... if (!isRoot) LockBuffer(buffer, GIN_EXCLUSIVE); ... for (i = FirstOffsetNumber; i <= GinPageGetOpaque(page)->maxoff; i++) { PostingItem *pitem = GinDataPageGetPostingItem(page, i); if (ginScanToDelete(gvs, PostingItemGetBlockNumber(pitem), FALSE, me, i)) i--; } ... } src/backend/access/gin/README - The previous paragraph's reasoning only applies to searches, and only to posting trees. To protect from inserters following a downlink to a deleted page, vacuum simply locks out all concurrent insertions to the posting tree, by holding a super-exclusive lock on the parent page of subtree with deletable pages. ... posting tree. To exclude interference with readers vacuum takes exclusive locks in a depth-first scan in left-to-right order of page tuples. ... - # stacks ## stack of insert:Acquire lock 0x2aaab4009564(blkno:9954) and hold 0x2aaab670dfe4(blkno:701), 0x2aaac587ae64(blkno:7054) - (gdb) bt #0 0x7fe11552379b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0 #1 0x7fe11552382f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0 #2 0x7fe1155238cb in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0 #3 0x0069d362 in PGSemaphoreLock (sema=0x2ac02958) at pg_sema.c:310 #4 0x007095ac in LWLockAcquire (lock=0x2aaab4009564, mode=LW_EXCLUSIVE) at lwlock.c:1233 #5 0x00490a32 in ginStepRight (buffer=6713826, index=, lockmode=lockmode@entry=2) at ginbtree.c:174 #6 0x00490c1c in ginFinishSplit (btree=btree@entry=0x7ffd81e4f950, stack=0x28eebf8, freestack=freestack@entry=1 '\001', buildStats=buildStats@entry=0x0) at ginbtree.c:701 #7 0x00491396 in ginInsertValue (btree=btree@entry=0x7ffd81e4f950, stack=, insertdata=insertdata@entry=0x7ffd81e4f940, buildStats=buildStats@entry=0x0) at ginbtree.c:773 #8 0x0048fb42 in ginInsertItemPointers (index=, rootBlkno=rootBlkno@entry=644, items=items@entry=0x2916598, nitem=nitem@entry=353, buildStats=buildStats@entry=0x0) at gindatapage.c:1907 #9 0x0048a9ea in ginEntryInsert (ginstate=ginstate@entry=0x28e7ef8, attnum=, key=42920440, category=, items=0x2916598, nitem=353, buildStats=buildStats@entry=0x0) at gininsert.c:214 #10 0x00496d94 in ginInsertCleanup (ginstate=ginstate@entry=0x28e7ef8, full_clean=full_clean@entry=0 '\000', fill_fsm=fill_fsm@entry=1 '\001', forceCleanup=forceCleanup@entry=0 '\000', stats=stats@entry=0x0) at ginfast.c:883 #11 0x00497727 in ginHeapTupleFastInsert (ginstate=ginstate@entry=0x28e7ef8, collector=collector@entry=0x7ffd81e4fe60) at ginfast.c:448 #12 0x0048b209 in gininsert (index=, values=0x7ffd81e4ff40, isnull=0x7ffd81e50040 "", ht_ctid=0x280d98c, heapRel=, checkUnique=, indexInfo=0x28b5aa8) at gininsert.c:522 #13 0x005ee8dd in ExecInsertIndexTuples (slot=slot@entry=0x28b5d58, tupleid=tupleid@entry=0x280d98c, estate=estate@entry=0x28b5288, noDupErr=noDupErr@entry=1 '\001', specConflict=specConflict@entry=0x7ffd81e5013b "", arbiterIndexes=arbiterIndexes@entry=0x28c6dd8) at execIndexing.c:386 #14 0x0060ccf5 in ExecInsert (canSetTag=1 '\001', estate=0x28b5288, onconflict=ONCONFLICT_UPDATE, arbiterIndexes=0x28c6dd8, planSlot=0x28b5d58, slot=0x28b5d58, mtstate=0x28b5628) at nodeModifyTable.c:564 #15 ExecModifyTab
Connections hang indefinitely while taking a gin index's LWLock buffer_content lock
Hi,all In our PostgreSQL 10.2 database, two sessions of insert and autovacuum of gin index blocked in LWLock:buffer_content. This blocked checkpoint and dead tuple recycle,and we had to restart the datebase. According to the information collected from gcore, a deadlock occurred when acquiring the buffer lock. This system is a citus 7.2.1 cluster. It has been running for more than a year. This problem has not happened before. 10/19 the cluster has been expanded from 8 workers to 16 workers. I don't know if it has anything to do with expansion. The problem happend on the coordinator node of citus, but the table is an normal table(not partition table). And the load of this coordinator node has always been heavy. After 10/25 this issue occurred three times in the 4 days, but we can not reproduce this problem in test environment. 3 faults are stuck on 3 different tables, but the stack is the same. ## stack of insert:Acquire lock 0x2aaab4009564 and hold 0x2aaab670dfe4, 0x2aaac587ae64 (gdb) bt #0 0x7fe11552379b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0 #1 0x7fe11552382f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0 #2 0x7fe1155238cb in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0 #3 0x0069d362 in PGSemaphoreLock (sema=0x2ac02958) at pg_sema.c:310 #4 0x007095ac in LWLockAcquire (lock=0x2aaab4009564, mode=LW_EXCLUSIVE) at lwlock.c:1233 #5 0x00490a32 in ginStepRight (buffer=6713826, index=, lockmode=lockmode@entry=2) at ginbtree.c:174 #6 0x00490c1c in ginFinishSplit (btree=btree@entry=0x7ffd81e4f950, stack=0x28eebf8, freestack=freestack@entry=1 '\001', buildStats=buildStats@entry=0x0) at ginbtree.c:701 #7 0x00491396 in ginInsertValue (btree=btree@entry=0x7ffd81e4f950, stack=, insertdata=insertdata@entry=0x7ffd81e4f940, buildStats=buildStats@entry=0x0) at ginbtree.c:773 #8 0x0048fb42 in ginInsertItemPointers (index=, rootBlkno=rootBlkno@entry=644, items=items@entry=0x2916598, nitem=nitem@entry=353, buildStats=buildStats@entry=0x0) at gindatapage.c:1907 #9 0x0048a9ea in ginEntryInsert (ginstate=ginstate@entry=0x28e7ef8, attnum=, key=42920440, category=, items=0x2916598, nitem=353, buildStats=buildStats@entry=0x0) at gininsert.c:214 #10 0x00496d94 in ginInsertCleanup (ginstate=ginstate@entry=0x28e7ef8, full_clean=full_clean@entry=0 '\000', fill_fsm=fill_fsm@entry=1 '\001', forceCleanup=forceCleanup@entry=0 '\000', stats=stats@entry=0x0) at ginfast.c:883 #11 0x00497727 in ginHeapTupleFastInsert (ginstate=ginstate@entry=0x28e7ef8, collector=collector@entry=0x7ffd81e4fe60) at ginfast.c:448 #12 0x0048b209 in gininsert (index=, values=0x7ffd81e4ff40, isnull=0x7ffd81e50040 "", ht_ctid=0x280d98c, heapRel=, checkUnique=, indexInfo=0x28b5aa8) at gininsert.c:522 #13 0x005ee8dd in ExecInsertIndexTuples (slot=slot@entry=0x28b5d58, tupleid=tupleid@entry=0x280d98c, estate=estate@entry=0x28b5288, noDupErr=noDupErr@entry=1 '\001', specConflict=specConflict@entry=0x7ffd81e5013b "", arbiterIndexes=arbiterIndexes@entry=0x28c6dd8) at execIndexing.c:386 #14 0x0060ccf5 in ExecInsert (canSetTag=1 '\001', estate=0x28b5288, onconflict=ONCONFLICT_UPDATE, arbiterIndexes=0x28c6dd8, planSlot=0x28b5d58, slot=0x28b5d58, mtstate=0x28b5628) at nodeModifyTable.c:564 #15 ExecModifyTable (pstate=0x28b5628) at nodeModifyTable.c:1766 #16 0x005ef612 in standard_ExecutorRun (queryDesc=, direction=, count=, execute_once=) at ../../../src/include/executor/executor.h:250 #17 0x7fe10607d15d in pgss_ExecutorRun (queryDesc=0x28cc1d8, direction=ForwardScanDirection, count=0, execute_once=) at pg_stat_statements.c:889 #18 0x0071a7ba in ProcessQuery (plan=, sourceText=0x2849058 "INSERT INTO bi_dm.tdm_sncity_workorder_analytic_statistics (CITYCODE,CITYNAME,DISTRICT,DISTRICT_NAME,ZL_WERKS,ZL_WERKS_NAME,AREA_CD,AREA_NM,ZSIZE,DEPT_CD,TMALL_FLG,ORDER_DATE,DISTRIBUTE,WORKORDER,FNSH"..., params=0x28b2260, queryEnv=0x0, dest=0xc9e2a0 , completionTag=0x7ffd81e507b0 "") at pquery.c:161 #19 0x0071a9f7 in PortalRunMulti (portal=portal@entry=0x289fdd8, isTopLevel=isTopLevel@entry=1 '\001', setHoldSnapshot=setHoldSnapshot@entry=0 '\000', dest=0xc9e2a0 , dest@entry=0x284d778, altdest=0xc9e2a0 , altdest@entry=0x284d778, completionTag=completionTag@entry=0x7ffd81e507b0 "") at pquery.c:1286 #20 0x0071b535 in PortalRun (portal=, count=1, isTopLevel=, run_once=, dest=0x284d778, altdest=0x284d778, completionTag=0x7ffd81e507b0 "") at pquery.c:799 #21 0x00718f84 in PostgresMain (argc=, argv=, dbname=, username=) at postgres.c:1984 #22 0x0047ad3c in Backe
When use prepared protocol, transaction will hold backend_xmin until the end of the transaction.
Hi, hackers! When execute sql with prepared protocol, read committed transaction will hold backend_xmin until the end of the transaction. Is this behavior normal? Should read committed transaction release backend_xmin immediately after SQL executing is completed? Just like when executing sql with simple protocol. # reproduction ## env - PostgreSQL 9.2 - CentOS 7.2 ## test script $ cat test.sql begin; select 1; \sleep 1000s ## execute with simple protocol $ pgbench -n -t 1 -f test.sql "service=admin" postgres=# select * from pg_stat_activity where query='select 1;'; -[ RECORD 1 ]+-- datid| 13805 datname | postgres pid | 19641 usesysid | 16388 usename | admin application_name | pgbench client_addr | client_hostname | client_port | -1 backend_start| 2018-07-04 13:27:10.62635+08 xact_start | 2018-07-04 13:27:10.629609+08 query_start | 2018-07-04 13:27:10.629845+08 state_change | 2018-07-04 13:27:10.63035+08 wait_event_type | Client wait_event | ClientRead state| idle in transaction backend_xid | backend_xmin | query| select 1; backend_type | client backend ## execute with prepared protocol $ pgbench -n -t 1 -f test.sql "service=admin" -M prepared postgres=# select * from pg_stat_activity where query='select 1;'; -[ RECORD 1 ]+-- datid| 13805 datname | postgres pid | 19662 usesysid | 16388 usename | admin application_name | pgbench client_addr | client_hostname | client_port | -1 backend_start| 2018-07-04 13:27:46.637134+08 xact_start | 2018-07-04 13:27:46.641348+08 query_start | 2018-07-04 13:27:46.64174+08 state_change | 2018-07-04 13:27:46.641778+08 wait_event_type | Client wait_event | ClientRead state| idle in transaction backend_xid | backend_xmin | 3930269815 query| select 1; backend_type | client backend backend_xmin will affect dead tuple removing postgres=# create table tbchj(id int); CREATE TABLE postgres=# insert into tbchj values(1); INSERT 0 1 postgres=# delete from tbchj; DELETE 1 postgres=# vacuum VERBOSE tbchj; INFO: vacuuming "public.tbchj" INFO: "tbchj": found 0 removable, 1 nonremovable row versions in 1 out of 1 pages DETAIL: 1 dead row versions cannot be removed yet, oldest xmin: 3930269815 There were 0 unused item pointers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. VACUUM Regards Chen Huajun
Re:Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files
At 2018-01-23 09:56:48, "Stephen Frost" wrote: > >I've only read through the thread to try and understand what's going on >and the first thing that comes to mind is that you're changing >pg_rewind to not remove the WAL from before the divergence (split) >point, but I'm not sure why. As noted, that WAL isn't needed for >anything (it's from before the split, after all), so why keep it? Is >there something in this optimization that depends on the old WAL being >there and, if so, what and why? After run pg_rewind, the first startup of postgres will do crash recovery. And crash recovery will begin from the previous redo point preceding the divergence. So, the WAL after the redo point and before the divergence is needed. Of course, the WAL before the redo point is not needed, but in my point of view, recycling unwanted WAL does not have to be done by pg_rewind. reference code: pg_rewind.c: main(int argc, char **argv) { ... findLastCheckpoint(datadir_target, divergerec, lastcommontliIndex, &chkptrec, &chkpttli, &chkptredo); ... createBackupLabel(chkptredo, chkpttli, chkptrec); ... } ... createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli, XLogRecPtr checkpointloc) { ... len = snprintf(buf, sizeof(buf), "START WAL LOCATION: %X/%X (file %s)\n" "CHECKPOINT LOCATION: %X/%X\n" "BACKUP METHOD: pg_rewind\n" "BACKUP FROM: standby\n" "START TIME: %s\n", /* omit LABEL: line */ (uint32) (startpoint >> 32), (uint32) startpoint, xlogfilename, (uint32) (checkpointloc >> 32), (uint32) checkpointloc, strfbuf); ... } >That's also different from how pg_basebackup works, which I don't think >is good (seems like pg_rewind should operate in a pretty similar manner >to pg_basebackup). > Thanks for your comments! I also considered copy WAL just like how pg_basebackup does,but a implement similar to pg_basebackup's manner may be not so simple. Twice transport of files from source to target may be needed,first for data files, and another for WAL. And the WAL which contains the previous redo point preceding the divergence may be only exists in target server and had been recycled in source. That's different between pg_rewind and pg_basebackup. Regards, Chen Huajun
Re:Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files
At 2017-12-01 12:27:09, "Michael Paquier" wrote: >On Tue, Oct 3, 2017 at 1:20 AM, chenhj wrote: >> I had filled the authors field of this patch in commitfest, and will rebase >> this patch if needed. Thank you for your help! > >The documentation of the patch needs a rebase, so I am moving it to >next CF with "waiting on author" as status. > >$ git diff master --check >src/bin/pg_rewind/pg_rewind.c:292: trailing whitespace. >+* >There are whitespace complains. >-- >Michael Rebased and removed the whitespace. regards Chen Huajun pg_rewind_wal_copy_reduce_v9.patch Description: Binary data