Re: [Proposal] Page Compression for OLTP

2021-02-16 Thread chenhj
At 2021-02-16 21:51:14, "Daniel Gustafsson"  wrote:

>> On 16 Feb 2021, at 15:45, chenhj  wrote:
>
>> I want to know whether this patch can be accepted by the community, that is, 
>> whether it is necessary for me to continue working for this Patch. 
>> If you have any suggestions, please feedback to me.
>
>It doesn't seem like the patch has been registered in the commitfest app so it
>may have been forgotten about, the number of proposed patches often outnumber
>the code review bandwidth.  Please register it at:
>
>   https://commitfest.postgresql.org/32/
>
>..to make sure it doesn't get lost.
>
>--

>Daniel Gustafsson  https://vmware.com/


Thanks, I will complete this patch and registered it later.
Chen Huajun

Re: [Proposal] Page Compression for OLTP

2021-02-16 Thread chenhj
Hi, hackers


I want to know whether this patch can be accepted by the community, that is, 
whether it is necessary for me to continue working for this Patch. 
If you have any suggestions, please feedback to me.


Best Regards
Chen Huajun



Re: [Proposal] Page Compression for OLTP

2020-06-05 Thread chenhj
Hi hackers,


> # Page storage(Plan C)
>
> Further, since the size of the compress address file is fixed, the above 
> address file and data file can also be combined into one file
>
> 0   1   2 1230710 1 2
> +===+===+===+ +===+=+=+
> | head  |  1|2  | ... |   | data1   | data2   | ...  
> +===+===+===+ +===+=+=+
>   head  |  address|  data  |

I made a prototype according to the above storage method. Any suggestions are 
welcome.

# Page compress file storage related definitions

/*
* layout of Page Compress file:
*
* - PageCompressHeader
* - PageCompressAddr[]
* - chuncks of PageCompressData
*
*/
typedef struct PageCompressHeader
{
 pg_atomic_uint32 nblocks; /* number of total blocks in this 
segment */
 pg_atomic_uint32 allocated_chunks; /* number of total allocated 
chunks in data area */
 uint16chunk_size; /* size of each chunk, must be 
1/2 1/4 or 1/8 of BLCKSZ */
 uint8algorithm; /* compress algorithm, 1=pglz, 
2=lz4 */
} PageCompressHeader;

typedef struct PageCompressAddr
{
 uint8nchunks;   /* number of chunks for 
this block */
 uint8allocated_chunks; /* number of allocated 
chunks for this block */

 /* variable-length fields, 1 based chunk no array for this block, size of 
the array must be 2, 4 or 8 */
 pc_chunk_number_t chunknos[FLEXIBLE_ARRAY_MEMBER];
} PageCompressAddr;

typedef struct PageCompressData
{
 char page_header[SizeOfPageHeaderData]; /* page header */
 uint32 size;/* size of 
compressed data */
 char data[FLEXIBLE_ARRAY_MEMBER];  /* compressed page, except 
for the page header */
} PageCompressData;


# Usage

Set whether to use compression through storage parameters of tables and indexes

- compress_type
 Set whether to compress and the compression algorithm used, supported values: 
none, pglz, zstd

- compress_chunk_size

 Chunk is the smallest unit of storage space allocated for compressed pages.
 The size of the chunk can only be 1/2, 1/4 or 1/8 of BLCKSZ

- compress_prealloc_chunks

  The number of chunks pre-allocated for each page. The maximum value allowed 
is: BLCKSZ/compress_chunk_size -1.
  If the number of chunks required for a compressed page is less than 
`compress_prealloc_chunks`,
  It allocates `compress_prealloc_chunks` chunks to avoid future storage 
fragmentation when the page needs more storage space.


# Sample

## requirement

- zstd

## build

./configure --with-zstd
make
make install

## create compressed table and index

create table tb1(id int,c1 text);
create table tb1_zstd(id int,c1 text) 
with(compress_type=zstd,compress_chunk_size=1024);
create table tb1_zstd_4(id int,c1 text) 
with(compress_type=zstd,compress_chunk_size=1024,compress_prealloc_chunks=4);

create index tb1_idx_id on tb1(id);
create index tb1_idx_id_zstd on tb1(id) 
with(compress_type=zstd,compress_chunk_size=1024);
create index tb1_idx_id_zstd_4 on tb1(id) 
with(compress_type=zstd,compress_chunk_size=1024,compress_prealloc_chunks=4);

create index tb1_idx_c1 on tb1(c1);
create index tb1_idx_c1_zstd on tb1(c1) 
with(compress_type=zstd,compress_chunk_size=1024);
create index tb1_idx_c1_zstd_4 on tb1(c1) 
with(compress_type=zstd,compress_chunk_size=1024,compress_prealloc_chunks=4);

insert into tb1 select generate_series(1,100),md5(random()::text);
insert into tb1_zstd select generate_series(1,100),md5(random()::text);
insert into tb1_zstd_4 select generate_series(1,100),md5(random()::text);

## show size of table and index

postgres=# \d+
List of relations
Schema |Name| Type  |  Owner   | Persistence | Size  | Description
++---+--+-+---+-
public | tb1| table | postgres | permanent   | 65 MB |
public | tb1_zstd   | table | postgres | permanent   | 37 MB |
public | tb1_zstd_4 | table | postgres | permanent   | 37 MB |
(3 rows)

postgres=# \di+
List of relations
Schema |   Name| Type  |  Owner   | Table | Persistence | Size  | 
Description
+---+---+--+---+-+---+-
public | tb1_idx_c1| index | postgres | tb1   | permanent   | 73 MB |
public | tb1_idx_c1_zstd   | index | postgres | tb1   | permanent   | 36 MB |
public | tb1_idx_c1_zstd_4 | index | postgres | tb1   | permanent   | 41 MB |
public | tb1_idx_id| index | postgres | tb1   | permanent   | 21 MB |
public | tb1_idx_id_zstd   | index | postgres | tb1   | permanent   | 13 MB |
public | tb1_idx_id_zstd_4 | index | postgres | tb1   | permanent   | 15 MB |
(6 rows)


# pgbench performance testing(TPC-B)

Compre

Re: [Proposal] Page Compression for OLTP

2020-05-21 Thread chenhj
Sorry, There may be a problem with the display format of the previous mail. So 
resend it


At 2020-05-21 15:04:55, "Fabien COELHO"  wrote:

>
>Hello,
>
>My 0.02, some of which may just show some misunderstanding on my part:
>
>  - Could this be proposed as some kind of extension, provided that enough
>hooks are available? ISTM that foreign tables and/or alternative
>storage engine (aka ACCESS METHOD) provide convenient APIs which could
>fit the need for these? Or are they not appropriate? You seem to
>suggest that there are not.
>
>If not, what could be done to improve API to allow what you are seeking
>to do? Maybe you need a somehow lower-level programmable API which does
>not exist already, or at least is not exported already, but could be
>specified and implemented with limited effort? Basically you would like
>to read/write pg pages to somewhere, and then there is the syncing
>issue to consider. Maybe such a "page storage" API could provide
>benefit for some specialized hardware, eg persistent memory stores,
>so there would be more reason to define it anyway? I think it might
>be valuable to give it some thoughts.

Thank you for giving so many comments.
In my opinion, developing a foreign table or a new storage engine, in addition 
to compression, also needs to do a lot of extra things.
A similar explanation was mentioned in Nikolay P's email.

The "page storage" API may be a good choice, and I will consider it, but I have 
not yet figured out how to implement it.

>  - Could you maybe elaborate on how your plan differs from [4] and [5]?

My solution is similar to CFS, and it is also embedded in the file access layer 
(fd.c, md.c) to realize the mapping from block number to the corresponding file 
and location where compressed data is stored.

However, the most important difference is that I hope to avoid the need for GC 
through the design of the page layout.

https://www.postgresql.org/message-id/flat/11996861554042351%40iva4-dd95b404a60b.qloud-c.yandex.net

>> The most difficult thing in CFS development is certainly
>> defragmentation. In CFS it is done using background garbage collection,
>> by one or one
>> GC worker processes. The main challenges were to minimize its
>> interaction with normal work of the system, make it fault tolerant and
>> prevent unlimited growth of data segments.

>> CFS is not introducing its own storage manager, it is mostly embedded in
>> existed Postgres file access layer (fd.c, md.c). It allows to reused
>> code responsible for mapping relations and file descriptors cache. As it
>> was recently discussed in hackers, it may be good idea to separate the
>> questions "how to map blocks to filenames and offsets" and "how to
>> actually perform IO". In this it will be easier to implement compressed
>> storage manager.


>  - Have you consider keeping page headers and compressing tuple data
>only?

In that case, we must add some additional information in the page header to 
identify whether this is a compressed page or an uncompressed page.
When a compressed page becomes an uncompressed page, or vice versa, an 
uncompressed page becomes a compressed page, the original page header must be 
modified.
This is unacceptable because it requires modifying the shared buffer and 
recalculating the checksum.

However, it should be feasible to put this flag in the compressed address file.
The problem with this is that even if a page only occupies the size of one 
compressed block, the address file needs to be read, that is, from 1 IO to 2 IO.
Since the address file is very small, it is basically a memory access, this 
cost may not be as large as I had imagined.

>  - I'm not sure there is a point in going below the underlying file
>system blocksize, quite often 4 KiB? Or maybe yes? Or is there
>a benefit to aim at 1/4 even if most pages overflow?

My solution is mainly optimized for scenarios where the original page can be 
compressed to only require one compressed block of storage.
The scene where the original page is stored in multiple compressed blocks is 
suitable for scenarios that are not particularly sensitive to performance, but 
are more concerned about the compression rate, such as cold data.

In addition, users can also choose to compile PostgreSQL with 16KB or 32KB 
BLOCKSZ.

>  - ISTM that your approach entails 3 "files". Could it be done with 2?
>I'd suggest that the possible overflow pointers (coa) could be part of
>the headers so that when reading the 3.1 page, then the header would
>tell where to find the overflow 3.2, without requiring an additional
>independent structure with very small data in it, most of it zeros.
>Possibly this is not possible, because it would require some available
>space in standard headers when the is page is not compressible, and
>there is not enough. 

Re: [Proposal] Page Compression for OLTP

2020-05-21 Thread chenhj
At 2020-05-21 15:04:55, "Fabien COELHO"  wrote: > >Hello, 
> >My 0.02, some of which may just show some misunderstanding on my part: > > 
- Could this be proposed as some kind of extension, provided that enough > 
hooks are available? ISTM that foreign tables and/or alternative > storage 
engine (aka ACCESS METHOD) provide convenient APIs which could > fit the need 
for these? Or are they not appropriate? You seem to > suggest that there are 
not. > > If not, what could be done to improve API to allow what you are 
seeking > to do? Maybe you need a somehow lower-level programmable API which 
does > not exist already, or at least is not exported already, but could be > 
specified and implemented with limited effort? Basically you would like > to 
read/write pg pages to somewhere, and then there is the syncing > issue to 
consider. Maybe such a "page storage" API could provide > benefit for some 
specialized hardware, eg persistent memory stores, > so there would be more 
reason to define it anyway? I think it might > be valuable to give it some 
thoughts. Thank you for giving so many comments. In my opinion, developing a 
foreign table or a new storage engine, in addition to compression, also needs 
to do a lot of extra things. A similar explanation was mentioned in Nikolay P's 
email. The "page storage" API may be a good choice, and I will consider it, but 
I have not yet figured out how to implement it. > - Could you maybe elaborate 
on how your plan differs from [4] and [5]? My solution is similar to CFS, and 
it is also embedded in the file access layer (fd.c, md.c) to realize the 
mapping from block number to the corresponding file and location where 
compressed data is stored. However, the most important difference is that I 
hope to avoid the need for GC through the design of the page layout. 
https://www.postgresql.org/message-id/flat/11996861554042351%40iva4-dd95b404a60b.qloud-c.yandex.net
 >> The most difficult thing in CFS development is certainly >> 
defragmentation. In CFS it is done using background garbage collection, >> by 
one or one >> GC worker processes. The main challenges were to minimize its >> 
interaction with normal work of the system, make it fault tolerant and >> 
prevent unlimited growth of data segments. >> CFS is not introducing its own 
storage manager, it is mostly embedded in >> existed Postgres file access layer 
(fd.c, md.c). It allows to reused >> code responsible for mapping relations and 
file descriptors cache. As it >> was recently discussed in hackers, it may be 
good idea to separate the >> questions "how to map blocks to filenames and 
offsets" and "how to >> actually perform IO". In this it will be easier to 
implement compressed >> storage manager. > - Have you consider keeping page 
headers and compressing tuple data > only? In that case, we must add some 
additional information in the page header to identify whether this is a 
compressed page or an uncompressed page. When a compressed page becomes an 
uncompressed page, or vice versa, an uncompressed page becomes a compressed 
page, the original page header must be modified. This is unacceptable because 
it requires modifying the shared buffer and recalculating the checksum. 
However, it should be feasible to put this flag in the compressed address file. 
The problem with this is that even if a page only occupies the size of one 
compressed block, the address file needs to be read, that is, from 1 IO to 2 
IO. Since the address file is very small, it is basically a memory access, this 
cost may not be as large as I had imagined. > - I'm not sure there is a point 
in going below the underlying file > system blocksize, quite often 4 KiB? Or 
maybe yes? Or is there > a benefit to aim at 1/4 even if most pages overflow? 
My solution is mainly optimized for scenarios where the original page can be 
compressed to only require one compressed block of storage. The scene where the 
original page is stored in multiple compressed blocks is suitable for scenarios 
that are not particularly sensitive to performance, but are more concerned 
about the compression rate, such as cold data. In addition, users can also 
choose to compile PostgreSQL with 16KB or 32KB BLOCKSZ. > - ISTM that your 
approach entails 3 "files". Could it be done with 2? > I'd suggest that the 
possible overflow pointers (coa) could be part of > the headers so that when 
reading the 3.1 page, then the header would > tell where to find the overflow 
3.2, without requiring an additional > independent structure with very small 
data in it, most of it zeros. > Possibly this is not possible, because it would 
require some available > space in standard headers when the is page is not 
compressible, and > there is not enough. Maybe creating a little room for that 
in > existing headers (4 bytes could be enough?) would be a good compromise. > 
Hmmm. Maybe the approach I suggest would only work for 1/2 compression, > but 
not for other target ratios, but I thin

[Proposal] Page Compression for OLTP

2020-05-20 Thread chenhj
Hello hackers!

This email is about the proposal to implement OLTP-oriented compression on 
PostgreSQL.

Although there are currently some alternative ways to achieve compression, such 
as using a file system that supports compression.
However, this depends on a specific file system and is not suitable for all 
deployment environments, but also increases the complexity of deployment and 
maintenance.

I hope this compression work can meet the following goals
1. In most scenarios, the compressed size of the table can be lower than 50% of 
the original table
2. Mainly oriented to the OLTP scenario, the performance impact on the load of 
frequent reads and writes is relatively small.
3. Does not rely on special external software or hardware that is difficult to 
obtain
4. Friendly to application developers and database managers
5. The transformation of PostgreSQL is small


I have noticed that there has been some discussion or work related to 
compression before, but they do not meet the above goals.
such as,
1. Use the sparse file[1]
   This is also the implemention method of MySQL 5.7's transparent page 
compression. However, sparse files may generate a lot of fragmentation inside 
the file system, and the "compressed" data file in sparse files will be 
restored to their original size after physical backup and restoration, unless 
our backup and recovery tools also support sparse files.
2. Use TAM (table access method interface) (pg_cryogen, zedstore) [2] [3]
   Neither storage engine is geared towards OLTP scenarios. It is best to make 
relatively small modifications to the existing code of the heap engine (by 
modify md.c and fd.c mainly).


The methods proposed by Postgres Pro Enterprise CFS [4] and Nikolay P [5] are 
close to my needs.
However, I would like to mention a slightly different implementation plan, 
which does not require space reclamation. Hope to get any suggestions.

# Premise assumption
1. Most of the pages in the compressed table can be compressed to less than 50% 
of the original size
   As long as you use an algorithm with a relatively high compression ratio 
(such as zlib, zstd), this first point should be easy to meet. Unless the table 
stores compressed data, such as pictures.
2. The compression ratio of most pages in the same table is relatively close


# Page storage

Configure 3 files for storing compressed data for each segment of each main 
fork. The main fork segment file(for example: 123456.2) still exists, but its 
size is 0.

-Compressed data file (for example: 123456.2.cd)
   Used to store the compressed page. The block size of this file is table 
level configurable. But it can only be 1/2, 1/4 or 1/8 of BLOCKSZ
-Compress overflow address file (for example: 123456.2.coa)
   When a page cannot be compressed to less than the size of the compressed 
block, this file is used to store the address of the overflow block.
-Compress overflow data file (for example: 123456.2.cod)
   When a page cannot be compressed to less than the size of the compressed 
block, this file is used to store the overflow block.

The following is an example when the compressed block size is 4K, which is 1/2 
of BLOCKSZ.

## Scenario 1: The compressed size of the original page (including the header 
of the compressed page) is less than or equal to the compressed block size (4KB)

Compressed data files(123456.2.cd):
 0   1   2
 +===+===+===+
 | data0 | data1 | data2 |  
 +===+===+===+
->|   4K  |<-


## Scenario 2: The compressed size of the original page (including the header 
of the compressed page) is larger than the compressed block size (4KB)

If the compressed size of the original page (page 3 below) is greater than 4KB, 
it will not be compressed. The first 4KB of the original page is stored in the 
compressed data file, and the last 4KB is stored in the compress overflow data 
file.

Compressed data files(123456.2.cd):

 0   1   2   3
 +===+===+===+=+
 | data0 | data1 | data2 | data3_1 |
 +===+===+===+=+
   ->| 1st 4K  |<-

Compress overflow address file(123456.2.coa):
The compress overflow address file stores the block number of the compress 
overflow block assigned to each block + 1
The size of the compressed block and the number of expanded blocks of the 
compress overflow data file are stored in the head of the compress overflow 
address file

 0   1   2   3
 +===+===+===+===+===+
 | head  |   |   |   |   1   |
 +===+===+===+===+===|===+
 |
 |
Compress overflow data file:  |
  ___|
 |
 0   |1   2   3
 +===|=+=+==+=+
 | data3_2 | |  | |
 +=+=+==+=+
->| 2nd 4K  |<-


If the compressed block size is 1/4 or 1/8 of BLOCKS

Re:could not access status of transaction

2020-01-05 Thread chenhj
Hi,



This database has not had the same failure again 2019/09/16(reported at 
2019/09/29), so this is a very low probability failure, but it is uncertain 
whether it will happen again in the future. Now add some information for 
incident at 2019/09/16, may be useful for analyze the cause of this problem.




> Block 32291 
>  -
>  Block Offset: 0x0fc46000 Offsets: Lower 236 (0x00ec)
>  Block: Size 8192  Version4Upper2112 (0x0840)
>  LSN:  logid254 recoff 0xa3a598c8  Special  8192 (0x2000)
>  Items:   53  Free Space: 1876
>  Checksum: 0x8355  Prune XID: 0x02186566  Flags: 0x0001 (HAS_FREE_LINES)
>  Length (including item array): 236
> ...
>  Item   8 -- Length:  151  Offset: 4696 (0x1258)  Flags: NORMAL
>   XMIN: 35153480  XMAX: 35153545  CID|XVAC: 0
>   Block Id: 163363  linp Index: 9   Attributes: 20   Size: 32
>   infomask: 0x2503 
> (HASNULL|HASVARWIDTH|XMIN_COMMITTED|XMAX_COMMITTED|UPDATED|HOT_UPDATED|HEAP_ONLY)
>   t_bits: [0]: 0xff [1]: 0xff [2]: 0x07
> 
> 
> 
>  Item   9 -- Length:  151  Offset: 4544 (0x11c0)  Flags: NORMAL
>   XMIN: 35153545  XMAX: 0  CID|XVAC: 0
>   Block Id: 163363  linp Index: 9   Attributes: 20   Size: 32
>   infomask: 0x2803 (HASNULL|HASVARWIDTH|XMAX_INVALID|UPDATED|HEAP_ONLY)
>   t_bits: [0]: 0xff [1]: 0xff [2]: 0x07


According to above information, the flags of the heap page (163363) with the 
problem tuple (163363, 9) is 0x0001 (HAS_FREE_LINES), that is, ALL_VISIBLE is 
not set.


However, according  hexdump content of the corresponding vm file, that 
block(location is 9F88 + 6bit) has set VISIBILITYMAP_ALL_FROZEN and 
VISIBILITYMAP_ALL_VISIBLE flags. That is, the heap file and the vm file are 
inconsistent.


# vi 19548_vm.hexdump


000 0215  0858 857c 8cee  0018 2000
010 2000 2004      
020        
*
0002000 0215  1680 857c 3bb1  0018 2000
0002010 2000 2004      
0002020        
*
0004000 0215  20e8 857c 07f7  0018 2000
0004010 2000 2004      
0004020        
*
0006000 0215  3618 857c 4792  0018 2000
0006010 2000 2004      
0006020        
*
0008000 0215  17a8 8580 3d7e  0018 2000
0008010 2000 2004      
0008020        
*
000a000 0215  3558 8585 1239  0018 2000
000a010 2000 2004      
000a020        
*
000c000 0215  72a8 871b 1a23  0018 2000
000c010 2000 2004      
000c020        
...
000e000 0215  93d0 8794 506a  0018 2000
000e010 2000 2004   fc03   
000e020  f3ff cccf fffc   f3cf ff3f
...
000f6c0 3f0f 3303 c33f 00f0 00c3 0303 0003 
000f6d0        
*
001

Regards,
Chen Huajun


Connections hang indefinitely while taking a gin index's LWLock buffer_content lock(PG10.7)

2019-09-29 Thread chenhj
Hi,all


In our PostgreSQL 10.7(rhel 6.3) database, autovacuum process and many insert 
processes blocked in gin index's LWLock:buffer_content for long time. 


In other words, the following gin index lwlock deadlock phenomenon has occurred 
again. Since the following bug in 10.7 has been fixed. So this should be a new 
bug.


https://www.postgresql.org/message-id/flat/31a702a.14dd.166c1366ac1.Coremail.chjischj%40163.com


We have already obtained coredump files of autovacuum process and one of insert 
processes.
Unfortunately the insert process(run by gcore) held no lwlock, it should be 
another process(we did not fetch core file) that hold the lwlock needed for 
autovacuum process.


the stack is as following:


## stack of one insert process: Acquire lock 0x7f6c517dbfa4 which was held by 
vacuum process

(gdb) bt
#0  0x00369ea0da00 in sem_wait () from /lib64/libpthread.so.0
#1  0x006a7910 in PGSemaphoreLock (sema=0x7f6c4f76a7b8) at pg_sema.c:316
#2  0x00718225 in LWLockAcquire (lock=0x7f6c517dbfa4, mode=LW_SHARED) 
at lwlock.c:1233
#3  0x0048b622 in ginTraverseLock (buffer=224225, searchMode=0 '\000') 
at ginbtree.c:40
#4  0x0048ca13 in ginFindLeafPage (btree=0x7fffc71c4ea0, searchMode=0 
'\000', snapshot=0x0) at ginbtree.c:97
#5  0x004894db in ginInsertItemPointers (index=, 
rootBlkno=, items=, nitem=, buildStats=0x0)
at gindatapage.c:1909
#6  0x004863a7 in ginEntryInsert (ginstate=0x1c72158, attnum=1, 
key=20190913, category=0 '\000', items=0x1c81508, nitem=72, buildStats=0x0) at 
gininsert.c:214
#7  0x0049219a in ginInsertCleanup (ginstate=0x1c72158, full_clean=0 
'\000', fill_fsm=1 '\001', forceCleanup=, stats=) at ginfast.c:878
#8  0x0049308e in ginHeapTupleFastInsert (ginstate=0x1c72158, 
collector=) at ginfast.c:443
#9  0x00486749 in gininsert (index=, 
values=0x7fffc71c54f0, isnull=0x7fffc71c5600 "", ht_ctid=0x1c6d3a4, 
heapRel=, 
checkUnique=, indexInfo=0x1c61da8) at gininsert.c:522
#10 0x005f75f0 in ExecInsertIndexTuples (slot=0x1c62168, 
tupleid=0x1c6d3a4, estate=0x1c61768, noDupErr=0 '\000', specConflict=0x0, 
arbiterIndexes=0x0) at execIndexing.c:387
#11 0x00616497 in ExecInsert (pstate=0x1c61ab8) at nodeModifyTable.c:519
#12 ExecModifyTable (pstate=0x1c61ab8) at nodeModifyTable.c:1779
#13 0x005fb6bf in ExecProcNode (queryDesc=0x1c67760, direction=, count=0, execute_once=-72 '\270') at 
../../../src/include/executor/executor.h:250
#14 ExecutePlan (queryDesc=0x1c67760, direction=, count=0, 
execute_once=-72 '\270') at execMain.c:1723
#15 standard_ExecutorRun (queryDesc=0x1c67760, direction=, 
count=0, execute_once=-72 '\270') at execMain.c:364
#16 0x7f6e226aa6f8 in pgss_ExecutorRun (queryDesc=0x1c67760, 
direction=ForwardScanDirection, count=0, execute_once=1 '\001') at 
pg_stat_statements.c:889
#17 0x7f6e224a474d in explain_ExecutorRun (queryDesc=0x1c67760, 
direction=ForwardScanDirection, count=0, execute_once=1 '\001') at 
auto_explain.c:267
#18 0x0072a15b in ProcessQuery (plan=, 
sourceText=0x1c21458 "INSERT INTO bi_dm.tdm_wh_shopgds_fnsh_rt 
(STATIS_DATE,SITE_CD,LGORT,ZSIZE,ZVTWEG,VSBED,TOTAL_CNT,FNSH_CNT,UNFNSH_CNT,ETL_TIME,DEPT_CD,TMALL_FLG,BUSS_TP,ZCKYWLX)
 VALUES($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$"..., params=0x1c21580, queryEnv=0x0, 
dest=, completionTag=0x7fffc71c5de0 "") at pquery.c:161
#19 0x0072a395 in PortalRunMulti (portal=0x1c57f18, isTopLevel=1 
'\001', setHoldSnapshot=0 '\000', dest=0xc9b480, altdest=0xc9b480, 
completionTag=0x7fffc71c5de0 "") at pquery.c:1286
#20 0x0072aa98 in PortalRun (portal=0x1c57f18, count=1, isTopLevel=1 
'\001', run_once=1 '\001', dest=0x1c25768, altdest=0x1c25768, 
completionTag=0x7fffc71c5de0 "") at pquery.c:799
#21 0x00728c9a in exec_execute_message (argc=, 
argv=, dbname=0x1bbb800 "lbiwhdb", username=) at postgres.c:2007
#22 PostgresMain (argc=, argv=, 
dbname=0x1bbb800 "lbiwhdb", username=) at postgres.c:4180
#23 0x006bb43a in BackendRun (argc=, argv=) at postmaster.c:4405
---Type  to continue, or q  to quit---
#24 BackendStartup (argc=, argv=) at 
postmaster.c:4077
#25 ServerLoop (argc=, argv=) at 
postmaster.c:1755
#26 PostmasterMain (argc=, argv=) at 
postmaster.c:1363
#27 0x0063b4d0 in main (argc=3, argv=0x1b839e0) at main.c:228
(gdb) f 2
#2  0x00718225 in LWLockAcquire (lock=0x7f6c517dbfa4, mode=LW_SHARED) 
at lwlock.c:1233
1233lwlock.c: No such file or directory.
in lwlock.c
(gdb) p num_held_lwlocks
$1 = 0
(gdb) 




## stack of autovacuum:Acquire lock 0x7f6c519ba5a4 and hold 0x7f6c517dbfa4, 
0x7f6c51684f64
--
(gdb) bt
#0  0x00369ea0da00 in sem_wait () from /lib64/libpthread.so.0
#1  0x006a7910 in PGSemaphoreLock (sema=0x7f6c4f77fdb8) at pg_sema.c:316
#2  0x00718225 in LWLockAcquire (lock=0x7f6c519ba5a4, 
mode=LW_EXCLUSIVE) at lwlock.c:1

Re:Re: could not access status of transaction

2019-09-29 Thread chenhj
Hi, all


Our other system had encountered the same failure, but this time it is 
PostgreSQL 10.7(rhel 6.3).


Details are as follows:


Phenomenon:


app_db=# select count(*) from loba_sp_cost_xcd_104561;
ERROR:  could not access status of transaction 35153545
DETAIL:  Could not open file "pg_xact/0021": No such file or directory.




Related Information:
The clog file in the pg_xact directory is 00CE~0179, and pg_xact/0021 has been 
cleaned up.


Through gdb analysis, the tuple that caused the error was in block 163363. 


(gdb) bt
#0  SlruReportIOError (ctl=0xca6b60, pageno=1072, xid=35153545) at slru.c:894
#1  0x004e2060 in SimpleLruReadPage (ctl=0xca6b60, pageno=1072, 
write_ok=1 '\001', xid=35153545) at slru.c:445
#2  0x004da431 in TransactionIdGetStatus (xid=35153545, 
lsn=0x7fffd07b1e88) at clog.c:411
#3  0x004e385a in TransactionLogFetch (transactionId=35153545) at 
transam.c:79
#4  0x004e39a5 in TransactionIdDidCommit (transactionId=) at transam.c:129
#5  0x0085e959 in HeapTupleSatisfiesMVCC (htup=, 
snapshot=0x2c21740, buffer=320) at tqual.c:1057
#6  0x004b49e7 in heapgetpage (scan=0x2c8a5d8, page=163363) at 
heapam.c:439
#7  0x004b5091 in heapgettup_pagemode (scan=0x2c8a5d8, direction=) at heapam.c:1034
#8  heap_getnext (scan=0x2c8a5d8, direction=) at 
heapam.c:1801
#9  0x00618b21 in SeqNext (node=0x2c63708) at nodeSeqscan.c:81
#10 0x005fe132 in ExecScanFetch (node=0x2c63708, accessMtd=0x618ae0 
, recheckMtd=0x618840 ) at execScan.c:97
#11 ExecScan (node=0x2c63708, accessMtd=0x618ae0 , recheckMtd=0x618840 
) at execScan.c:164
#12 0x006061ee in ExecProcNode (aggstate=0x2c630c0) at 
../../../src/include/executor/executor.h:250
#13 fetch_input_tuple (aggstate=0x2c630c0) at nodeAgg.c:695
#14 0x00607f89 in agg_retrieve_direct (pstate=0x2c630c0) at 
nodeAgg.c:2448
#15 ExecAgg (pstate=0x2c630c0) at nodeAgg.c:2158
#16 0x0060b240 in ExecProcNode (pstate=0x2c62da0) at 
../../../src/include/executor/executor.h:250
#17 gather_getnext (pstate=0x2c62da0) at nodeGather.c:286
#18 ExecGather (pstate=0x2c62da0) at nodeGather.c:215
#19 0x006061ee in ExecProcNode (aggstate=0x2c626f8) at 
../../../src/include/executor/executor.h:250
#20 fetch_input_tuple (aggstate=0x2c626f8) at nodeAgg.c:695
#21 0x006080ec in agg_retrieve_direct (pstate=0x2c626f8) at 
nodeAgg.c:2347
#22 ExecAgg (pstate=0x2c626f8) at nodeAgg.c:2158
#23 0x005fb6bf in ExecProcNode (queryDesc=0x2c81658, direction=, count=0, execute_once=-8 '\370') at 
../../../src/include/executor/executor.h:250
#24 ExecutePlan (queryDesc=0x2c81658, direction=, count=0, 
execute_once=-8 '\370') at execMain.c:1723
#25 standard_ExecutorRun (queryDesc=0x2c81658, direction=, 
count=0, execute_once=-8 '\370') at execMain.c:364
#26 0x7fd6f161c6f8 in pgss_ExecutorRun (queryDesc=0x2c81658, 
direction=ForwardScanDirection, count=0, execute_once=1 '\001') at 
pg_stat_statements.c:889
#27 0x7fd6f141674d in explain_ExecutorRun (queryDesc=0x2c81658, 
direction=ForwardScanDirection, count=0, execute_once=1 '\001') at 
auto_explain.c:267
#28 0x0072986b in PortalRunSelect (portal=0x2c85f98, forward=, count=0, dest=) at pquery.c:932
#29 0x0072aa01 in PortalRun (portal=0x2c85f98, 
count=9223372036854775807, isTopLevel=1 '\001', run_once=1 '\001', 
dest=0x2c5fb58, altdest=0x2c5fb58, completionTag=0x7fffd07b26d0 "")
at pquery.c:773
#30 0x00727051 in exec_simple_query (query_string=0x2c26358 "select 
count(*) from loba_sp_cost_xcd_104561;") at postgres.c:1122
#31 0x00728039 in PostgresMain (argc=, argv=, dbname=0x2bcc2f0 "app_db", username=) at 
postgres.c:4117
#32 0x006bb43a in BackendRun (argc=, argv=) at postmaster.c:4405
#33 BackendStartup (argc=, argv=) at 
postmaster.c:4077
---Type  to continue, or q  to quit---
#34 ServerLoop (argc=, argv=) at 
postmaster.c:1755
#35 PostmasterMain (argc=, argv=) at 
postmaster.c:1363
#36 0x0063b4d0 in main (argc=1, argv=0x2b8e9f0) at main.c:228
(gdb) f 5
#5  0x0085e959 in HeapTupleSatisfiesMVCC (htup=, 
snapshot=0x2c21740, buffer=320) at tqual.c:1057
1057tqual.c: No such file or directory.
in tqual.c
(gdb) p *tuple
$1 = {t_choice = {t_heap = {t_xmin = 35153545, t_xmax = 0, t_field3 = {t_cid = 
0, t_xvac = 0}}, t_datum = {datum_len_ = 35153545, datum_typmod = 0, 
datum_typeid = 0}}, t_ctid = {ip_blkid = {
  bi_hi = 2, bi_lo = 32291}, ip_posid = 9}, t_infomask2 = 32788, t_infomask = 
10243, t_hoff = 32 ' ', t_bits = 0x7fd6d97f0440 "\211f\030\002"}
  
  
The problematic tuple data parsed using `pg_filedump -i` is as follows:




Block 32291 
 -
 Block Offset: 0x0fc46000 Offsets: Lower 236 (0x00ec)
 Block: Size 8192  Version4Upper2112 (0x0840)
 LSN:  logid254 recoff 0xa3a598c8  Special  8192 (0x2000)
 Items:   53  Free Space: 1876
 Checksum: 0x8355  Prune

could not access status of transaction

2019-01-20 Thread chenhj
In our PG 10.2(CentOS 7.3) database, the following error is reported when 
querying a table. We have already restored the production data through backup, 
but i want to confirm what may be the reason and how to avoid it in the future.

lma=# select count(*) from bi_dm.tdm_ttk_site_on_way_rt;
ERROR:  could not access status of transaction 3250922107
DETAIL:  Could not open file "pg_xact/0C1C": No such file or directory.

Here are some related information

The CLOG files in pg_xact diractory is as follow:

0C4A(Last update date: 2018/12/29)
...
0D09(Last update date: 2019/01/13)

The stack when the error occurs is as follows

(gdb) bt
#0  SlruReportIOError (ctl=ctl@entry=0xca98e0 , 
pageno=pageno@entry=99210, xid=xid@entry=3250922107) at slru.c:901
#1  0x004e53a8 in SimpleLruReadPage (ctl=ctl@entry=0xca98e0 
, pageno=pageno@entry=99210, write_ok=write_ok@entry=1 '\001', 
xid=xid@entry=3250922107)
at slru.c:445
#2  0x004e5460 in SimpleLruReadPage_ReadOnly 
(ctl=ctl@entry=0xca98e0 , pageno=pageno@entry=99210, 
xid=xid@entry=3250922107) at slru.c:492
#3  0x004dde4d in TransactionIdGetStatus 
(xid=xid@entry=3250922107, lsn=lsn@entry=0x7ffeb4472660) at clog.c:411
#4  0x004e6e28 in TransactionLogFetch 
(transactionId=3250922107) at transam.c:79
#5  0x004e6e7f in TransactionIdDidCommit 
(transactionId=) at transam.c:129
#6  0x0084ce6a in HeapTupleSatisfiesMVCC (htup=, 
snapshot=0xf7f0c8, buffer=837) at tqual.c:1124
#7  0x004b106e in heapgetpage (scan=scan@entry=0x10bb1b0, 
page=page@entry=1078209) at heapam.c:439
#8  0x004b23ab in heapgettup_pagemode (key=0x0, nkeys=0, 
dir=1078209, scan=0x10bb1b0) at heapam.c:1034
#9  heap_getnext (scan=scan@entry=0x10bb1b0, 
direction=direction@entry=ForwardScanDirection) at heapam.c:1801
#10 0x0060fa93 in SeqNext (node=node@entry=0x108c360) at 
nodeSeqscan.c:81
#11 0x005f51ca in ExecScanFetch (recheckMtd=0x60fa40 
, accessMtd=0x60fa60 , node=0x108c360) at execScan.c:97
#12 ExecScan (node=0x108c360, accessMtd=0x60fa60 , 
recheckMtd=0x60fa40 ) at execScan.c:147
#13 0x005fad89 in ExecProcNode (node=0x108c360) at 
../../../src/include/executor/executor.h:250
#14 fetch_input_tuple (aggstate=aggstate@entry=0x108bcb8) at 
nodeAgg.c:695
#15 0x005fcdbf in agg_retrieve_direct (aggstate=0x108bcb8) at 
nodeAgg.c:2448
#16 ExecAgg (pstate=0x108bcb8) at nodeAgg.c:2158
#17 0x005ef612 in ExecProcNode (node=0x108bcb8) at 
../../../src/include/executor/executor.h:250
#18 ExecutePlan (execute_once=, dest=0x10c40e8, 
direction=, numberTuples=0, sendTuples=1 '\001', 
operation=CMD_SELECT, 
use_parallel_mode=, planstate=0x108bcb8, 
estate=0x108baa8) at execMain.c:1722
#19 standard_ExecutorRun (queryDesc=0x10c7168, direction=, count=0, execute_once=) at execMain.c:363
#20 0x7f766427915d in pgss_ExecutorRun (queryDesc=0x10c7168, 
direction=ForwardScanDirection, count=0, execute_once=) at 
pg_stat_statements.c:889
#21 0x0071a28b in PortalRunSelect 
(portal=portal@entry=0xfac708, forward=forward@entry=1 '\001', count=0, 
count@entry=9223372036854775807, dest=dest@entry=0x10c40e8)
at pquery.c:932
#22 0x0071b63f in PortalRun (portal=portal@entry=0xfac708, 
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', 
run_once=run_once@entry=1 '\001', dest=dest@entry=0x10c40e8, 
altdest=altdest@entry=0x10c40e8, 
completionTag=completionTag@entry=0x7ffeb4472c80 "") at pquery.c:773
#23 0x007175b3 in exec_simple_query (query_string=0x102a4e8 
"select count(*) from tdm_ttk_site_on_way_rt ;") at postgres.c:1099
#24 0x007188ac in PostgresMain (argc=, 
argv=argv@entry=0xfa2d68, dbname=0xfa2c50 "lma", username=) at 
postgres.c:4088
#25 0x0047ad3c in BackendRun (port=0xfa49b0) at 
postmaster.c:4405
#26 BackendStartup (port=0xfa49b0) at postmaster.c:4077
#27 ServerLoop () at postmaster.c:1755
#28 0x006afacf in PostmasterMain (argc=argc@entry=3, 
argv=argv@entry=0xf5fa20) at postmaster.c:1363
#29 0x0047bb6f in main (argc=3, argv=0xf5fa20) at main.c:228
(gdb) f 6
#6  0x0084ce6a in HeapTupleSatisfiesMVCC (htup=, 
snapshot=0xf7f0c8, buffer=837) at tqual.c:1124
1124if 
(!TransactionIdDidCommit(HeapTupleHeaderGetRawXmax(tuple)))
(gdb) p *tuple
$3 = {t_choice = {t_heap = {t_xmin = 3238223060, t_xmax = 3250922107, 
t_field3 = {t_cid = 0, t_xvac = 0}}, t_datum = {datum_len_ = -1056744236, 
datum_typmod = -1044045189, 
  datum_typeid = 0}}, t_ctid = {ip_blkid = {bi_hi = 16, bi_lo = 
29633}, ip_posid = 3}, t_infomask2 = 8211, t_infomask

Re:Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-11-11 Thread chenhj
Hi


Before we only discussed the connection hang on the primary, the connection 
hang on the standby database should be another problem.
 When the recovery process replays the gin's delete WAL record, the order of 
get lwlock is not the same as the select process, 
resulting in a deadlock. Accord infomation form gcore, the details are as 
follows:


## select process


   2246(rootBlkno=2246)
  |
3254(1. Held buffer=366260,LWLock=0x2d50e2e4)  --rightlink--> 483(2. 
Acquire buffer=2201739,LWLock=0x2aaab45158a4)


The ginStepRight() function in select process gets the lwlock in the order of 
left to right.


## recovey process


   2246(2. Held buffer=7034160,LWLock=0x2aaac6c081e4,rootBlkno)
  |
3254(3. Acquire buffer=366260,LWLock=0x2d50e2e4)  --rightlink--> 483(1. 
Held buffer=2201739,LWLock=0x2aaab45158a4)


But, the ginRedoDeletePage() function in recovery process gets the lwlock in 
the order of current to parent and to left.
So, i think inconsistent lock order in ginRedoDeletePage() is the reason for 
hang in the standby.


static void
ginRedoDeletePage(XLogReaderState *record)
{
XLogRecPtrlsn = record->EndRecPtr;
ginxlogDeletePage *data = (ginxlogDeletePage *) XLogRecGetData(record);
Bufferdbuffer;
Bufferpbuffer;
Bufferlbuffer;
Pagepage;


if (XLogReadBufferForRedo(record, 0, &dbuffer) == BLK_NEEDS_REDO)
{
page = BufferGetPage(dbuffer);
Assert(GinPageIsData(page));
GinPageGetOpaque(page)->flags = GIN_DELETED;
PageSetLSN(page, lsn);
MarkBufferDirty(dbuffer);
}


if (XLogReadBufferForRedo(record, 1, &pbuffer) == BLK_NEEDS_REDO)
{
page = BufferGetPage(pbuffer);
Assert(GinPageIsData(page));
Assert(!GinPageIsLeaf(page));
GinPageDeletePostingItem(page, data->parentOffset);
PageSetLSN(page, lsn);
MarkBufferDirty(pbuffer);
}


if (XLogReadBufferForRedo(record, 2, &lbuffer) == BLK_NEEDS_REDO)
{
page = BufferGetPage(lbuffer);
Assert(GinPageIsData(page));
GinPageGetOpaque(page)->rightlink = data->rightLink;
PageSetLSN(page, lsn);
MarkBufferDirty(lbuffer);
}


if (BufferIsValid(lbuffer))
UnlockReleaseBuffer(lbuffer);
if (BufferIsValid(pbuffer))
UnlockReleaseBuffer(pbuffer);
if (BufferIsValid(dbuffer))
UnlockReleaseBuffer(dbuffer);
}




The order of get lwlock in ginRedoDeletePage() may should be change from 
"dbuffer->pbuffer->lbuffer" to "lbuffer->dbuffer->pbuffer" . Is this right?




## How to reproduct this issue


1. modify ginxlog.c and build(add "sleep(60)" )


if (XLogReadBufferForRedo(record, 1, &pbuffer) == BLK_NEEDS_REDO)
{
page = BufferGetPage(pbuffer);
Assert(GinPageIsData(page));
Assert(!GinPageIsLeaf(page));
GinPageDeletePostingItem(page, data->parentOffset);
PageSetLSN(page, lsn);
MarkBufferDirty(pbuffer);
}
==>
if (XLogReadBufferForRedo(record, 1, &pbuffer) == BLK_NEEDS_REDO)
{
page = BufferGetPage(pbuffer);
Assert(GinPageIsData(page));
Assert(!GinPageIsLeaf(page));
GinPageDeletePostingItem(page, data->parentOffset);
PageSetLSN(page, lsn);
MarkBufferDirty(pbuffer);
sleep(60);//add for debug
}




2. run test SQL on the primary


create table tb1(id int);
create index ON tb1 using gin(id);
insert into tb1 select 1 from generate_series(1,100)id;
delete from tb1;


3. check recovery process in standby had enter "sleep()" branch


$ ps -ef|grep reco
postgres 13418 13417  0 22:23 ?00:00:00 postgres: startup process   
recovering 0001005E
postgres 13425 31505  0 22:23 pts/800:00:00 grep --color=auto reco
$ pstack 13418
#0  0x7f2166d39650 in __nanosleep_nocancel () from /lib64/libc.so.6
#1  0x7f2166d39504 in sleep () from /lib64/libc.so.6
#2  0x0048614f in ginRedoDeletePage (record=0x127cbe8) at 
ginxlog.c:480
#3  gin_redo (record=0x127cbe8) at ginxlog.c:732
#4  0x004efec3 in StartupXLOG ()
#5  0x00697c51 in StartupProcessMain ()
#6  0x004fd22a in AuxiliaryProcessMain ()
#7  0x00694e49 in StartChildProcess ()
#8  0x00697665 in PostmasterMain ()
#9  0x004766e1 in main ()


4. execute select SQL


set enable_seqscan = false;
select count(*) from tb1 where id =2;


5. check result


recovery process block in LWLock


$ pstack 13418
#0  0x7f216775779b in do_futex_wait.constprop.1 () from 
/lib64/libpthread.so.0
#1  0x7f216775782f in __new_sem_wait_slow.constprop.0 () from 
/lib64/libpthread.so.0
#2  0x7f21677578cb in sem_wait@@GLIBC_2.2.5 () from 
/lib64/libpthread.so.0
#3  0x00685df2 in PGSemaphoreLock ()
#4  0x006edd64 in LWLockAcquire ()
#5  0x004fa66a in XLogReadBufferForRedoExtended ()
#6  0x00486161 in ginRedoDeletePage (record=0x127cbe8) at 
ginxlog.c:483
#7  gin_redo (record=0x127cbe8) at ginxlog.c:732
#8  0x004efec3 in StartupXLOG ()
#9  0x00697c51 in StartupProcessMain ()
#10 0x004fd22a in AuxiliaryProcessMain ()
#11 0x000

Re:Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-11-06 Thread chenhj
Hi,all
 
I  analyzed the btree block where lwlock deadlock occurred, as follows:

## insert process(ginInsertValue())

 644(root blkno)
  |
 7054(2. held LWLock:0x2aaac587ae64) rightlink>(3. 
Acquire LWLock:0x2aaab4009564,buffer = 2119038,blkno should be 9954)
   |
701(1. held LWLock:0x2aaab670dfe4)


The ginInsertValue() function above gets the lwlock in the order described in 
the README.

src/backend/access/gin/README
---
To avoid deadlocks, B-tree pages must always be locked in the same order:
left to right, and bottom to top.
...
-


## autovacuum process(ginScanToDelete())

 644(root blkno)
  |
...  9954(1. held LWLock:0x2aaab4009564) ...
   |
701(2. Acquire LWLock:0x2aaab670dfe4)

note:according to autovacuum's core 701's parent is 9954;while insert's core 
shows 701's parent is 7054, rightlink of 7054 is 9954!

However, ginScanToDelete() depth-first scans the btree and gets the EXCLUSIVE 
lock, which creates a deadlock.
Is the above statement correct? If so, deadlocks should easily happen.


static bool
ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
DataPageDeleteStack *parent, OffsetNumber myoff)
{
...
if (!isRoot)
LockBuffer(buffer, GIN_EXCLUSIVE);

...
for (i = FirstOffsetNumber; i <= GinPageGetOpaque(page)->maxoff; i++)
{
PostingItem *pitem = GinDataPageGetPostingItem(page, i);

if (ginScanToDelete(gvs, PostingItemGetBlockNumber(pitem), FALSE, 
me, i))
i--;
}
...
}

src/backend/access/gin/README
-
The previous paragraph's reasoning only applies to searches, and only to
posting trees. To protect from inserters following a downlink to a deleted
page, vacuum simply locks out all concurrent insertions to the posting tree,
by holding a super-exclusive lock on the parent page of subtree with deletable
pages. 
...
posting tree. To exclude interference with readers vacuum takes exclusive
locks in a depth-first scan in left-to-right order of page tuples. 
...
-

# stacks
## stack of insert:Acquire lock 0x2aaab4009564(blkno:9954) and hold 
0x2aaab670dfe4(blkno:701), 0x2aaac587ae64(blkno:7054)
-
(gdb) bt
#0  0x7fe11552379b in do_futex_wait.constprop.1 () from 
/lib64/libpthread.so.0
#1  0x7fe11552382f in __new_sem_wait_slow.constprop.0 () from 
/lib64/libpthread.so.0
#2  0x7fe1155238cb in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x0069d362 in PGSemaphoreLock (sema=0x2ac02958) at pg_sema.c:310
#4  0x007095ac in LWLockAcquire (lock=0x2aaab4009564, 
mode=LW_EXCLUSIVE) at lwlock.c:1233
#5  0x00490a32 in ginStepRight (buffer=6713826, index=, 
lockmode=lockmode@entry=2) at ginbtree.c:174
#6  0x00490c1c in ginFinishSplit (btree=btree@entry=0x7ffd81e4f950, 
stack=0x28eebf8, freestack=freestack@entry=1 '\001', 
buildStats=buildStats@entry=0x0) at ginbtree.c:701
#7  0x00491396 in ginInsertValue (btree=btree@entry=0x7ffd81e4f950, 
stack=, insertdata=insertdata@entry=0x7ffd81e4f940, 
buildStats=buildStats@entry=0x0)
at ginbtree.c:773
#8  0x0048fb42 in ginInsertItemPointers (index=, 
rootBlkno=rootBlkno@entry=644, items=items@entry=0x2916598, 
nitem=nitem@entry=353, buildStats=buildStats@entry=0x0)
at gindatapage.c:1907
#9  0x0048a9ea in ginEntryInsert (ginstate=ginstate@entry=0x28e7ef8, 
attnum=, key=42920440, category=, 
items=0x2916598, nitem=353, 
buildStats=buildStats@entry=0x0) at gininsert.c:214
#10 0x00496d94 in ginInsertCleanup (ginstate=ginstate@entry=0x28e7ef8, 
full_clean=full_clean@entry=0 '\000', fill_fsm=fill_fsm@entry=1 '\001', 
forceCleanup=forceCleanup@entry=0 '\000', stats=stats@entry=0x0) at 
ginfast.c:883
#11 0x00497727 in ginHeapTupleFastInsert 
(ginstate=ginstate@entry=0x28e7ef8, collector=collector@entry=0x7ffd81e4fe60) 
at ginfast.c:448
#12 0x0048b209 in gininsert (index=, 
values=0x7ffd81e4ff40, isnull=0x7ffd81e50040 "", ht_ctid=0x280d98c, 
heapRel=, checkUnique=, 
indexInfo=0x28b5aa8) at gininsert.c:522
#13 0x005ee8dd in ExecInsertIndexTuples (slot=slot@entry=0x28b5d58, 
tupleid=tupleid@entry=0x280d98c, estate=estate@entry=0x28b5288, 
noDupErr=noDupErr@entry=1 '\001', 
specConflict=specConflict@entry=0x7ffd81e5013b "", 
arbiterIndexes=arbiterIndexes@entry=0x28c6dd8) at execIndexing.c:386
#14 0x0060ccf5 in ExecInsert (canSetTag=1 '\001', estate=0x28b5288, 
onconflict=ONCONFLICT_UPDATE, arbiterIndexes=0x28c6dd8, planSlot=0x28b5d58, 
slot=0x28b5d58, mtstate=0x28b5628)
at nodeModifyTable.c:564
#15 ExecModifyTab

Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-10-29 Thread chenhj
Hi,all


In our PostgreSQL 10.2 database, two sessions of insert and autovacuum of gin 
index blocked in LWLock:buffer_content.
This blocked checkpoint and dead tuple recycle,and we had to restart the 
datebase.
According to the information collected from gcore, a deadlock occurred when 
acquiring the buffer lock.


This system is a citus 7.2.1 cluster. It has been running for more than a year. 
This problem has not happened before. 
10/19 the cluster has been expanded from 8 workers to 16 workers. I don't know 
if it has anything to do with expansion. 
The problem happend on the coordinator node of citus, but the table is an 
normal table(not partition table).
And the load of this coordinator node has always been heavy.


After 10/25 this issue occurred three times in the 4 days, but we can not 
reproduce this problem in test environment.
3 faults are stuck on 3 different tables, but the stack is the same.


## stack of insert:Acquire lock 0x2aaab4009564 and hold 0x2aaab670dfe4, 
0x2aaac587ae64

(gdb) bt
#0  0x7fe11552379b in do_futex_wait.constprop.1 () from 
/lib64/libpthread.so.0
#1  0x7fe11552382f in __new_sem_wait_slow.constprop.0 () from 
/lib64/libpthread.so.0
#2  0x7fe1155238cb in sem_wait@@GLIBC_2.2.5 () from 
/lib64/libpthread.so.0
#3  0x0069d362 in PGSemaphoreLock (sema=0x2ac02958) at 
pg_sema.c:310
#4  0x007095ac in LWLockAcquire (lock=0x2aaab4009564, 
mode=LW_EXCLUSIVE) at lwlock.c:1233
#5  0x00490a32 in ginStepRight (buffer=6713826, index=, lockmode=lockmode@entry=2) at ginbtree.c:174
#6  0x00490c1c in ginFinishSplit (btree=btree@entry=0x7ffd81e4f950, 
stack=0x28eebf8, freestack=freestack@entry=1 '\001', 
buildStats=buildStats@entry=0x0) at ginbtree.c:701
#7  0x00491396 in ginInsertValue (btree=btree@entry=0x7ffd81e4f950, 
stack=, insertdata=insertdata@entry=0x7ffd81e4f940, 
buildStats=buildStats@entry=0x0)
at ginbtree.c:773
#8  0x0048fb42 in ginInsertItemPointers (index=, 
rootBlkno=rootBlkno@entry=644, items=items@entry=0x2916598, 
nitem=nitem@entry=353, buildStats=buildStats@entry=0x0)
at gindatapage.c:1907
#9  0x0048a9ea in ginEntryInsert 
(ginstate=ginstate@entry=0x28e7ef8, attnum=, key=42920440, 
category=, items=0x2916598, nitem=353, 
buildStats=buildStats@entry=0x0) at gininsert.c:214
#10 0x00496d94 in ginInsertCleanup 
(ginstate=ginstate@entry=0x28e7ef8, full_clean=full_clean@entry=0 '\000', 
fill_fsm=fill_fsm@entry=1 '\001', 
forceCleanup=forceCleanup@entry=0 '\000', stats=stats@entry=0x0) at 
ginfast.c:883
#11 0x00497727 in ginHeapTupleFastInsert 
(ginstate=ginstate@entry=0x28e7ef8, collector=collector@entry=0x7ffd81e4fe60) 
at ginfast.c:448
#12 0x0048b209 in gininsert (index=, 
values=0x7ffd81e4ff40, isnull=0x7ffd81e50040 "", ht_ctid=0x280d98c, 
heapRel=, checkUnique=, 
indexInfo=0x28b5aa8) at gininsert.c:522
#13 0x005ee8dd in ExecInsertIndexTuples (slot=slot@entry=0x28b5d58, 
tupleid=tupleid@entry=0x280d98c, estate=estate@entry=0x28b5288, 
noDupErr=noDupErr@entry=1 '\001', 
specConflict=specConflict@entry=0x7ffd81e5013b "", 
arbiterIndexes=arbiterIndexes@entry=0x28c6dd8) at execIndexing.c:386
#14 0x0060ccf5 in ExecInsert (canSetTag=1 '\001', estate=0x28b5288, 
onconflict=ONCONFLICT_UPDATE, arbiterIndexes=0x28c6dd8, planSlot=0x28b5d58, 
slot=0x28b5d58, mtstate=0x28b5628)
at nodeModifyTable.c:564
#15 ExecModifyTable (pstate=0x28b5628) at nodeModifyTable.c:1766
#16 0x005ef612 in standard_ExecutorRun (queryDesc=, 
direction=, count=, execute_once=)
at ../../../src/include/executor/executor.h:250
#17 0x7fe10607d15d in pgss_ExecutorRun (queryDesc=0x28cc1d8, 
direction=ForwardScanDirection, count=0, execute_once=) at 
pg_stat_statements.c:889
#18 0x0071a7ba in ProcessQuery (plan=, 
sourceText=0x2849058 "INSERT INTO 
bi_dm.tdm_sncity_workorder_analytic_statistics 
(CITYCODE,CITYNAME,DISTRICT,DISTRICT_NAME,ZL_WERKS,ZL_WERKS_NAME,AREA_CD,AREA_NM,ZSIZE,DEPT_CD,TMALL_FLG,ORDER_DATE,DISTRIBUTE,WORKORDER,FNSH"...,
 params=0x28b2260, queryEnv=0x0, dest=0xc9e2a0 , 
completionTag=0x7ffd81e507b0 "") at pquery.c:161
#19 0x0071a9f7 in PortalRunMulti (portal=portal@entry=0x289fdd8, 
isTopLevel=isTopLevel@entry=1 '\001', setHoldSnapshot=setHoldSnapshot@entry=0 
'\000', dest=0xc9e2a0 , 
dest@entry=0x284d778, altdest=0xc9e2a0 , 
altdest@entry=0x284d778, completionTag=completionTag@entry=0x7ffd81e507b0 "") 
at pquery.c:1286
#20 0x0071b535 in PortalRun (portal=, count=1, 
isTopLevel=, run_once=, dest=0x284d778, 
altdest=0x284d778, 
completionTag=0x7ffd81e507b0 "") at pquery.c:799
#21 0x00718f84 in PostgresMain (argc=, 
argv=, dbname=, username=) at 
postgres.c:1984
#22 0x0047ad3c in Backe

When use prepared protocol, transaction will hold backend_xmin until the end of the transaction.

2018-07-03 Thread chenhj
Hi, hackers!

When execute sql with prepared protocol, read committed transaction will hold 
backend_xmin until the end of the transaction.

Is this behavior normal?

Should read committed transaction release backend_xmin immediately after SQL 
executing is completed? Just like
when executing sql with simple protocol.

# reproduction
## env
- PostgreSQL 9.2
- CentOS 7.2

## test script

$ cat test.sql
begin;
select 1;
\sleep 1000s


## execute with simple protocol

$ pgbench -n -t 1 -f test.sql "service=admin"

postgres=# select * from pg_stat_activity where query='select 1;';
-[ RECORD 1 ]+--
datid| 13805
datname  | postgres
pid  | 19641
usesysid | 16388
usename  | admin
application_name | pgbench
client_addr  |
client_hostname  |
client_port  | -1
backend_start| 2018-07-04 13:27:10.62635+08
xact_start   | 2018-07-04 13:27:10.629609+08
query_start  | 2018-07-04 13:27:10.629845+08
state_change | 2018-07-04 13:27:10.63035+08
wait_event_type  | Client
wait_event   | ClientRead
state| idle in transaction
backend_xid  |
backend_xmin |
query| select 1;
backend_type | client backend

## execute with prepared protocol

$ pgbench -n -t 1 -f test.sql "service=admin" -M prepared

postgres=# select * from pg_stat_activity where query='select 1;';
-[ RECORD 1 ]+--
datid| 13805
datname  | postgres
pid  | 19662
usesysid | 16388
usename  | admin
application_name | pgbench
client_addr  |
client_hostname  |
client_port  | -1
backend_start| 2018-07-04 13:27:46.637134+08
xact_start   | 2018-07-04 13:27:46.641348+08
query_start  | 2018-07-04 13:27:46.64174+08
state_change | 2018-07-04 13:27:46.641778+08
wait_event_type  | Client
wait_event   | ClientRead
state| idle in transaction
backend_xid  |
backend_xmin | 3930269815
query| select 1;
backend_type | client backend

backend_xmin will affect dead tuple removing

postgres=# create table tbchj(id int);
CREATE TABLE
postgres=# insert into tbchj values(1);
INSERT 0 1
postgres=# delete from tbchj;
DELETE 1
postgres=# vacuum VERBOSE tbchj;
INFO:  vacuuming "public.tbchj"
INFO:  "tbchj": found 0 removable, 1 nonremovable row versions in 1 out of 
1 pages
DETAIL:  1 dead row versions cannot be removed yet, oldest xmin: 3930269815
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM
 
Regards
Chen Huajun

Re:Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files

2018-01-24 Thread chenhj

At 2018-01-23 09:56:48, "Stephen Frost"  wrote:

>
>I've only read through the thread to try and understand what's going on
>and the first thing that comes to mind is that you're changing
>pg_rewind to not remove the WAL from before the divergence (split)
>point, but I'm not sure why.  As noted, that WAL isn't needed for
>anything (it's from before the split, after all), so why keep it?  Is
>there something in this optimization that depends on the old WAL being
>there and, if so, what and why?



After run pg_rewind, the first startup of postgres will do crash recovery.
And crash recovery will begin from the previous redo point preceding the 
divergence.
So, the WAL after the redo point and before the divergence is needed.

Of course, the WAL before the redo point is not needed, but in my point of view,
recycling unwanted WAL does not have to be done by pg_rewind.

reference code:

pg_rewind.c:
main(int argc, char **argv)
{
...
findLastCheckpoint(datadir_target, divergerec,
   lastcommontliIndex,
   &chkptrec, &chkpttli, &chkptredo);
...
createBackupLabel(chkptredo, chkpttli, chkptrec);
...
}
...
createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli, XLogRecPtr 
checkpointloc)
{
...
len = snprintf(buf, sizeof(buf),
   "START WAL LOCATION: %X/%X (file %s)\n"
   "CHECKPOINT LOCATION: %X/%X\n"
   "BACKUP METHOD: pg_rewind\n"
   "BACKUP FROM: standby\n"
   "START TIME: %s\n",
/* omit LABEL: line */
   (uint32) (startpoint >> 32), (uint32) 
startpoint, xlogfilename,
   (uint32) (checkpointloc >> 32), (uint32) 
checkpointloc,
   strfbuf);
...
}




>That's also different from how pg_basebackup works, which I don't think
>is good (seems like pg_rewind should operate in a pretty similar manner
>to pg_basebackup).
>

Thanks for your comments!
I also considered copy WAL just like how pg_basebackup does,but a implement 
similar to pg_basebackup's manner may be not so simple.
Twice transport of files from source to target may be needed,first for data 
files, and another for WAL.
And the WAL which contains the previous redo point preceding the divergence may 
be only exists in target server and had been recycled in source. That's 
different between pg_rewind and pg_basebackup.


Regards,
Chen Huajun







Re:Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files

2017-12-02 Thread chenhj


At 2017-12-01 12:27:09, "Michael Paquier"  wrote:
>On Tue, Oct 3, 2017 at 1:20 AM, chenhj  wrote:
>> I had filled the authors field of this patch in commitfest, and will rebase
>> this patch if needed. Thank you for your help!
>
>The documentation of the patch needs a rebase, so I am moving it to
>next CF with "waiting on author" as status.
>
>$ git diff master --check
>src/bin/pg_rewind/pg_rewind.c:292: trailing whitespace.
>+*
>There are whitespace complains.
>-- 

>Michael


Rebased and removed the  whitespace.



regards
Chen Huajun

pg_rewind_wal_copy_reduce_v9.patch
Description: Binary data