Re: 32TB relation size make mdnblocks overflow

2022-01-18 Thread ()
​I think we must meet some corner cases about our storage. The relation has 
32TB blocks,  so 'mdnblocks' gets the unexpected value, we will check it again.
Thanks a lot.

32TB relation size make mdnblocks overflow

2022-01-17 Thread ()
Hello

We know that PostgreSQL doesn't support a single relation size over 32TB, 
limited by the MaxBlockNumber. But if we just 'insert into' one relation over 
32TB, it will get an error message 'unexpected data beyond EOF in block 0 of 
relation' in ReadBuffer_common. The '0 block' is from mdnblocks function where 
the segment number is over 256 and make segno * RELSEG_SIZE over uint32's max 
value. So is it necessary to make the error message more readable like 'The 
relation size is over max value ...' and elog in mdnblocks?

This scene we met is as below, 'shl $0x18, %eax' make $ebx from 256 to 0, which 
makes segno from 256 to 0. 

   0x00c2cc51 <+289>: callq  0xc657f0 
   0x00c2cc56 <+294>: mov-0x8(%r15),%rdi
   0x00c2cc5a <+298>: mov%r15,%rsi
   0x00c2cc5d <+301>: mov%eax,%r14d
   0x00c2cc60 <+304>: mov0x10(%rdi),%rax
   0x00c2cc64 <+308>: callq  *0x8(%rax)
   0x00c2cc67 <+311>: test   %r14d,%r14d
   0x00c2cc6a <+314>: jns0xc2cd68 
=> 0x00c2cc70 <+320>: add$0x28,%rsp
   0x00c2cc74 <+324>: mov%ebx,%eax
   0x00c2cc76 <+326>: shl$0x18,%eax
   0x00c2cc79 <+329>: pop%rbx
   0x00c2cc7a <+330>: pop%r12
   0x00c2cc7c <+332>: pop%r13
   0x00c2cc7e <+334>: pop%r14
   0x00c2cc80 <+336>: pop%r15
   0x00c2cc82 <+338>: pop%rbp
   0x00c2cc83 <+339>: retq

回复:Re: Regarding the necessity of RelationGetNumberOfBlocks for every rescan / bitmap heap scan.

2021-05-30 Thread ()
+1,  This would be an nice improvement even the lseek is fast usually, it is a 
system call after all

Buzhen--
发件人:Andy Fan
日 期:2021年05月31日 13:46:22
收件人:PostgreSQL Hackers
主 题:Re: Regarding the necessity of RelationGetNumberOfBlocks for every rescan / 
bitmap heap scan.



On Sat, May 29, 2021 at 11:23 AM Andy Fan  wrote:

Hi: 

I'm always confused about the following codes.

static void
initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
{
 ParallelBlockTableScanDesc bpscan = NULL;
 bool allow_strat;
 bool allow_sync;

 /*
 * Determine the number of blocks we have to scan.
 *
 * It is sufficient to do this once at scan start, since any tuples added
 * while the scan is in progress will be invisible to my snapshot anyway.
 * (That is not true when using a non-MVCC snapshot.  However, we couldn't
 * guarantee to return tuples added after scan start anyway, since they
 * might go into pages we already scanned.  To guarantee consistent
 * results for a non-MVCC snapshot, the caller must hold some higher-level
 * lock that ensures the interesting tuple(s) won't change.)
 */
 if (scan->rs_base.rs_parallel != NULL)
 {
 bpscan = (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel;
 scan->rs_nblocks = bpscan->phs_nblocks;
 }
 else
 scan->rs_nblocks = RelationGextNumberOfBlocks(scan->rs_base.rs_rd);


..
}

1. Why do we need scan->rs_nblocks =
   RelationGextNumberOfBlocks(scan->rs_base.rs_rd) for every rescan, which looks
   mismatched with the comments along the code. and the comments looks
   reasonable to me.

To be more precise, this question can be expressed as if the relation size
can be changed during rescan. We are sure that the size can be increased due to
new data, but we are sure that the new data is useless for the query as well. So
looks this case is ok. and for the file size decreasing, since we have lock on
the relation, so the file size would not be reduced as well (I have verified
this logic on the online vacuum case, other cases should be similar as well).


-- 
Best Regards
Andy Fan (https://www.aliyun.com/) 


回复:Re: Cache relation sizes?

2021-01-18 Thread ()
Hi Thomas
I want to share a patch with you, I change the replacement algorithm from fifo 
to a simple lru.

Buzhen

0001-update-fifo-to-lru-to-sweep-a-valid-cache.patch
Description: Binary data


Re: Cache relation sizes?

2021-01-07 Thread ()
Hi, Thomas

Is some scenes the same as dropdb for smgr cache.
1. drop tablespace for master. Should smgrdroptbs function be added in 
DropTableSpace function ? 
2. drop database for slave. smgrdropdb in dbase_redo.
3. drop tablespace for slave. smgrdroptbs in tblspc_redo.

Buzhen


 --原始邮件 --
发件人:Thomas Munro 
发送时间:Fri Jan 8 00:56:17 2021
收件人:陈佳昕(步真) 
抄送:Amit Kapila , Konstantin Knizhnik 
, PostgreSQL Hackers 

主题:Re: Cache relation sizes?
On Wed, Dec 30, 2020 at 4:13 AM 陈佳昕(步真)  wrote:
> I found some other problems which I want to share my change with you to make 
> you confirm.
> <1> I changed the code in smgr_alloc_sr to avoid dead lock.
>
>   LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
>   flags = smgr_lock_sr(sr);
>   Assert(flags & SR_SYNCING(forknum));
>   + flags &= ~SR_SYNCING(forknum);
>   if (flags & SR_JUST_DIRTIED(forknum))
>   {
>/*
> * Someone else dirtied it while we were syncing, so we can't mark
> * it clean.  Let's give up on this SR and go around again.
> */
>smgr_unlock_sr(sr, flags);
>LWLockRelease(mapping_lock);
>goto retry;
>   }
>   /* This fork is clean! */
>   - flags &= ~SR_SYNCING(forknum);
>   flags &= ~SR_DIRTY(forknum);
>  }
>
> In smgr_drop_sr, if the sr is SR_SYNCING, it will retry until the sr is not 
> SR_SYNCING.  But in smgr_alloc_sr, if the sr is SR_JUST_DIRTIED, it will 
> retry to get another sr, although it has been synced by smgrimmedsync, the 
> flag SR_SYNCING  doesn't changed. This might cause dead lock. So I changed 
> your patch as above.

Right.  Thanks!

I also added smgrdropdb() to handle DROP DATABASE (the problem you
reported in your previous email).

While tweaking that code, I fixed it so that it uses a condition
variable to wait (instead of the silly sleep loop) when it needs to
drop an SR that is being sync'd.  Also, it now releases the mapping
lock while it's doing that, and requires it on retry.

> <2> I changed the code in smgr_drop_sr to avoid some corner cases
> /* Mark it invalid and drop the mapping. */
> smgr_unlock_sr(sr, ~SR_VALID);
> + for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
> +  sr->nblocks[forknum] = InvalidBlockNumber;
> hash_search_with_hash_value(sr_mapping_table,
>>smgr_rnode,
>hash,
>HASH_REMOVE,
>NULL);
>
> smgr_drop_sr just removes the hash entry from sr_mapping_table, but doesn't 
> remove the sr_pool. But in smgrnblocks_fast, it just get sr from sr_pool, so 
> I add some codes as above to avoid some corner cases get an unexpected result 
> from smgrnblocks_fast. Is it necessary, I also want some advice from you.

Hmm.  I think it might be better to increment sr->generation.  That
was already done in the "eviction" code path, where other processes
might still have references to the SR object, and I don't think it's
possible for anyone to access a dropped relation, but I suppose it's
more consistent to do that anyway.  Fixed.

Thanks for the review!


回复:Re: Re: Cache relation sizes?

2020-12-29 Thread ()
Hi Thomas
I found some other problems which I want to share my change with you to make 
you confirm.
<1> I changed the code in smgr_alloc_sr to avoid dead lock.
​
  LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
  flags = smgr_lock_sr(sr);
  Assert(flags & SR_SYNCING(forknum));
  + flags &= ~SR_SYNCING(forknum);
  if (flags & SR_JUST_DIRTIED(forknum))
  {
   /*
* Someone else dirtied it while we were syncing, so we can't mark
* it clean.  Let's give up on this SR and go around again.
*/
   smgr_unlock_sr(sr, flags);
   LWLockRelease(mapping_lock);
   goto retry;
  }
  /* This fork is clean! */
 - flags &= ~SR_SYNCING(forknum);
  flags &= ~SR_DIRTY(forknum);
 }

In smgr_drop_sr, if the sr is SR_SYNCING, it will retry until the sr is not 
SR_SYNCING.  But in smgr_alloc_sr, if the sr is SR_JUST_DIRTIED, it will retry 
to get another sr, although it has been synced by smgrimmedsync, the flag 
SR_SYNCING  doesn't changed. This might cause dead lock. So I changed your 
patch as above.

<2> I changed the code in smgr_drop_sr to avoid some corner cases
/* Mark it invalid and drop the mapping. */
smgr_unlock_sr(sr, ~SR_VALID);
+ for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+  sr->nblocks[forknum] = InvalidBlockNumber;
hash_search_with_hash_value(sr_mapping_table,
   >smgr_rnode,
   hash,
   HASH_REMOVE,
   NULL);

smgr_drop_sr just removes the hash entry from sr_mapping_table, but doesn't 
remove the sr_pool. But in smgrnblocks_fast, it just get sr from sr_pool, so I 
add some codes as above to avoid some corner cases get an unexpected result 
from smgrnblocks_fast. Is it necessary, I also want some advice from you.

Thanks a lot.
Buzhen


 --原始邮件 --
发件人:Thomas Munro 
发送时间:Tue Dec 29 22:52:59 2020
收件人:陈佳昕(步真) 
抄送:Amit Kapila , Konstantin Knizhnik 
, PostgreSQL Hackers 

主题:Re: Re: Cache relation sizes?
On Wed, Dec 23, 2020 at 1:31 AM 陈佳昕(步真)  wrote:
> I studied your patch these days and found there might be a problem.
> When execute 'drop database', the smgr shared pool will not be removed 
> because of no call 'smgr_drop_sr'. Function 'dropdb' in dbcommands.c remove 
> the buffer from bufferpool and unlink the real files by 'rmtree', It doesn't 
> call smgrdounlinkall, so the smgr shared cache will not be dropped although 
> the table has been removed. This will cause some errors when smgr_alloc_str 
> -> smgropen、smgrimmedsync. Table file has been removed, so smgropen and 
> smgrimmedsync will get a unexpected result.

Hi Buzhen,

Thanks, you're right -- it needs to scan the pool of SRs and forget
everything from the database you're dropping.




回复:Re: Cache relation sizes?

2020-12-22 Thread ()
Hi Thomas:

I studied your patch these days and found there might be a problem.
When execute 'drop database', the smgr shared pool will not be removed because 
of no call 'smgr_drop_sr'. Function 'dropdb' in dbcommands.c remove the buffer 
from bufferpool and unlink the real files by 'rmtree', It doesn't call 
smgrdounlinkall, so the smgr shared cache will not be dropped although the 
table has been removed. This will cause some errors when smgr_alloc_str -> 
smgropen、smgrimmedsync. Table file has been removed, so smgropen and 
smgrimmedsync will get a unexpected result.
 
Buzhen


 --原始邮件 --
发件人:Thomas Munro 
发送时间:Tue Dec 22 19:57:35 2020
收件人:Amit Kapila 
抄送:Konstantin Knizhnik , PostgreSQL Hackers 

主题:Re: Cache relation sizes?
On Tue, Nov 17, 2020 at 10:48 PM Amit Kapila  wrote:
> Yeah, it is good to verify VACUUM stuff but I have another question
> here. What about queries having functions that access the same
> relation (SELECT c1 FROM t1 WHERE c1 <= func(); assuming here function
> access the relation t1)? Now, here I think because the relation 't1'
> is already opened, it might use the same value of blocks from the
> shared cache even though the snapshot for relation 't1' when accessed
> from func() might be different. Am, I missing something, or is it
> dealt in some way?

I think it should be covered by the theory about implicit memory
barriers snapshots, but to simplify things I have now removed the
lock-free stuff from the main patch (0001), because it was a case of
premature optimisation and it distracted from the main concept.  The
main patch has 128-way partitioned LWLocks for the mapping table, and
then per-relfilenode spinlocks for modifying the size.  There are
still concurrency considerations, which I think are probably handled
with the dirty-update-wins algorithm you see in the patch.  In short:
due to extension and exclusive locks, size changes AKA dirty updates
are serialised, but clean updates can run concurrently, so we just
have to make sure that clean updates never clobber dirty updates -- do
you think that is on the right path?