from:"Hubert Zhang"

Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2021-01-11 Thread Hubert Zhang

Hi Tom,

I agree to get detailed error message for each failed host as your patch 0001.

As for patch 0004, find ':' after "could not connect to" may failed when error 
message like:
"could not connect to host "localhost" (::1), port 12345: Connection refused", 
where p2 will point to "::1" instead of ": Connection refused". But since it's 
only used for test case, we don't need to filter the error message precisely.

```
ecpg_filter_stderr(const char *resultfile, const char *tmpfile)
{
..
char   *p1 = strstr(linebuf.data, "could not connect to ");
if (p1)
{
char   *p2 = strchr(p1, ':');
if (p2)
memmove(p1 + 17, p2, strlen(p2) + 1);
}
}
```

Thanks,
Hubert

From: Tom Lane 
Sent: Monday, January 11, 2021 10:56 AM
To: Hubert Zhang 
Cc: tsunakawa.ta...@fujitsu.com ; 
pgsql-hack...@postgresql.org 
Subject: Re: Multiple hosts in connection string failed to failover in non-hot 
standby mode

I wrote:
> I feel pretty good about 0001: it might be committable as-is.  0002 is
> probably subject to bikeshedding, plus it has a problem in the ECPG tests.
> Two of the error messages are now unstable because they expose
> chosen-at-random socket paths:
> ...
> I don't have any non-hacky ideas what to do about that.  The extra detail
> seems useful to end users, but we don't have any infrastructure that
> would allow filtering it out in the ECPG tests.

So far the only solution that comes to mind is to introduce some
infrastructure to do that filtering.  0001-0003 below are unchanged,
0004 patches up the ecpg test framework with a rather ad-hoc filtering
function.  I'd feel worse about this if there weren't already a very
ad-hoc filtering function there ;-)

This set passes check-world for me; we'll soon see what the cfbot
thinks.

regards, tom lane

Recv-Q buffer is filled up due to bgwriter continue sending statistics to un-launched stat collector

2020-11-19 Thread Hubert Zhang

Hi hackers,

Bgwriter process will call `pgstat_send_bgwriter` to send statistics to stat 
collector, but stat collector is not started when standby is not in hot standby 
mode.

```
 if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
  PgStatPID = pgstat_start();
```

This would lead to the kernel Recv-Q buffer being filled up by checking 
`netstat -du`.
I think we should disable to send statistics in `pgstat_send_bgwriter` when 
stat collector is not running. So here are three questions about how to fix it.

  1.  Is there any way we could effectively check the status of stat collector 
in bgwriter process?
  2.  Is there any way we check the bgwriter is running on a standby but not in 
hot stanby mode?
  3.  Is there any other process will send statistics except the bgwriter in 
standby? We should fix it one by one or add a check in `pgstat_send` directly?

Thanks,
Hubert Zhang

Re: Multiple hosts in connection string failed to failover in non-hot standby mode

2020-10-28 Thread Hubert Zhang

Was the primary running and accepting connections when you encountered this 
error?  That is, if you specified host="host1 host2", host1 was the non-hot 
standby and host2 was a running primary?  Or only the non-hot standby was 
running?

If a primary was running, I'd say it's a bug...  Perhaps the following part in 
libpq gives up connection attempts wen the above FATAL error is returned from 
the server.  Maybe libpq should differentiate errors using SQLSTATE and 
continue connection attempts on other hosts.
Yes, the primary was running, but non-hot standby is in front of the primary in 
connection string.
Hao Wu and I wrote a patch to fix this problem. Client side libpq should try 
another hosts in connection string when it is rejected by a non-hot standby, or 
the first host encounter some n/w problems during the libpq handshake.

Please send emails in text format.  Your email was in HTML, and I changed this 
reply to text format.
Thanks. Is this email in text format now? I just use outlook in chrome. Let me 
know if it still in html format.

Hubert & Hao Wu


From: tsunakawa.ta...@fujitsu.com 
Sent: Tuesday, October 27, 2020 5:30 PM
To: Hubert Zhang 
Cc: pgsql-hack...@postgresql.org 
Subject: RE: Multiple hosts in connection string failed to failover in non-hot 
standby mode

Please send emails in text format.  Your email was in HTML, and I changed this 
reply to text format.


From: Hubert Zhang 
> Libpq has supported to specify multiple hosts in connection string and enable 
> auto failover when the previous PostgreSQL instance cannot be accessed.
> But when I tried to enable this feature for a non-hot standby, it cannot do 
> the failover with the following messages.
>
> psql: error: could not connect to server: FATAL:  the database system is 
> starting up

Was the primary running and accepting connections when you encountered this 
error?  That is, if you specified host="host1 host2", host1 was the non-hot 
standby and host2 was a running primary?  Or only the non-hot standby was 
running?

If a primary was running, I'd say it's a bug...  Perhaps the following part in 
libpq gives up connection attempts wen the above FATAL error is returned from 
the server.  Maybe libpq should differentiate errors using SQLSTATE and 
continue connection attempts on other hosts.

[fe-connect.c]
/* Handle errors. */
if (beresp == 'E')
{
if (PG_PROTOCOL_MAJOR(conn->pversion) >= 3)
...
#endif

goto error_return;
}

/* It is an authentication request. */
conn->auth_req_received = true;

/* Get the type of request. */


Regards
Takayuki Tsunakawa



0001-Enhance-libpq-to-support-multiple-host-for-non-hot-s.patch
Description:  0001-Enhance-libpq-to-support-multiple-host-for-non-hot-s.patch

Multiple hosts in connection string failed to failover in non-hot standby mode

2020-10-27 Thread Hubert Zhang

Hi hackers,

Libpq has supported to specify multiple hosts in connection string and enable 
auto failover when the previous PostgreSQL instance cannot be accessed.
But when I tried to enable this feature for a non-hot standby, it cannot do the 
failover with the following messages.

psql: error: could not connect to server: FATAL:  the database system is 
starting up

Document says ' If a connection is established successfully, but authentication 
fails, the remaining hosts in the list are not tried.'
I'm wondering is it a feature by design or a bug? If it's a bug, I plan to fix 
it.

Thanks,
Hubert Zhang

Re: Print physical file path when checksum check fails

2020-03-19 Thread Hubert Zhang

I have updated the patch based on the previous comments. Sorry for the late
patch.

I removed `SetZeroDamagedPageInChecksum` and add `zeroDamagePage` flag in
smgrread to determine whether we should zero damage page when an error
happens. It depends on the caller.

`GetRelationFilePath` is removed as well. We print segno on the fly.

On Thu, Feb 20, 2020 at 2:33 PM Hubert Zhang  wrote:

> Thanks,
>
> On Thu, Feb 20, 2020 at 11:36 AM Andres Freund  wrote:
>
>> Hi,
>>
>> On 2020-02-19 16:48:45 +0900, Michael Paquier wrote:
>> > On Wed, Feb 19, 2020 at 03:00:54PM +0900, Kyotaro Horiguchi wrote:
>> > > I have had support requests related to broken block several times, and
>> > > (I think) most of *them* had hard time to locate the broken block or
>> > > even broken file.  I don't think it is useles at all, but I'm not sure
>> > > it is worth the additional complexity.
>> >
>> > I handle stuff like that from time to time, and any reports usually
>> > go down to people knowledgeable about PostgreSQL enough to know the
>> > difference.  My point is that in order to know where a broken block is
>> > physically located on disk, you need to know four things:
>> > - The block number.
>> > - The physical location of the relation.
>> > - The size of the block.
>> > - The length of a file segment.
>> > The first two items are printed in the error message, and you can
>> > guess easily the actual location (file, offset) with the two others.
>>
>> > I am not necessarily against improving the error message here, but
>> > FWIW I think that we need to consider seriously if the code
>> > complications and the maintenance cost involved are really worth
>> > saving from one simple calculation.
>>
>> I don't think it's that simple for most.
>>
>> And if we e.g. ever get the undo stuff merged, it'd get more
>> complicated, because they segment entirely differently. Similar, if we
>> ever manage to move SLRUs into the buffer pool and checksummed, it'd
>> again work differently.
>>
>> Nor is it architecturally appealing to handle checksums in multiple
>> places above the smgr layer: For one, that requires multiple places to
>> compute verify them. But also, as the way checksums are computed depends
>> on the page format etc, it'll likely change for things like undo/slru -
>> which then again will require additional smarts if done above the smgr
>> layer.
>>
>
> So considering undo staff, it's better to move checksum logic into md.c
> I will keep it in the new patch.
>
> On 2020-02-19 16:48:45 +0900, Michael Paquier wrote:
>
> > Particularly, quickly reading through the patch, I am rather unhappy
>> > about the shape of the second patch which pushes down the segment
>> > number knowledge into relpath.c, and creates more complication around
>> > the handling of zero_damaged_pages and zero'ed pages.  -- Michael
>>
>> I do not like the SetZeroDamagedPageInChecksum stuff at all however.
>>
>>
> I'm +1 on the first concern, and I will delete the new added function
> `GetRelationFilePath`
> in relpath.c and append the segno directly in error message inside
> function `VerifyPage`
>
> As for SetZeroDamagedPageInChecksum, the reason why I introduced it is
> that when we are doing
> smgrread() and one of the damaged page failed to pass the checksum check,
> we could not directly error
> out, since the caller of smgrread() may tolerate this error and just zero
> all the damaged page plus a warning message.
> Also, we could not just use GUC zero_damaged_pages to do this branch,
> since we also have ReadBufferMode(i.e. RBM_ZERO_ON_ERROR) to control it.
>
> To get rid of SetZeroDamagedPageInChecksum, one idea is to pass
> zero_damaged_page flag into smgrread(), something like below:
> ==
>
> extern void smgrread(SMgrRelation reln, ForkNumber forknum,
>
> BlockNumber blocknum, char *buffer, int flag);
>
> ===
>
>
> Any comments?
>
>
>
> --
> Thanks
>
> Hubert Zhang
>


-- 
Thanks

Hubert Zhang


0003-Print-physical-file-path-when-verify-checksum-failed.patch
Description: Binary data

Re: Yet another vectorized engine

2020-02-27 Thread Hubert Zhang

Hi Konstantin,
I also vimdiff nodeAgg.c in your PG13 branch with nodeAgg.c in pg's main
repo.
Many functions has changed from PG96 to PG13, e.g. 'advance_aggregates',
'lookup_hash_entry'
The vectorized nodeAgg seems still follow the PG96 way of implementing
these functions.
In general, I think we'd better port executor of PG13 to vectorized
executor of PG13 instead of merge some PG13 code into vectorized executor
of PG96 to make it works. Because It's hard to determine which functions
need to be merged and it's buggy if the executor code of both PG13 and PG96
exist in one branch.

What's your opinion?

Re: Yet another vectorized engine

2020-02-26 Thread Hubert Zhang

On Wed, Feb 26, 2020 at 7:59 PM Konstantin Knizhnik <
k.knizh...@postgrespro.ru> wrote:

>
>
> On 26.02.2020 13:11, Hubert Zhang wrote:
>
>
>
>> and with JIT:
>>
>>  13.88%  postgres  postgres [.] tts_buffer_heap_getsomeattrs
>>7.15%  postgres  vectorize_engine.so  [.] vfloat8_accum
>>6.03%  postgres  postgres [.] HeapTupleSatisfiesVisibility
>>5.55%  postgres  postgres [.] bpchareq
>>4.42%  postgres  vectorize_engine.so  [.] VExecStoreColumns
>>4.19%  postgres  postgres [.] hashbpchar
>>4.09%  postgres  vectorize_engine.so  [.] vfloat8pl
>>
>>
> I also tested Q1 with your latest code. Result of vectorized is still slow.
> PG13 native: 38 secs
> PG13 Vec: 30 secs
> PG13 JIT: 23 secs
> PG13 JIT+Vec: 27 secs
>
>
> It is strange that your results are much slower than my and profile is
> very different.
> Which postgres configuration you are using?
>
>
./configure CFLAGS="-O3 -g -march=native" --prefix=/usr/local/pgsql/
--disable-cassert --enable-debug --with-llvm
 I also use `PGXS := $(shell $(PG_CONFIG) --pgxs)` to compile
vectorized_engine. So it will share the same compile configuration.

My perf result is as belows. There are three parts:
> 1. lookup_hash_entry(43.5%) this part is not vectorized yet.
>
> It is vectorized in some sense: lookup_hash_entry performs bulk of hash
> lookups and pass array with results of such lookups to aggregate transmit
> functions.
> It will be possible to significantly increase speed of HashAgg if we store
> data in order of grouping attributes and use RLE (run length encoding) to
> peform just one
> hash lookup for group of values. But it requires creation of special
> partitions (like it is done in Vertica and VOPS).
>
>
Yes, Vertica's partition needed to be pre-sorted on user defined columns.
So for TPCH Q1 on Postgres, we could not have that assumption. And my Q1
plan uses HashAgg instead of GroupAgg based on cost.


> 2. scan part: fetch_input_tuple(36%)
> 3. vadvance_aggregates part(20%)
> I also perfed on PG96 vectorized version and got similar perf results and
> running time of vectorized PG96 and PG13 are also similar. But PG13 is much
> faster than PG96. So I just wonder whether we merge all the latest executor
> code of PG13 into the vectorized PG13 branch?
>
>
> Sorry, I do not understand the question. vectorize_executor contains
> patched versions of nodeSeqscan  and nodeAgg from standard executor.
> When performing porting to PG13, I took the latest version of nodeAgg and
> tried to apply your patches to it. Certainly not always it was possible and
> I have to rewrite a lt of places. Concerning nodeSeqscan - I took old
> version from vectorize_executor and port it to PG13.
>

> It is strange that I am not seeing lookup_hash_entry in profile in my
> case.
>
>
So you already have the PG13 nodeAgg, that is good.
Yes, it is strange. Hash table probing is always the costly part.
My perf command `perf record --call-graph dwarf -p pid`
Could you share your lineitem schema and Q1 query?
My schema and Q1 query are:
CREATE TABLE lineitem (
l_orderkey BIGINT NOT NULL,
l_partkey INTEGER NOT NULL,
l_suppkey INTEGER NOT NULL,
l_linenumber INTEGER NOT NULL,
l_quantity double precision NOT NULL,
l_extendedprice double precision NOT NULL,
l_discount double precision NOT NULL,
l_tax double precision NOT NULL,
l_returnflag CHAR(1) NOT NULL,
l_linestatus CHAR(1) NOT NULL,
l_shipdate DATE NOT NULL,
l_commitdate DATE NOT NULL,
l_receiptdate DATE NOT NULL,
l_shipinstruct CHAR(25) NOT NULL,
l_shipmode CHAR(10) NOT NULL,
l_comment VARCHAR(44) NOT NULL
);
select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(l_discount) as count_order
from
lineitem
where
l_shipdate <= date '1998-12-01' - interval '106 day'
group by
l_returnflag,
l_linestatus
;


-- 
Thanks

Hubert Zhang

Re: Yet another vectorized engine

2020-02-26 Thread Hubert Zhang

Hi Konstantin,

On Tue, Feb 25, 2020 at 6:44 PM Konstantin Knizhnik <
k.knizh...@postgrespro.ru> wrote:

>
>
> On 25.02.2020 11:06, Hubert Zhang wrote:
>
> Hi Konstantin,
>
> I checkout your branch pg13 in repo
> https://github.com/zhangh43/vectorize_engine
> After I fixed some compile error, I tested Q1 on TPCH-10G
> The result is different from yours and vectorize version is too slow. Note
> that I disable parallel worker by default.
> no JIT no Vectorize:  36 secs
> with JIT only: 23 secs
> with Vectorize only:   33 secs
> JIT + Vectorize: 29 secs
>
> My config option is `CFLAGS='-O3 -g -march=native'
> --prefix=/usr/local/pgsql/ --disable-cassert --enable-debug --with-llvm`
> I will do some spike on why vectorized is so slow. Could you please
> provide your compile option and the TPCH dataset size and your
> queries(standard Q1?) to help me to debug on it.
>
>
>
> Hi, Hubert
>
> Sorry, looks like I have used slightly deteriorated snapshot of master so
> I have not noticed some problems.
> Fixes are committed.
>
> Most of the time is spent in unpacking heap tuple
> (tts_buffer_heap_getsomeattrs):
>
>   24.66%  postgres  postgres [.] tts_buffer_heap_getsomeattrs
>8.28%  postgres  vectorize_engine.so  [.] VExecStoreColumns
>5.94%  postgres  postgres [.] HeapTupleSatisfiesVisibility
>4.21%  postgres  postgres [.] bpchareq
>4.12%  postgres  vectorize_engine.so  [.] vfloat8_accum
>
>
> In my version of nodeSeqscan I do not keep all fetched 1024 heap tuples
> but stored there attribute values in vector columns immediately.
> But to avoid extraction of useless data it is necessary to know list of
> used columns.
> The same problem is solved in zedstore, but unfortunately there is no
> existed method in Postgres to get list
> of used attributes. I have done it but my last implementation contains
> error which cause loading of all columns.
> Fixed version is committed.
>
> Now profile without JIT is:
>
>  15.52%  postgres  postgres [.] tts_buffer_heap_getsomeattrs
>   10.25%  postgres  postgres [.] ExecInterpExpr
>6.54%  postgres  postgres [.] HeapTupleSatisfiesVisibility
>5.12%  postgres  vectorize_engine.so  [.] VExecStoreColumns
>4.86%  postgres  postgres [.] bpchareq
>4.80%  postgres  vectorize_engine.so  [.] vfloat8_accum
>3.78%  postgres  postgres [.] tts_minimal_getsomeattrs
>3.66%  postgres  vectorize_engine.so  [.] VExecAgg
>3.38%  postgres  postgres [.] hashbpchar
>
> and with JIT:
>
>  13.88%  postgres  postgres [.] tts_buffer_heap_getsomeattrs
>7.15%  postgres  vectorize_engine.so  [.] vfloat8_accum
>6.03%  postgres  postgres [.] HeapTupleSatisfiesVisibility
>5.55%  postgres  postgres [.] bpchareq
>4.42%  postgres  vectorize_engine.so  [.] VExecStoreColumns
>4.19%  postgres  postgres [.] hashbpchar
>4.09%  postgres  vectorize_engine.so  [.] vfloat8pl
>
>
I also tested Q1 with your latest code. Result of vectorized is still slow.
PG13 native: 38 secs
PG13 Vec: 30 secs
PG13 JIT: 23 secs
PG13 JIT+Vec: 27 secs

My perf result is as belows. There are three parts:
1. lookup_hash_entry(43.5%) this part is not vectorized yet.
2. scan part: fetch_input_tuple(36%)
3. vadvance_aggregates part(20%)
I also perfed on PG96 vectorized version and got similar perf results and
running time of vectorized PG96 and PG13 are also similar. But PG13 is much
faster than PG96. So I just wonder whether we merge all the latest executor
code of PG13 into the vectorized PG13 branch?

- agg_fill_hash_table ◆ - 43.50% lookup_hash_entry (inlined) ▒ + 39.07%
LookupTupleHashEntry ▒ 0.56% ExecClearTuple (inlined) ▒ - 36.06%
fetch_input_tuple ▒ - ExecProcNode (inlined) ▒ - 36.03% VExecScan ▒ -
34.60% ExecScanFetch (inlined) ▒ - ExecScanFetch (inlined) ▒ - VSeqNext ▒ +
16.64% table_scan_getnextslot (inlined) ▒ - 10.29% slot_getsomeattrs
(inlined) ▒ - 10.17% slot_getsomeattrs_int ▒ + tts_buffer_heap_getsomeattrs
▒ 7.14% VExecStoreColumns ▒ + 1.38% ExecQual (inlined) ▒ - 20.30%
Vadvance_aggregates (inlined) ▒ - 17.46% Vadvance_transition_function
(inlined) ▒ + 11.95% vfloat8_accum ▒ + 4.74% vfloat8pl ▒ 0.75% vint8inc_any
▒ + 2.77% ExecProject (inlined)

-- 
Thanks

Hubert Zhang

Re: Yet another vectorized engine

2020-02-25 Thread Hubert Zhang

Hi Konstantin,

I checkout your branch pg13 in repo
https://github.com/zhangh43/vectorize_engine
After I fixed some compile error, I tested Q1 on TPCH-10G
The result is different from yours and vectorize version is too slow. Note
that I disable parallel worker by default.
no JIT no Vectorize:  36 secs
with JIT only: 23 secs
with Vectorize only:   33 secs
JIT + Vectorize: 29 secs

My config option is `CFLAGS='-O3 -g -march=native'
--prefix=/usr/local/pgsql/ --disable-cassert --enable-debug --with-llvm`
I will do some spike on why vectorized is so slow. Could you please provide
your compile option and the TPCH dataset size and your queries(standard
Q1?) to help me to debug on it.

On Mon, Feb 24, 2020 at 8:43 PM Hubert Zhang  wrote:

> Hi Konstantin,
> I have added you as a collaborator on github. Please accepted and try
> again.
> I think non collaborator could also open pull requests.
>
> On Mon, Feb 24, 2020 at 8:02 PM Konstantin Knizhnik <
> k.knizh...@postgrespro.ru> wrote:
>
>>
>>
>> On 24.02.2020 05:08, Hubert Zhang wrote:
>>
>> Hi
>>
>> On Sat, Feb 22, 2020 at 12:58 AM Konstantin Knizhnik <
>> k.knizh...@postgrespro.ru> wrote:
>>
>>>
>>>
>>> On 12.02.2020 13:12, Hubert Zhang wrote:
>>>
>>> On Tue, Feb 11, 2020 at 1:20 AM Konstantin Knizhnik <
>>> k.knizh...@postgrespro.ru> wrote:
>>>
>>>>
>>>> So looks like PG-13 provides significant advantages in OLAP queries
>>>> comparing with 9.6!
>>>> Definitely it doesn't mean that vectorized executor is not needed for
>>>> new version of Postgres.
>>>> Once been ported, I expect that it should provide comparable
>>>> improvement of performance.
>>>>
>>>> But in any case I think that vectorized executor makes sense only been
>>>> combine with columnar store.
>>>>
>>>
>>> Thanks for the test. +1 on vectorize should be combine with columnar
>>> store. I think when we support this extension
>>> on master, we could try the new zedstore.
>>> I'm not active on this work now, but will continue when I have time.
>>> Feel free to join bring vops's feature into this extension.
>>>
>>> Thanks
>>>
>>> Hubert Zhang
>>>
>>>
>>> I have ported vectorize_engine to the master.
>>> It takes longer than I expected: a lot of things were changed in
>>> executor.
>>>
>>> Results are the following:
>>>
>>>
>>> par.warkers
>>> PG9_6
>>> vectorize=off
>>> PG9_6
>>> vectorize=on
>>> master
>>> vectorize=off
>>> jit=on
>>> master
>>> vectorize=off
>>> jit=off master
>>> vectorize=on
>>> jit=ofn master
>>> vectorize=on
>>> jit=off
>>> 0
>>> 36
>>> 20
>>> 16
>>> 25.5
>>> 15
>>> 17.5
>>> 4
>>> 10
>>> -
>>> 5 7
>>> -
>>> -
>>>
>>> So it proves the theory that JIT provides almost the same speedup as
>>> vector executor (both eliminates interpretation overhead but in different
>>> way).
>>> I still not sure that we need vectorized executor: because with standard
>>> heap it provides almost no improvements comparing with current JIT version.
>>> But in any case I am going to test it with vertical storage (zedstore or
>>> cstore).
>>>
>>>
>> Thanks for the porting and testing.
>> Yes, PG master and 9.6 have many changes, not only executor, but also
>> tupletableslot interface.
>>
>> What matters the performance of JIT and Vectorization is its
>> implementation. This is just the beginning of vectorization work, just as
>> your vops extension reported, vectorization could run 10 times faster in
>> PG. With the overhead of row storage(heap), we may not reach that speedup,
>> but I think we could do better. Also +1 on vertical storage.
>>
>> BTW, welcome to submit your PR for the PG master version.
>>
>>
>>
>> Sorry, but I have no permissions to push changes to your repository.
>> I can certainly create my own fork of vectorize_engine, but I think it
>> will be beter if I push pg13 branch in your repository.
>>
>>
>>
>
> --
> Thanks
>
> Hubert Zhang
>


-- 
Thanks

Hubert Zhang

Re: Yet another vectorized engine

2020-02-24 Thread Hubert Zhang

Hi Konstantin,
I have added you as a collaborator on github. Please accepted and try again.
I think non collaborator could also open pull requests.

On Mon, Feb 24, 2020 at 8:02 PM Konstantin Knizhnik <
k.knizh...@postgrespro.ru> wrote:

>
>
> On 24.02.2020 05:08, Hubert Zhang wrote:
>
> Hi
>
> On Sat, Feb 22, 2020 at 12:58 AM Konstantin Knizhnik <
> k.knizh...@postgrespro.ru> wrote:
>
>>
>>
>> On 12.02.2020 13:12, Hubert Zhang wrote:
>>
>> On Tue, Feb 11, 2020 at 1:20 AM Konstantin Knizhnik <
>> k.knizh...@postgrespro.ru> wrote:
>>
>>>
>>> So looks like PG-13 provides significant advantages in OLAP queries
>>> comparing with 9.6!
>>> Definitely it doesn't mean that vectorized executor is not needed for
>>> new version of Postgres.
>>> Once been ported, I expect that it should provide comparable
>>> improvement of performance.
>>>
>>> But in any case I think that vectorized executor makes sense only been
>>> combine with columnar store.
>>>
>>
>> Thanks for the test. +1 on vectorize should be combine with columnar
>> store. I think when we support this extension
>> on master, we could try the new zedstore.
>> I'm not active on this work now, but will continue when I have time. Feel
>> free to join bring vops's feature into this extension.
>>
>> Thanks
>>
>> Hubert Zhang
>>
>>
>> I have ported vectorize_engine to the master.
>> It takes longer than I expected: a lot of things were changed in executor.
>>
>> Results are the following:
>>
>>
>> par.warkers
>> PG9_6
>> vectorize=off
>> PG9_6
>> vectorize=on
>> master
>> vectorize=off
>> jit=on
>> master
>> vectorize=off
>> jit=off master
>> vectorize=on
>> jit=ofn master
>> vectorize=on
>> jit=off
>> 0
>> 36
>> 20
>> 16
>> 25.5
>> 15
>> 17.5
>> 4
>> 10
>> -
>> 5 7
>> -
>> -
>>
>> So it proves the theory that JIT provides almost the same speedup as
>> vector executor (both eliminates interpretation overhead but in different
>> way).
>> I still not sure that we need vectorized executor: because with standard
>> heap it provides almost no improvements comparing with current JIT version.
>> But in any case I am going to test it with vertical storage (zedstore or
>> cstore).
>>
>>
> Thanks for the porting and testing.
> Yes, PG master and 9.6 have many changes, not only executor, but also
> tupletableslot interface.
>
> What matters the performance of JIT and Vectorization is its
> implementation. This is just the beginning of vectorization work, just as
> your vops extension reported, vectorization could run 10 times faster in
> PG. With the overhead of row storage(heap), we may not reach that speedup,
> but I think we could do better. Also +1 on vertical storage.
>
> BTW, welcome to submit your PR for the PG master version.
>
>
>
> Sorry, but I have no permissions to push changes to your repository.
> I can certainly create my own fork of vectorize_engine, but I think it
> will be beter if I push pg13 branch in your repository.
>
>
>

-- 
Thanks

Hubert Zhang

Re: Yet another vectorized engine

2020-02-23 Thread Hubert Zhang

Hi

On Sat, Feb 22, 2020 at 12:58 AM Konstantin Knizhnik <
k.knizh...@postgrespro.ru> wrote:

>
>
> On 12.02.2020 13:12, Hubert Zhang wrote:
>
> On Tue, Feb 11, 2020 at 1:20 AM Konstantin Knizhnik <
> k.knizh...@postgrespro.ru> wrote:
>
>>
>> So looks like PG-13 provides significant advantages in OLAP queries
>> comparing with 9.6!
>> Definitely it doesn't mean that vectorized executor is not needed for new
>> version of Postgres.
>> Once been ported, I expect that it should provide comparable  improvement
>> of performance.
>>
>> But in any case I think that vectorized executor makes sense only been
>> combine with columnar store.
>>
>
> Thanks for the test. +1 on vectorize should be combine with columnar
> store. I think when we support this extension
> on master, we could try the new zedstore.
> I'm not active on this work now, but will continue when I have time. Feel
> free to join bring vops's feature into this extension.
>
> Thanks
>
> Hubert Zhang
>
>
> I have ported vectorize_engine to the master.
> It takes longer than I expected: a lot of things were changed in executor.
>
> Results are the following:
>
>
> par.warkers
> PG9_6
> vectorize=off
> PG9_6
> vectorize=on
> master
> vectorize=off
> jit=on
> master
> vectorize=off
> jit=off master
> vectorize=on
> jit=ofn master
> vectorize=on
> jit=off
> 0
> 36
> 20
> 16
> 25.5
> 15
> 17.5
> 4
> 10
> -
> 5 7
> -
> -
>
> So it proves the theory that JIT provides almost the same speedup as
> vector executor (both eliminates interpretation overhead but in different
> way).
> I still not sure that we need vectorized executor: because with standard
> heap it provides almost no improvements comparing with current JIT version.
> But in any case I am going to test it with vertical storage (zedstore or
> cstore).
>
>
>
Thanks for the porting and testing.
Yes, PG master and 9.6 have many changes, not only executor, but also
tupletableslot interface.

What matters the performance of JIT and Vectorization is its
implementation. This is just the beginning of vectorization work, just as
your vops extension reported, vectorization could run 10 times faster in
PG. With the overhead of row storage(heap), we may not reach that speedup,
but I think we could do better. Also +1 on vertical storage.

BTW, welcome to submit your PR for the PG master version.


-- 
Thanks

Hubert Zhang

Re: Print physical file path when checksum check fails

2020-02-19 Thread Hubert Zhang

Thanks Kyotaro,

On Wed, Feb 19, 2020 at 2:02 PM Kyotaro Horiguchi 
wrote:

> At Wed, 19 Feb 2020 13:28:04 +0900, Michael Paquier 
> wrote in
> > On Wed, Feb 19, 2020 at 01:07:36PM +0900, Kyotaro Horiguchi wrote:
> > > If we also verify checksum in md layer, callback is overkill since the
> > > immediate caller consumes the event immediately.  We can signal the
> > > error by somehow returning a file tag.
> >
> > FWIW, I am wondering if there is any need for a change here and
> > complicate more the code.  If you know the block number, the page size
> > and the segment file size you can immediately guess where is the
> > damaged block.  The first information is already part of the error
>
> I have had support requests related to broken block several times, and
> (I think) most of *them* had hard time to locate the broken block or
> even broken file.  I don't think it is useles at all, but I'm not sure
> it is worth the additional complexity.
>
> > damaged block.  The first information is already part of the error
> > message, and the two other ones are constants defined at
> > compile-time.
>
> May you have misread the snippet?
>
> What Hubert proposed is:
>
>  "invalid page in block %u of relation file %s; zeroing out page",
> blkno, 
>
> The second format in my messages just before is:
>   "invalid page in block %u in relation %u, file \"%s\"",
>  blockNum, smgr->smgr_rnode.node.relNode, smgrfname()
>
> All of them are not compile-time constant at all.
>
>
I like your error message, the block number is relation level not file
level.
I 'll change the error message to
"invalid page in block %u of relation %u, file %s"


-- 
Thanks

Hubert Zhang

Re: Print physical file path when checksum check fails

2020-02-19 Thread Hubert Zhang

Thanks,

On Thu, Feb 20, 2020 at 11:36 AM Andres Freund  wrote:

> Hi,
>
> On 2020-02-19 16:48:45 +0900, Michael Paquier wrote:
> > On Wed, Feb 19, 2020 at 03:00:54PM +0900, Kyotaro Horiguchi wrote:
> > > I have had support requests related to broken block several times, and
> > > (I think) most of *them* had hard time to locate the broken block or
> > > even broken file.  I don't think it is useles at all, but I'm not sure
> > > it is worth the additional complexity.
> >
> > I handle stuff like that from time to time, and any reports usually
> > go down to people knowledgeable about PostgreSQL enough to know the
> > difference.  My point is that in order to know where a broken block is
> > physically located on disk, you need to know four things:
> > - The block number.
> > - The physical location of the relation.
> > - The size of the block.
> > - The length of a file segment.
> > The first two items are printed in the error message, and you can
> > guess easily the actual location (file, offset) with the two others.
>
> > I am not necessarily against improving the error message here, but
> > FWIW I think that we need to consider seriously if the code
> > complications and the maintenance cost involved are really worth
> > saving from one simple calculation.
>
> I don't think it's that simple for most.
>
> And if we e.g. ever get the undo stuff merged, it'd get more
> complicated, because they segment entirely differently. Similar, if we
> ever manage to move SLRUs into the buffer pool and checksummed, it'd
> again work differently.
>
> Nor is it architecturally appealing to handle checksums in multiple
> places above the smgr layer: For one, that requires multiple places to
> compute verify them. But also, as the way checksums are computed depends
> on the page format etc, it'll likely change for things like undo/slru -
> which then again will require additional smarts if done above the smgr
> layer.
>

So considering undo staff, it's better to move checksum logic into md.c
I will keep it in the new patch.

On 2020-02-19 16:48:45 +0900, Michael Paquier wrote:

> Particularly, quickly reading through the patch, I am rather unhappy
> > about the shape of the second patch which pushes down the segment
> > number knowledge into relpath.c, and creates more complication around
> > the handling of zero_damaged_pages and zero'ed pages.  -- Michael
>
> I do not like the SetZeroDamagedPageInChecksum stuff at all however.
>
>
I'm +1 on the first concern, and I will delete the new added function
`GetRelationFilePath`
in relpath.c and append the segno directly in error message inside function
`VerifyPage`

As for SetZeroDamagedPageInChecksum, the reason why I introduced it is that
when we are doing
smgrread() and one of the damaged page failed to pass the checksum check,
we could not directly error
out, since the caller of smgrread() may tolerate this error and just zero
all the damaged page plus a warning message.
Also, we could not just use GUC zero_damaged_pages to do this branch, since
we also have ReadBufferMode(i.e. RBM_ZERO_ON_ERROR) to control it.

To get rid of SetZeroDamagedPageInChecksum, one idea is to pass
zero_damaged_page flag into smgrread(), something like below:
==

extern void smgrread(SMgrRelation reln, ForkNumber forknum,

BlockNumber blocknum, char *buffer, int flag);

===


Any comments?



-- 
Thanks

Hubert Zhang

Re: Print physical file path when checksum check fails

2020-02-17 Thread Hubert Zhang

On Wed, Feb 12, 2020 at 5:22 PM Hubert Zhang  wrote:

> Thanks Andres,
>
> On Tue, Feb 11, 2020 at 5:30 AM Andres Freund  wrote:
>
>> HHi,
>>
>> On 2020-02-10 16:04:21 +0800, Hubert Zhang wrote:
>> > Currently we only print block number and relation path when checksum
>> check
>> > fails. See example below:
>> >
>> > ERROR: invalid page in block 333571 of relation base/65959/656195
>>
>> > DBA complains that she needs additional work to calculate which physical
>> > file is broken, since one physical file can only contain `RELSEG_SIZE`
>> > number of blocks. For large tables, we need to use many physical files
>> with
>> > additional suffix, e.g. 656195.1, 656195.2 ...
>> >
>> > Is that a good idea to also print the physical file path in error
>> message?
>> > Like below:
>> >
>> > ERROR: invalid page in block 333571 of relation base/65959/656195, file
>> > path base/65959/656195.2
>>
>> I think that'd be a nice improvement. But:
>>
>> I don't think the way you did it is right architecturally. The
>> segmenting is really something that lives within md.c, and we shouldn't
>> further expose it outside of that. And e.g. the undo patchset uses files
>> with different segmentation - but still goes through bufmgr.c.
>>
>> I wonder if this partially signals that the checksum verification piece
>> is architecturally done in the wrong place currently? It's imo not good
>> that every place doing an smgrread() needs to separately verify
>> checksums. OTOH, it doesn't really belong inside smgr.c.
>>
>>
>> This layering issue was also encountered in
>>
>> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3eb77eba5a51780d5cf52cd66a9844cd4d26feb0
>> so perhaps we should work to reuse the FileTag it introduces to
>> represent segments, without hardcoding the specific segment size?
>>
>>
> I checked the FileTag commit. It calls `register_xxx_segment` inside md.c
> to store the sync request into a hashtable and used by checkpointer later.
>
> Checksum verify is simpler. We could move the `PageIsVerified` into md.c
> (mdread). But we can not elog error inside md.c because read buffer mode
> RBM_ZERO_ON_ERROR is at bugmgr.c level.
>
> One idea is to change save the error message(or the FileTag) at (e.g. a
> static string in bufmgr.c) by calling `register_checksum_failure` inside
> mdread in md.c.
>
> As for your concern about the need to do checksum verify after every
> smgrread, we now move the checksum verify logic into md.c, but we still
> need to check the checksum verify result after smgrread and reset buffer to
> zero if mode is RBM_ZERO_ON_ERROR.
>
> If this idea is OK, I will submit the new PR.
>
>
Based on Andres's comments, here is the new patch for moving checksum
verify logic into mdread() instead of call PageIsVerified in every
smgrread(). Also using FileTag to print the physical file name when
checksum verify failed, which handle segmenting inside md.c as well.

-- 
Thanks

Hubert Zhang


0002-Print-physical-file-path-when-verify-checksum-failed.patch
Description: Binary data

Re: Yet another vectorized engine

2020-02-12 Thread Hubert Zhang

On Tue, Feb 11, 2020 at 1:20 AM Konstantin Knizhnik <
k.knizh...@postgrespro.ru> wrote:

>
> So looks like PG-13 provides significant advantages in OLAP queries
> comparing with 9.6!
> Definitely it doesn't mean that vectorized executor is not needed for new
> version of Postgres.
> Once been ported, I expect that it should provide comparable  improvement
> of performance.
>
> But in any case I think that vectorized executor makes sense only been
> combine with columnar store.
>

Thanks for the test. +1 on vectorize should be combine with columnar store.
I think when we support this extension
on master, we could try the new zedstore.
I'm not active on this work now, but will continue when I have time. Feel
free to join bring vops's feature into this extension.

Thanks

Hubert Zhang

Re: Print physical file path when checksum check fails

2020-02-12 Thread Hubert Zhang

Thanks Andres,

On Tue, Feb 11, 2020 at 5:30 AM Andres Freund  wrote:

> HHi,
>
> On 2020-02-10 16:04:21 +0800, Hubert Zhang wrote:
> > Currently we only print block number and relation path when checksum
> check
> > fails. See example below:
> >
> > ERROR: invalid page in block 333571 of relation base/65959/656195
>
> > DBA complains that she needs additional work to calculate which physical
> > file is broken, since one physical file can only contain `RELSEG_SIZE`
> > number of blocks. For large tables, we need to use many physical files
> with
> > additional suffix, e.g. 656195.1, 656195.2 ...
> >
> > Is that a good idea to also print the physical file path in error
> message?
> > Like below:
> >
> > ERROR: invalid page in block 333571 of relation base/65959/656195, file
> > path base/65959/656195.2
>
> I think that'd be a nice improvement. But:
>
> I don't think the way you did it is right architecturally. The
> segmenting is really something that lives within md.c, and we shouldn't
> further expose it outside of that. And e.g. the undo patchset uses files
> with different segmentation - but still goes through bufmgr.c.
>
> I wonder if this partially signals that the checksum verification piece
> is architecturally done in the wrong place currently? It's imo not good
> that every place doing an smgrread() needs to separately verify
> checksums. OTOH, it doesn't really belong inside smgr.c.
>
>
> This layering issue was also encountered in
>
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3eb77eba5a51780d5cf52cd66a9844cd4d26feb0
> so perhaps we should work to reuse the FileTag it introduces to
> represent segments, without hardcoding the specific segment size?
>
>
I checked the FileTag commit. It calls `register_xxx_segment` inside md.c
to store the sync request into a hashtable and used by checkpointer later.

Checksum verify is simpler. We could move the `PageIsVerified` into md.c
(mdread). But we can not elog error inside md.c because read buffer mode
RBM_ZERO_ON_ERROR is at bugmgr.c level.

One idea is to change save the error message(or the FileTag) at (e.g. a
static string in bufmgr.c) by calling `register_checksum_failure` inside
mdread in md.c.

As for your concern about the need to do checksum verify after every
smgrread, we now move the checksum verify logic into md.c, but we still
need to check the checksum verify result after smgrread and reset buffer to
zero if mode is RBM_ZERO_ON_ERROR.

If this idea is OK, I will submit the new PR.

Thanks

Hubert Zhang

Print physical file path when checksum check fails

2020-02-10 Thread Hubert Zhang

Hi hacker,

Currently we only print block number and relation path when checksum check
fails. See example below:

ERROR: invalid page in block 333571 of relation base/65959/656195

DBA complains that she needs additional work to calculate which physical
file is broken, since one physical file can only contain `RELSEG_SIZE`
number of blocks. For large tables, we need to use many physical files with
additional suffix, e.g. 656195.1, 656195.2 ...

Is that a good idea to also print the physical file path in error message?
Like below:

ERROR: invalid page in block 333571 of relation base/65959/656195, file
path base/65959/656195.2

Patch is attached.
-- 
Thanks

Hubert Zhang


0001-Print-physical-file-path-when-checksum-check-fails.patch
Description: Binary data

Re: Yet another vectorized engine

2019-12-08 Thread Hubert Zhang

Thanks Konstantin,
Your suggestions are very helpful. I have added them into issues of
vectorize_engine repo
https://github.com/zhangh43/vectorize_engine/issues

On Wed, Dec 4, 2019 at 10:08 PM Konstantin Knizhnik <
k.knizh...@postgrespro.ru> wrote:

>
>
> On 04.12.2019 12:13, Hubert Zhang wrote:
>
> 3. Why you have to implement your own plan_tree_mutator and not using
> expression_tree_mutator?
>
> I also want to replace plan node, e.g. Agg->CustomScan(with VectorAgg
> implementation). expression_tree_mutator cannot be used to mutate plan node
> such as Agg, am I right?
>
>
> O, sorry, I see.
>
>
>
>> 4. As far as I understand you now always try to replace SeqScan with your
>> custom vectorized scan. But it makes sense only if there are quals for this
>> scan or aggregation is performed.
>> In other cases batch+unbatch just adds extra overhead, doesn't it?
>>
> Probably extra overhead for heap format and query like 'select i from t;'
> without qual, projection, aggregation.
> But with column store, VectorScan could directly read batch, and no
> additional batch cost. Column store is the better choice for OLAP queries.
>
>
> Generally, yes.
> But will it be true for the query with a lot of joins?
>
> select * from T1 join T2 on (T1.pk=T2.fk) join T3 on (T2.pk=T3.fk) join T4
> ...
>
> How can batching improve performance in this case?
> Also if query contains LIMIT clause or cursors, then batching can cause
> fetching of useless records (which never will be requested by client).
>
> Can we conclude that it would be better to use vector engine for OLAP
> queries and row engine for OLTP queries.
>
> 5. Throwing and catching exception for queries which can not be vectorized
>> seems to be not the safest and most efficient way of handling such cases.
>> May be it is better to return error code in plan_tree_mutator and
>> propagate this error upstairs?
>
>
> Yes, as for efficiency, another way is to enable some plan node to be
> vectorized and leave other nodes not vectorized and add batch/unbatch layer
> between them(Is this what you said "propagate this error upstairs"). As you
> mentioned, this could introduce additional overhead. Is there any other
> good approaches?
> What do you mean by not safest?
> PG catch will receive the ERROR, and fallback to the original
> non-vectorized plan.
>
>
> The problem with catching and ignoring exception was many times discussed
> in hackers.
> Unfortunately Postgres PG_TRY/PG_CATCH mechanism is not analog of
> exception mechanism in more high level languages, like C++, Java...
> It doesn't perform stack unwind. If some resources (files, locks,
> memory,...) were obtained before throwing error, then them are not
> reclaimed.
> Only rollback of transaction is guaranteed to release all resources. And
> it actually happen in case of normal error processing.
> But if you catch and ignore exception , trying to continue execution, then
> it can cause many problems.
>
> May be in your case it is not a problem, because you know for sure where
> error can happen: it is thrown by plan_tree_mutator
> and looks like there are no resources obtained by this function.  But in
> any case overhead of setjmp is much higher than of explicit checks of
> return code.
> So checking return codes will not actually add some noticeable overhead
> except code complication by adding extra checks.
> But in can be hidden in macros which are used in any case (like MUTATE).
>
>
> 7. How vectorized scan can be combined with parallel execution (it is
>> already supported in9.6, isn't it?)
>>
>
> We didn't implement it yet. But the idea is the same as non parallel one.
> Copy the current parallel scan and implement vectorized Gather, keeping
> their interface to be VectorTupleTableSlot.
> Our basic idea to reuse most of the current PG executor logic, and make
> them vectorized, then tuning performance gradually.
>
>
> Parallel scan is scattering pages between parallel workers.
> To fill VectorTupleSlot with data you may need more than one page (unless
> you make a decision that it can fetch tuples only from single page).
> So it should be somehow take in account specific of parallel search.
> Also there is special nodes for parallel search so if we want to provide
> parallel execution for vectorized operations we need also to substitute
> this nodes with
> custom nodes.
>
> --
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com 
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.postgrespro.com=DwMDaQ=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw=lz-kpGdw_rtpgYV2ho3DjDSB5Psxis_b-3VZKON7K7c=vdzzVhvy3WXoHG6U6a8YqBZnVe-7lCDU5SzNWwPDxSM=0TXQmqH_G8_Nao7F_n5m-ekne2NfeaJJPCaRkH_4_ME=>
> The Russian Postgres Company
>
>

-- 
Thanks

Hubert Zhang

Re: Yet another vectorized engine

2019-12-04 Thread Hubert Zhang

Thanks Konstantin for your detailed review!

On Tue, Dec 3, 2019 at 5:58 PM Konstantin Knizhnik <
k.knizh...@postgrespro.ru> wrote:

>
>
> On 02.12.2019 4:15, Hubert Zhang wrote:
>
>
> The prototype extension is at https://github.com/zhangh43/vectorize_engine
>
>
> I am very sorry, that I have no followed this link.
> Few questions concerning your design decisions:
>
> 1. Will it be more efficient to use native arrays in vtype instead of
> array of Datum? I think it will allow compiler to generate more efficient
> code for operations with float4 and int32 types.
> It is possible to use union to keep fixed size of vtype.

Yes, I'm also considering that when scan a column store, the column batch
is loaded into a continuous memory region. For int32, the size of this
region is 4*BATCHSIZE, while for int16, the size is 2*BATCHSIZE. So using
native array could just do a single memcpy to fill the vtype batch.

> 2. Why VectorTupleSlot contains array (batch) of heap tuples rather than
> vectors (array of vtype)?
>

a. VectorTupleSlot stores array of vtype in tts_values field which is used
to reduce the code change and reuse functions like ExecProject. Of course
we could use separate field to store vtypes.
b. VectorTupleSlot also contains array of heap tuples. This used to do heap
tuple deform. In fact, the tuples in a batch may across many pages, so we
also need to pin an array of related pages instead of just one page.

3. Why you have to implement your own plan_tree_mutator and not using
> expression_tree_mutator?
>

I also want to replace plan node, e.g. Agg->CustomScan(with VectorAgg
implementation). expression_tree_mutator cannot be used to mutate plan node
such as Agg, am I right?

> 4. As far as I understand you now always try to replace SeqScan with your
> custom vectorized scan. But it makes sense only if there are quals for this
> scan or aggregation is performed.
> In other cases batch+unbatch just adds extra overhead, doesn't it?
>
Probably extra overhead for heap format and query like 'select i from t;'
without qual, projection, aggregation.
But with column store, VectorScan could directly read batch, and no
additional batch cost. Column store is the better choice for OLAP queries.
Can we conclude that it would be better to use vector engine for OLAP
queries and row engine for OLTP queries.

5. Throwing and catching exception for queries which can not be vectorized
> seems to be not the safest and most efficient way of handling such cases.
> May be it is better to return error code in plan_tree_mutator and
> propagate this error upstairs?

Yes, as for efficiency, another way is to enable some plan node to be
vectorized and leave other nodes not vectorized and add batch/unbatch layer
between them(Is this what you said "propagate this error upstairs"). As you
mentioned, this could introduce additional overhead. Is there any other
good approaches?
What do you mean by not safest? PG catch will receive the ERROR, and
fallback to the original non-vectorized plan.

> 6. Have you experimented with different batch size? I have done similar
> experiments in VOPS and find out that tile size larger than 128 are not
> providing noticable increase of performance.
> You are currently using batch size 1024 which is significantly larger than
> typical amount of tuples on one page.
>

Good point, We will do some experiments on it.

7. How vectorized scan can be combined with parallel execution (it is
> already supported in9.6, isn't it?)
>

We didn't implement it yet. But the idea is the same as non parallel one.
Copy the current parallel scan and implement vectorized Gather, keeping
their interface to be VectorTupleTableSlot.
Our basic idea to reuse most of the current PG executor logic, and make
them vectorized, then tuning performance gradually.

-- 
Thanks

Hubert Zhang

Re: Yet another vectorized engine

2019-12-01 Thread Hubert Zhang

On Sun, Dec 1, 2019 at 10:05 AM Michael Paquier  wrote:

> On Thu, Nov 28, 2019 at 05:23:59PM +0800, Hubert Zhang wrote:
> > Note that the vectorized executor engine is based on PG9.6 now, but it
> > could be ported to master / zedstore with some effort.  We would
> appreciate
> > some feedback before moving further in that direction.
>
> There has been no feedback yet, unfortunately.  The patch does not
> apply anymore, so a rebase is necessary.  For now I am moving the
> patch to next CF, waiting on author.
> --
> Michael
>

Thanks we'll rebase and resubmit the patch.
-- 
Thanks

Hubert Zhang

Re: Yet another vectorized engine

2019-12-01 Thread Hubert Zhang

Hi Konstantin,
Thanks for your reply.

On Fri, Nov 29, 2019 at 12:09 AM Konstantin Knizhnik <
k.knizh...@postgrespro.ru> wrote:

> On 28.11.2019 12:23, Hubert Zhang wrote:
>
> We just want to introduce another POC for vectorized execution engine
> https://github.com/zhangh43/vectorize_engine and want to get some
> feedback on the idea.
>
> But I do not completely understand why you are proposing to implement it
> as extension.
> Yes, custom nodes makes it possible to provide vector execution without
> affecting Postgres core.
> But for efficient integration of zedstore and vectorized executor we need
> to extend table-AM (VectorTupleTableSlot and correspondent scan functions).
> Certainly it is easier to contribute vectorized executor as extension, but
> sooner or later I think it should be added to Postgres core.
>
> As far as I understand you already have some prototype implementation
> (otherwise how you got the performance results)?
> If so, are you planning to publish it or you think that executor should be
> developed from scratch?
>

The prototype extension is at https://github.com/zhangh43/vectorize_engine
 I agree vectorized executor should be added to Postgres core some days.
But it is such a huge feature and need to change from not only the extended
table-AM you mentioned and also every executor node , such as Agg,Join,Sort
node etc. What's more, the expression evaluation function and aggregate's
transition function, combine function etc. We all need to supply a
vectorized version for them. Hence, implementing it as an extension first
and if it is popular among community and stable, we could merge it into
Postgres core whenever we want.

We do want to get some feedback from the community about CustomScan.
CustomScan is just an abstract layer. It's typically used to support user
defined scan node, but some other PG extensions(pgstorm) have already used
it as a general CustomNode e.g. Agg, Join etc. Since vectorized engine need
to support vectorized processing in all executor node, follow the above
idea, our choice is to use CustomScan.


>  Some my concerns based on VOPS experience:
>

> 1. Vertical (columnar) model is preferable for some kind of queries, but
> there are some classes of queries for which it is less efficient.
> Moreover, data is used to be imported in the database in row format.
> Inserting it in columnar store record-by-record is very inefficient.
> So you need some kind of bulk loader which will be able to buffer input
> data before loading it in columnar store.
> Actually this problem it is more related with data model rather than
> vectorized executor. But what I want to express here is that it may be
> better to have both representation (horizontal and vertical)
> and let optimizer choose most efficient one for particular query.
>
>
Yes, in general, for OLTP queries, row format is better and for OLAP
queries column format is better.
As for storage type(or data model), I think DBA should choose row or column
store to use for a specific table.
As for executor, it's a good idea to let optimizer to choose based on cost.
It is a long term goal and our extension now will fallback to original row
executor for Insert,Update,IndexScan cases in a rough way.
We want our extension could be enhanced in a gradual way.


> 2. Columnar store and vectorized executor are most efficient for query
> like "select sum(x) from T where ...".
> Unfortunately such simple queries are rarely used in real life. Usually
> analytic queries contain group-by and joins.
> And here vertical model is not always optimal (you have to reconstruct
> rows from columns to perform join or grouping).
> To provide efficient execution of queries you may need to create multiple
> different projections of the same data (sorted by different subset of
> attributes).
> This is why Vertica (one of the most popular columnar store DBMS) is
> supporting projections.
> The same can be done in VOPS: using create_projection function you can
> specify which attributes should be scalar (grouping attributes) and which
> vectorized.
> In this case you can perform grouping and joins using standard Postgres
> executor, while perform vectorized operations for filtering and
> accumulating aggregates.
>

> This is why Q1 is 20 times faster in VOPS and not 2 times as in your
> prototype.
> So I think that columnar store should make it possible to maintain several
> projections of table and optimizer should be able to automatically choose
> one of them for particular query.
> Definitely synchronization of projections is challenged problem.
> Fortunately OLAP usually not require most recent data.
>

Projection in Vertica is useful. I tested, VOPS is really faster. It could
be nice if you could contribute it to PG core. Our extension is aimed t

Yet another vectorized engine

2019-11-28 Thread Hubert Zhang

Hi hackers,

We just want to introduce another POC for vectorized execution engine
https://github.com/zhangh43/vectorize_engine and want to get some feedback
on the idea.

The basic idea is to extend the TupleTableSlot and introduce
VectorTupleTableSlot, which is an array of datums organized by projected
columns.  The array of datum per column is continuous in memory. This makes
the expression evaluation cache friendly and SIMD could be utilized. We
have refactored the SeqScanNode and AggNode to support VectorTupleTableSlot
currently.

Below are features in our design.
1. Pure extension. We don't hack any code into postgres kernel.

2. CustomScan node. We use CustomScan framework to replace original
executor node such as SeqScan, Agg etc. Based on CustomScan, we could
extend the CustomScanState, BeginCustomScan(), ExecCustomScan(),
EndCustomScan() interface to implement vectorize executor logic.

3. Post planner hook. After plan is generated, we use plan_tree_walker to
traverse the plan tree and check whether it could be vectorized. If yes,
the non-vectorized nodes (SeqScan, Agg etc.) are replaced with vectorized
nodes (in form of CustomScan node) and use vectorized executor. If no, we
will revert to the original plan and use non-vectorized executor. In future
this part could be enhanced, for example, instead of revert to original
plan when some nodes cannot be vectorized, we could add Batch/UnBatch node
to generate a plan with both vectorized as well as non-vectorized node.

4. Support implement new vectorized executor node gradually. We currently
only vectorized SeqScan and Agg but other queries which including Join
could also be run when vectorize extension is enabled.

5. Inherit original executor code. Instead of rewriting the whole executor,
we choose a more smooth method to modify current Postgres executor node and
make it vectorized. We copy the current executor node's c file into our
extension, and add vectorize logic based on it. When Postgres enhance its
executor, we could relatively easily merge them back. We want to know
whether this is a good way to write vectorized executor extension?

6. Pluggable storage. Postgres has supported pluggable storage now.
TupleTableSlot is refactored as abstract struct TupleTableSlotOps.
VectorTupleTableSlot could be implemented under this framework when we
upgrade the extension to latest PG.

We run the TPCH(10G) benchmark and result of Q1 is 50sec(PG) V.S.
28sec(Vectorized PG). Performance gain can be improved by:
1. heap tuple deform occupy many CPUs. We will try zedstore in future,
since vectorized executor is more compatible with column store.

2. vectorized agg is not fully vectorized and we have many optimization
need to do. For example, batch compute the hash value, optimize hash table
for vectorized HashAgg.

3. Conversion cost from Datum to actual type and vice versa is also high,
for example DatumGetFloat4 & Float4GetDatum. One optimization maybe that we
store the actual type in VectorTupleTableSlot directly, instead of an array
of datums.

Related works:
1. VOPS is a vectorized execution extension. Link:
https://github.com/postgrespro/vops.
It doesn't use custom scan framework and use UDF to do the vectorized
operation e.g. it changes the SQL syntax to do aggregation.

2. Citus vectorized executor is another POC. Link:
https://github.com/citusdata/postgres_vectorization_test.
It uses ExecutorRun_hook to run the vectorized executor and uses cstore fdw
to support column storage.

Note that the vectorized executor engine is based on PG9.6 now, but it
could be ported to master / zedstore with some effort.  We would appreciate
some feedback before moving further in that direction.

Thanks,
Hubert Zhang, Gang Xiong, Ning Yu, Asim Praveen

Re: accounting for memory used for BufFile during hash joins

2019-08-14 Thread Hubert Zhang

ich is 8 now. So we have
to split some  tuples in R3 to disk file R7, *which is not necessary*. When
Probing S3, we also need to spilt some tuples in S3 into S7, *which is not
necessary either*. Since R3 could be loaded into memory entirely, spill
part of R3 to disk file not only introduce more file and file buffers(which
is problem Tomas try to solve), but also slow down the performance. After
the third batch processed, we got:
disk files:  batch4(R4,S4),batch(R6,S6),batch(R7,S7)

Next, we begin to process R4 and S4. Similar to R3, some tuples in R4 also
need to be spilled to file R8. But after this splitting, suppose R4 is
still skewed, and we increase the batch number again to 16. As a result,
some tuples in R4 will be spilled to file R12 and R16. When probing S4,
similarly we need to split some tuples in S4 into S8,S12 and S16. After
this step, we get:
 disk files:
batch(R6,S6),batch(R7,S7),batch(R8,S8),batch(R12,S12),batch(R16,S16).

Next, when we begin to process R6 and S6, even if we could build hash table
for R6 all in memory, but we have to spilt R6 based on new batch number 16
and spill to file: R14. *It's not necessary.*

Now we could conclude that increasing batch number would introduce
unnecessary repeated spill not only on original batch(R3,S3) but also on
new generated batch(R6,S6) in a cascade way. *In a worse case, suppose R2
is super skew and need to split 10 times, while R3 is OK to build hash
table all in memory. In this case, we have to introduce R7,R11,,R4095,
total 1023 unnecessary spill files. Each of these files may only contain
less than ten tuples. Also, we need to palloc file buffer(512KB) for these
spill files. This is the so called batch explosion problem.*

*Solutions:*
To avoid these unnecessary repeated spill, I propose to make function
ExecHashGetBucketAndBatch
as a hash function chain to determine the batchno.
Here is the original implementation of ExecHashGetBucketAndBatch
```

//nbatch is the global batch number

*batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);
```
We can see the original hash function basically calculate MOD of global
batch number(IBN).

A real hybrid hash join should use a hash function chain to determine the
batchno. In the new algorithm, the component of hash function chain is
defined as: MOD of #IBN, MOD of #IBN*2, MOD of #IBN*4,MOD of #IBN*8
etc. A small batch will just use the first hash function in chain,
while the skew batch will use the same number of hash functions in chain as
the times it is split.
Here is the new implementation of ExecHashGetBucketAndBatch
```
/* i is the current batchno we are processing */
/* hashChainLen record the times batch i is spilt */
for (j=0;j> hashtable->log2_nbuckets) & ((#initialBatch)*
(2^j) - 1);
/* if the calculated batchno is still i, we need to call more hash
functions
 * in chain to determine the final bucketno, else we could return
directly.
 */
if ( batchno != i )
   return batchno;
}
return batchno;
```

A quick example, Suppose R3's input is  3,7,11,15,19,23,27,31,35,15,27(we
could ensure they MOD4=3)
Suppose Initial batch number is 4 and memory could contain 4 tuples, the
5th tuple need to do batch spilt.
Step1: batch3 process 3,7,11,15,19 and now need to split,
chainLen[3]=2
batch3: 3,11,19
batch7: 7,15
Step2: 23,27,31 coming
   batch3: 3,11,19,27
   batch7: 7,15,23,31
Step 3: 35 coming, batch3 need to split again
   chainLen[3]=3
   batch3: 3,19,35
   batch7: 7,15,23,31
   batch11: 11,27
Step 4  15 coming, HashFun1 15%4=3, HashFun2 15%8=7;
   since 7!=3 spill 15 to batch7.
Step 5  27 coming, 27%4=3, 27%8=3, 27%16 =11
   since 27!=3 spill 27 to batch 11.
Final state:
   chainLen[3]=3
   batch3:  3,19,35
   batch7:  7,15,23,31,15
   batch11: 11,27,27

Here is pseudo code of processing of batch i:
```
/*Step 1: build hash table for Ri*/
tuple = ReadFromFile(Ri);
/* get batchno by the new function*/
batchno =NewExecHashGetBucketAndBatch()
/* do spill if not belong to current batch*/
if(batchno != i)
spill to file[batchno]
flag = InsertTupleToHashTable(HT, tuple);
if (flag == NEED_SPILT)
{
hashChainLen[i] ++;
/* then call ExecHashIncreaseNumBatches() to do the real spill */
}

/* probe stage */
tuple = ReadFromFile(S[i+Bi*k]);
batchno = NewExecHashGetBucketAndBatch()
if (batchno == curbatch)
   probe and match
else
   spillToFile(tuple, batchno)
}
```

This solution only split the batch which needs to be split in a lazy way.
If this solution makes sense, I would like write the real patch.
Any comment?


-- 
Thanks

Hubert Zhang

How to create named portal except cursor?

2019-07-18 Thread Hubert Zhang

Hi all,

Is there any way to create a named portal except cursor in PG?
I tried postgres-jdbc driver and use PrepareStatement. Backend could
receive `bind` and `execute` message, but the portal name is still empty.
How can I specify the portal name?

-- 
Thanks

Hubert Zhang

Re: Control your disk usage in PG: Introduction to Disk Quota Extension

2019-07-14 Thread Hubert Zhang

Thanks, Thomas.

On Mon, Jul 8, 2019 at 6:47 AM Thomas Munro  wrote:

> On Mon, Feb 18, 2019 at 7:39 PM Hubert Zhang  wrote:
> > Based on the assumption we use smgr as hook position, hook API option1
> or option2 which is better?
> > Or we could find some balanced API between option1 and option2?
> >
> > Again comments on other better hook positions are also appreciated!
>
> Hi Hubert,
>
> The July Commitfest is now running, and this entry is in "needs
> review" state.  Could you please post a rebased patch?
>
> I have questions about how disk quotas should work and I think we'll
> probably eventually want more hooks than these, but simply adding
> these hooks so extensions can do whatever they want doesn't seem very
> risky for core.  I think it's highly likely that the hook signatures
> will have to change in future releases too, but that seems OK for such
> detailed internal hooks.  As for your question, my first reaction was
> that I preferred your option 1, because SMgrRelation seems quite
> private and there are no existing examples of that object being
> exposed to extensions.  But on reflection, other callbacks don't take
> such a mollycoddling approach.  So my vote is for option 2 "just pass
> all the arguments to the callback", which I understand to be the
> approach of patch you have posted.
>
> +if (smgrcreate_hook)
> +{
> +(*smgrcreate_hook)(reln, forknum, isRedo);
> +}
>
> Usually we don't use curlies for single line if branches.
>
>
I have rebased the patch to v4 and removed the unnecessary braces.
As your comments, Options 2 is still used in patch v4.

Agree that diskquota extension may use more hooks in future.
Currently the behavior of diskquota extension is that we use smgr hooks to
detect active tables and record them in the shared memory. Bgworkers of
diskquota extension will read these active tables from shared memory and
calculate the latest table size and sum them into the size of schema or
role. If size of schema of role exceeds their quota limit, they will be put
into a black list in shared memory. When a new query comes,
ExecutorCheckPerms_hook will be used to check the black list the cancel the
query if needed.

-- 
Thanks

Hubert Zhang


disk_quota_hooks_v4.patch
Description: Binary data

Re: accounting for memory used for BufFile during hash joins

2019-05-28 Thread Hubert Zhang

Hi Tomas,

Here is the patch, it's could be compatible with your patch and it focus on
when to regrow the batch.


On Tue, May 28, 2019 at 3:40 PM Hubert Zhang  wrote:

> On Sat, May 4, 2019 at 8:34 AM Tomas Vondra 
> wrote:
>
>> The root cause is that hash join treats batches as pretty much free, but
>> that's not really true - we do allocate two BufFile structs per batch,
>> and each BufFile is ~8kB as it includes PGAlignedBuffer.
>>
>> The OOM is not very surprising, because with 524288 batches it'd need
>> about 8GB of memory, and the system only has 8GB RAM installed.
>>
>> The second patch tries to enforce work_mem more strictly. That would be
>> impossible if we were to keep all the BufFile structs in memory, so
>> instead it slices the batches into chunks that fit into work_mem, and
>> then uses a single "overflow" file for slices currently not in memory.
>> These extra slices can't be counted into work_mem, but we should need
>> just very few of them. For example with work_mem=4MB the slice is 128
>> batches, so we need 128x less overflow files (compared to per-batch).
>>
>>
> Hi Tomas
>
> I read your second patch which uses overflow buf files to reduce the total
> number of batches.
> It would solve the hash join OOM problem what you discussed above: 8K per
> batch leads to batch bloating problem.
>
> I mentioned in another thread:
>
> https://www.postgresql.org/message-id/flat/CAB0yrekv%3D6_T_eUe2kOEvWUMwufcvfd15SFmCABtYFOkxCFdfA%40mail.gmail.com
> There is another hashjoin OOM problem which disables splitting batches too
> early. PG uses a flag hashtable->growEnable to determine whether to split
> batches. Once one splitting failed(all the tuples are assigned to only one
> batch of two split ones) The growEnable flag would be turned off forever.
>
> The is an opposite side of batch bloating problem. It only contains too
> few batches and makes the in-memory hash table too large to fit into memory.
>
> Here is the tradeoff: one batch takes more than 8KB(8KB makes sense, due
> to performance), in-memory hash table takes memory as well and splitting
> batched may(not must) reduce the in-memory hash table size but introduce
> more batches(and thus more memory usage 8KB*#batch).
> Can we conclude that it would be worth to splitting if satisfy:
> (The reduced memory of in-memory hash table) - (8KB * number of new
> batches) > 0
>
> So I'm considering to combine our patch with your patch to fix join OOM
> problem. No matter the OOM is introduced by (the memory usage of
> in-memory hash table) or (8KB * number of batches).
>
> nbatch_inmemory in your patch could also use the upper rule to redefine.
>
> What's your opinion?
>
> Thanks
>
> Hubert Zhang
>


-- 
Thanks

Hubert Zhang


0001-Allow-to-continue-to-split-batch-when-tuples-become-.patch
Description: Binary data

Re: accounting for memory used for BufFile during hash joins

2019-05-28 Thread Hubert Zhang

On Sat, May 4, 2019 at 8:34 AM Tomas Vondra 
wrote:

> The root cause is that hash join treats batches as pretty much free, but
> that's not really true - we do allocate two BufFile structs per batch,
> and each BufFile is ~8kB as it includes PGAlignedBuffer.
>
> The OOM is not very surprising, because with 524288 batches it'd need
> about 8GB of memory, and the system only has 8GB RAM installed.
>
> The second patch tries to enforce work_mem more strictly. That would be
> impossible if we were to keep all the BufFile structs in memory, so
> instead it slices the batches into chunks that fit into work_mem, and
> then uses a single "overflow" file for slices currently not in memory.
> These extra slices can't be counted into work_mem, but we should need
> just very few of them. For example with work_mem=4MB the slice is 128
> batches, so we need 128x less overflow files (compared to per-batch).
>
>
Hi Tomas

I read your second patch which uses overflow buf files to reduce the total
number of batches.
It would solve the hash join OOM problem what you discussed above: 8K per
batch leads to batch bloating problem.

I mentioned in another thread:

https://www.postgresql.org/message-id/flat/CAB0yrekv%3D6_T_eUe2kOEvWUMwufcvfd15SFmCABtYFOkxCFdfA%40mail.gmail.com
There is another hashjoin OOM problem which disables splitting batches too
early. PG uses a flag hashtable->growEnable to determine whether to split
batches. Once one splitting failed(all the tuples are assigned to only one
batch of two split ones) The growEnable flag would be turned off forever.

The is an opposite side of batch bloating problem. It only contains too few
batches and makes the in-memory hash table too large to fit into memory.

Here is the tradeoff: one batch takes more than 8KB(8KB makes sense, due to
performance), in-memory hash table takes memory as well and splitting
batched may(not must) reduce the in-memory hash table size but introduce
more batches(and thus more memory usage 8KB*#batch).
Can we conclude that it would be worth to splitting if satisfy:
(The reduced memory of in-memory hash table) - (8KB * number of new
batches) > 0

So I'm considering to combine our patch with your patch to fix join OOM
problem. No matter the OOM is introduced by (the memory usage of in-memory
hash table) or (8KB * number of batches).

nbatch_inmemory in your patch could also use the upper rule to redefine.

What's your opinion?

Thanks

Hubert Zhang

Re: Replace hashtable growEnable flag

2019-05-16 Thread Hubert Zhang

Thanks Tomas.
I will follow this problem on your thread. This thread could be terminated.

On Thu, May 16, 2019 at 3:58 AM Tomas Vondra 
wrote:

> On Wed, May 15, 2019 at 06:19:38PM +0800, Hubert Zhang wrote:
> >Hi all,
> >
> >When we build hash table for a hash join node etc., we split tuples into
> >different hash buckets. Since tuples could not all be held in memory.
> >Postgres splits each bucket into batches, only the current batch of bucket
> >is in memory while other batches are written to disk.
> >
> >During ExecHashTableInsert(), if the memory cost exceeds the operator
> >allowed limit(hashtable->spaceAllowed), batches will be split on the fly
> by
> >calling ExecHashIncreaseNumBatches().
> >
> >In past, if data is distributed unevenly, the split of batch may
> failed(All
> >the tuples falls into one split batch and the other batch is empty) Then
> >Postgres will set hashtable->growEnable to false. And never expand batch
> >number any more.
> >
> >If tuples become diverse in future, spliting batch is still valuable and
> >could avoid the current batch become too big and finally OOM.
> >
> >To fix this, we introduce a penalty on hashtable->spaceAllowed, which is
> >the threshold to determine whether to increase batch number.
> >If batch split failed, we increase the penalty instead of just turn off
> the
> >growEnable flag.
> >
> >Any comments?
> >
>
> There's already another thread discussing various issues with how hashjoin
> increases the number of batches, including various issues with how/when we
> disable adding more batches.
>
> https://commitfest.postgresql.org/23/2109/
>
> In general I think you're right something like this is necessary, but I
> think we may need to rethink growEnable a bit more.
>
> For example, the way you implemented it, after reaching the increased
> limit, we just increase the number of batches just like today, and then
> decide whether it actually helped. But that means we double the number of
> BufFile entries, which uses more and more memory (because each is 8kB and
> we need 1 per batch). I think in this case (after increasing the limit) we
> should check whether increasing batches makes sense or not. And only do it
> if it helps. Otherwise we'll double the amount of memory for BufFile(s)
> and also the work_mem. That's not a good idea.
>
> But as I said, there are other issues discussed on the other thread. For
> example we only disable the growth when all rows fall into the same batch.
> But that's overly strict.
>
>
> regards
>
> --
> Tomas Vondra
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.2ndQuadrant.com=DwIBAg=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw=lz-kpGdw_rtpgYV2ho3DjDSB5Psxis_b-3VZKON7K7c=y2bI6_b4EPRd9aTQGv9Pio3c_ZtCWs_jzKd4t8CtJEI=XHLORM8U7I6XR_EDkgSFtJDvhxIVd2rDA7r-xvJa278=
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>

-- 
Thanks

Hubert Zhang

Replace hashtable growEnable flag

2019-05-15 Thread Hubert Zhang

Hi all,

When we build hash table for a hash join node etc., we split tuples into
different hash buckets. Since tuples could not all be held in memory.
Postgres splits each bucket into batches, only the current batch of bucket
is in memory while other batches are written to disk.

During ExecHashTableInsert(), if the memory cost exceeds the operator
allowed limit(hashtable->spaceAllowed), batches will be split on the fly by
calling ExecHashIncreaseNumBatches().

In past, if data is distributed unevenly, the split of batch may failed(All
the tuples falls into one split batch and the other batch is empty) Then
Postgres will set hashtable->growEnable to false. And never expand batch
number any more.

If tuples become diverse in future, spliting batch is still valuable and
could avoid the current batch become too big and finally OOM.

To fix this, we introduce a penalty on hashtable->spaceAllowed, which is
the threshold to determine whether to increase batch number.
If batch split failed, we increase the penalty instead of just turn off the
growEnable flag.

Any comments?


-- 
Thanks

Hubert Zhang


0001-Using-growPenalty-to-replace-growEnable-in-hashtable.patch
Description: Binary data

Re: Control your disk usage in PG: Introduction to Disk Quota Extension

2019-02-17 Thread Hubert Zhang

Hi Andres

On Sat, Feb 16, 2019 at 12:53 PM Andres Freund  wrote:

> Hi,
> On 2019-01-30 10:26:52 +0800, Hubert Zhang wrote:
> > Hi Michael, Robert
> > For you question about the hook position, I want to explain more about
> the
> > background why we want to introduce these hooks.
> > We wrote a diskquota extension <
> https://github.com/greenplum-db/diskquota>
> > [ ...]
> > On Tue, Jan 22, 2019 at 12:08 PM Hubert Zhang  wrote:
> >
> > > > For this particular purpose, I don't immediately see why you need a
> > >> > hook in both places.  If ReadBuffer is called with P_NEW, aren't we
> > >> > guaranteed to end up in smgrextend()?
> > >> Yes, that's a bit awkward.
>
> Please note that on PG lists the customary and desired style is to quote
> inline in messages rather than top-quote.
>
> Greetings,
>
> Andres Freund
>

Thanks for your note. I will reply the above questions again.

On Wed, Nov 21, 2018 at 10:48 PM Robert Haas  wrote:

> On Tue, Nov 20, 2018 at 2:20 AM Haozhou Wang  wrote:
> > We prepared a patch that includes the hook points. And such hook points
> are needed for disk quota extension.
> > There are two hooks.
> > One is SmgrStat_hook. It's used to perform ad-hoc logic in storage when
> doing smgr create/extend/truncate in general. As for disk quota extension,
> this hook is used to detect active tables(new created tables, tables
> extending new blocks, or tables being truncated)
> > The other is BufferExtendCheckPerms_hook. It's used to perform ad-hoc
> logic when buffer extend a new block. Since ReadBufferExtended is a hot
> function, we call this hook only when blockNum == P_NEW. As  for disk quota
> extension, this hook is used to do query enforcement during the query is
> loading data.
> >
> > Any comments are appreciated.
>
> +1 for adding some hooks to support this kind of thing, but I think
> the names you've chosen are not very good.  The hook name should
> describe the place from which it is called, not the purpose for which
> one imagines that it will be used, because somebody else might imagine
> another use.  Both BufferExtendCheckPerms_hook_type and
> SmgrStat_hook_type are imagining that they know what the hook does -
> CheckPerms in the first case and Stat in the second case.
>
> For this particular purpose, I don't immediately see why you need a
> hook in both places.  If ReadBuffer is called with P_NEW, aren't we
> guaranteed to end up in smgrextend()?


We have removed the hook in ReadBufferExtended and only keep the hooks in
SMGR.


On Wed, Dec 26, 2018 at 1:56 PM Michael Paquier  wrote:

> On Wed, Nov 21, 2018 at 09:47:44AM -0500, Robert Haas wrote:
> > +1 for adding some hooks to support this kind of thing, but I think
> > the names you've chosen are not very good.  The hook name should
> > describe the place from which it is called, not the purpose for which
> > one imagines that it will be used, because somebody else might imagine
> > another use.  Both BufferExtendCheckPerms_hook_type and
> > SmgrStat_hook_type are imagining that they know what the hook does -
> > CheckPerms in the first case and Stat in the second case.
>
> I personally don't mind making Postgres more pluggable, but I don't
> think that we actually need the extra ones proposed here at the layer
> of smgr, as smgr is already a layer designed to call an underlying set
> of APIs able to extend, unlink, etc. depending on the storage type.
>
>
The reason to use smgr as the hook position is as follows:
These hooks will be used by diskquota extension, which needs to gather the
current disk usage for each schema. We want to use hooks to detect the
Active Tables, and only recalculate the disk size of all the Active Tables.
As you mentioned, smgr is the layer to call underlying API to extend/unlink
files. So it's also the place to detect the table size change, i.e. the
Active Tables.
For example, the smgrextend hook indicates that you allocate a new block
and the corresponding table needs to be treated as Active Table.


Besides, suppose we use smgr as hook position, I want to discuss the API
passed to the hook function.
Take smgrextend as example. The function interface of smgrextend is like
that:
```
void smgrextend
(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, char *buffer,
bool skipFsync);
```
So the hook api of smgrextend could have two main options.
Hook API Option1
```
typedef void (*smgrextend_hook_type)
(RelFileNode smgr_rnode,ForkNumber forknum);
```
Hook API Option 2
```
typedef void (*smgrextend_hook_type)
(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,char *buffer,
bool skipFsync);
```

As for Option1. Since diskquota extension only needs the relfilenode
information t

Re: Control your disk usage in PG: Introduction to Disk Quota Extension

2019-01-30 Thread Hubert Zhang

Hi all,

Currently we add hooks in SMGR to detect which table is being modified(disk
size change).
Inserting rows into existing page with room will not introduce new block,
and thus should not be treated as active table. smgrextend is a good
position to meet this behavior.
We welcome suggestions on other better hook positions!

Besides, suppose we use smgr as hook position, I want to discuss the API
passed to the hook function.
Take smgrextend as example. The function interface of smgrextend is like
that:
```
void smgrextend
(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, char *buffer,
bool skipFsync);
```
So the hook api of smgrextend could have two main options.
Hook API Option1
```
typedef void (*smgrextend_hook_type)
(RelFileNode smgr_rnode,ForkNumber forknum);
```
Hook API Option 2
```
typedef void (*smgrextend_hook_type)
(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,char *buffer,
bool skipFsync);
```

As for Option1. Since diskquota extension only needs the relfilenode
information to detect the active tables, Option1 just pass the RelFileNode
to the hook function. It's more clear and has semantic meaning.

While Option 2 is to pass the original parameters to the hook functions
without any filter. This is more general and let the different hook
implementations to decide how to use these parameters.

Option 1 also needs some additional work to handle smgrdounlinkall case,
whose input parameter is the SMgrRelation list. We may need to palloc
Relfilenode list and pfree it manually.
smgrdounlinkall function interface:
```
smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo, char
*relstorages)
```

Based on the assumption we use smgr as hook position, hook API option1 or
option2 which is better?
Or we could find some balanced API between option1 and option2?

Again comments on other better hook positions are appreciated!

Thanks
Hubert

On Wed, Jan 30, 2019 at 10:26 AM Hubert Zhang  wrote:

> Hi Michael, Robert
> For you question about the hook position, I want to explain more about the
> background why we want to introduce these hooks.
> We wrote a diskquota extension <https://github.com/greenplum-db/diskquota>
> for Postgresql(which is inspired by Heikki's pg_quota
> <https://github.com/hlinnaka/pg_quota>). Diskquota extension is used to
> control the disk usage in Postgresql in a fine-grained way, which means:
> 1. You could set disk quota limit at schema level or role level.
> 2. A background worker will gather the current disk usage for each
> schema/role in realtime.
> 3. A background worker will generate the blacklist for schema/role whose
> quota limit is exceeded.
> 4. New transaction want to insert data into the schema/role in the
> blacklist will be cancelled.
>
> In step 2, gathering the current disk usage for each schema needs to sum
> disk size of all the tables in this schema. This is a time consuming
> operation. We want to use hooks in SMGR to detect the Active Table, and
> only recalculate the disk size of all the Active Tables.
> For example, the smgrextend hook indicates that you allocate a new block
> and the table need to be treated as Active Table.
>
> Do you have some better hook positions recommend to solve the above user
> case?
> Thanks in advance.
>
> Hubert
>
>
>
>
>
> On Tue, Jan 22, 2019 at 12:08 PM Hubert Zhang  wrote:
>
>> > For this particular purpose, I don't immediately see why you need a
>>> > hook in both places.  If ReadBuffer is called with P_NEW, aren't we
>>> > guaranteed to end up in smgrextend()?
>>> Yes, that's a bit awkward.
>>
>>
>>  Hi Michael, we revisit the ReadBuffer hook and remove it in the latest
>> patch.
>> ReadBuffer hook is original used to do enforcement(e.g. out of diskquota
>> limit) when query is loading data.
>> We plan to put the enforcement work of running query to separate
>> diskquota worker process.
>> Let worker process to detect the backends to be cancelled and send SIGINT
>> to these backends.
>> So there is no need for ReadBuffer hook anymore.
>>
>> Our patch currently only contains smgr related hooks to catch the file
>> change and get the Active Table list for diskquota extension.
>>
>> Thanks Hubert.
>>
>>
>> On Mon, Jan 7, 2019 at 6:56 PM Haozhou Wang  wrote:
>>
>>> Thanks very much for your comments.
>>>
>>> To the best of my knowledge, smgr is a layer that abstract the storage
>>> operations. Therefore, it is a good place to control or collect information
>>> the storage operations without touching the physical storage layer.
>>> Moreover, smgr is coming with actual disk IO operation (not consider the
>>> OS cache) for postgres. So we do not need to worr

Re: Control your disk usage in PG: Introduction to Disk Quota Extension

2019-01-29 Thread Hubert Zhang

Hi Michael, Robert
For you question about the hook position, I want to explain more about the
background why we want to introduce these hooks.
We wrote a diskquota extension <https://github.com/greenplum-db/diskquota>
for Postgresql(which is inspired by Heikki's pg_quota
<https://github.com/hlinnaka/pg_quota>). Diskquota extension is used to
control the disk usage in Postgresql in a fine-grained way, which means:
1. You could set disk quota limit at schema level or role level.
2. A background worker will gather the current disk usage for each
schema/role in realtime.
3. A background worker will generate the blacklist for schema/role whose
quota limit is exceeded.
4. New transaction want to insert data into the schema/role in the
blacklist will be cancelled.

In step 2, gathering the current disk usage for each schema needs to sum
disk size of all the tables in this schema. This is a time consuming
operation. We want to use hooks in SMGR to detect the Active Table, and
only recalculate the disk size of all the Active Tables.
For example, the smgrextend hook indicates that you allocate a new block
and the table need to be treated as Active Table.

Do you have some better hook positions recommend to solve the above user
case?
Thanks in advance.

Hubert

On Tue, Jan 22, 2019 at 12:08 PM Hubert Zhang  wrote:

> > For this particular purpose, I don't immediately see why you need a
>> > hook in both places.  If ReadBuffer is called with P_NEW, aren't we
>> > guaranteed to end up in smgrextend()?
>> Yes, that's a bit awkward.
>
>
>  Hi Michael, we revisit the ReadBuffer hook and remove it in the latest
> patch.
> ReadBuffer hook is original used to do enforcement(e.g. out of diskquota
> limit) when query is loading data.
> We plan to put the enforcement work of running query to separate diskquota
> worker process.
> Let worker process to detect the backends to be cancelled and send SIGINT
> to these backends.
> So there is no need for ReadBuffer hook anymore.
>
> Our patch currently only contains smgr related hooks to catch the file
> change and get the Active Table list for diskquota extension.
>
> Thanks Hubert.
>
>
> On Mon, Jan 7, 2019 at 6:56 PM Haozhou Wang  wrote:
>
>> Thanks very much for your comments.
>>
>> To the best of my knowledge, smgr is a layer that abstract the storage
>> operations. Therefore, it is a good place to control or collect information
>> the storage operations without touching the physical storage layer.
>> Moreover, smgr is coming with actual disk IO operation (not consider the
>> OS cache) for postgres. So we do not need to worry about the buffer
>> management in postgres.
>> It will make the purpose of hook is pure: a hook for actual disk IO.
>>
>> Regards,
>> Haozhou
>>
>> On Wed, Dec 26, 2018 at 1:56 PM Michael Paquier 
>> wrote:
>>
>>> On Wed, Nov 21, 2018 at 09:47:44AM -0500, Robert Haas wrote:
>>> > +1 for adding some hooks to support this kind of thing, but I think
>>> > the names you've chosen are not very good.  The hook name should
>>> > describe the place from which it is called, not the purpose for which
>>> > one imagines that it will be used, because somebody else might imagine
>>> > another use.  Both BufferExtendCheckPerms_hook_type and
>>> > SmgrStat_hook_type are imagining that they know what the hook does -
>>> > CheckPerms in the first case and Stat in the second case.
>>>
>>> I personally don't mind making Postgres more pluggable, but I don't
>>> think that we actually need the extra ones proposed here at the layer
>>> of smgr, as smgr is already a layer designed to call an underlying set
>>> of APIs able to extend, unlink, etc. depending on the storage type.
>>>
>>> > For this particular purpose, I don't immediately see why you need a
>>> > hook in both places.  If ReadBuffer is called with P_NEW, aren't we
>>> > guaranteed to end up in smgrextend()?
>>>
>>> Yes, that's a bit awkward.
>>> --
>>> Michael
>>
>>
>
> --
> Thanks
>
> Hubert Zhang
>

-- 
Thanks

Hubert Zhang

Re: Control your disk usage in PG: Introduction to Disk Quota Extension

2019-01-21 Thread Hubert Zhang

>
> > For this particular purpose, I don't immediately see why you need a
> > hook in both places.  If ReadBuffer is called with P_NEW, aren't we
> > guaranteed to end up in smgrextend()?
> Yes, that's a bit awkward.


 Hi Michael, we revisit the ReadBuffer hook and remove it in the latest
patch.
ReadBuffer hook is original used to do enforcement(e.g. out of diskquota
limit) when query is loading data.
We plan to put the enforcement work of running query to separate diskquota
worker process.
Let worker process to detect the backends to be cancelled and send SIGINT
to these backends.
So there is no need for ReadBuffer hook anymore.

Our patch currently only contains smgr related hooks to catch the file
change and get the Active Table list for diskquota extension.

Thanks Hubert.


On Mon, Jan 7, 2019 at 6:56 PM Haozhou Wang  wrote:

> Thanks very much for your comments.
>
> To the best of my knowledge, smgr is a layer that abstract the storage
> operations. Therefore, it is a good place to control or collect information
> the storage operations without touching the physical storage layer.
> Moreover, smgr is coming with actual disk IO operation (not consider the
> OS cache) for postgres. So we do not need to worry about the buffer
> management in postgres.
> It will make the purpose of hook is pure: a hook for actual disk IO.
>
> Regards,
> Haozhou
>
> On Wed, Dec 26, 2018 at 1:56 PM Michael Paquier 
> wrote:
>
>> On Wed, Nov 21, 2018 at 09:47:44AM -0500, Robert Haas wrote:
>> > +1 for adding some hooks to support this kind of thing, but I think
>> > the names you've chosen are not very good.  The hook name should
>> > describe the place from which it is called, not the purpose for which
>> > one imagines that it will be used, because somebody else might imagine
>> > another use.  Both BufferExtendCheckPerms_hook_type and
>> > SmgrStat_hook_type are imagining that they know what the hook does -
>> > CheckPerms in the first case and Stat in the second case.
>>
>> I personally don't mind making Postgres more pluggable, but I don't
>> think that we actually need the extra ones proposed here at the layer
>> of smgr, as smgr is already a layer designed to call an underlying set
>> of APIs able to extend, unlink, etc. depending on the storage type.
>>
>> > For this particular purpose, I don't immediately see why you need a
>> > hook in both places.  If ReadBuffer is called with P_NEW, aren't we
>> > guaranteed to end up in smgrextend()?
>>
>> Yes, that's a bit awkward.
>> --
>> Michael
>
>

-- 
Thanks

Hubert Zhang


disk_quota_hooks_v3.patch
Description: Binary data

Discussion: Fast DB/Schema/Table disk size check in Postgresql

2018-12-17 Thread Hubert Zhang

a
extension. Thanks in advance!

-- 
Thanks

Hubert Zhang

Re: Control your disk usage in PG: Introduction to Disk Quota Extension

2018-11-22 Thread Hubert Zhang

>
> For this particular purpose, I don't immediately see why you need a
> hook in both places.  If ReadBuffer is called with P_NEW, aren't we
> guaranteed to end up in smgrextend()?

For the usercase in diskquota.
BufferExtendCheckPerms_hook is used to do dynamic query enforcement, while
smgr related hooks are used to detect relfilenodeoid of active tables and
write them into shared memory(which is used to accelerate refreshing of
diskquota model).
The reason we don't use smgr_extend hook to replace ReadBuffer hook to do
enforcement has two folds:
1. As for enforcement, we don't want to affect the performance of insert
query. But hooks in smgr_extend need to convert relfilenode to reloid
firstly which need an indexscan.
2. Using hooks in ReadBuffer instead of smgr_extend could avoid to
enforcement on 'cluster relation' operator. For example, 'vacuum full
table' will firstly cluster and create a new table, and then delete the old
table. Because the disk usage will first grow and then shrink, if quota
limit is reached, then vacuum full will fail.(but in fact we want vacuum
full to reduce disk usage) Using hooks in ReadBuffer is one solution to
this problem. Of course, there are other solutions. But This is one of the
reason we use BufferExtendCheckPerms_hook to do enforcement at current
stage.

On Thu, Nov 22, 2018 at 7:26 PM Haozhou Wang  wrote:

> Thank you very much for your review.
> We refactored our patch with new names and comments.
>
> For ReadBufferExtended hook, yes, Readbuffer with P_NEW will then call
> smgrextend.
>
> But in smgrextend, we cannot get the oid of a relation, and it will take
> some time to get the oid via smgrrelation.
> We would like to add a hook just before the smgrextend to get the oid and
> avoid use RelidByRelfilenode().
>
> New patch is attached in the attachment.
> Thank a lot!
>
> Regards,
> Haozhou
>
>
> On Wed, Nov 21, 2018 at 10:48 PM Robert Haas 
> wrote:
>
>> On Tue, Nov 20, 2018 at 2:20 AM Haozhou Wang  wrote:
>> > We prepared a patch that includes the hook points. And such hook points
>> are needed for disk quota extension.
>> > There are two hooks.
>> > One is SmgrStat_hook. It's used to perform ad-hoc logic in storage when
>> doing smgr create/extend/truncate in general. As for disk quota extension,
>> this hook is used to detect active tables(new created tables, tables
>> extending new blocks, or tables being truncated)
>> > The other is BufferExtendCheckPerms_hook. It's used to perform ad-hoc
>> logic when buffer extend a new block. Since ReadBufferExtended is a hot
>> function, we call this hook only when blockNum == P_NEW. As  for disk quota
>> extension, this hook is used to do query enforcement during the query is
>> loading data.
>> >
>> > Any comments are appreciated.
>>
>> +1 for adding some hooks to support this kind of thing, but I think
>> the names you've chosen are not very good.  The hook name should
>> describe the place from which it is called, not the purpose for which
>> one imagines that it will be used, because somebody else might imagine
>> another use.  Both BufferExtendCheckPerms_hook_type and
>> SmgrStat_hook_type are imagining that they know what the hook does -
>> CheckPerms in the first case and Stat in the second case.
>>
>> For this particular purpose, I don't immediately see why you need a
>> hook in both places.  If ReadBuffer is called with P_NEW, aren't we
>> guaranteed to end up in smgrextend()?
>>
>> --
>> Robert Haas
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>>
>
>
>

-- 
Thanks

Hubert Zhang

Re: Control your disk usage in PG: Introduction to Disk Quota Extension

2018-11-14 Thread Hubert Zhang

Thanks, Tomas,

Yes, we want to add the hooks into postgres repo.
But before that, we plan to firstly get some feedbacks from community about
the diskquota extension implementation and usage?
Later, we'll modify our license and submit the hooks into CF.

Thanks
Hubert

On Wed, Nov 14, 2018 at 3:54 AM Tomas Vondra 
wrote:

> On Tue, 2018-11-13 at 16:47 +0800, Hubert Zhang wrote:
> > Hi all,
> >
> > We implement disk quota feature on Postgresql as an extension(link:
> > https://github.com/greenplum-db/diskquota),
> > If you are interested, try and use it to limit the amount of disk
> > space that
> > a schema or a role can use in Postgresql.
> > Any feedback or question are appreciated.
> >
>
> It's not clear to me what's the goal of this thread? I understand what
> quotas are about, but are you sharing it because (a) it's a useful
> extension, (b) you propose adding a couple of new hooks (and keep the
> extension separate) or (c) you propose adding both the hooks and the
> extension (into contrib)?
>
> I assume it's either (b) and (c), in which case you should add it to
> 2019-01 CF: https://commitfest.postgresql.org/21/
>
> In either case, it might be useful to add a LICENSE to the github
> repository, it's not clear what's the situation in this respect. That
> probably matters especially for (b), because for (c) it'd end up with
> PostgreSQL license automatically.
>
> regards
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>

-- 
Thanks

Hubert Zhang

Control your disk usage in PG: Introduction to Disk Quota Extension

2018-11-13 Thread Hubert Zhang

 disk quota stats periodically
diskquota.naptime = 2
# restart database to load preload library.
pg_ctl restart


   1. Create diskquota extension in monitored database.

create extension diskquota;


   1. Reload database configuraion

# reset monitored database list in postgresql.conf
diskquota.monitor_databases = 'postgres, postgres2'
# reload configuration
pg_ctl reload

<https://github.com/greenplum-db/diskquota/blob/master/README.md#usage>Usage

   1. Set/update/delete schema quota limit using diskquota.set_schema_quota

create schema s1;
select diskquota.set_schema_quota('s1', '1 MB');
set search_path to s1;

create table a(i int);
# insert small data succeeded
insert into a select generate_series(1,100);
# insert large data failed
insert into a select generate_series(1,1000);
# insert small data failed
insert into a select generate_series(1,100);

# delete quota configuration
select diskquota.set_schema_quota('s1', '-1');
# insert small data succeed
select pg_sleep(5);
insert into a select generate_series(1,100);
reset search_path;


   1. Set/update/delete role quota limit using diskquota.set_role_quota

create role u1 nologin;
create table b (i int);
alter table b owner to u1;
select diskquota.set_role_quota('u1', '1 MB');

# insert small data succeeded
insert into b select generate_series(1,100);
# insert large data failed
insert into b select generate_series(1,1000);
# insert small data failed
insert into b select generate_series(1,100);

# delete quota configuration
select diskquota.set_role_quota('u1', '-1');
# insert small data succeed
select pg_sleep(5);
insert into a select generate_series(1,100);
reset search_path;


   1. Show schema quota limit and current usage

select * from diskquota.show_schema_quota_view;

<https://github.com/greenplum-db/diskquota/blob/master/README.md#test>Test

Run regression tests.

cd contrib/diskquota;
make installcheck

<https://github.com/greenplum-db/diskquota/blob/master/README.md#benchmark--performence-test>Benchmark
& Performence Test
<https://github.com/greenplum-db/diskquota/blob/master/README.md#cost-of-diskquota-worker>Cost
of diskquota worker

During each refresh interval, the disk quota worker need to refresh the
disk quota model.

It take less than 100ms under 100K user tables with no avtive tables.

It take less than 200ms under 100K user tables with 1K active tables.
<https://github.com/greenplum-db/diskquota/blob/master/README.md#impact-on-oltp-queries>Impact
on OLTP queries

We test OLTP queries to measure the impact of enabling diskquota feature.
The range is from 2k tables to 10k tables. Each connection will insert 100
rows into each table. And the parallel connections range is from 5 to 25.
Number of active tables will be around 1k.

Without diskquota enabled (seconds)
2k4k6k8k10k
5 4.002 11.356 18.460 28.591 41.123
10 4.832 11.988 21.113 32.829 45.832
15 6.238 16.896 28.722 45.375 64.642
20 8.036 21.711 38.499 61.763 87.875
25 9.909 27.175 47.996 75.688 106.648

With diskquota enabled (seconds)
2k4k6k8k10k
5 4.135 10.641 18.776 28.804 41.740
10 4.773 12.407 22.351 34.243 47.568
15 6.355 17.305 30.941 46.967 66.216
20 9.451 22.231 40.645 61.758 88.309
25 10.096 26.844 48.910 76.537 108.025

The performance difference between with/without diskquota enabled are less
then 2-3% in most case. Therefore, there is no significant performance
downgrade when diskquota is enabled.
<https://github.com/greenplum-db/diskquota/blob/master/README.md#notes>Notes

   1. Drop database with diskquota enabled.

If DBA enable monitoring diskquota on a database, there will be a
connection to this database from diskquota worker process. DBA need to
first remove this database from diskquota.monitor_databases in
postgres.conf, and reload configuration by call pg_ctl reload. Then
database could be dropped successfully.

   1. Temp table.

Diskquota supports to limit the disk usage of temp table as well. But
schema and role are different. For role, i.e. the owner of the temp table,
diakquota will treat it the same as normal tables and sum its table size to
its owner's quota. While for schema, temp table is located under namespace
'pg_temp_backend_id', so temp table size will not sum to the current
schema's qouta.


-- 
Thanks

Hubert Zhang, Haozhou Wang, Hao Wu, Jack WU

Re: Is there way to detect uncommitted 'new table' in pg_class?

2018-11-01 Thread Hubert Zhang

Thanks

On Thu, Nov 1, 2018 at 8:38 AM Michael Paquier  wrote:

> On Wed, Oct 31, 2018 at 01:30:52PM -0400, Robert Haas wrote:
> > In theory, at least, you could write C code to scan the catalog tables
> > with SnapshotDirty to find the catalog entries, but I don't think that
> > helps a whole lot.  You couldn't necessarily rely on those catalog
> > entries to be in a consistent state, and even if they were, they might
> > depend on committed types or functions or similar whose definitions
> > your backend can't see.  Moreover, the creating backend will have an
> > AccessExclusiveLock on the table -- if you write C code to bypass that
> > and read the data anyway, then you will probably destabilize the
> > entire system for complicated reasons that I don't feel like
> > explaining right now.
>
> One take here is that we cannot give any guarantee that a single DDL
> will create only one consistent version of the tuple added in system
> catalogs.  In those cases a new version is made visible by using
> CommandCounterIncrement() so as the follow-up processing can see it.
>
> > You should try very hard to find some way of solving this problem that
> > doesn't require reading data from a table that hasn't been committed
> > yet, because you are almost certainly not going to be able to make
> > that work reliably even if you are willing to write code in C.
>
> +1.
> --
> Michael
>


-- 
Thanks

Hubert Zhang

Is there way to detect uncommitted 'new table' in pg_class?

2018-10-31 Thread Hubert Zhang

Hi all,

In PG READ UNCOMMITTED is treated as READ COMMITTED
But I have a requirement to read dirty table. Is there way to detect table
which is created in other uncommitted transaction?

T1:
BEGIN;
create table a(i int);

T2:
select * from pg_class where relname='a';
could return table a?
-- 
Thanks

Hubert Zhang

Re: Is there any way to request unique lwlock inside a background worker in PG9.4?

2018-10-17 Thread Hubert Zhang

Thanks a lot.

On Wed, Oct 17, 2018 at 11:21 PM Andres Freund  wrote:

> Hi,
>
> On 2018-10-17 23:11:26 +0800, Hubert Zhang wrote:
> > The section "Share Memory and LWLocks" describe the AddinShmemInitLock
> which
> > is used to protect the ShmemInitStruct() when backend workers initialize
> > their shm.  My requirement is to how to protect the shm access within the
> > bgworkers(not at init stage). This lock should be bgworkers specific.
>
> Rereead the page please. After you RequestAddinLWLocks() during
> initialization, you can use LWLockAssign() to get an lwlock. Which you
> then can use as is your pleasure.
>
> Greetings,
>
> Andres Freund
>


-- 
Thanks

Hubert Zhang

Re: Is there any way to request unique lwlock inside a background worker in PG9.4?

2018-10-17 Thread Hubert Zhang

Hi Amit

The section "Share Memory and LWLocks" describe the AddinShmemInitLock which
is used to protect the ShmemInitStruct() when backend workers initialize
their shm.  My requirement is to how to protect the shm access within the
bgworkers(not at init stage). This lock should be bgworkers specific.

On Wed, Oct 17, 2018 at 7:51 PM Amit Kapila  wrote:

> On Wed, Oct 17, 2018 at 3:49 PM Hubert Zhang  wrote:
> >
> > Hi all,
> >
> > I want to init SHM in a background worker, which is supported in PG9.4.
> Also I need to use lwlock to protect the share memory inside the worker
> code.
> >
> > RequestNamedLWLockTranche is the way to handle it, but it's not
> supported in PG 9.4, is there any way I could request an unique lock inside
> worker init code in PG 9.4?
> >
>
> You might want to look at section "Shared Memory and LWLocks" in the docs
> [1].
>
> [1] - https://www.postgresql.org/docs/9.4/static/xfunc-c.html
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>

-- 
Thanks

Hubert Zhang

Is there any way to request unique lwlock inside a background worker in PG9.4?

2018-10-17 Thread Hubert Zhang

Hi all,

I want to init SHM in a background worker, which is supported in PG9.4.
Also I need to use lwlock to protect the share memory inside the worker
code.

RequestNamedLWLockTranche is the way to handle it, but it's not supported
in PG 9.4, is there any way I could request an unique lock inside worker
init code in PG 9.4?

-- 
Thanks

Hubert Zhang

Re: Proposal for disk quota feature

2018-09-24 Thread Hubert Zhang

>
> The quotas or object limits, resource limits are pretty useful and
> necessary, but I don't see these like new type of objects, it is much more
> some property of current objects. Because we have one syntax for this
> purpose I prefer it. Because is not good to have two syntaxes for similar
> purpose.

SCHEMA and TABLE are OK for me, But as I mentioned before, ROLE is a
special case when using ALTER SET at this moment.
TABLE and SCHEMA are both database level, e.g. pg_class and pg_namespace
both residents in one database. But ROLE is cluster-level. They don't
belong to a database. ALTER ROLE XXX SET disk_quota = xxx means to set the
quota for the user on all the databases in the first glance. But in our
first stage design, ROLE's quota is bind to a specific database. E.g. Role
Jack could have 10GB quota on database A and 2GB quota on database B.

SQL syntax is not hard to modify,  I don't think this should block the main
design of disk quota feature. Is there any comment on the design and
architecture? If no, we'll firstly submit our patch and involve more
discussion?

On Sat, Sep 22, 2018 at 3:03 PM Pavel Stehule 
wrote:

>
>
> so 22. 9. 2018 v 8:48 odesílatel Hubert Zhang  napsal:
>
>> But it looks like redundant to current GUC configuration and limits
>>
>> what do you mean by current GUC configuration? Is that the general block
>> number limit in your patch? If yes, the difference between GUC and
>> pg_diskquota catalog is that pg_diskquota will store different quota limit
>> for the different role, schema or table instead of a single GUC value.
>>
>
> storage is not relevant in this moment.
>
> I don't see to consistent to sets some limits via SET command, or ALTER X
> SET, and some other with CREATE QUOTA ON.
>
> The quotas or object limits, resource limits are pretty useful and
> necessary, but I don't see these like new type of objects, it is much more
> some property of current objects. Because we have one syntax for this
> purpose I prefer it. Because is not good to have two syntaxes for similar
> purpose.
>
> So instead CREATE DISC QUATA ON SCHEMA xxx some value I prefer
>
> ALTER SCHEMA xxx SET disc_quota = xxx;
>
> The functionality is +/- same. But ALTER XX SET was introduce first, and I
> don't feel comfortable to have any new syntax for similar purpose
>
> Regards
>
> Pavel
>
>
>
>
>
>>
>> On Sat, Sep 22, 2018 at 11:17 AM Pavel Stehule 
>> wrote:
>>
>>>
>>>
>>> pá 21. 9. 2018 v 16:21 odesílatel Hubert Zhang 
>>> napsal:
>>>
>>>> just fast reaction - why QUOTA object?
>>>>> Isn't ALTER SET enough?
>>>>> Some like
>>>>> ALTER TABLE a1 SET quote = 1MB;
>>>>> ALTER USER ...
>>>>> ALTER SCHEMA ..
>>>>> New DDL commans looks like too hard hammer .
>>>>
>>>>
>>>> It's an option. Prefer to consider quota setting store together:
>>>> CREATE DISK QUOTA way is more nature to store quota setting in a
>>>> separate pg_diskquota catalog
>>>> While ALTER SET way is more close to store quota setting in pg_class,
>>>> pg_role, pg_namespace. etc in an integrated way.
>>>> (Note that here I mean nature/close is not must, ALTER SET could also
>>>> store in pg_diskquota and vice versa.)
>>>>
>>>
>>> I have not a problem with new special table for storing this
>>> information. But it looks like redundant to current GUC configuration and
>>> limits. Can be messy do some work with ALTER ROLE, and some work via CREATE
>>> QUOTE.
>>>
>>> Regards
>>>
>>> Pavel
>>>
>>>
>>>> Here are some differences I can think of:
>>>> 1 pg_role is a global catalog, not per database level. It's harder to
>>>> tracker the user's disk usage in the whole clusters(considering 1000+
>>>> databases).  So the semantic of  CREATE DISK QUOTA ON USER is limited: it
>>>> only tracks the user's disk usage inside the current database.
>>>> 2 using separate pg_diskquota could add more field except for quota
>>>> limit without adding too many fields in pg_class, e.g. red zone to give the
>>>> user a warning or the current disk usage of the db objects.
>>>>
>>>> On Fri, Sep 21, 2018 at 8:01 PM Pavel Stehule 
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> pá 21. 9. 2018 v 13:32 odesílatel Hubert Zhang 
>>>>> napsal:
>>>>>
>>>>>>
>>>>>>
>>>>>>
>

Re: Proposal for disk quota feature

2018-09-22 Thread Hubert Zhang

>
> But it looks like redundant to current GUC configuration and limits

what do you mean by current GUC configuration? Is that the general block
number limit in your patch? If yes, the difference between GUC and
pg_diskquota catalog is that pg_diskquota will store different quota limit
for the different role, schema or table instead of a single GUC value.

On Sat, Sep 22, 2018 at 11:17 AM Pavel Stehule 
wrote:

>
>
> pá 21. 9. 2018 v 16:21 odesílatel Hubert Zhang  napsal:
>
>> just fast reaction - why QUOTA object?
>>> Isn't ALTER SET enough?
>>> Some like
>>> ALTER TABLE a1 SET quote = 1MB;
>>> ALTER USER ...
>>> ALTER SCHEMA ..
>>> New DDL commans looks like too hard hammer .
>>
>>
>> It's an option. Prefer to consider quota setting store together:
>> CREATE DISK QUOTA way is more nature to store quota setting in a separate
>> pg_diskquota catalog
>> While ALTER SET way is more close to store quota setting in pg_class,
>> pg_role, pg_namespace. etc in an integrated way.
>> (Note that here I mean nature/close is not must, ALTER SET could also
>> store in pg_diskquota and vice versa.)
>>
>
> I have not a problem with new special table for storing this information.
> But it looks like redundant to current GUC configuration and limits. Can be
> messy do some work with ALTER ROLE, and some work via CREATE QUOTE.
>
> Regards
>
> Pavel
>
>
>> Here are some differences I can think of:
>> 1 pg_role is a global catalog, not per database level. It's harder to
>> tracker the user's disk usage in the whole clusters(considering 1000+
>> databases).  So the semantic of  CREATE DISK QUOTA ON USER is limited: it
>> only tracks the user's disk usage inside the current database.
>> 2 using separate pg_diskquota could add more field except for quota limit
>> without adding too many fields in pg_class, e.g. red zone to give the user
>> a warning or the current disk usage of the db objects.
>>
>> On Fri, Sep 21, 2018 at 8:01 PM Pavel Stehule 
>> wrote:
>>
>>>
>>>
>>> pá 21. 9. 2018 v 13:32 odesílatel Hubert Zhang 
>>> napsal:
>>>
>>>>
>>>>
>>>>
>>>>
>>>> *Hi all,We redesign disk quota feature based on the comments from Pavel
>>>> Stehule and Chapman Flack. Here are the new design.OverviewBasically,  disk
>>>> quota feature is used to support multi-tenancy environment, different level
>>>> of database objects could be set a quota limit to avoid over use of disk
>>>> space. A common case could be as follows: DBA could enable disk quota on a
>>>> specified database list. DBA could set disk quota limit for
>>>> tables/schemas/roles in these databases. Separate disk quota worker process
>>>> will monitor the disk usage for these objects and detect the objects which
>>>> exceed their quota limit. Queries loading data into these “out of disk
>>>> quota” tables/schemas/roles will be cancelled.We are currently working at
>>>> init implementation stage. We would like to propose our idea firstly and
>>>> get feedbacks from community to do quick iteration.SQL Syntax (How to use
>>>> disk quota)1 Specify the databases with disk quota enabled in GUC
>>>> “diskquota_databases” in postgresql.conf and restart the database.2 DBA
>>>> could set disk quota limit for table/schema/role.CREATE DISK QUOTA tablea1
>>>> ON TABLE a1 with (quota = ‘1MB’);CREATE DISK QUOTA roleu1 ON USER u1 with
>>>> (quota = ‘1GB’);CREATE DISK QUOTA schemas1 ON SCHEMA s1 with (quota =
>>>> ‘3MB’);*
>>>>
>>>
>>> just fast reaction - why QUOTA object?
>>>
>>> Isn't ALTER SET enough?
>>>
>>> Some like
>>>
>>> ALTER TABLE a1 SET quote = 1MB;
>>> ALTER USER ...
>>> ALTER SCHEMA ..
>>>
>>> New DDL commans looks like too hard hammer .
>>>
>>>
>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *3 Simulate a schema out of quota limit case: suppose table a1 and
>>>> table a2 are both under schema s1.INSERT INTO a1 SELECT
>>>> generate_series(1,1000);INSERT INTO a2 SELECT
>>>> generate_series(1,300);SELECT pg_sleep(5)INSERT INTO a1 SELECT
>>>> generate_series(1,1000);ERROR:  sc

Re: Proposal for disk quota feature

2018-09-21 Thread Hubert Zhang

>
> just fast reaction - why QUOTA object?
> Isn't ALTER SET enough?
> Some like
> ALTER TABLE a1 SET quote = 1MB;
> ALTER USER ...
> ALTER SCHEMA ..
> New DDL commans looks like too hard hammer .


It's an option. Prefer to consider quota setting store together:
CREATE DISK QUOTA way is more nature to store quota setting in a separate
pg_diskquota catalog
While ALTER SET way is more close to store quota setting in pg_class,
pg_role, pg_namespace. etc in an integrated way.
(Note that here I mean nature/close is not must, ALTER SET could also store
in pg_diskquota and vice versa.)

Here are some differences I can think of:
1 pg_role is a global catalog, not per database level. It's harder to
tracker the user's disk usage in the whole clusters(considering 1000+
databases).  So the semantic of  CREATE DISK QUOTA ON USER is limited: it
only tracks the user's disk usage inside the current database.
2 using separate pg_diskquota could add more field except for quota limit
without adding too many fields in pg_class, e.g. red zone to give the user
a warning or the current disk usage of the db objects.

On Fri, Sep 21, 2018 at 8:01 PM Pavel Stehule 
wrote:

>
>
> pá 21. 9. 2018 v 13:32 odesílatel Hubert Zhang  napsal:
>
>>
>>
>>
>>
>> *Hi all,We redesign disk quota feature based on the comments from Pavel
>> Stehule and Chapman Flack. Here are the new design.OverviewBasically,  disk
>> quota feature is used to support multi-tenancy environment, different level
>> of database objects could be set a quota limit to avoid over use of disk
>> space. A common case could be as follows: DBA could enable disk quota on a
>> specified database list. DBA could set disk quota limit for
>> tables/schemas/roles in these databases. Separate disk quota worker process
>> will monitor the disk usage for these objects and detect the objects which
>> exceed their quota limit. Queries loading data into these “out of disk
>> quota” tables/schemas/roles will be cancelled.We are currently working at
>> init implementation stage. We would like to propose our idea firstly and
>> get feedbacks from community to do quick iteration.SQL Syntax (How to use
>> disk quota)1 Specify the databases with disk quota enabled in GUC
>> “diskquota_databases” in postgresql.conf and restart the database.2 DBA
>> could set disk quota limit for table/schema/role.CREATE DISK QUOTA tablea1
>> ON TABLE a1 with (quota = ‘1MB’);CREATE DISK QUOTA roleu1 ON USER u1 with
>> (quota = ‘1GB’);CREATE DISK QUOTA schemas1 ON SCHEMA s1 with (quota =
>> ‘3MB’);*
>>
>
> just fast reaction - why QUOTA object?
>
> Isn't ALTER SET enough?
>
> Some like
>
> ALTER TABLE a1 SET quote = 1MB;
> ALTER USER ...
> ALTER SCHEMA ..
>
> New DDL commans looks like too hard hammer .
>
>
>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *3 Simulate a schema out of quota limit case: suppose table a1 and table
>> a2 are both under schema s1.INSERT INTO a1 SELECT
>> generate_series(1,1000);INSERT INTO a2 SELECT
>> generate_series(1,300);SELECT pg_sleep(5)INSERT INTO a1 SELECT
>> generate_series(1,1000);ERROR:  schema's disk space quota exceededDROP
>> TABLE a2;SELECT pg_sleep(5)INSERT INTO a1 SELECT
>> generate_series(1,1000);INSERT 0 1000ArchitectureDisk quota has the
>> following components.1. Quota Setting Store is where the disk quota setting
>> to be stored and accessed. We plan to use catalog table pg_diskquota to
>> store these information. pg_diskquota is
>> like:CATALOG(pg_diskquota,6122,DiskQuotaRelationId){ NameData quotaname; /*
>> diskquota name */int16 quotatype; /* diskquota type name */ Oid
>> quotatargetoid; /* diskquota target db object oid*/ int32 quotalimit; /*
>> diskquota size limit in MB*/ int32 quotaredzone; /* diskquota redzone in
>> MB*/} FormData_pg_diskquota;2. Quota Change Detector is the monitor of size
>> change of database objects. We plan to use stat collector to detect the
>> ‘active’ table list at initial stage. But stat collector has some
>> limitation on finding the active table which is in a running transaction.
>> Details see TODO section.3. Quota Size Checker is where to calculate the
>> size and compare with quota limit for database objects. According to
>> Pavel’s comment, autovacuum launcher and worker process could be a good
>> reference to disk quota. So we plan to use a disk quota launcher daemon
>> process and several disk quota worker process to finish this work. Launcher
>> process is responsible for starting worker process based on a user defined
>> database

Re: Proposal for disk quota feature

2018-09-21 Thread Hubert Zhang

% cpu usage and will be
finished within 50ms. Todo/LimitationBefore we propose our patch, we plan
to enhance it with the following ideas:1. Setting database list with disk
quota enabled dynamically without restart database. Since we have the disk
quota launcher process, it could detect the new ‘diskquota_databases’ list
and start/stop the corresponding disk quota worker process.2. Enforcement
when query is running. Considering the case when there is 10MB quota left,
but next query will insert 10GB data. Current enforcement design will allow
this query to be executed. This is limited by the ‘active’ table detection
is generated by stat collector. Postgres backend will only send table stat
information to collector only when the transaction ends. We need a new way
to detect the ‘active’ table even when this table is being modified inside
a running transaction.3. Monitor unlimited number of databases. Current we
set the max number of disk quota worker process to be 10 to reduce the
affection normal workload. But how about  if we want to monitor the disk
quota of more than 10 databases? Our solution is to let disk quota launcher
to manage a queue of database need to be monitored. And disk quota worker
process consuming the queue and refresh the disk usage/quota for this
database. After some periods, worker will return the database to the queue,
and fetch the top database from queue to process. The period determine the
delay of detecting disk quota change. To implement this feature, we need to
support a subprocess of postmaster to rebind to another database instead of
the database binded in InitPostgres().4. Support active table detection on
vacuum full and vacuum analyze. Currently vacuum full and vacuum analyze
are not tracked by stat collector.Thanks to Heikki, Pavel Stehule,Chapman
Flack for the former comments on disk quota feature. Any comments on how to
improve disk quota feature are appreciated.*

On Mon, Sep 3, 2018 at 12:05 PM, Pavel Stehule 
wrote:

>
>
> 2018-09-03 3:49 GMT+02:00 Hubert Zhang :
>
>> Thanks Pavel.
>> Your patch did enforcement on storage level(md.c or we could also use
>> smgr_extend). It's straight forward.
>> But I prefer to implement disk_quota as a feature with following
>> objectives:
>> 1 set/alter disk quota setting on different database objects, e.g. user,
>> database, schema etc. not only a general GUC, but we could set separate
>> quota limit for a specific objects.
>> 2 enforcement operator should work at two positions: before query is
>> running and when query is running. The latter one's implementation maybe
>> similar to your patch.
>>
>
> The patch was just example. The resource quotes should be more complex -
> per partition, table, schema, database, user - so GUC are possible, but not
> very user friendly.
>
> Our case is specific, but not too much. The servers are used for
> multidimensional analyses - and some tables can grow too fast (COPY, INSERT
> SELECT). We need to solve limits immediately. The implementation is simple,
> so I did it. Same implementation on database level, or schema level needs
> some more locks, so it will not be too effective. The resource management
> can be complex very complex, and I expect so it will be hard work.
>
> Regards
>
> Pavel
>
>
>> On Sun, Sep 2, 2018 at 8:44 PM, Pavel Stehule 
>> wrote:
>>
>>> Hi
>>>
>>> 2018-09-02 14:18 GMT+02:00 Hubert Zhang :
>>>
>>>> Thanks Chapman.
>>>> @Pavel,  could you please explain more about your second suggestion 
>>>> "implement
>>>> some quotas on storage level?"
>>>>
>>>
>>> See attached patch - it is very simple - and good enough for our
>>> purposes.
>>>
>>> Regards
>>>
>>> Pavel
>>>
>>>
>>>
>>>> We will not keep the long-lived processes attach to all databases(just
>>>> like you mentioned servers with thousands of databases)
>>>> And you are right, we could share ideas with autovacuum process, fork
>>>> worker processes in need.
>>>> "autovacuum checks for tables that have had a large number of
>>>> inserted, updated or deleted tuples. These checks use the statistics
>>>> collection facility"
>>>> diskquota process is similar to autovacuum at caring about insert, but
>>>> the difference is that it also care about vucuum full, truncate and drop.
>>>> While update and delete may not be interested since no file change happens.
>>>> So a separate diskquota process is preferred.
>>>>
>>>> So if we implemented disk quota as a full native feature, and in the
>>>> first initial vers

How to get active table within a transaction.

2018-09-19 Thread Hubert Zhang

Hi all,

I have a requirement to get the active table list in a child postmaster
process.

We are able to get the active table list after corresponding transaction
ends by using stat collector. Each postgres process will gather the table
change information locally, but only send the stat info to collector after
transaction end(become idle).

As an enhancement, we also want to get the active table while the
transaction inserting the table is in progress. Delay is acceptable.

Is there any existing ways in PG to support it?


-- 
Thanks

Hubert Zhang

Re: Proposal for disk quota feature

2018-09-02 Thread Hubert Zhang

Thanks Pavel.
Your patch did enforcement on storage level(md.c or we could also use
smgr_extend). It's straight forward.
But I prefer to implement disk_quota as a feature with following objectives:
1 set/alter disk quota setting on different database objects, e.g. user,
database, schema etc. not only a general GUC, but we could set separate
quota limit for a specific objects.
2 enforcement operator should work at two positions: before query is
running and when query is running. The latter one's implementation maybe
similar to your patch.

On Sun, Sep 2, 2018 at 8:44 PM, Pavel Stehule 
wrote:

> Hi
>
> 2018-09-02 14:18 GMT+02:00 Hubert Zhang :
>
>> Thanks Chapman.
>> @Pavel,  could you please explain more about your second suggestion 
>> "implement
>> some quotas on storage level?"
>>
>
> See attached patch - it is very simple - and good enough for our purposes.
>
> Regards
>
> Pavel
>
>
>
>> We will not keep the long-lived processes attach to all databases(just
>> like you mentioned servers with thousands of databases)
>> And you are right, we could share ideas with autovacuum process, fork
>> worker processes in need.
>> "autovacuum checks for tables that have had a large number of inserted,
>> updated or deleted tuples. These checks use the statistics collection
>> facility"
>> diskquota process is similar to autovacuum at caring about insert, but
>> the difference is that it also care about vucuum full, truncate and drop.
>> While update and delete may not be interested since no file change happens.
>> So a separate diskquota process is preferred.
>>
>> So if we implemented disk quota as a full native feature, and in the
>> first initial version I prefer to implement the following features:
>> 1 Fork diskquota launcher process under Postmaster serverloop, which is
>> long-lived.
>> 2 Diskquota launcher process is responsible for creating diskquota
>> worker process for every database.
>> 3 DIskquota setting is stored in a separate catalog table for each
>> database.
>> 4 Initialization stage, Diskquota launcher process creates diskquota worker
>> process for all the databases(traverse like autovacuum). Worker process
>> calculates disk usage of db objects and their diskquota setting. If any
>> db object exceeds its quota limit, put them into the blacklist in the
>> shared memory, which will later be used by enforcement operator. Worker
>> process exits when works are done.
>> 5 Running stage, Diskquota launcher process creates diskquota worker
>> process for the database with a large number of insert, copy, truncate,
>> drop etc. or create disk quota statement. Worker process updates the file
>> size for db objects containing the result relation, and compare with the
>> diskquota setting. Again, if exceeds quota limit, put them into blacklist,
>> remove from blacklist vice versa. Worker process exits when works are
>> done and a GUC could control the frequency of worker process restart to a
>> specific database. As you know, this GUC also controls the delay when we do
>> enforcement.
>> 6 Enforcement. When postgres backend executes queries, check the
>> blacklist in shared memory to determine whether the query is allowed(before
>> execute) or need rollback(is executing)?
>>
>> If we implemented disk quota as an extension, we could just use
>> background worker to start diskquota launcher process and use
>> RegisterDynamicBackgroundWorker() to fork child diskquota worker
>> processes by the launcher process as suggested by @Chapman.
>> Diskquota setting could be stored in user table in a separate schema for
>> each database(Schema and table created by create extension statement) just
>> like what Heikki has done in pg_quota project. But in this case, we need to
>> create extension for each database before diskquota worker process can be
>> set up for that database.
>>
>> Any comments on the above design and which is preferred, native feature
>> or extension as the POC?
>>
>>
>> -- Hubert
>>
>>
>>
>> On Fri, Aug 31, 2018 at 3:32 AM, Pavel Stehule 
>> wrote:
>>
>>>
>>>
>>> 2018-08-30 16:22 GMT+02:00 Chapman Flack :
>>>
>>>> On 08/30/2018 09:57 AM, Hubert Zhang wrote:
>>>>
>>>> > 2 Keep one worker process for each database. But using a parent/global
>>>> > quota worker process to manage the lifecycle of database level worker
>>>> > processes. It could handle the newly created database(avoid restart
>>>> > database) and save resource w

Re: Proposal for disk quota feature

2018-09-02 Thread Hubert Zhang

Thanks Chapman.
@Pavel,  could you please explain more about your second suggestion "implement
some quotas on storage level?"
We will not keep the long-lived processes attach to all databases(just like
you mentioned servers with thousands of databases)
And you are right, we could share ideas with autovacuum process, fork
worker processes in need.
"autovacuum checks for tables that have had a large number of inserted,
updated or deleted tuples. These checks use the statistics collection
facility"
diskquota process is similar to autovacuum at caring about insert, but the
difference is that it also care about vucuum full, truncate and drop. While
update and delete may not be interested since no file change happens. So a
separate diskquota process is preferred.

So if we implemented disk quota as a full native feature, and in the first
initial version I prefer to implement the following features:
1 Fork diskquota launcher process under Postmaster serverloop, which is
long-lived.
2 Diskquota launcher process is responsible for creating diskquota worker
process for every database.
3 DIskquota setting is stored in a separate catalog table for each database.
4 Initialization stage, Diskquota launcher process creates diskquota worker
process for all the databases(traverse like autovacuum). Worker process
calculates disk usage of db objects and their diskquota setting. If any
db object exceeds its quota limit, put them into the blacklist in the
shared memory, which will later be used by enforcement operator. Worker
process exits when works are done.
5 Running stage, Diskquota launcher process creates diskquota worker
process for the database with a large number of insert, copy, truncate,
drop etc. or create disk quota statement. Worker process updates the file
size for db objects containing the result relation, and compare with the
diskquota setting. Again, if exceeds quota limit, put them into blacklist,
remove from blacklist vice versa. Worker process exits when works are done
and a GUC could control the frequency of worker process restart to a
specific database. As you know, this GUC also controls the delay when we do
enforcement.
6 Enforcement. When postgres backend executes queries, check the blacklist
in shared memory to determine whether the query is allowed(before execute)
or need rollback(is executing)?

If we implemented disk quota as an extension, we could just use background
worker to start diskquota launcher process and use
RegisterDynamicBackgroundWorker() to fork child diskquota worker processes
by the launcher process as suggested by @Chapman. Diskquota setting could
be stored in user table in a separate schema for each database(Schema and
table created by create extension statement) just like what Heikki has done
in pg_quota project. But in this case, we need to create extension for each
database before diskquota worker process can be set up for that database.

Any comments on the above design and which is preferred, native feature or
extension as the POC?

-- Hubert

On Fri, Aug 31, 2018 at 3:32 AM, Pavel Stehule 
wrote:

>
>
> 2018-08-30 16:22 GMT+02:00 Chapman Flack :
>
>> On 08/30/2018 09:57 AM, Hubert Zhang wrote:
>>
>> > 2 Keep one worker process for each database. But using a parent/global
>> > quota worker process to manage the lifecycle of database level worker
>> > processes. It could handle the newly created database(avoid restart
>> > database) and save resource when a database is not used. But this needs
>> to
>> > change worker process to be hierarchical. Postmaster becomes the
>> grandfather
>> >  of database level worker processes in this case.
>>
>> I am using background workers this way in 9.5 at $work.
>>
>> In my case, one worker lives forever, wakes up on a set period, and
>> starts a short-lived worker for every database, waiting for each
>> one before starting the next.
>>
>> It was straightforward to implement. Looking back over the code,
>> I see the global worker assigns its own PID to worker.bgw_notify_pid
>> of each of its children, and also obtains a handle for each child
>> from RegisterDynamicBackgroundWorker().
>>
>> I imagine the global quota worker would prefer to start workers
>> for every database and then just wait for notifications from any
>> of them, but that seems equally straightforward at first glance.
>>
>
> There are servers with thousands databases. Worker per database is not
> good idea.
>
> It should to share ideas, code with autovacuum process.
>
> Not sure, how to effective implementation based on bg workers can be. On
> servers with large set of databases, large set of tables it can identify
> too big table too late.
>
> Isn't better to implement some quotas on storage level?
>
> Regards
>
> Pavel
>
>
>
>> -Chap
>>
>>
>

-- 
Thanks

Hubert Zhang

Proposal for disk quota feature

2018-08-30 Thread Hubert Zhang

Hi all,
We want to introduce disk quota feature into Postgres.

*Why disk quota*
*In a multi-tenant environment, there is a requirement to limit the disk
quota that database/schema/table can be written or a user can consume for
different organizations.*
*Meanwhile, other databases such as Oracle, Teradata, DB2 have already
supported disk quota feature.*

*Heikki has already implemented disk quota feature in Postgres as an
extension pg_quota <https://github.com/hlinnaka/pg_quota>. We plan to
enhance disk quota feature based on Heikki's implementation.*

*Scope*

The scope of disk quota feature is to limit the disk usage of
database/schema/table objects and users.

Here table means heap table, ao table, index table, partition table and
associated table( toast table, visible table, large object etc.). Schema
disk quota is the disk quota of all the tables in this schema. Database
disk quota is the disk quota of all the tables in this database.

User's quota is the size of all the tables whose owner are this user.
Out of Scope: Note that spill files, xlogs, clogs and logs are not
considered for database object level disk quota at this stage.


*Design*
We propose disk quota with the following components:

1. Quota Setting Store is where the disk quota setting to be stored and
accessed. DBA or object owner uses SQL queries to configure the disk quota
for each database objects.



*2. Quota Change Detector is the monitor of size change of database objects
in Postgres. It will write change information to shared memory to notify
Quota Size Checker. The implementation of Change Detector could be hooks in
smgr_extend/smgr_unlink/smgr_truncate when modifying the size of a heap
table. The hooks will write to shared memory in a batched way to reduce the
impact on OLTP performance.3. Quota Size Checker is implemented as a worker
process. It maintains the current disk usage for each database objects and
users, and compare them with settings in Quota Setting Store. If it detects
the disk usage hit the quota redzone(either upon or below), it will notify
Quota Enforcement Operator. 4. Quota Enforcement Operator has two roles:
one is to check the disk quota permission before queries are
executed(QueryBeforeRun Check), the other is to cancel the running queries
when it reaches the disk quota limit dynamically(QueryRunning Check). Quota
Enforcement Operator uses shared memory to store the enforcement
information to guarantee a quick check. *


*To implement the right proof of concept, we want to receive feedback from
the community from the following aspects: *
Q1. User Interface: when setting a disk quota,
Option 1 is to use *insert into quota.config(user1, 10G)*
Option 2 is to use UDF *select set_quota("role","user1",10G)*
Option 3 is to use native SQL syntax *create disk quota on ROLE user1
10G, or create disk quota on SCHEMA s1 25G;*
Q2. Quota Setting Store using user table or catalog?
Option 1 is to create a schema called quota for each database and write
quota settings into quota.config table, only DBA could modify it. This
corresponds to Q1.option1
Option 2 is to store quota settings into the catalog. For Schema and
Table write them to database level catalog. For database or user, write
them to either database level or global catalog.

I personally prefer Q1's option3 and Q2's option2, since it makes disk
quota more like a native feature. We could support the quota worker process
implementation as an extension for now, but in the long term, disk quota is
like a fundamental feature of a database and should be a native feature
just like other databases. So store quota conf into catalog and supply
native syntax is better.

Open Problem
We prepare to implement Quota Size Checker as a worker process. Worker
process needs to connect to a database to build the disk usage map and
quota map(e.g. a user/shcema's disk usage on a given database) But one
worker process can only bind to one database(InitPostgres(dbname)) at the
initialization stage. It results in we need a separate worker process for
each database. The solution to this problem is not straightforward, here
are some ideas:
1 To make worker process could retrieve and cache information from all the
databases.  As Tom Lane pointed out that it needs to flush all the database
specific thing, like relcache syscache etc.
2 Keep one worker process for each database. But using a parent/global
quota worker process to manage the lifecycle of database level worker
processes. It could handle the newly created database(avoid restart
database) and save resource when a database is not used. But this needs to
change worker process to be hierarchical. Postmaster becomes the grandfather
 of database level worker processes in this case.

Any better ideas on it?



-- 
Thanks

Hubert Zhang

Is child process of postmaster able to access all the databases?

2018-08-29 Thread Hubert Zhang

Hello all.

background worker can use SPI to read a database, but it can call
BackgroundWorkerInitializeConnection(dbname) only once.

I wonder if there is a way to let a child process of postmaster to access
all the databases one by one?


-- 
Thanks

Hubert Zhang

Re: Considering signal handling in plpython again

2018-06-21 Thread Hubert Zhang

Hi Heikki,
Not working on it now, you can go ahead.

On Fri, Jun 22, 2018 at 12:56 AM, Heikki Linnakangas 
wrote:

> Hi Hubert,
>
> Are you working on this, or should I pick this up? Would be nice to get
> this done as soon as v12 development begins.
>
> - Heikki
>



-- 
Thanks

Hubert Zhang

Secured and customizable PLPython and PLR on Postgresql

2018-06-11 Thread Hubert Zhang

Hi all,

As you know PLPython and PLR are untrusted language, which means only DBA
could create these UDFs(very inconvenient). Moreover it's also hard to
supply user specific Python/R env for different data scientists.

We are working on an open source  project to make PLPython and PLR trusted
and customizable. The project is called PLContainer, which use Docker
Container as the sandbox to avoid malicious user breaking Postgresql
database or even the whole host machine.

Now there is a basic version which could support PLContainer on PG9.6
(Thanks krait007 and markwort for the contribution). We still have a lot of
issues to make it production ready and share with more peoples. [Github
umbrella project](https://github.com/greenplum-db/plcontainer/projects/1)

If you are interested in it, feel free to try it. Your suggestion and
contribution will be appreciated.
-- 
Thanks

Hubert Zhang

Re: Considering signal handling in plpython again

2018-05-16 Thread Hubert Zhang

There are remaining two problems

1. Do we need to consider when to delete the extension hook or it's not
necessary?
As the destroy function _PG_fini doesn't work, I cannot find a good
place to reset to hook gracefully.
I tested the drop language plpythonu statement which will not remove
the python shared library in the current session,
So it seems to be safe to leave the cancel_handler_hook not be reset.
How about other extensions, for example plr. Does the "drop extension"
statement will not remove the loaded shared library in the process either?

-- Another idea is to register the hook at the beginning of
plpython_call_handler and unregister the hook at the end of
plpython_call_handler.

2. Do we need to use explicit hook list(List *cancel_hook_list) instead of
implicit cancel_hook(which relies on the extension to link the cancel_hook
inside their code
 e.g. prev_hook = cancel_hook; cancel_hook=my_hook;   void
my_hook(){mywork(); (*prev_hook)();} )?
I didn't find any explicit hook list in PG code base, is that a good
practice?

-- Hubert

On Mon, May 14, 2018 at 6:40 PM, Heikki Linnakangas <hlinn...@iki.fi> wrote:

> On 14/05/18 10:56, Hubert Zhang wrote:
>
>> For nested SPI case, one option is to turn off the bool variable when
>> entering the SPI function(PLy_spi_prepare, PLy_spi_execute, PLy_cursor
>> etc.)
>> and turn on the bool variable again when exiting the SPI function.
>>
>
> Yeah, that seems reasonable.
>
> - Heikki
>

-- 
Thanks

Hubert Zhang

Re: Considering signal handling in plpython again

2018-05-14 Thread Hubert Zhang

For nested SPI case, one option is to turn off the bool variable when
entering the SPI function(PLy_spi_prepare, PLy_spi_execute, PLy_cursor etc.)
and turn on the bool variable again when exiting the SPI function.
If it's OK, we could follow this way to update Heikki's patch.

--Hubert

On Fri, May 11, 2018 at 9:28 PM, Heikki Linnakangas <hlinn...@iki.fi> wrote:

>
>
> On 11 May 2018 10:01:56 EEST, Hubert Zhang <hzh...@pivotal.io> wrote:
> >2. Add a flag in hook function to indicate whether to call
> >Py_AddPendingCall.
> >This is straightforward.(I prefer it)
>
> Yeah, that's what I had in mind, too. A global bool variable that's set
> when you enter libpython, and cleared on return. Need to handle nesting,
> i.e if a PL/python function runs a slow query with SPI, and cancellation
> happens during that. And the case that the SPI query calls another
> PL/python function.
>
> - Heikki
>

-- 
Thanks

Hubert Zhang

Re: Considering signal handling in plpython again

2018-05-11 Thread Hubert Zhang

Thanks Heikki and Robert for your comments.

I reviewed Heikki's patch and let's enhance it.

As Heikki mentioned, there is a problem when no Python code is being
executed. I tested it in the following case
"select pysleep();" and then type ctrl-c,  query cancelled
successfully.(Patch works:))
"select * from a, b, c, d;" and then type ctrl-c,  query cancelled
successfully.
"select pysleep();" It will terminate immediately(without type ctrl-c) due
to the Py_AddPendingCall is registered on the last query.(The problem
Heikki said)

To fix this problem, we need to let query like "select * from a, b, c, d;"
doesn't call Py_AddPendingCall.
There are at least 3 ways.
1. Add a flag in hook function to indicate whether to call the hook
function.
In current patch the hook function will do two things: a. call
Py_AddPendingCall;
b. call the previous hook(prev_cancel_pending_hook).
So this method will introduce side effect on miss the previous hook. To
enhance it, we need to change the hook in postgres.c to hook array.
And let hook function only do one thing.
2. Add a flag in hook function to indicate whether to call Py_AddPendingCall.
This is straightforward.(I prefer it)
3. Delete the hook after it's been used. This is related to the lifecycle
of signal hooks. In current patch the handler hook is registered
inside PLy_initialize() which will be called for
every plpython_call_handler().  I prefer to just register hook once and in
_PG_init() for each extension. If follow this way, delete hook is not
needed.

Any comments?

On Thu, May 10, 2018 at 10:50 PM, Heikki Linnakangas <hlinn...@iki.fi>
wrote:

> On 10/05/18 09:32, Hubert Zhang wrote:
>
>> Hi all,
>>
>> I want to support canceling for a plpython query which may be a busy loop.
>>
>> I found some discussions on pgsql-hackers 2 years ago. Below is the link.
>>
>> https://www.postgresql.org/message-id/CAFYwGJ3+Xg7EcL2nU-MxX
>> 6p+o6c895pm3myz-b+9n9dffeh...@mail.gmail.com
>>
>> Mario wrote a patch to fix this problem at that time
>> *https://github.com/CartoDB/postgres/commit/d587d8a6e4f035cc
>> 45e1d84fc46aa7c3ab0344c3
>> <https://github.com/CartoDB/postgres/commit/d587d8a6e4f035cc
>> 45e1d84fc46aa7c3ab0344c3>*
>> <https://github.com/CartoDB/postgres/commit/d587d8a6e4f035cc
>> 45e1d84fc46aa7c3ab0344c3>
>>
>> The main logic is to register a new signal handler for SIGINT/SIGTERM
>> and link the old signal handler in the chain.
>>
>> static void PLy_python_interruption_handler()
>> {
>> PyErr_SetString(PyExc_RuntimeError, "test except");
>> return NULL;
>> }
>> static void
>> PLy_handle_interrupt(int sig)
>> {
>> // custom interruption
>> int added = Py_AddPendingCall(PLy_python_interruption_handler, NULL);
>> if (coreIntHandler) {
>> (*coreIntHandler)(sig);
>> }
>> }
>>
>> Does anyone have some comments on this patch?
>>
>
> PostgreSQL assumes to have control of all the signals. Although I don't
> foresee any changes in this area any time soon, there's no guarantee that
> overriding the SIGINT/SIGTERM will do what you want in the future. Also,
> catching SIGINT/SIGTERM still won't react to recovery conflict interrupts.
>
> In that old thread, I think the conclusion was that we should provide a
> hook in the backend for this, rather than override the signal handler
> directly. We could then call the hook whenever InterruptPending is set.
> No-one got around to write a patch to do that, though.
>
> As for me, I think handler function should call PyErr_SetInterrupt()
>> instead of PyErr_SetString(PyExc_RuntimeError, "test except");
>>
>
> Hmm. I tested that, but because the Python's default SIGINT handler is not
> installed, PyErr_SetInterrupt() will actually throw an error:
>
> TypeError: 'NoneType' object is not callable
>
> I came up with the attached patch, which is similar to Mario's, but it
> introduces a new "hook" for this.
>
> One little problem remains: The Py_AddPendingCall() call is made
> unconditionally in the signal handler, even if no Python code is currently
> being executed. The pending call is queued up until the next time you run a
> PL/Python function, which could be long after the original statement was
> canceled. We need some extra checks to only raise the Python exception, if
> the Python interpreter is currently active.
>
> - Heikki
>

-- 
Thanks

Hubert Zhang

Considering signal handling in plpython again

2018-05-10 Thread Hubert Zhang

Hi all,

I want to support canceling for a plpython query which may be a busy loop.

I found some discussions on pgsql-hackers 2 years ago. Below is the link.

https://www.postgresql.org/message-id/cafywgj3+xg7ecl2nu-mxx6p+o6c895pm3myz-b+9n9dffeh...@mail.gmail.com

Mario wrote a patch to fix this problem at that time
*https://github.com/CartoDB/postgres/commit/d587d8a6e4f035cc45e1d84fc46aa7c3ab0344c3
<https://github.com/CartoDB/postgres/commit/d587d8a6e4f035cc45e1d84fc46aa7c3ab0344c3>*
<https://github.com/CartoDB/postgres/commit/d587d8a6e4f035cc45e1d84fc46aa7c3ab0344c3>

The main logic is to register a new signal handler for SIGINT/SIGTERM
and link the old signal handler in the chain.

static void PLy_python_interruption_handler()
{
PyErr_SetString(PyExc_RuntimeError, "test except");
return NULL;
}
static void
PLy_handle_interrupt(int sig)
{
// custom interruption
int added = Py_AddPendingCall(PLy_python_interruption_handler, NULL);
if (coreIntHandler) {
(*coreIntHandler)(sig);
}
}

Does anyone have some comments on this patch?
As for me, I think handler function should call PyErr_SetInterrupt()
instead of PyErr_SetString(PyExc_RuntimeError, "test except");

-- 
Thanks

Hubert Zhang

57 matches

Mail list logo