date:20150726

Re: [HACKERS] Sharing aggregate states between different aggregate functions

2015-07-26 Thread Haribabu Kommi

On Thu, Jul 9, 2015 at 7:44 PM, David Rowley
 wrote:
> On 15 June 2015 at 12:05, David Rowley  wrote:
>>
>>
>> This basically allows an aggregate's state to be shared between other
>> aggregate functions when both aggregate's transition functions (and a few
>> other things) match
>> There's quite a number of aggregates in our standard set which will
>> benefit from this optimisation.
>>
>
> After compiling the original patch with another compiler, I noticed a couple
> of warnings.
>
> The attached fixes these.

I did some performance tests on the patch. This patch shown good
improvement for same column aggregates. With int or bigint datatype columns,
this patch doesn't show any visible performance difference. But with numeric
datatype it shows good improvement.

select sum(x), avg(y) from test where x < $1;

Different columns:

selectivityHeadpatch
(millions)
0.1   315  322
0.3   367  376
0.5   419  427
1  551  558
2  824  826

select sum(x), avg(x) from test where x < $1;

Same column:

selectivityHeadpatch
(millions)
0.1   314  314
0.3   363  343
0.5   412  373
1  536  440
2  795  586

Regards,
Hari Babu
Fujitsu Australia


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Failing assertions in indxpath.c, placeholder.c and brin_minmax.c

2015-07-26 Thread Andreas Seltenreich

Tom Lane writes:

> Andreas Seltenreich  writes:
>> when running my random query generator contraption[1] against the
>> regression database of 9.5 or master, it occasionally triggers one of
>> the following three assertions.
>
> I've fixed the first two of these --- thanks for the report!

I let sqlsmith run during the night, and it did no longer trigger the
first two.  During roughly a million random queries it triggered the
already mentioned brin one 10 times, but there was also one instance of
this new one in the log:

TRAP: FailedAssertion("!(join_clause_is_movable_into(rinfo, joinrel->relids, 
join_and_req))", File: "relnode.c", Line: 987)
LOG:  server process (PID 12851) was terminated by signal 6: Aborted
DETAIL:  Failed process was running: select  
  rel65543066.tmplname as c0, 
  rel65543064.umuser as c1
from 
  public.dept as rel65543059
inner join pg_catalog.pg_user_mappings as rel65543064
left join pg_catalog.pg_enum as rel65543065
on (rel65543064.srvname = rel65543065.enumlabel )
  inner join pg_catalog.pg_ts_template as rel65543066
  on (rel65543065.enumtypid = rel65543066.tmplnamespace )
on (rel65543059.dname = rel65543064.srvname )
where ((rel65543059.mgrname = rel65543059.mgrname) 
and (rel65543064.usename = rel65543066.tmplname)) 
  and (rel65543059.mgrname ~~ rel65543059.mgrname)
fetch first 128 rows only;

>> ,[ git bisect ]
>> |   first bad commit: [3f8c23c4d31d4a0e801041733deb2c7cfa577b32] Improve
>> |   predtest.c's ability to reason about operator expressions.
>> `
>
> I'm a bit confused about this aspect of your report though, because in
> my hands that example fails clear back to 9.2.  It doesn't seem to require
> the predtest.c improvement to expose the fault.

Hmm, I actually used a different, uglier query to trigger this assertion
for the bisection run.  I'll attach it[1] along with the complete git
bisect log[2].

regards,
andreas

Footnotes: 
[1]  select  subq_717608.c3 as c0, rel4551421.inhrelid as c1, 
rel4551421.inhrelid as c2, subq_717608.c3 as c3 from 
information_schema.foreign_tables as rel4551363 right join public.hash_f8_heap 
as rel4551366 inner join pg_catalog.pg_constraint as rel4551419 inner join 
(select  rel4551420.bb as c0, rel4551420.aa as c1, rel4551420.aa as c2, 
rel4551420.aa as c3 from public.b as rel4551420 where ( 
rel4551420.bb>rel4551420.bb ) and ( rel4551420.bbrel4551419.connamespace ) 
and ( rel4551419.connamespace>=rel4551419.conrelid ) ) and ( 
rel4551421.inhparenthttp://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sharing aggregate states between different aggregate functions

2015-07-26 Thread David Rowley

On 27 July 2015 at 03:24, Heikki Linnakangas  wrote:

> On 07/09/2015 12:44 PM, David Rowley wrote:
>
>> On 15 June 2015 at 12:05, David Rowley 
>> wrote:
>>
>>
>>> This basically allows an aggregate's state to be shared between other
>>> aggregate functions when both aggregate's transition functions (and a few
>>> other things) match
>>> There's quite a number of aggregates in our standard set which will
>>> benefit from this optimisation.
>>>
>>>  After compiling the original patch with another compiler, I noticed a
>> couple of warnings.
>>
>> The attached fixes these.
>>
>
> I spent some time reviewing this. I refactored the ExecInitAgg code rather
> heavily to make it more readable (IMHO); see attached. What do you think?
> Did I break anything?
>

Thanks for taking the time to look at this and makes these fixes.

I'm just looking over your changes:

- ereport(ERROR,
- (errcode(ERRCODE_INVALID_FUNCTION_DEFINITION),
- errmsg("aggregate %u needs to have compatible input type and transition
type",
- aggref->aggfnoid)));
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_FUNCTION_DEFINITION),
+ errmsg("aggregate needs to have compatible input type and transition
type")));

I can't quite see the reason to remove the agg OID from the error message
here. It seems to be still valid to use as build_peraggstate_for_aggref()
only is called when nothing is shared.


- * agg_input_types, agg_state_type, agg_result_type identify the input,
- * transition, and result types of the aggregate.  These should all be
- * resolved to actual types (ie, none should ever be ANYELEMENT etc).
+ * agg_input_types identifies the input types of the aggregate.  These
should
+ * be resolved to actual types (ie, none should ever be ANYELEMENT etc).

I'm not sure I understand why agg_state_type and agg_result_type have
changed here.


+ peraggstate->sortstates = (Tuplesortstate **)
+ palloc0(sizeof(Tuplesortstate *) * numGroupingSets);
+ for (currentsortno = 0; currentsortno < numGroupingSets; currentsortno++)
+ peraggstate->sortstates[currentsortno] = NULL;

This was not you, but this NULL setting looks unneeded due to the palloc0().


> Some comments:
>
> * In aggref_has_compatible_states(), you give up if aggtype or aggcollid
> differ. But those properties apply to the final function, so you were
> leaving some money on the table by disallowing state-sharing if they differ.
>

Good catch, and accurate analogy. Thanks for fixing.


>
> * The filter and input expressions are initialized for every AggRef,
> before the deduplication logic kicks in. The AggrefExprState.aggfilter,
> aggdirectargs and args fields really belong to the AggStatePerAggState
> struct instead. This is not a new issue, but now that we have a convenient
> per-aggstate struct to put them in, let's do so.
>

Good idea. I failed to notice that code over there in execQual.c so I agree
that where you've moved it to is much better.


>
> * There was a reference-after free bug in aggref_has_compatible_states;
> you cannot ReleaseSysCache and then continue pointing to the struct.
>

Thanks for fixing.

In this function I also wasn't quite sure if it was with comparing both
non-NULL INITCOND's here. I believe my code comments may slightly
contradict what the code actually does, as the comments talk about them
having to match, but the code just bails if any are non-NULL. The reason I
didn't check them was because it seems inevitable that some duplicate work
needs to be done when setting up the INITCOND. Perhaps it's worth it?


select aggfnoid || '(' || typname || ')',aggtransfn,agginitval
from pg_aggregate
inner join pg_type on aggtranstype = oid
where aggtransfn in (select aggtransfn
from pg_aggregate
group by aggtransfn
having count(*)>1)
order by aggtransfn;

This indicates that everything using float4_accum as a transfn could
benefit from that. I just wasn't sure how far to go.



> * The code was a bit fuzzy on which parts of the per-aggstate are filled
> in at what time. Some of the fields were overwritten every time a match was
> found. With the same values, so no harm done, but I found it confusing. I
> refactored ExecInitAgg in the attached patch to clear that up.
>

> * There API of build_aggregate_fnexprs() was a bit strange now that some
> callers use it to only fill in the final function, some call it to fill
> both the transition functions and the final function. I split it to two
> separate functions.
>
>
That's much better.


> * I wonder if we should do this duplicate elimination at plan time. It's
> very fast, so I'm not worried about that right now, but you had grand plans
> to expand this so that an aggregate could optionally use one of many
> different kinds of state values. At that point, it certainly seems like a
> planning decision to decide which aggregates share state. I think we can
> leave it as it is for now, but it's something to perhaps consider later.
>
>
I don't think I'm going to get the time to work on the "supporting
aggregate" stuff you're ta

Re: [HACKERS] LWLock deadlock and gdb advice

2015-07-26 Thread Amit Langote

On 2015-07-16 PM 04:03, Jeff Janes wrote:
> On Wed, Jul 15, 2015 at 8:44 AM, Heikki Linnakangas  wrote:
> 
>>
>> Both. Here's the patch.
>>
>> Previously, LWLockAcquireWithVar set the variable associated with the lock
>> atomically with acquiring it. Before the lwlock-scalability changes, that
>> was straightforward because you held the spinlock anyway, but it's a lot
>> harder/expensive now. So I changed the way acquiring a lock with a variable
>> works. There is now a separate flag, LW_FLAG_VAR_SET, which indicates that
>> the current lock holder has updated the variable. The LWLockAcquireWithVar
>> function is gone - you now just use LWLockAcquire(), which always clears
>> the LW_FLAG_VAR_SET flag, and you can call LWLockUpdateVar() after that if
>> you want to set the variable immediately. LWLockWaitForVar() always waits
>> if the flag is not set, i.e. it will not return regardless of the
>> variable's value, if the current lock-holder has not updated it yet.
>>
>>
> I ran this for a while without casserts and it seems to work.  But with
> casserts, I get failures in the autovac process on the GIN index.
> 
> I don't see how this is related to the LWLock issue, but I didn't see it
> without your patch.  Perhaps the system just didn't survive long enough to
> uncover it without the patch (although it shows up pretty quickly).  It
> could just be an overzealous Assert, since the casserts off didn't show
> problems.
> 
> bt and bt full are shown below.
> 

I got a similar assert failure but with a btree index
(pg_attribute_relid_attnum_index). The backtrace looks like Jeff's:

(gdb) bt
#0  0x003969632625 in raise () from /lib64/libc.so.6
#1  0x003969633e05 in abort () from /lib64/libc.so.6
#2  0x0092eb9e in ExceptionalCondition (conditionName=0x9c2220
"!(((PageHeader) (page))->pd_special >= (__builtin_offsetof
(PageHeaderData, pd_linp)))",
errorType=0x9c0c41 "FailedAssertion", fileName=0x9c0c10 "nbtree.c",
lineNumber=903) at assert.c:54
#3  0x004e02d8 in btvacuumpage (vstate=0x7fff2c7655f0, blkno=9,
orig_blkno=9) at nbtree.c:903
#4  0x004e0067 in btvacuumscan (info=0x7fff2c765cd0,
stats=0x279f7d0, callback=0x668f6d ,
callback_state=0x279e338, cycleid=49190)
at nbtree.c:821
#5  0x004dfdde in btbulkdelete (fcinfo=0x7fff2c7657d0) at nbtree.c:676
#6  0x00939769 in FunctionCall4Coll (flinfo=0x7fff2c765bb0,
collation=0, arg1=140733939342544, arg2=0, arg3=6721389, arg4=41542456) at
fmgr.c:1375
#7  0x004d2a01 in index_bulk_delete (info=0x7fff2c765cd0,
stats=0x0, callback=0x668f6d , callback_state=0x279e338)
at indexam.c:690
#8  0x0066861d in lazy_vacuum_index (indrel=0x7fd40ab846f0,
stats=0x279e770, vacrelstats=0x279e338) at vacuumlazy.c:1367
#9  0x006678a8 in lazy_scan_heap (onerel=0x274ec90,
vacrelstats=0x279e338, Irel=0x279e790, nindexes=2, scan_all=0 '\000') at
vacuumlazy.c:1098
#10 0x006660f7 in lazy_vacuum_rel (onerel=0x274ec90, options=99,
params=0x27bdc88, bstrategy=0x27bdd18) at vacuumlazy.c:244
#11 0x00665c1a in vacuum_rel (relid=1249, relation=0x7fff2c7662a0,
options=99, params=0x27bdc88) at vacuum.c:1388
#12 0x006643ce in vacuum (options=99, relation=0x7fff2c7662a0,
relid=1249, params=0x27bdc88, va_cols=0x0, bstrategy=0x27bdd18,
isTopLevel=1 '\001')
at vacuum.c:293
#13 0x0075d23c in autovacuum_do_vac_analyze (tab=0x27bdc80,
bstrategy=0x27bdd18) at autovacuum.c:2807
#14 0x0075c632 in do_autovacuum () at autovacuum.c:2328
#15 0x0075b457 in AutoVacWorkerMain (argc=0, argv=0x0) at
autovacuum.c:1647
#16 0x0075b0a5 in StartAutoVacWorker () at autovacuum.c:1454
#17 0x0076f9cc in StartAutovacuumWorker () at postmaster.c:5261
#18 0x0076f28a in sigusr1_handler (postgres_signal_arg=10) at
postmaster.c:4918
#19 
#20 0x0039696e1353 in __select_nocancel () from /lib64/libc.so.6
#21 0x0076ace7 in ServerLoop () at postmaster.c:1582
#22 0x0076a538 in PostmasterMain (argc=3, argv=0x26f9330) at
postmaster.c:1263
#23 0x006c1c2e in main (argc=3, argv=0x26f9330) at main.c:223

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] spgist recovery assertion failure

2015-07-26 Thread Noah Misch

On Mon, Jul 27, 2015 at 02:19:09PM +0900, Michael Paquier wrote:
> On Mon, Jul 27, 2015 at 2:00 PM, Noah Misch  wrote:
> > When I caused a crash during the create_index regression test, recovery hit 
> > an
> > assertion failure.  Minimal test case:
> >
> > psql -X < > CREATE TABLE t (c text);
> > INSERT INTO t SELECT 'P0123456789abcdef' FROM generate_series(1,1000);
> > INSERT INTO t VALUES ('P0123456789abcdefF');
> > CREATE INDEX ON t USING spgist (c);
> > EOSQL
> > pg_ctl -m immediate -w restart
> 
> On which platform are you seeing the failure? I am afraid I could not
> reproduce the failure on Linux and OSX after testing it on HEAD.

I saw it on ppc64 Fedora and on {ppc32 PostgreSQL, ppc64 kernel} AIX.  Like
you, I don't see it on x86_64 Ubuntu.  Perhaps it is specific to big-endian.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] spgist recovery assertion failure

2015-07-26 Thread Michael Paquier

On Mon, Jul 27, 2015 at 2:00 PM, Noah Misch  wrote:
> When I caused a crash during the create_index regression test, recovery hit an
> assertion failure.  Minimal test case:
>
> psql -X < CREATE TABLE t (c text);
> INSERT INTO t SELECT 'P0123456789abcdef' FROM generate_series(1,1000);
> INSERT INTO t VALUES ('P0123456789abcdefF');
> CREATE INDEX ON t USING spgist (c);
> EOSQL
> pg_ctl -m immediate -w restart

On which platform are you seeing the failure? I am afraid I could not
reproduce the failure on Linux and OSX after testing it on HEAD.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench - allow backslash-continuations in custom scripts

2015-07-26 Thread Kyotaro HORIGUCHI

Hello,

> > Attatched is the revised version of this patch.
> >
> > The first patch is not changed from before.
> >
> > The second is fixed a kind of bug.
> >
> > Ths third is the new one to allow backslash continuation for
> > backslash commands.
> 
> Ah, thanks:-)
> 
> Would you consider adding the patch to the next commitfest? I may put
> myself as a reviewer...

No problem.

> > [...] I'm not satisfied by the design but I don't see another way..
> 
> I'll try to have a look.

Thanks.

> >>> I don't have idea how to deal with the copy of psqlscan.[lh] from
> >>> psql. Currently they are simply the dead copies of those of psql.
> >>
> >> I think that there should be no copies, but it should use relative
> >> symbolic links so that the files are kept synchronized.
> >
> > Yeah, I think so but symlinks could harm on git and Windows.
> > The another way would be make copies it from psql directory. They live
> > next door to each other.
> 
> Indeed there are plenty of links already which are generated by
> makefiles (see src/bin/pg_xlogdump/*), and probably a copy is made on
> windows. There should no file duplication within the source tree.

Thank you for pointing out that. Makefile of pg_xlogdump
re-creates symlinks to those files and they are replaced with cp
for the platforms where symlinks are not usable. But the files
are are explicitly added to .sln file on msvc. Anyway I'll
address it.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] spgist recovery assertion failure

2015-07-26 Thread Noah Misch

When I caused a crash during the create_index regression test, recovery hit an
assertion failure.  Minimal test case:

psql -X

Re: [HACKERS] Parallel Seq Scan

2015-07-26 Thread Haribabu Kommi

On Thu, Jul 23, 2015 at 9:42 PM, Amit Kapila  wrote:
> On Wed, Jul 22, 2015 at 9:14 PM, Robert Haas  wrote:
>>
>> One thing I noticed that is a bit dismaying is that we don't get a lot
>> of benefit from having more workers.  Look at the 0.1 data.  At 2
>> workers, if we scaled perfectly, we would be 3x faster (since the
>> master can do work too), but we are actually 2.4x faster.  Each
>> process is on the average 80% efficient.  That's respectable.  At 4
>> workers, we would be 5x faster with perfect scaling; here we are 3.5x
>> faster.   So the third and fourth worker were about 50% efficient.
>> Hmm, not as good.  But then going up to 8 workers bought us basically
>> nothing.
>>
>
> I think the improvement also depends on how costly is the qualification,
> if it is costly, even for same selectivity the gains will be shown till
> higher
> number of clients and for simple qualifications, we will see that cost of
> having more workers will start dominating (processing data over multiple
> tuple queues) over the benefit we can achieve by them.

Yes, That's correct. when the qualification cost is increased, the performance
is also increasing with number of workers.

Instead of using all the configured workers per query, how about deciding number
of workers based on cost of the qualification? I am not sure whether we have
any information available to find out the qualification cost. This way
the workers
will be distributed to all backends properly.

Regards,
Hari Babu
Fujitsu Australia


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] raw output from copy

2015-07-26 Thread Craig Ringer

On 7 July 2015 at 14:32, Pavel Stehule  wrote:
> Hi
>
> previous patch was broken, and buggy
>
> Here is new version with fixed upload and more tests

I routinely see people trying to use COPY ... FORMAT binary to export
a single binary field (like an image, for example) and getting
confused by the header PostgreSQL adds. Or using text-format COPY and
struggling with the hex escaping. It's clearly something people have
trouble with.

It doesn't help that while lo_import and lo_export can read paths
outside the datadir (and refuse to read from within it),
pg_read_binary_file is superuser only and disallows absolute paths.
There's no corresponding pg_write_binary_file. So users who want to
import and export a single binary field tend to try to use COPY. We
have functionality for large objects that has no equivalent for
'bytea'.

I don't love the use of COPY for this, but it gets us support for
arbitrary clients pretty easily. Otherwise it'd be server-side only
via local filesystem access, or require special psql-specific
functionality like we have for lo_import etc.

The main point is that this is a real world thing. People want to do
it, try to do it, and have problems doing it. So it's a solution a
real issue.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] security labels on databases are bad for dump & restore

2015-07-26 Thread Craig Ringer

On 20 July 2015 at 01:18, Noah Misch  wrote:
> On Wed, Jul 15, 2015 at 11:08:53AM +0200, Andres Freund wrote:
>> On 2015-07-15 12:04:40 +0300, Alvaro Herrera wrote:
>> > Andres Freund wrote:
>> > > One thing worth mentioning is that arguably the problem is caused by the
>> > > fact that we're here emitting database level information in pg_dump,
>> > > normally only done for dumpall.
>
> Consistency with existing practice would indeed have pg_dump ignore
> pg_shseclabel and have pg_dumpall reproduce its entries.

Existing practice is pretty broken though, and not necessarily a good guide.

COMMENT ON DATABASE and SECURITY LABEL FOR DATABASE are dumped by
pg_dump, but always refer to the database's name at the time it was
dumped, so restoring it can break.

GRANTs on databases are ignored and not dumped by pg_dump or by
pg_dumpall --globals-only. The only way to dump them seems to be to
use pg_dumpall, which nobody uses in the real world.

I'd be strongly in favour of teaching GRANT, SECURITY LABEL, COMMENT
ON DATABASE, etc to recognise CURRENT_DATABASE as a keyword. Then
dumping them in pg_dump --create, and in pg_dump -Fc .

In practice I see zero real use of pg_dumpall without --globals-only,
and almost everyone does pg_dump -Fc . I'd like to see that method
case actually preserve the whole state of the system and do the right
thing sensibly.

A pg_restore option to skip database-level settings could be useful,
but I think by default they should be restored.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Combining Aggregates

2015-07-26 Thread David Rowley

On 27 July 2015 at 12:14, Kouhei Kaigai  wrote:

> > The main use case people have been talking about is parallel query, but
> > is there some other case this would be useful right now, without the
> > parallel query feature? You and Simon talked about this case:
> >
> > > 2. Queries such as:
> > >
> > > SELECT p.name, SUM(s.qty) FROM sales s INNER JOIN product p ON
> s.product_id
> > > = p.product_id GROUP BY p.name;
> > >
> > > Such a query could be transformed into:
> > >
> > > SELECT p.name,SUM(qty) FROM (SELECT product_id,SUM(qty) AS qty FROM
> sales
> > > GROUP BY product_id) s
> > > INNER JOIN product p ON p.product_id = s.product_id GROUP BY p_name;
> > >
> > > Of course the outer query's SUM and GROUP BY would not be required if
> there
> > > happened to be a UNIQUE index on product(name), but assuming there's
> not
> > > then the above should produce the results faster. This of course works
> ok
> > > for SUM(), but for something like AVG() or STDDEV() the combine/merge
> > > aggregate functions would be required to process those intermediate
> > > aggregate results that were produced by the sub-query.
> >
> > Any chance you could implement that in the planner?
> >
>
It likely needs planner enhancement prior to other applications...
>
> http://www.postgresql.org/message-id/ca+tgmobgwkhfzc09b+s2lxjtword5ht-avovdvaq4+rpwro...@mail.gmail.com
>
>
I had thought this too, but I'm not sure that's 100% correct. As I just
said in the my previous email on this thread, I am now working on "group by
before join". I had started it with the intentions of path-ifying the
grouping planner, but I soon realised that the grouping_planner() does
quite a bit more at that final stage that I propose to do to allow "group
by before join". This is mostly around handling of DISTINCT, HAVING and
LIMIT. I don't think those need to be handled in my patch, perhaps with the
exception of DISTINCT without GROUP BY, but not when both are present.

Instead I've started by inventing GroupingPath and I'm now building these
new path types once there's enough tables joined for all Vars of the GROUP
BY and agg parameters.

I believe this optimisation needs to be costed as pushing the GROUP BY down
to happen as early as possible is *not* always a win. Perhaps the JOIN is
very selective and eliminates many groups. Hence I've added costing and
allowed the planner to choose what it thinks is cheapest.


> Once planner allowed to have both of normal path and partial aggregation
> paths to compare according to the cost, it is the straightforward way to
> do.
>

I've ended up with 2 boolean members to struct GroupingPath, combine_states
and finalize_aggs. I plan to modify nodeAgg.c to use these, and EXPLAIN to
give a better description of what its doing.


>
> Here are various academic research, for example, below is the good starting
> point to clarify aggregate queries that we can run with 2-phase approach.
>
> http://www.researchgate.net/publication/2715288_Performing_Group-By_before_Join
>
>
Thanks, I've been using that very paper.

Regards

David Rowley

--
 David Rowley   http://www.2ndQuadrant.com/

 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Combining Aggregates

2015-07-26 Thread David Rowley

On 27 July 2015 at 04:58, Heikki Linnakangas  wrote:

> On 04/01/2015 06:28 PM, Robert Haas wrote:
>
>> On Mon, Mar 30, 2015 at 1:28 AM, Michael Paquier
>>  wrote:
>>
>>> I've been thinking of bumping this patch to the June commitfest as the
 patch only exists to provide the basic infrastructure for things like
 parallel aggregation, aggregate before join, and perhaps auto updating
 materialised views.

 It seems unlikely that any of those things will happen for 9.5.

>>>
>>> Yeah, I guess so...
>>>
>>>  Does anybody object to me moving this to June's commitfest?

>>>
>>> Not from my side FWIW. I think it actually makes sense.
>>>
>>
>> +1.  I'd like to devote some time to looking at this, but I don't have
>> the time right now.  The chances that we can do something useful with
>> it in 9.6 seem good.
>>
>
> And the June commitfest is now in progress.
>
> This patch seems sane to me, as far as it goes. However, there's no
> planner or executor code to use the aggregate combining for anything. I'm
> not a big fan of dead code, I'd really like to see something to use this.
>

Thanks for looking. I partially agree with that, it is a little weird to
put in code that's yet to be used. I'd certainly agree 95% if this was the
final commitfest of 9.5, but we're in the first commitfest of 9.6 and I
think there's a very high probability of this getting used in 9.6, and
likely that probability would be even higher if the code is already in.
Perhaps it's a little bit in the same situation as to Robert's parallel
worker stuff?



>
> The main use case people have been talking about is parallel query, but is
> there some other case this would be useful right now, without the parallel
> query feature? You and Simon talked about this case:
>
>  2. Queries such as:
>>
>> SELECT p.name, SUM(s.qty) FROM sales s INNER JOIN product p ON
>> s.product_id
>> = p.product_id GROUP BY p.name;
>>
>> Such a query could be transformed into:
>>
>> SELECT p.name,SUM(qty) FROM (SELECT product_id,SUM(qty) AS qty FROM sales
>> GROUP BY product_id) s
>> INNER JOIN product p ON p.product_id = s.product_id GROUP BY p_name;
>>
>> Of course the outer query's SUM and GROUP BY would not be required if
>> there
>> happened to be a UNIQUE index on product(name), but assuming there's not
>> then the above should produce the results faster. This of course works ok
>> for SUM(), but for something like AVG() or STDDEV() the combine/merge
>> aggregate functions would be required to process those intermediate
>> aggregate results that were produced by the sub-query.
>>
>
> Any chance you could implement that in the planner?
>
>
Yes! I'm actually working on it now and so far have it partially working.
Quite likely I'll be able to submit for CF2. There's still some costing
tweaks to do. So far it just works for GROUP BY with no aggs. I plan to fix
that later using this patch.

I don't want to talk too much about it on this thread, but in a test query
which is the one in my example, minus the SUM(qty), with 1 million sales
records, and 100 products, performance goes from 350ms to 200ms on my
machine, so looking good so far.

Regards

David Rowley
--
 David Rowley   http://www.2ndQuadrant.com/

 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] New functions

2015-07-26 Thread Michael Paquier

On Wed, Jul 8, 2015 at 10:18 PM, Michael Paquier
 wrote:
>
> On Wed, Jul 8, 2015 at 9:15 PM, Дмитрий Воронин
>  wrote:
> >
> >
> > 07.07.2015, 18:34, "Michael Paquier" :
> >
> >>  Speaking of which, I have rewritten the patch as attached. This looks
> >>  way cleaner than the previous version submitted. Dmitry, does that
> >>  look fine for you?
> >>  I am switching this patch as "Waiting on Author".
> >>  Regards,
> >>  --
> >>  Michael
> >
> > Michael, hello. I'm looking your variant of patch. You create function 
> > ssl_extensions_info(), that gives information of SSL extensions in more 
> > informative view. So, it's cool.
>
> OK cool. I think a committer could look at it, hence switching it to
> "Ready for Committer".

Note for committers: attached is a small script that will generate a
client certificate with extensions enabled. This is helpful when
testing this patch. Once created, then simply connect with something
like this connection string:
"host=127.0.0.1 sslmode=verify-full sslcert=client.crt
sslkey=client.key sslrootcert=server.crt"
By querying the new function implemented by this patch the information
about the client certificate extensions will show up.
-- 
Michael


ssl_key_generate
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-07-26 Thread Amit Kapila

On Sat, Jul 25, 2015 at 10:30 PM, Ildus Kurbangaliev <
i.kurbangal...@postgrespro.ru> wrote:

>
> > On Jul 24, 2015, at 10:02 PM, Robert Haas  wrote:
> >
> > Also, the patch should not invent a new array similar but not quite
> > identical to LockTagTypeNames[].
> >
> > This is goofy:
> >
> > +   if (tranche_id > 0)
> > +   result->tranche = tranche_id;
> > +   else
> > +   result->tranche = LW_USERDEFINED;
> >
> > If we're going to make everybody pass a tranche ID when they call
> > LWLockAssign(), then they should have to do that always, not sometimes
> > pass a tranche ID and otherwise pass 0, which is actually a valid
> > tranche ID but to which you've given a weird special meaning here.
> > I'd also name the constants differently, like LWLOCK_TRANCHE_XXX or
> > something.  LW_ is a somewhat ambiguous prefix.
> >
> > The idea of tranches is really that each tranche is an array of items
> > each of which contains 1 or more lwlocks.  Here you are intermingling
> > different tranches.  I guess that won't really break anything but it
> > seems ugly.
>
> Maybe it will be better to split LWLockAssign to two functions then, keep
> name
> LWLockAssign for user defined locks and other function with tranche_id.
>
> > On Jul 25, 2015, at 1:54 PM, Amit Kapila 
> wrote:
> >
> > That anyway he has to do it either you go by defining individual arrays
> > or having unified WaitEventType enum for individual arrays he has to
> > find out that array.  Another thing is with that you can just encapsulate
> > this information in one byte in structure PgBackendStatus, rather than
> > using more number of bytes (4 bytes) and I think the function for
> reporting
> > Waitevent will be much more simplified.
>
> In my way there are no special meaning for names. Array with names
> located in lwlock.c and lock.c, and can be used for other things (for
> example
> tracing). One byte sounds good only for this case.


Do you mean to say that you need more than 256 events? I am not sure
if we can add that many events without adding performance penalty
in some path.

The original idea was proposed for one-byte and the patch was written
considering the same, now you are planning to extend (which is okay), but
modifying it without any prior consent is what slightly a matter of concern.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Failing assertions in indxpath.c, placeholder.c and brin_minmax.c

2015-07-26 Thread Michael Paquier

On Sun, Jul 26, 2015 at 10:32 PM, Andreas Seltenreich 
wrote:

> Michael Paquier writes:
>
> >> Footnotes:
> >> [1]  https://github.com/anse1/sqlsmith
> >
> > This is really interesting stuff. I think that it would be possible to
> > extract self-contained test cases from your tool and those queries to
> > reproduce the failures. It is written that this tools connects to a
> > database to retrieve the schema, what is it exactly in the case of
> > those failures?
>
> I used the database "regression" that pg_regress leaves behind when you
> remove the --temp-install from it's default invocation through make
> check.  Sorry about not being explicit about that.
>
> So, dropping one of the queries into src/test/regress/sql/smith.sql and
> invoking
>
> make check EXTRA_TESTS=smith
>
> was all that was needed to integrate them.  I was then able to perform
> "git bisect run" on this command.  Er, plus consing the expected output
> file.
>
> I'm using the regression db a lot when hacking on sqlsmith, as it
> contains much more nasty things than your average database.
>

Ah, OK. Thanks. The code is licensed as GPL, has a dependency on libpqxx
and is written in C++, so it cannot be integrated into core as a test
module in this state, but I think that it would be definitely worth having
something like that in the code tree that runs on the buildfarm. We could
have caught up those problems earlier.  Now I imagine that this is a costly
run, so we had better have a switch to control if it is run or not, like a
configure option or a flag. Thoughts?
-- 
Michael

[HACKERS] Documentation tweak for row-valued expressions and null

2015-07-26 Thread Thomas Munro

Hi

I wonder if it might be worth adding a tiny note to the manual to
point out that the special logic for " IS [ NOT
] NULL" doesn't apply anywhere else that we handle nulls or talk about
[non]-null values in the manual.  See attached.

-- 
Thomas Munro
http://www.enterprisedb.com


note-about-row-is-null.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Combining Aggregates

2015-07-26 Thread Kouhei Kaigai

> On 04/01/2015 06:28 PM, Robert Haas wrote:
> > On Mon, Mar 30, 2015 at 1:28 AM, Michael Paquier
> >  wrote:
> >>> I've been thinking of bumping this patch to the June commitfest as the
> >>> patch only exists to provide the basic infrastructure for things like
> >>> parallel aggregation, aggregate before join, and perhaps auto updating
> >>> materialised views.
> >>>
> >>> It seems unlikely that any of those things will happen for 9.5.
> >>
> >> Yeah, I guess so...
> >>
> >>> Does anybody object to me moving this to June's commitfest?
> >>
> >> Not from my side FWIW. I think it actually makes sense.
> >
> > +1.  I'd like to devote some time to looking at this, but I don't have
> > the time right now.  The chances that we can do something useful with
> > it in 9.6 seem good.
> 
> And the June commitfest is now in progress.
> 
> This patch seems sane to me, as far as it goes. However, there's no
> planner or executor code to use the aggregate combining for anything.
> I'm not a big fan of dead code, I'd really like to see something to use
> this.
>
+1, this patch itself looks good for me, but...

> The main use case people have been talking about is parallel query, but
> is there some other case this would be useful right now, without the
> parallel query feature? You and Simon talked about this case:
>
> > 2. Queries such as:
> >
> > SELECT p.name, SUM(s.qty) FROM sales s INNER JOIN product p ON s.product_id
> > = p.product_id GROUP BY p.name;
> >
> > Such a query could be transformed into:
> >
> > SELECT p.name,SUM(qty) FROM (SELECT product_id,SUM(qty) AS qty FROM sales
> > GROUP BY product_id) s
> > INNER JOIN product p ON p.product_id = s.product_id GROUP BY p_name;
> >
> > Of course the outer query's SUM and GROUP BY would not be required if there
> > happened to be a UNIQUE index on product(name), but assuming there's not
> > then the above should produce the results faster. This of course works ok
> > for SUM(), but for something like AVG() or STDDEV() the combine/merge
> > aggregate functions would be required to process those intermediate
> > aggregate results that were produced by the sub-query.
> 
> Any chance you could implement that in the planner?
>
It likely needs planner enhancement prior to other applications...
http://www.postgresql.org/message-id/ca+tgmobgwkhfzc09b+s2lxjtword5ht-avovdvaq4+rpwro...@mail.gmail.com

Once planner allowed to have both of normal path and partial aggregation
paths to compare according to the cost, it is the straightforward way to
do.

Here are various academic research, for example, below is the good starting
point to clarify aggregate queries that we can run with 2-phase approach.
http://www.researchgate.net/publication/2715288_Performing_Group-By_before_Join

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CustomScan and readfuncs.c

2015-07-26 Thread Kouhei Kaigai

> Kouhei Kaigai  writes:
> > Under the investigation of ParallelAppend, I noticed here is a few
> > problems in CustomScan, that prevents to reproduce an equivalent
> > plan node on the background worker from serialized string.
> 
> > 1. CustomScanMethods->TextOutCustomScan callback
> > 
> > This callback allows to output custom information on nodeToString.
> > Originally, we intend to use this callback for debug only, because
> > CustomScan must be copyObject() safe, thus, all the private data
> > also must be stored in custom_exprs or custom_private.
> > However, it will lead another problem when we try to reproduce
> > CustomScan node from the string form generated by outfuncs.c.
> > If TextOutCustomScan prints something, upcoming _readCustomScan
> > has to deal with unexpected number of tokens in unexpected format.
> 
> Um ... wait a second.  There is no support in readfuncs for any
> plan node type, and never has been, and I seriously doubt that there
> ever should be.  I do not think it makes sense to ship plans around
> in the way you seem to have in mind.  (Also, I don't think the
> problems you mention are exactly unique to CustomScan.  There's no
> reason to assume that FDW plans could survive this treatment either,
> since we do not know what's in the fdw_private stuff; certainly no
> one has ever suggested that it should not contain pointers to static
> data.)
>
Yep, no Plan node types are supported at this moment, however, will
appear soon by the Funnel + PartialSeqScan nodes.
It serializes a partial plan subtree using nodeToString() then gives
the flatten PlannedStmt to background workers.
I'm now investigating to apply same structure to Append not to kick
child nodes in parallel.
Once various plan node types appear in readfuncs.c, we have to care
about this problem, don't it? I'm working for the patch submission
of ParallelAppend on the next commit-fest, so like to make a consensus
how to treat this matter.

> > I'd like to propose to omit this callback prior to v9.5 release,
> > for least compatibility issues.
> 
> I regard our commitment to cross-version compatibility for the
> custom scan APIs as being essentially zero, for reasons previously
> discussed.  So if this goes away in 9.6 it will not matter, but we
> might as well leave it in for now for debug support.
>
I don't argue this point strongly. If TextOutCustomScan shall be
obsoleted on v9.6, it is just kindness for developers not to use
this callback.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Failing assertions in indxpath.c, placeholder.c and brin_minmax.c

2015-07-26 Thread Peter Geoghegan

On Sun, Jul 26, 2015 at 7:07 AM, Tom Lane  wrote:
> Andreas Seltenreich  writes:
>> when running my random query generator contraption[1] against the
>> regression database of 9.5 or master, it occasionally triggers one of
>> the following three assertions.
>
> Very very cool tool!  Please keep doing that testing.

The SQLite people have been using a tool like this for some time.
They've also had luck finding bugs with a generic fuzz-testing tool
called "american fuzzy lop" (yes, seriously, that's what it's called),
which apparently is the state of the art.

I myself ran that tool against Postgres. I didn't spend enough time to
tweak it in a way that might have been effective. I also didn't figure
out a way to make iterations fast enough for the tool to be effective,
because I was invoking Postgres in single-user mode. I might pick it
up again in the future, but probably for a more targeted case.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Failing assertions in indxpath.c, placeholder.c and brin_minmax.c

2015-07-26 Thread Tom Lane

Andreas Seltenreich  writes:
> when running my random query generator contraption[1] against the
> regression database of 9.5 or master, it occasionally triggers one of
> the following three assertions.

I've fixed the first two of these --- thanks for the report!

> ,[ git bisect ]
> |   first bad commit: [3f8c23c4d31d4a0e801041733deb2c7cfa577b32] Improve
> |   predtest.c's ability to reason about operator expressions.
> `

I'm a bit confused about this aspect of your report though, because in
my hands that example fails clear back to 9.2.  It doesn't seem to require
the predtest.c improvement to expose the fault.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] False comment about speculative insertion

2015-07-26 Thread Peter Geoghegan

Attached patch removes a reference to an executor README section about
speculative insertion. In fact, the high-level overview of speculative
insertion ended up at the top of execIndexing.c. The executor README
was not touched by the ON CONFLICT patch at all.

I don't think it's necessary to refer to execIndexing.c within
ExecInsert instead. All the routines being called from ExecInsert()
that relate to speculative insertion are in execIndexing.c anyway.

-- 
Peter Geoghegan
From 5ea69e5f98a4eeb4c9f6ffc8f161e3e16f0cda86 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Sun, 26 Jul 2015 12:28:39 -0700
Subject: [PATCH] Remove false comment about speculative insertion

There never was an executor README section that discussed speculative
insertion in the original ON CONFLICT commit.
---
 src/backend/executor/nodeModifyTable.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 874ca6a..1ef76d0 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -351,8 +351,7 @@ ExecInsert(ModifyTableState *mtstate,
 			 *
 			 * We loop back here if we find a conflict below, either during
 			 * the pre-check, or when we re-check after inserting the tuple
-			 * speculatively.  See the executor README for a full discussion
-			 * of speculative insertion.
+			 * speculatively.
 			 */
 	vlock:
 			specConflict = false;
-- 
1.9.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MultiXact member wraparound protections are now enabled

2015-07-26 Thread Noah Misch

On Fri, Jul 24, 2015 at 09:14:09PM -0400, Peter Eisentraut wrote:
> On 7/22/15 4:45 PM, Robert Haas wrote:
> > But it seemed to me that this could be rather confusing.  I thought it
> > would be better to be explicit about whether the protections are
> > enabled in all cases.  That way, (1) if you see the message saying
> > they are enabled, they are enabled; (2) if you see the message saying
> > they are disabled, they are disabled; and (3) if you see neither
> > message, your version does not have those protections.
> 
> But this is not documented, AFAICT, so I don't think anyone is going to
> be able to follow that logic.  I don't see anything in the release notes
> saying, look for this message to see how this applies to you, or whatever.

I supported inclusion of the message, because it has good potential to help
experts studying historical logs to find the root cause of data corruption.
The complex histories of clusters showing corruption from this series of bugs
have brought great expense to the task of debugging new reports.  Given a
cluster having full mxact wraparound protections since last corruption-free
backup (or since initdb), one can rule out some causes.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Buildfarm failure from overly noisy warning message

2015-07-26 Thread Tom Lane

Kevin Grittner  writes:
> Tom Lane  wrote:
>> What's evidently happened here is that our session tried to boot an
>> autovacuum process off a table lock, only that process was gone by the
>> time we issued the kill() call.

> I think a LOG entry when an autovacuum process is actually canceled
> has value just in case it is happening on a particular table so
> frequently that the table starts to bloat.  I see no reason to log
> anything if there is an intention to cancel an autovacuum process
> but it actually completes before we can do so.  IMV the best
> solution would be to proceed straight to the kill() and only log
> something if it was found by kill().

Hm.  By that logic, I'm not sure if we need anything to be logged here,
because the autovacuum process will log something about having received
a query cancel signal.  Also, if we do it like that there's a race
condition: it's entirely possible (maybe even likely, on single-CPU
machines) that the autovacuum process would issue its log entry before
the sender is able to log its request.  That would probably confuse people
no end.

If we're in the business of minimizing log chatter, I'd suggest that
we remove the entirely-routine "sending cancel" log message, and only
log something in the uncommon case where the kill() fails (but, per
original point, reduce that entry to LOG or so; or else print something
only for not-ESRCH cases).

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] more RLS oversights

2015-07-26 Thread Joe Conway

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/03/2015 10:03 AM, Noah Misch wrote:
> (2) CreatePolicy() and AlterPolicy() omit to create a pg_shdepend
> entry for each role in the TO clause.  Test case:

Please see the attached patch. Note that I used SHARED_DEPENDENCY_ACL
for this. It seems appropriate, but possibly we should invent a new
shared dependency type for this use case? Comments?

Thanks,

Joe

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iQIcBAEBAgAGBQJVtSlAAAoJEDfy90M199hlrkIP/0BqMj30JpJVj3H5pIopU0RZ
pesBzFzzsWmt2QmasX9hajRfo6eXQAWPmKQmw/NDTJvHHNACxLdo0MHE7A9y7No7
aZIFTm0KhnG4jpzxfzpGqQ4ron5Vsc2TlNQgCBYCtbEEtuD0mV2qmcuTkqGte7iB
7iqneRBhmTYBy63X7S0ir4AhDi+JJg8P4F7YuU/XMcha5v5CpNJAToPpW7mCoZ8O
9w/RbXCHso7p1DIxfx4tfxVwLQ7j7G2j0rXbuA2e6OKfwZWWrk5E8Wnvc3xy3yCY
J37fcjOxFdhU/j1j+Tr90LYOSuRu5fQ4z6PmmsPkBM3T0Vllz/nNP64aVKbmHzga
szrIBERRs2icKa4OI08qZF42ObIEv0/t/NuyrXpegY6awRNNNsjyZvwM+51zdjw1
fuWh+2rObzh3RDH8lPB0N0rVfDMM+wU85Vaekckv/7UQ/pbWcwsYD/tykA1engQ8
Ww8kBAEct4dWdqppTqvLLxLgo4d+vuiS1mJ7SRY2aZFRX8QQ8TfB3eObETUsU4/X
9UWyMn+E0Au9ICX+wfD4whaBKst8EuO36qOZx0oZt+++EMOlzypgkCCxDtZO0KWd
KZHbtmJXHFVH+ihbEGXAKMIx+gLqSDG3IKy9MIJxpB3JY3XSdBNqobL2hiQy36/w
svK7i+mEYmUCQ6pB1j8c
=P1AS
-END PGP SIGNATURE-
diff --git a/src/backend/commands/policy.c b/src/backend/commands/policy.c
index 17b48d4..20ac54e 100644
*** a/src/backend/commands/policy.c
--- b/src/backend/commands/policy.c
***
*** 22,27 
--- 22,28 
  #include "catalog/indexing.h"
  #include "catalog/namespace.h"
  #include "catalog/objectaccess.h"
+ #include "catalog/pg_authid.h"
  #include "catalog/pg_policy.h"
  #include "catalog/pg_type.h"
  #include "commands/policy.h"
***
*** 48,54 
  static void RangeVarCallbackForPolicy(const RangeVar *rv,
  		  Oid relid, Oid oldrelid, void *arg);
  static char parse_policy_command(const char *cmd_name);
! static ArrayType *policy_role_list_to_array(List *roles);
  
  /*
   * Callback to RangeVarGetRelidExtended().
--- 49,55 
  static void RangeVarCallbackForPolicy(const RangeVar *rv,
  		  Oid relid, Oid oldrelid, void *arg);
  static char parse_policy_command(const char *cmd_name);
! static Datum *policy_role_list_to_array(List *roles, int *num_roles);
  
  /*
   * Callback to RangeVarGetRelidExtended().
*** parse_policy_command(const char *cmd_nam
*** 130,159 
  
  /*
   * policy_role_list_to_array
!  *	 helper function to convert a list of RoleSpecs to an array of role ids.
   */
! static ArrayType *
! policy_role_list_to_array(List *roles)
  {
! 	ArrayType  *role_ids;
! 	Datum	   *temp_array;
  	ListCell   *cell;
- 	int			num_roles;
  	int			i = 0;
  
  	/* Handle no roles being passed in as being for public */
  	if (roles == NIL)
  	{
! 		temp_array = (Datum *) palloc(sizeof(Datum));
! 		temp_array[0] = ObjectIdGetDatum(ACL_ID_PUBLIC);
  
! 		role_ids = construct_array(temp_array, 1, OIDOID, sizeof(Oid), true,
!    'i');
! 		return role_ids;
  	}
  
! 	num_roles = list_length(roles);
! 	temp_array = (Datum *) palloc(num_roles * sizeof(Datum));
  
  	foreach(cell, roles)
  	{
--- 131,158 
  
  /*
   * policy_role_list_to_array
!  *	 helper function to convert a list of RoleSpecs to an array of
!  *	 role id Datums.
   */
! static Datum *
! policy_role_list_to_array(List *roles, int *num_roles)
  {
! 	Datum	   *role_oids;
  	ListCell   *cell;
  	int			i = 0;
  
  	/* Handle no roles being passed in as being for public */
  	if (roles == NIL)
  	{
! 	   *num_roles = 1;
! 		role_oids = (Datum *) palloc(*num_roles * sizeof(Datum));
! 		role_oids[0] = ObjectIdGetDatum(ACL_ID_PUBLIC);
  
! 		return role_oids;
  	}
  
!*num_roles = list_length(roles);
! 	role_oids = (Datum *) palloc(*num_roles * sizeof(Datum));
  
  	foreach(cell, roles)
  	{
*** policy_role_list_to_array(List *roles)
*** 164,187 
  		 */
  		if (spec->roletype == ROLESPEC_PUBLIC)
  		{
! 			if (num_roles != 1)
  ereport(WARNING,
  		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  		 errmsg("ignoring roles specified other than public"),
  	  errhint("All roles are members of the public role.")));
! 			temp_array[0] = ObjectIdGetDatum(ACL_ID_PUBLIC);
! 			num_roles = 1;
! 			break;
  		}
  		else
! 			temp_array[i++] =
  ObjectIdGetDatum(get_rolespec_oid((Node *) spec, false));
  	}
  
! 	role_ids = construct_array(temp_array, num_roles, OIDOID, sizeof(Oid), true,
! 			   'i');
! 
! 	return role_ids;
  }
  
  /*
--- 163,187 
  		 */
  		if (spec->roletype == ROLESPEC_PUBLIC)
  		{
! 			Datum   *tmp_role_oids;
! 
! 			if (*num_roles != 1)
  ereport(WARNING,
  		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  		 errmsg("ignoring roles specified other than public"),
  	  errhint("All roles are members of the public role.")));
! 		   *num_roles = 1;
! 			tmp_role_oids = (Datum *) palloc(*num_roles * sizeof(Datum));
! 			tmp_role_oids[0] = ObjectIdGetDatum(ACL_ID_PUBLIC);
! 
! 			return tmp_r

Re: [HACKERS] security labels on databases are bad for dump & restore

2015-07-26 Thread Noah Misch

On Thu, Jul 23, 2015 at 12:14:14PM -0400, Robert Haas wrote:
> On Wed, Jul 22, 2015 at 3:42 PM, Adam Brightwell 
>  wrote:
> >> I like Noah's proposal of having pg_dump --create reproduce all
> >> database-level state.
> >
> > Should it be enabled by default?  If so, then wouldn't it make more
> > sense to call it --no-create and do the opposite?  So, --no-create
> > would exclude rather than include database-level information?  Would
> > enabling it by default cause issues with the current expected use of
> > the tool by end users?
> 
> This seems a bit hairy to me.  If we want to transfer responsibility
> for dumping this stuff from pg_dumpall to pg_dump, I have no problem
> with that at all.  But doing it only when --create is specified seems
> odd.  Then, does pg_dumpall -g dump it or not?

The principle I had in mind was to dump ACLs, pg_db_role_setting entries,
comments and security labels if and only if we emit a CREATE statement for the
object they modify.  That is already the rule for objects located inside
databases.  Since "pg_dumpall -g" does not emit CREATE DATABASE statements[1],
it would not dump these attributes of databases.

> If it does, then we're
> sorta dumping it in two places when --create is used.  If it doesn't,
> then when --create is not used we're doing it nowhere.

Yep.  Plain "pg_dump" dumps the contents of a database without dumping the
database itself.  I don't like that as a default, but we're stuck with it.

[1] "pg_dumpall -g --binary-upgrade" _does_ emit CREATE DATABASE statements,
so _it_ would dump these attributes.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Buildfarm failure from overly noisy warning message

2015-07-26 Thread Kevin Grittner

Tom Lane  wrote:

> + WARNING:  could not send signal to process 30123: No such process

> What's evidently happened here is that our session tried to boot an
> autovacuum process off a table lock, only that process was gone by the
> time we issued the kill() call.

> I'm inclined to reduce the WARNING to LOG, and/or skip it altogether
> if the error is ESRCH.

> One could also argue that both of these ereports ought to be downgraded to 
> DEBUG1
> or less, since this mechanism is pretty well shaken out by now.

> Thoughts?

I think a LOG entry when an autovacuum process is actually canceled
has value just in case it is happening on a particular table so
frequently that the table starts to bloat.  I see no reason to log
anything if there is an intention to cancel an autovacuum process
but it actually completes before we can do so.  IMV the best
solution would be to proceed straight to the kill() and only log
something if it was found by kill().

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Combining Aggregates

2015-07-26 Thread Heikki Linnakangas


On 04/01/2015 06:28 PM, Robert Haas wrote:

On Mon, Mar 30, 2015 at 1:28 AM, Michael Paquier
 wrote:

I've been thinking of bumping this patch to the June commitfest as the
patch only exists to provide the basic infrastructure for things like
parallel aggregation, aggregate before join, and perhaps auto updating
materialised views.

It seems unlikely that any of those things will happen for 9.5.


Yeah, I guess so...


Does anybody object to me moving this to June's commitfest?


Not from my side FWIW. I think it actually makes sense.


+1.  I'd like to devote some time to looking at this, but I don't have
the time right now.  The chances that we can do something useful with
it in 9.6 seem good.


And the June commitfest is now in progress.

This patch seems sane to me, as far as it goes. However, there's no 
planner or executor code to use the aggregate combining for anything. 
I'm not a big fan of dead code, I'd really like to see something to use 
this.


The main use case people have been talking about is parallel query, but 
is there some other case this would be useful right now, without the 
parallel query feature? You and Simon talked about this case:



2. Queries such as:

SELECT p.name, SUM(s.qty) FROM sales s INNER JOIN product p ON s.product_id
= p.product_id GROUP BY p.name;

Such a query could be transformed into:

SELECT p.name,SUM(qty) FROM (SELECT product_id,SUM(qty) AS qty FROM sales
GROUP BY product_id) s
INNER JOIN product p ON p.product_id = s.product_id GROUP BY p_name;

Of course the outer query's SUM and GROUP BY would not be required if there
happened to be a UNIQUE index on product(name), but assuming there's not
then the above should produce the results faster. This of course works ok
for SUM(), but for something like AVG() or STDDEV() the combine/merge
aggregate functions would be required to process those intermediate
aggregate results that were produced by the sub-query.


Any chance you could implement that in the planner?

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Buildfarm failure from overly noisy warning message

2015-07-26 Thread Tom Lane

Andrew Dunstan  writes:
> On 07/26/2015 11:00 AM, Andres Freund wrote:
>> On 2015-07-26 10:56:05 -0400, Tom Lane wrote:
>>> I'm inclined to reduce the WARNING to LOG

>> Hm, that doesn't seem like a very nice solution, given that LOG is even
>> more likely to end up in the server log.

>>> and/or skip it altogether if the error is ESRCH.

>> Combined with a comment why we're ignoring that case that'd work for me.

> +1 for this. if the process is gone we shouldn't really have a problem, 
> should we?

The only real reason for printing anything here is the possibility that
the shared lock table is sufficiently corrupted that we think there's
an autovacuum process blocking us when there isn't.  However, I don't
see any particular reason to think that this is more probable than any
other bad consequence of corrupted shmem, so I don't feel a need to be
in the user's face with a WARNING.  The evidence so far is that the
race condition is real, but I don't recall anyone ever reporting this
in a context where we'd suspect actual shmem corruption.

I do, however, think it might be reasonable to LOG the failure given
that we LOG'd the action.  Is nobody else on board with downgrading
both ereports to DEBUG?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] anole: assorted stability problems

2015-07-26 Thread Andres Freund

On 2015-07-07 13:25:24 +0200, Andres Freund wrote:
> So, it's starting to look good. Not exactly allowing for a lot of
> confidence yet, but still:
> http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=anole&br=HEAD

Since there have not been any relevant failures since, I'm going to
remove this issue from the open items list.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sharing aggregate states between different aggregate functions

2015-07-26 Thread Heikki Linnakangas


On 07/09/2015 12:44 PM, David Rowley wrote:

On 15 June 2015 at 12:05, David Rowley  wrote:



This basically allows an aggregate's state to be shared between other
aggregate functions when both aggregate's transition functions (and a few
other things) match
There's quite a number of aggregates in our standard set which will
benefit from this optimisation.


After compiling the original patch with another compiler, I noticed a
couple of warnings.

The attached fixes these.


I spent some time reviewing this. I refactored the ExecInitAgg code 
rather heavily to make it more readable (IMHO); see attached. What do 
you think? Did I break anything?


Some comments:

* In aggref_has_compatible_states(), you give up if aggtype or aggcollid 
differ. But those properties apply to the final function, so you were 
leaving some money on the table by disallowing state-sharing if they differ.


* The filter and input expressions are initialized for every AggRef, 
before the deduplication logic kicks in. The AggrefExprState.aggfilter, 
aggdirectargs and args fields really belong to the AggStatePerAggState 
struct instead. This is not a new issue, but now that we have a 
convenient per-aggstate struct to put them in, let's do so.


* There was a reference-after free bug in aggref_has_compatible_states; 
you cannot ReleaseSysCache and then continue pointing to the struct.


* The code was a bit fuzzy on which parts of the per-aggstate are filled 
in at what time. Some of the fields were overwritten every time a match 
was found. With the same values, so no harm done, but I found it 
confusing. I refactored ExecInitAgg in the attached patch to clear that up.


* There API of build_aggregate_fnexprs() was a bit strange now that some 
callers use it to only fill in the final function, some call it to fill 
both the transition functions and the final function. I split it to two 
separate functions.


* I wonder if we should do this duplicate elimination at plan time. It's 
very fast, so I'm not worried about that right now, but you had grand 
plans to expand this so that an aggregate could optionally use one of 
many different kinds of state values. At that point, it certainly seems 
like a planning decision to decide which aggregates share state. I think 
we can leave it as it is for now, but it's something to perhaps consider 
later.



BTW, the name of the AggStatePerAggStateData struct is pretty horrible. 
The repeated "AggState" feels awkward. Now that I've stared at the patch 
for a some time, it doesn't bother me anymore, but it took me quite a 
while to over that. I'm sure it will for others too. And it's not just 
that struct, the comments talk about "aggregate state", which could be 
confused to mean "AggState", but it actually means 
AggStatePerAggStateData. I don't have any great suggestions, but can you 
come up a better naming scheme?


- Heikki

diff --git a/src/backend/executor/execQual.c b/src/backend/executor/execQual.c
index 0f911f2..fd922bd 100644
--- a/src/backend/executor/execQual.c
+++ b/src/backend/executor/execQual.c
@@ -4485,35 +4485,15 @@ ExecInitExpr(Expr *node, PlanState *parent)
 			break;
 		case T_Aggref:
 			{
-Aggref	   *aggref = (Aggref *) node;
 AggrefExprState *astate = makeNode(AggrefExprState);
 
 astate->xprstate.evalfunc = (ExprStateEvalFunc) ExecEvalAggref;
 if (parent && IsA(parent, AggState))
 {
 	AggState   *aggstate = (AggState *) parent;
-	int			naggs;
 
 	aggstate->aggs = lcons(astate, aggstate->aggs);
-	naggs = ++aggstate->numaggs;
-
-	astate->aggdirectargs = (List *) ExecInitExpr((Expr *) aggref->aggdirectargs,
-  parent);
-	astate->args = (List *) ExecInitExpr((Expr *) aggref->args,
-		 parent);
-	astate->aggfilter = ExecInitExpr(aggref->aggfilter,
-	 parent);
-
-	/*
-	 * Complain if the aggregate's arguments contain any
-	 * aggregates; nested agg functions are semantically
-	 * nonsensical.  (This should have been caught earlier,
-	 * but we defend against it here anyway.)
-	 */
-	if (naggs != aggstate->numaggs)
-		ereport(ERROR,
-(errcode(ERRCODE_GROUPING_ERROR),
-		errmsg("aggregate function calls cannot be nested")));
+	aggstate->numaggs++;
 }
 else
 {
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 2bf48c5..fcc3859 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -152,17 +152,28 @@
 
 
 /*
- * AggStatePerAggData - per-aggregate working state for the Agg scan
+ * AggStatePerAggStateData - per aggregate state data for the Agg scan
+ *
+ * Working state for calculating the aggregate state, using the state
+ * transition function. This struct does not store the information needed
+ * to produce the final aggregate result from the state value; that's stored
+ * in AggStatePerAggData instead. This separation allows multiple aggregate
+ * results to be produced

Re: [HACKERS] Buildfarm failure from overly noisy warning message

2015-07-26 Thread Andrew Dunstan



On 07/26/2015 11:00 AM, Andres Freund wrote:

Hi,

On 2015-07-26 10:56:05 -0400, Tom Lane wrote:

   CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
+ WARNING:  could not send signal to process 30123: No such process
What's evidently happened here is that our session tried to boot an
autovacuum process off a table lock, only that process was gone by the
time we issued the kill() call.  No problem really ... but aside from
causing buildfarm noise, I could see somebody getting really panicky
if this message appeared on a production server.

Oops.


I'm inclined to reduce the WARNING to LOG

Hm, that doesn't seem like a very nice solution, given that LOG is even
more likely to end up in the server log.


and/or skip it altogether if the error is ESRCH.

Combined with a comment why we're ignoring that case that'd work for me.




+1 for this. if the process is gone we shouldn't really have a problem, 
should we?


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GSets: Getting collation related error when GSets has text column

2015-07-26 Thread Andres Freund

Hi,

On 2015-07-17 18:55:52 +0530, Jeevan Chalke wrote:
> Attached patch which attempts to fix this issue. However I am not sure
> whether it is correct. But it does not add any new issues as such.
> Added few test in the patch as well.

Pushed the fix. Thanks for the report and fix.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GSets: Fix bug involving GROUPING and HAVING together

2015-07-26 Thread Andres Freund

On 2015-07-24 11:34:22 +0100, Andrew Gierth wrote:
> > "Andrew" == Andrew Gierth  writes:
> 
>  Andrew> The other is that in subquery_planner, the optimization of
>  Andrew> converting HAVING clauses to WHERE clauses is suppressed if
>  Andrew> parse->groupingSets isn't empty. (It is empty if there's either
>  Andrew> no group by clause at all, or if there's exactly one grouping
>  Andrew> set, which must not be empty, however derived.) This is costing
>  Andrew> us some optimizations, especially in the case of an explicit
>  Andrew> GROUP BY () clause; I'll look into this.
> 
> I'm inclined to go with the simplest approach here, which copies the
> quals if there are grouping sets. The only way we could safely move a
> qual without copying is if we can show that none of the grouping sets is
> empty, that is (), and at this point the grouping set list has not been
> expanded so it's not trivial to determine that.

I pushed this to both master and 9.5. While this isn't strictly a
bugfix, it seemed better to keep the branches the same at this point.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CustomScan and readfuncs.c

2015-07-26 Thread Tom Lane

Kouhei Kaigai  writes:
> Under the investigation of ParallelAppend, I noticed here is a few
> problems in CustomScan, that prevents to reproduce an equivalent
> plan node on the background worker from serialized string.

> 1. CustomScanMethods->TextOutCustomScan callback
> 
> This callback allows to output custom information on nodeToString.
> Originally, we intend to use this callback for debug only, because
> CustomScan must be copyObject() safe, thus, all the private data
> also must be stored in custom_exprs or custom_private.
> However, it will lead another problem when we try to reproduce
> CustomScan node from the string form generated by outfuncs.c.
> If TextOutCustomScan prints something, upcoming _readCustomScan
> has to deal with unexpected number of tokens in unexpected format.

Um ... wait a second.  There is no support in readfuncs for any
plan node type, and never has been, and I seriously doubt that there
ever should be.  I do not think it makes sense to ship plans around
in the way you seem to have in mind.  (Also, I don't think the
problems you mention are exactly unique to CustomScan.  There's no
reason to assume that FDW plans could survive this treatment either,
since we do not know what's in the fdw_private stuff; certainly no
one has ever suggested that it should not contain pointers to static
data.)

> I'd like to propose to omit this callback prior to v9.5 release,
> for least compatibility issues.

I regard our commitment to cross-version compatibility for the
custom scan APIs as being essentially zero, for reasons previously
discussed.  So if this goes away in 9.6 it will not matter, but we
might as well leave it in for now for debug support.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GSets: Fix bug involving GROUPING and HAVING together

2015-07-26 Thread Andres Freund

On 2015-07-14 14:51:09 +0530, Jeevan Chalke wrote:
> Fix this by adding GroupingFunc node in this walker.  We do it correctly in
> contain_aggs_of_level_walker() in which we have handling for GroupingFunc
> there.
> 
> Attached patch to fix this.

Pushed, thanks for fix!


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Grouping Sets: Fix unrecognized node type bug

2015-07-26 Thread Andres Freund

On 2015-07-17 19:57:22 +0100, Andrew Gierth wrote:
> Attached is the current version of my fix (with Jeevan's regression
> tests plus one of mine).

Pushed, thanks for the report and fix!


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsets: ROW expression semantic broken between 9.4 and 9.5

2015-07-26 Thread Andrew Gierth

> "Jeevan" == Jeevan Chalke  writes:

 Jeevan> Hi
 Jeevan> It looks like we have broken the ROW expression without
 Jeevan> explicit ROW keyword in GROUP BY.

Andres has given the short version, but here's the long version:

In the spec, GROUP BY ROW(a,b) is an error, while GROUP BY (a,b) is
exactly equivalent to GROUP BY a,b.

Previously, pg treated GROUP BY (a,b) as if it were GROUP BY ROW(a,b)
since it was parsing it as an expression, and (a,b) in an expression is
shorthand for ROW(a,b).  However, the parens are significant in many
contexts in the grouping set syntax, e.g. ROLLUP(a,(b,c)) is equivalent
to GROUPING SETS ((a,b,c), (a), ()), and we have to be able to parse
both GROUPING SETS (a,b) (which is two grouping sets) and GROUPING SETS
((a,b),(c,d)), which means that we can't actually use the grammar to
distinguish expressions from parenthesized sublists.  What the code
therefore does is to explicitly distinguish (a,b) and ROW(a,b), and
treat the first as a list and the second as a single expression.

This is documented in the following NOTE in queries.sgml:

  
   
The construct (a,b) is normally recognized in expressions as
a row constructor.
Within the GROUP BY clause, this does not apply at the top
levels of expressions, and (a,b) is parsed as a list of
expressions as described above.  If for some reason you need
a row constructor in a grouping expression, use ROW(a,b).
   
  

http://www.postgresql.org/docs/9.5/static/queries-table-expressions.html#QUERIES-GROUPING-SETS

Andres has suggested that this should be mentioned in an incompatibility
note in the release notes. I'm not sure that's needed, since I don't
believe there are any cases where previously valid queries change in
behavior; a query such as

select (a,b) from (values (1,2)) v(a,b) group by (a,b);

previously evaluated the row constructor before grouping, while now it
groups by a and b separately and evaluates the row constructor
afterwards.  If there's a way to make this change affect the result,
I've not found it yet, even when using volatile functions.

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

2015-07-26 Thread Fabien COELHO



Hello,

Attached is very minor v5 update which does a rebase & completes the 
cleanup of doing a full sort instead of a chuncked sort.


Attached is an updated version of the patch which turns the sort option 
into a boolean, and also include the sort time in the checkpoint log.


There is still an open question about whether the sorting buffer allocation 
is lost on some signals and should be reallocated in such event.


In such case, probably the allocation should be managed from 
CheckpointerMain, and the lazy allocation could remain for other callers (I 
guess just "initdb").



More open questions:

- best name for the flush option (checkpoint_flush_to_disk,
checkpoint_flush_on_write, checkpoint_flush, ...)

- best name for the sort option (checkpoint_sort,
checkpoint_sort_buffers, checkpoint_sort_ios, ...)


Other nice-to-have inputs:

- tests on a non-linux system with posix_fadvise
  (FreeBSD? others?)

- tests on a large dedicated box


Attached are some scripts to help with testing, if someone's feels like that:

- cp_test.sh: run some tests, to adapt to one's setup...

- cp_test_count.pl: show percent of late transactions

- avg.py: show stats about stuff

  sh> grep 'progress: ' OUTPUT_FILE | cut -d' ' -f4 | avg.py

  *BEWARE* that if pgbench got stuck some "0" data are missing,
  look for the actual tps in the output file and for the line
  count to check whether it is the case... some currently submitted
  patch on pgbench helps, see https://commitfest.postgresql.org/5/199/


As this pgbench patch is now in master, pgbench is less likely to get 
stuck, but check nevertheless that the number of progress line matches the 
expected number.


--
Fabien.diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index bbe1eb0..0257e34 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2483,6 +2483,28 @@ include_dir 'conf.d'
   
  
 
+ 
+  checkpoint_sort (bool)
+  
+   checkpoint_sort configuration parameter
+  
+  
+  
+   
+Whether to sort buffers before writting them out to disk on checkpoint.
+For a HDD storage, this setting allows to group together
+neighboring pages written to disk, thus improving performance by
+reducing random write activity.
+This sorting should have limited performance effects on SSD backends
+as such storages have good random write performance, but it may
+help with wear-leveling so be worth keeping anyway.
+The default is on.
+This parameter can only be set in the postgresql.conf
+file or on the server command line.
+   
+  
+ 
+
  
   checkpoint_warning (integer)
   
@@ -2504,6 +2526,24 @@ include_dir 'conf.d'
   
  
 
+ 
+  checkpoint_flush_to_disk (bool)
+  
+   checkpoint_flush_to_disk configuration parameter
+  
+  
+  
+   
+When writing data for a checkpoint, hint the underlying OS that the
+data must be sent to disk as soon as possible.  This may help smoothing
+disk I/O writes and avoid a stall when fsync is issued at the end of
+the checkpoint, but it may also reduce average performance.
+This setting may have no effect on some platforms.
+The default is off.
+   
+  
+ 
+
  
   min_wal_size (integer)
   
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index e3941c9..eea6668 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -546,6 +546,29 @@
   
 
   
+   When hard-disk drives (HDD) are used for terminal data storage
+allows to sort pages
+   so that neighboring pages on disk will be flushed together by
+   chekpoints, reducing the random write load and improving performance.
+   If solid-state drives (SSD) are used, sorting pages induces no benefit
+   as their random write I/O performance is good: this feature could then
+   be disabled by setting checkpoint_sort to off.
+   It is possible that sorting may help with SSD wear leveling, so it may
+   be kept on that account.
+  
+
+  
+   On Linux and POSIX platforms, 
+   allows to hint the OS that pages written on checkpoints must be flushed
+   to disk quickly.  Otherwise, these pages may be kept in cache for some time,
+   inducing a stall later when fsync is called to actually
+   complete the checkpoint.  This setting helps to reduce transaction latency,
+   but it may also have a small adverse effect on the average transaction rate
+   at maximum throughput.  This feature probably brings no benefit on SSD,
+   as the I/O write latency is small on such hardware, thus it may be disabled.
+  
+
+  
The number of WAL segment files in pg_xlog directory depends on
min_wal_size, max_wal_size and
the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index bcce3e3..f565dc4 1

Re: [HACKERS] Buildfarm failure from overly noisy warning message

2015-07-26 Thread Andres Freund

Hi,

On 2015-07-26 10:56:05 -0400, Tom Lane wrote:
>   CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
> + WARNING:  could not send signal to process 30123: No such process

> What's evidently happened here is that our session tried to boot an
> autovacuum process off a table lock, only that process was gone by the
> time we issued the kill() call.  No problem really ... but aside from
> causing buildfarm noise, I could see somebody getting really panicky
> if this message appeared on a production server.

Oops.

> I'm inclined to reduce the WARNING to LOG

Hm, that doesn't seem like a very nice solution, given that LOG is even
more likely to end up in the server log.

> and/or skip it altogether if the error is ESRCH.

Combined with a comment why we're ignoring that case that'd work for me.


- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A little RLS oversight?

2015-07-26 Thread Joe Conway

On 07/26/2015 07:19 AM, Dean Rasheed wrote:
> I'm not convinced about exporting convert_table_name() from acl.c,
> particularly with such a non-descriptive name. It's only a couple of
> lines of code, so I think they may as well just be included directly
> in the new function, as seems to be common practice elsewhere.

fair enough

> As it stands, passing checkAsUser = GetUserId() to check_enable_rls()
> will cause it to skip the check for row_security = OFF, and so it may
> falsely report that RLS is active when the user has bypassed it. To
> avoid that row_security_active() needs to pass checkAsUser =
> InvalidOid to check_enable_rls().

> [ Actually there seem to be a few other callers of check_enable_rls()
> that suffer the same problem. I don't really understand the reasoning
> behind that check in check_enable_rls() - if the current user has the
> BYPASSRLS attribute and row_security = OFF, shouldn't RLS be disabled
> on every table no matter how it is accessed? ]


Hmmm, I can see that now but find it surprising. Seems like an odd
interface, and probably deserves some explanation in the function
comments. Maybe Stephen or someone else can weigh in on this...


> I think it would be better if the security context check were part of
> this new function too. Right now that can't make any difference, since
> only the RI triggers set SECURITY_ROW_LEVEL_DISABLED and this function
> cannot be called in that code path, but it's possible that in the
> future other code might set SECURITY_ROW_LEVEL_DISABLED, so moving the
> check down guarantees that check_enable_rls()/row_security_active()
> always accurately return the RLS status for the table.
> 
> While I was looking at it, I spotted a couple of other things to tidy
> up in existing related code:
> 
> * The comment for GetUserIdAndSecContext() needed updating for the new RLS 
> bit.
> 
> * Calling GetUserIdAndSecContext() and then throwing away the user_id
> returned seems ugly. There is already a code style precedent for
> checking the other bits of SecurityRestrictionContext, so I've added a
> similar bit-testing function for SECURITY_ROW_LEVEL_DISABLED which
> makes this code a bit neater.
> 
> Attached is an updated patch (still needs some docs for the functions).

Thanks for that. I'll add the docs.

Joe





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Buildfarm failure from overly noisy warning message

2015-07-26 Thread Tom Lane

chipmunk failed last night
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2015-07-26%2007%3A36%3A32

like so:

== pgsql.build/src/test/regress/regression.diffs 
===
*** 
/home/pgbfarm/buildroot/REL9_3_STABLE/pgsql.build/src/test/regress/expected/create_index.out
Sun Jul 26 10:37:41 2015
--- 
/home/pgbfarm/buildroot/REL9_3_STABLE/pgsql.build/src/test/regress/results/create_index.out
 Sun Jul 26 10:51:48 2015
***
*** 14,19 
--- 14,20 
  CREATE INDEX tenk1_hundred ON tenk1 USING btree(hundred int4_ops);
  CREATE INDEX tenk1_thous_tenthous ON tenk1 (thousand, tenthous);
  CREATE INDEX tenk2_unique1 ON tenk2 USING btree(unique1 int4_ops);
+ WARNING:  could not send signal to process 30123: No such process
  CREATE INDEX tenk2_unique2 ON tenk2 USING btree(unique2 int4_ops);
  CREATE INDEX tenk2_hundred ON tenk2 USING btree(hundred int4_ops);
  CREATE INDEX rix ON road USING btree (name text_ops);

==

What's evidently happened here is that our session tried to boot an
autovacuum process off a table lock, only that process was gone by the
time we issued the kill() call.  No problem really ... but aside from
causing buildfarm noise, I could see somebody getting really panicky
if this message appeared on a production server.

I'm inclined to reduce the WARNING to LOG, and/or skip it altogether
if the error is ESRCH.  The relevant code in ProcSleep() is:

ereport(LOG,
  (errmsg("sending cancel to blocking autovacuum PID %d",
  pid),
   errdetail_log("%s", logbuf.data)));

if (kill(pid, SIGINT) < 0)
{
/* Just a warning to allow multiple callers */
ereport(WARNING,
(errmsg("could not send signal to process %d: %m",
pid)));
}

so logging failures at LOG level seems pretty reasonable.  One could
also argue that both of these ereports ought to be downgraded to DEBUG1
or less, since this mechanism is pretty well shaken out by now.

Thoughts?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] CustomScan and readfuncs.c

2015-07-26 Thread Kouhei Kaigai

Hello,

Under the investigation of ParallelAppend, I noticed here is a few
problems in CustomScan, that prevents to reproduce an equivalent
plan node on the background worker from serialized string.

1. CustomScanMethods->TextOutCustomScan callback

This callback allows to output custom information on nodeToString.
Originally, we intend to use this callback for debug only, because
CustomScan must be copyObject() safe, thus, all the private data
also must be stored in custom_exprs or custom_private.
However, it will lead another problem when we try to reproduce
CustomScan node from the string form generated by outfuncs.c.
If TextOutCustomScan prints something, upcoming _readCustomScan
has to deal with unexpected number of tokens in unexpected format.
I'd like to propose to omit this callback prior to v9.5 release,
for least compatibility issues.

2. Reproduce method table on background worker
--
The method field of CustomPath/Scan/ScanState is expected to be
a reference to a static structure. Thus, copyObject() does not
copy the entire table, but only pointers.
However, we have no way to guarantee the callback functions have
same entrypoint addressed on background workers. So, we may need
an infrastructure to reproduce same CustomScan node with same
callback function tables, probably, identified by name.
We may have a few ways to solve the problem.

* Add system catalog, function returns pointer
The simplest way, like FDW. System catalog has a name and function
to return callback pointers. It also needs SQL statement support,
even a little down side.

* Registered by name, during shared_preload_libraries only
Like an early version of CustomScan interface, it requires custom-
scan providers to register a pair of name and callbacks, but only
when shared_preload_libraries is processed, to guarantee the callbacks
are also registered in the background workers also.
(Is this assumption right on windows?)

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A little RLS oversight?

2015-07-26 Thread Dean Rasheed

On 25 July 2015 at 19:12, Joe Conway  wrote:
> On 07/22/2015 02:17 PM, Dean Rasheed wrote:
>> Hmm, I think it probably ought to do more, based on whether or not RLS
>> is being bypassed or in force-mode -- see the first few checks in
>> get_row_security_policies(). Perhaps a new SQL-callable function
>> exposing those checks and calling check_enable_rls(). It's probably
>> still worth including the "c.relrowsecurity = false" check in SQL to
>> save calling the function for the majority of tables that don't have
>> RLS.
>
> Please see the attached patch and let me know what you think. I believe
> the only thing lacking is documentation for the two new user visible
> functions. Comments?
>

I'm not convinced about exporting convert_table_name() from acl.c,
particularly with such a non-descriptive name. It's only a couple of
lines of code, so I think they may as well just be included directly
in the new function, as seems to be common practice elsewhere.

As it stands, passing checkAsUser = GetUserId() to check_enable_rls()
will cause it to skip the check for row_security = OFF, and so it may
falsely report that RLS is active when the user has bypassed it. To
avoid that row_security_active() needs to pass checkAsUser =
InvalidOid to check_enable_rls().

[ Actually there seem to be a few other callers of check_enable_rls()
that suffer the same problem. I don't really understand the reasoning
behind that check in check_enable_rls() - if the current user has the
BYPASSRLS attribute and row_security = OFF, shouldn't RLS be disabled
on every table no matter how it is accessed? ]

I think it would be better if the security context check were part of
this new function too. Right now that can't make any difference, since
only the RI triggers set SECURITY_ROW_LEVEL_DISABLED and this function
cannot be called in that code path, but it's possible that in the
future other code might set SECURITY_ROW_LEVEL_DISABLED, so moving the
check down guarantees that check_enable_rls()/row_security_active()
always accurately return the RLS status for the table.

While I was looking at it, I spotted a couple of other things to tidy
up in existing related code:

* The comment for GetUserIdAndSecContext() needed updating for the new RLS bit.

* Calling GetUserIdAndSecContext() and then throwing away the user_id
returned seems ugly. There is already a code style precedent for
checking the other bits of SecurityRestrictionContext, so I've added a
similar bit-testing function for SECURITY_ROW_LEVEL_DISABLED which
makes this code a bit neater.

Attached is an updated patch (still needs some docs for the functions).

Regards,
Dean

rls-pg-stats.v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Grouping Sets: Fix unrecognized node type bug

2015-07-26 Thread Andres Freund

On 2015-07-17 11:37:26 +0530, Jeevan Chalke wrote:
> However I wonder why we are supporting GROUPING SETS inside GROUPING SETS.
> On Oracle, it is throwing an error.
> We are not trying to be Oracle compatible, but just curious to know.

The SQL specification seems to be pretty unambigous about supporting
nested grouping set specifications. Check 7.9 :  (nested inside a )
can contain .

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Failing assertions in indxpath.c, placeholder.c and brin_minmax.c

2015-07-26 Thread Tom Lane

Andreas Seltenreich  writes:
> when running my random query generator contraption[1] against the
> regression database of 9.5 or master, it occasionally triggers one of
> the following three assertions.

Very very cool tool!  Please keep doing that testing.

The first two seem to be planner problems, so I'll take responsibility for
digging into those.  But the third appears to be plain old brain fade in
the BRIN code.  It can be reproduced by

regression=# explain select * from brintest where int4col = 
NULL::integer::information_schema.cardinal_number;
   QUERY PLAN   

 Bitmap Heap Scan on brintest  (cost=52.01..56.02 rows=1 width=339)
   Recheck Cond: (int4col = ((NULL::integer)::information_schema.cardinal_number
)::integer)
   ->  Bitmap Index Scan on brinidx  (cost=0.00..52.01 rows=1 width=0)
 Index Cond: (int4col = ((NULL::integer)::information_schema.cardinal_nu
mber)::integer)
(4 rows)

regression=# select * from brintest where int4col = 
NULL::integer::information_schema.cardinal_number;
server closed the connection unexpectedly

or you can do it like this:

regression=# select * from brintest where int4col = (select NULL::integer);
server closed the connection unexpectedly

or you could do it with a join to a table containing some null values.

You need some complication because if you just write a plain null literal:

regression=# explain select * from brintest where int4col = NULL::integer; 
QUERY PLAN
--
 Result  (cost=0.00..106.30 rows=1 width=339)
   One-Time Filter: NULL::boolean
   ->  Seq Scan on brintest  (cost=0.00..106.30 rows=1 width=339)
(3 rows)

the planner knows that int4eq is strict so it reduces the WHERE clause
to constant NULL and doesn't bother with an indexscan.

Bottom line is that somebody failed to consider the possibility of a
null comparison value reaching the BRIN index lookup machinery.
The code stanza that's failing supposes that only IS NULL or IS NOT NULL
tests could have SK_ISNULL set, but that's just wrong.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Failing assertions in indxpath.c, placeholder.c and brin_minmax.c

2015-07-26 Thread Andreas Seltenreich

Michael Paquier writes:

>> Footnotes:
>> [1]  https://github.com/anse1/sqlsmith
>
> This is really interesting stuff. I think that it would be possible to
> extract self-contained test cases from your tool and those queries to
> reproduce the failures. It is written that this tools connects to a
> database to retrieve the schema, what is it exactly in the case of
> those failures?

I used the database "regression" that pg_regress leaves behind when you
remove the --temp-install from it's default invocation through make
check.  Sorry about not being explicit about that.

So, dropping one of the queries into src/test/regress/sql/smith.sql and
invoking

make check EXTRA_TESTS=smith

was all that was needed to integrate them.  I was then able to perform
"git bisect run" on this command.  Er, plus consing the expected output
file.

I'm using the regression db a lot when hacking on sqlsmith, as it
contains much more nasty things than your average database.

regards
andreas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Failing assertions in indxpath.c, placeholder.c and brin_minmax.c

2015-07-26 Thread Michael Paquier

On Sun, Jul 26, 2015 at 9:55 PM, Andreas Seltenreich  wrote:
> when running my random query generator contraption[1] against the
> regression database of 9.5 or master, it occasionally triggers one of
> the following three assertions.  Someone more knowledgeable might want
> to take a look at them...
>
> -- FailedAssertion("!(outer_rel->rows > 0)", File: "indxpath.c", Line: 1911)
> -- sample query:
> select
>   rel1925354.loid as c0,
>   rel1925353.version as c1
> from
>   (select
> rel1925352.aa as c0,
> rel1925352.aa as c1
>   from
> public.b as rel1925352
>   where (rel1925352.bb is NULL)
> and (rel1925352.bb < rel1925352.bb)) as subq_303136
>   inner join pg_catalog.pg_stat_ssl as rel1925353
>   on (subq_303136.c0 = rel1925353.pid )
> right join pg_catalog.pg_largeobject as rel1925354
> on (subq_303136.c0 = rel1925354.pageno )
> where (rel1925353.clientdn !~ rel1925353.clientdn)
>   and (rel1925353.cipher <= rel1925353.clientdn);
>
> ,[ git bisect ]
> |   first bad commit: [3f8c23c4d31d4a0e801041733deb2c7cfa577b32] Improve
> |   predtest.c's ability to reason about operator expressions.
> `
>
> -- FailedAssertion("!(!bms_is_empty(phinfo->ph_eval_at))", File: 
> "placeholder.c", Line: 109)
> -- sample query:
> select
>   rel1600276.viewowner as c0,
>   rel1600274.maxwritten_clean as c1,
>   rel1600275.n_tup_hot_upd as c2
> from
>   pg_catalog.pg_stat_bgwriter as rel1600274
>   inner join pg_catalog.pg_stat_xact_all_tables as rel1600275
>   on (rel1600274.maxwritten_clean = rel1600275.seq_scan )
> right join pg_catalog.pg_views as rel1600276
>   right join pg_catalog.pg_operator as rel1600277
>   on (rel1600276.viewname = rel1600277.oprname )
> on (rel1600275.relname = rel1600277.oprname )
> where 3 is not NULL;
>
> ,[ git bisect ]
> |   first bad commit: [f4abd0241de20d5d6a79b84992b9e88603d44134] Support
> |   flattening of empty-FROM subqueries and one-row VALUES tables.
> `
>
> -- FailedAssertion("!(key->sk_flags & 0x0080)", File: "brin_minmax.c", Line: 
> 177)
> -- sample query:
> select
>   rel167978.namecol as c0
> from
>   information_schema.parameters as rel167972
> left join public.student as rel167977
>   inner join public.brintest as rel167978
>   on (rel167977.age = rel167978.int4col )
> on (rel167972.interval_precision = rel167977.age )
> where rel167977.name <> rel167977.name;
> Footnotes:
> [1]  https://github.com/anse1/sqlsmith

This is really interesting stuff. I think that it would be possible to
extract self-contained test cases from your tool and those queries to
reproduce the failures. It is written that this tools connects to a
database to retrieve the schema, what is it exactly in the case of
those failures?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Failing assertions in indxpath.c, placeholder.c and brin_minmax.c

2015-07-26 Thread Andreas Seltenreich

Hi,

when running my random query generator contraption[1] against the
regression database of 9.5 or master, it occasionally triggers one of
the following three assertions.  Someone more knowledgeable might want
to take a look at them...

-- FailedAssertion("!(outer_rel->rows > 0)", File: "indxpath.c", Line: 1911)
-- sample query:
select
  rel1925354.loid as c0,
  rel1925353.version as c1
from
  (select
rel1925352.aa as c0,
rel1925352.aa as c1
  from
public.b as rel1925352
  where (rel1925352.bb is NULL)
and (rel1925352.bb < rel1925352.bb)) as subq_303136
  inner join pg_catalog.pg_stat_ssl as rel1925353
  on (subq_303136.c0 = rel1925353.pid )
right join pg_catalog.pg_largeobject as rel1925354
on (subq_303136.c0 = rel1925354.pageno )
where (rel1925353.clientdn !~ rel1925353.clientdn)
  and (rel1925353.cipher <= rel1925353.clientdn);

,[ git bisect ]
|   first bad commit: [3f8c23c4d31d4a0e801041733deb2c7cfa577b32] Improve
|   predtest.c's ability to reason about operator expressions.
`

-- FailedAssertion("!(!bms_is_empty(phinfo->ph_eval_at))", File: 
"placeholder.c", Line: 109)
-- sample query:
select
  rel1600276.viewowner as c0,
  rel1600274.maxwritten_clean as c1,
  rel1600275.n_tup_hot_upd as c2
from
  pg_catalog.pg_stat_bgwriter as rel1600274
  inner join pg_catalog.pg_stat_xact_all_tables as rel1600275
  on (rel1600274.maxwritten_clean = rel1600275.seq_scan )
right join pg_catalog.pg_views as rel1600276
  right join pg_catalog.pg_operator as rel1600277
  on (rel1600276.viewname = rel1600277.oprname )
on (rel1600275.relname = rel1600277.oprname )
where 3 is not NULL;

,[ git bisect ]
|   first bad commit: [f4abd0241de20d5d6a79b84992b9e88603d44134] Support
|   flattening of empty-FROM subqueries and one-row VALUES tables.
`

-- FailedAssertion("!(key->sk_flags & 0x0080)", File: "brin_minmax.c", Line: 
177)
-- sample query:
select
  rel167978.namecol as c0
from
  information_schema.parameters as rel167972
left join public.student as rel167977
  inner join public.brintest as rel167978
  on (rel167977.age = rel167978.int4col )
on (rel167972.interval_precision = rel167977.age )
where rel167977.name <> rel167977.name;

regards,
andreas

Footnotes: 
[1]  https://github.com/anse1/sqlsmith


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improving replay of XLOG_BTREE_VACUUM records

2015-07-26 Thread Andres Freund

On 2015-07-24 09:53:49 +0300, Heikki Linnakangas wrote:
> On 05/02/2015 02:10 AM, Jim Nasby wrote:
> >This looks like a good way to address this until the more significant
> >work can be done.
> >
> >I'm not a fan of "RBM_ZERO_NO_BM_VALID"; how about RBM_ZERO_BM_INVALID?
> >or BM_NOT_VALID? Or maybe I'm just trying to impose too much English on
> >the code; I see the logic to NO_BM_VALID...
> >
> >+ * RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK, but does not set
> >+ * BM_VALID bit before returning buffer so that noone could pin it.
> >
> >It would be better to explain why we want that mode. How about:
> >
> >RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK but does not set
> >BM_VALID before returning the buffer. This is done to ensure that no one
> >can pin the buffer without actually reading the buffer contents in. This
> >is necessary while replying XLOG_BTREE_VACUUM records in hot standby.
>
> That's a rather strange mode. The BM_VALID flag is internal to the buffer
> manager - if you don't know how the buffer manager works, you cannot
> understand that description. I couldn't quite understand what that means
> myself. What can or can you not do with a buffer that's not marked as
> BM_VALID? I'd suggest a mode like this instead:
>
> In RBM_IF_IN_CACHE mode, the page is returned if it is in the buffer cache
> already. If it's not, it is not read from disk, and InvalidBuffer is
> returned instead.

To me it sounds like this shouldn't go through the full ReadBuffer()
rigamarole. That code is already complex enough, and here it's really
not needed. I think it'll be much easier to review - and actually faster
in many cases to simply have something like

bool
BufferInCache(Relation, ForkNumber, BlockNumber)
{
/* XXX: setup tag, hash, partition */

LWLockAcquire(newPartitionLock, LW_SHARED);
buf_id = BufTableLookup(&newTag, newHash);
LWLockRelease(newPartitionLock);
return buf_id != -1;
}

and then fall back to the normal ReadBuffer() when it's in cache.


Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] extend pgbench expressions with functions

2015-07-26 Thread Fabien COELHO



Hello Heikki,

As soon as we add more functions, the way they are documented needs to be 
reworked too; we'll need to add a table in the manual to list them.


Here is a v8 with "abs", "min", "max", "random", "gaussian" et
"exponential".

[...]

There is no real doc, WIP...


Here is a v9 with a doc. The implicit typing of expressions is improved.

I also added two debug functions which allow to show intermediate integer 
(idebug) or double (ddebug).


  \set i idebug(random(1, 10))

will print the random value and assign it to i.


I updated the defaut scripts, which seems to speed up meta command 
evaluations. The initial version does less than 2 million evals per 
second:


  sh> cat before.sql
  \set naccounts 10 * :scale
  \setrandom aid 1 :naccounts

  sh> ./pgbench -T 3 -P 1 -f before.sql
  [...]
  tps = 1899004.793098 (excluding connections establishing)


The updated version does more than 3 million evals per second:

  sh> cat after.sql
  \set aid random(1, 10 * :scale)

  sh> ./pgbench -T 3 -P 1 -f after.sql
  [...]
  tps = 3143168.813690 (excluding connections establishing)


Suggestion:

A possibility would be to remove altogether the \setrandom stuff as the 
functionality is available through \set, maybe with an error message to 
advise about using \set with one of the random functions. That would 
remove about 200 ugly locs...


Another option would be to translate the setrandom stuff into a \set 
expression, that would maybe save 100 locs for the eval, but keep and 
expand a little the "manual" parser part.


--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 2517a3a..73b7caf 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -758,17 +758,20 @@ pgbench  options  dbname
   Sets variable varname to an integer value calculated
   from expression.
   The expression may contain integer constants such as 5432,
-  references to variables :variablename,
+  double constants such as 3.14156,
+  references to integer variables :variablename,
   and expressions composed of unary (-) or binary operators
   (+, -, *, /, %)
-  with their usual associativity, and parentheses.
+  with their usual associativity, integer function calls and parentheses.
+   shows the available
+  functions.
  
 
  
   Examples:
 
 \set ntellers 10 * :scale
-\set aid (1021 * :aid) % (10 * :scale) + 1
+\set aid (1021 * random(1, 10 * :scale)) % (10 * :scale) + 1
 
 

@@ -918,18 +921,110 @@ pgbench  options  dbname

   
 
+   
+   
+PgBench Functions
+
+ 
+  
+   Function
+   Return Type
+   Description
+   Example
+   Result
+  
+ 
+ 
+  
+   abs(a)
+   same as a
+   integer or double absolute value
+   abs(-17)
+   17
+  
+  
+   ddebug(x)
+   double
+   stderr print for debug and return argument
+   ddebug(5432.1)
+   5432.1
+  
+  
+   double(i)
+   double
+   evaluate as int and cast to double
+   double(5432)
+   5432.0
+  
+  
+   exporand(i, j, t)
+   integer
+   exponentially distributed random integer in the bounds, see below
+   exporand(1, 10, 3.0)
+   int between 1 and 10
+  
+  
+   idebug(i)
+   integer
+   stderr print for debug and return argument
+   idebug(5432)
+   5432
+  
+  
+   int(x)
+   integer
+   evaluate as double and cast to int
+   int(5.4 + 3.8)
+   9
+  
+  
+   gaussrand(i, j, t)
+   integer
+   gaussian distributed random integer in the bounds, see below
+   gaussrand(1, 10, 2.5)
+   int between 1 and 10
+  
+  
+   min(i, ...)
+   integer
+   minimum value
+   min(5, 4, 3, 2)
+   2
+  
+  
+   max(i, ...)
+   integer
+   maximum value
+   max(5, 4, 3, 2)
+   5
+  
+  
+   random(i, j)
+   integer
+   uniformly distributed random integer in the bounds
+   random(1, 10)
+   int between 1 and 10
+  
+  
+   sqrt(x)
+   double
+   square root
+   sqrt(2.0)
+   1.414213562
+  
+ 
+ 
+   
+
   
As an example, the full definition of the built-in TPC-B-like
transaction is:
 
 
-\set nbranches :scale
-\set ntellers 10 * :scale
-\set naccounts 10 * :scale
-\setrandom aid 1 :naccounts
-\setrandom bid 1 :nbranches
-\setrandom tid 1 :ntellers
-\setrandom delta -5000 5000
+\set aid random(1, 10 * :scale)
+\set bid random(1, 1 * :scale)
+\set tid random(1, 10 * :scale)
+\set delta random(-5000, 5000)
 BEGIN;
 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
diff --git a/src/bin/pgbench/exprparse.y b/src/bin/pgbench/exprparse.y
index e68631e..97bb559 100644
--- a/src/bin/pgbench/exprparse.y
+++

Re: [HACKERS] Restore-reliability mode

2015-07-26 Thread Noah Misch

On Thu, Jul 23, 2015 at 04:53:49PM -0300, Alvaro Herrera wrote:
> Peter Geoghegan wrote:
> > On Sat, Jun 6, 2015 at 12:58 PM, Noah Misch  wrote:
> > >   - Call VALGRIND_MAKE_MEM_NOACCESS() on a shared buffer when its local 
> > > pin
> > > count falls to zero.  Under CLOBBER_FREED_MEMORY, wipe a shared buffer
> > > when its global pin count falls to zero.
> > 
> > Did a patch for this ever materialize?
> 
> I think the first part would be something like the attached.

Neat.  Does it produce any new complaints during "make installcheck"?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

52 matches

Mail list logo