Re: Issue in ExecCleanupTupleRouting()

2019-04-14 Thread David Rowley
On Fri, 12 Apr 2019 at 01:06, Etsuro Fujita  wrote:
> While working on an update-tuple-routing bug in postgres_fdw [1], I
> noticed this change to ExecCleanupTupleRouting() made by commit
> 3f2393edefa5ef2b6970a5a2fa2c7e9c55cc10cf:

Added to open items list.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Mailing list not working

2019-04-14 Thread Ramanarayana
Hi,

I am not able to access the mailing list archive. Is the mailing list
server down or something?

-- 
Cheers
Ram 4.0


Re: cache lookup failed for collation 0

2019-04-14 Thread Jeevan Chalke
On Fri, Apr 12, 2019 at 1:26 PM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2019-04-11 17:04, Jeevan Chalke wrote:
> > The error is coming from get_collation_isdeterministic() when colloid
> > passed is 0. I think like we do in get_collation_name(), we should
> > return false here when such collation oid does not exist.
>
> I'm not in favor of doing that.  It would risk papering over errors of
> omission at other call sites.
>
> The root cause is that the same code match_pattern_prefix() is being
> used for text and bytea, but bytea does not use collations, so having
> the collation 0 is expected, and we shouldn't call
> get_collation_isdeterministic() in that case.
>
> Proposed patch attached.
>

Looks fine to me.


>
> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


plpgsql - execute - cannot use a reference to record field

2019-04-14 Thread Pavel Stehule
Hi

Is there reason why following code should not to work?

do $$
declare r record; result int;
begin
  select 10 as a, 20 as b into r;
  raise notice 'a: %', r.a;
  execute 'select $1.a + $1.b' into result using r;
  raise notice '%', result;
end;
$$

but it fails

NOTICE:  a: 10
ERROR:  could not identify column "a" in record data type
LINE 1: select $1.a + $1.b
   ^
QUERY:  select $1.a + $1.b
CONTEXT:  PL/pgSQL function inline_code_block line 6 at EXECUTE

Regards

Pavel


Re: pg_dump is broken for partition tablespaces

2019-04-14 Thread Alvaro Herrera
On 2019-Apr-15, David Rowley wrote:

> To be honest, if I'd done a better job of thinking through the
> implications of this tablespace inheritance in ca4103025d, then I'd
> probably have not bothered submitting a patch for it.  We could easily
> revert that, but we'd still be left with the same behaviour in
> partitioned indexes, which is in PG11.

Well, I suppose if we do decide to revert it for tables, we should do it
for both tables and indexes.  But as I said, I'm not yet convinced that
this is the best way forward.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Adding Unix domain socket path and port to pg_stat_get_wal_senders()

2019-04-14 Thread Michael Paquier
On Fri, Apr 12, 2019 at 11:38:52PM +0900, Tatsuo Ishii wrote:
> If we were ok to add a new column to pg_stat_activity view or
> pg_stat_replication view as well, that will be great.

Okay, no objections with a separate, new, column if that's the
consensus.
--
Michael


signature.asc
Description: PGP signature


Re: Status of the table access method work

2019-04-14 Thread Alexander Korotkov
On Wed, Apr 10, 2019 at 8:32 PM Andres Freund  wrote:
> On 2019-04-10 20:14:17 +0300, Alexander Korotkov wrote:
> > Your explanation of existing limitations looks very good and
> > convincing.  But I think there is one you didn't mention. We require
> > new table AMs to basically save old "contract" between heap and
> > indexes.  We have "all or nothing" choice during updates.  So, storage
> > can either insert to none of indexes or insert to all of them
> > (HOT-like update).
>
> I think that's a problem, and yea, I should have mentioned it. I'd
> earlier thought about it and then forgot.
>
> I however don't think we should design the interface for this before we
> have at least one AM that's actually in decent-ish shape that needs
> it. I seriously doubt we'll get the interface right enough.

Sure.

My point is that once we get first table AM which needs this, say
zheap, we shouldn't make it like this

TM_Result   (*tuple_update) (Relation rel, ... bool *update_indexes,
bool *delete_marking);

but rather try to design proper encapsulation of logic inside of table AM.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: jsonpath

2019-04-14 Thread Alexander Korotkov
On Tue, Apr 9, 2019 at 7:16 PM Tomas Vondra
 wrote:
>
> On Sun, Apr 07, 2019 at 03:03:58AM +0300, Alexander Korotkov wrote:
> >On Sun, Apr 7, 2019 at 2:37 AM Tom Lane  wrote:
> >> Alexander Korotkov  writes:
> >> > Thus, contents of unused function makes test fail or pass.  So far, it
> >> > looks like a compiler bug.  Any thoughts?
> >>
> >> Yeah :-(.  The fact that we've not seen a similar failure on any other
> >> machines points in that direction, too.  Maybe it's some other aspect
> >> of the machine's toolchain, like flex or bison, but that's not that
> >> much different from our standpoint.
> >>
> >> There's a lot of stuff I don't especially like about jsonpath_scan,
> >> for instance I think the "init" arguments to resizeString etc are
> >> a pretty error-prone and unmaintainable way to do things.  But
> >> I don't see anything that looks like it'd be a portability hazard
> >> that would explain this.
> >>
> >> I still have a nagging feeling that there's a wild store somewhere
> >> in here, but I don't know how to find it based on the limited
> >> evidence we've got.
> >
> >Yeah, it might be not because compiler error.  It might depend on
> >memory layout.  So existence of extra function changes memory layout
> >and, in turn, causes an error.  I will try to disassemble
> >jsonpath_scan.o and see whether content of yyparse2 influences
> >assembly of other functions.
> >
>
> Have you tried other compiler version / different optimization level?

Error goes away with -O0.  Or I just didn't manage to reproduce that.
In my observation error depends on memory layout or something.  So, it
might be that I just didn't manage to reproduce it with -O0 while it
really still persists.  I didn't try other compilers yet.

> Or running it under valgrind. Not sure how difficult that is on Windows.

Valgrind isn't installed there.  I'm not sure how to do that, but I
will probably try.

The interesting thing is that on failure I got following backtrace.

#0  0x7ff94f86a458 in ntdll!RtlRaiseStatus () from
C:\WINDOWS\SYSTEM32\ntdll.dll
#1  0x7ff94f87760e in ntdll!memset () from C:\WINDOWS\SYSTEM32\ntdll.dll
#2  0x7ff94dc42e1a in msvcrt!_setjmpex () from
C:\WINDOWS\System32\msvcrt.dll
#3  0x0086a37a in pg_re_throw () at elog.c:1720
#4  0x0086a166 in errfinish (dummy=) at elog.c:464
#5  0x007c3d18 in jsonpath_yyerror (result=result@entry=0x0,
message=message@entry=0xa87d38 <__func__.110220+1512>
"unrecognized flag of LIKE_REGEX predicate") at jsonpath_scan.l:276
#6  0x007c5f3d in makeItemLikeRegex (pattern=,
pattern=, flags=, expr=0x7216760) at
jsonpath_gram.y:500
#7  jsonpath_yyparse (result=, result@entry=0x495e818)
at jsonpath_gram.y:178

So, error happens inside implementation of siglongjmp().  I've checked
that contents of *PG_exception_stack didn't change since previous
successfully thrown error.  Probably this implementation of long jump
saves some part of state outside of sigjmp_buf and that part is
corrupt.  Any ideas?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Berserk Autovacuum (let's save next Mandrill)

2019-04-14 Thread Masahiko Sawada
On Mon, Apr 15, 2019 at 10:15 AM Masahiko Sawada  wrote:
>
> On Sun, Apr 14, 2019 at 4:51 AM Tomas Vondra
>  wrote:
> >
> > On Thu, Apr 11, 2019 at 11:25:29AM +0200, Chris Travers wrote:
> > >   On Wed, Apr 10, 2019 at 5:21 PM Andres Freund  
> > > wrote:
> > >
> > > Hi,
> > >
> > > On April 10, 2019 8:13:06 AM PDT, Alvaro Herrera
> > >  wrote:
> > > >On 2019-Mar-31, Darafei "Komяpa" Praliaskouski wrote:
> > > >
> > > >> Alternative point of "if your database is super large and actively
> > > >written,
> > > >> you may want to set autovacuum_freeze_max_age to even smaller 
> > > values
> > > >so
> > > >> that autovacuum load is more evenly spread over time" may be 
> > > needed.
> > > >
> > > >I don't think it's helpful to force emergency vacuuming more
> > > >frequently;
> > > >quite the contrary, it's likely to cause even more issues.  We should
> > > >tweak autovacuum to perform freezing more preemtively instead.
> > >
> > > I still think the fundamental issue with making vacuum less painful is
> > > that the all indexes have to be read entirely. Even if there's not 
> > > much
> > > work (say millions of rows frozen, hundreds removed). Without that 
> > > issue
> > > we could vacuum much more frequently. And do it properly in insert 
> > > only
> > > workloads.
> > >
> > >   So I see a couple of issues here and wondering what the best approach 
> > > is.
> > >   The first is to just skip lazy_cleanup_index if no rows were removed.  
> > > Is
> > >   this the approach you have in mind?  Or is that insufficient?
> >
> > I don't think that's what Andres had in mind, as he explicitly mentioned
> > removed rows. So just skipping lazy_cleanup_index when there were no
> > deleted would not help in that case.
> >
> > What I think we could do is simply leave the tuple pointers in the table
> > (and indexes) when there are only very few of them, and only do the
> > expensive table/index cleanup once there's anough of them.
>
> Yeah, we now have an infrastructure that skips index vacuuming by
> leaving the tuples pointers. So we then can have a threshold for
> autovacuum to invoke index vacuuming. Or an another idea is to delete
> index entries more actively by index looking up instead of scanning
> the whole index. It's proposed[1].
>
> [1] I couldn't get the URL of the thread right now for some reason but
> the thread subject is " [WIP] [B-Tree] Retail IndexTuple deletion".

Now I got 
https://www.postgresql.org/message-id/425db134-8bba-005c-b59d-56e50de3b41e%40postgrespro.ru

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center




Re: Berserk Autovacuum (let's save next Mandrill)

2019-04-14 Thread Masahiko Sawada
On Sun, Apr 14, 2019 at 4:51 AM Tomas Vondra
 wrote:
>
> On Thu, Apr 11, 2019 at 11:25:29AM +0200, Chris Travers wrote:
> >   On Wed, Apr 10, 2019 at 5:21 PM Andres Freund  wrote:
> >
> > Hi,
> >
> > On April 10, 2019 8:13:06 AM PDT, Alvaro Herrera
> >  wrote:
> > >On 2019-Mar-31, Darafei "Komяpa" Praliaskouski wrote:
> > >
> > >> Alternative point of "if your database is super large and actively
> > >written,
> > >> you may want to set autovacuum_freeze_max_age to even smaller values
> > >so
> > >> that autovacuum load is more evenly spread over time" may be needed.
> > >
> > >I don't think it's helpful to force emergency vacuuming more
> > >frequently;
> > >quite the contrary, it's likely to cause even more issues.  We should
> > >tweak autovacuum to perform freezing more preemtively instead.
> >
> > I still think the fundamental issue with making vacuum less painful is
> > that the all indexes have to be read entirely. Even if there's not much
> > work (say millions of rows frozen, hundreds removed). Without that issue
> > we could vacuum much more frequently. And do it properly in insert only
> > workloads.
> >
> >   So I see a couple of issues here and wondering what the best approach is.
> >   The first is to just skip lazy_cleanup_index if no rows were removed.  Is
> >   this the approach you have in mind?  Or is that insufficient?
>
> I don't think that's what Andres had in mind, as he explicitly mentioned
> removed rows. So just skipping lazy_cleanup_index when there were no
> deleted would not help in that case.
>
> What I think we could do is simply leave the tuple pointers in the table
> (and indexes) when there are only very few of them, and only do the
> expensive table/index cleanup once there's anough of them.

Yeah, we now have an infrastructure that skips index vacuuming by
leaving the tuples pointers. So we then can have a threshold for
autovacuum to invoke index vacuuming. Or an another idea is to delete
index entries more actively by index looking up instead of scanning
the whole index. It's proposed[1].

[1] I couldn't get the URL of the thread right now for some reason but
the thread subject is " [WIP] [B-Tree] Retail IndexTuple deletion".

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center




Re: New vacuum option to do only freezing

2019-04-14 Thread Masahiko Sawada
On Mon, Apr 15, 2019 at 12:47 AM Tom Lane  wrote:
>
> Robert Haas  writes:
> > On Wed, Apr 3, 2019 at 10:32 PM Masahiko Sawada  
> > wrote:
> >> Attached the updated version patch.
>
> > Committed with a little bit of documentation tweaking.
>
> topminnow just failed an assertion from this patch:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=topminnow=2019-04-14%2011%3A01%3A48
>
> The symptoms are:
>
> TRAP: FailedAssertion("!((params->index_cleanup == VACOPT_TERNARY_ENABLED && 
> nleft_dead_tuples == 0 && nleft_dead_itemids == 0) || params->index_cleanup 
> == VACOPT_TERNARY_DISABLED)", File: 
> "/home/nm/farm/mipsel_deb8_gcc_32/HEAD/pgsql.build/../pgsql/src/backend/access/heap/vacuumlazy.c",
>  Line: 1404)
> ...
> 2019-04-14 14:49:16.328 CEST [15282:5] LOG:  server process (PID 18985) was 
> terminated by signal 6: Aborted
> 2019-04-14 14:49:16.328 CEST [15282:6] DETAIL:  Failed process was running: 
> autovacuum: VACUUM ANALYZE pg_catalog.pg_depend
>
> Just looking at the logic around index_cleanup, I rather think that
> that assertion is flat out wrong:
>
> +/* No dead tuples should be left if index cleanup is enabled */
> +Assert((params->index_cleanup == VACOPT_TERNARY_ENABLED &&
> +nleft_dead_tuples == 0 && nleft_dead_itemids == 0) ||
> +   params->index_cleanup == VACOPT_TERNARY_DISABLED);
>
> Either it's wrong, or this is:
>
> +/*
> + * Since this dead tuple will not be vacuumed and
> + * ignored when index cleanup is disabled we count
> + * count it for reporting.
> + */
> +if (params->index_cleanup == VACOPT_TERNARY_ENABLED)
> +nleft_dead_tuples++;
>

Ugh, I think the assertion is right but the above condition is
completely wrong. We should increment nleft_dead_tuples when index
cleanup is *not* enabled. For nleft_dead_itemids we require that index
cleanup is disabled as follows.

   {
   /*
* Here, we have indexes but index cleanup is disabled.
Instead of
* vacuuming the dead tuples on the heap, we just forget them.
*
* Note that vacrelstats->dead_tuples could have tuples which
* became dead after HOT-pruning but are not marked dead yet.
* We do not process them because it's a very rare condition, and
* the next vacuum will process them anyway.
*/
   Assert(params->index_cleanup == VACOPT_TERNARY_DISABLED);
   nleft_dead_itemids += vacrelstats->num_dead_tuples;
   }

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center




Re: pg_dump is broken for partition tablespaces

2019-04-14 Thread David Rowley
On Mon, 15 Apr 2019 at 05:32, Alvaro Herrera  wrote:
>
> On 2019-Apr-14, Andres Freund wrote:
>
> > On 2019-04-14 10:38:05 -0400, Tom Lane wrote:
> > > It's entirely possible BTW that this whole business of inheriting
> > > tablespace from the partitioned table is broken and should be thrown
> > > out.  I certainly don't see any compelling reason for partitions to
> > > act differently from regular tables in this respect, and the more
> > > problems we find with the idea, the less attractive it seems.
> >
> > Indeed. After discovering during the tableam work, and trying to write
> > tests for the equivalent feature for tableam, I decided that just not
> > allowing AM specifications for partitioned tables is the right call - at
> > least until the desired behaviour is clearer. The discussion of the last
> > few days makes me think so even more.
>
> To be honest, when doing that feature I neglected to pay attention to
> (read: forgot about) default_tablespace, and it's mostly the
> interactions with that feature that makes this partitioned table stuff
> so complicated.  I'm not 100% convinced yet that we need to throw it out
> completely, but I'm less sure now about it than I was before.

FWIW, I was trying to hint in [1] that this all might be more trouble
than its worth.

To be honest, if I'd done a better job of thinking through the
implications of this tablespace inheritance in ca4103025d, then I'd
probably have not bothered submitting a patch for it.  We could easily
revert that, but we'd still be left with the same behaviour in
partitioned indexes, which is in PG11.

[1] 
https://www.postgresql.org/message-id/CAKJS1f-52x3o16fsd4=tbpkct9_e0ueg0lmzogxbqliuzsj...@mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: partitioning performance tests after recent patches

2019-04-14 Thread David Rowley
On Mon, 15 Apr 2019 at 07:19, Floris Van Nee  wrote:
> 3) What could be causing the big performance difference between case 7 
> (simple SELECT) and 8 (simple SELECT with ORDER BY  LIMIT 1)? For 4096 
> partitions, TPS of 7) is around 5, while adding the ORDER BY  LIMIT 1 
> makes TPS drop well below 1. In theory, run-time pruning of the right chunk 
> should take exactly the same amount of time in both cases, because both are 
> pruning timestamp now() on the same number of partitions. The resulting plans 
> are also identical with the exception of the top LIMIT-node (in PG11 they 
> differ slightly as a MergeAppend is chosen for the ORDER BY instead of an 
> Append, in HEAD with ordered append this is not necessary anymore). Am I 
> missing something here?

With the information provided, I don't really see any reason why the
ORDER BY LIMIT would slow it down if the plan is the same apart from
the LIMIT node. Please share the EXPLAIN ANALYZE output of each.

> 4) A more general question about run-time pruning in nested loops, like the 
> one for case 14. I believe I read in one of the previous threads that 
> run-time pruning only reoccurs if it determines that the value that 
> determines which partitions must be excluded has changed in between 
> iterations. How is this defined? Eg. let's say partitions are 1-day wide and 
> the first iteration of the loop filters on the partitioned table for 
> timestamp between 14-04-2019 12:00 and 14-04-2019 20:00 (dynamically 
> determined). Then the second iteration comes along and now filters on values 
> between 14-04-2019 12:00 and 14-04-2019 19:00. The partition that should be 
> scanned hasn't changed, because both timestamps fall into the same partition. 
> Is the full process of run-time pruning applied again, or is there some kind 
> of shortcut that first checks if the previous pruning result is still valid 
> even if the value has changed slightly? If not, would this be a possible 
> optimization, as I think it's a case that occurs very often? I don't know the 
> run-time pruning code very well though, so it may just be a crazy idea that 
> can't be practically achieved.

Currently, there's no shortcut. It knows which parameters partition
pruning depends on and it reprunes whenever the value of ones of these
changes.

I'm not really sure how rechecking would work exactly. There are cases
where it wouldn't be possible, say the condition was: partkey >= $1
and there was no partition for $1 since it was beyond the range of the
defined range partitions.  How could we tell if we can perform the
shortcut if the next param value falls off the lower bound of the
defined partitions?  The first would include no partitions and the
second includes all partitions, but the actual value of $1 belongs to
no partition in either case so we can't check to see if it matches the
same partition.  Perhaps it could work for equality operators when
just a single partition is matched in the first place, it might then
be possible to do a shortcircuit recheck to see if the same partition
matches the next set of values.  The problem with that is that
run-time pruning code in the executor does not care about which
operators are used. It just passes those details off to the pruning
code to deal with it.  Perhaps something can be decided in the planner
in analyze_partkey_exprs() to have it set a "do_recheck" flag to tell
the executor to check before pruning again...  Or maybe it's okay to
just try a recheck when we match to just a single partition and just
recheck the new values are allowed in that partition when re-pruning.
However, that might be just too overly dumb since for inequality
operators the original values may never even have falling inside the
partition's bounds in the first place.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: COLLATE: Hash partition vs UPDATE

2019-04-14 Thread Tom Lane
Jesper Pedersen  writes:
> Yeah, that works here - apart from an issue with the test case; fixed in 
> the attached.

Couple issues spotted in an eyeball review of that:

* There is code that supposes that partsupfunc[] is the last
field of ColumnsHashData, eg

fcinfo->flinfo->fn_extra =
MemoryContextAllocZero(fcinfo->flinfo->fn_mcxt,
   offsetof(ColumnsHashData, partsupfunc) +
   sizeof(FmgrInfo) * nargs);

I'm a bit surprised that this patch manages to run without crashing,
because this would certainly not allocate space for partcollid[].

I think we would likely be well advised to do

-   FmgrInfopartsupfunc[PARTITION_MAX_KEYS];
+   FmgrInfopartsupfunc[FLEXIBLE_ARRAY_MEMBER];

to make it more obvious that that has to be the last field.  Or else
drop the cuteness with variable-size allocations of ColumnsHashData.
FmgrInfo is only 48 bytes, I'm not really sure that it's worth the
risk of bugs to "optimize" this.

* I see collation-less calls of the partsupfunc at both partbounds.c:2931
and partbounds.c:2970, but this patch touches only the first one.  How
can that be right?

regards, tom lane




Re: Should the docs have a warning about pg_stat_reset()?

2019-04-14 Thread Jeff Janes
On Wed, Apr 10, 2019 at 2:52 PM Bruce Momjian  wrote:

>
> OK, let me step back.  Why are people resetting the statistics
> regularly?  Based on that purpose, does it make sense to clear the
> stats that effect autovacuum?
>

When I've done it (not regularly, thankfully), it was usually because I
failed to type "pg_stat_reset_shared('bgwriter')" or
"pg_stat_statements_reset()" correctly.

I've also been tempted to do it because storing snapshots with a cron job
or something requires effort and planning ahead to set up the tables and
cron and some way to limit the retention, and than running LAG windows
functions over the snapshots requires a re-study of the documentation,
because they are a bit esoteric and I don't use them enough to commit the
syntax to memory.  I don't want to see pg_statio_user_indexes often enough
to make elaborate arrangements ahead of time (especially since
track_io_timing columns is missing from it); but when I do want them, I
want them.  And when I do, I don't want them to be "since the beginning of
time".

When I'm thinking about pg_statio_user_indexes, I am probably not thinking
about autovac, since they have about nothing in common with each other.
(Other than pg_stat_reset operating on both.)

Cheers,

Jeff


Re: hyrax vs. RelationBuildPartitionDesc

2019-04-14 Thread Tom Lane
Robert Haas  writes:
> On Fri, Mar 15, 2019 at 3:45 PM Tom Lane  wrote:
>> I agree that copying data isn't great.  What I don't agree is that this
>> solution is better.

> I am not very convinced that it makes sense to lump something like
> RelationGetIndexAttrBitmap in with something like
> RelationGetPartitionDesc.  RelationGetIndexAttrBitmap is returning a
> single Bitmapset, whereas the data structure RelationGetPartitionDesc
> is vastly larger and more complex.  You can't say "well, if it's best
> to copy 32 bytes of data out of the relcache every time we need it, it
> must also be right to copy 10k or 100k of data out of the relcache
> every time we need it."

I did not say that.  What I did say is that they're both correct
solutions.  Copying large data structures is clearly a potential
performance problem, but that doesn't mean we can take correctness
shortcuts.

> If we want an at-least-somewhat unified solution here, I think we need
> to bite the bullet and make the planner hold a reference to the
> relcache throughout planning.  (The executor already keeps it open, I
> believe.) Otherwise, how is the relcache supposed to know when it can
> throw stuff away and when it can't?

The real problem with that is that we *still* won't know whether it's
okay to throw stuff away or not.  The issue with these subsidiary
data structures is exactly that we're trying to reduce the lock strength
required for changing one of them, and as soon as you make that lock
strength less than AEL, you have the problem that that value may need
to change in relcache entries despite them being pinned.  The code I'm
complaining about is trying to devise a shortcut solution to exactly
that situation ... and it's not a good shortcut.

> The only alternative seems to be to have each subsystem hold its own
> reference count, as I did with the PartitionDirectory stuff, which is
> not something we'd want to scale out.

Well, we clearly don't want to devise a separate solution for every
subsystem, but it doesn't seem out of the question to me that we could
build a "reference counted cache sub-structure" mechanism and apply it
to each of these relcache fields.  Maybe it could be unified with the
existing code in the typcache that does a similar thing.  Sure, this'd
be a fair amount of work, but we've done it before.  Syscache entries
didn't use to have reference counts, for example.

BTW, the problem I have with the PartitionDirectory stuff is exactly
that it *isn't* a reference-counted solution.  If it were, we'd not
have this problem of not knowing when we could free rd_pdcxt.

>> I especially don't care for the much-less-than-half-baked kluge of
>> chaining the old rd_pdcxt onto the new one and hoping that it will go away
>> at a suitable time.

> It will go away at a suitable time, but maybe not at the soonest
> suitable time.  It wouldn't be hard to improve that, though.  The
> easiest thing to do, I think, would be to have an rd_oldpdcxt and
> stuff the old context there.  If there already is one there, parent
> the new one under it.  When RelationDecrementReferenceCount reduces
> the reference count to zero, destroy anything found in rd_oldpdcxt.

Meh.  While it seems likely that that would mask most practical problems,
it still seems like covering up a wart with a dirty bandage.  In
particular, that fix doesn't fix anything unless relcache reference counts
go to zero pretty quickly --- which I'll just note is directly contrary
to your enthusiasm for holding relcache pins longer.

I think that what we ought to do for v12 is have PartitionDirectory
copy the data, and then in v13 work on creating real reference-count
infrastructure that would allow eliminating the copy steps with full
safety.  The $64 question is whether that really would cause unacceptable
performance problems.  To look into that, I made the attached WIP patches.
(These are functionally complete, but I didn't bother for instance with
removing the hunk that 898e5e329 added to relcache.c, and the comments
need work, etc.)  The first one just changes the PartitionDirectory
code to do that, and then the second one micro-optimizes
partition_bounds_copy() to make it somewhat less expensive, mostly by
collapsing lots of small palloc's into one big one.

What I get for test cases like [1] is

single-partition SELECT, hash partitioning:

N   tps, HEAD   tps, patch
2   11426.24375411448.615193
8   11254.83326711374.278861
32  11288.32911411371.942425
128 11222.32925611185.845258
512 11001.17713710572.917288
102410612.4564709834.172965
40968819.110195 7021.864625
81927372.611355 5276.130161

single-partition SELECT, range partitioning:

N   tps, HEAD   tps, patch
2   11037.85533811153.595860
8   11085.21802211019.132341
32  10994.34820710935.719951
128 10884.41732410532.685237
512 10635.5834119578.108915
102410407.2864148689.585136
4096

Re: Identity columns should own only one sequence

2019-04-14 Thread Laurenz Albe
I wrote:
> Identity columns don't work if they own more than one sequence.
> 
[...]
> test=> INSERT INTO ser (id) VALUES (DEFAULT);
> ERROR:  more than one owned sequence found
> 
> 
> I propose that we check if there already is a dependent sequence
> before adding an identity column.
> 
> The attached patch does that, and also forbids setting the ownership
> of a sequence to an identity column.

Alternatively, maybe getOwnedSequence should only consider sequences
with an "internal" dependency on the column.  That would avoid the problem
without forbidding anything, since normal OWNED BY dependencies are "auto".

What do you think?

Yours,
Laurenz Albe





Re: hyrax vs. RelationBuildPartitionDesc

2019-04-14 Thread Tom Lane
Amit Langote  writes:
> On 2019/04/10 15:42, Michael Paquier wrote:
>> It seems to me that Tom's argument to push in the way relcache
>> information is handled by copying its contents sounds sensible to me.
>> That's not perfect, but it is consistent with what exists (note
>> PublicationActions for a rather-still-not-much-complex example of
>> structure which gets copied from the relcache).

> I gave my vote for direct access of unchangeable relcache substructures
> (TupleDesc, PartitionDesc, etc.), because they're accessed during the
> planning of every user query and fairly expensive to copy compared to list
> of indexes or PublicationActions that you're citing.  To further my point
> a bit, I wonder why PublicationActions needs to be copied out of relcache
> at all?  Looking at its usage in execReplication.c, it seems we can do
> fine without copying, because we are holding both a lock and a relcache
> reference on the replication target relation during the entirety of
> apply_handle_insert(), which means that information can't change under us,
> neither logically nor physically.

So the point here is that that reasoning is faulty.  You *cannot* assume,
no matter how strong a lock or how many pins you hold, that a relcache
entry will not get rebuilt underneath you.  Cache flushes happen
regardless.  And unless relcache.c takes special measures to prevent it,
a rebuild will result in moving subsidiary data structures and thereby
breaking any pointers you may have pointing into those data structures.

For certain subsidiary structures such as the relation tupdesc,
we do take such special measures: that's what the "keep_xxx" dance in
RelationClearRelation is.  However, that's expensive, both in cycles
and maintenance effort: it requires having code that can decide equality
of the subsidiary data structures, which we might well have no other use
for, and which we certainly don't have strong tests for correctness of.
It's also very error-prone for callers, because there isn't any good way
to cross-check that code using a long-lived pointer to a subsidiary
structure is holding a lock that's strong enough to guarantee non-mutation
of that structure, or even that relcache.c provides any such guarantee
at all.  (If our periodic attempts to reduce lock strength for assorted
DDL operations don't scare the pants off you in this connection, you have
not thought hard enough about it.)  So I think that even though we've
largely gotten away with this approach so far, it's also a half-baked
kluge that we should be looking to get rid of, not extend to yet more
cases.

To my mind there are only two trustworthy solutions to the problem of
wanting time-extended usage of a relcache subsidiary data structure: one
is to copy it, and the other is to reference-count it.  I think that going
over to a reference-count-based approach for many of these structures
might well be something we should do in future, maybe even the very near
future.  In the mean time, though, I'm not really satisfied with inserting
half-baked kluges, especially not ones that are different from our other
half-baked kluges for similar purposes.  I think that's a path to creating
hard-to-reproduce bugs.

regards, tom lane




Re: pg_dump is broken for partition tablespaces

2019-04-14 Thread Alvaro Herrera
On 2019-Apr-14, Andres Freund wrote:

> On 2019-04-14 10:38:05 -0400, Tom Lane wrote:
> > It's entirely possible BTW that this whole business of inheriting
> > tablespace from the partitioned table is broken and should be thrown
> > out.  I certainly don't see any compelling reason for partitions to
> > act differently from regular tables in this respect, and the more
> > problems we find with the idea, the less attractive it seems.
> 
> Indeed. After discovering during the tableam work, and trying to write
> tests for the equivalent feature for tableam, I decided that just not
> allowing AM specifications for partitioned tables is the right call - at
> least until the desired behaviour is clearer. The discussion of the last
> few days makes me think so even more.

To be honest, when doing that feature I neglected to pay attention to
(read: forgot about) default_tablespace, and it's mostly the
interactions with that feature that makes this partitioned table stuff
so complicated.  I'm not 100% convinced yet that we need to throw it out
completely, but I'm less sure now about it than I was before.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Should the docs have a warning about pg_stat_reset()?

2019-04-14 Thread Tomas Vondra

On Sun, Apr 14, 2019 at 09:11:52AM -0500, Joe Conway wrote:

On 4/13/19 3:42 PM, Tomas Vondra wrote:

If only we had a way to regularly snapshot the data from within the
database, and then compute the deltas on that. If only we could insert
data from one table into another one a then do some analysics on it,
with like small windows moving over the data or something ...


Why not store deltas separately with their own delta reset command?



Well, we could do that, but we don't. Essentially, we'd implement some
sort of RRD, but we'd have to handle cleanup, configuration (how much
history, how frequently to snapshot the deltas). I think the assumption
is people will do that in a third-party tool.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Zedstore - compressed in-core columnar storage

2019-04-14 Thread Tomas Vondra

On Sun, Apr 14, 2019 at 09:45:10AM -0700, Andres Freund wrote:

Hi,

On 2019-04-14 18:36:18 +0200, Tomas Vondra wrote:

I think those comparisons are cute and we did a fair amount of them when
considering a drop-in replacement for pglz, but ultimately it might be a
bit pointless because:

(a) it very much depends on the dataset (one algorithm may work great on
one type of data, suck on another)

(b) different systems may require different trade-offs (high ingestion
rate vs. best compression ratio)

(c) decompression speed may be much more important

What I'm trying to say is that we shouldn't obsess about picking one
particular algorithm too much, because it's entirely pointless. Instead,
we should probably design the system to support different compression
algorithms, ideally at column level.


I think we still need to pick a default algorithm, and realistically
that's going to be used by like 95% of the users.



True. Do you expect it to be specific to the column store, or should be
set per-instance default (even for regular heap)?

FWIW I think the conclusion from past dev meetings was we're unlikely to
find anything better than lz4. I doubt that changed very much.

regard

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Checksum errors in pg_stat_database

2019-04-14 Thread Magnus Hagander
On Sat, Apr 13, 2019 at 8:46 PM Robert Treat  wrote:

>
> On Fri, Apr 12, 2019 at 8:18 AM Magnus Hagander 
> wrote:
> > On Sun, Apr 7, 2019 at 6:28 PM Julien Rouhaud 
> wrote:
> >> Thanks for looking it it!
> >> On Sun, Apr 7, 2019 at 4:36 PM Magnus Hagander 
> wrote:
> >> >
> >> > I'm not sure I like the idea of using "" as the
> database name. It's not very likely that somebody would be using that as a
> name for their database, but i's not impossible. But it also just looks
> strrange. Wouldn't NULL be a more appropriate choice?
> >> >
> >> > Likewise, shouldn't we return NULL as the number of backends for the
> shared counters, rather than 0?
> >> I wanted to make things more POLA-compliant, but maybe it was a bad
> >> idea.  I changed it for NULL here and for numbackends.
> >>
>
> ISTM the argument here is go with zero since you have zero connections
> vs go with null since you can't actually connect, so it doesn't make
> sense. (There is a third argument about making it -1 since you can't
> connect, but that breaks sum(numbackends) so it's easily dismissed.) I
> think I would have gone for 0 personally, but what ended up surprising
> me was that a bunch of other stuff like xact_commit show zero when
> AFAICT the above reasoning would apply the same to those columns.
> (unless there is a way to commit a transaction in the global objects
> that I don't know about).
>

That's a good point. I mean, you can commit a transaction that involves
changes of global objects, but it counts in the database that you were
conneced to.

We should probably at least make it consistent and make it NULL in all or 0
in all.

I'm -1 for using -1 (!), for the very reason that you mention. But either
changing the numbackends to 0, or the others to NULL would work for
consistency. I'm leaning towards the 0 as well.


>> > Micro-nit:
> >> > + Time at which the last data page checksum failures was
> detected in
> >> > s/failures/failure/
> >>
> >> Oops.
> >>
> >> v5 attached.
> >
>
> What originally got me looking at this was the idea of returning -1
> (or maybe null) for checksum failures for cases when checksums are not
> enabled. This seems a little more complicated to set up, but seems
> like it might ward off people thinking they are safe due to no
> checksum error reports when they actually aren't.
>

NULL seems like the reasonable thing to return there. I'm not sure what
you're referring to with a little more complicated to set up, thought? Do
you mean somehow for the end user?

Code-wise it seems it should be simple -- just do an "if checksums disabled
then return null"  in the two functions.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Zedstore - compressed in-core columnar storage

2019-04-14 Thread Tomas Vondra

On Mon, Apr 08, 2019 at 05:27:05PM -0700, Ashwin Agrawal wrote:

  Heikki and I have been hacking recently for few weeks to implement
  in-core columnar storage for PostgreSQL. Here's the design and initial
  implementation of Zedstore, compressed in-core columnar storage (table
  access method). Attaching the patch and link to github branch [1] to
  follow along.

  The objective is to gather feedback on design and approach to the
  same. The implementation has core basic pieces working but not close
  to complete.

  Big thank you to Andres, Haribabu and team for the table access method
  API's. Leveraged the API's for implementing zedstore, and proves API
  to be in very good shape. Had to enhance the same minimally but
  in-general didn't had to touch executor much.

  Motivations / Objectives

  * Performance improvement for queries selecting subset of columns
    (reduced IO).
  * Reduced on-disk footprint compared to heap table. Shorter tuple
    headers and also leveraging compression of similar type data
  * Be first-class citizen in the Postgres architecture (tables data can
    just independently live in columnar storage)
  * Fully MVCC compliant
  * All Indexes supported
  * Hybrid row-column store, where some columns are stored together, and
    others separately. Provide flexibility of granularity on how to
    divide the columns. Columns accessed together can be stored
    together.
  * Provide better control over bloat (similar to zheap)
  * Eliminate need for separate toast tables
  * Faster add / drop column or changing data type of column by avoiding
    full rewrite of the table.



Cool. Me gusta.


  High-level Design - B-trees for the win!
  

  To start simple, let's ignore column store aspect for a moment and
  consider it as compressed row store. The column store is natural
  extension of this concept, explained in next section.

  The basic on-disk data structure leveraged is a B-tree, indexed by
  TID. BTree being a great data structure, fast and versatile. Note this
  is not referring to existing Btree indexes, but instead net new
  separate BTree for table data storage.

  TID - logical row identifier:
  TID is just a 48-bit row identifier. The traditional division into
  block and offset numbers is meaningless. In order to find a tuple with
  a given TID, one must always descend the B-tree. Having logical TID
  provides flexibility to move the tuples around different pages on page
  splits or page merges can be performed.



So if TIDs are redefined this way, how does affect BRIN indexes? I mean,
that's a lightweight indexing scheme which however assumes TIDs encode
certain amount of locality - so this probably makes them (and Bitmap
Heap Scans in general) much less eficient. That's a bit unfortunate,
although I don't see a way around it :-(


  The internal pages of the B-tree are super simple and boring. Each
  internal page just stores an array of TID and downlink pairs. Let's
  focus on the leaf level. Leaf blocks have short uncompressed header,
  followed by btree items. Two kinds of items exist:

   - plain item, holds one tuple or one datum, uncompressed payload
   - a "container item", holds multiple plain items, compressed payload

  +-
  | Fixed-size page header:
  |
  |   LSN
  |   TID low and hi key (for Lehman & Yao B-tree operations)
  |   left and right page pointers
  |
  | Items:
  |
  |   TID | size | flags | uncompressed size | lastTID | payload (container
  item)
  |   TID | size | flags | uncompressed size | lastTID | payload (container
  item)
  |   TID | size | flags | undo pointer | payload (plain item)
  |   TID | size | flags | undo pointer | payload (plain item)
  |   ...
  |
  +



So if I understand it correctly, ZSUncompressedBtreeItem is the "plain"
item and ZSCompressedBtreeItem is the container one. Correct?

I find it a bit confusing, and I too ran into the issue with data that
can't be compressed, so I think the "container" should support both
compressed and uncompressed data. Heikki already mentioned that, so I
suppose it's just not implemented yet. That however means the name of
the "compressed" struct gets confusing, so I suggest to rename to:

   ZSUncompressedBtreeItem -> ZSPlainBtreeItem
   ZSCompressedBtreeItem -> ZSContainerBtreeItem

where the container supports both compressed and uncompressed mode.
Also, maybe we don't need to put "Btree" into every damn name ;-)

Looking at the ZSCompressedBtreeItem, I see it stores just first/last
TID for the compressed data. Won't that be insufficient when there are
some gaps due to deletions or something? Or perhaps I just don't
understand how it works.

Another thing is that with uncompressed size being stored as uint16,
won't that be insufficient for highly compressible data / large pages? I
mean, we can have pages up to 32kB, which is not that far.



  Column store
  

  A column 

Re: Zedstore - compressed in-core columnar storage

2019-04-14 Thread Andres Freund
Hi,

On 2019-04-14 18:36:18 +0200, Tomas Vondra wrote:
> I think those comparisons are cute and we did a fair amount of them when
> considering a drop-in replacement for pglz, but ultimately it might be a
> bit pointless because:
> 
> (a) it very much depends on the dataset (one algorithm may work great on
> one type of data, suck on another)
> 
> (b) different systems may require different trade-offs (high ingestion
> rate vs. best compression ratio)
> 
> (c) decompression speed may be much more important
> 
> What I'm trying to say is that we shouldn't obsess about picking one
> particular algorithm too much, because it's entirely pointless. Instead,
> we should probably design the system to support different compression
> algorithms, ideally at column level.

I think we still need to pick a default algorithm, and realistically
that's going to be used by like 95% of the users.

Greetings,

Andres Freund




Re: Zedstore - compressed in-core columnar storage

2019-04-14 Thread Tomas Vondra

On Thu, Apr 11, 2019 at 06:20:47PM +0300, Heikki Linnakangas wrote:

On 11/04/2019 17:54, Tom Lane wrote:

Ashwin Agrawal  writes:

Thank you for trying it out. Yes, noticed for certain patterns pg_lzcompress() 
actually requires much larger output buffers. Like for one 86 len source it 
required 2296 len output buffer. Current zedstore code doesn’t handle this case 
and errors out. LZ4 for same patterns works fine, would highly recommend using 
LZ4 only, as anyways speed is very fast as well with it.


You realize of course that *every* compression method has some inputs that
it makes bigger.  If your code assumes that compression always produces a
smaller string, that's a bug in your code, not the compression algorithm.


Of course. The code is not making that assumption, although clearly 
there is a bug there somewhere because it throws that error. It's 
early days..


In practice it's easy to weasel out of that, by storing the data 
uncompressed, if compression would make it longer. Then you need an 
extra flag somewhere to indicate whether it's compressed or not. It 
doesn't break the theoretical limit because the actual stored length 
is then original length + 1 bit, but it's usually not hard to find a 
place for one extra bit.




Don't we already have that flag, though? I see ZSCompressedBtreeItem has
t_flags, and there's ZSBT_COMPRESSED, but maybe it's more complicated.


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Zedstore - compressed in-core columnar storage

2019-04-14 Thread Tomas Vondra

On Thu, Apr 11, 2019 at 04:52:33PM +0300, Konstantin Knizhnik wrote:

  On 11.04.2019 16:18, Andreas Karlsson wrote:

On 4/11/19 10:46 AM, Konstantin Knizhnik wrote:

  This my results of compressing pbench data using different
  compressors:

  +-+
  |Configuration  |Size (Gb)   |Time (sec)  |
  |---++|
  |no compression |15.31   |92  |
  |---++|
  |zlib (default level)   |2.37|284 |
  |---++|
  |zlib (best speed)  |2.43|191 |
  |---++|
  |postgres internal lz   |3.89|214 |
  |---++|
  |lz4|4.12|95  |
  |---++|
  |snappy |5.18|99  |
  |---++|
  |lzfse  |2.80|1099|
  |---++|
  |(apple) 2.80 1099  |1.69|125 |
  +-+

  You see that zstd provides almost 2 times better compression ration
  and almost at the same speed.

What is "(apple) 2.80 1099"? Was that intended to be zstd?

Andreas

  Ugh...
  Cut and paste problems.
  The whole document can be found here:
  http://garret.ru/PageLevelCompression.pdf

  lzfse (apple)   2.80    1099
  zstd (facebook)  1.69    125

  ztsd is compression algorithm proposed by facebook: 
  https://github.com/facebook/zstd
  Looks like it provides the best speed/compress ratio result.



I think those comparisons are cute and we did a fair amount of them when
considering a drop-in replacement for pglz, but ultimately it might be a
bit pointless because:

(a) it very much depends on the dataset (one algorithm may work great on
one type of data, suck on another)

(b) different systems may require different trade-offs (high ingestion
rate vs. best compression ratio)

(c) decompression speed may be much more important

What I'm trying to say is that we shouldn't obsess about picking one
particular algorithm too much, because it's entirely pointless. Instead,
we should probably design the system to support different compression
algorithms, ideally at column level.

Also, while these general purpose algorithms are nice, what I think will
be important in later stages of colstore development will be compression
algorithms allowing execution directly on the compressed data (like RLE,
dictionary and similar approaches).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Zedstore - compressed in-core columnar storage

2019-04-14 Thread Tomas Vondra

On Tue, Apr 09, 2019 at 02:03:09PM -0700, Ashwin Agrawal wrote:

  On Tue, Apr 9, 2019 at 9:13 AM Konstantin Knizhnik
   wrote:

On 09.04.2019 18:51, Alvaro Herrera wrote:
> On 2019-Apr-09, Konstantin Knizhnik wrote:
>
>> On 09.04.2019 3:27, Ashwin Agrawal wrote:
>>> Heikki and I have been hacking recently for few weeks to implement
>>> in-core columnar storage for PostgreSQL. Here's the design and
initial
>>> implementation of Zedstore, compressed in-core columnar storage
(table
>>> access method). Attaching the patch and link to github branch [1] to
>>> follow along.
>> Thank you for publishing this patch. IMHO Postgres is really missing
normal
>> support of columnar store
> Yep.
>
>> and table access method API is the best way of integrating it.
> This is not surprising, considering that columnar store is precisely
the
> reason for starting the work on table AMs.
>
> We should certainly look into integrating some sort of columnar
storage
> in mainline.  Not sure which of zedstore or VOPS is the best
candidate,
> or maybe we'll have some other proposal.  My feeling is that having
more
> than one is not useful; if there are optimizations to one that can be
> borrowed from the other, let's do that instead of duplicating effort.
>
There are two different aspects:
1. Store format.
2. Vector execution.

1. VOPS is using mixed format, something similar with Apache parquet.
Tuples are stored vertically, but only inside one page.
It tries to minimize trade-offs between true horizontal and true
vertical storage:
first is most optimal for selecting all rows, while second - for
selecting small subset of rows.
To make this approach more efficient, it is better to use large page
size - default Postgres 8k pages is not enough.

 From my point of view such format is better than pure vertical storage
which will be very inefficient if query access larger number of columns.
This problem can be somehow addressed by creating projections: grouping
several columns together. But it requires more space for storing
multiple projections.

  Right, storing all the columns in single page doens't give any savings on
  IO.



Yeah, although you could save some I/O thanks to compression even in
that case.


2. Doesn't matter which format we choose, to take all advantages of
vertical representation we need to use vector operations.
And Postgres executor doesn't support them now. This is why VOPS is
using some hacks, which is definitely not good and not working in all
cases.
zedstore is not using such hacks and ... this is why it never can reach
VOPS performance.

  Vectorized execution is orthogonal to storage format. It can be even
  applied to row store and performance gained. Similarly column store
  without vectorized execution also gives performance gain better
  compression rations and such benefits. Column store clubbed with
  vecotorized execution makes it lot more performant agree. Zedstore
  currently is focused to have AM piece in place, which fits the postgres
  ecosystem and supports all the features heap does.


Not sure it's quite orthogonal. Sure, you can apply it to rowstores too,
but I'd say column stores are naturally better suited for it.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Zedstore - compressed in-core columnar storage

2019-04-14 Thread Tomas Vondra

On Tue, Apr 09, 2019 at 02:29:09PM -0400, Robert Haas wrote:

On Tue, Apr 9, 2019 at 11:51 AM Alvaro Herrera  wrote:

This is not surprising, considering that columnar store is precisely the
reason for starting the work on table AMs.

We should certainly look into integrating some sort of columnar storage
in mainline.  Not sure which of zedstore or VOPS is the best candidate,
or maybe we'll have some other proposal.  My feeling is that having more
than one is not useful; if there are optimizations to one that can be
borrowed from the other, let's do that instead of duplicating effort.


I think that conclusion may be premature.  There seem to be a bunch of
different ways of doing columnar storage, so I don't know how we can
be sure that one size will fit all, or that the first thing we accept
will be the best thing.

Of course, we probably do not want to accept a ton of storage manager
implementations is core.  I think if people propose implementations
that are poor quality, or missing important features, or don't have
significantly different use cases from the ones we've already got,
it's reasonable to reject those.  But I wouldn't be prepared to say
that if we have two significantly different column store that are both
awesome code with a complete feature set and significantly disjoint
use cases, we should reject the second one just because it is also a
column store.  I think that won't get out of control because few
people will be able to produce really high-quality implementations.

This stuff is hard, which I think is also why we only have 6.5 index
AMs in core after many, many years.  And our standards have gone up
over the years - not all of those would pass muster if they were
proposed today.



It's not clear to me whether you're arguing for not having any such
implementation in core, or having multiple ones? I think we should aim
to have at least one in-core implementation, even if it's not the best
possible one for all sizes. It's not like our rowstore is the best
possible implementation for all cases either.

I think having a colstore in core is important not just for adoption,
but also for testing and development of the executor / planner bits.

If we have multiple candidates with sufficient code quality, then we may
consider including both. I don't think it's very likely to happen in the
same release, considering how much work it will require. And I have no
idea if zedstore or VOPS are / will be the only candidates - it's way
too early at this point.

FWIW I personally plan to focus primarily on the features that aim to
be included in core, and that applies to colstores too.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: pg_dump is broken for partition tablespaces

2019-04-14 Thread Andres Freund
Hi,

On 2019-04-14 10:38:05 -0400, Tom Lane wrote:
> It's entirely possible BTW that this whole business of inheriting
> tablespace from the partitioned table is broken and should be thrown
> out.  I certainly don't see any compelling reason for partitions to
> act differently from regular tables in this respect, and the more
> problems we find with the idea, the less attractive it seems.

Indeed. After discovering during the tableam work, and trying to write
tests for the equivalent feature for tableam, I decided that just not
allowing AM specifications for partitioned tables is the right call - at
least until the desired behaviour is clearer. The discussion of the last
few days makes me think so even more.

Greetings,

Andres Freund




Identity columns should own only one sequence

2019-04-14 Thread Laurenz Albe
Identity columns don't work if they own more than one sequence.

So if one tries to convert a "serial" column to an identity column,
the following can happen:

test=> CREATE TABLE ser(id serial);
CREATE TABLE
test=> ALTER TABLE ser ALTER id ADD GENERATED ALWAYS AS IDENTITY;
ERROR:  column "id" of relation "ser" already has a default value

Hm, ok, let's drop the column default value.

test=> ALTER TABLE ser ALTER id DROP DEFAULT;
ALTER TABLE

Now it works:

test=> ALTER TABLE ser ALTER id ADD GENERATED ALWAYS AS IDENTITY;
ALTER TABLE

But not very much:

test=> INSERT INTO ser (id) VALUES (DEFAULT);
ERROR:  more than one owned sequence found


I propose that we check if there already is a dependent sequence
before adding an identity column.

The attached patch does that, and also forbids setting the ownership
of a sequence to an identity column.

I think this should be backpatched.

Yours,
Laurenz Albe
From ab536da87fa8ffc70469d3dbdaf3e1b84b0ef793 Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Sun, 14 Apr 2019 17:37:03 +0200
Subject: [PATCH] Make sure identity columns own only a single sequence

If an identity column owns more than one sequence, inserting the
default value will fail with "more than one owned sequence found".

Consequently, we should prevent this from happening, and disallow
a) turning a column that already owns a sequence into an identity column
b) setting ownership of a sequence to an identity column
---
 src/backend/commands/sequence.c  | 20 +---
 src/backend/commands/tablecmds.c | 11 +++
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index e9add1b987..dc3bff5fdf 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -1644,6 +1644,7 @@ process_owned_by(Relation seqrel, List *owned_by, bool for_identity)
 	int			nnames;
 	Relation	tablerel;
 	AttrNumber	attnum;
+	HeapTuple	heaptup;
 
 	deptype = for_identity ? DEPENDENCY_INTERNAL : DEPENDENCY_AUTO;
 
@@ -1694,9 +1695,22 @@ process_owned_by(Relation seqrel, List *owned_by, bool for_identity)
 	(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 	 errmsg("sequence must be in same schema as table it is linked to")));
 
-		/* Now, fetch the attribute number from the system cache */
-		attnum = get_attnum(RelationGetRelid(tablerel), attrname);
-		if (attnum == InvalidAttrNumber)
+		/* Now, fetch the attribute from the system cache */
+		heaptup = SearchSysCacheAttName(RelationGetRelid(tablerel), attrname);
+		if (HeapTupleIsValid(heaptup))
+		{
+			Form_pg_attribute att_tup = (Form_pg_attribute) GETSTRUCT(heaptup);
+			bool	is_identity = att_tup->attidentity;
+
+			attnum = att_tup->attnum;
+			ReleaseSysCache(heaptup);
+			if (is_identity && !for_identity)
+ereport(ERROR,
+		(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+		 errmsg("column \"%s\" of relation \"%s\" is an identity column",
+attrname, RelationGetRelationName(tablerel;
+		}
+		else
 			ereport(ERROR,
 	(errcode(ERRCODE_UNDEFINED_COLUMN),
 	 errmsg("column \"%s\" of relation \"%s\" does not exist",
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index d48a947f7c..9ea9dae2fb 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -6378,6 +6378,17 @@ ATExecAddIdentity(Relation rel, const char *colName,
 (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  errmsg("column \"%s\" of relation \"%s\" already has a default value",
 		colName, RelationGetRelationName(rel;
+	/*
+	 * Make sure that the column does not already own a sequence.
+	 * Otherwise, inserting a default value would fail, since it is not
+	 * clear which sequence should be used.
+	 */
+	if (getOwnedSequences(RelationGetRelid(rel), attnum) != NIL)
+		ereport(ERROR,
+(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("column \"%s\" of relation \"%s\" already owns a sequence",
+		colName, RelationGetRelationName(rel)),
+ errhint("Drop the dependent sequence before making the column an identity column.")));
 
 	attTup->attidentity = cdef->identity;
 	CatalogTupleUpdate(attrelation, >t_self, tuple);
-- 
2.20.1



Re: New vacuum option to do only freezing

2019-04-14 Thread Tom Lane
Robert Haas  writes:
> On Wed, Apr 3, 2019 at 10:32 PM Masahiko Sawada  wrote:
>> Attached the updated version patch.

> Committed with a little bit of documentation tweaking.

topminnow just failed an assertion from this patch:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=topminnow=2019-04-14%2011%3A01%3A48

The symptoms are:

TRAP: FailedAssertion("!((params->index_cleanup == VACOPT_TERNARY_ENABLED && 
nleft_dead_tuples == 0 && nleft_dead_itemids == 0) || params->index_cleanup == 
VACOPT_TERNARY_DISABLED)", File: 
"/home/nm/farm/mipsel_deb8_gcc_32/HEAD/pgsql.build/../pgsql/src/backend/access/heap/vacuumlazy.c",
 Line: 1404)
...
2019-04-14 14:49:16.328 CEST [15282:5] LOG:  server process (PID 18985) was 
terminated by signal 6: Aborted
2019-04-14 14:49:16.328 CEST [15282:6] DETAIL:  Failed process was running: 
autovacuum: VACUUM ANALYZE pg_catalog.pg_depend

Just looking at the logic around index_cleanup, I rather think that
that assertion is flat out wrong:

+/* No dead tuples should be left if index cleanup is enabled */
+Assert((params->index_cleanup == VACOPT_TERNARY_ENABLED &&
+nleft_dead_tuples == 0 && nleft_dead_itemids == 0) ||
+   params->index_cleanup == VACOPT_TERNARY_DISABLED);

Either it's wrong, or this is:

+/*
+ * Since this dead tuple will not be vacuumed and
+ * ignored when index cleanup is disabled we count
+ * count it for reporting.
+ */
+if (params->index_cleanup == VACOPT_TERNARY_ENABLED)
+nleft_dead_tuples++;

The poor quality of that comment suggests that maybe the code is just
as confused.

(I also think that that "ternary option" stuff is unreadably overdesigned
notation, which possibly contributed to getting this wrong.  If even the
author can't keep it straight, it's unreadable.)

regards, tom lane




Re: pg_dump is broken for partition tablespaces

2019-04-14 Thread Tom Lane
David Rowley  writes:
> On Mon, 15 Apr 2019 at 02:16, Tom Lane  wrote:
>> Well, it's not really nice perhaps, but you cannot just put in some
>> other concrete tablespace OID instead.  What a zero there means is
>> "use the database's default tablespace", and the point of it is that
>> it still means that after the DB has been cloned with a different
>> default tablespace.  If we don't store 0 then we break
>> "CREATE DATABASE ... TABLESPACE = foo".

> [ quotes documents ]

I think those documentation statements are probably wrong in detail,
or at least you're misreading them if you think they are justification
for this patch.  *This change will break CREATE DATABASE*.

(And, apparently, the comment you tried to remove isn't sufficiently
clear about that.)

> Alvaro is proposing to change this behaviour for partitioned tables
> and indexes. I'm proposing not having that special case and just
> changing it.

It's possible that Alvaro's patch is also broken, but I haven't had time
to review it.  The immediate question is what happens when somebody makes
a partitioned table in template1 and then does CREATE DATABASE with a
tablespace option.  Does the partitioned table end up in the same
tablespace as ordinary tables do?

It's entirely possible BTW that this whole business of inheriting
tablespace from the partitioned table is broken and should be thrown
out.  I certainly don't see any compelling reason for partitions to
act differently from regular tables in this respect, and the more
problems we find with the idea, the less attractive it seems.

regards, tom lane




Re: pg_dump is broken for partition tablespaces

2019-04-14 Thread David Rowley
On Mon, 15 Apr 2019 at 02:16, Tom Lane  wrote:
>
> David Rowley  writes:
> > I'd say the fact that we populate reltablespace with 0 is a bug as
> > it's not going to do what they want after a dump/restore.
>
> Well, it's not really nice perhaps, but you cannot just put in some
> other concrete tablespace OID instead.  What a zero there means is
> "use the database's default tablespace", and the point of it is that
> it still means that after the DB has been cloned with a different
> default tablespace.  If we don't store 0 then we break
> "CREATE DATABASE ... TABLESPACE = foo".
>
> You could imagine using some special tablespace OID that has these
> semantics (*not* pg_default, but some new row in pg_tablespace).
> I'm not sure that that'd provide any functional improvement over
> using zero, but we could certainly entertain such a change if
> partitioned tables seem to need it.

The patch only changes that behaviour when the user does something like:

set default_tablespace = 'pg_default';
create table ... (...);

or:

create table ... (...) tablespace pg_default;

The 0 value is still maintained when the tablespace is not specified
or default_tablespace is an empty string.

The CREATE TABLE docs mention:

"The tablespace_name is the name of the tablespace in which the new
table is to be created. If not specified, default_tablespace is
consulted, or temp_tablespaces if the table is temporary. For
partitioned tables, since no storage is required for the table itself,
the tablespace specified here only serves to mark the default
tablespace for any newly created partitions when no other tablespace
is explicitly specified."

and the default_tablespace docs say:

"When default_tablespace is set to anything but an empty string, it
supplies an implicit TABLESPACE clause for CREATE TABLE and CREATE
INDEX commands that do not have an explicit one."

so the change just seems to be altering the code to follow the documents.

Alvaro is proposing to change this behaviour for partitioned tables
and indexes. I'm proposing not having that special case and just
changing it.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: pg_dump is broken for partition tablespaces

2019-04-14 Thread Tom Lane
David Rowley  writes:
> I'd say the fact that we populate reltablespace with 0 is a bug as
> it's not going to do what they want after a dump/restore.

Well, it's not really nice perhaps, but you cannot just put in some
other concrete tablespace OID instead.  What a zero there means is
"use the database's default tablespace", and the point of it is that
it still means that after the DB has been cloned with a different
default tablespace.  If we don't store 0 then we break
"CREATE DATABASE ... TABLESPACE = foo".

You could imagine using some special tablespace OID that has these
semantics (*not* pg_default, but some new row in pg_tablespace).
I'm not sure that that'd provide any functional improvement over
using zero, but we could certainly entertain such a change if
partitioned tables seem to need it.

regards, tom lane




Re: Should the docs have a warning about pg_stat_reset()?

2019-04-14 Thread Joe Conway
On 4/13/19 3:42 PM, Tomas Vondra wrote:
> If only we had a way to regularly snapshot the data from within the
> database, and then compute the deltas on that. If only we could insert
> data from one table into another one a then do some analysics on it,
> with like small windows moving over the data or something ...

Why not store deltas separately with their own delta reset command?

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development




Re: Berserk Autovacuum (let's save next Mandrill)

2019-04-14 Thread Komяpa
>
>
> >I don't think it's helpful to force emergency vacuuming more
> >frequently;
> >quite the contrary, it's likely to cause even more issues.  We should
> >tweak autovacuum to perform freezing more preemtively instead.
>
> I still think the fundamental issue with making vacuum less painful is
> that the all indexes have to be read entirely. Even if there's not much
> work (say millions of rows frozen, hundreds removed). Without that issue we
> could vacuum much more frequently. And do it properly in insert only
> workloads.
>

Deletion of hundreds of rows on default settings will cause the same
behavior now.
If there was 0 updates currently the index cleanup will be skipped.

https://commitfest.postgresql.org/22/1817/ got merged. This means
Autovacuum can have two separate thresholds - the current, on dead tuples,
triggering the VACUUM same way it triggers it now, and a new one, on
inserted tuples only, triggering VACUUM (INDEX_CLEANUP FALSE)?

-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: Berserk Autovacuum (let's save next Mandrill)

2019-04-14 Thread Komяpa
On Wed, Apr 10, 2019 at 6:13 PM Alvaro Herrera 
wrote:

> On 2019-Mar-31, Darafei "Komяpa" Praliaskouski wrote:
>
> > Alternative point of "if your database is super large and actively
> written,
> > you may want to set autovacuum_freeze_max_age to even smaller values so
> > that autovacuum load is more evenly spread over time" may be needed.
>
> I don't think it's helpful to force emergency vacuuming more frequently;
> quite the contrary, it's likely to cause even more issues.  We should
> tweak autovacuum to perform freezing more preemtively instead.
>

Okay. What would be your recommendation for the case of Mandrill running
current Postgres 11? Which parameters shall they tune and to which values?



>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: pg_dump is broken for partition tablespaces

2019-04-14 Thread David Rowley
On Sat, 13 Apr 2019 at 11:36, Alvaro Herrera  wrote:
> Here's a patch to fix the reported problems.  It's somewhat invasive,
> and I've spent a long time staring at it, so I very much appreciate eyes
> on it.

I think it's a bit strange that don't store the pg_default's oid in
reltablespace for objects other than partitioned tables and indexes.
For documents [1] say:

"When default_tablespace is set to anything but an empty string, it
supplies an implicit TABLESPACE clause for CREATE TABLE and CREATE
INDEX commands that do not have an explicit one."

I'd say the fact that we populate reltablespace with 0 is a bug as
it's not going to do what they want after a dump/restore.

If that's ok to change then maybe the attached is an okay fix. Rather
nicely it gets rid of the code that's commented with "Yes, this is a
bit of a hack." and also changes the contract with heap_create() so
that we just pass InvalidOid to mean use MyDatabaseTableSpace.  I've
not really documented that in the patch yet.  It also does not need
the pg_dump change to have it use ATTACH PARTITION instead of
PARTITION OF, although perhaps that's an okay change to make
regardless of this bug.

On Wed, 10 Apr 2019 at 10:58, Alvaro Herrera  wrote:
> There is one deficiency that needs to be solved in order for this to
> work fully: currently there is no way to reset "reltablespace" to 0.

Yeah, I noticed that too.  My patch makes that a consistent problem
with all object types that allow tablespaces. Perhaps we can allow
ALTER ...  SET TABLESPACE DEFAULT; since "DEFAULT" is fully
reserved it can't be the name of a tablespace.

[1] https://www.postgresql.org/docs/devel/manage-ag-tablespaces.html

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


partition-tablespaces-david.patch
Description: Binary data


doc: datatype-table and xfunc-c-type-table

2019-04-14 Thread Chapman Flack
Hi,

Both tables in $subject (in datatype.sgml and xfunc.sgml, respectively)
contain similar information (though the xfunc one mentions C structs and
header files, and the datatype one does not, but has a description column)
and seem similarly out-of-date with respect to the currently supported
types ... though not identically out-of-date; they have different numbers
of rows, and different types that are missing.

How crazy an idea would it be to have include/catalog/pg_type.dat
augmented with description, ctypename, and cheader fields, and let
both tables be generated, with their respective columns?

Regards,
-Chap