date:20170706

[HACKERS] Fix header comment of streamutil.c

2017-07-06 Thread Masahiko Sawada

Hi,

While reading source code, I found that the header comment of
streamutil.c is not correct. I guess pg_receivelog is a mistake of
pg_receivexlog and it's an oversight when changing xlog to wal.

Attached patch fixes this.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


fix_header_of_streamutil_c.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql 10: hash indexes testing

2017-07-06 Thread AP

On Thu, Jul 06, 2017 at 05:19:59PM +0530, Amit Kapila wrote:
> I think if you are under development, it is always advisable to create
> indexes after initial bulk load.  That way it will be faster and will
> take lesser space atleast in case of hash index.

This is a bit of a pickle, actually:
* if I do have a hash index I'll wind up with a bloated one at some stage
  that refused to allow more inserts until the index is re-created
* if I don't have an index then I'll wind up with a table where I cannot
  create a hash index because it has too many rows for it to handle

I'm at a bit of a loss as to how to deal with this. The DB design does come
with a kind of partitioning where a bundle of tables get put off to the side
and searched seperately as needed but too many of those and the impact on
performance can be noticed so I need to minimise them.

> >> As mentioned above REINDEX might be a better option.  I think for such
> >> situation we should have some provision to allow squeeze functionality
> >> of hash exposed to the user, this will be less costly than REINDEX and
> >> might serve the purpose for the user.  Hey, can you try some hack in
> >
> > Assuming it does help, would this be something one would need to guess
> > at? "I did a whole bunch of concurrent INSERT heavy transactions so I
> > guess I should do a squeeze now"?
> >
> > Or could it be figured out programmatically?
> 
> I think one can refer free_percent and number of overflow pages to
> perform such a command.  It won't be 100% correct, but we can make a
> guess.  We can even check free space in overflow pages with page
> inspect to make it more accurate.

Does this take much time? Main reason I am asking is that this looks like
something that the db ought to handle underneath (say as part of an autovac
run) and so if there are stats that the index code can maintain that can
then be used by the autovac (or something) code to trigger a cleanup this
I think would be of benefit.

Unless I am being /so/ unusual that it's not worth it. :)

I'll reply to the rest in a separate stream as I'm still poking other
work related things atm so can't do the debug testing as yet.

AP

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multi column range partition table

2017-07-06 Thread Amit Langote

On 2017/07/07 4:55, Dean Rasheed wrote:
> On 5 July 2017 at 18:07, Dean Rasheed  wrote:
>> So if we were to go for maximum flexibility and compatibility with
>> Oracle, then perhaps what we would do is more like the original idea
>> of UNBOUNDED ABOVE/BELOW, except call them MINVALUE and MAXVALUE,
>> which conveniently are already unreserved keywords, as well as being
>> much shorter. Plus, we would also relax the constraint about having
>> finite values after MINVALUE/MAXVALUE.
>>
> So I know that I have flip-flopped a few times on this now, but I'm
> now starting to think that this approach, replacing UNBOUNDED with
> MINVALUE and MAXVALUE is the best way to go, along with permitting
> finite values after MINVALUE/MAXVALUE.

Sure.

> This gives the greatest flexibility, it's not too verbose, and it
> makes it easy to define contiguous sets of partitions just by making
> the lower bound of one match the upper bound of another.
> 
> With this approach, any partition bounds that Oracle allows are also
> valid in PostgreSQL, not that I would normally give too much weight to
> that, but it is I think quite a nice syntax. Of course, we also
> support things that Oracle doesn't allow, such as MINVALUE and gaps
> between partitions.

Agreed.  MINVALUE/MAXVALUE seems like a good way forward.

> Parts of the patch are similar to your UNBOUNDED ABOVE/BELOW patch,
> but there are a number of differences -- most notably, I replaced the
> "infinite" boolean flag on PartitionRangeDatum with a 3-value enum and
> did away with all the DefElem nodes and the associated special string
> constants being copied and compared.

That's better.

> However, this is also an incompatible syntax change, and any attempt
> to support both the old and new syntaxes is likely to be messy, so we
> really need to get consensus on whether this is the right thing to do,
> and whether it *can* be done now for PG10.

+1 to releasing this syntax in PG 10.

The patch looks generally good, although I found and fixed some minor
issues (typos and such).  Please find attached the updated patch.

Thanks,
Amit
diff --git a/doc/src/sgml/ref/create_table.sgml 
b/doc/src/sgml/ref/create_table.sgml
index b15c19d3d0..06bea2a18a 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -87,8 +87,8 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] 
TABLE [ IF NOT EXI
 and partition_bound_spec 
is:
 
 IN ( { numeric_literal | 
string_literal | NULL } [, ...] ) |
-FROM ( { numeric_literal | 
string_literal | UNBOUNDED } [, 
...] )
-  TO ( { numeric_literal | 
string_literal | UNBOUNDED } [, 
...] )
+FROM ( { numeric_literal | 
string_literal | MINVALUE | 
MAXVALUE } [, ...] )
+  TO ( { numeric_literal | 
string_literal | MINVALUE | 
MAXVALUE } [, ...] )
 
 index_parameters in 
UNIQUE, PRIMARY KEY, and 
EXCLUDE constraints are:
 
@@ -269,10 +269,10 @@ FROM ( { numeric_literal | 
   Each of the values specified in
   the partition_bound_spec is
-  a literal, NULL, or UNBOUNDED.
-  Each literal value must be either a numeric constant that is coercible
-  to the corresponding partition key column's type, or a string literal
-  that is valid input for that type.
+  a literal, NULL, MINVALUE, or
+  MAXVALUE.  Each literal value must be either a
+  numeric constant that is coercible to the corresponding partition key
+  column's type, or a string literal that is valid input for that type.
  
 
  
@@ -300,13 +300,47 @@ FROM ( { numeric_literal | 
 
  
-  Writing UNBOUNDED in FROM
-  signifies -infinity as the lower bound of the
-  corresponding column, whereas when written in TO,
-  it signifies +infinity as the upper bound.
-  All items following an UNBOUNDED item within
-  a FROM or TO list must also
-  be UNBOUNDED.
+  The special values MINVALUE and MAXVALUE
+  may be use when creating a range partition to indicate that there
+  is no lower or upper bound on the column's value. For example, a
+  partition defined using FROM (MINVALUE) TO (10) allows
+  any values less than 10, and a partition defined using
+  FROM (10) TO (MAXVALUE) allows any values greater than
+  or equal to 10.
+ 
+
+ 
+  When creating a range partition involving more than one column, it
+  can also make sense to use MAXVALUE as part of the lower
+  bound, and MINVALUE as part of the upper bound. For
+  example, a partition defined using
+  FROM (0, MAXVALUE) TO (10, MAXVALUE) allows any rows
+  where the first partitioned column is greater than 0 and less than
+  or equal to 10. Similarly, a partition defined using
+  FROM ('a', MINVALUE) TO ('b', MINVALUE)
+  allows only rows where the first partitioned column starts with "a".
+ 
+
+ 
+  Note that any values after MINVALUE or
+  MAXVALUE in a partition bound are ignored; so the bound
+  (10,

Re: [HACKERS] Error while copying a large file in pg_rewind

2017-07-06 Thread Michael Paquier

On Thu, Jul 6, 2017 at 8:51 PM, Dilip Kumar  wrote:
> On Thu, Jul 6, 2017 at 4:18 PM, Kuntal Ghosh  
> wrote:
>> But, I'm little concerned/doubt regarding the following part of the code.
>> +/*
>> + * Converts an int64 from network byte order to native format.
>> + */
>> +static int64
>> +pg_recvint64(int64 value)
>> +{
>> +   union
>> +   {
>> +   int64   i64;
>> +   uint32  i32[2];
>> +   } swap;
>> +   int64   result;
>> +
>> +   swap.i64 = value;
>> +
>> +   result = (uint32) ntohl(swap.i32[0]);
>> +   result <<= 32;
>> +   result |= (uint32) ntohl(swap.i32[1]);
>> +
>> +   return result;
>> +}
>> Does this always work correctly irrespective of the endianess of the
>> underlying system?
>
> I think this will have problem, we may need to do like
>
> and reverse complete array if byte order is changed

This comes from the the implementation of 64b-large objects, which was
first broken on big-endian systems, until Tom fixed it with the
following commit, and this looks fine to me:
commit: 26fe56481c0f7baa705f0b3265b5a0676f894a94
author: Tom Lane 
date: Mon, 8 Oct 2012 18:24:32 -0400
Code review for 64-bit-large-object patch.

At some point it would really make sense to group all things under the
same banner (64-b LO, pg_basebackup, and now pg_rewind).

> or we can use something like "be64toh"

That's even less portable. For example macos would need a macro to map
to OSSwapBigToHostInt64 or OSSwapLittleToHostInt64 from OSByteOrder.h.
Brr.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New partitioning - some feedback

2017-07-06 Thread Amit Langote

Hi Mark,

On 2017/07/07 9:02, Mark Kirkwood wrote:
> I've been trying out the new partitioning in version 10. Firstly, I must
> say this is excellent - so much nicer than the old inheritance based method!

Thanks. :)

> My only niggle is the display of partitioned tables via \d etc. e.g:
>
> part=# \d
> List of relations
>  Schema | Name | Type  |  Owner
> +--+---+--
>  public | date_fact| table | postgres
>  public | date_fact_201705 | table | postgres
>  public | date_fact_201706 | table | postgres
>  public | date_fact_20170601   | table | postgres
>  public | date_fact_2017060100 | table | postgres
>  public | date_fact_201707 | table | postgres
>  public | date_fact_rest   | table | postgres
> (7 rows)
> 
> Now it can be inferred from the names that date_fact is a partitioned
> table and the various date_fact_ are its partitions - but \d is not
> providing any hints of this. The more detailed individual describe is fine:
> 
> part=# \d date_fact
>   Table "public.date_fact"
>  Column |   Type   | Collation | Nullable | Default
> +--+---+--+-
>  id | integer  |   | not null |
>  dte| timestamp with time zone |   | not null |
>  val| integer  |   | not null |
> Partition key: RANGE (dte)
> Number of partitions: 6 (Use \d+ to list them.)
> 
> I'd prefer *not* to see a table and its partitions all intermixed in the
> same display (especially with nothing indicating which are partitions) -
> as this will make for unwieldy long lists when tables have many
> partitions. Also it would be good if the 'main' partitioned table and its
> 'partitions' showed up as a different type in some way.
> I note the they do in pg_class:
> 
> part=# SELECT relname,relkind,relispartition FROM pg_class WHERE relname
> LIKE 'date_fact%';
>relname| relkind | relispartition
> --+-+
>  date_fact| p   | f
>  date_fact_201705 | r   | t
>  date_fact_201706 | r   | t
>  date_fact_20170601   | r   | t
>  date_fact_2017060100 | r   | t
>  date_fact_201707 | r   | t
>  date_fact_rest   | r   | t
> (7 rows)
> 
> ...so it looks to be possible to hide the partitions from the main display
> and/or mark them as such. Now I realize that making this comment now that
> beta is out is a bit annoying - apologies, but I think seeing a huge list
> of 'tables' is going to make \d frustrating for folk doing partitioning.

Someone complained about this awhile back [1].  And then it came up again
[2], where Noah appeared to take a stance that partitions should be
visible in views / output of commands that list "tables".

Although I too tend to prefer not filling up the \d output space by
listing partitions (pg_class.relispartition = true relations), there
wasn't perhaps enough push for creating a patch for that.  If some
committer is willing to consider such a patch, I can make one.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAM-w4HOZ5fPS7GoCTTrW42q01%2BwPrOWFCnr9H0iDyVTZP2H1CA%40mail.gmail.com

[2]
https://www.postgresql.org/message-id/20170406070227.GA2741046%40tornado.leadboat.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Out of date comment in predicate.c

2017-07-06 Thread Thomas Munro

On Sat, Jul 1, 2017 at 6:38 AM, Peter Eisentraut
 wrote:
> On 6/27/17 01:21, Thomas Munro wrote:
>> Commit ea9df812d8502fff74e7bc37d61bdc7d66d77a7f got rid of
>> FirstPredicateLockMgrLock, but it's still referred to in a comment in
>> predicate.c where the locking protocol is documented.  I think it's
>> probably best to use the name of the macro that's usually used to
>> access the lock array in the code.  Please see attached.
>
> Does this apply equally to PredicateLockHashPartitionLock() and
> PredicateLockHashPartitionLockByIndex()?  Should the comment mention or
> imply both?

Yeah, I guess so.  How about listing the hashcode variant, as it's the
more commonly used and important for a reader to understand of the
two, but mentioning the ByIndex variant in a bullet point below?  Like
this.

-- 
Thomas Munro
http://www.enterprisedb.com


fix-comments-v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multi column range partition table

2017-07-06 Thread Amit Langote

On 2017/07/06 18:30, Dean Rasheed wrote:
> On 5 July 2017 at 10:43, Amit Langote  wrote:
>> 0001 is your patch to tidy up check_new_partition_bound()  (must be
>> applied before 0002)
>>
> 
> I pushed this first patch, simplifying check_new_partition_bound() for
> range partitions, since it seemed like a good simplification, but note
> that I don't think that was actually the cause of the latent bug you
> saw upthread.

I like how simple check_new_partition_bound() has now become.

> I think the real issue was in partition_rbound_cmp() -- normally, if
> the upper bound of one partition coincides with the lower bound of
> another, that function would report the upper bound as the smaller
> one, but that logic breaks if any of the bound values are infinite,
> since then it will exit early, returning 0, without ever comparing the
> "lower" flags on the bounds.
> 
> I'm tempted to push a fix for that independently, since it's a bug
> waiting to happen, even though it's not possible to hit it currently.

Oops, you're right.  Thanks for the fix.

Regards,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] New partitioning - some feedback

2017-07-06 Thread Mark Kirkwood

I've been trying out the new partitioning in version 10. Firstly, I must 
say this is excellent - so much nicer than the old inheritance based method!


My only niggle is the display of partitioned tables via \d etc. e.g:

part=# \d
List of relations
 Schema | Name | Type  |  Owner
+--+---+--
 public | date_fact| table | postgres
 public | date_fact_201705 | table | postgres
 public | date_fact_201706 | table | postgres
 public | date_fact_20170601   | table | postgres
 public | date_fact_2017060100 | table | postgres
 public | date_fact_201707 | table | postgres
 public | date_fact_rest   | table | postgres
(7 rows)

Now it can be inferred from the names that date_fact is a partitioned 
table and the various date_fact_ are its partitions - but \d is not 
providing any hints of this. The more detailed individual describe is fine:


part=# \d date_fact
  Table "public.date_fact"
 Column |   Type   | Collation | Nullable | Default
+--+---+--+-
 id | integer  |   | not null |
 dte| timestamp with time zone |   | not null |
 val| integer  |   | not null |
Partition key: RANGE (dte)
Number of partitions: 6 (Use \d+ to list them.)

I'd prefer *not* to see a table and its partitions all intermixed in the 
same display (especially with nothing indicating which are partitions) - 
as this will make for unwieldy long lists when tables have many 
partitions. Also it would be good if the 'main' partitioned table and 
its 'partitions' showed up as a different type in some way.


I note the they do in pg_class:

part=# SELECT relname,relkind,relispartition FROM pg_class WHERE relname 
LIKE 'date_fact%';

   relname| relkind | relispartition
--+-+
 date_fact| p   | f
 date_fact_201705 | r   | t
 date_fact_201706 | r   | t
 date_fact_20170601   | r   | t
 date_fact_2017060100 | r   | t
 date_fact_201707 | r   | t
 date_fact_rest   | r   | t
(7 rows)

...so it looks to be possible to hide the partitions from the main 
display and/or mark them as such. Now I realize that making this comment 
now that beta is out is a bit annoying - apologies, but I think seeing a 
huge list of 'tables' is going to make \d frustrating for folk doing 
partitioning.


regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multi column range partition table

2017-07-06 Thread Joe Conway

On 07/06/2017 01:24 PM, Dean Rasheed wrote:
> On 6 July 2017 at 21:04, Tom Lane  wrote:
>> Dean Rasheed  writes:
>>> However, this is also an incompatible syntax change, and any attempt
>>> to support both the old and new syntaxes is likely to be messy, so we
>>> really need to get consensus on whether this is the right thing to do,
>>> and whether it *can* be done now for PG10.
>>
>> FWIW, I'd much rather see us get it right the first time than release
>> PG10 with a syntax that we'll regret later.  I do not think that beta2,
>> or even beta3, is too late for such a change.
>>
>> I'm not taking a position on whether this proposal is actually better
>> than what we have.  But if there's a consensus that it is, we should
>> go ahead and do it, not worry that it's too late.
>>
> 
> OK, thanks. That's good to know.

I agree we should get this right the first time and I also agree with
Dean's proposal, so I guess I'm a +2

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature

Re: [HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=

2017-07-06 Thread Tom Lane

Kuntal Ghosh  writes:
> On Thu, Jul 6, 2017 at 3:45 AM, Tom Lane  wrote:
> + * In the first bin (i==1), add a fudge factor that ensures
> + * that histfrac is at least eq_selec.  We do this because we
> + * know that the first histogram entry does satisfy the
> + * inequality (if !isgt) or not satisfy it (if isgt), so our
> + * estimate here should certainly not be zero even if binfrac
> + * is zero.  (XXX experimentally this is the correct way to do
> + * it, but why isn't it a linear adjustment across the whole
> + * histogram rather than just the first bin?)

> Given that the values are distinct, (I've some doubts for real number case)

Well, really, we are *always* dealing with discrete distributions here.

> if histogram_bounds are assigned as,
> {0,9,19,29,39,49,59,69,79,89,99,109,119,129,13,..}
> ...
> So, if val=low, then hisfrac = (bucket_current - 1)/num_of_buckets
> which means it assumes val is included in the previous bucket.

I looked at that again and realized that one of the answers I was missing
lies in the behavior of analyze.c's compute_scalar_stats().  When it has
collected "nvals" values and wants to make a histogram containing
"num_hist" entries, it does this:

 * The object of this loop is to copy the first and last values[]
 * entries along with evenly-spaced values in between.  So the
 * i'th value is values[(i * (nvals - 1)) / (num_hist - 1)].

(where i runs from 0 to num_hist - 1).  Because the "/" denotes integer
division, this effectively means that for all but the first entry,
we are taking the last value out of the corresponding bucket of samples.
That's obviously true for the last histogram entry, which will be the very
last sample value, and it's also true for earlier ones --- except for the
*first* histogram entry, which is necessarily the first one in its bucket.
So the first histogram bucket is actually a tad narrower than the others,
which is visible in the typical data you showed above: 0 to 9 is only 9
counts wide, but all the remaining buckets are 10 counts wide.  That
explains why we need a scale adjustment just in the first bucket.

I think I'm also beginning to grasp why scalarineqsel's basic estimation
process produces an estimate for "x <= p" rather than some other condition
such as "x < p".  For clarity, imagine that we're given p equal to the
last histogram entry.  If the test operator is in fact "<=", then it will
say "true" for every histogram entry, and we'll fall off the end of the
histogram and return 1.0, which is correct.  If the test operator is "<",
it will return "true" for all but the last entry, so that we end up with
"i" equal to sslot.nvalues - 1 --- but we will compute binfrac = 1.0,
if convert_to_scalar() produces sane answers, again resulting in
histfrac = 1.0.  Similar reasoning applies for ">" and ">=", so that in
all cases we arrive at histfrac = 1.0 if p equals the last histogram
entry.  More generally, if p is equal to some interior histogram entry,
say the k'th one out of n total, we end up with histfrac = (k-1)/(n-1)
no matter which operator we probe with, assuming that convert_to_scalar()
correctly gives us a binfrac of 0.0 or 1.0 depending on which of the
adjacent bins we end up examining.  So, remembering that the histogram
entries occupy the right ends of their buckets, the proper interpretation
of that is that it's the probability of "x <= p".  (This is wrong for the
first histogram entry, but that's why we need a correction there.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multi column range partition table

2017-07-06 Thread Dean Rasheed

On 6 July 2017 at 21:04, Tom Lane  wrote:
> Dean Rasheed  writes:
>> However, this is also an incompatible syntax change, and any attempt
>> to support both the old and new syntaxes is likely to be messy, so we
>> really need to get consensus on whether this is the right thing to do,
>> and whether it *can* be done now for PG10.
>
> FWIW, I'd much rather see us get it right the first time than release
> PG10 with a syntax that we'll regret later.  I do not think that beta2,
> or even beta3, is too late for such a change.
>
> I'm not taking a position on whether this proposal is actually better
> than what we have.  But if there's a consensus that it is, we should
> go ahead and do it, not worry that it's too late.
>

OK, thanks. That's good to know.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multi column range partition table

2017-07-06 Thread Tom Lane

Dean Rasheed  writes:
> However, this is also an incompatible syntax change, and any attempt
> to support both the old and new syntaxes is likely to be messy, so we
> really need to get consensus on whether this is the right thing to do,
> and whether it *can* be done now for PG10.

FWIW, I'd much rather see us get it right the first time than release
PG10 with a syntax that we'll regret later.  I do not think that beta2,
or even beta3, is too late for such a change.

I'm not taking a position on whether this proposal is actually better
than what we have.  But if there's a consensus that it is, we should
go ahead and do it, not worry that it's too late.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multi column range partition table

2017-07-06 Thread Dean Rasheed

On 5 July 2017 at 18:07, Dean Rasheed  wrote:
> So if we were to go for maximum flexibility and compatibility with
> Oracle, then perhaps what we would do is more like the original idea
> of UNBOUNDED ABOVE/BELOW, except call them MINVALUE and MAXVALUE,
> which conveniently are already unreserved keywords, as well as being
> much shorter. Plus, we would also relax the constraint about having
> finite values after MINVALUE/MAXVALUE.
>

So I know that I have flip-flopped a few times on this now, but I'm
now starting to think that this approach, replacing UNBOUNDED with
MINVALUE and MAXVALUE is the best way to go, along with permitting
finite values after MINVALUE/MAXVALUE.

This gives the greatest flexibility, it's not too verbose, and it
makes it easy to define contiguous sets of partitions just by making
the lower bound of one match the upper bound of another.

With this approach, any partition bounds that Oracle allows are also
valid in PostgreSQL, not that I would normally give too much weight to
that, but it is I think quite a nice syntax. Of course, we also
support things that Oracle doesn't allow, such as MINVALUE and gaps
between partitions.

Parts of the patch are similar to your UNBOUNDED ABOVE/BELOW patch,
but there are a number of differences -- most notably, I replaced the
"infinite" boolean flag on PartitionRangeDatum with a 3-value enum and
did away with all the DefElem nodes and the associated special string
constants being copied and compared.

However, this is also an incompatible syntax change, and any attempt
to support both the old and new syntaxes is likely to be messy, so we
really need to get consensus on whether this is the right thing to do,
and whether it *can* be done now for PG10.

Regards,
Dean

Replace_UNBOUNDED_with_MINVALUE_and_MAXVALUE.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: pg_test_timing: Add NLS

2017-07-06 Thread Alvaro Herrera

Peter Eisentraut wrote:
> pg_test_timing: Add NLS
> 
> Also straighten out use of time unit abbreviations a bit.

We (well, Carlos Chapi, who's doing the translation work now) just
noticed that this has a bug in this line

+   printf("%6s   %10s %10s\n", _("< us"), _("% of total"), _("count"));

_() marks the strings with the c-string flag, which means that the
%-specifiers are checked by gettext, but the % in the third literal is
not a printf specifier.  So there's no correct way to write the
translation.  We need to use a different xgettext trigger there, one
that doesn't set c-format, but I don't know what.

Babel is now complaining:

/home/nlsuser/admin/wwwtools/scm/postgresql-master/src/bin/pg_test_timing/po/es.po:73:
 format specifications in 'msgid' and 'msgstr' for argument 1 are not the same

where the line is

#: pg_test_timing.c:181
#, c-format
msgid "% of total"
msgstr "% del total"

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Rust bindings to pgtypes lib

2017-07-06 Thread Andres Freund

Hi,

On 2017-07-06 20:26:29 +0200, Michael Meskes wrote:
> > But is this possible, or feasible? I see that the makefile refers to 
> 
> Possible yes, but in general I'm not a big fan of duplicating code. I
> spend too much time to keep those copies in sync.

Indeed. I'm quite strongly against exposing / using pgtypeslib in more
places. If anything it should be phased out. Because that code is
definitely not always kept up2date, and it's a lot of work to do so.  If
anything the code should be rewritten to *not* require so much
duplication, then we could talk.

> > How much of the source tree would I have to carve out?
> > Or from another perspective; how do other language (if any) solve this?
> 
> I'm not aware of any other language binding for pgtypeslib.

Some people use http://libpqtypes.esilo.com/

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Rust bindings to pgtypes lib

2017-07-06 Thread Michael Meskes

> But is this possible, or feasible? I see that the makefile refers to 

Possible yes, but in general I'm not a big fan of duplicating code. I
spend too much time to keep those copies in sync.

> Makefile.global, and also includes stuff from ecpg, aat least. I don't want 
> to 

Yes, but these should be minor. Makefile.global is mostly included for
some rules and other than some includes I don't know what else is taken
from ecpg.

However, there are a couple small files that are taken from teh backend
source tree, like pgstrcasecmp.c

> How much of the source tree would I have to carve out?
> Or from another perspective; how do other language (if any) solve this?

I'm not aware of any other language binding for pgtypeslib.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Meskes at (Debian|Postgresql) dot Org
Jabber: michael at xmpp dot meskes dot org
VfL Borussia! Força Barça! Go SF 49ers! Use Debian GNU/Linux, PostgreSQL


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Challenges preventing us moving to 64 bit transaction id (XID)?

2017-07-06 Thread Greg Stark

On 6 July 2017 at 15:29, Jim Finnerty  wrote:
>
> Feel free to knock down this 'straw man' and propose something better!

I think the pattern in this design that we don't want is that it
imposes extra complexity on every user of every page even when the
page doesn't have the problem and even when the problem isn't anywhere
in the database. Even years from now when this problem is long gone
you'll have code paths for dealing with this special page format that
are rarely executed and never tested that will have to be maintained
blind.

Ideally a solution to this problem that imposes a cost only on the
weird pages and only temporarily and leave the database in a
"consistent" state that doesn't require any special processing when
reading the data would be better.

The "natural" solution is what was discussed for incompatible page
format changes in the past where there's an point release of one
Postgres version that tries to ensure there's enough space on the page
for the next version and keeps track of whether there are any
problematic pages. Then you would be blocked from upgrading until you
had ensured all pages had space (presumably by running some special
"vacuum upgrade" or something like that).

Incidentally it's somewhat intriguing to think about what would happen
if we *always* did such a tombstone for deletes. Or perhaps only when
it's a full_page_write. Since the whole page is going into the log and
that tuple will never be modified again you could imagine just
replacing the tuple with the LSN of the deletion and letting anyone
who really needs it fetch it from the xlog. That would be a completely
different model from the way Postgres works though. More like a
log-structured storage system.
-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More race conditions in logical replication

2017-07-06 Thread Petr Jelinek

On 06/07/17 17:33, Petr Jelinek wrote:
> On 03/07/17 01:54, Tom Lane wrote:
>> I noticed a recent failure that looked suspiciously like a race condition:
>>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet=2017-07-02%2018%3A02%3A07
>>
>> The critical bit in the log file is
>>
>> error running SQL: 'psql::1: ERROR:  could not drop the replication 
>> slot "tap_sub" on publisher
>> DETAIL:  The error was: ERROR:  replication slot "tap_sub" is active for PID 
>> 3866790'
>> while running 'psql -XAtq -d port=59543 host=/tmp/QpCJtafT7R 
>> dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'DROP SUBSCRIPTION 
>> tap_sub' at 
>> /home/nm/farm/xlc64/HEAD/pgsql.build/src/test/subscription/../../../src/test/perl/PostgresNode.pm
>>  line 1198.
>>
>> After poking at it a bit, I found that I can cause several different
>> failures of this ilk in the subscription tests by injecting delays at
>> the points where a slot's active_pid is about to be cleared, as in the
>> attached patch (which also adds some extra printouts for debugging
>> purposes; none of that is meant for commit).  It seems clear that there
>> is inadequate interlocking going on when we kill and restart a logical
>> rep worker: we're trying to start a new one before the old one has
>> gotten out of the slot.
>>
> 
> Thanks for the test case.
> 
> It's not actually that we start new worker fast. It's that we try to
> drop the slot right after worker process was killed but if the code that
> clears slot's active_pid takes too long, it still looks like it's being
> used. I am quite sure it's possible to make this happen for physical
> replication as well when using slots.
> 
> This is not something that can be solved by locking on subscriber. ISTM
> we need to make pg_drop_replication_slot behave more nicely, like making
> it wait for the slot to become available (either by default or as an
> option).
> 
> I'll have to think about how to do it without rewriting half of
> replication slots or reimplementing lock queue though because the
> replication slots don't use normal catalog access so there is no object
> locking with wait queue. We could use some latch wait with small timeout
> but that seems ugly as that function can be called by user without
> having dropped the slot before so the wait can be quite long (as in
> "forever").
> 

Naive fix would be something like attached. But as I said, it's not
exactly pretty.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services
>From 6d016a3fad8635c2366bcf5438d49f41c905621c Mon Sep 17 00:00:00 2001
From: Petr Jelinek 
Date: Thu, 6 Jul 2017 18:16:44 +0200
Subject: [PATCH] Wait for slot to become free in when dropping it

---
 src/backend/replication/logical/logicalfuncs.c |  2 +-
 src/backend/replication/slot.c | 51 ++
 src/backend/replication/slotfuncs.c|  2 +-
 src/backend/replication/walsender.c|  6 +--
 src/include/replication/slot.h |  4 +-
 5 files changed, 50 insertions(+), 15 deletions(-)

diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 363ca82..a3ba2b1 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -244,7 +244,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	else
 		end_of_wal = GetXLogReplayRecPtr();
 
-	ReplicationSlotAcquire(NameStr(*name));
+	ReplicationSlotAcquire(NameStr(*name), true);
 
 	PG_TRY();
 	{
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index dc7de20..ab115f4 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -46,6 +46,7 @@
 #include "pgstat.h"
 #include "replication/slot.h"
 #include "storage/fd.h"
+#include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
@@ -323,7 +324,7 @@ ReplicationSlotCreate(const char *name, bool db_specific,
  * Find a previously created slot and mark it as used by this backend.
  */
 void
-ReplicationSlotAcquire(const char *name)
+ReplicationSlotAcquire(const char *name, bool nowait)
 {
 	ReplicationSlot *slot = NULL;
 	int			i;
@@ -331,6 +332,8 @@ ReplicationSlotAcquire(const char *name)
 
 	Assert(MyReplicationSlot == NULL);
 
+retry:
+
 	/* Search for the named slot and mark it active if we find it. */
 	LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
 	for (i = 0; i < max_replication_slots; i++)
@@ -350,16 +353,48 @@ ReplicationSlotAcquire(const char *name)
 	}
 	LWLockRelease(ReplicationSlotControlLock);
 
-	/* If we did not find the slot or it was already active, error out. */
+	/* If we did not find the slot, error out. */
 	if (slot == NULL)
 		ereport(ERROR,
 (errcode(ERRCODE_UNDEFINED_OBJECT),
  errmsg("replication slot \"%s\" does not exist", name)));
+
+	/*
+

[HACKERS] paths in partitions of a dummy partitioned table

2017-07-06 Thread Ashutosh Bapat

If a partitioned table is proven dummy, set_rel_pathlist() doesn't mark the
partition relations dummy and thus doesn't set any (dummy) paths in the
partition relations. The lack of paths in the partitions means that we can
not use partition-wise join while joining this table with some other similarly
partitioned table as the partitions do not have any paths that child-joins can
use. This means that we can not create partition relations for such a join and
thus can not consider the join to be partitioned. This doesn't matter much when
the dummy relation is the outer relation, since the resultant join is also
dummy. But when the dummy relation is inner relation, the resultant join is not
dummy and can be considered to be partitioned with same partitioning scheme as
the outer relation to be joined with other similarly partitioned table. Not
having paths in the partitions deprives us of this future optimization.
Without partition-wise join, not setting dummy paths in the partition relations
makes sense, since those do not have any use. But with partition-wise join that
changes.

If we always mark the partitions dummy, that effort may get wasted if the
partitioned table doesn't participate in the partition-wise join. A possible
solution would be to set the partitions dummy when only the partitioned table
participates in the join, but that happens during make_join_rel(), much after
we have set up the simple relations. So, I am not sure, whether that's a good
option. But I am open to suggestions.

What if we always mark the partition relations of a dummy partitioned table
dummy? I tried attached path on a thousand partition table, the planning time
increased by 3-5%. That doesn't look that bad for a thousand partitions.

Any other options/comments?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index f087ddb..4af95b9 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -421,7 +421,7 @@ static void
 set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
  Index rti, RangeTblEntry *rte)
 {
-	if (IS_DUMMY_REL(rel))
+	if (IS_DUMMY_REL(rel) && !rte->inh)
 	{
 		/* We already proved the relation empty, so nothing more to do */
 	}
@@ -1244,6 +1244,8 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 		/*
 		 * Compute the child's access paths.
 		 */
+		if (IS_DUMMY_REL(rel))
+			set_dummy_rel_pathlist(childrel);
 		set_rel_pathlist(root, childrel, childRTindex, childRTE);
 
 		/*
@@ -1258,6 +1260,12 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 		live_childrels = lappend(live_childrels, childrel);
 	}
 
+	if (!live_childrels)
+	{
+		Assert(IS_DUMMY_REL(rel));
+		return;
+	}
+
 	/* Add paths to the "append" relation. */
 	add_paths_to_append_rel(root, rel, live_childrels);
 }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More race conditions in logical replication

2017-07-06 Thread Petr Jelinek

On 03/07/17 01:54, Tom Lane wrote:
> I noticed a recent failure that looked suspiciously like a race condition:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet=2017-07-02%2018%3A02%3A07
> 
> The critical bit in the log file is
> 
> error running SQL: 'psql::1: ERROR:  could not drop the replication 
> slot "tap_sub" on publisher
> DETAIL:  The error was: ERROR:  replication slot "tap_sub" is active for PID 
> 3866790'
> while running 'psql -XAtq -d port=59543 host=/tmp/QpCJtafT7R 
> dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'DROP SUBSCRIPTION 
> tap_sub' at 
> /home/nm/farm/xlc64/HEAD/pgsql.build/src/test/subscription/../../../src/test/perl/PostgresNode.pm
>  line 1198.
> 
> After poking at it a bit, I found that I can cause several different
> failures of this ilk in the subscription tests by injecting delays at
> the points where a slot's active_pid is about to be cleared, as in the
> attached patch (which also adds some extra printouts for debugging
> purposes; none of that is meant for commit).  It seems clear that there
> is inadequate interlocking going on when we kill and restart a logical
> rep worker: we're trying to start a new one before the old one has
> gotten out of the slot.
> 

Thanks for the test case.

It's not actually that we start new worker fast. It's that we try to
drop the slot right after worker process was killed but if the code that
clears slot's active_pid takes too long, it still looks like it's being
used. I am quite sure it's possible to make this happen for physical
replication as well when using slots.

This is not something that can be solved by locking on subscriber. ISTM
we need to make pg_drop_replication_slot behave more nicely, like making
it wait for the slot to become available (either by default or as an
option).

I'll have to think about how to do it without rewriting half of
replication slots or reimplementing lock queue though because the
replication slots don't use normal catalog access so there is no object
locking with wait queue. We could use some latch wait with small timeout
but that seems ugly as that function can be called by user without
having dropped the slot before so the wait can be quite long (as in
"forever").

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Bug in Physical Replication Slots (at least 9.5)?

2017-07-06 Thread Ryan Murphy

Poking this.  Looking back through the discussion, this issue has been 
reproduced by multiple people.  The patch still applies to HEAD without issues. 
 I have no experience with PostgreSQL replication, so I'm not qualified to 
really review this.  From what I can see with the patch, it's just a small 
block of code added to /src/backend/replication/walreceiver.c to handle some 
edge case where the WAL file no longer exists or something.

I think one thing that would help move this forward is if we edited the patch 
to include a comment explaining why this new code is necessary.  There's lots 
of great discussion on this issue in the email list, so if a summary of that 
gets into the code I think it would make the patch easier to understand and 
make the new walreceiver.c less confusing.
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Rust bindings to pgtypes lib

2017-07-06 Thread Kaare Rasmussen

Hi

For learning purposes (and because I probably need it) , I've started to make 
Rust bindings to the pgtypes Library.

One problem I'm thinking about right now is how to know if, and where, the 
library and the include files are installed. One way to avoid that problem is 
to include the source of pgtypeslib in my project, and let the rust build 
system take care of it *.

But is this possible, or feasible? I see that the makefile refers to 
Makefile.global, and also includes stuff from ecpg, aat least. I don't want to 
include the whole pg source tree, just the nice, safe, little corner that 
would be used for this purpose.

How much of the source tree would I have to carve out?
Or from another perspective; how do other language (if any) solve this?

* Not sure it can, but other rust crates seem to do it that way, e.g. https://
github.com/alexcrichton/bzip2-rs

-- 
Med venlig hilsen
Kaare Rasmussen, Jasonic

Jasonic:  Nordre Fasanvej 12, 2000 Frederiksberg



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql 10: hash indexes testing

2017-07-06 Thread AP

On Wed, Jul 05, 2017 at 07:31:39PM +1000, AP wrote:
> On Tue, Jul 04, 2017 at 08:23:20PM -0700, Jeff Janes wrote:
> > On Tue, Jul 4, 2017 at 3:57 AM, AP  wrote:
> > > The data being indexed is BYTEA, (quasi)random and 64 bytes in size.
> > > The table has over 2 billion entries. The data is not unique. There's
> > > an average of 10 duplicates for every unique value.
> > 
> > What is the number of duplicates for the most common value?
> 
> Damn. Was going to collect this info as I was doing a fresh upload but
> it fell through the cracks of my mind. It'll probably take at least
> half a day to collect (a simple count(*) on the table takes 1.5-1.75
> hours parallelised across 11 processes) so I'll probably have this in
> around 24 hours if all goes well. (and I don't stuff up the SQL :) )

Well...

 num_ids |  count   
-+--
   1 | 91456442
   2 | 56224976
   4 | 14403515
  16 | 13665967
   3 | 12929363
  17 | 12093367
  15 | 10347006


So the most common is a unique value, then a dupe.

AP.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql 10: hash indexes testing

2017-07-06 Thread AP

On Thu, Jul 06, 2017 at 12:38:38PM +0530, Amit Kapila wrote:
> On Thu, Jul 6, 2017 at 9:32 AM, AP  wrote:
> > On Thu, Jul 06, 2017 at 08:52:03AM +0530, Amit Kapila wrote:
> >> On Thu, Jul 6, 2017 at 2:40 AM, AP  wrote:
> >> > On Wed, Jul 05, 2017 at 05:52:32PM +0530, Amit Kapila wrote:
> >> >> >> >  version | bucket_pages | overflow_pages | bitmap_pages | 
> >> >> >> > unused_pages | live_items | dead_items |   free_percent
> >> >> >> > -+--++--+--+++--
> >> >> >> >3 | 10485760 |2131192 |   66 |  
> >> >> >> >   0 | 2975444240 |  0 | 1065.19942179026
> >> >> >> > (1 row)
> >> > ...
> >> >> >> > And I do appear to have an odd percentage of free space. :)
> >> >>
> >> >> Are you worried about "unused_pages"? If so, then this is not a major
> >> >
> >> > Nope. "free_percent" Just a bit weird that I have it at over 1000% free. 
> >> > :)
> >> > Shouldn't that number be < 100?
> >>
> >> Yes, there seems to be some gotcha in free percent calculation.  Is it
> >> possible for you to debug or in some way share the test?
> >
> > I can try to debug but I need to know what to look for and how.
> 
> Okay,  you need to debug function pgstathashindex and have your
> breakpoint at free_percent calculation, then try to get the values of
> nblocks, all the values in stats struct and total_space.  I think
> after getting this info we can further decide what to look for.

Ok. I'll try and get to this tomorrow amidst fun with NFS. Hopefully
there'll be time.

So... I'll need

postgresql-10-dbg - debug symbols for postgresql-10

Then given 
https://doxygen.postgresql.org/pgstatindex_8c.html#af86e3b4c40779d4f30a73b0bfe06316f
set a breakpoint at pgstatindex.c:710 via gdb and then have fun with
print?

> > As for sharing the test, that'd mean sharing the data. If it helps I can
> > provide the content of that column but you're looking at an sql dump that
> > is roughly (2*64+1)*2.3 billion (give or take a (few) billion) in size. :)
> 
> This is tricky, will Ibe able to import that column values by creating
> table, if so, then probably it is worth.

Should do. Thinking about it a little more, I can shrink the file down by
roughly half if I don't do a pg_dump or similar. Doing

COPY link (datum_id) TO '/tmp/moocow.copy' WITH (FORMAT BINARY)

should allow you to use COPY FROM to restore the file and produce something
a lot smaller than a dump, right?

The table is simple:

CREATE TABLE link (datum_id BYTEA);

I can't give you the rest of the table (one other column) as the stuff hidden
in there is private.

The only thing that wont give you is the manner in which the column is filled
(ie: the transactions, their size, how long they run, their concurrency etc).
Don't know if that's important.

> >> So at this stage, there are two possibilities for you (a) run manual
> >> Vacuum in-between (b) create the index after bulk load.  In general,
> >> whatever I have mentioned in (b) is a better way for bulk loading.
> >> Note here again the free_percent seems to be wrong.
> >
> > If you didn't mean VACUUM FULL then (a) does not appear to work and (b)
> > would kill usability of the db during import, which would happen daily
> > (though with a vastly reduced data size).
> 
> If the data size in subsequent import is very less, then you only need
> to Create the index after first import and then let it continue like
> that.

Managing this has suddenly gotten a lot more complex. :)

> > It also messes with the
> > permission model that has been set up for the least-trusted section of
> > the project (at the moment that section can only INSERT).
> 
> As per your permission model Vacuum Full is allowed, but not Create index?

Well, at the moment it's all in development (though time for that to end
is coming up). As such I can do things with enhanced permissions manually.

When it hits production, that rather stops.

> As mentioned above REINDEX might be a better option.  I think for such
> situation we should have some provision to allow squeeze functionality
> of hash exposed to the user, this will be less costly than REINDEX and
> might serve the purpose for the user.  Hey, can you try some hack in

Assuming it does help, would this be something one would need to guess
at? "I did a whole bunch of concurrent INSERT heavy transactions so I
guess I should do a squeeze now"?

Or could it be figured out programmatically?

> the code which can at least tell us whether squeezing hash can give us
> any benefit or is the index in such a situation that only REINDEX will
> help.  Basically, while doing Vacuum (non-full), you need to somehow
> bypass the below line in lazy_scan_heap
> lazy_scan_heap
> {
> ..
> if (vacrelstats->num_dead_tuples > 0)
> ..
> }
> 
> I am not sure it will work, but we can try once if you are okay.

So this would be at L1284 of

Re: [HACKERS] AdvanceXLInsertBuffer vs. WAL segment compressibility

2017-07-06 Thread Heikki Linnakangas


On 07/03/2017 06:30 PM, Chapman Flack wrote:

On 07/03/2017 09:39 AM, Heikki Linnakangas wrote:


Hmm. That's not the problem, though. Imagine that instead of the loop
above, you do just:

WALInsertLockUpdateInsertingAt(CurrPos);
AdvanceXLInsertBuffer(EndPos, false);

AdvanceXLInsertBuffer() will call XLogWrite(), to flush out any pages
before EndPos, to make room in the wal_buffers for the new pages. Before
doing that, it will call WaitXLogInsertionsToFinish()


Although it's moot in the straightforward approach of re-zeroing in
the loop, it would still help my understanding of the system to know
if there is some subtlety that would have broken what I proposed
earlier, which was an extra flag to AdvanceXLInsertBuffer() that
would tell it not only to skip initializing headers, but also to
skip the WaitXLogInsertionsToFinish() check ... because I have
the entire region reserved and I hold all the writer slots
at that moment, it seems safe to assure AdvanceXLInsertBuffer()
that there are no outstanding writes to wait for.


Yeah, I suppose that would work, too.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql 10: hash indexes testing

2017-07-06 Thread AP

On Thu, Jul 06, 2017 at 08:52:03AM +0530, Amit Kapila wrote:
> On Thu, Jul 6, 2017 at 2:40 AM, AP  wrote:
> > On Wed, Jul 05, 2017 at 05:52:32PM +0530, Amit Kapila wrote:
> >> >> >  version | bucket_pages | overflow_pages | bitmap_pages | 
> >> >> > unused_pages | live_items | dead_items |   free_percent
> >> >> > -+--++--+--+++--
> >> >> >3 | 10485760 |2131192 |   66 |
> >> >> > 0 | 2975444240 |  0 | 1065.19942179026
> >> >> > (1 row)
> > ...
> >> >> > And I do appear to have an odd percentage of free space. :)
> >>
> >> Are you worried about "unused_pages"? If so, then this is not a major
> >
> > Nope. "free_percent" Just a bit weird that I have it at over 1000% free. :)
> > Shouldn't that number be < 100?
> 
> Yes, there seems to be some gotcha in free percent calculation.  Is it
> possible for you to debug or in some way share the test?

I can try to debug but I need to know what to look for and how. If it
requires data reloads then that's around 12-15 hours per hit.

As for sharing the test, that'd mean sharing the data. If it helps I can
provide the content of that column but you're looking at an sql dump that
is roughly (2*64+1)*2.3 billion (give or take a (few) billion) in size. :)

> > Well, if this is the cause of my little issue, it might be nice. ATM
> > my import script bombs out on errors (that I've duplicated! :) It took
> > 11 hours but it bombed) and it sounds like I'll need to do a manual
> > VACUUM before it can be run again.
> >
> 
> Yeah, I think after manual vacuum you should be able to proceed.

I don't think that'll help. I did a small check which I hope is helpful in
seeing if it will. Working off a similar db that completed (as it was
smaller and I did not want to mess with my one copy of the broken db)
I got the following results:

Pre-VACUUM:
---
# \di+
List of relations
 Schema |   Name| Type  |   Owner   |  Table  |  Size   | 
Description
+---+---+---+-+-+-
...
 public | link_datum_id_idx | index | mdkingpin | link| 90 GB   |
...

# select * from  pgstathashindex('link_datum_id_idx');
 version | bucket_pages | overflow_pages | bitmap_pages | unused_pages | 
live_items | dead_items |  free_percent
-+--++--+--+++-
   3 |  7611595 |3451261 |  106 |   777013 | 
3076131325 |  0 | 1512.8635780908

# vacuum VERBOSE ANALYZE link;
INFO:  vacuuming "public.link"
INFO:  "link": found 0 removable, 2272156152 nonremovable row versions in 
120507771 out of 123729466 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 8594
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 457.16 s, system: 755.84 s, elapsed: 4196.75 s.
INFO:  vacuuming "pg_toast.pg_toast_183727891"
INFO:  index "pg_toast_183727891_index" now contains 1441820 row versions in 
3956 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.02 s, system: 0.00 s, elapsed: 0.56 s.
INFO:  "pg_toast_183727891": found 0 removable, 1441820 nonremovable row 
versions in 332271 out of 332271 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 8594
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 1.16 s, system: 2.26 s, elapsed: 22.80 s.
INFO:  analyzing "public.link"
INFO:  "link": scanned 300 of 123729466 pages, containing 56661337 live 
rows and 0 dead rows; 300 rows in sample, 2330296882 estimated total rows
VACUUM
Time: 7057484.079 ms (01:57:37.484)

Post-VACUUM:

# \di+
 Schema |   Name| Type  |   Owner   |  Table  |  Size   | 
Description
+---+---+---+-+-+-
 public | link_datum_id_idx | index | mdkingpin | link| 90 GB   |

# select * from  pgstathashindex('link_datum_id_idx');
 version | bucket_pages | overflow_pages | bitmap_pages | unused_pages | 
live_items | dead_items |  free_percent
-+--++--+--+++-
   3 |  7611595 |3451261 |  106 |   777013 | 
3076131325 |  0 | 1512.8635780908

The end results are the same.

Then I did:
# CREATE INDEX CONCURRENTLY ON link USING hash (datum_id) WITH (fillfactor = 
90);
CREATE INDEX
Time: 12545612.560 ms (03:29:05.613)
# \di+   
 List of relations
 Schema |Name| Type  |   Owner   |  Table  |  Size   | 
Description

Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-07-06 Thread Mithun Cy

On Thu, Jul 6, 2017 at 10:52 AM, Amit Kapila  wrote:
> I am not able to understand what you want to say. Unlogged tables
> should be empty in case of crash recovery.  Also, we never flush
> unlogged buffers except for shutdown checkpoint, refer BufferAlloc and
> in particular below comment:

-- Sorry I said that because of my lack of knowledge about unlogged
tables. Yes, what you say is right "an unlogged table is automatically
truncated after a crash or unclean shutdown". So it will be enough if
we just dump them during shutdown time.


-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Error while copying a large file in pg_rewind

2017-07-06 Thread Dilip Kumar

On Thu, Jul 6, 2017 at 4:18 PM, Kuntal Ghosh  wrote:
> On Wed, Jul 5, 2017 at 9:35 AM, Michael Paquier
>  wrote:
>> On Tue, Jul 4, 2017 at 4:41 PM, Kuntal Ghosh  
>> wrote:
>>> I've not yet started the patch and it may take some time for me to
>>> understand and write
>>> the patch in a correct way. Since, you've almost written the patch,
>>> IMHO, please go ahead
>>> and submit the patch. I'll happily review and test it. :-)
>>>
>>> Thanks for the notes.
>>
>> OK, thanks. Here you go.
>>
> Thanks for the patch. It looks good and it solves the existing issues.
>
> But, I'm little concerned/doubt regarding the following part of the code.
> +/*
> + * Converts an int64 from network byte order to native format.
> + */
> +static int64
> +pg_recvint64(int64 value)
> +{
> +   union
> +   {
> +   int64   i64;
> +   uint32  i32[2];
> +   } swap;
> +   int64   result;
> +
> +   swap.i64 = value;
> +
> +   result = (uint32) ntohl(swap.i32[0]);
> +   result <<= 32;
> +   result |= (uint32) ntohl(swap.i32[1]);
> +
> +   return result;
> +}
> Does this always work correctly irrespective of the endianess of the
> underlying system?

I think this will have problem, we may need to do like

 union
   {
   int64   i64;
   uint8  i8[8];
   } swap;


and reverse complete array if byte order is changed

or we can use something like "be64toh"

>
>
> --
> Thanks & Regards,
> Kuntal Ghosh
> EnterpriseDB: http://www.enterprisedb.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql 10: hash indexes testing

2017-07-06 Thread Amit Kapila

On Thu, Jul 6, 2017 at 5:04 PM, AP  wrote:
> On Thu, Jul 06, 2017 at 12:38:38PM +0530, Amit Kapila wrote:
>> On Thu, Jul 6, 2017 at 9:32 AM, AP  wrote:
>> > On Thu, Jul 06, 2017 at 08:52:03AM +0530, Amit Kapila wrote:
>> >> On Thu, Jul 6, 2017 at 2:40 AM, AP  wrote:
>> >> > On Wed, Jul 05, 2017 at 05:52:32PM +0530, Amit Kapila wrote:
>> >> >> >> >  version | bucket_pages | overflow_pages | bitmap_pages | 
>> >> >> >> > unused_pages | live_items | dead_items |   free_percent
>> >> >> >> > -+--++--+--+++--
>> >> >> >> >3 | 10485760 |2131192 |   66 | 
>> >> >> >> >0 | 2975444240 |  0 | 1065.19942179026
>> >> >> >> > (1 row)
>> >> > ...
>> >> >> >> > And I do appear to have an odd percentage of free space. :)
>> >> >>
>> >> >> Are you worried about "unused_pages"? If so, then this is not a major
>> >> >
>> >> > Nope. "free_percent" Just a bit weird that I have it at over 1000% 
>> >> > free. :)
>> >> > Shouldn't that number be < 100?
>> >>
>> >> Yes, there seems to be some gotcha in free percent calculation.  Is it
>> >> possible for you to debug or in some way share the test?
>> >
>> > I can try to debug but I need to know what to look for and how.
>>
>> Okay,  you need to debug function pgstathashindex and have your
>> breakpoint at free_percent calculation, then try to get the values of
>> nblocks, all the values in stats struct and total_space.  I think
>> after getting this info we can further decide what to look for.
>
> Ok. I'll try and get to this tomorrow amidst fun with NFS. Hopefully
> there'll be time.
>

Cool.

> So... I'll need
>
> postgresql-10-dbg - debug symbols for postgresql-10
>
> Then given 
> https://doxygen.postgresql.org/pgstatindex_8c.html#af86e3b4c40779d4f30a73b0bfe06316f
> set a breakpoint at pgstatindex.c:710 via gdb and then have fun with
> print?
>

If I am reading it correctly it should be line 706 as below:
if (get_call_result_type(fcinfo, NULL, ) != TYPEFUNC_COMPOSITE)

>> > As for sharing the test, that'd mean sharing the data. If it helps I can
>> > provide the content of that column but you're looking at an sql dump that
>> > is roughly (2*64+1)*2.3 billion (give or take a (few) billion) in size. :)
>>
>> This is tricky, will Ibe able to import that column values by creating
>> table, if so, then probably it is worth.
>
> Should do. Thinking about it a little more, I can shrink the file down by
> roughly half if I don't do a pg_dump or similar. Doing
>
> COPY link (datum_id) TO '/tmp/moocow.copy' WITH (FORMAT BINARY)
>
> should allow you to use COPY FROM to restore the file and produce something
> a lot smaller than a dump, right?
>

I think so.  You can share it. I will try.

> The table is simple:
>
> CREATE TABLE link (datum_id BYTEA);
>
> I can't give you the rest of the table (one other column) as the stuff hidden
> in there is private.
>

No issues.

> The only thing that wont give you is the manner in which the column is filled
> (ie: the transactions, their size, how long they run, their concurrency etc).
> Don't know if that's important.
>
>> >> So at this stage, there are two possibilities for you (a) run manual
>> >> Vacuum in-between (b) create the index after bulk load.  In general,
>> >> whatever I have mentioned in (b) is a better way for bulk loading.
>> >> Note here again the free_percent seems to be wrong.
>> >
>> > If you didn't mean VACUUM FULL then (a) does not appear to work and (b)
>> > would kill usability of the db during import, which would happen daily
>> > (though with a vastly reduced data size).
>>
>> If the data size in subsequent import is very less, then you only need
>> to Create the index after first import and then let it continue like
>> that.
>
> Managing this has suddenly gotten a lot more complex. :)
>
>> > It also messes with the
>> > permission model that has been set up for the least-trusted section of
>> > the project (at the moment that section can only INSERT).
>>
>> As per your permission model Vacuum Full is allowed, but not Create index?
>
> Well, at the moment it's all in development (though time for that to end
> is coming up). As such I can do things with enhanced permissions manually.
>

I think if you are under development, it is always advisable to create
indexes after initial bulk load.  That way it will be faster and will
take lesser space atleast in case of hash index.

> When it hits production, that rather stops.
>
>> As mentioned above REINDEX might be a better option.  I think for such
>> situation we should have some provision to allow squeeze functionality
>> of hash exposed to the user, this will be less costly than REINDEX and
>> might serve the purpose for the user.  Hey, can you try some hack in
>
> Assuming it does help, would this be something one would need to guess
> at? "I did a whole bunch of concurrent

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

2017-07-06 Thread Amit Kapila

On Thu, Jul 6, 2017 at 3:24 PM, Kyotaro HORIGUCHI
 wrote:
> Hello,
>
> At Tue, 4 Jul 2017 14:49:26 +0530, Amit Kapila  
> wrote in 
>> On Tue, Jul 4, 2017 at 1:23 PM, Michael Paquier
>>  wrote:
>> > On Mon, Jul 3, 2017 at 7:40 PM, Amit Kapila  
>> > wrote:
>> >
>> > It seems to me that this qualifies as an open item for 10. WAL-logging
>> > is new for hash tables. Amit, could you add one?
>> >
>>
>> Added.
>>
>> > +   use_wal = RelationNeedsWAL(rel) ||
>> > + (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
>> > +  forkNum == INIT_FORKNUM);
>> > In access AMs, empty() routines are always only called for unlogged
>> > relations, so you could relax that to check for INIT_FORKNUM only.
>> >
>>
>> Yeah, but _hash_init() is an exposed API, so I am not sure it is a
>> good idea to make such an assumption at that level of code.  Now, as
>> there is no current user which wants to use it any other way, we can
>> make such an assumption, but does it serve any purpose?
>
> Check for INIT_FORKNUM appears both accompanied and not
> accompanied by check for RELPER.._UNLOGGED, so I'm not sure which
> is the convention here.
>
> The difference of the two is an init fork of TEMP
> relations. However I believe that init fork is by definition a
> part of an unlogged relation, it seems to me that it ought not to
> be wal-logged if we had it. From this viewpoint, the middle line
> makes sense.
>

What is the middle line?  Are you okay with a current check or do you
see any problem with it or do you prefer to write it differently?

>
>
>> > Looking at the patch, I think that you are right to do the logging
>> > directly in _hash_init() and not separately in hashbuildempty().
>> > Compared to other relations the init data is more dynamic.
>> >
>> > +   /*
>> > +* Force the on-disk state of init forks to always be in sync with 
>> > the
>> > +* state in shared buffers.  See XLogReadBufferForRedoExtended.
>> > +*/
>> > +   XLogRecGetBlockTag(record, 0, NULL, , NULL);
>> > +   if (forknum == INIT_FORKNUM)
>> > +   FlushOneBuffer(metabuf);
>> > I find those forced syncs a bit surprising. The buffer is already
>> > marked as dirty, so even if there is a concurrent checkpoint they
>> > would be flushed.
>> >
>>
>> What if there is no concurrent checkpoint activity and the server is
>> shutdown?  Remember this replay happens on standby and we will just
>> try to perform Restartpoint and there is no guarantee that it will
>> flush the buffers.  If the buffers are not flushed at shutdown, then
>> the Init fork will be empty on restart and the hash index will be
>> corrupt.
>
> XLogReadBufferForRedoExtended() automatically force the flush for
> a XLOG record on init fork that having FPW imaeges. And
> _hash_init seems to have emitted such a XLOG record using
> log_newpage.
>

Sure, but log_newpage is not used for metapage and bitmappage.

>> > If recovery comes again here, then they would just
>> > be replayed. I am unclear why XLogReadBufferForRedoExtended is not
>> > enough to replay a FPW of that.
>> >
>>
>> Where does FPW come into the picture for a replay of metapage or bitmappage?
>
> Since required FPW seems to be emitted and the flush should be
> done automatically, the issuer side (_hash_init) may attach wrong
> FPW images if hash_xlog_init_meta_page requires additional
 > flushes. But I don't see such a flaw there. I think Michael wants
> to say such a kind of thing.
>

I am not able to understand what you want to say, but I think you
don't see any problem with the patch as such.

>
>
> By the way the comment of the function ReadBufferWithoutRelcache
> has the following sentense.
>

I don't think we need anything with respect to this patch because
ReadBufferWithoutRelcache is used in case of FPW replay of INIT_FORK
pages as well.  Do you have something else in mind which I am not able
to see?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add support for tuple routing to foreign partitions

2017-07-06 Thread Etsuro Fujita


On 2017/07/05 9:13, Amit Langote wrote:

On 2017/06/29 20:20, Etsuro Fujita wrote:



In relation to this, I also allowed expand_inherited_rtentry() to
build an RTE and AppendRelInfo for each partition of a partitioned table
that is an INSERT target, as mentioned by Amit in [1], by modifying
transformInsertStmt() so that we set the inh flag for the target table's
RTE to true when the target table is a partitioned table.


About this part, Robert sounded a little unsure back then [1]; his words:

"Doing more inheritance expansion early is probably not a good idea
because it likely sucks for performance, and that's especially unfortunate
if it's only needed for foreign tables."

But...


The partition
RTEs are not only referenced by FDWs, but used in selecting relation
aliases for EXPLAIN (see below) as well as extracting plan dependencies in
setref.c so that we force rebuilding of the plan on any change to
partitions.  (We do replanning on FDW table-option changes to foreign
partitions, currently.  See regression tests in the attached patch.)


...this part means we cannot really avoid locking at least the foreign
partitions during the planning stage, which we currently don't, as we
don't look at partitions at all before ExecSetupPartitionTupleRouting() is
called by ExecInitModifyTable().


Another case where we need inheritance expansion during the planning 
stage would probably be INSERT .. ON CONFLICT into a partitioned table, 
I guess.  We don't support that yet, but if implementing that, I guess 
we would probably need to look at each partition and do something like 
infer_arbiter_indexes() for each partition during the planner stage. 
Not sure.



Since we are locking *all* the partitions anyway, it may be better to
shift the cost of locking to the planner by doing the inheritance
expansion even in the insert case (for partitioned tables) and avoid
calling the lock manager again in the executor when setting up
tuple-routing data structures (that is, in ExecSetupPartitionTupleRouting).


We need the execution-time lock, anyway.  See the comments from Robert 
in [3].



An idea that Robert recently mentioned on the nearby "UPDATE partition
key" thread [2] seems relevant in this regard.  By that idea,
expand_inherited_rtentry(), instead of reading in the partition OIDs in
the old catalog order, will instead call
RelationBuildPartitionDispatchInfo(), which locks the partitions and
returns their OIDs in the canonical order.  If we store the foreign
partition RT indexes in that order in ModifyTable.partition_rels
introduced by this patch, then the following code added by the patch could
be made more efficient (as also explained in aforementioned Robert's email):

+foreach(l, node->partition_rels)
+{
+Index   rti = lfirst_int(l);
+Oid relid = getrelid(rti, estate->es_range_table);
+boolfound;
+int j;
+
+/* First, find the ResultRelInfo for the partition */
+found = false;
+for (j = 0; j < mtstate->mt_num_partitions; j++)
+{
+resultRelInfo = partitions + j;
+
+if (RelationGetRelid(resultRelInfo->ri_RelationDesc) ==
relid)
+{
+Assert(mtstate->mt_partition_indexes[i] == 0);
+mtstate->mt_partition_indexes[i] = j;
+found = true;
+break;
+}
+}
+if (!found)
+elog(ERROR, "failed to find ResultRelInfo for rangetable
index %u", rti);


Agreed.  I really want to improve this based on that idea.

Thank you for the comments!

Best regards,
Etsuro Fujita

[3] 
https://www.postgresql.org/message-id/ca+tgmoyiwvicdri3zk+quxj1r7umu9t_kdnq+17pcwgzrbz...@mail.gmail.com




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Error while copying a large file in pg_rewind

2017-07-06 Thread Kuntal Ghosh

On Wed, Jul 5, 2017 at 9:35 AM, Michael Paquier
 wrote:
> On Tue, Jul 4, 2017 at 4:41 PM, Kuntal Ghosh  
> wrote:
>> I've not yet started the patch and it may take some time for me to
>> understand and write
>> the patch in a correct way. Since, you've almost written the patch,
>> IMHO, please go ahead
>> and submit the patch. I'll happily review and test it. :-)
>>
>> Thanks for the notes.
>
> OK, thanks. Here you go.
>
Thanks for the patch. It looks good and it solves the existing issues.

But, I'm little concerned/doubt regarding the following part of the code.
+/*
+ * Converts an int64 from network byte order to native format.
+ */
+static int64
+pg_recvint64(int64 value)
+{
+   union
+   {
+   int64   i64;
+   uint32  i32[2];
+   } swap;
+   int64   result;
+
+   swap.i64 = value;
+
+   result = (uint32) ntohl(swap.i32[0]);
+   result <<= 32;
+   result |= (uint32) ntohl(swap.i32[1]);
+
+   return result;
+}
Does this always work correctly irrespective of the endianess of the
underlying system?


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

2017-07-06 Thread Kyotaro HORIGUCHI

FWIW..

At Thu, 06 Jul 2017 18:54:47 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170706.185447.256482539.horiguchi.kyot...@lab.ntt.co.jp>
> > > +   /*
> > > +* Force the on-disk state of init forks to always be in sync 
> > > with the
> > > +* state in shared buffers.  See XLogReadBufferForRedoExtended.
> > > +*/
> > > +   XLogRecGetBlockTag(record, 0, NULL, , NULL);
> > > +   if (forknum == INIT_FORKNUM)
> > > +   FlushOneBuffer(metabuf);
> > > I find those forced syncs a bit surprising. The buffer is already
> > > marked as dirty, so even if there is a concurrent checkpoint they
> > > would be flushed.
> > >
> > 
> > What if there is no concurrent checkpoint activity and the server is
> > shutdown?  Remember this replay happens on standby and we will just
> > try to perform Restartpoint and there is no guarantee that it will
> > flush the buffers.  If the buffers are not flushed at shutdown, then
> > the Init fork will be empty on restart and the hash index will be
> > corrupt.
> 
> XLogReadBufferForRedoExtended() automatically force the flush for
> a XLOG record on init fork that having FPW imaeges. And
> _hash_init seems to have emitted such a XLOG record using
> log_newpage.
> 
> > > If recovery comes again here, then they would just
> > > be replayed. I am unclear why XLogReadBufferForRedoExtended is not
> > > enough to replay a FPW of that.
> > >
> > 
> > Where does FPW come into the picture for a replay of metapage or bitmappage?
> 
> Since required FPW seems to be emitted and the flush should be
> done automatically, the issuer side (_hash_init) may attach wrong

"may attach" -> "may have attached"

> FPW images if hash_xlog_init_meta_page requires additional
> flushes. But I don't see such a flaw there. I think Michael wants
> to say such a kind of thing.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] increasing the default WAL segment size

2017-07-06 Thread Beena Emerson

On Thu, Jul 6, 2017 at 3:21 PM, tushar  wrote:
> On 07/06/2017 12:04 PM, Beena Emerson wrote:
>>
>> The 04-initdb-walsegsize_v2.patch has the following improvements:
>> - Rebased over new 03 patch
>> - Pass the wal-segsize intidb option as command-line option rathern
>> than in an environment variable.
>> - Since new function check_wal_size had only had two checks and was
>> sed once, moved the code to ReadControlFile where it is used and
>> removed this function.
>> - improve comments and add validations where required.
>> - Use DEFAULT_XLOG_SEG_SIZE to set the min_wal_size and
>> max_wal_size,instead of the value 16.
>> - Use XLogSegMaxSize and XLogSegMinSize to calculate the range of guc
>> wal_segment_size instead 16 - INT_MAX.
>
> Thanks Beena. I tested with below following scenarios  and all are working
> as expected
> .)Different WAL segment size i.e 1,2,8,16,32,64,512,1024 at the time of
> initdb
> .)SR setup in place.
> .)Combinations of  max/min_wal_size in postgresql.conf with different
> wal_segment_size.
> .)shutdown the server forcefully (kill -9) / promote slave / to make sure
> -recovery happened successfully.
> .)with different utilities like pg_resetwal/pg_upgrade/pg_waldump
> .)running pg_bench for substantial workloads (~ 1 hour)
> .)WAL segment size is not default (changed at the time of ./configure) +
> different wal_segment_size (at the time of initdb) .
>
> Things looks fine.
>

Thank you Tushar.


-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] hash index on unlogged tables doesn't behave as expected

2017-07-06 Thread Kyotaro HORIGUCHI

Hello,

At Tue, 4 Jul 2017 14:49:26 +0530, Amit Kapila  wrote 
in 
> On Tue, Jul 4, 2017 at 1:23 PM, Michael Paquier
>  wrote:
> > On Mon, Jul 3, 2017 at 7:40 PM, Amit Kapila  wrote:
> >
> > It seems to me that this qualifies as an open item for 10. WAL-logging
> > is new for hash tables. Amit, could you add one?
> >
> 
> Added.
> 
> > +   use_wal = RelationNeedsWAL(rel) ||
> > + (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
> > +  forkNum == INIT_FORKNUM);
> > In access AMs, empty() routines are always only called for unlogged
> > relations, so you could relax that to check for INIT_FORKNUM only.
> >
> 
> Yeah, but _hash_init() is an exposed API, so I am not sure it is a
> good idea to make such an assumption at that level of code.  Now, as
> there is no current user which wants to use it any other way, we can
> make such an assumption, but does it serve any purpose?

Check for INIT_FORKNUM appears both accompanied and not
accompanied by check for RELPER.._UNLOGGED, so I'm not sure which
is the convention here.

The difference of the two is an init fork of TEMP
relations. However I believe that init fork is by definition a
part of an unlogged relation, it seems to me that it ought not to
be wal-logged if we had it. From this viewpoint, the middle line
makes sense.

> > Looking at the patch, I think that you are right to do the logging
> > directly in _hash_init() and not separately in hashbuildempty().
> > Compared to other relations the init data is more dynamic.
> >
> > +   /*
> > +* Force the on-disk state of init forks to always be in sync with 
> > the
> > +* state in shared buffers.  See XLogReadBufferForRedoExtended.
> > +*/
> > +   XLogRecGetBlockTag(record, 0, NULL, , NULL);
> > +   if (forknum == INIT_FORKNUM)
> > +   FlushOneBuffer(metabuf);
> > I find those forced syncs a bit surprising. The buffer is already
> > marked as dirty, so even if there is a concurrent checkpoint they
> > would be flushed.
> >
> 
> What if there is no concurrent checkpoint activity and the server is
> shutdown?  Remember this replay happens on standby and we will just
> try to perform Restartpoint and there is no guarantee that it will
> flush the buffers.  If the buffers are not flushed at shutdown, then
> the Init fork will be empty on restart and the hash index will be
> corrupt.

XLogReadBufferForRedoExtended() automatically force the flush for
a XLOG record on init fork that having FPW imaeges. And
_hash_init seems to have emitted such a XLOG record using
log_newpage.

> > If recovery comes again here, then they would just
> > be replayed. I am unclear why XLogReadBufferForRedoExtended is not
> > enough to replay a FPW of that.
> >
> 
> Where does FPW come into the picture for a replay of metapage or bitmappage?

Since required FPW seems to be emitted and the flush should be
done automatically, the issuer side (_hash_init) may attach wrong
FPW images if hash_xlog_init_meta_page requires additional
flushes. But I don't see such a flaw there. I think Michael wants
to say such a kind of thing.

By the way the comment of the function ReadBufferWithoutRelcache
has the following sentense.

| * NB: At present, this function may only be used on permanent relations, which
| * is OK, because we only use it during XLOG replay.  If in the future we
| * want to use it on temporary or unlogged relations, we could pass additional
| * parameters.

and does

| return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
 mode, strategy, );

This surely works since BufferAlloc recognizes INIT_FORK. But
some adjustment may be needed around here.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] increasing the default WAL segment size

2017-07-06 Thread tushar


On 07/06/2017 12:04 PM, Beena Emerson wrote:

The 04-initdb-walsegsize_v2.patch has the following improvements:
- Rebased over new 03 patch
- Pass the wal-segsize intidb option as command-line option rathern
than in an environment variable.
- Since new function check_wal_size had only had two checks and was
sed once, moved the code to ReadControlFile where it is used and
removed this function.
- improve comments and add validations where required.
- Use DEFAULT_XLOG_SEG_SIZE to set the min_wal_size and
max_wal_size,instead of the value 16.
- Use XLogSegMaxSize and XLogSegMinSize to calculate the range of guc
wal_segment_size instead 16 - INT_MAX.
Thanks Beena. I tested with below following scenarios  and all are 
working as expected
.)Different WAL segment size i.e 1,2,8,16,32,64,512,1024 at the time of 
initdb

.)SR setup in place.
.)Combinations of  max/min_wal_size in postgresql.conf with different 
wal_segment_size.
.)shutdown the server forcefully (kill -9) / promote slave / to make 
sure -recovery happened successfully.

.)with different utilities like pg_resetwal/pg_upgrade/pg_waldump
.)running pg_bench for substantial workloads (~ 1 hour)
.)WAL segment size is not default (changed at the time of ./configure) + 
different wal_segment_size (at the time of initdb) .


Things looks fine.

--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multi column range partition table

2017-07-06 Thread Dean Rasheed

On 5 July 2017 at 10:43, Amit Langote  wrote:
> 0001 is your patch to tidy up check_new_partition_bound()  (must be
> applied before 0002)
>

I pushed this first patch, simplifying check_new_partition_bound() for
range partitions, since it seemed like a good simplification, but note
that I don't think that was actually the cause of the latent bug you
saw upthread.

I think the real issue was in partition_rbound_cmp() -- normally, if
the upper bound of one partition coincides with the lower bound of
another, that function would report the upper bound as the smaller
one, but that logic breaks if any of the bound values are infinite,
since then it will exit early, returning 0, without ever comparing the
"lower" flags on the bounds.

I'm tempted to push a fix for that independently, since it's a bug
waiting to happen, even though it's not possible to hit it currently.

Regards,
Dean

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP patch: distinguish selectivity of < from <= and > from >=

2017-07-06 Thread Kuntal Ghosh

On Thu, Jul 6, 2017 at 3:45 AM, Tom Lane  wrote:
> I wrote:
>> (Pokes at it some more...) Oh, interesting: it behaves that way except
>> when p is exactly the lowest histogram entry.
>
+ /*
+ * In the first bin (i==1), add a fudge factor that ensures
+ * that histfrac is at least eq_selec.  We do this because we
+ * know that the first histogram entry does satisfy the
+ * inequality (if !isgt) or not satisfy it (if isgt), so our
+ * estimate here should certainly not be zero even if binfrac
+ * is zero.  (XXX experimentally this is the correct way to do
+ * it, but why isn't it a linear adjustment across the whole
+ * histogram rather than just the first bin?)
+ */
Given that the values are distinct, (I've some doubts for real number case)

if histogram_bounds are assigned as,
{0,9,19,29,39,49,59,69,79,89,99,109,119,129,13,..}

I think the buckets are defined as,
0 < bucket1 <= 9
9 < bucket2 <=19
19 < bucket3 <= 29 and so on.

Because, the histfrac is calculated as follows:

histfrac = (double) (bucket_current - 1) + (val - low) / (high - low);
(where bucket_current is obtained by doing a binary search on
histogram_bounds.)
histfrac /= (double) (nvalues - 1);

So, if val=low, then hisfrac = (bucket_current - 1)/num_of_buckets
which means it assumes val is included in the previous bucket.

This means that it always fails to calculate the selectivity for
lowest histogram boundary. Hence, we need adjustment only for the
first bucket.

Do you think my reasoning justifies your concern?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Another comment typo in execMain.c

2017-07-06 Thread Etsuro Fujita


Here is a comment in ExecFindPartition() in execMain.c:

/*
 * First check the root table's partition constraint, if any.  No 
point in

 * routing the tuple it if it doesn't belong in the root table itself.
 */

I think that in the second sentence "it" just before "if" is a typo. 
Attached is a small patch for fixing that.


Best regards,
Etsuro Fujita
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 0f08283..06b467b 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -3310,7 +3310,7 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, 
PartitionDispatch *pd,
 
/*
 * First check the root table's partition constraint, if any.  No point 
in
-* routing the tuple it if it doesn't belong in the root table itself.
+* routing the tuple if it doesn't belong in the root table itself.
 */
if (resultRelInfo->ri_PartitionCheck)
ExecPartitionCheck(resultRelInfo, slot, estate);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Oddity in error handling of constraint violation in ExecConstraints for partitioned tables

2017-07-06 Thread Etsuro Fujita


Here is an example:

postgres=# create table col_desc (a int, b int) partition by list (a);
postgres=# create table col_desc_1 partition of col_desc for values in (1);
postgres=# alter table col_desc_1 add check (b > 0);
postgres=# create role col_desc_user;
postgres=# grant insert on col_desc to col_desc_user;
postgres=# revoke select on col_desc from col_desc_user;
postgres=# set role col_desc_user;
postgres=> insert into col_desc values (1, -1) returning 1;
ERROR:  new row for relation "col_desc_1" violates check constraint 
"col_desc_1_b_check"

DETAIL:  Failing row contains (a, b) = (1, -1).

Looks good, but

postgres=> reset role;
postgres=# create table result (f1 text default 'foo', f2 text default 
'bar', f3 int);

postgres=# grant insert on result to col_desc_user;
postgres=# set role col_desc_user;
postgres=> with t as (insert into col_desc values (1, -1) returning 1) 
insert into result (f3) select * from t;
ERROR:  new row for relation "col_desc_1" violates check constraint 
"col_desc_1_b_check"


Looks odd to me because the error message doesn't show any DETAIL info; 
since the CTE query, which produces the message, is the same as the 
above query, the message should also be the same as the one for the 
above query.  I think the reason for that is: ExecConstraints looks at 
an incorrect resultRelInfo to build the message for a violating tuple 
that has been routed *in the case where the partitioned table isn't the 
primary ModifyTable node*, which leads to deriving an incorrect 
modifiedCols and then invoking ExecBuildSlotValueDescription with the 
wrong bitmap.  I think this should be fixed.  Attached is a patch for that.


Best regards,
Etsuro Fujita
*** a/src/backend/executor/execMain.c
--- b/src/backend/executor/execMain.c
***
*** 92,97  static bool ExecCheckRTEPermsModified(Oid relOid, Oid userid,
--- 92,99 
  Bitmapset *modifiedCols,
  AclMode requiredPerms);
  static void ExecCheckXactReadOnly(PlannedStmt *plannedstmt);
+ static ResultRelInfo *ExecFindResultRelInfo(EState *estate, Oid relid,
+   
bool missing_ok);
  static char *ExecBuildSlotValueDescription(Oid reloid,
  TupleTableSlot *slot,
  TupleDesc tupdesc,
***
*** 1360,1365  InitResultRelInfo(ResultRelInfo *resultRelInfo,
--- 1362,1392 
  }
  
  /*
+  * ExecFindResultRelInfo -- find the ResultRelInfo struct for given relation 
OID
+  *
+  * If no such struct, either return NULL or throw error depending on 
missing_ok
+  */
+ static ResultRelInfo *
+ ExecFindResultRelInfo(EState *estate, Oid relid, bool missing_ok)
+ {
+   ResultRelInfo *rInfo;
+   int nr;
+ 
+   rInfo = estate->es_result_relations;
+   nr = estate->es_num_result_relations;
+   while (nr > 0)
+   {
+   if (RelationGetRelid(rInfo->ri_RelationDesc) == relid)
+   return rInfo;
+   rInfo++;
+   nr--;
+   }
+   if (!missing_ok)
+   elog(ERROR, "failed to find ResultRelInfo for relation %u", 
relid);
+   return NULL;
+ }
+ 
+ /*
   *ExecGetTriggerResultRel
   *
   * Get a ResultRelInfo for a trigger target relation.  Most of the time,
***
*** 1379,1399  ResultRelInfo *
  ExecGetTriggerResultRel(EState *estate, Oid relid)
  {
ResultRelInfo *rInfo;
-   int nr;
ListCell   *l;
Relationrel;
MemoryContext oldcontext;
  
/* First, search through the query result relations */
!   rInfo = estate->es_result_relations;
!   nr = estate->es_num_result_relations;
!   while (nr > 0)
!   {
!   if (RelationGetRelid(rInfo->ri_RelationDesc) == relid)
!   return rInfo;
!   rInfo++;
!   nr--;
!   }
/* Nope, but maybe we already made an extra ResultRelInfo for it */
foreach(l, estate->es_trig_target_relations)
{
--- 1406,1419 
  ExecGetTriggerResultRel(EState *estate, Oid relid)
  {
ResultRelInfo *rInfo;
ListCell   *l;
Relationrel;
MemoryContext oldcontext;
  
/* First, search through the query result relations */
!   rInfo = ExecFindResultRelInfo(estate, relid, true);
!   if (rInfo)
!   return rInfo;
/* Nope, but maybe we already made an extra ResultRelInfo for it */
foreach(l, estate->es_trig_target_relations)
{
***
*** 1828,1833  ExecPartitionCheck(ResultRelInfo *resultRelInfo, 
TupleTableSlot *slot,
--- 1848,1854 
   EState *estate)
  {
Relationrel =

Re: [HACKERS] pgsql 10: hash indexes testing

2017-07-06 Thread Amit Kapila

On Thu, Jul 6, 2017 at 9:32 AM, AP  wrote:
> On Thu, Jul 06, 2017 at 08:52:03AM +0530, Amit Kapila wrote:
>> On Thu, Jul 6, 2017 at 2:40 AM, AP  wrote:
>> > On Wed, Jul 05, 2017 at 05:52:32PM +0530, Amit Kapila wrote:
>> >> >> >  version | bucket_pages | overflow_pages | bitmap_pages | 
>> >> >> > unused_pages | live_items | dead_items |   free_percent
>> >> >> > -+--++--+--+++--
>> >> >> >3 | 10485760 |2131192 |   66 |
>> >> >> > 0 | 2975444240 |  0 | 1065.19942179026
>> >> >> > (1 row)
>> > ...
>> >> >> > And I do appear to have an odd percentage of free space. :)
>> >>
>> >> Are you worried about "unused_pages"? If so, then this is not a major
>> >
>> > Nope. "free_percent" Just a bit weird that I have it at over 1000% free. :)
>> > Shouldn't that number be < 100?
>>
>> Yes, there seems to be some gotcha in free percent calculation.  Is it
>> possible for you to debug or in some way share the test?
>
> I can try to debug but I need to know what to look for and how.
>

Okay,  you need to debug function pgstathashindex and have your
breakpoint at free_percent calculation, then try to get the values of
nblocks, all the values in stats struct and total_space.  I think
after getting this info we can further decide what to look for.


> If it
> requires data reloads then that's around 12-15 hours per hit.
>
> As for sharing the test, that'd mean sharing the data. If it helps I can
> provide the content of that column but you're looking at an sql dump that
> is roughly (2*64+1)*2.3 billion (give or take a (few) billion) in size. :)
>

This is tricky, will Ibe able to import that column values by creating
table, if so, then probably it is worth.

>> > Well, if this is the cause of my little issue, it might be nice. ATM
>> > my import script bombs out on errors (that I've duplicated! :) It took
>> > 11 hours but it bombed) and it sounds like I'll need to do a manual
>> > VACUUM before it can be run again.
>> >
>>
>> Yeah, I think after manual vacuum you should be able to proceed.
>
>
> That's markedly different. At a rough estimate I should be able to double
> the row count before I hit the "real" limits of a hash index.
>
> When you refer to VACUUM do you mean VACUUM FULL?
>

Normal Vauum won't work for this case as you don't have dead tuples
(deleted rows in table).

> That's going to get nasty
> if that's the case as the table is almost 1TB in size.
>

Yeah,  I think for this situation REINDEX will be a better option
because anyway Vacuum Full will rewrite the entire index and heap.

>> >From above stats, it is clear that you have hit the maximum number of
>> overflow pages we can support today.  Now, here one can argue that we
>> should increase the limit of overflow pages in hash index which we can
>> do, but I think you can again hit such a problem after some more time.
>
> True, though I'm not far off hitting it "for real" at the present limits
> so an increase would be of benefit in other respects (ie not needing to
> have as many tables to manage because we have to bundle a set off due
> to index limits).
>
>> So at this stage, there are two possibilities for you (a) run manual
>> Vacuum in-between (b) create the index after bulk load.  In general,
>> whatever I have mentioned in (b) is a better way for bulk loading.
>> Note here again the free_percent seems to be wrong.
>
> If you didn't mean VACUUM FULL then (a) does not appear to work and (b)
> would kill usability of the db during import, which would happen daily
> (though with a vastly reduced data size).
>

If the data size in subsequent import is very less, then you only need
to Create the index after first import and then let it continue like
that.

> It also messes with the
> permission model that has been set up for the least-trusted section of
> the project (at the moment that section can only INSERT).
>

As per your permission model Vacuum Full is allowed, but not Create index?

> Still, I may wind up going with (b) if a VACUUM FULL is the only other
> real choice but would prefer to avoid it. The fact that the index is
> around 300GB smaller (so far) than btree may well be worth the pain
> all by its lonesome.
>

As mentioned above REINDEX might be a better option.  I think for such
situation we should have some provision to allow squeeze functionality
of hash exposed to the user, this will be less costly than REINDEX and
might serve the purpose for the user.  Hey, can you try some hack in
the code which can at least tell us whether squeezing hash can give us
any benefit or is the index in such a situation that only REINDEX will
help.  Basically, while doing Vacuum (non-full), you need to somehow
bypass the below line in lazy_scan_heap
lazy_scan_heap
{
..
if (vacrelstats->num_dead_tuples > 0)
..
}

I am not sure it will work, but we can try once if you are

Re: [HACKERS] Update comment in ExecPartitionCheck

2017-07-06 Thread Etsuro Fujita


On 2017/07/04 18:15, Amit Langote wrote:

On 2017/07/04 17:55, Etsuro Fujita wrote:

This comment in an error handling in ExecPartitionCheck():

 if (!ExecCheck(resultRelInfo->ri_PartitionCheckExpr, econtext))
 {
 char   *val_desc;
 Relationorig_rel = rel;

 /* See the comment above. */
 if (resultRelInfo->ri_PartitionRoot)

should be updated because we don't have any comment on that above in the
code.  Since we have a comment on that in ExecConstraints() defined just
below that function, I think the comment should be something like this:
"See the comment in ExecConstraints().".  Attached is a patch for that.


Thanks for fixing that.  I forgot to do the same in the patch that got
committed as 15ce775faa428 [1], which moved that code block from
ExecConstraints() to ExecPartitionCheck().


Thanks for the explanation!

In relation to this, I found odd behavior in the error handling.  Since 
that is another topic, I'll start a new thread.


Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] increasing the default WAL segment size

2017-07-06 Thread Beena Emerson

Hello,



On Tue, Mar 28, 2017 at 1:06 AM, Beena Emerson  wrote:
> Hello,
>
> On Sat, Mar 25, 2017 at 10:32 PM, Peter Eisentraut
>  wrote:
>>
>> At this point, I suggest splitting this patch up into several
>> potentially less controversial pieces.
>>
>> One big piece is that we currently don't support segment sizes larger
>> than 64 GB, for various internal arithmetic reasons.  Your patch appears
>> to address that.  So I suggest isolating that.  Assuming it works
>> correctly, I think there would be no great concern about it.
>>
>> The next piece would be making the various tools aware of varying
>> segment sizes without having to rely on a built-in value.
>>
>> The third piece would then be the rest that allows you to set the size
>> at initdb
>>
>> If we take these in order, we would make it easier to test various sizes
>> and see if there are any more unforeseen issues when changing sizes.  It
>> would also make it easier to do performance testing so we can address
>> the original question of what the default size should be.
>
>
> PFA the patches divided into 3 parts:
>
> 02-increase-max-wal-segsize.patch - Increases the wal-segsize and changes
> the internal representation of max_wal_size and min_wal_size to mb.

Already committed.

>
> 03-modify-tools.patch - Makes XLogSegSize into a variable, currently set as
> XLOG_SEG_SIZE and modifies the tools to fetch the size instead of using
> inbuilt value.
>

The updated 03-modify-tools_v2.patch has the following changes:
 - Rebased over current HEAD
 - Impoverished comments
 - Adding error messages where applicable.
 - Replace XLOG_SEG_SIZE in the tools and xlog_internal.h to
XLogSegSize. XLOG_SEG_SIZE is the wal_segment_size the server was
compiled with and XLogSegSize is the wal_segment_size of the target
server on which the tool is run. When the binaries used and the target
server are compiled with different wal_segment_size, the calculations
would be be affected and the tool would crash. To avoid it, all the
calculations used by tool should use XLogSegSize.
 - pg_waldump : The  fuzzy_open_file is split into two functions -
open_file_in_directory and identify_target_directory so that code can
be reused when determining the XLogSegSize from the WAL file header.
 - IsValidXLogSegSize macro is moved from 04 to here so that we can
use it for validating the size in all the tools.


> 04-initdb-walsegsize.patch - Adds the initdb option to set wal-segsize and
> make related changes. Update pg_test_fsync to use DEFAULT_XLOG_SEG_SIZE
> instead of XLOG_SEG_SIZE

The 04-initdb-walsegsize_v2.patch has the following improvements:
- Rebased over new 03 patch
- Pass the wal-segsize intidb option as command-line option rathern
than in an environment variable.
- Since new function check_wal_size had only had two checks and was
sed once, moved the code to ReadControlFile where it is used and
removed this function.
- improve comments and add validations where required.
- Use DEFAULT_XLOG_SEG_SIZE to set the min_wal_size and
max_wal_size,instead of the value 16.
- Use XLogSegMaxSize and XLogSegMinSize to calculate the range of guc
wal_segment_size instead 16 - INT_MAX.


>
>>
>> One concern I have is that your patch does not contain any tests.  There
>> should probably be lots of tests.
>
>
> 05-initdb_tests.patch adds tap tests to initialize cluster with different
> wal_segment_size and then check the config values. What other tests do you
> have in mind? Checking the various tools?
>
>




-- 

Beena Emerson

EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


04-initdb-walsegsize_v2.patch
Description: Binary data


03-modify-tools_v2.patch
Description: Binary data


01-add-XLogSegmentOffset-macro_rebase.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-07-06 Thread Amit Kapila

On Wed, Jul 5, 2017 at 6:25 PM, Mithun Cy  wrote:
> On Mon, Jul 3, 2017 at 11:58 AM, Amit Kapila  wrote:
>> On Sun, Jul 2, 2017 at 10:32 PM, Mithun Cy  
>> wrote:
>>> On Tue, Jun 27, 2017 at 11:41 AM, Mithun Cy  
>>> wrote:
 On Fri, Jun 23, 2017 at 5:45 AM, Thom Brown  wrote:
>
> Also, I find it a bit messy that launch_autoprewarm_dump() doesn't
> detect an autoprewarm process already running.  I'd want this to
> return NULL or an error if called for a 2nd time.

 We log instead of error as we try to check only after launching the
 worker and inside worker. One solution could be as similar to
 autoprewam_dump_now(), the autoprewarm_start_worker() can init shared
 memory and check if we can launch worker in backend itself. I will try
 to fix same.
>>>
>>> I have fixed it now as follows
>>>
>>> +Datum
>>> +autoprewarm_start_worker(PG_FUNCTION_ARGS)
>>> +{
>>> +   pid_t   pid;
>>> +
>>> +   init_apw_shmem();
>>> +   pid = apw_state->bgworker_pid;
>>> +   if (pid != InvalidPid)
>>> +   ereport(ERROR,
>>> +   (errmsg("autoprewarm worker is already running under PID 
>>> %d",
>>> +   pid)));
>>> +
>>> +   autoprewarm_dump_launcher();
>>> +   PG_RETURN_VOID();
>>> +}
>>>
>>> In backend itself, we shall check if an autoprewarm worker is running
>>> then only start the server. There is a possibility if this function is
>>> executed concurrently when there is no worker already running (Which I
>>> think is not a normal usage) then both call will say it has
>>> successfully launched the worker even though only one could have
>>> successfully done that (other will log and silently die).
>>
>> Why can't we close this remaining race condition?  Basically, if we
>> just perform all of the actions in this function under the lock and
>> autoprewarm_dump_launcher waits till the autoprewarm worker has
>> initialized the bgworker_pid, then there won't be any remaining race
>> condition.  I think if we don't close this race condition, it will be
>> unpredictable whether the user will get the error or there will be
>> only a server log for the same.
>
> Yes, I can make autoprewarm_dump_launcher to wait until the launched
> bgworker set its pid, but this requires one more synchronization
> variable between launcher and worker. More than that I see
> ShmemInitStruct(), LWLockAcquire can throw ERROR (restarts the
> worker), which needs to be called before setting pid. So I thought it
> won't be harmful let two concurrent calls to launch workers and we
> just log failures. Please let me know if I need to rethink about it.
>

I don't know whether you need to rethink but as presented in the
patch, it seems unclear to me about the specs of API.  As this is an
exposed function to the user, I think the behavior should be well
defined.  If you don't think changing code has many advantages, then
at the very least update the docs to indicate the expectation and
behavior of the API.  Also, I think it is better to add few comments
in the code to tell about the unpredictable behavior in case of race
condition and the reason for same.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

44 matches

Mail list logo