Re: [HACKERS] Re: Clarifying "server starting" messaging in pg_ctl start without --wait

2017-01-18 Thread Stephen Frost
Michael,

* Michael Paquier (michael.paqu...@gmail.com) wrote:
> On Wed, Jan 18, 2017 at 7:31 AM, Stephen Frost  wrote:
> > Perhaps we need a way for pg_ctl to realize a cold-standby case and
> > throw an error or warning if --wait is specified then, but that hardly
> > seems like the common use-case.  It also wouldn't make any sense to have
> > anything in the init system which depended on PG being up in such a case
> > because, well, PG isn't ever going to be 'up'.
> 
> Yeah, it seems to me that we are likely looking for a wait mode saying
> to exit pg_ctl once Postgres is happily rejecting connections, because
> that means that it is up and that it is sorting out something first
> before accepting them. This would basically filter the states in the
> control file that we find as acceptable if the connection test
> continues complaining about PQPING_REJECT.

If you're suggesting this *only* in the case where PG is starting up as
a cold standby, then, ok, maybe.  I don't think '-w' should mean
anything less than "up and accepting connections" for regular or hot
standby systems.

I'm not really convinced that the code is worth the trouble to handle
this case, but I'm not going to argue if someone wants to write it.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Parallel Index Scans

2017-01-18 Thread Amit Kapila
On Tue, Jan 17, 2017 at 11:27 PM, Robert Haas  wrote:
> On Mon, Jan 16, 2017 at 7:11 AM, Amit Kapila  wrote:
>
>
> WAIT_EVENT_PARALLEL_INDEX_SCAN is in fact btree-specific.  There's no
> guarantee that any other AMs the implement parallel index scans will
> use that wait event, and they might have others instead.  I would make
> it a lot more specific, like WAIT_EVENT_BTREE_PAGENUMBER.  (Waiting
> for the page number needed to continue a parallel btree scan to become
> available.)
>

WAIT_EVENT_BTREE_PAGENUMBER - NUMBER sounds slightly inconvenient. How
about just WAIT_EVENT_BTREE_PAGE?  We can keep the description as
suggested by you?

> Why do we need separate AM support for index_parallelrescan()?  Why
> can't index_rescan() cover that case?

The reason is that sometime index_rescan is called when we have to
just update runtime scankeys in index and we don't want to reset
parallel scan for that.  Refer ExecReScanIndexScan() changes in patch
parallel_index_opt_exec_support_v4.  Rescan is called from below place
during index scan.

ExecIndexScan(IndexScanState *node)
{
/*
* If we have runtime keys and they've not already been set up, do it now.
*/
if (node->iss_NumRuntimeKeys != 0 && !node->iss_RuntimeKeysReady)
ExecReScan((PlanState *) node);

>  If the AM-specific method is
> getting the IndexScanDesc, it can see for itself whether it is a
> parallel scan.
>

I think if we want to do that way then we need to pass some additional
information related to runtime scan keys in index_rescan method and
then probably to amspecific rescan method. That sounds scary.


>
> I think it's a bad idea to add a ParallelIndexScanDesc argument to
> index_beginscan().  That function is used in lots of places, and
> somebody might think that they are allowed to actually pass a non-NULL
> value there, which they aren't: they must go through
> index_beginscan_parallel.  I think that the additional argument should
> only be added to index_beginscan_internal, and
> index_beginscan_parallel should remain unchanged.
>

If we go that way then we need to set few parameters like heapRelation
and xs_snapshot in index_beginscan_parallel as we are doing in
index_beginscan. Does going that way sound better to you?

>  Either that, or get
> rid of index_beginscan_parallel altogether and have everyone use
> index_beginscan directly, and put the snapshot-restore logic in that
> function.
>

I think there is value in retaining index_beginscan_parallel as that
is parallel to heap_beginscan_parallel.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Generate fmgr prototypes automatically

2017-01-18 Thread Tom Lane
Peter Eisentraut  writes:
> Generate fmgr prototypes automatically

BTW, I notice some suspicious-looking behavior with -j:

$ make -j8 -s
Writing fmgroids.h
Writing fmgroids.h
Writing postgres.bki
Writing fmgrprotos.h
Writing fmgrtab.c
Writing schemapg.h
Writing postgres.description
Writing postgres.shdescription
Writing fmgrprotos.h
Writing fmgrtab.c
...

Why do fmgroids.h, fmgrprotos.h, and fmgrtab.c now get mentioned
twice?  I suspect there is something broken about the parallelization.
If indeed multiple instances of gmake are writing these files
concurrently, that's likely to result in irreproducible build failures.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Rename pg_switch_xlog to pg_switch_wal

2017-01-18 Thread Michael Paquier
On Wed, Jan 18, 2017 at 8:15 PM, Vladimir Rusinov  wrote:
> On the topic of binaries, there's going to be another patch renaming them.
> Those will have no aliases as it's trivial to work-around (symlinks, shell
> scripts, etc) and not so trivial to implement in a portable way.

On Windows that would be a pain... So let's not use symlinks for binaries.

> I'm used to workflows with lots of small commits, so I was not able to force
> myself to include all of the renames in one diff.

That's a good habit IMO.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] kNN for btree

2017-01-18 Thread Nikita Glukhov

Sorry for the broken formatting in my previous message.
Below is a corrected version of this message.


I'd like to present a series of patches that implements k-Nearest Neighbors
(kNN) search for btree, which can be used to speed up ORDER BY distance
queries like this:
SELECT * FROM events ORDER BY date <-> '2000-01-01'::date ASC LIMIT 100;

Now only GiST supports kNN, but kNN on btree can be emulated using
contrib/btree_gist.


Scanning algorithm
==

Algorithm is very simple: we use bidirectional B-tree index scan starting at
the point from which we measure the distance (target point).  At each step,
we advance this scan in the direction that has the  nearest point.  But when
the target point does not fall into the scanned range, we don't even need to
use a bidirectional scan here --- we can use ordinary unidirectional scan
in the right direction.


Performance results
===

Test database is taken from original kNN-GiST presentation (PGCon 2010).

Test query

SELECT * FROM events ORDER BY date <-> '1957-10-04'::date ASC LIMIT k;

can be optimized to the next rather complicated UNION form,
which no longer requires kNN:

WITH
t1 AS (SELECT * FROM events WHERE date >= '1957-10-04'::date
   ORDER BY date ASC  LIMIT k),
t2 AS (SELECT * FROM events WHERE date <  '1957-10-04'::date
   ORDER BY date DESC LIMIT k),
t  AS (SELECT * FROM t1 UNION SELECT * FROM t2)
SELECT * FROM t ORDER BY date <-> '1957-10-04'::date ASC LIMIT k;


In each cell of this table shown query execution time in milliseconds and
the number of accessed blocks:


 k  |  kNN-btree   |  kNN-GiST|  Opt. query   |  Seq. scan
|  | (btree_gist) |  with UNION   |  with sort
|--|--|---|
  1 |  0.041 4 |  0.079 4 |   0.060 8 |  41.1 1824
 10 |  0.048 7 |  0.091 9 |   0.09717 |  41.8 1824
100 |  0.10747 |  0.19252 |   0.342   104 |  42.3 1824
   1000 |  0.735   573 |  0.913   650 |   2.970  1160 |  43.5 1824
  1 |  5.070  5622 |  6.240  6760 |  36.300 11031 |  54.1 1824
 10 | 49.600 51608 | 61.900 64194 | 295.100 94980 | 115.0 1824


As you can see, kNN-btree can be two times faster than kNN-GiST (btree_gist)
when k < 1000, but the number of blocks read is roughly the same.


Implementation details
==

A brief description is given below for each of the patches:

1. Introduce amcanorderbyop() function

This patch transforms existing boolean AM property amcanorderbyop into a method
(function pointer).  This is necessary because, unlike GiST, kNN for btree
supports only a one ordering operator on the first index column and we need a
different pathkey matching logic for btree (there was a corresponding comment
in match_pathkeys_to_index()).  GiST-specific logic has been moved from
match_pathkeys_to_index() to gistcanorderbyop().

2. Extract substructure BTScanState from BTScanOpaque

This refactoring is necessary for bidirectional kNN-scan implementation.
Now, BTScanOpaque's substructure BTScanState containing only the fields
related to scan position is passed to some functions where the whole
BTScanOpaque was passed previously.

3. Extract get_index_column_opclass(), get_opclass_opfamily_and_input_type().

Extracted two simple common functions used in gistproperty() and
btproperty() (see the next patch).

4. Add kNN support to btree

  * Added additional optional BTScanState to BTScanOpaque for
bidirectional kNN scan.
  * Implemented bidirectional kNN scan.
  * Implemented logic for selecting kNN strategy
  * Implemented btcanorderbyop(), updated btproperty() and btvalidate()

B-tree user interface functions have not been altered because ordering
operators are used directly.

5. Add distance operators for some types

These operators for integer, float, date, time, timestamp, interval, cash and
oid types have been copied from contrib/btree_gist and added to the existing
btree opclasses as ordering operators.  Their btree_gist duplicates are removed
in the next patch.

6. Remove duplicate distance operators from contrib/btree_gist.

References to their own distance operators in btree_gist opclasses are
replaced with references to the built-in operators and than duplicate
operators are dropped.  But if the user is using somewhere these operators,
upgrade of btree_gist from 1.3 to 1.4 would fail.

7. Add regression tests for btree kNN.

Tests were added only after the built-in distance operators were added.


--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw bug in 9.6

2017-01-18 Thread Ashutosh Bapat
On Wed, Jan 18, 2017 at 8:18 AM, Etsuro Fujita
 wrote:
> On 2017/01/16 11:38, Etsuro Fujita wrote:
>>
>> On 2017/01/14 6:39, Jeff Janes wrote:
>>>
>>> I do get a compiler warning:
>>>
>>> foreign.c: In function 'CreateLocalJoinPath':
>>> foreign.c:832: warning: implicit declaration of function
>>> 'pathkeys_contained_in'
>
>
>> Will fix.
>
>
> Done.  Attached is the new version.  I also adjusted the code a bit further.

With this patch there are multiple cases, where CreateLocalJoinPath()
would return NULL and hence postgres_fdw would not push a join down
for example
/*
 * (1) if either cheapest-total path is parameterized by the
 * other rel, we can't generate a hashjoin/mergejoin path, and
 * (2) proposed hashjoin/mergejoin path is still parameterized
 * (ie, the required_outer set calculated by
 * calc_non_nestloop_required_outer isn't NULL), it's not what
 * we want; which means that both the cheapest-total paths
 * should be unparameterized.
 */
if (outer_path->param_info || inner_path->param_info)
return NULL;
or
/*
 * If the cheapest-total outer path is parameterized by the
 * inner rel, we can't generate a nestloop path.  (There's no
 * use looking for alternative outer paths, since it should
 * already be the least-parameterized available path.)
 */
if (PATH_PARAM_BY_REL(outer_path, innerrel))
return NULL;
/* If proposed path is still parameterized, don't return it. */
required_outer = calc_nestloop_required_outer(outer_path,
  inner_path);
if (required_outer)
{
bms_free(required_outer);
return NULL;
}

Am I right?

The earlier version of this function GetExistingLocalJoinPath() used
to return a local path in those cases and hence postgres_fdw was able
to push down corresponding queries. So, we are introducing a
performance regression. We need to plug those holes.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Rename pg_switch_xlog to pg_switch_wal

2017-01-18 Thread Vladimir Rusinov
On Tue, Jan 17, 2017 at 10:03 PM, Robert Haas  wrote:

> Q: OK, where is my WAL stored?
> A: pg_wal
> Q: How do I reset it?
> A: pg_resetxlog
>

On the topic of binaries, there's going to be another patch renaming them.
Those will have no aliases as it's trivial to work-around (symlinks, shell
scripts, etc) and not so trivial to implement in a portable way.

I'm used to workflows with lots of small commits, so I was not able to
force myself to include all of the renames in one diff.

-- 
Vladimir Rusinov
Storage SRE, Google Ireland

Google Ireland Ltd.,Gordon House, Barrow Street, Dublin 4, Ireland
Registered in Dublin, Ireland
Registration Number: 368047


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [HACKERS] Add pgstathashindex() to get hash index table statistics.

2017-01-18 Thread Kuntal Ghosh
On Fri, Jan 6, 2017 at 6:58 PM, Ashutosh Sharma  wrote:

I've successfully applied the patch on the latest head and ran a
regression tests without any failure. There is no major changes.
However, I've some minor comments on the patch:

+/*
+ * HASH_ALLOCATABLE_PAGE_SZ represents allocatable
+ * space (pd_upper - pd_lower) on a hash page.
+ */
+#define HASH_ALLOCATABLE_PAGE_SZ \
+   BLCKSZ - \
+   (SizeOfPageHeaderData + sizeof(HashPageOpaqueData))
My suggestion will be not to write "(pd_upper - pd_lower)" in the
comment. You may write allocatable space on a empty hash page.

+   buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno,
RBM_NORMAL, NULL);
Use BAS_BULKREAD strategy to read the buffer.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] emergency outage requiring database restart

2017-01-18 Thread Ants Aasma
On Wed, Jan 4, 2017 at 5:36 PM, Merlin Moncure  wrote:
> Still getting checksum failures.   Over the last 30 days, I see the
> following.  Since enabling checksums FWICT none of the damage is
> permanent and rolls back with the transaction.   So creepy!

The checksums still only differ in least significant digits which
pretty much means that there is a block number mismatch. So if you
rule out filesystem not doing its job correctly and transposing
blocks, it could be something else that is resulting in blocks getting
read from a location that happens to differ by a small multiple of
page size. Maybe somebody is racily mucking with table fd's between
seeking and reading. That would explain the issue disappearing after a
retry.

Maybe you can arrange for the RelFileNode and block number to be
logged for the checksum failures and check what the actual checksums
are in data files surrounding the failed page. If the requested block
number contains something completely else, but the page that follows
contains the expected checksum value, then it would support this
theory.

Regards,
Ants Aasma


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pageinspect: Hash index support

2017-01-18 Thread Ashutosh Sharma
Hi,

> 1.
> +static Page
> +verify_hash_page(bytea *raw_page, int flags)
>
> Few checks for meta page are missing, refer _hash_checkpage.

okay, I have added the checks for meta page as well. Please refer to
attached patch.

>
> 2.
> + if (!superuser())
> + ereport(ERROR,
> + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
> + (errmsg("must be superuser to use pageinspect functions";
>
> Isn't it better to use "raw page" instead of "pageinspect" in the
> above error message? If you agree, then fix other similar occurrences
> in the patch.

Yes, knowing that we deal with raw pages as in brin index it would be
good to use raw page instead of pageinspect to maintain the
consistency in the error message.

>
> 3.
> + values[j++] = CStringGetTextDatum(psprintf("(%u,%u)",
> +  BlockIdGetBlockNumber(&(itup->t_tid.ip_blkid)),
> + itup->t_tid.ip_posid));
>
> Fix indentation in the third line.
>

Corrected. Please check the attached patch.

> 4.
> +Datum
> +hash_page_items(PG_FUNCTION_ARGS)
> +{
> + bytea   *raw_page = PG_GETARG_BYTEA_P(0);
>
>
> +Datum
> +hash_bitmap_info(PG_FUNCTION_ARGS)
> +{
> + Oid indexRelid = PG_GETARG_OID(0);
> + uint32 ovflblkno = PG_GETARG_UINT32(1);
>
> Is there a reason for keeping the input arguments for
> hash_bitmap_info() different from hash_page_items()?
>

Yes, there are two reasons behind it.

Firstly, we need metapage to identify the bitmap page that holds the
information about the overflow page passed as an input to this
function.

Secondly, we will have to input overflow block number as an input to
this function so as to determine the overflow bit number which can be
used further to identify the bitmap page.

+   /* Read the metapage so we can determine which bitmap page to use */
+   metabuf = _hash_getbuf(indexRel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+   metap = HashPageGetMeta(BufferGetPage(metabuf));
+
+   /* Identify overflow bit number */
+   ovflbitno = _hash_ovflblkno_to_bitno(metap, ovflblkno);
+
+   bitmappage = ovflbitno >> BMPG_SHIFT(metap);
+   bitmapbit = ovflbitno & BMPG_MASK(metap);
+
+   if (bitmappage >= metap->hashm_nmaps)
+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("invalid overflow bit number %u", ovflbitno)));
+
+   bitmapblkno = metap->hashm_mapp[bitmappage];


> 5.
> +hash_metapage_info(PG_FUNCTION_ARGS)
> {
> ..
> + spares = palloc0(HASH_MAX_SPLITPOINTS * 5 + 1);
> ..
> + mapp = palloc0(HASH_MAX_BITMAPS * 5 + 1);
> ..
> }
>
> Don't you think we should free the allocated memory in this function?
> Also, why are you 5 as a multiplier in both the above pallocs,
> shouldn't it be 4?

Yes, we should free it. We have used 5 as a multiplier instead of 4
because of ' ' character.

Apart from above comments, the attached patch also handles all the
comments from Mithun.

With Regards,
Ashutosh Sharma.
EnterpriseDB: http://www.enterprisedb.com


0001-Add-support-for-hash-index-in-pageinspect-contrib-mo.patch
Description: invalid/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Floating point comparison inconsistencies of the geometric types

2017-01-18 Thread Kyotaro HORIGUCHI
Hello.

(I apologize in advance for possible inaccurate wording on maths,
which might cause confusion..)

At Wed, 11 Jan 2017 16:37:59 +0100, Emre Hasegeli  wrote in 

> > - Floating point comparisons for gemetric types
> >
> >   Comparison related to geometric types is performed by FPeq
> >   macro and similars defined in geo_decls.h. This intends to give
> >   tolerance to the comparisons.
> >
> >  A
> >   FPeq: |<=e-|-e=>|(<= means inclusive, e = epsilon = tolerance)
> >   FPne:   ->|  e | e  |<-  (<- means exclusive)
> >   FPlt:  | e  |<-
> >   FPle: |<=e |
> >   FPgt:   ->|  e |
> >   FPge:  | e=>|
> >
> >   These seems reasonable ignoring the tolerance amount issue.
> 
> I tried to explain below why I don't find this method reasonable.

Thank you very much for the explanation.

> > At least for the point type, (bitmap) index scan is consistent
> > with sequential scan.  I remember that the issue was
> > "inconsistency between indexed and non-indexed scans over
> > geometric types" but they seem consistent with each other.
> 
> You can see on the Git history that it was a lot of trouble to make
> the indexes consistent with the operators, and this is limiting
> improving indexing of the geometric types.

Yeah. geo_ops.c has the following comment from its first version
in the current repository.

| * Relational operators for Points.
| * Since we do have a sense of coordinates being
| * "equal" to a given accuracy (point_vert, point_horiz),

So, we take it as a requirement for geometric types. All the
difficulties come from the "nature".

> > You mentioned b-tree, which don't have predefined opclass for
> > geometric types. So the "inconsistency" may be mentioning the
> > difference between geometric values and combinations of plain
> > (first-class) values. But the two are different things and
> > apparently using different operators (~= and = for equality) so
> > the issue is not fit for this. More specifically, "p ~= p0" and
> > "x = x0 and y = y0" are completely different.
> 
> Yes, the default btree operator class doesn't have to implement
> standard equality and basic comparison operator symbols.  We can
> implement it with different symbols.

Perhaps I don't understand it. Many opclass are defined for
btree. But since (ordinary) btree handles only one-dimentional,
totally-orderable sets, geometric types even point don't fit
it. Even if we relaxed it by removing EPSILON, multidimentional
data still doesn't fit.

Still we can implement equality mechanism *over* multiple btree
indexes, but it cannot be a opclass for btree.

> > Could you let me (or any other guys on this ml) have more precise
> > description on the problem and/or what you want to do with this
> > patch?
> 
> I will try to itemize the current problems with the epsilon method:
> 
> = Operator Inconsistencies
> 
> My current patches address all of them.

No. It just removes tolerans that is a "requirement" for
coordinates of geometric types. I think we don't have enough
reason to remove it. We could newly define geometric types with
no-tolerance as 'exact_point" or something but it still doesn't
fit btree.

> - Only some operators are using the epsilon.
> 
> I included the list of the ones which doesn't use the epsilon on
> initial post of this thread.  This makes the operators not compatible
> with each other.  For example, you cannot say "if box_a @> box_b and
> box_b @> point then box_a @> box_b".

(It must be "then box_a @> point".)  Both of the operators don't
seem to use EPSILON so the transitivity holds, but perhaps we
should change them to more appropriate shape, that is, to use
EPSILON. Even having the tolerance, we can define the operators
to keep the transitivity.

> - The application of epsilon is implementation dependent.
> 
> Epsilon works between two numbers.  There is not always a single way
> of implementing geometric operators.  This is surprising to the users.
> For example, you cannot say "if box_a @> box_b then box_a <-> box_b <=
> epsilon".

Maybe it should be like "if box_a ~= box_b && box_b @< box_a then
..". I'm not sure off-hand. But I don't see the relationship is
significant.

> This is also creating a high barrier on developing those operators.
> Fixing bugs and performance regressions are very likely to change the
> results.

I agree to you. The effect of the EPSILON is quite arbitrary and
difficult to see their chemistry.

> - Some operators are just wrong:
> 
> Line operators are not well maintained.  The following are giving
> wrong results when neither A or B of lines are not exactly 1:
> 
> * line # line
> * line <-> line
> * point ## line
> * point ## lseg

These are not comparison operators so EPSILON is unrelated. And I
agree that they needs fix.

> They are broken independently from the epsilon, though it is not easy
> to fix them without 

<    1   2